1 of 1

Homework

Update: 2024-01-05

Task 1: Chronicles of Narnia

Your goal is to extract specific information from each book in the Chronicles of Narnia. For each book, gather the following details:

Book Title
Year of Publishing

For each chapter within a book, collect the following information:

Chapter Number
Chapter Title
Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.

Steps

Download the chronicles_of_narnia.txt file and place it under the dat/ directory. Note that the text file is already tokenized using the ELIT Tokenizer.
Create a text_processing.py file in the src/homework/ directory.
Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Notes

The chronicles_of_narnia.txt file contains books and chapters listed in chronological order. Your program will be evaluated using both the original and a modified version of this text file, maintaining the same format but with books and chapters arranged in a mixed order.
For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.
If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations] -> text_processing. Confirm that the working directory path is set to the top directory, nlp-essentials.

Task 2: Regular Expressions

Define regular expressions to match the following cases using the corresponding variables in text_processing.py:

RE_Abbreviation: Dr., U.S.A.
RE_Apostrophe: '80, '90s, 'cause
RE_Concatenation: don't, gonna, cannot
RE_Hyperlink: https://emory.gitbook.io/nlp-essentials
RE_Number: 1/2, 123-456-7890, 1,000,000
RE_Unit: $10, #20, 5kg

Notes

Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Homework

Update: 2024-01-05

Task 1: Chronicles of Narnia

Your goal is to extract specific information from each book in the Chronicles of Narnia. For each book, gather the following details:

Book Title
Year of Publishing

For each chapter within a book, collect the following information:

Chapter Number
Chapter Title
Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.

Steps

Download the chronicles_of_narnia.txt file and place it under the dat/ directory. Note that the text file is already tokenized using the ELIT Tokenizer.
Create a text_processing.py file in the src/homework/ directory.
Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Notes

The chronicles_of_narnia.txt file contains books and chapters listed in chronological order. Your program will be evaluated using both the original and a modified version of this text file, maintaining the same format but with books and chapters arranged in a mixed order.
For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.
If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations] -> text_processing. Confirm that the working directory path is set to the top directory, nlp-essentials.

Task 2: Regular Expressions

Define regular expressions to match the following cases using the corresponding variables in text_processing.py:

RE_Abbreviation: Dr., U.S.A.
RE_Apostrophe: '80, '90s, 'cause
RE_Concatenation: don't, gonna, cannot
RE_Hyperlink: https://emory.gitbook.io/nlp-essentials
RE_Number: 1/2, 123-456-7890, 1,000,000
RE_Unit: $10, #20, 5kg

Notes

Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.

Submission

Commit and push the text_processing.py file to your GitHub repository.