Homework

Update: 2024-01-05

Task 1: Chronicles of Narnia

Your goal is to extract specific information from each book in the Chronicles of Narnia. For each book, gather the following details:

  • Book Title

  • Year of Publishing

For each chapter within a book, collect the following information:

  • Chapter Number

  • Chapter Title

  • Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.

Steps

  1. Download the chronicles_of_narnia.txt file and place it under the dat/ directory. Note that the text file is already tokenized using the ELIT Tokenizer.

  2. Create a text_processing.py file in the src/homework/ directory.

  3. Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Notes

  • The chronicles_of_narnia.txt file contains books and chapters listed in chronological order. Your program will be evaluated using both the original and a modified version of this text file, maintaining the same format but with books and chapters arranged in a mixed order.

  • For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.

  • If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations] -> text_processing. Confirm that the working directory path is set to the top directory, nlp-essentials.

Task 2: Regular Expressions

Define regular expressions to match the following cases using the corresponding variables in text_processing.py:

  • RE_Abbreviation: Dr., U.S.A.

  • RE_Apostrophe: '80, '90s, 'cause

  • RE_Concatenation: don't, gonna, cannot

  • RE_Hyperlink: https://emory.gitbook.io/nlp-essentials

  • RE_Number: 1/2, 123-456-7890, 1,000,000

  • RE_Unit: $10, #20, 5kg

Notes

  • Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Last updated

Copyright © 2023 All rights reserved