Homework

HW1: Text Processing

Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narnia series, focusing on book metadata and chapter statistics.

Data Collection

For each book, gather the following details:

  • Book Title (preserve exact spacing as shown in text)

  • Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

  • Chapter Number (as Arabic numeral)

  • Chapter Title

  • Token Count of the Chapter Content

    • Each word, punctuation mark, and symbol counts as a separate token.

    • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

Implementation

  1. Download the chronicles_of_narnia.txt file and place it under the dat/ directory.

    • The text file is pre-tokenized using the ELIT Tokenizer.

    • Each token is separated by whitespace.

  2. Create a text_processing.py file in the src/homework/ directory.

  3. Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

  • Format:

    • username@hostname.domain

  • Username and Hostname:

    • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

    • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Rubric

  • Task 1: Chronicles of Narnia (7 points)

  • Task 2: Regular Expressions (3 points)

  • Concept Quiz (2 points)

Last updated

Was this helpful?