arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Homework

HW1: Text Processing

hashtag
Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narniaarrow-up-right series, focusing on book metadata and chapter statistics.

hashtag
Data Collection

For each book, gather the following details:

  • Book Title (preserve exact spacing as shown in text)

  • Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

  • Chapter Number (as Arabic numeral)

  • Chapter Title

  • Token Count of the Chapter Content

hashtag
Implementation

  1. Download the file and place it under the directory.

    • The text file is pre-tokenized using the .

    • Each token is separated by whitespace.

hashtag
Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

  • Format:

    • username@hostname.domain

  • Username and Hostname:

hashtag
Submission

Commit and push the text_processing.py file to your GitHub repository.

hashtag
Rubric

  • Task 1: Chronicles of Narnia (7 points)

  • Task 2: Regular Expressions (3 points)

  • Concept Quiz (2 points)

Each word, punctuation mark, and symbol counts as a separate token.
  • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

  • Create a text_processing.pyarrow-up-right file in the src/homework/arrow-up-right directory.

  • Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

  • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

  • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

    • Formats:

      • YYYY/MM/DD or YY/MM/DD

      • YYYY-MM-DD or YY-MM-DD

    • Year:

      • 4 digits: between 1951 and 2050

      • 2 digits: for 1951 - 2050

    • Month:

      • 1 - 12 (can be with/without leading zero)

    • Day:

      • 1 - 31 (can be with/without leading zero)

      • Must be valid for the given month.

    • Format:

      • protocol://address

    • Protocol:

      • http or https (only)

    • Address:

      • Can contain letters, hyphen, dots.

      • Must start with letter/number.

    • Formats:

      • Single author: Lastname, YYYY (e.g., Smith, 2023)

      • Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)

      • Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)

    • Lastnames must be capitalized and can have multiple

    • Year must be between 1900-2024.

    chronicles_of_narnia.txtarrow-up-right
    dat/arrow-up-right
    ELIT Tokenizerarrow-up-right
    {
      'The Lion , the Witch and the Wardrobe': {
        'title': 'The Lion , the Witch and the Wardrobe',
        'year': 1950,
        'chapters': [
          {
            'number': 1,
            'title': 'Lucy Looks into a Wardrobe',
            'token_count': 1915
          },
          {
            'number': 2,
            'title': 'What Lucy Found There',
            'token_count': 2887
          },
          ...
        ]
      },
      'Prince Caspian : The Return to Narnia': {
        'title': 'Prince Caspian : The Return to Narnia',
        'year': 1951,
        'chapters': [
          ...
        ]
      },
      ...
    }
    Must include at least one dot.