Homework

HW1: Text Processing

Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narniaarrow-up-right series, focusing on book metadata and chapter statistics.

Data Collection

For each book, gather the following details:

  • Book Title (preserve exact spacing as shown in text)

  • Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

  • Chapter Number (as Arabic numeral)

  • Chapter Title

  • Token Count of the Chapter Content

    • Each word, punctuation mark, and symbol counts as a separate token.

    • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

Implementation

  1. Download the chronicles_of_narnia.txtarrow-up-right file and place it under the dat/arrow-up-right directory.

  2. Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

  • Format:

    • username@hostname.domain

  • Username and Hostname:

    • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

    • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Rubric

  • Task 1: Chronicles of Narnia (7 points)

  • Task 2: Regular Expressions (3 points)

  • Concept Quiz (2 points)

Last updated

Was this helpful?