Homework
HW1: Text Processing
Last updated
Was this helpful?
HW1: Text Processing
Last updated
Was this helpful?
Your goal is to extract and organize structured information from C.S. Lewis's series, focusing on book metadata and chapter statistics.
For each book, gather the following details:
Book Title (preserve exact spacing as shown in text)
Year of Publishing (indicated in the title)
For each chapter within every book, collect the following information:
Chapter Number (as Arabic numeral)
Chapter Title
Token Count of the Chapter Content
Each word, punctuation mark, and symbol counts as a separate token.
Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.
Download the file and place it under the directory.
The text file is pre-tokenized using the .
Each token is separated by whitespace.
Create a file in the directory.
Define a function named chronicles_of_narnia()
that takes a file path pointing to the text file and returns a dictionary structured as follows:
Takes a file path pointing to the text file.
Returns a dictionary with the structure shown below.
Books must be stored as key-value pairs in the main dictionary.
Chapters must be stored as lists within each book's dictionary/
Chapter lists must be sorted by chapter number in ascending order.
Define a function named regular_expressions()
in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None
if nothing matches:
Format:
username@hostname.domain
Username and Hostname:
Can contain letters, numbers, period (.
), underscore (_
), hyphen (-
).
Must start and end with letter/number.
Domain:
Limited to com
, org
, edu
, and gov
.
Commit and push the text_processing.py file to your GitHub repository.
Task 1: Chronicles of Narnia (7 points)
Task 2: Regular Expressions (3 points)
Concept Quiz (2 points)