Homework
HW1: Text Processing
Task 1: Chronicles of Narnia
Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narnia series, focusing on book metadata and chapter statistics.
Data Collection
For each book, gather the following details:
Book Title (preserve exact spacing as shown in text)
Year of Publishing (indicated in the title)
For each chapter within every book, collect the following information:
Chapter Number (as Arabic numeral)
Chapter Title
Token Count of the Chapter Content
Each word, punctuation mark, and symbol counts as a separate token.
Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.
Implementation
Download the chronicles_of_narnia.txt file and place it under the dat/ directory.
The text file is pre-tokenized using the ELIT Tokenizer.
Each token is separated by whitespace.
Create a text_processing.py file in the src/homework/ directory.
Define a function named
chronicles_of_narnia()
that takes a file path pointing to the text file and returns a dictionary structured as follows:Takes a file path pointing to the text file.
Returns a dictionary with the structure shown below.
Books must be stored as key-value pairs in the main dictionary.
Chapters must be stored as lists within each book's dictionary/
Chapter lists must be sorted by chapter number in ascending order.
Task 2: Regular Expressions
Define a function named regular_expressions()
in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None
if nothing matches:
Format:
username@hostname.domain
Username and Hostname:
Can contain letters, numbers, period (
.
), underscore (_
), hyphen (-
).Must start and end with letter/number.
Domain:
Limited to
com
,org
,edu
, andgov
.
Submission
Commit and push the text_processing.py file to your GitHub repository.
Rubric
Task 1: Chronicles of Narnia (7 points)
Task 2: Regular Expressions (3 points)
Concept Quiz (2 points)
Last updated
Was this helpful?