Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narnia series, focusing on book metadata and chapter statistics.
Data Collection
For each book, gather the following details:
Book Title (preserve exact spacing as shown in text)
Year of Publishing (indicated in the title)
For each chapter within every book, collect the following information:
Chapter Number (as Arabic numeral)
Chapter Title
Token Count of the Chapter Content
Implementation
Download the file and place it under the directory.
The text file is pre-tokenized using the .
Each token is separated by whitespace.
Task 2: Regular Expressions
Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:
Format:
username@hostname.domain
Username and Hostname:
Submission
Commit and push the text_processing.py file to your GitHub repository.
Rubric
Task 1: Chronicles of Narnia (7 points)
Task 2: Regular Expressions (3 points)
Concept Quiz (2 points)
Each word, punctuation mark, and symbol counts as a separate token.
Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.
{
'The Lion , the Witch and the Wardrobe': {
'title': 'The Lion , the Witch and the Wardrobe',
'year': 1950,
'chapters': [
{
'number': 1,
'title': 'Lucy Looks into a Wardrobe',
'token_count': 1915
},
{
'number': 2,
'title': 'What Lucy Found There',
'token_count': 2887
},
...
]
},
'Prince Caspian : The Return to Narnia': {
'title': 'Prince Caspian : The Return to Narnia',
'year': 1951,
'chapters': [
...
]
},
...
}