> For the complete documentation index, see [llms.txt](https://emory.gitbook.io/nlp-essentials/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://emory.gitbook.io/nlp-essentials/text-processing/homework.md).

# Homework

## Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's [Chronicles of Narnia](https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia) series, focusing on book metadata and chapter statistics.

### Data Collection

For each book, gather the following details:

* Book Title (preserve exact spacing as shown in text)
* Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

* Chapter Number (as Arabic numeral)
* Chapter Title
* Token Count of the Chapter Content
  * Each word, punctuation mark, and symbol counts as a separate token.
  * Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

### Implementation

1. Download the [**chronicles\_of\_narnia.txt**](https://github.com/emory-courses/nlp-essentials/blob/main/dat/chronicles_of_narnia.txt) file and place it under the [dat/](https://github.com/emory-courses/nlp-essentials/tree/main/dat) directory.
   * The text file is pre-tokenized using the [ELIT Tokenizer](https://github.com/emorynlp/elit-tokenizer).
   * Each token is separated by whitespace.
2. Create a [**text\_processing.py**](https://github.com/emory-courses/nlp-essentials/blob/main/src/homework/text_processing.py) file in the [src/homework/](https://github.com/emory-courses/nlp-essentials/tree/main/src/homework) directory.
3. Define a function named `chronicles_of_narnia()` that takes a file path pointing to the text file and returns a dictionary structured as follows:
   * Takes a file path pointing to the text file.
   * Returns a dictionary with the structure shown below.
   * Books must be stored as key-value pairs in the main dictionary.
   * Chapters must be stored as lists within each book's dictionary/
   * Chapter lists must be sorted by chapter number in ascending order.

```json
{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}
```

## Task 2: Regular Expressions

Define a function named `regular_expressions()` in **src/homework/text\_processing.py** that takes a string and returns one the four types, "email", "date", "url", "cite", or `None` if nothing matches:

{% tabs %}
{% tab title="email" %}

* Format:
  * <username@hostname.domain>
* Username and Hostname:
  * Can contain letters, numbers, period (`.`), underscore (`_`), hyphen (`-`).
  * Must start and end with letter/number.
* Domain:
  * Limited to `com`, `org`, `edu`, and `gov`.
    {% endtab %}

{% tab title="date" %}

* Formats:
  * YYYY/MM/DD or YY/MM/DD
  * YYYY-MM-DD or YY-MM-DD
* Year:
  * 4 digits: between 1951 and 2050
  * 2 digits: for 1951 - 2050
* Month:
  * 1 - 12 (can be with/without leading zero)
* Day:
  * 1 - 31 (can be with/without leading zero)
  * Must be valid for the given month.
    {% endtab %}

{% tab title="url" %}

* Format:
  * protocol://address
* Protocol:
  * http or https (only)
* Address:
  * Can contain letters, hyphen, dots.
  * Must start with letter/number.
  * Must include at least one dot.
    {% endtab %}

{% tab title="cite" %}

* Formats:
  * Single author: Lastname, YYYY (e.g., Smith, 2023)&#x20;
  * Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)
  * Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)
* Lastnames must be capitalized and can have multiple&#x20;
* Year must be between 1900-2024.
  {% endtab %}
  {% endtabs %}

## Submission

Commit and push the **text\_processing.py** file to your GitHub repository.

## Rubric

* Task 1: Chronicles of Narnia (7 points)
* Task 2: Regular Expressions (3 points)
* Concept Quiz (2 points)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://emory.gitbook.io/nlp-essentials/text-processing/homework.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
