NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Task 1: Chronicles of Narnia
  • Data Collection
  • Implementation
  • Task 2: Regular Expressions
  • Submission
  • Rubric

Was this helpful?

Export as PDF
  1. Text Processing

Homework

HW1: Text Processing

PreviousRegular ExpressionsNextLanguage Models

Last updated 3 months ago

Was this helpful?

Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's series, focusing on book metadata and chapter statistics.

Data Collection

For each book, gather the following details:

  • Book Title (preserve exact spacing as shown in text)

  • Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

  • Chapter Number (as Arabic numeral)

  • Chapter Title

  • Token Count of the Chapter Content

    • Each word, punctuation mark, and symbol counts as a separate token.

    • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

Implementation

  1. Download the file and place it under the directory.

    • The text file is pre-tokenized using the .

    • Each token is separated by whitespace.

  2. Create a file in the directory.

  3. Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

  • Format:

    • username@hostname.domain

  • Username and Hostname:

    • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

    • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

  • Formats:

    • YYYY/MM/DD or YY/MM/DD

    • YYYY-MM-DD or YY-MM-DD

  • Year:

    • 4 digits: between 1951 and 2050

    • 2 digits: for 1951 - 2050

  • Month:

    • 1 - 12 (can be with/without leading zero)

  • Day:

    • 1 - 31 (can be with/without leading zero)

    • Must be valid for the given month.

  • Format:

    • protocol://address

  • Protocol:

    • http or https (only)

  • Address:

    • Can contain letters, hyphen, dots.

    • Must start with letter/number.

    • Must include at least one dot.

  • Formats:

    • Single author: Lastname, YYYY (e.g., Smith, 2023)

    • Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)

    • Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)

  • Lastnames must be capitalized and can have multiple

  • Year must be between 1900-2024.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Rubric

  • Task 1: Chronicles of Narnia (7 points)

  • Task 2: Regular Expressions (3 points)

  • Concept Quiz (2 points)

Chronicles of Narnia
chronicles_of_narnia.txt
dat/
ELIT Tokenizer
text_processing.py
src/homework/