1 of 57

NLP Essentials

Overview

By Jinho D. Choi (2025 Edition)

Natural Language Processing (NLP) is a dynamic field within Artificial Intelligence focused on developing computational models to understand, interpret, and generate human language. As NLP technologies become increasingly embedded in our daily lives, understanding its fundamentals is crucial for both leveraging its potential and enhancing our interaction with language-based systems.

This course is designed to build a robust foundation in the core principles of modern NLP. We begin with text processing techniques that show how to manipulate textual data for model development. We then progress to language models, enabling computational models to comprehend and generate human language, and vector space models that convert textual data into machine-readable formats. Advanced topics include distributional semantics for creating context-aware word embeddings, and contextual encoding for analyzing word relationships within their surrounding text.

The latter part of the course focuses on practical application through team projects. Students will have the opportunity to work with cutting-edge NLP technologies, such as large language models, to develop real-world applications. This hands-on approach encourages creativity and innovation, with students proposing their own ideas and presenting demonstrations to their peers.

Learning assessment combines concept quizzes to reinforce theoretical understanding with hands-on programming assignments that develop practical implementation skills. By the conclusion of the course, students will have gained the knowledge and skills to navigate and contribute to the rapidly evolving field of NLP.

Prerequisites

Introduction to Python Programming
Introduction to Machine Learning

Sections

Syllabus

CS|QTM|LING-329: Computational Linguistics (Spring 2025)

General

Book: http://emory.gitbook.io/nlp-essentials/
GitHub: https://github.com/emory-courses/nlp-essentials/
Time: MW 4 - 5:15 PM
Location: MSC W201

Instructors

Jinho Choi
- Associate Professor of Computer Science, Quantitative Theory and Methods, Linguistics
- Office Hours and Location: MW 5:30 - 6:30 PM, White Hall 218
- GitHub: jdchoi77
Grace Byun
- Ph.D. Student in Computer Science and Informatics
- Office Hours and Location:
  - Hours: Wed 1:30 - 3:30 PM and Fri 12:30 - 1:30 PM
  - Location: Math & Science Center E308 (Computer Lab)
- GitHub: byunsj
Swati Rajwal
- Ph.D. Student in Computer Science and Informatics
- Office Hours and Location:
  - Hours: Mon 10:30 AM - 1:30 PM
  - Location: Math & Science Center E308 (Computer Lab)
- GitHub: swati-rajwal

Grading

Homework: 65%
Team Project: 35%
Your work is governed by the Emory Honor Code. Honor code violations (e.g., copies from any source, including colleagues and internet sites) will be referred to the Emory Honor Council.
Requests for absence/rescheduling due to severe personal events (such as health, family, or personal reasons) impacting course performance must be supported by a letter from the Office for Undergraduate Education.

Homework

Each topic will include homework that combines concept quizzes and programming assignments to assess your understanding of the subject matter.
Assignments must be submitted individually. While discussions are allowed, your work must be original.
Late submissions within a week will be accepted with a grading penalty of 15%. They will not be accepted after the solutions are discussed in class.

Concept Quizzes

Each section incorporates questions to explore the content more comprehensively, with their corresponding answers slated for discussion in the class.
While certain questions may have multiple valid answers, the grading will be based on the responses discussed in class, and alternative answers will be disregarded. This approach allows us to distinguish between answers discussed in class and those generated by AI tools like ChatGPT.

Programming Assignments

You are encouraged to use any code examples and invoke APIs provided in this book.
Feel free to create additional functions and variables in the assigned Python file. For each homework, ensure that all your implementations are included in the respective Python file located under the corresponding directory.
Usage of packages not covered in the assigned chapter is prohibited. Ensure that your code does not rely on the installation of additional packages, as we will not be able to execute your program for evaluation if external dependencies are needed.

Team Project

You are expected to:
- Form a team of 3-4 members.
- Present a project pitch to share your proposed idea, and write a proposal report.
- Deliver a live demonstration showcasing your working project, create a demonstration video, and write a final report documenting details about your project.
- Provide individual feedback on other teams' presentations and demonstrations.
Participation in pitch presentations and live demonstrations is compulsory. Failure to attend any of these events will result in a zero grade for the respective activity. In the event of unavoidable absence due to severe personal circumstances, a formal letter from the Office for Undergraduate Education must accompany any excuses.

Project Grading

Team members receive the same grade for the pitch presentation, live demonstration, and demonstration video.
Peer evaluations from other teams factor into your team grade.
Your feedbacks to other teams are graded individually.

For the project and final reports, you are required to indicate the contribution percentage of each team member, which impacts the individual grades for the assignment.

Contribution

If your team of two members received 4 out of 5 points, for example, and you indicate that your contribution was 60% while your teammate's was 40%, the points are distributed as follows:

You receive: 4 (team points) ⨉ 60 (your contributions) / 60 (max contributions) = 4 points.
Your teammate receive: 4 (team points) ⨉ 40 (your teammate's contributions) / 60 (max contributions) = 2.67 points.

This approach ensures that the grading reflects the effort and input of each team member, promoting fairness and accountability.

Consensus

Each team is required to submit a single, agreed-upon chart detailing the contribution percentages of all members for each team assignment. This means that you and your teammates must reach a consensus on the contribution rates before submitting your work.
Open communication and transparency are essential in this process. Disagreements should be resolved within the team, ensuring that the final submission reflects the true division of labor and contributions.

By adhering to these guidelines, you not only produce a strong research paper but also develop key skills in teamwork and fair assessment of contributions.

Schedule

CS|QTM|LING-329: Computational Linguistics (Spring 2025)

Date

Topic

Assignment

Attendance is mandatory for all Project Pitches and Live Demonstrations sessions.

Development Environment

This guide will help you set up your development environment by installing required tools: Python programming language, GitHub for version control, and PyCharm IDE.

Python

Install Python version 3.13.x or higher. Earlier versions may not be compatible with this course.
Take some time to familiarize yourself with Python's new features.

GitHub

Create a GitHub account (if you do not already have one). As a student, you can get GitHub Pro features for free through the GitHub Student Developer Pack.
Login to GitHub.
Create a new repository named "nlp-essentials" and set its visibility to Private.
Create a GitHub repository.
Go to [Settings] in your repository, and select [Collaborators and teams].
Click [Add people], and add each instructor using their GitHub usernames:
1. Find their GitHub IDs in the "Instructors" section of the Syllabus.
2. Enter each username and send the collaboration invitation.
Verify that all instructors have been added as collaborators.

PyCharm

Install PyCharm on your local machine:
1. As a student, you can get PyCharm Professional for free by applying for a JetBrains Educational License.
2. The following instructions are based on PyCharm 2024.3.x Professional Edition.
Configure your GitHub account:
1. Go to [Settings] > [Version Control] > [GitHub].
2. Press [+], select [Log in via GitHub], and follow the browser prompts to authorize PyCharm with your GitHub account.
3. Once connected, you will be able to access GitHub directly from PyCharm for version control operations.

Create a new PyCharm project from GitHub:
1. On the PyCharm welcome screen, click [Clone Repository].
2. In the new window, select [GitHub] from the left menu, choose your nlp-essentials repository, and click [Clone] (verify the directory name is "nlp-essentials").
Set Up a Python virtual environment:
1. Go to [Settings] > [Project: nlp-essentials] > [Project Interpreter].
2. Click [Add Interpreter] and choose [Add Local Interpreter].
In the prompted menu, choose [Add Local Environment], configure as follows, then click [OK]:
- Environment: Generate new
- Type: Virtualenv
- Base python: the Python version you installed above
- Location: YOUR_LOCAL_PATH/nlp-essentials/.venv
Add a virtual environment.

References

Git: a version control system for tracking changes in files.
Virtualenv: a tool to create isolated Python environment.

Homework

HW0: Getting Started

Task 1: Getting Started

In this assignment, you will:

Set up your development environment,
Install your first Python package using pip within a virtual environment,
Run a test program to verify the installation and environment configuration, and
Commit your changes and push them to your GitHub repository.

This will ensure your development workspace is properly configured for this course.

Package Installation

Once you set up the development environment:

Open Terminal in PyCharm:
1. Click the Terminal icon () at the bottom left, or
2. Select the [View] > [Tool Windows] > [Terminal] menu).
Update pip to the latest version (if necessary):
```
python -m pip install --upgrade pip
```
Install setuptools (if necessary):
```
pip install setuptools
```
Install the ELIT Tokenizer:
```
pip install elit_tokenizer
```
You will know the installation is successful when you see "Successfully installed ..." messages for each package in the terminal output.

Test Program

Create the project structure:
1. Create a new Python package called src in your nlp-essentials directory.
2. Inside src, create a homework package.
3. PyCharm will automatically create __init__.py files in both directories to mark them as Python packages.

Create your first program:

Create a Python file called getting_started.py inside homework.
Copy the following code into the file:

from elit_tokenizer import EnglishTokenizer

if __name__ == '__main__':
    text = 'Emory NLP is a research lab in Atlanta, GA. It was founded by Jinho D. Choi in 2014. Dr. Choi is a professor at Emory University.'
    tokenizer = EnglishTokenizer()
    sentence = tokenizer.decode(text)
    print(sentence.tokens)
    print(sentence.offsets)

Run the program:
1. Choose the [Run] > [Run 'getting_started'] menu, or
2. Use the green run button next to the main block.
Run the program.

Verify the output; your program is working correctly if you see this output:

['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'was', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']
[(0, 5), (6, 9), (10, 12), (13, 14), (15, 23), (24, 27), (28, 30), (31, 38), (38, 39), (40, 42), (42, 43), (44, 46), (47, 50), (51, 58), (59, 61), (62, 67), (68, 70), (71, 75), (76, 78), (79, 83), (83, 84), (85, 88), (89, 93), (94, 96), (97, 98), (99, 108), (109, 111), (112, 117), (118, 128), (128, 129)]

Commit & Push

Create a .gitignore file:
1. Create the file in your nlp-essentials root directory
2. Add the following lines to exclude unnecessary files:
```
.idea/
.venv/
```
Stage your files for commit:
1. Add the following files to Git by right-clicking them and selecting [Git] > [Add]:
  .gitignore src/__init__.py src/homework/__init__.py src/homework/getting_started.py
2. Files should turn green when successfully added. If files do not change color, restart PyCharm and try again.
Commit and push your changes:
1. Right-click the nlp-essentials directory.
2. Select [Git] > [Commit Directory].
3. Write a descriptive commit message (e.g., "Initial setup and tokenizer test")
4. Click [Commit and Push] (not just Commit)
5. Click [Push] in the next dialog to upload to your GitHub repository.
Verify your submission:
1. Visit your GitHub repository in a web browser.
2. Confirm that all files are properly present and contain the correct content.

Submission

Submit the URL of your GitHub repository to Canvas.

Task 2: Project Ideas

Share your team project concept by filling out the form in Canvas (about 100-150 words). Your description will be posted on the Project Ideas page to help classmates discover shared interests and form teams.

Rubric

GitHub Setup (0.2 points):
- Private repository created.
- All instructors added as collaborators.
Project Organization (0.2 points):
- Correct directory structure.
- No unnecessary files committed
Version Control (0.3 points):
- All required files committed and successfully pushed to GitHub.
- Content of the files are correct.
Code Implementation (0.3 points):
- The program executes without errors.
- Produces correct tokenizer output.
Project Ideas (1 point)
- Is the team project idea well described?

Text Processing

Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.

If you are not acquainted with Python programming, we strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.

Frequency Analysis

Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.

Word Counting

Consider the following text from Wikipedia about (as of 2023-10-18):

Our task is to determine the number of word tokens and unique word types in this text.

Q1: What is the difference between a word token and a word type?

A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:

L1: Import the class, a special type of a , from the package.
L3: Use to indicate the parameter type (str) and the return type (Counter).
L4: the corpus file.
L5: the contents of the file as a string, it into a of words, and store them in words.
L6: Count the occurrences of each word and return the results as a Counter.

L1: The corpus can be found .
L4: Save the total number of word tokens in the corpus, which is the of counts values.
L5: Save the number of unique word types in the corpus, which is the of counts.
L7-8: Print the value using the .

When running this program, you may encounter FileNotFoundError. To fix this error, follow these steps to set up your working directory:

Go to [Run] > [Edit Configurations] in the menu.
Select "frequency_analysis" from the sidebar.
Change the working directory to the top-level "nlp-essentials" directory.
Click [OK] to save the changes.

Word Frequency

In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:

L1: Sort words in counts in descending order and save them into dec as a list of (word, count) , sorted from the most frequent to the least frequent words (, , ).
L2: Sort words in counts in ascending order and save them into asc as a list of (word, count) tuples.
L4: Iterate over the top k most frequent words in the sorted list using , and print each word along with its count.
L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.

Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

Q2: How can we interpret the most frequent words in a text?

Save Output

Finally, let us save all word types in alphabetical order to a file:

L2: Open outfile in mode (w).
L4: Iterate over unique word types (keys) of counts in alphabetical order.
L5: Write each word followed by a newline character to fout.
L7: Close the output stream.

L1: Creates the file if it does not exist; otherwise, its previous contents will be completely overwritten.

References

Source:
, The Python Standard Library - Built-in Types
, The Python Standard Library - Built-in Types
, The Python Tutorial

Tokenization

Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.

Q3: What is the difference between a word and a token?

When examining the dat/word_types.txt from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

"R1: → ['"', "R1", ":"]
(R&D) → ['(', 'R&D', ')']
15th-largest → ['15th', '-', 'largest']
Atlanta, → ['Atlanta', ',']
Department's → ['Department', "'s"]
activity"[26] → ['activity', '"', '[26]']
centers.[21][22] → ['centers', '.', '[21]', '[22]']

Depending on the task, you may want to tokenize "[26]" into ['[', '26', ']'] for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number "26" surrounded by square brackets. Thus, we aim to recognize it as a single token.

Delimiters

Let us write the delimit() function that takes a word and a set of delimiters, and returns a list of tokens by splitting the word using the delimiters:

def delimit(word: str, delimiters: set[str]) -> list[str]:
    i = next((i for i, c in enumerate(word) if c in delimiters), -1)
    if i < 0: return [word]
    tokens = []

    if i > 0: tokens.append(word[:i])
    tokens.append(word[i])

    if i + 1 < len(word):
        tokens.extend(delimit(word[i + 1:], delimiters))

    return tokens

L1: Support for type hints.
L2: Find the index of the first character in word that is in a set of delimiters (enumerate(), next()). If no delimiter is found in word, return -1 (generator expressions).
L3: If no delimiter is found, return a list containing word as a single token.
L4: If a delimiter is found, create a list tokens to store the individual tokens.
L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.
L7: Add the delimiter itself as a separate token to tokens.
L9-10: If there are characters after the delimiter, call delimit() recursively on the remaining part of word and extend() the tokens list with the result.

Let us define a set of delimiters and test delimit() using various input:

delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}

input = [
    '"R1:',
    '(R&D)',
    '15th-largest',
    'Atlanta,',
    "Department's",
    'activity"[26]',
    'centers.[21][22]',
    '149,000',
    'U.S.'
]

output = [delimit(word, delims) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))

L1: Set types.
L17: Iterate over the two lists, input and output, in parallel using the zip() function.
L18: Format Specification Mini-Language

"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'", 's']
activity"[26] -> ['activity', '"', '[', '26', ']']
centers.[21][22] -> ['centers', '.', '[', '21', ']', '[', '22', ']']
149,000 -> ['149', ',', '000']
U.S. -> ['U', '.', 'S', '.']

Q4: All delimiters used in our implementation are punctuation marks. What types of tokens should not be split by such delimiters?

Post-Processing

When reviewing the output of delimit(), the first four test cases yield accurate results, while the last five are not handled properly, which should have been tokenized as follows:

Department's → ['Department', "'s"]
activity"[26] → ['activity', '"', '[26]']
centers.[21][22] → ['centers', '.', '[21]', '[22]']
149,000 → ['149,000']
U.S. → ['U.S.']

To handle these special cases, let us post-process the tokens generated by delimit():

def postprocess(tokens: list[str]) -> list[str]:
    i, new_tokens = 0, []

    while i < len(tokens):
        if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
            new_tokens.append(''.join(tokens[i:i + 2]))
            i += 1
        elif i + 2 < len(tokens) and \
                ((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
                 (tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
            new_tokens.append(''.join(tokens[i:i + 3]))
            i += 2
        elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
            new_tokens.append(''.join(tokens[i:i + 4]))
            i += 3
        else:
            new_tokens.append(tokens[i])
        i += 1

    return new_tokens

L2: Initialize variables i for the current position and new_tokens for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
- L6: Combine the apostrophe and "s" and append it as a single token.
- L7: Move the position indicator by 1 to skip the next character.
L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).
- L11: Combine the special number format and append it as a single token.
- L12: Move the position indicator by 2 to skip the next two characters.
L13: Case 3: Handling acronyms like "U.S.".
- L14: Combine the acronym and append it as a single token.
- L15: Move the position indicator by 3 to skip the next three characters.
L17: Case 4: If none of the special cases above are met, append the current token.
L18: Move the position indicator by 1 to process the next token.
L20: Return the list of processed tokens.

Once the post-processing is applied, all outputs are now handled properly:

output = [postprocess(delimit(word, delims)) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))

"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']

Q5: Our tokenizer uses hard-coded rules to handle specific cases. What would be a scalable approach to handling more diverse cases?

Tokenizing

Finally, let us write tokenize() that takes a path to a corpus and a set of delimiters, and returns a list of tokens from the corpus:

def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
    with open(corpus) as fin:
        words = fin.read().split()
    return [token for word in words for token in postprocess(delimit(word, delimiters))]

L2: Read the corpus file.
L3: Split the text into words.
L4: Tokenize each word in the corpus using the specified delimiters. postprocess() is used to process the special cases further. The resulting tokens are collected in a list and returned (list comprehension).

Given the new tokenizer, let us recount word types in the corpus, emory-wiki.txt, and save them:

from collections import Counter
from src.frequency_analysis import save_output

corpus = 'dat/emory-wiki.txt'
output = 'dat/word_types-token.txt'

words = tokenize(corpus, delims)
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

save_output(counts, output)

L2: Import save_output() from the src/frequency_analysis.py module.
L13: Save the tokenized output to dat/word_types-token.txt.

# of word tokens: 363
# of word types: 197

dat/text_processing /word_types-token.txt

Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.

Q6: The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?

References

Source: src/tokenization.py
ELIT Tokenizer - a heuristic-based tokenizer

Lemmatization

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

Q7: What is the difference between a lemmatizer and a stemmer [3]?

When analyzing dat/word_types-token.txt obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

Universities
University
universities
university

Two variations are applied to the noun "university" - capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

study
studies
studied
studying

We want to develop a lemmatizer that normalizes all variations into their respective lemmas.

Lemma Lexica

Let us create lexica for lemmatization:

import json
import os
from types import SimpleNamespace

def get_lexica(res_dir: str) -> SimpleNamespace:
    with open(os.path.join(res_dir, 'nouns.txt')) as fin: nouns = {noun.strip() for noun in fin}
    with open(os.path.join(res_dir, 'verbs.txt')) as fin: verbs = {verb.strip() for verb in fin}
    with open(os.path.join(res_dir, 'nouns_irregular.json')) as fin: nouns_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_irregular.json')) as fin: verbs_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'nouns_rules.json')) as fin: nouns_rules = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_rules.json')) as fin: verbs_rules = json.load(fin)

    return SimpleNamespace(
        nouns=nouns,
        verbs=verbs,
        nouns_irregular=nouns_irregular,
        verbs_irregular=verbs_irregular,
        nouns_rules=nouns_rules,
        verbs_rules=verbs_rules
    )

L1: res_dir: the path to the root directory where all lexica files are located.
L2: nouns.txt: a list of base nouns
L3: verbs.txt: a list of base verbs
L4: nouns_irregular.json: a dictionary of nouns whose plural forms are irregular (e.g., mouse → mice)
L5: verbs_irregular.json: a dictionary of verbs whose inflection forms are irregular (e.g., buy → bought)
L6: nouns_rules.json: a list of pluralization rules for nouns
L7: verbs_rules.json: a list of inflection rules for verbs

We then verify that all lexical resources are loaded correctly:

print(len(lexica.nouns))
print(len(lexica.verbs))
print(lexica.nouns_irregular)
print(lexica.verbs_irregular)
print(lexica.nouns_rules)
print(lexica.verbs_rules)

91
27
{'children': 'child', 'crises': 'crisis', 'mice': 'mouse'}
{'is': 'be', 'was': 'be', 'has': 'have', 'had': 'have', 'bought': 'buy'}
[['ies', 'y'], ['es', ''], ['s', ''], ['men', 'man'], ['ae', 'a'], ['i', 'us']]
[['ies', 'y'], ['ied', 'y'], ['es', ''], ['ed', ''], ['s', ''], ['d', ''], ['ying', 'ie'], ['ing', ''], ['ing', 'e'], ['n', ''], ['ung', 'ing']]

Q8: What are the key differences between inflectional and derivational morphology?

Lemmatizing

Let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

def lemmatize(lexica: SimpleNamespace, word: str) -> str:
    def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
        lemma = irregular.get(word, None)
        if lemma is not None: return lemma

        for p, s in rules:
            lemma = word[:-len(p)] + s
            if lemma in vocabs: return lemma

        return None

    word = word.lower()
    lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)

    if lemma is None:
        lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)

    return lemma if lemma else word

L2: Define a nested function aux to handle lemmatization.
L3-4: Check if the word is in the irregular dictionary (get()), if so, return its lemma.
L6-7: Try applying each rule in the rules list to word.
L8: If the resulting lemma is in the vocabulary, return it.
L10: If no lemma is found, return None.
L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

We now test our lemmatizer for nouns and verbs:

nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
nouns_lemmatized = [lemmatize(lexica, word) for word in nouns]
for word, lemma in zip(nouns, nouns_lemmatized): print('{} -> {}'.format(word, lemma))

verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
verbs_lemmatized = [lemmatize(lexica, word) for word in verbs]
for word, lemma in zip(verbs, verbs_lemmatized): print('{} -> {}'.format(word, lemma))

studies -> study
crosses -> cross
areas -> area
gentlemen -> gentleman
vertebrae -> vertebra
alumni -> alumnus
children -> child
crises -> crisis

applies -> apply
cried -> cry
pushes -> push
entered -> enter
takes -> take
heard -> hear
lying -> lie
studying -> study
taking -> take
drawn -> draw
clung -> cling
was -> be
bought -> buy

Finally, let us recount word types in dat/emory-wiki.txt using the lemmatizer and save them to dat/word_types-token-lemma.txt:

from collections import Counter
from src.tokenization import tokenize

corpus = 'dat/emory-wiki.txt'
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
words = [lemmatize(lexica, word) for word in tokenize(corpus, delims)]
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

output = 'dat/word_types-token-lemma.txt'
with open(output, 'w') as fout:
    for key in sorted(counts.keys()): fout.write(f'{key}\n')

L2: Import the tokenize() function from the src/tokenization.py module.

# of word tokens: 363
# of word types: 177

dat/text_processing /word_types-token-lemma.txt

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

Q9: In which tasks can lemmatization negatively impact performance?

References

Source: lemmatization.py
ELIT Morphological Analyzer - A heuristic-based lemmatizer
An Algorithm for Suffix Stripping, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 (PDF)

Regular Expressions

Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.

Online Interpreter: Regular Expressions 101

Core Syntax

Metacharacters

Regex provides metacharacters with specific meanings, making it convenient to define patterns:

.: any single character except a newline character
e.g., M.\. matches "Mr." and "Ms.", but not "Mrs." (\ escapes the metacharacter .).
[ ]: a character set matching any character within the brackets
e.g., [aeiou] matches any vowel.
\d: any digit, equivalent to [0-9]
e.g., \d\d\d searches for "170" in "CS170".
\D: any character that is not a digit, equivalent to [^0-9]
e.g., \D\D searches for "kg" in "100kg".
\s: any whitespace character, equivalent to [ \t\n\r\f\v]
e.g., \s searches for the space " " in "Hello World".
\S: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v]
e.g., \S searches for "H" in " Hello".
\w: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_]
e.g., \w\w searches for "1K" in "$1K".
\W: any character that is not a word character, equivalent to [^A-Za-z0-9_]
e.g., \W searches for "!" in "Hello!".
\b: a word boundary matching the position between a word character and a non-word character
e.g., \bis\b matches "is", but does not match "island" nor searches for "is" in "basis".

The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the following section.

Repetitions

Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:

*: the preceding character or group appears zero or more times
e.g., \d* matches "90" in "90s" as well as "" (empty string) in "ABC".
+: the preceding character or group appears one or more times
e.g., \d+ matches "90" in "90s", but no match in "ABC".
?: the preceding character or group appears zero or once, making it optional
e.g., https? matches both "http" and "https".
{m}: the preceding character or group appears exactly m times
e.g., \d{3} is equivalent to \d\d\d.
{m,n}: the preceding character or group appears at least m times but no more than n times
e.g., \d{2,4} matches "12", "123", "1234", but not "1" or "12345".
{m,}: the preceding character or group appears at least m times or more
e.g., \d{2,} matches "12", "123", "1234", and "12345", but not "1".
By default, matches are "greedy" such that patterns match as many characters as possible
e.g., <.+> matches the entire string of "<Hello> and <World>".
Matches become "lazy" by adding ? after the repetition metacharacters, in which case, patterns match as few characters as possible
e.g., <.+?> matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.

Groupings

Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses ( and ).

|: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives
e.g., (cat|dog) matches either "cat" or "dog".
( ): a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways
e.g., (\w+)@(\w+.\w+) has two capturing groups, (\w+) and (\w+.\w+), and matches email addresses such as "john@emory.edu" where the first and second groups capture "john" and "emory.edu", respectively.
(?: ): a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways
e.g., (?:\w+)@(\w+.\w+) has one non-capturing group (?:\w+) and one capturing group (\w+.\w+). It still matches "john@emory.edu" but only captures "emory.edu", not "john".
\num: a backreference that refers back to the most recently matched text by the num'th capturing group within the same regex
e.g., (\w+) (\w+) - (\2), (\1) has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".
You can nest groups within other groups to create more complex patterns
e.g., (\w+.(edu|org)) has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.

Assertions

Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.

A positive lookahead assertion (?= ) checks that a specific pattern is present immediately after the current position
e.g., apple(?=[ -]pie) matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".
A negative lookahead assertion (?! ) checks that a specific pattern is not present immediately after the current position
e.g., do(?!(?: not|n't)) matches "do" in "do it" or "doing", but not in "do not" or "don't".
A positive look-behind assertion (?<= ) checks that a specific pattern is present immediately before the current position
e.g., (?<=\$)\d+ matches "100" in "$100", but not in "100 dollars".
A negative look-behind assertion (?<! ) checks that a specific pattern is not present immediately before the current position
e.g., (?<!not )(happy|sad) searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".
^ asserts that the pattern following the caret must match at the beginning of the text
e.g., not searches for "not" in "note" and "cannot", whereas ^not matches "not" in "note" but not in "cannot".
$ asserts that the pattern preceding the dollar sign must match at the end of the text
e.g., not$ searches for "not" in "cannot" but not in "note".

Functions

Python provides several functions to make use of regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

import re

re_mr = re.compile(r'M[rs]\.')
m = re_mr.match('Mr. Wayne')

print(m)
if m: print(m.start(), m.end())

L1: Regular expression operations.
L3: Create a regular expression re_mr (compile()). Note that a string indicated by an r prefix is considered a regular expression in Python.
- r'M' matches the letter "M".
- r'[rs]' matches either "r" or "s".
- r'\.' matches a period (dot).
L4: Try to match re_mr at the beginning of the string "Mr. Wayne" (match()).
L6: Print the value of m. If matched, it prints the match object information; otherwise, m is None; thus, it prints "None".
L7: Check if a match was found (m is not None), and print the start position (start()) and end position (end()) of the match.

<re.Match object; span=(0, 3), match='Mr.'>
0 3

group()

Currently, no group has been specified for re_mr:

print(m.groups())

L1: groups()

()

Let us capture the letters and the period as separate groups:

re_mr = re.compile(r'(M[rs])(\.)')
m = re_mr.match('Ms.')

print(m)
print(m.group())
print(m.groups())
print(m.group(0), m.group(1), m.group(2))

L1: The pattern re_mr is looking for the following:
- 1st group: "M" followed by either "r" or 's'.
- 2nd group: a period (".")
L2: Match re_mr with the input string "Ms".
L5: Print the entire matched string (group()).
L6: Print a tuple of all captured groups (groups()).
L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.

<re.Match object; span=(0, 3), match='Ms.'>
Ms.
('Ms', '.')
Ms. Ms .

If the pattern does not find a match, it returns None.

m = RE_MR.match('Mrs.')
print(m)

None

search()

Let us match the following strings with re_mr:

s1 = 'Mr. and Ms. Wayne are here'
s2 = 'Here are Mr. and Ms. Wayne'

print(re_mr.match(s1))
print(re_mr.match(s2))

<re.Match object; span=(0, 3), match='Mr.'>
None

s1 matches "Mr." but not "Ms." while s2 does not match any pattern. It is because the match() function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search() instead:

print(re_mr.search(s1))
print(re_mr.search(s2))

search()

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(9, 12), match='Mr.'>

findall()

The search() function matches "Mr." in both s1 and s2 but still does not match "Ms.". To match them all, we need to use the findall() function:

print(re_mr.findall(s1))
print(re_mr.findall(s2))

findall()

[('Mr', '.'), ('Ms', '.')]
[('Mr', '.'), ('Ms', '.')]

finditer()

While the findall() function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer() function:

ms = re_mr.finditer(s1)
for m in ms: print(m)

ms = re_mr.finditer(s2)
for m in ms: print(m)

finditer()

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(8, 11), match='Ms.'>
<re.Match object; span=(9, 12), match='Mr.'>
<re.Match object; span=(17, 20), match='Ms.'>

sub()

Finally, you can replace the matched results with another string by using the sub() function:

print(re_mr.sub('Dr.', 'I met Mr. Wayne and Ms. Kyle.'))

sub()

I met Dr. Wayne and Dr. Kyle.

Tokenization

Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:

def tokenize(text: str) -> list[str]:
    re_tok = re.compile(r'([",.]|\s+|n\'t)')
    tokens, prev_idx = [], 0

    for m in re_tok.finditer(text):
        t = text[prev_idx:m.start()].strip()
        if t: tokens.append(t)
        t = m.group().strip()
        
        if t:
            if tokens and tokens[-1] in {'Mr', 'Ms'} and t == '.':
                tokens[-1] = tokens[-1] + t
            else:
                tokens.append(t)
        
        prev_idx = m.end()

    t = text[prev_idx:]
    if t: tokens.append(t)
    return tokens

L2: Create a regular expression to match delimiters and a special case:
- Delimiters: ',', '.', or whitespaces ('\s+').
- The special case: 'n't' (e.g., "can't").
L3: Create an empty list tokens to store the resulting tokens, and initialize prev_idx to keep track of the previous token's end position.
L5: Iterate over matches in text using the regular expression pattern.
- L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t.
- L7: If t is not empty (i.e., it is not just whitespace), add it to the tokens list.
- L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t.
- L10: If t is not empty (i.e., the pattern is matched):
  - L11-12: Check if the previous token in tokens is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.
  - L13-14: Otherwise, add t to tokens.
L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens.

Test cases for the tokenizer:

text = 'Mr. Wayne isn\'t the hero we need, but "the one" we deserve.'
print(tokenize(text))

text = 'Ms. Wayne is "Batgirl" but not "the one".'
print(tokenize(text))

['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"']
['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"', '.']

Q10: What are the benefits and limitations of using regular expressions for tokenization vs. the rule-based tokenization approach discussed in the previous section?

References

Source: regular_expressions.py
Regular Expression, Kuchling, HOWTOs in Python Documentation

Homework

HW1: Text Processing

Task 1: Chronicles of Narnia

Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narnia series, focusing on book metadata and chapter statistics.

Data Collection

For each book, gather the following details:

Book Title (preserve exact spacing as shown in text)
Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

Chapter Number (as Arabic numeral)
Chapter Title
Token Count of the Chapter Content
- Each word, punctuation mark, and symbol counts as a separate token.
- Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

Implementation

Download the chronicles_of_narnia.txt file and place it under the dat/ directory.
- The text file is pre-tokenized using the ELIT Tokenizer.
- Each token is separated by whitespace.
Create a text_processing.py file in the src/homework/ directory.
Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:
- Takes a file path pointing to the text file.
- Returns a dictionary with the structure shown below.
- Books must be stored as key-value pairs in the main dictionary.
- Chapters must be stored as lists within each book's dictionary/
- Chapter lists must be sorted by chapter number in ascending order.

{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}

Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

Format:
- username@hostname.domain
Username and Hostname:
- Can contain letters, numbers, period (.), underscore (_), hyphen (-).
- Must start and end with letter/number.
Domain:
- Limited to com, org, edu, and gov.

Formats:
- YYYY/MM/DD or YY/MM/DD
- YYYY-MM-DD or YY-MM-DD
Year:
- 4 digits: between 1951 and 2050
- 2 digits: for 1951 - 2050
Month:
- 1 - 12 (can be with/without leading zero)
Day:
- 1 - 31 (can be with/without leading zero)
- Must be valid for the given month.

Format:
- protocol://address
Protocol:
- http or https (only)
Address:
- Can contain letters, hyphen, dots.
- Must start with letter/number.
- Must include at least one dot.

Formats:
- Single author: Lastname, YYYY (e.g., Smith, 2023)
- Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)
- Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)
Lastnames must be capitalized and can have multiple
Year must be between 1900-2024.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Rubric

Task 1: Chronicles of Narnia (7 points)
Task 2: Regular Expressions (3 points)
Concept Quiz (2 points)

Language Models

A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.

N-gram Models

An n-gram is a contiguous sequence of n items from text data. These items can be:

Words (most common in language modeling)
Characters (useful for morphological analysis)
Subword tokens (common in modern NLP)
Bytes or other units

For the sentence "I'm a computer scientist.", we can extract different n-grams:

1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?

Unigram Estimation

Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:

Where:

: the count of word in the corpus.
: the vocabulary (set of all unique words).
The denominator represents the total word count.

Let us define the Unigram type:

L1: .
L3: A dictionary where the key is a unigram and the value is its probability.

Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

L3: Calculate the total count of all unigrams in the text.
L4: Return a dictionary where each word is a key and its probability is the value.

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :

L1: Import the type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).

L2: Pass the unigram_estimation() function as the second argument.

This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".

Q2: What advantages do unigram probabilities have over ?

Bigram Estimation

A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):

Where:

: the total occurrences of in the corpus in that order.
: a set of all word types occurring after .

Let us define the Bigram type:

A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a for .

Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

L1: Import the class from the package.
L2: import the Bigram from the package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?

References

Source: .
, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

Smoothing

Unigram Smoothing

The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

Thus, the probability of any unknown word $w_*$ with Laplace smoothing is calculated as follows:

P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import unigram_count, Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams

L1: Import the unigram_count() function from the src.ngram_models package.
L3: Define a constant representing the unknown word.
L7: Increment the total count by the vocabulary size.
L8: Increment each unigram count by 1.
L9: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

We then test unigram_smoothing() with a text file dat/chronicles_of_narnia.txt:

from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)

L1: Import test_unigram() from the ngram_models package.

         I 0.010225
     Aslan 0.001796
      Lucy 0.001762
    Edmund 0.001369
    Narnia 0.001339
   Caspian 0.001300
      Jill 0.001226
     Peter 0.001005
    Shasta 0.000902
    Digory 0.000899
   Eustace 0.000853
     Susan 0.000636
    Tirian 0.000585
     Polly 0.000533
    Aravis 0.000523
      Bree 0.000479
Puddleglum 0.000479
    Scrubb 0.000469
    Andrew 0.000396

   Unigram  With Smoothing   W/O Smoothing
         I        0.010225        0.010543
     Aslan        0.001796        0.001850
      Lucy        0.001762        0.001815
    Edmund        0.001369        0.001409
    Narnia        0.001339        0.001379
   Caspian        0.001300        0.001338
      Jill        0.001226        0.001262
     Peter        0.001005        0.001034
    Shasta        0.000902        0.000928
    Digory        0.000899        0.000925
   Eustace        0.000853        0.000877
     Susan        0.000636        0.000654
    Tirian        0.000585        0.000601
     Polly        0.000533        0.000547
    Aravis        0.000523        0.000537
      Bree        0.000479        0.000492
Puddleglum        0.000479        0.000492
    Scrubb        0.000469        0.000482
    Andrew        0.000396        0.000406

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Q4: When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram's probability to increase?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])

L2: Use get() to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the UNKNOWN token.

unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print(f'{word} {smoothed_unigram(unigram, word):.6f}')

L2: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Aslan 0.001796
Jinho 0.000002

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}

Thus, the probability of an unknown bigram $(w_{u-1}, w_{*})$ where $w_{u-1}$ is known but $w_{*}$ is unknown is calculated as follows:

P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}

Q5: What does the Laplace smoothed bigram probability of $(w_{u-1}, w_{u})$ represent when $w_{u-1}$ is unknown, and what is a potential problem with this estimation?

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import bigram_count, Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: (count + 1) / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1 / len(vocab)
    return bigrams

L1: Import the bigram_count() function from the src.ngram_models package.
L5: Create a set vocab containing all unique $w_{i-1}$ .
L6-7: Add all unique $w_i$ to vocab.
L11: Calculate the total count of all bigrams with the same previous word.
L12: Calculate and store the probabilities of each current word given the previous word
L13: Calculate the probability for an unknown current word.
L16: Add a probability for an unknown previous word.

We then test bigram_smoothing() with the same text file:

from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)

L1: Import the test_bigram() function from the ngram_models package.

I
        'm 0.020590
        do 0.019136
       've 0.011143
       was 0.010598
      have 0.009629
        am 0.009084
       'll 0.008236
     think 0.008115
        'd 0.006661
      know 0.006540
the
      same 0.008403
     other 0.007591
      King 0.007096
     Witch 0.006673
     whole 0.005119
    others 0.005084
     first 0.004978
     Dwarf 0.004872
      door 0.004837
     great 0.004837
said
       the 0.039038
         , 0.018270
      Lucy 0.014312
    Edmund 0.011206
   Caspian 0.010049
     Peter 0.009805
      Jill 0.008709
         . 0.008648
    Digory 0.007734
     Aslan 0.007491

Finally, we test the bigram estimation using smoothing for unknown sequences:

def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])

L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.

bigram = bigram_smoothing(corpus)
for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
    print(f'{word} {smoothed_bigram(bigram, *word):.6f}')

L3: The tuple word is unpacked as passed as the second and third parameters.

('Aslan', 'is') 0.001146
('Aslan', 'Jinho') 0.000076
('Jinho', 'is') 0.000081

Normalization

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word $w_i$ is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

You are a student
You and I are students

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that $v=7$ . Before Laplace smoothing, the bigram probabilities of $(w_{i-1} = \textit{You}, w_{i} = *)$ are estimated as follows:

\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}

The bigram distribution for $w_{i-1}$ can be normalized to 1 by adding the total number of word types occurring after $w_{i-1}$ , denoted as $|V_i|$ , to the denominator instead of $v$ :

P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}

Consequently, the probability of an unknown bigram $(w_{u-1}, w_{*})$ can be calculated with the normalization as follows:

P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}

For the above example, $|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2$ . Once you apply $|V_{i}|$ to $P_{\mathcal{L}}(*|\textit{You})$ , the sum of its bigram probabilities becomes 1:

\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

A major drawback of this normalization is that the probability cannot be measured when $w_{u-1}$ is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of $(w_*, w_u)$ , where the previous word is unknown, as follows:

P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})

Q6: Why is it problematic when bigram probabilities following a given word don't sum to 1?

Reference

Source: smoothing.py
Additive smoothing, Wikipedia

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

In the previous section, you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

P(student|I,am,a) = \frac{P(I,am,a,student)}{P(I,am,a)}

The joint probability of the word sequence "I am a student" can be measured as follows:

P(I,am,a,student) = \frac{\#(I,am,a,student)}{\sum_{\forall a}\sum_{\forall b}\sum_{\forall c}\sum_{\forall d}\#(w_a,w_b,w_c,w_d)}

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: Chain Rule and Markov Assumption.

Q7: What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?

Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

P(I,am,a,student) = P(I) \cdot P(am|I) \cdot P(a|I,am) \cdot P(student|I,am,a)

Thus, the probability of any given word sequence $w_1^n=(w_1,\ldots,w_n)$ can be measured as:

P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1})

The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring $P(w_n|w_1^{n-1})$ is as challenging as measuring $P(w_1^n)$ .

Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})

The joint probability can now be measured by the product of unigram and bigram probabilities.

Q8: How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?

Initial Word Probability

Let us consider the unigram probabilities of $P(the)$ and $P(The)$ . In general, "the" appears more frequently than "The" such that:

P(the) > P(The)

Let $w_0$ be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as $P(the|w_0)$ and $P(The|w_0)$ , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:

P(the|w_0) < P(The|w_0)

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

Thus, to predict a more probable initial word, it is better to consider the bigram probability $P(w_1 | w_0)$ rather than the unigram probability $P(w_1)$ when measuring sequence probability:

P(w_1^n) = P(w_1|w_0) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

P(w_1^n) = \prod_{i=1}^n P(w_i|w_{i-1})

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

\log P(w_1^n) = \log(\prod_{i=1}^n P(w_i|w_{i-1})) = \sum_{i=1}^n \log P(w_i|w_{i-1})

Q9: Is it worth considering the end of the text by introducing another artificial token, $w_{n+1}$ , to improve last-word prediction by multiplying the above product with $P(w_{n+1} | w_n)$ ?

References

Maximum likelihood estimation, Wikipedia

Entropy and Perplexity

Entropy

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable is defined as:

is the probability distribution of . The self-information of is defined as , which measures how much information is gained when occurs. The negative sign indicates that as the occurrence of increases, its self-information value decreases.

Entropy has several properties, including:

It is non-negative: .
It is at its minimum when is entirely predictable (all probability mass on a single outcome).
It is at its maximum when all outcomes of are equally likely.

Q10: Why is logarithmic scale used to measure self-information in entropy calculations?

Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:

The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :

In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:

The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

The in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.

By applying this theorem, can be approximated:

Consequently, is approximated as follows, where :

Q11: What indicates high entropy in a text corpus?

Perplexity

Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:

Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .

Perplexity, , can be directly derived from the approximated entropy rate, :

Q12: What is the relationship between corpus entropy and language model perplexity?

References

, Wikipedia
, Wikipedia

Homework

HW2: Language Models

Task 1: Bigram Modeling

Your goal is to build a bigram model using (1) Laplace smoothing with normalization and (2) initial word probabilities by adding the artificial token $w_0$ at the beginning of every sentence.

Implementation

Create a language_models.py file in the src/homework/ directory.
Define a function named bigram_model() that takes a file path pointing to the text file, and returns a dictionary of bigram probabilities estimated in the text file.
Use the following constants to indicate the unknown and initial probabilities:

UNKNOWN = ''
INIT = '[INIT]'

Notes

Test your model using dat/chronicles_of_narnia.txt.
Each line should be treated independently for bigram counting such that the INIT token should precede the first word of each line.
Use smoothing with normalization such that all probabilities must sum to 1.0 for any given previous word.
Unknown word probabilities should be retrieved using the UNKNOWN key for both the previous word ( $w_{i-1}$ ) and the current word ( $w_i$ ).

Task 2: Sequence Generation

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

Implementation

Under language_models.py, define a function named sequence_generator() that takes the following parameters:

A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

It must have the precise number of tokens as specified.
Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of $\mathrm{floor}(21 / 5) = 4$ puncuation tokens are permitted).
Excluding punctuation, there should be no redundant tokens in the sequence.

The goal of this task is not to discover a sequence that maximizes the overall sequence probability, but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word $w$ , a potential strategy is as follows:

Identify the next word $w'$ where the bigram probability $P(w′∣w)$ is maximized.
If $w′$ fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.
Make $w = w'$ and repeat the #1 until you reach the specific sequence length.

Finally, the function returns a tuple comprising the following two elements:

The list of tokens in the sequence
The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base $e$ , provided as the math.log() function in Python.

Extra Credit

Create a function called sequence_generator_plus() that takes the same input parameters as the existing sequence_generator() function. This new function should generate sequences with higher probability scores and better semantic coherence compared to the original implementation.

Submission

Commit and push the language_models.py file to your GitHub repository.

Rubric

Task 1: Bigram Modeling (5 points)
Task 2: Sequence Generator (4.6 points), Extra Credit (2 points)
Concept Quiz (2.4 points)

Vector Space Models

A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.

Bag-of-Words Model
Term Weighting
Similarity Metrics
Document Classification
Homework

Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two tokenized documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

Let $D_i = [w_{i,1}, \ldots, w_{i,n}]$ be a document, where $w_{i,j}$ is the $j$ 'th word in $D_i$ . A vector representation for $D_i$ can be defined as $v_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}$ , where $w_j$ is the $j$ 'th word in $W$ and each dimension in $v_i$ is the frequency of $w_j$ 's occurrences in $D_i$ such that:

#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in $v_i$ being 0 in practice, as most words in the vocabulary $W$ do not occur in document $D_i$ . Therefore, it is more efficient to represent $v_i$ as a sparse vector:

v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}

Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

from typing import TypeAlias

Document: TypeAlias = list[str]
Vocab: TypeAlias = dict[str, int]

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

from collections import Counter

SparseVector: TypeAlias = dict[int, int | float]

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

Finally, let us our bag-of-words model with the examples above:

documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))

{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}

Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

References

Source: bag_of_words_model.py
Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials

Document Classification

Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.

Supervised Learning

Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.

Data Split

Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.

It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.

The document_classification directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the chronicles_of_narnia.txt file, following a file-naming convention of A_B, where A denotes the book ID and B indicates the chapter ID.

Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:

from src.bag_of_words_model import Document
import glob, os

def collect(dirpath: str) -> dict[int, list[Document]]:
    books = dict()

    for filename in glob.glob(os.path.join(dirpath, '*.txt')):
        t = os.path.basename(filename).split('_')
        book_id = int(t[0])
        fin = open((filename))
        books.setdefault(book_id, list()).append(fin.read().split())

    return books

L7: the glob module, the os module.

We then print the number of documents in each set:

def join_documents(dataset: dict[int, list[Document]]) -> list[Document]:
    return [document for documents in dataset.values() for document in documents]

trn = collect('dat/document_classification/trn')
dev = collect('dat/document_classification/dev')
tst = collect('dat/document_classification/tst')
print(len(join_documents(trn)), len(join_documents(dev)), len(join_documents(tst)))

82 14 14

Q8: What potential problems might arise from the above data splitting approach, and what alternative method could mitigate these issues?

Vectorization

To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:

corpus = join_documents(trn)
vocab = vocabulary(join_documents(trn))
dfs = document_frequencies(vocab, corpus)
D = len(corpus)

Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:

def vectorize(vocab: Vocab, dfs: SparseVector, D: int, docset: dict[int, list[Document]]) -> list[tuple[int, SparseVector]]:
    vs = []

    for book_id, documents in docset.items():
        for document in documents:
            vs.append((book_id, tf_idf(vocab, dfs, D, document)))

    return vs

We then vectorize all documents in each set:

trn_vs = vectorize(vocab, dfs, D, trn)
dev_vs = vectorize(vocab, dfs, D, dev)
tst_vs = vectorize(vocab, dfs, D, tst)

Q9: Why do we use only the training set to collect the vocabulary?

Classification

Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and $k$ , and returns the predicted book ID of the document and its similarity score:

def knn(trn_vs: list[tuple[int, SparseVector]], v: SparseVector, k: int = 1) -> tuple[int, float]:
    sims = [(book_id, cosine_similarity(v, t)) for book_id, t in trn_vs]
    sims.sort(key=lambda x: x[1], reverse=True)
    return Counter(sims[:k]).most_common(1)[0][0]

L2: Measure the similarity between the input document $v$ and every document $t$ in the training set and save it with the book ID of $t$ .
L3-4: Return the most common book ID among the top- $k$ documents in the training set that are most similar to $v$ .

Finally, we test our classification model on the development set:

correct = 0
    
for g_book_id, document in dev_vs:
    p_book_id, p_score = knn(trn_vs, document)
    if g_book_id == p_book_id: correct += 1
    print('Gold: {}, Auto: {}, Score: {:.2f}'.format(g_book_id, p_book_id, p_score))

print('Accuracy: {} ({}/{})'.format(100 * correct / len(dev_vs), correct, len(dev_vs)))

Gold: 1, Auto: 1, Score: 0.49
Gold: 1, Auto: 1, Score: 0.27
Gold: 3, Auto: 3, Score: 0.36
Gold: 3, Auto: 3, Score: 0.32
Gold: 5, Auto: 5, Score: 0.29
Gold: 5, Auto: 5, Score: 0.54
Gold: 0, Auto: 0, Score: 0.32
Gold: 0, Auto: 0, Score: 0.26
Gold: 6, Auto: 6, Score: 0.48
Gold: 6, Auto: 6, Score: 0.49
Gold: 2, Auto: 2, Score: 0.37
Gold: 2, Auto: 2, Score: 0.31
Gold: 4, Auto: 4, Score: 0.56
Gold: 4, Auto: 4, Score: 0.60
Accuracy: 100.0 (14/14)

Q10: What are the primary weaknesses and limitations of the K-Nearest Neighbors (KNN) classification model when applied to document classification?

References

Source: document_classification.py

K-nearest neighbors, Wikipedia.

Homework

HW3: Vector Space Models

Your task is to develop a sentiment analyzer train on the :

Create a file in the directory.
Define a function named sentiment_analyzer() that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.
Use the -nearest neighbors algorithm for the classification. Find the optimal value of using the development set, and then hardcode this value into your function before submission.

Data

The directory contains the following two files:

: a training set consisting of 8,544 labeled documents.
: a development set consisting of 1,101 labeled documents.

Each line is a document, which is formatted as follows:

Below are the explanations of what each label signifies:

0: Very negative
1: Negative
2: Neutral
3: Positive
4: Very positive

Submission

Commit and push the vector_space_models.py file to your GitHub repository.

Extra Credit

Define a function named sentiment_analyzer_extra() that gives an improved sentiment analyzer.

Rubric

Code Submission (1 point)
Program Execution (1 point)
Development Set Accuracy (4 points)
Evaluation Set Accuracy (4 points)
Concept Quiz (2 points)

Projects

Final Report

Proposal

Submit a final report in PDF format consisting of 5-8 pages (excluding references) using the provided template. Change the section titles as needed.

Header

Project Title (create an engaging and impactful title)
Course ID and Course Name
Team Member Information:
- Full names
- Academic majors
- Contact information

Abstract

A concise summary (200-300 words) addressing:

Complete project overview and objectives
Intellectual merit and technical innovation
Broader societal and field impact

Introduction

Objectives
- Clear statement of project goals
- Specific deliverables achievable within semester
- Technical scope and boundaries
Motivation
- Project's societal value and importance
- Target beneficiaries or applications
- Potential real-world impact
Problem Statement
- Specific NLP challenge being addressed
- Current limitations in the field
- Technical barriers to overcome
Innovation Component
- Unique aspects of proposed approach
- Comparative analysis with existing solutions
- Technical or methodological advances

Analysis of relevant academic research
Review of existing industry solutions
Identification of current limitations
Differentiation of proposed approach

Approach

Methodology
- Detailed technical approach
- Development framework
- Implementation strategy
- Quality assurance methods
Research (if applicable)
- Dataset description and preparation
- Experimental design
- Evaluation metrics
- Validation approach

Results

Performance Analysis
- Quantitative results and metrics
- Qualitative observations
- System capabilities and limitations
Limitations
- Current system constraints
- Technical barriers encountered
- Areas for improvement

Conclusion

Summary
- Project achievements and contributions
- Key lessons learned
- Technical and practical insights
Future Work
- Potential improvements
- Extension opportunities
- Research directions

References

Properly formatted citations of all referenced work.

Contributions

In the Appendix, list each team member's name and their percentage of contribution to the proposal development. The percentages should total 100%.

Submission

Submit your final report in PDF to Canvas.

Project Ideas

Spring 2025

Project Groups by Theme

A. Career Development & Professional Tools

Resume & Cover Letter Analysis

Amoakohene, Humphrey: NLP system to analyze and rank resumes based on job descriptions
Kansal, Rhea: System to extract key skills from resumes and match with job descriptions
Wen, Yuanhuizi: AI-powered resume-to-job matching system using neural networks
Zahid, Zeshan: Web app for resume editing and optimization
Ambrose, Daniel: Smart cover letter generator using job description keywords

Job Application Tracking

Arshavsky, Mark: Email-based internship application tracking system

B. Content Analysis & Verification

Media Bias & Fake News Detection

Dattatreya, Maya: Sentiment analysis for detecting tone and bias in media
Jerry Xing: Program to evaluate media bias levels
Kansal, Rhea: Fake news detection system
King, Marisa: Analysis of media word choices based on demographic factors
Suh, EunGyul: Fake news detector for social media
Yoon, James: Tool for identifying bias in journalism using sentiment analysis

Academic & Research Analysis

Ahn, Eric: Academic paper simplifier for general audiences
Xu, Jack: Sentiment analysis tool for evaluating research paper quality
Xinyuan, Hu: Tool to evaluate research paper credibility
Okumura, Harutoshi: LLM-based contradiction detection in political speeches

Review & Sentiment Analysis

Chen, Angelina: Bot detection in Amazon product reviews
Coulanges, Charlington: Sentiment analysis chatbot for platform reviews
Gyul Kim: Analysis of healthcare system and doctor reviews
Hou, Carol: Sentiment analysis for social media posts
Sezgin, Alas: Dogwhistle detector for social media
Yeruva, Sujith: Stock sentiment analysis from financial discussion platforms

C. Content Generation & Processing

Text Simplification & Accessibility

Pham, Chloe: Translation of academic text to accessible language
Yirdaw, Elshaday: Readability analysis tool with improvement suggestions
Yuxuan, Shi: Medical text simplification for patient education

Creative Writing & Story Generation

Zhang, Jingzhi: Interactive life story generator
Wu, Junting: Poetry style analysis and classification
Sixing, Wu: Context-aware reply generator for various scenarios

D. Educational Tools

Course Selection & Academic Planning

Hu, Yutong & Shah, Jiya: Natural language-based course recommendation system

Study Aids

Jarman, Robert: Pocket Teaching Assistant for lecture note processing
Yirdaw, Elshaday: Vocabulary building tool

E. Language & Translation

Language Understanding

Chang, Ridge: Idiomatic expression translator across languages
Dahiya, Rishita: Context-based word definition tool
Kim, Olivia: Hand gesture and NLP integration system

F. Content Classification

Genre & Style Analysis

Correa-Perez, Nicholas: Music genre and lyrics analysis
Jennings, Caleb: Author style prediction model
Jitngamplang, Varakorn: Music mood and theme analyzer
Mi, Renzhi & Xiaotong, Liu: Book genre classifier based on word patterns

G. Code Analysis & Documentation

Code Quality

Carrier, Emma: Code syntax analysis and improvement tool
Shah, Riyan: Automated code documentation generator

H. Data Processing & Analysis

Financial Analysis

Jiang, Jenny: Financial news summarizer
Reicin, Noah: Financial document analyzer
Suh, EunGyul: Accounting fraud detector

Data Cleaning & Structure

Guan, Yifei: Police report information extractor
Yang, Junhyeok: Web-scraped data cleaner for LLMs
Tolmochow, Gregory: Natural language query interface for structured data

I. Health & Fitness

Fitness Planning

Chen, Xinmo: Personalized fitness recommendation system
Arularasu, Akhil: AI-powered meal planning assistant
Jarman, Robert: Gym companion app

J. AI & Machine Learning Research

AI Development

Gao Henry: AI text detection and human-like text generation
Jung, Seungwon: Zero-shot vs few-shot learning comparison
Hakam, Brian: AI debate simulator using persona-based LLMs

Detailed Individual Ideas

Ahn, Eric

The team project idea I had in mind was an academic paper simplifier that would aim to bridge the gap between complex academic research and general audiences. Lots of scholarly papers today are filled with technical jargon, which makes them difficult to understand for non-experts. In this way, by using NLTK for text preprocessing followed by NLP (using pre-trained models like GPT or BERT) for summarization to analyze research articles, we could preserve key information and provide plain-language summarizes, highlight essential points, and offer explanations for complex terms. Users, such as students or the general public, can input an academic paper and receive an easy-to-read summary via input from the following potential options: (1) web scraping from URLs (like PubMed), (2) uploading .txt/.docx/.pdf file, (3) and/or direct text input in the UI.

Ambrose, Daniel

Cover letters are definitely the most annoying and time consuming part about applying to jobs. There are methods to make a cover letter quicker like using ChatGPT, but the result is usually too general and recruiters could easily tell it's AI-generated. I want to create a project where you can input a job’s description and your general cover letter and it outputs a tailored cover letter that incorporates keywords from the job description. To create this project, I will need to use a mix of AI-generation and keyword matching through text processing. This project will mostly focus on computer science jobs to limit the variety of keywords. Keywords extracted from job descriptions will prioritize certain groups of keywords such as programming languages and the requirements section in the job description. Additionally, there could be functionality to rate how well the job's keywords match your cover letter's keywords.

Amoakohene, Humphrey

Building an NLP system to analyze and rank resumes based on job descriptions to make it easy to apply to spam apply for jobs without having to manually create new resumes

Arshavsky, Mark

In Data Science and CS, it’s quite normal for students to apply to tons of internships. From personal experience, it gets really hard to keep track of all the applications. Sometimes you might even miss an important status update that gets buried in your email. My project idea is to create a product that simplifies the internship application process for students by handling the “tracking” and “monitoring” step for them. My vision is for it to work by students connecting the product to their emails (as when you apply to an internship you usually get an email), but I am open to other interpretations! Then, the product will use NLP to parse the relevant emails, extracting data such as company name, date of application, position name, etc. and put it in an Excel sheet. The product will also track for updates to students’ applications and color-code accordingly. Perhaps red would be “Rejected”, yellow would be “Status Update”, and green would be “Offered”. I have been thinking of this idea for some time now, and I think it can make a real-world impact!

Arularasu, Akhil

I want to implement an AI-powered meal planning assistant that dynamically generates personalized meal plans based on user inputted dietary preferences, allergies, fitness goals, and daily caloric needs. By processing natural language inputs (eg: "I need a high-protein vegetarian meal plan with 2500 calories per day"), the system will create meals from one or many extensive recipe datasets, ensuring nutritional balance while considering restrictions like keto, vegan, vegetarian, gluten-free, etc.

Key Features include leveraging NLP for text generation, calory and nutrient optimization, dynamic recipe recommendations, grocery list generator, and more. Users can describe their dietary needs in natural language (eg: "low-carb, high-protein meals under 2000 calories"), and the system will parse these requests into structured data to generate meal plans. I feel that there is a lot of places the group can take this idea and run with it while utilizing NLP.

Carrier, Emma

An application that identifies poor coding syntax practices and gives suggestions to fix it. For example, if there isn't any commenting found throughout the code, it will suggest to add more comments in specific places (before functions, etc.). The user would input a file / code chunk to be analyzed and it can either return the code chunk accepting the suggestions or just give a list of suggestions for the user to change themselves. There could be an option for multiple programming languages because they all have different syntax and standards that programmers should follow. Could market it as a way to help new programmers or programmers learning a new language as a way to help them identify areas in which their code could improve. Would likely need to implement a parser to achieve this goal.

Chang, Ridge

I want to build a system that translates idiomatic expressions across languages in a way that actually makes sense, rather than just converting words literally. Expressions like “break the ice” don’t mean much if translated directly, but most language has its own way of saying the same thing or similar things. I want to create a tool that captures these nuances, ensuring that meaning—not just words—is preserved to the best of its ability. I’m interested in how idioms reflect cultural perspectives, and I want to explore how this project can improve communication between languages while keeping translations natural, authentic, and pragmatic.

Chen, Angelina

I think it would be interesting to look into Amazon product reviews and use NLP to detect whether a bot is responding or an actual human; this is due to the increased number of bot action on Amazon. Often bots will give fake positive reviews to help boost a particular item or to show that a user is active if they are a paid reviewer. Another idea is to analyze the most popular companies and the kinds of words that are used in advertisements/website/customer interaction to understand how they have good customer retention. There could be a analysis of popular and dying companies.

Chen, Xinmo

I want to analyze user input and provide personalized fitness recommendations. It will allow users to log their workouts in natural language, such as "I lifted 50 lbs for 3 sets of 8 reps today," and extract key data points using NLP. By analyzing past workout logs, the system will identify trends, detect plateaus, and suggest progressive overload strategies, such as increasing weight, reps, or intensity. Additionally, I also want to do/incorporate sentiment analysis to give user motivation and provide encouraging feedback. This model will probabaly rely on outside fitness-related datasets and integrating them in order to offer actionable insights for optimizing workout routines.

Correa-Perez, Nicholas

My idea for the team project would be an analysis between lyrics and music genre. It would take a list of songs from a list of different genres and see if common themes or even words come up in the search. Perhaps maybe even an ambitious goal could be to see if we can tie the type of music to its words (i.e. bubbly music has positive words). Genre would be described as the musical aspects of songs to have a consistent definition, and the lyrics would be analysis for overall theme and tone of the lyrics as a whole.

Coulanges, Charlington

I would like to build a sentiment analysis chatbot to distinguish reviews on given platforms (Yelp, Amazon, Rotten Tomatoes, etc.). By giving the data either a positive, neutral, or negative label, the majority of the labels will either recommend or not recommend the product using excerpts from the reviews analyzed. There are often times when a review site will have a decent amount of bot reviews to either tank or boost the rating. Thus, the chatbot will also be able to detect whether any review appears to be written by bots or a any non-human source.

Dahiya, Rishita

I've often been fascinated by the intricacies of the English language which may not be intuitive if English is not someone's first language. An example of such is how certain words change in definition by their context, for example the word "set", which means something different in each of these phrases: set the table, the sun set, to set the record, and so on. My goal in this project would be to develop a tool which is able to decipher the definition of the ambiguous words depending on the context it is used and as a result make machine language processing more precise and intuitive. This would make machines understand the human language better and would have practical applications in machine translations and search engines.

Dattatreya, Maya

My project idea would be to create a model that can perform sentiment analysis to detect the tone and bias in media and the news. I want to see how levels of emotional language used in different kinds of articles affect the engagement metrics of popular media and news sources. My goal is to understand how sentiment influences public perception and how media literacy can be improved.

Forbes, Alexander

I get spam emails daily, and I think it'd be an interesting final project to create a straightforward spam detection tool that flags suspicious emails or messages by using NLP techniques to identify spammy keywords, suspicious links, or recurring patterns. A feedback loop will let users confirm or override the model’s decisions, which could potentially improve accuracy over time. This project would combine classification algorithms with a user-friendly interface for simple adoption. Ultimately, I want to produce a packaged, practical tool that addresses an actual, real-world communication problem.

Gelb, Jake

Hi! I am open to all sorts of various projects and am not attached to any specific idea. Some areas of interest include spam bot detection, phishing and scam prevention, recipes tailored to dietary needs, sentiment analysis, and chat bot development. One specific idea is aa chatbot geared toward fantasy football users that analyzes trades and decision-making, helping users optimize their teams based on player performance, matchups, and trends. I am very excited for this course and am eager to collaborate on some impactful project.

Guan, Yifei

I'm currently working with a professor to aggregate data from the police department. Some department has organized data that we can use directly, but most of them only have individual reports that we have to fetch information manually. I want to use linguistic model and text processing technique to extract information efficiently from individual police report, including incident, date, victims, information officer, information, and so on. This would largely shorten the time we need on processing the reports.

Hakam, Brian

AI Debate Simulator-Using persona based LLMs(Similar to Character AI) This project would fine tune LLMs to answer like specific people, using scraped transcripts and a tone analysis of the person(funny, serious). Then attach the person’s history and debate relevant events to the context window. For example, to do Abraham Lincoln, an LLM would first be fine tuned on speech of that time, then specifically tuning on his transcripts and text. Then another LLM would be trained in this fashion on another person. The 2 or more LLMs would then debate a chosen topic with an intermediately and neutral LLM asking questions, then ultimately deciding the winner. This process would be fully automated, where you type in a name and it would web scrape and train all on its own.

Henry, Gao

I have two ideas:

Create an AI model that will identify text written by AI. It would be difficult because AIs are becoming more and more human-like. Some ideas would be to look for some “errors” that AIs won't make, or check the cosine similarities between sentences. The “errors” should be something that is grammatically correct but not often used. By checking cosine similarities, we can see if the sentences use similar words, or have similar structures.
Use human-written text to train an AGI agent that writes like a real human. Instead of always choosing the token with the highest probability, choose a random token based on the probability assigned. In this way AI will have a larger vocabulary set. Another possible approach is to analyze the sentence structures and let each sentence structure appear in a probability similar to that in reality.

Henry Gao

Hou, Carol

I'm interested in a project focusing on sentiment analysis or an email spam detector. I am interested in researching machine learning models that can be used for the classification tasks (classifying a social media post as positive, negative, or neutral, or for emails as spam or not spam.) I haven't narrowed down the main social media website I want to use as a data source for sentiment analysis, but I'm currently thinking of X (twitter), Yelp, etc. The goal of the project would be to determine what customers are thinking about certain products, restaurants, etc. Or, to detect if an email is spam or not.

Hu, Yutong

I would like to build a natural language-based course recommendation system that helps students find relevant courses based on their preferences and academic history. The system will process natural language inputs from users to understand their course requirements and generate personalized recommendations. The system will accept free-form text descriptions of desired courses, parsing key information such as: subject area/discipline, course topics and content, instructor preferences, credit hour requirements, and etc. Besides, we would like to add more advanced filtering based on students' personal information: their majors and the course they've already taken. We might need to use named entity recognition and some filtering engine.

Jarman, Robert

I have a few ideas.

Online Multiplayer Game: Create a simple multiplayer game (e.g., trivia like charades, word games like Wordle, or strategy-based optimization such as collecting coins on a map within a limited number of steps or time). The game would feature real-time interaction and leaderboards to engage players.

Pocket Teaching Assistant (Pocket-TA): Develop an app that uses natural language processing to summarize lecture notes, create flashcards, organize notes collaboratively in real-time (with features like version history, tags, and search), and answer questions from uploaded documents. This tool aims to make studying for exams more efficient.

Gym Companion App: Build an app to personalize fitness journeys with tailored exercise, diet, and sleep plans based on a user’s current build and desired physique. The app also fosters community by helping users find nearby sports partners for games like basketball or tennis, promoting health and social connection.

Jennings, Caleb

Create a model using machine learning combined with computational linguistics that learns the writing style of a list of classic authors from a database of many classic literature books, and the model tries to predict an author based on a sample text that was left out of the database for testing. Finding the database should be easy since much of classic literature is public domain (though it would take some time to collect it all). A number of machine learning techniques could be applied such as Stochastic Gradient Descent, Random Forests, Neural Networks, and Clustering. The challenge would be to determine how to parse the data, what learning technique to use, and how to use the data in learning the model.

Jiang, Jenny

An idea I had using NLP is a tool that processes hundreds of news articles that are released daily regarding markets, finance, and the economy (from Wall Street Journal, New York Times, CNBC, MarketWatch, etc.) that then provides a summary of what happened. As someone that likes to keep up with the news and markets myself, it can be exhausting reading multiple articles a day from different platforms to get an accurate grasp of current events. An NLP tool could process all these news articles using big data to then generate daily summaries of the news. This can help make keeping up with the economy more accessible for those that don’t follow the news.

Jitngamplang, Varakorn

I like the idea of a tool that helps find music that precisely matches a desired mood or theme. I want to develop a tool that uses natural language processing techniques to analyze song lyrics and facilitate music discovery based on a specified lyrical characteristics. Ideally, it would searches for songs based on themes (e.g., romance, travel), sentiment (e.g., melancholic, upbeat), and other lyrical features. By extracting keywords, analyzing sentiment, and employing other NLP processes, I hope it will recommend songs that resonate with given preferences. This goes beyond a simple sentiment analysis tools or aggregator as this would require analyzing music rhetoric.

Jung, Seungwon

I am open to any topic to research. I am open to topics like sentiment analysis, news fraud detection, text-to-speech tools that add emotional expressiveness, and even poetry-generating writing editor. Also, research related to bias detection and ethical language generation can be done. My idea at the moment is to research the effectiveness of zero-shot vs. few-shot learning on dynamic gaming(or just a normal life if sufficient datasets are not found) NPCs. Although it is well-known few-shot learning will be more effective than zero-shot, I would like to research its limitations using BLEU, ROUGE, or BERTScore to evaluate the quality of a generated text using few-shot learning.

Kansal, Rhea

I am open to any project ideas, but one that I think could be interesting is using NLP to detect fake news. In this project, we would develop a system that can classify news articles or social media posts as either fake or real. The system could analyze linguistic patterns, stylistic features, and content to identify misinformation and provide users with a confidence score for the classification. The prevalence of misinformation in today’s society makes this a highly relevant topic.

Another idea I have is a system that identifies and extracts key skills, qualifications, and experiences from resumes. This tool can be used to match resumes with specific job descriptions, streamlining the recruitment process. It could also rank resumes based on how closely they align with the job requirements, helping recruiters quickly find the best candidates.

Kim, Gyul

I would like to analyze patient reviews on healthcare systems and doctors to identify key topics such as wait time, doctor empathy, quality of care, misdiagnosis, and facility conditions. Through topic modeling, the project can identify recurring themes to provide a summarized review on a facility, service, doctor/ any other patient feedback. The project will involve collecting and preprocessing patient reviews from platforms such as Healthgrades, Zocdoc, or Reddit, followed by implementing unsupervised learning models to extract meaningful topics. Sentiment analysis can also be integrated to assess whether patient sentiments about specific topics are positive, negative, or neutral.

Kim, Olivia

Embedding hand gestures with NLP. As LLMs expand out to voice recognition, there has been significant more researches about tone, but to my short knowledge hand gesture research hasn't been done as extensively. By recording the hand movements made when speaking a certain text and analyzing it, nuances will be able to be captured more effectively. This could also be implemented to turn a written text into ai-generated videos of sign language. However, I strongly believe this is too good of an idea to not have been made already.

King, Marisa

I would enjoy developing a project that examines the different words and terms used by the media to describe events (both global and internal), and how their specific word choices differ based on factors such as race, gender, sexuality, politics, religion, etc.

Lin, Ryan

My team project concept is a some kind of text-message/DM chatbot that can mimic slang or like styles of texting. People speak differently over text whether in content, style, or length of messages. I think its pretty interesting how language over text can be so different than using proper grammar, and in a sense it is its own kind of dialect of language. I'm open to doing other stuff cuz I'm also interested in using different types of data including video, audio, and picture data but I'm not really certain how yet.

Mi, Renzhi

I'm thinking about developing a program that classifies books and novels into their respective genres based on word frequency patterns and word types, which can potentially help to build the page suggestions for readers. This might involve processing volumes of text, extracting frequently occurring terms and utilizing them as key indicators to infer the overarching theme and genre. By leveraging NLP techniques, the program will identify genre-specific linguistic patterns and compare them against predefined datasets corresponding to various literary genres such as mystery, fantasy, science fiction, and romance. This automated classification framework enhances genre identification by analyzing textual content directly, so it can provide a more objective and data-driven approach to literary categorization.

Okumura, Harutoshi

Political Science Research Paper on using LLMs to detect explicit contradictions from presidential debate speech candidates.

Can LLM rely solely on Natural Language Structure to find explicit contradictions on candidates' past speeches (public, tweets, websties, e.t.c), and use it against them to solidify an argument of contradiction, and thereby lack of credibility in certain topics?

Are there certain semantical features and law of contradiction, that we can adapt reliably for a consistent system that detects contradictions (no matter how fundamental it may be), and apply to large corpus of past speeches.

Pham, Chloe

I'm potentially interested in the idea of "translating" texts written in a highly elevated/academic/obscure tone to language that may be more accessible to the average person. While there is an argument for "flowery" and "smart" writing, I think there is an elitist undertone to conveying ideas in a way that is inaccessible to many, especially if the text's purpose is education. There is a large set of existing source texts to draw from, along with some existing SparkNotes-type "translations" of classics and other foundational texts.

Reicin, Noah

Develop a Natural Language Processing (NLP) system that analyzes financial documents to perform two main tasks:

Sentiment Analysis: Identify and categorize the document's tone—positive, neutral, or negative—based on relevant keywords' frequency and contextual usage.
Key Focus Extraction: Detect the frequency and distribution of specific domain-relevant terms to determine the company’s primary areas of emphasis (e.g., sustainability initiatives, R&D, and shareholder returns).

Sezgin, Alas

I would like to work on a dogwhistle detector, specifically for Twitter (or, reluctantly, now named "X") as with the new administration of the website, far right rhetoric (like fascism) has become less and less censored and perhaps even more common, albeit hidden behind a veil of dogwhistles and plausible deniability which make it hard for the average person to determine if what they are consuming is a joke or far right rhetoric. This has the effect of familiarizing users to unkowingly become familiarized with harmful rhetoric through humor, the selective lack of censorship towards which can sway user opinions on a supposedly neutral website. This is why it's important to detect dogwhistles or general far right rhetoric, which may be difficult for the average person. It would be nice to implement this as a browser add-on but that might be outside the scope of this class. This would likely involve a lot of hard-coding or feeding labeled data into a model, so I am open to new ideas.

Shah, Jiya

This idea was made in collaboration with Yutong Hu. Although I don’t have any concrete ideas on implementation currently, we were thinking of something that helps students choose courses based on information inputted in natural language. It would use their description of the course (i.e., which academic discipline, professors they’d like, how many hours they want/need, which major/general education requirements they still need to fulfill, general subjects they would like to explore, etc.) in addition to information about what courses they have already taken and their major(s) and/or minor(s). This would all be to provide suggestions of what courses they could take based on the course atlas of that semester.

Shah, Riyan

My idea is to process pieces of code and using an NLP algorithm, generate meaningful comments for it so other people looking at the code can better understand it.

create a list of variables used with their purpose
comment long lines of code to explain what they are trying to do (eg. long mathematical expressions, or lines involving lots of variables and packages)
create comments for blocks of code that are trying to perform a task (eg. this for loop is meant to do blah blah blah)

Sixing, Wu

My idea is to make a Brilliant Reply model. I hope the model can work to assist people in replying in various scenarios. 1. email writing assistant: helps to formulate emails for different occasions, including networking, job applications, announcements, etc.); 2. text message reply: helps to generate an appropriate and engaging response for personal and business usage, being able to handle slangs and emojis.); 3. scenarios based replies: eg: helps to generate a hook-up message or flirting replies in ins, a coffee chat invitation in LinkedIn, appropriate complaints message to customers services, etc; 4. tries to tailor the users' tone and learn from the previous text.

Suh, EunGyul

Project Concept 1: Accounting Fraud Detector. I assume that the SEC filings of companies involved in accounting fraud may contain exaggerated or magnified language and tone. I would like to develop a classification or anomaly detection model that identifies accounting fraud by analyzing companies' SEC filings. However, I anticipate that datasets containing SEC filings from fraudulent companies would not be available enough to train the model effectively.

Project Concept 2: Fake News Detector. Fake news on social media has become a significant concern in recent times. Similar to the first concept, I aim to develop a detection model that identifies potential fake news by analyzing social media posts, such as those on platforms like X. I would collect dataset from X with API or use existing dataset on fake news on social media. Additionally, I am interested in exploring which components of a social media post—such as whether the account is verified, whether the post contains images, the post's length, etc.—serve as indicators of potential fake news via extracting meta features from posts

Tolmochow, Gregory

Natural Language Query Interface for Structured Data This project aims to develop an NLP-powered interface that allows users to query structured databases (CSV or SQL) using natural language. Instead of writing complex SQL queries or manually filtering data, users can ask questions conversationally, and the system will break down their queries into structured filters. For example, given a housing dataset, a user might ask, "How many homes built in 2023 have 2 or more residents within 5 miles of a school?" The model would extract relevant columns such as year_built = 2023, # of ppl > 2, and dist_to_school < 5. These filters would be put into blocks that users can then refine these filters before and after applying them, or even pick which ones to remove. This project combines natural language understanding with data retrieval, making database interaction more intuitive and user-friendly.

Ukpong, Imeikan

One idea that hopefully will call on knowledge learned in this course is a tool that is able to take in prompts like conversations or simple sentences and returns whether that input is either positive or negative, (depending on the difficulty maybe neutral or more nuanced statements can be considered and categorized as well). To figure out what to produce as an output, the tool could observe certain keywords based on a certain heuristic (like words or symbols that tend to generally be important when people use them, like “I just got a new job!” has “new job” - positive and exclamation point further helps model be sure that the statement is positive).The sign of the output (Either positive or negative) is solely based on whether the input elicits positive emotions like happiness, laughter, etc. or negative emotions like sadness, pain, etc.

Wen, Yuanhuizi

This project aims to develop an AI-powered resume-to-job matching system using a dataset containing candidate resumes, the job positions they are applying for, and a numerically labeled match score. We will train a neural network-based regression model to predict the match score between a given resume and job position (already found the dataset from Kaggle). The model's performance will be evaluated using appropriate metrics such as Mean Squared Error (MSE) or R² score (We will do more research on this). This system has two key applications: (1) Automating resume screening to help recruiters efficiently rank candidates, reducing manual effort, and (2) Assisting job seekers in evaluating how well their resume aligns with different job positions before applying. By streamlining the hiring process, this project saves time for both candidates and recruiters, increasing overall efficiency in job matching.

Wu, Junting

Different poets have different styles. People who read enough poems can know the author of a poem through its style. But what is the style? Is it the use of words (some poets would use certain words repeatedly, some poets have characteristic ways of arranging the sequence of words, etc.)? Is it the meaning? Is it the way of starting or ending a poem? Or is it just a feeling? I wonder whether AI can classify poems according to their authors just based on their style. The result may shed light on the understanding of a poet's style. Only English poets would be chosen to avoid issues with translation.

Xiaotong, Liu

My project aims to develop a program for analyzing text in books and novels to determine their genre based on word frequency patterns. The program will skim through a large amount of text, identify the most frequently occurring words, and use them as key indicators to make inferences about the overall theme or genre. Using Natural Language Processing (NLP) techniques, the program will identify genre-specific words and compare them to predefined datasets of literary genres such as suspense, fantasy, science fiction, or romance. This automated classification system helps readers to effectively categorize novels based on textual content rather than metadata.

Xing, Jerry

My team project concept is a program that evaluates media and determines the level of bias. It would specifically evaluate news articles and transcripts of news videos. While all media has the same goal of informing the viewer or reader, often there is a certain level of bias from the person who wrote the article or script. I believe that natural language processing could be very applicable for analyzing media and gauging the amount of bias. In terms of deliverables, the end product would aggregate various sources of media on a topic, and compare the degrees of bias between the content that is written on a singular subject.

Xinyuan, Hu

I'm planning to build a tool that uses natural language processing to analyze and evaluate research paper credibility on platforms like arXiv and Google Scholar. Given the massive volume of papers being published today, having an automated way to assess paper quality would be incredibly helpful. The project will use well-established papers from top journals as training data to identify what makes research credible. The analysis will focus on some key aspects such as: methodology robustness (how well research methods are described and justified), citation patterns (how the work connects with existing research), and overall writing quality, etc. The process might be retrieving papers from websites, extracting the paper contents, and hen developing a scoring method.

Xu, Jack

I'm interested in making a sentiment analysis tool for studies/papers that can classify the overall quality, or to see if there are patterns that can be found in papers that are considered low-quality in meta-research. Given the text of a paper, it should predict the quality of the evidence and overall methodology. Also, maybe it could highlight some common patterns that are indicative of high quality, etc.

Yang, Junhyeok

Idea: Making web-scraped data readable for LLM

Description: Web-scraped data from various websites often contain unstructured and irrelevant content, making it difficult for LLMs to process effectively. Issues include raw HTML tags, boilerplate text, headers, footers, advertisements, and redundant information. NLP techniques can be leveraged to clean, preprocess, and structure this data into a format optimized for LLMs. Methods such as text normalization, entity recognition, summarization, and noise filtering help remove unwanted elements while preserving meaningful content. By applying NLP-driven parsing and formatting, web-scraped data can be transformed into a structured, high-quality dataset for better comprehension and usability for LLMs.

Yeruva, Sujith

I would like to do something involving sentiment analysis on stocks. There are sites like SeekingAlpha, ValueInvestorsClub, Yahoo Finance, and many others that analyze a stock and sometimes pitch it. There is also discussion on social media sites like Twitter and Reddit from individual retail investors. I would like to use this to potentially generate a "sentiment score" that shows what the public perception is on certain stocks (and how it might vary between different sources).

Yirdaw, Elshaday

I mainly have two project ideas that I would like to work on throughout the semester. My first (and main) project idea involves creating a tool that can estimate the readability level of a text provided by a user. In addition, I would like for this tool to offer some kind of suggestions that can make the text more accessible. These suggestions could range from simple modification, like replacing difficult words with more commonly used words, to advanced modifications, such as restructuring sentences or paragraphs that may otherwise be unclear and hard to understand. Essentially, the goal of the project would be to estimate the current readability level of the given text and provide modification suggestions to improve the readability of the text (perhaps to some level of complexity that a user might want). The secondary idea I am considering is developing some kind of a vocabulary builder. This tool would take a text provided by a user and identify words (and phrases) that may be challenging to understand. Then, using these words (and phrases), it would generate a vocabulary list along with definitions so that users can use it to expand their understanding.

Yoon, James

With the Information Age’s abundance of information, such an overwhelming sea of viewpoints and novel ideas may embolden more polarizing instances of media. Consequently, there remains an ever-urgent necessity to identify bias within journalism which may be obviated through utilizing NLP and subsequent sentiment analysis techniques. By leveraging sentiment analysis, named entity recognition, and topic modeling, one could assess the emotional tone, political leaning, and framing of articles from multiple sources. Users could input URLs or text to receive a breakdown of sentiment polarity, lexical bias, and comparative analysis against a diverse dataset of news sources. The tool could further employ machine learning models trained on labeled datasets to enhance accuracy, offering readers an objective lens through which to evaluate media narratives.

Yuxuan, Shi

Text Simplification for Patient Education

The complexity of medical language has been a major barrier in consumer health informatics. For individuals untrained in medical field, the healthcare processes become a black box. Here are some potential directions that I aim to optimize through this project: lexical and contextual complexity; health literacy; language and cultural barriers; patient-centric communication tools.

Solution: a multi-agent environment that can do:

lexical simplification: replace complex medical jargon with simpler terms; syntactic simplification to shorten long, complex sentences.
context specific explanations. ("benign" as "not cancerous" in a pathology report)
dynamic summarization: highlight and prioritize most critical information, such as diagnoses, treatment plans, next steps.
user feedback loop: allow interaction with agent for further clarification or simplification
personalization: adjust output based on user’s patient literacy levels, language preference, and prior knowledge.

Zahid, Zeshan

The project Im thinking about creating is a webapp or program that can make edits to your resume weather it be format or edits to your words in the way you describe certain stuff in your resumes. Im familiar with backend webapp creation with Python and think using some of the libraries along with possibly Ai like chatgpt 4 to be able to make those edits. This tool could be useful for students as well as job seekers by helping tailor their resumes. Using Ai would make it alot more robust to where once inputed we could use Ai to tailor the resume for certain jobs or purposes and make effective edits.

Zhang, Jingzhi

Interactive Life Story Generator: Turn user-provided life details into a realistic, interactive, fictional story, where real-time user choices can adjust the plot (i.e., exploring future "what-if" scenarios). The project will allow users to specify the genre and ending type and use pre-trained models to dynamically generate realistic narratives based on the user's past experiences. The focus will likely be on prompt engineering to ensure coherence and immersion with a web-based UI for user interaction.

NLP Essentials

Overview

Prerequisites

Sections

Syllabus

General

Instructors

Grading

Homework

Concept Quizzes

Programming Assignments

Team Project

Project Grading

Contribution

Consensus

Schedule

Development Environment

Python

GitHub

PyCharm

References

Homework

Task 1: Getting Started

Package Installation

Test Program

Commit & Push

Submission

Task 2: Project Ideas

Rubric

Text Processing

Sections

Frequency Analysis

Word Counting

Word Frequency

Save Output

References

Tokenization

Delimiters

Post-Processing

Tokenizing

References

Lemmatization

Lemma Lexica

Lemmatizing

References

Regular Expressions

Core Syntax

Metacharacters

Repetitions

Groupings

Assertions

Functions

match()

group()

search()

findall()

finditer()

sub()

Tokenization

References

Homework

Task 1: Chronicles of Narnia

Data Collection

Implementation

Task 2: Regular Expressions

Submission

Rubric

Language Models

Contents

N-gram Models

Unigram Estimation

Bigram Estimation

References

Smoothing

Unigram Smoothing

Bigram Smoothing

Normalization

Reference

Maximum Likelihood Estimation

Sequence Probability