Only this pageAll pages
Powered by GitBook
1 of 57

NLP Essentials

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

NLP Tasks & Applications

Loading...

Loading...

Structure Parsing

Relation Extraction

Question Answering

Machine Translation

Text Summarization

Dialogue Management

Homework

Projects

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Overview

By Jinho D. Choi (2025 Edition)

Natural Language Processing (NLP) is a dynamic field within Artificial Intelligence focused on developing computational models to understand, interpret, and generate human language. As NLP technologies become increasingly embedded in our daily lives, understanding its fundamentals is crucial for both leveraging its potential and enhancing our interaction with language-based systems.

This course is designed to build a robust foundation in the core principles of modern NLP. We begin with text processing techniques that show how to manipulate textual data for model development. We then progress to language models, enabling computational models to comprehend and generate human language, and vector space models that convert textual data into machine-readable formats. Advanced topics include distributional semantics for creating context-aware word embeddings, and contextual encoding for analyzing word relationships within their surrounding text.

The latter part of the course focuses on practical application through team projects. Students will have the opportunity to work with cutting-edge NLP technologies, such as large language models, to develop real-world applications. This hands-on approach encourages creativity and innovation, with students proposing their own ideas and presenting demonstrations to their peers.

Learning assessment combines concept quizzes to reinforce theoretical understanding with hands-on programming assignments that develop practical implementation skills. By the conclusion of the course, students will have gained the knowledge and skills to navigate and contribute to the rapidly evolving field of NLP.

Prerequisites

  • Introduction to Python Programming

  • Introduction to Machine Learning

Sections

Syllabus
Schedule
Development Environment
Homework

Syllabus

CS|QTM|LING-329: Computational Linguistics (Spring 2025)

General

  • Time: MW 4 - 5:15 PM

  • Location: MSC W201

Instructors

    • Associate Professor of Computer Science, Quantitative Theory and Methods, Linguistics

    • Office Hours and Location: MW 5:30 - 6:30 PM, White Hall 218

    • GitHub: jdchoi77

    • Ph.D. Student in Computer Science and Informatics

    • Office Hours and Location:

      • Hours: Wed 1:30 - 3:30 PM and Fri 12:30 - 1:30 PM

      • Location: Math & Science Center E308 (Computer Lab)

    • GitHub: byunsj

    • Ph.D. Student in Computer Science and Informatics

    • Office Hours and Location:

      • Hours: Mon 10:30 AM - 1:30 PM

      • Location: Math & Science Center E308 (Computer Lab)

    • GitHub: swati-rajwal

Grading

  • Homework: 65%

  • Team Project: 35%

Homework

  • Each topic will include homework that combines concept quizzes and programming assignments to assess your understanding of the subject matter.

  • Assignments must be submitted individually. While discussions are allowed, your work must be original.

  • Late submissions within a week will be accepted with a grading penalty of 15%. They will not be accepted after the solutions are discussed in class.

Concept Quizzes

  • Each section incorporates questions to explore the content more comprehensively, with their corresponding answers slated for discussion in the class.

  • While certain questions may have multiple valid answers, the grading will be based on the responses discussed in class, and alternative answers will be disregarded. This approach allows us to distinguish between answers discussed in class and those generated by AI tools like ChatGPT.

Programming Assignments

  • You are encouraged to use any code examples and invoke APIs provided in this book.

  • Feel free to create additional functions and variables in the assigned Python file. For each homework, ensure that all your implementations are included in the respective Python file located under the corresponding directory.

  • Usage of packages not covered in the assigned chapter is prohibited. Ensure that your code does not rely on the installation of additional packages, as we will not be able to execute your program for evaluation if external dependencies are needed.

Team Project

  • You are expected to:

    • Form a team of 3-4 members.

    • Provide individual feedback on other teams' presentations and demonstrations.

Project Grading

  • Team members receive the same grade for the pitch presentation, live demonstration, and demonstration video.

  • Peer evaluations from other teams factor into your team grade.

  • Your feedbacks to other teams are graded individually.

For the project and final reports, you are required to indicate the contribution percentage of each team member, which impacts the individual grades for the assignment.

Contribution

If your team of two members received 4 out of 5 points, for example, and you indicate that your contribution was 60% while your teammate's was 40%, the points are distributed as follows:

  • You receive: 4 (team points) ⨉ 60 (your contributions) / 60 (max contributions) = 4 points.

  • Your teammate receive: 4 (team points) ⨉ 40 (your teammate's contributions) / 60 (max contributions) = 2.67 points.

This approach ensures that the grading reflects the effort and input of each team member, promoting fairness and accountability.

Consensus

  • Each team is required to submit a single, agreed-upon chart detailing the contribution percentages of all members for each team assignment. This means that you and your teammates must reach a consensus on the contribution rates before submitting your work.

  • Open communication and transparency are essential in this process. Disagreements should be resolved within the team, ensuring that the final submission reflects the true division of labor and contributions.

By adhering to these guidelines, you not only produce a strong research paper but also develop key skills in teamwork and fair assessment of contributions.

Development Environment

This guide will help you set up your development environment by installing required tools: Python programming language, GitHub for version control, and PyCharm IDE.

Python

GitHub

  1. Login to GitHub.

  2. Create a new repository named "nlp-essentials" and set its visibility to Private.

  3. Go to [Settings] in your repository, and select [Collaborators and teams].

  4. Click [Add people], and add each instructor using their GitHub usernames:

    1. Enter each username and send the collaboration invitation.

  5. Verify that all instructors have been added as collaborators.

PyCharm

    1. The following instructions are based on PyCharm 2024.3.x Professional Edition.

  1. Configure your GitHub account:

    1. Go to [Settings] > [Version Control] > [GitHub].

    2. Press [+], select [Log in via GitHub], and follow the browser prompts to authorize PyCharm with your GitHub account.

    3. Once connected, you will be able to access GitHub directly from PyCharm for version control operations.

  1. Create a new PyCharm project from GitHub:

    1. On the PyCharm welcome screen, click [Clone Repository].

    2. In the new window, select [GitHub] from the left menu, choose your nlp-essentials repository, and click [Clone] (verify the directory name is "nlp-essentials").

  2. Set Up a Python virtual environment:

    1. Go to [Settings] > [Project: nlp-essentials] > [Project Interpreter].

    2. Click [Add Interpreter] and choose [Add Local Interpreter].

  3. In the prompted menu, choose [Add Local Environment], configure as follows, then click [OK]:

    • Environment: Generate new

    • Type: Virtualenv

    • Base python: the Python version you installed above

    • Location: YOUR_LOCAL_PATH/nlp-essentials/.venv

References

Book:

GitHub:

Your work is governed by the . Honor code violations (e.g., copies from any source, including colleagues and internet sites) will be referred to the Emory Honor Council.

Requests for absence/rescheduling due to severe personal events (such as health, family, or personal reasons) impacting course performance must be supported by a letter from the .

Present a to share your proposed idea, and write a .

Deliver a showcasing your working project, create a demonstration video, and write a documenting details about your project.

Participation in pitch presentations and live demonstrations is compulsory. Failure to attend any of these events will result in a zero grade for the respective activity. In the event of unavoidable absence due to severe personal circumstances, a formal letter from the must accompany any excuses.

Install version 3.13.x or higher. Earlier versions may not be compatible with this course.

Take some time to familiarize yourself with Python's .

Create a account (if you do not already have one). As a student, you can get features for free through the .

Find their GitHub IDs in the "Instructors" section of the .

Install on your local machine:

As a student, you can get for free by applying for a .

: a version control system for tracking changes in files.

: a tool to create isolated Python environment.

http://emory.gitbook.io/nlp-essentials/
https://github.com/emory-courses/nlp-essentials/
Jinho Choi
Grace Byun
Swati Rajwal
Emory Honor Code
Office for Undergraduate Education
project pitch
proposal report
live demonstration
final report
Office for Undergraduate Education
Python
new features
GitHub
GitHub Pro
GitHub Student Developer Pack
Syllabus
PyCharm
PyCharm Professional
JetBrains Educational License
Git
Virtualenv

Lemmatization

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

Q7: What is the difference between a lemmatizer and a stemmer [3]?

  • Universities

  • University

  • universities

  • university

Two variations are applied to the noun "university" - capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

  • study

  • studies

  • studied

  • studying

We want to develop a lemmatizer that normalizes all variations into their respective lemmas.

Lemma Lexica

Let us create lexica for lemmatization:

import json
import os
from types import SimpleNamespace
def get_lexica(res_dir: str) -> SimpleNamespace:
    with open(os.path.join(res_dir, 'nouns.txt')) as fin: nouns = {noun.strip() for noun in fin}
    with open(os.path.join(res_dir, 'verbs.txt')) as fin: verbs = {verb.strip() for verb in fin}
    with open(os.path.join(res_dir, 'nouns_irregular.json')) as fin: nouns_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_irregular.json')) as fin: verbs_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'nouns_rules.json')) as fin: nouns_rules = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_rules.json')) as fin: verbs_rules = json.load(fin)

    return SimpleNamespace(
        nouns=nouns,
        verbs=verbs,
        nouns_irregular=nouns_irregular,
        verbs_irregular=verbs_irregular,
        nouns_rules=nouns_rules,
        verbs_rules=verbs_rules
    )
  • L1: res_dir: the path to the root directory where all lexica files are located.

We then verify that all lexical resources are loaded correctly:

print(len(lexica.nouns))
print(len(lexica.verbs))
print(lexica.nouns_irregular)
print(lexica.verbs_irregular)
print(lexica.nouns_rules)
print(lexica.verbs_rules)
91
27
{'children': 'child', 'crises': 'crisis', 'mice': 'mouse'}
{'is': 'be', 'was': 'be', 'has': 'have', 'had': 'have', 'bought': 'buy'}
[['ies', 'y'], ['es', ''], ['s', ''], ['men', 'man'], ['ae', 'a'], ['i', 'us']]
[['ies', 'y'], ['ied', 'y'], ['es', ''], ['ed', ''], ['s', ''], ['d', ''], ['ying', 'ie'], ['ing', ''], ['ing', 'e'], ['n', ''], ['ung', 'ing']]

Q8: What are the key differences between inflectional and derivational morphology?

Lemmatizing

Let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

def lemmatize(lexica: SimpleNamespace, word: str) -> str:
    def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
        lemma = irregular.get(word, None)
        if lemma is not None: return lemma

        for p, s in rules:
            lemma = word[:-len(p)] + s
            if lemma in vocabs: return lemma

        return None

    word = word.lower()
    lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)

    if lemma is None:
        lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)

    return lemma if lemma else word
  • L2: Define a nested function aux to handle lemmatization.

  • L6-7: Try applying each rule in the rules list to word.

  • L8: If the resulting lemma is in the vocabulary, return it.

  • L10: If no lemma is found, return None.

  • L12: Convert the input word to lowercase for case-insensitive processing.

  • L13: Try to lemmatize the word using verb-related lexica.

  • L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.

  • L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

We now test our lemmatizer for nouns and verbs:

nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
nouns_lemmatized = [lemmatize(lexica, word) for word in nouns]
for word, lemma in zip(nouns, nouns_lemmatized): print('{} -> {}'.format(word, lemma))

verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
verbs_lemmatized = [lemmatize(lexica, word) for word in verbs]
for word, lemma in zip(verbs, verbs_lemmatized): print('{} -> {}'.format(word, lemma))
studies -> study
crosses -> cross
areas -> area
gentlemen -> gentleman
vertebrae -> vertebra
alumni -> alumnus
children -> child
crises -> crisis
applies -> apply
cried -> cry
pushes -> push
entered -> enter
takes -> take
heard -> hear
lying -> lie
studying -> study
taking -> take
drawn -> draw
clung -> cling
was -> be
bought -> buy
from collections import Counter
from src.tokenization import tokenize

corpus = 'dat/emory-wiki.txt'
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
words = [lemmatize(lexica, word) for word in tokenize(corpus, delims)]
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

output = 'dat/word_types-token-lemma.txt'
with open(output, 'w') as fout:
    for key in sorted(counts.keys()): fout.write(f'{key}\n')
# of word tokens: 363
# of word types: 177

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

Q9: In which tasks can lemmatization negatively impact performance?

References

When analyzing obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

L1:

L2:

L3:

L2: : a list of base nouns

L3: : a list of base verbs

L4: : a dictionary of nouns whose plural forms are irregular (e.g., mouse → mice)

L5: : a dictionary of verbs whose inflection forms are irregular (e.g., buy → bought)

L6: : a list of pluralization rules for nouns

L7: : a list of inflection rules for verbs

L3-4: Check if the word is in the irregular dictionary (), if so, return its lemma.

Finally, let us recount word types in using the lemmatizer and save them to :

L2: Import the tokenize() function from the module.

Source:

- A heuristic-based lemmatizer

, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 ()

dat/word_types-token.txt
JSON encoder and decoder
Common pathname manipulations
SimpleNamespace
nouns.txt
verbs.txt
nouns_irregular.json
verbs_irregular.json
nouns_rules.json
verbs_rules.json
get()
dat/emory-wiki.txt
dat/word_types-token-lemma.txt
src/tokenization.py
dat/text_processing /word_types-token-lemma.txt
lemmatization.py
ELIT Morphological Analyzer
An Algorithm for Suffix Stripping
PDF
Create a GitHub repository.
Add collaborators to your GitHub repository.
Add your GitHub account to PyCharm.
Add a virtual environment.

Homework

HW0: Getting Started

Task 1: Getting Started

In this assignment, you will:

  1. Set up your development environment,

  2. Install your first Python package using pip within a virtual environment,

  3. Run a test program to verify the installation and environment configuration, and

  4. Commit your changes and push them to your GitHub repository.

This will ensure your development workspace is properly configured for this course.

Package Installation

  1. Open Terminal in PyCharm:

    1. Select the [View] > [Tool Windows] > [Terminal] menu).

  2. python -m pip install --upgrade pip
  3. pip install setuptools
  4. pip install elit_tokenizer
  5. You will know the installation is successful when you see "Successfully installed ..." messages for each package in the terminal output.

Test Program

  1. Create the project structure:

    1. PyCharm will automatically create __init__.py files in both directories to mark them as Python packages.

  2. Create your first program:

    1. Copy the following code into the file:

    from elit_tokenizer import EnglishTokenizer
    
    if __name__ == '__main__':
        text = 'Emory NLP is a research lab in Atlanta, GA. It was founded by Jinho D. Choi in 2014. Dr. Choi is a professor at Emory University.'
        tokenizer = EnglishTokenizer()
        sentence = tokenizer.decode(text)
        print(sentence.tokens)
        print(sentence.offsets)
  3. Run the program:

    1. Choose the [Run] > [Run 'getting_started'] menu, or

    2. Use the green run button next to the main block.

  4. Verify the output; your program is working correctly if you see this output:

    ['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'was', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']
    [(0, 5), (6, 9), (10, 12), (13, 14), (15, 23), (24, 27), (28, 30), (31, 38), (38, 39), (40, 42), (42, 43), (44, 46), (47, 50), (51, 58), (59, 61), (62, 67), (68, 70), (71, 75), (76, 78), (79, 83), (83, 84), (85, 88), (89, 93), (94, 96), (97, 98), (99, 108), (109, 111), (112, 117), (118, 128), (128, 129)]

Commit & Push

    1. Create the file in your nlp-essentials root directory

    2. Add the following lines to exclude unnecessary files:

    .idea/
    .venv/
  1. Stage your files for commit:

    1. Add the following files to Git by right-clicking them and selecting [Git] > [Add]:

      .gitignore
      src/__init__.py
      src/homework/__init__.py
      src/homework/getting_started.py
    2. Files should turn green when successfully added. If files do not change color, restart PyCharm and try again.

  2. Commit and push your changes:

    1. Right-click the nlp-essentials directory.

    2. Select [Git] > [Commit Directory].

    3. Write a descriptive commit message (e.g., "Initial setup and tokenizer test")

    4. Click [Commit and Push] (not just Commit)

    5. Click [Push] in the next dialog to upload to your GitHub repository.

  3. Verify your submission:

    1. Visit your GitHub repository in a web browser.

    2. Confirm that all files are properly present and contain the correct content.

Submission

Submit the URL of your GitHub repository to Canvas.

Task 2: Project Ideas

Rubric

  • GitHub Setup (0.2 points):

    • Private repository created.

    • All instructors added as collaborators.

  • Project Organization (0.2 points):

    • Correct directory structure.

    • No unnecessary files committed

  • Version Control (0.3 points):

    • All required files committed and successfully pushed to GitHub.

    • Content of the files are correct.

  • Code Implementation (0.3 points):

    • The program executes without errors.

    • Produces correct tokenizer output.

  • Project Ideas (1 point)

    • Is the team project idea well described?

Once you set up the :

Click the Terminal icon () at the bottom left, or

Update to the latest version (if necessary):

Install (if necessary):

Install the :

Create a new Python package called in your nlp-essentials directory.

Inside src, create a package.

Create a Python file called inside homework.

Create a file:

Share your team project concept by filling out the form in Canvas (about 100-150 words). Your description will be posted on the page to help classmates discover shared interests and form teams.

development environment
pip
setuptools
ELIT Tokenizer
src
homework
getting_started.py
.gitignore
Project Ideas

Schedule

CS|QTM|LING-329: Computational Linguistics (Spring 2025)

Date
Topic
Assignment

01/15

01/20

Martin Luther King Day

01/22

01/27

(continue)

01/29

(continue)

02/03

(continue)

02/05

02/10

02/12

(continue)

02/17

(continue)

02/19

(continue)

02/24

02/26

(continue)

03/03

(continue)

03/05

(continue)

03/10

Spring Break

03/12

Spring Break

03/17

03/19

03/24

03/26

(continue)

03/31

(continue)

04/02

(continue)

04/07

(continue)

04/09

04/14

(continue)

04/16

(continue)

04/21

04/23

04/28

NLP Tasks & Applications

  • Attendance is mandatory for all Project Pitches and Live Demonstrations sessions.

Overview
Homework 0
Text Processing
Homework 1
Speed Dating
Team Formation
Language Models
Homework 2
Vector Space Models
Proposal Pitch
Homework 3
Proposal Pitches
Proposal Report
Proposal Pitches
Distributional Semantics
Homework 4
Contextual Encoding
Live Demonstration
Live Demonstrations
Live Demonstrations
Final Report
Homework 5

Entropy and Perplexity

Entropy

Entropy has several properties, including:

Q10: Why is logarithmic scale used to measure self-information in entropy calculations?

Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Q11: What indicates high entropy in a text corpus?

Perplexity

Q12: What is the relationship between corpus entropy and language model perplexity?

References

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable X={x1,…,xn}X = \{x_1, \ldots, x_n\}X={x1​,…,xn​} is defined as:

H(X)=∑i=1nP(xi)⋅(−log⁡P(xi))=−∑i=1nP(xi)⋅log⁡P(xi)H(X) = \sum_{i=1}^n P(x_i) \cdot (-\log P(x_i)) = -\sum_{i=1}^n P(x_i) \cdot \log P(x_i)H(X)=i=1∑n​P(xi​)⋅(−logP(xi​))=−i=1∑n​P(xi​)⋅logP(xi​)

P(xi)P(x_i)P(xi​) is the probability distribution of xix_ixi​. The self-information of xix_ixi​ is defined as −log⁡P(x)-\log P(x)−logP(x), which measures how much information is gained when xix_ixi​ occurs. The negative sign indicates that as the occurrence of xix_ixi​ increases, its self-information value decreases.

It is non-negative: H(X)≥0H(X) \geq 0H(X)≥0.

It is at its minimum when xix_ixi​ is entirely predictable (all probability mass on a single outcome).

It is at its maximum when all outcomes of xix_ixi​ are equally likely.

Assume a long sequence of words, W=[w1,…,wn]W = [w_1, \ldots, w_n]W=[w1​,…,wn​], concatenating the entire text from a language L\mathcal{L}L. Let S={S1,…,Sm}\mathcal{S} = \{S_1, \ldots, S_m\}S={S1​,…,Sm​} be a set of all possible sequences derived from WWW, where S1S_1S1​ is the shortest sequence (a single word) and Sm=WS_m = WSm​=W is the longest sequence. Then, the entropy of WWW can be measured as follows:

H(W)=−∑j=1mP(Sj)⋅log⁡P(Sj)H(W) = -\sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(W)=−j=1∑m​P(Sj​)⋅logP(Sj​)

The entropy rate (per-word entropy), H′(W)H'(W)H′(W), can be measured by dividing H(W)H(W)H(W) by the total number of words nnn:

H′(W)=−1n∑j=1mP(Sj)⋅log⁡P(Sj)H'(W) = -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H′(W)=−n1​j=1∑m​P(Sj​)⋅logP(Sj​)

In theory, there is an infinite number of unobserved word sequences in the language L\mathcal{L}L. To estimate the true entropy of L\mathcal{L}L, we need to take the limit to H′(W)H'(W)H′(W) as nnn approaches infinity:

H(L)=lim⁡n→∞H′(W)=lim⁡n→∞−1n∑j=1mP(Sj)⋅log⁡P(Sj)H(\mathcal{L}) = \lim_{n \rightarrow \infty} H'(W) = \lim_{n \rightarrow \infty} -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(L)=n→∞lim​H′(W)=n→∞lim​−n1​j=1∑m​P(Sj​)⋅logP(Sj​)

The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure H(L)H(\mathcal{L})H(L) because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

The in the previous section is stationary because all probabilities rely on the same condition, P(wi+1,wi)P(w_{i+1}, w_i)P(wi+1​,wi​). In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.

By applying this theorem, H(L)H(\mathcal{L})H(L) can be approximated:

H(L)≈lim⁡n→∞−1nlog⁡P(Sm)H(\mathcal{L}) \approx \lim_{n \rightarrow \infty} -\frac{1}{n} \log P(S_m)H(L)≈n→∞lim​−n1​logP(Sm​)

Consequently, H′(W)H'(W)H′(W) is approximated as follows, where Sm=WS_m = WSm​=W:

H′(W)≈−1nlog⁡P(W)H'(W) \approx -\frac{1}{n} \log P(W)H′(W)≈−n1​logP(W)

Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence WWWis measured as:

PP(W)=P(W)−1n=1P(W)nPP(W) = P(W)^{-\frac{1}{n}} = \sqrt[n]{\frac{1}{P(W)}}PP(W)=P(W)−n1​=nP(W)1​​

Hence, the higher P(W)P(W)P(W) is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating WWW.

Perplexity, PP(W)PP(W)PP(W), can be directly derived from the approximated entropy rate, H′(W)H'(W)H′(W):

H′(W)≈−1nlog⁡2P(W)2H′(W)≈2−1nlog⁡2P(W)=2log⁡2P(W)−1n=P(W)−1n=PP(W)\begin{array}{rcl} H'(W) &\approx & -\frac{1}{n} \log_2 P(W) \\[5pt] 2^{H'(W)} &\approx & 2^{-\frac{1}{n} \log_2 P(W)} \\[5pt] &= & 2^{\log_2 P(W)^{-\frac{1}{n}}} \\[5pt] &= & P(W)^{-\frac{1}{n}} \\[5pt] &= & PP(W) \end{array}H′(W)2H′(W)​≈≈===​−n1​log2​P(W)2−n1​log2​P(W)2log2​P(W)−n1​P(W)−n1​PP(W)​

, Wikipedia

, Wikipedia

Entropy
Perplexity
Shannon-McMillan-Breiman theorem
bigram model

Regular Expressions

Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.

Core Syntax

Metacharacters

Regex provides metacharacters with specific meanings, making it convenient to define patterns:

  • .: any single character except a newline character

    e.g., M.\. matches "Mr." and "Ms.", but not "Mrs." (\ escapes the metacharacter .).

  • [ ]: a character set matching any character within the brackets

    e.g., [aeiou] matches any vowel.

  • \d: any digit, equivalent to [0-9]

    e.g., \d\d\d searches for "170" in "CS170".

  • \D: any character that is not a digit, equivalent to [^0-9]

    e.g., \D\D searches for "kg" in "100kg".

  • \s: any whitespace character, equivalent to [ \t\n\r\f\v]

    e.g., \s searches for the space " " in "Hello World".

  • \S: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v]

    e.g., \S searches for "H" in " Hello".

  • \w: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_]

    e.g., \w\w searches for "1K" in "$1K".

  • \W: any character that is not a word character, equivalent to [^A-Za-z0-9_]

    e.g., \W searches for "!" in "Hello!".

  • \b: a word boundary matching the position between a word character and a non-word character

    e.g., \bis\b matches "is", but does not match "island" nor searches for "is" in "basis".

Repetitions

Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:

  • *: the preceding character or group appears zero or more times

    e.g., \d* matches "90" in "90s" as well as "" (empty string) in "ABC".

  • +: the preceding character or group appears one or more times

    e.g., \d+ matches "90" in "90s", but no match in "ABC".

  • ?: the preceding character or group appears zero or once, making it optional

    e.g., https? matches both "http" and "https".

  • {m}: the preceding character or group appears exactly m times

    e.g., \d{3} is equivalent to \d\d\d.

  • {m,n}: the preceding character or group appears at least m times but no more than n times

    e.g., \d{2,4} matches "12", "123", "1234", but not "1" or "12345".

  • {m,}: the preceding character or group appears at least m times or more

    e.g., \d{2,} matches "12", "123", "1234", and "12345", but not "1".

  • By default, matches are "greedy" such that patterns match as many characters as possible

    e.g., <.+> matches the entire string of "<Hello> and <World>".

  • Matches become "lazy" by adding ? after the repetition metacharacters, in which case, patterns match as few characters as possible

    e.g., <.+?> matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.

Groupings

Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses ( and ).

  • |: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives

    e.g., (cat|dog) matches either "cat" or "dog".

  • ( ): a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways

  • (?: ): a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways

  • \num: a backreference that refers back to the most recently matched text by the num'th capturing group within the same regex

    e.g., (\w+) (\w+) - (\2), (\1) has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".

  • You can nest groups within other groups to create more complex patterns

    e.g., (\w+.(edu|org)) has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.

Assertions

Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.

  • A positive lookahead assertion (?= ) checks that a specific pattern is present immediately after the current position

    e.g., apple(?=[ -]pie) matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".

  • A negative lookahead assertion (?! ) checks that a specific pattern is not present immediately after the current position

    e.g., do(?!(?: not|n't)) matches "do" in "do it" or "doing", but not in "do not" or "don't".

  • A positive look-behind assertion (?<= ) checks that a specific pattern is present immediately before the current position

    e.g., (?<=\$)\d+ matches "100" in "$100", but not in "100 dollars".

  • A negative look-behind assertion (?<! ) checks that a specific pattern is not present immediately before the current position

    e.g., (?<!not )(happy|sad) searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".

  • ^ asserts that the pattern following the caret must match at the beginning of the text

    e.g., not searches for "not" in "note" and "cannot", whereas ^not matches "not" in "note" but not in "cannot".

  • $ asserts that the pattern preceding the dollar sign must match at the end of the text

    e.g., not$ searches for "not" in "cannot" but not in "note".

Functions

Python provides several functions to make use of regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

import re

re_mr = re.compile(r'M[rs]\.')
m = re_mr.match('Mr. Wayne')

print(m)
if m: print(m.start(), m.end())
    • r'M' matches the letter "M".

    • r'[rs]' matches either "r" or "s".

    • r'\.' matches a period (dot).

<re.Match object; span=(0, 3), match='Mr.'>
0 3

group()

Currently, no group has been specified for re_mr:

print(m.groups())
()

Let us capture the letters and the period as separate groups:

re_mr = re.compile(r'(M[rs])(\.)')
m = re_mr.match('Ms.')

print(m)
print(m.group())
print(m.groups())
print(m.group(0), m.group(1), m.group(2))
  • L1: The pattern re_mr is looking for the following:

    • 1st group: "M" followed by either "r" or 's'.

    • 2nd group: a period (".")

  • L2: Match re_mr with the input string "Ms".

  • L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.

<re.Match object; span=(0, 3), match='Ms.'>
Ms.
('Ms', '.')
Ms. Ms .

If the pattern does not find a match, it returns None.

m = RE_MR.match('Mrs.')
print(m)
None

search()

Let us match the following strings with re_mr:

s1 = 'Mr. and Ms. Wayne are here'
s2 = 'Here are Mr. and Ms. Wayne'

print(re_mr.match(s1))
print(re_mr.match(s2))
<re.Match object; span=(0, 3), match='Mr.'>
None

s1 matches "Mr." but not "Ms." while s2 does not match any pattern. It is because the match() function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search() instead:

print(re_mr.search(s1))
print(re_mr.search(s2))
<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(9, 12), match='Mr.'>

findall()

The search() function matches "Mr." in both s1 and s2 but still does not match "Ms.". To match them all, we need to use the findall() function:

print(re_mr.findall(s1))
print(re_mr.findall(s2))
[('Mr', '.'), ('Ms', '.')]
[('Mr', '.'), ('Ms', '.')]

finditer()

While the findall() function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer() function:

ms = re_mr.finditer(s1)
for m in ms: print(m)

ms = re_mr.finditer(s2)
for m in ms: print(m)
<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(8, 11), match='Ms.'>
<re.Match object; span=(9, 12), match='Mr.'>
<re.Match object; span=(17, 20), match='Ms.'>

sub()

Finally, you can replace the matched results with another string by using the sub() function:

print(re_mr.sub('Dr.', 'I met Mr. Wayne and Ms. Kyle.'))
I met Dr. Wayne and Dr. Kyle.

Tokenization

Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:

def tokenize(text: str) -> list[str]:
    re_tok = re.compile(r'([",.]|\s+|n\'t)')
    tokens, prev_idx = [], 0

    for m in re_tok.finditer(text):
        t = text[prev_idx:m.start()].strip()
        if t: tokens.append(t)
        t = m.group().strip()
        
        if t:
            if tokens and tokens[-1] in {'Mr', 'Ms'} and t == '.':
                tokens[-1] = tokens[-1] + t
            else:
                tokens.append(t)
        
        prev_idx = m.end()

    t = text[prev_idx:]
    if t: tokens.append(t)
    return tokens
  • L2: Create a regular expression to match delimiters and a special case:

    • Delimiters: ',', '.', or whitespaces ('\s+').

    • The special case: 'n't' (e.g., "can't").

  • L3: Create an empty list tokens to store the resulting tokens, and initialize prev_idx to keep track of the previous token's end position.

  • L5: Iterate over matches in text using the regular expression pattern.

    • L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t.

    • L7: If t is not empty (i.e., it is not just whitespace), add it to the tokens list.

    • L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t.

    • L10: If t is not empty (i.e., the pattern is matched):

      • L11-12: Check if the previous token in tokens is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.

      • L13-14: Otherwise, add t to tokens.

  • L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens.

Test cases for the tokenizer:

text = 'Mr. Wayne isn\'t the hero we need, but "the one" we deserve.'
print(tokenize(text))

text = 'Ms. Wayne is "Batgirl" but not "the one".'
print(tokenize(text))
['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"']
['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"', '.']

References

Vector Space Models

A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.

Contents

Smoothing

Unigram Smoothing

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

  • L3: Define a constant representing the unknown word.

  • L7: Increment the total count by the vocabulary size.

  • L8: Increment each unigram count by 1.

  • L9: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Q4: When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram's probability to increase?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

  • L2: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

  • L11: Calculate the total count of all bigrams with the same previous word.

  • L12: Calculate and store the probabilities of each current word given the previous word

  • L13: Calculate the probability for an unknown current word.

  • L16: Add a probability for an unknown previous word.

We then test bigram_smoothing() with the same text file:

Finally, we test the bigram estimation using smoothing for unknown sequences:

  • L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.

  • L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.

Normalization

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

Q6: Why is it problematic when bigram probabilities following a given word don't sum to 1?

Reference

Homework

HW1: Text Processing

Task 1: Chronicles of Narnia

Data Collection

For each book, gather the following details:

  • Book Title (preserve exact spacing as shown in text)

  • Year of Publishing (indicated in the title)

For each chapter within every book, collect the following information:

  • Chapter Number (as Arabic numeral)

  • Chapter Title

  • Token Count of the Chapter Content

    • Each word, punctuation mark, and symbol counts as a separate token.

    • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

Implementation

    • Each token is separated by whitespace.

  1. Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

Task 2: Regular Expressions

Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

  • Format:

    • username@hostname.domain

  • Username and Hostname:

    • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

    • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

  • Formats:

    • YYYY/MM/DD or YY/MM/DD

    • YYYY-MM-DD or YY-MM-DD

  • Year:

    • 4 digits: between 1951 and 2050

    • 2 digits: for 1951 - 2050

  • Month:

    • 1 - 12 (can be with/without leading zero)

  • Day:

    • 1 - 31 (can be with/without leading zero)

    • Must be valid for the given month.

  • Format:

    • protocol://address

  • Protocol:

    • http or https (only)

  • Address:

    • Can contain letters, hyphen, dots.

    • Must start with letter/number.

    • Must include at least one dot.

  • Formats:

    • Single author: Lastname, YYYY (e.g., Smith, 2023)

    • Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)

    • Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)

  • Lastnames must be capitalized and can have multiple

  • Year must be between 1900-2024.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Rubric

  • Task 1: Chronicles of Narnia (7 points)

  • Task 2: Regular Expressions (3 points)

  • Concept Quiz (2 points)

Online Interpreter:

The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the .

e.g., (\w+)@(\w+.\w+) has two capturing groups, (\w+) and (\w+.\w+), and matches email addresses such as "" where the first and second groups capture "john" and "emory.edu", respectively.

e.g., (?:\w+)@(\w+.\w+) has one non-capturing group (?:\w+) and one capturing group (\w+.\w+). It still matches "" but only captures "emory.edu", not "john".

L1: .

L3: Create a regular expression re_mr (). Note that a string indicated by an r prefix is considered a regular expression in Python.

L4: Try to match re_mr at the beginning of the string "Mr. Wayne" ().

L6: Print the value of m. If matched, it prints the information; otherwise, m is None; thus, it prints "None".

L7: Check if a match was found (m is not None), and print the start position () and end position () of the match.

L1:

L5: Print the entire matched string ().

L6: Print a tuple of all captured groups ().

Q10: What are the benefits and limitations of using regular expressions for tokenization vs. the rule-based tokenization approach discussed in the ?

Source:

, Kuchling, HOWTOs in Python Documentation

The in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Thus, the probability of any unknown word with Laplace smoothing is calculated as follows:

L1: Import the unigram_count() function from the package.

We then test unigram_smoothing() with a text file :

L1: Import test_unigram() from the package.

L2: Use to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the UNKNOWN token.

Thus, the probability of an unknown bigram where is known but is unknown is calculated as follows:

Q5: What does the Laplace smoothed bigram probability of represent when is unknown, and what is a potential problem with this estimation?

L1: Import the bigram_count() function from the package.

L5: Create a set vocab containing all unique .

L6-7: Add all unique to vocab.

L1: Import the test_bigram() function from the package.

L3: The tuple word is as passed as the second and third parameters.

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that . Before Laplace smoothing, the bigram probabilities of are estimated as follows:

The bigram distribution for can be normalized to 1 by adding the total number of word types occurring after , denoted as , to the denominator instead of :

Consequently, the probability of an unknown bigram can be calculated with the normalization as follows:

For the above example, . Once you apply to , the sum of its bigram probabilities becomes 1:

A major drawback of this normalization is that the probability cannot be measured when is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of , where the previous word is unknown, as follows:

Source:

, Wikipedia

Your goal is to extract and organize structured information from C.S. Lewis's series, focusing on book metadata and chapter statistics.

Download the file and place it under the directory.

The text file is pre-tokenized using the .

Create a file in the directory.

Regular Expressions 101
john@emory.edu
john@emory.edu
Regular expression operations
compile()
match()
match object
start()
end()
groups()
group()
groups()
search()
findall()
finditer()
sub()
regular_expressions.py
Regular Expression
following section
previous section
PL(wi)=#(wi)+1∑∀wk∈V(#(wk)+1)=#(wi)+1∑∀wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(wi​)=∑∀wk​∈V​(#(wk​)+1)#(wi​)+1​=∑∀wk​∈V​#(wk​)+∣V∣#(wi​)+1​
w∗w_*w∗​
PL(w∗)=1∑∀wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(w∗​)=∑∀wk​∈V​#(wk​)+∣V∣1​
∑i=1vP(wi)=∑i=1vPL(wi)=1\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1i=1∑v​P(wi​)=i=1∑v​PL​(wi​)=1
from src.ngram_models import unigram_count, Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams
from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)
         I 0.010225
     Aslan 0.001796
      Lucy 0.001762
    Edmund 0.001369
    Narnia 0.001339
   Caspian 0.001300
      Jill 0.001226
     Peter 0.001005
    Shasta 0.000902
    Digory 0.000899
   Eustace 0.000853
     Susan 0.000636
    Tirian 0.000585
     Polly 0.000533
    Aravis 0.000523
      Bree 0.000479
Puddleglum 0.000479
    Scrubb 0.000469
    Andrew 0.000396
   Unigram  With Smoothing   W/O Smoothing
         I        0.010225        0.010543
     Aslan        0.001796        0.001850
      Lucy        0.001762        0.001815
    Edmund        0.001369        0.001409
    Narnia        0.001339        0.001379
   Caspian        0.001300        0.001338
      Jill        0.001226        0.001262
     Peter        0.001005        0.001034
    Shasta        0.000902        0.000928
    Digory        0.000899        0.000925
   Eustace        0.000853        0.000877
     Susan        0.000636        0.000654
    Tirian        0.000585        0.000601
     Polly        0.000533        0.000547
    Aravis        0.000523        0.000537
      Bree        0.000479        0.000492
Puddleglum        0.000479        0.000492
    Scrubb        0.000469        0.000482
    Andrew        0.000396        0.000406
def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])
unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print(f'{word} {smoothed_unigram(unigram, word):.6f}')
Aslan 0.001796
Jinho 0.000002
PL(wi∣wi−1)=#(wi−1,wi)+1∑∀wk∈Vi#(wi−1,wk)+∣V∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}PL​(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)+∣V∣#(wi−1​,wi​)+1​
(wu−1,w∗)(w_{u-1}, w_{*})(wu−1​,w∗​)
wu−1w_{u-1}wu−1​
w∗w_{*}w∗​
PL(w∗∣wu−1)=1∑∀wk∈Vi#(wu−1,wk)+∣V∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}PL​(w∗​∣wu−1​)=∑∀wk​∈Vi​​#(wu−1​,wk​)+∣V∣1​
(wu−1,wu)(w_{u-1}, w_{u})(wu−1​,wu​)
wu−1w_{u-1}wu−1​
from src.ngram_models import bigram_count, Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: (count + 1) / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1 / len(vocab)
    return bigrams
wi−1w_{i-1}wi−1​
wiw_iwi​
from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)
I
        'm 0.020590
        do 0.019136
       've 0.011143
       was 0.010598
      have 0.009629
        am 0.009084
       'll 0.008236
     think 0.008115
        'd 0.006661
      know 0.006540
the
      same 0.008403
     other 0.007591
      King 0.007096
     Witch 0.006673
     whole 0.005119
    others 0.005084
     first 0.004978
     Dwarf 0.004872
      door 0.004837
     great 0.004837
said
       the 0.039038
         , 0.018270
      Lucy 0.014312
    Edmund 0.011206
   Caspian 0.010049
     Peter 0.009805
      Jill 0.008709
         . 0.008648
    Digory 0.007734
     Aslan 0.007491
def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])
bigram = bigram_smoothing(corpus)
for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
    print(f'{word} {smoothed_bigram(bigram, *word):.6f}')
('Aslan', 'is') 0.001146
('Aslan', 'Jinho') 0.000076
('Jinho', 'is') 0.000081
wiw_iwi​
You are a student
You and I are students
v=7v=7v=7
(wi−1=You,wi=∗)(w_{i-1} = \textit{You}, w_{i} = *)(wi−1​=You,wi​=∗)
P(are|You)=P(and|You)=1/2P(are|You)+P(and|You)=1\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}P(are|You)=P(and|You)P(are|You)+P(and|You)​=1/2=1​
PL(are|You)=PL(and|You)=(1+1)/(2+7)=2/9PL(are|You)+PL(and|You)=4/9\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+7)=2/9=4/9​
wi−1w_{i-1}wi−1​
wi−1w_{i-1}wi−1​
∣Vi∣|V_i|∣Vi​∣
vvv
PL(wi∣wi−1)=#(wi−1,wi)+1∑∀wk∈Vi#(wi−1,wk)+∣Vi∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}PL​(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)+∣Vi​∣#(wi−1​,wi​)+1​
(wu−1,w∗)(w_{u-1}, w_{*})(wu−1​,w∗​)
PL(w∗∣wu−1)=1∑∀wk∈Vi#(wu−1,wk)+∣Vi∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}PL​(w∗​∣wu−1​)=∑∀wk​∈Vi​​#(wu−1​,wk​)+∣Vi​∣1​
∣Vi∣=∣{are,and}∣=2|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2∣Vi​∣=∣{are,and}∣=2
∣Vi∣|V_{i}|∣Vi​∣
PL(∗∣You)P_{\mathcal{L}}(*|\textit{You})PL​(∗∣You)
PL(are|You)=PL(and|You)=(1+1)/(2+2)=1/2PL(are|You)+PL(and|You)=1\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+2)=1/2=1​
wu−1w_{u-1}wu−1​
(w∗,wu)(w_*, w_u)(w∗​,wu​)
PL(wu∣w∗)=min⁡({PL(w∗∣wk):∀wk∈V})P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})PL​(wu​∣w∗​)=min({PL​(w∗​∣wk​):∀wk​∈V})
{
  'The Lion , the Witch and the Wardrobe': {
    'title': 'The Lion , the Witch and the Wardrobe',
    'year': 1950,
    'chapters': [
      {
        'number': 1,
        'title': 'Lucy Looks into a Wardrobe',
        'token_count': 1915
      },
      {
        'number': 2,
        'title': 'What Lucy Found There',
        'token_count': 2887
      },
      ...
    ]
  },
  'Prince Caspian : The Return to Narnia': {
    'title': 'Prince Caspian : The Return to Narnia',
    'year': 1951,
    'chapters': [
      ...
    ]
  },
  ...
}
Bag-of-Words Model
Term Weighting
Similarity Metrics
Document Classification
Homework
src.ngram_models
dat/chronicles_of_narnia.txt
ngram_models
get()
src.ngram_models
ngram_models
unpacked
smoothing.py
Additive smoothing
Chronicles of Narnia
chronicles_of_narnia.txt
dat/
ELIT Tokenizer
text_processing.py
src/homework/

Term Weighting

Term Frequency

The frequency of a word www's occurrences in a document DDD is called the Term Frequency (TF) of w∈Dw \in Dw∈D. TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.

However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:

from typing import Callable

def read_corpus(filename: str, tokenizer: Callable[[str], list[str]] | None = None) -> list[Document]:
    fin = open(filename)
    if tokenizer is None: tokenizer = lambda s: s.split()
    return [tokenizer(line) for line in fin]
from src.bag_of_words_model import vocabulary

corpus = read_corpus('dat/chronicles_of_narnia.txt')
vocab = vocabulary(corpus)
from src.bag_of_words_model import bag_of_words, Document, Vocab

def print_tfs(vocab: Vocab, documents: list[Document]):
    tfs = [bag_of_words(vocab, document) for document in documents]
    words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]
    for tf in tfs:
        print([(words[index], count) for index, count in sorted(tf.items(), key=lambda x: x[1], reverse=True)])
  • L5: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

from elit_tokenizer import EnglishTokenizer

ds = [
    "As dawn broke, the first light kissed the golden mane of Aslan, the rightful king of Narnia.",
    "The White Witch's icy breath froze the once lush meadows, casting a shadow over Narnia.",
    "Lucy's footsteps echoed in the halls of Cair Paravel, where legends were born."
]

etok = EnglishTokenizer()
documents = [etok.decode(d).tokens  for d in ds]
print_tfs(vocab, documents)
[('the', 3), (',', 2), ('of', 2), ('.', 1), ('As', 1), ('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('first', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
[("'s", 1), (',', 1), ('.', 1), ('Narnia', 1), ('The', 1), ('White', 1), ('Witch', 1), ('a', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('once', 1), ('over', 1), ('shadow', 1), ('the', 1)]
[("'s", 1), (',', 1), ('.', 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('in', 1), ('legends', 1), ('of', 1), ('the', 1), ('were', 1), ('where', 1)]

In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.

Q3: If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

Stopwords

One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.

from string import punctuation

stopwords = {line.strip().lower() for line in open('dat/stopwords.txt')}
is_stopwords = lambda w: w.lower() in stopwords or w in punctuation

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

sw_tokenizer = lambda s: [word for word in s.split() if not is_stopwords(word)]
corpus = read_corpus('dat/chronicles_of_narnia.txt', sw_tokenizer)
vocab = vocabulary(corpus)

Finally, let us print the TFs of the same documents using the updated vocabulary:

print_tfs(vocab, documents)
[('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
[("'s", 1), ('Narnia', 1), ('White', 1), ('Witch', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('shadow', 1)]
[("'s", 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('legends', 1)]

Q4: Stop words can be filtered either during the creation of vocabulary dictionary or when generating the bag-of-words representations. Which approach is preferable and why?

Document Frequency

Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.

Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

from collections import Counter
from src.types import SparseVector

def document_frequencies(vocab: Vocab, corpus: list[Document]) -> SparseVector:
    counts = Counter()
    for document in corpus:
        counts.update(set(document))
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

We then compare the term and document frequencies of all terms in the above documents:

corpus = read_corpus('dat/chronicles_of_narnia.txt')
vocab = vocabulary(corpus)
words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]

dfs = document_frequencies(vocab, corpus)
for document in documents:
    bow = bag_of_words(vocab, document)
    tf_df = [(words[tid], tf, dfs[tid]) for tid, tf in sorted(bow.items())]
    tf_df = sorted(tf_df, key=lambda x: (-x[1], x[2]))
    print(' '.join(document))
    print('\n'.join(['{:>10}  {}  {:>5}'.format(*t) for t in tf_df]))
  • L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.

As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
       the  3   9574
        of  2   5355
         ,  2  10578
  rightful  1      1
      dawn  1      6
    kissed  1     26
     broke  1     35
      king  1     40
      mane  1     40
    golden  1     52
        As  1    161
     light  1    203
     first  1    401
    Narnia  1    512
     Aslan  1    706
         .  1  19747
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
   casting  1      1
     froze  1      2
       icy  1      3
    shadow  1     38
     White  1     44
    breath  1     86
     Witch  1    246
      once  1    378
      over  1    431
    Narnia  1    512
       The  1   1352
        's  1   2404
         a  1   5456
       the  1   9574
         ,  1  10578
         .  1  19747
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
 footsteps  1      1
   legends  1      1
    echoed  1      2
     halls  1      4
      born  1     14
   Paravel  1     84
      Cair  1     86
     where  1    360
      Lucy  1    704
      were  1   1684
        's  1   2404
        in  1   3513
        of  1   5355
       the  1   9574
         ,  1  10578
         .  1  19747

Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.

Q5: What are the implications when a term has a high document frequency?

TF-IDF

Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).

Given a term ttt in a document d∈Dd \in Dd∈D where DDD is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

  TF−IDF(t,d,D)=TF(t,d)⋅IDF(t,D)  TF(t,d)=#(t,d)∣d∣  IDF(t,D)=log⁡∣D∣DF(t,D)\begin{align*}     \mathbf{TF-IDF}(t, d, D) &= \mathbf{TF}(t, d) \cdot \mathbf{IDF}(t, D) \\     \mathbf{TF}(t, d) &= \frac{\#(t, d)}{|d|} \\     \mathbf{IDF}(t, D) &= \log\frac{|D|}{\mathbf{DF}(t, D)} \\ \end{align*}   TF−IDF(t,d,D)  TF(t,d)  IDF(t,D)​=TF(t,d)⋅IDF(t,D)=∣d∣#(t,d)​=logDF(t,D)∣D∣​​

In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

def tf_idf(vocab: Vocab, dfs: SparseVector, D: int, document: Document) -> SparseVector:
    tf = lambda count: count / len(document)
    idf = lambda tid: math.log(D / dfs[tid])
    return {tid: tf(count) * idf(tid) for tid, count in bag_of_words(vocab, document).items()}

We then compute the TF-IDF scores of terms in the above documents:

for document in documents:
    tfidf = tf_idf(vocab, dfs, len(corpus), document)
    print(' '.join(document))
    print('\n'.join(['{:>10}  {:.2f}'.format(words[tid], score) for tid, score in sorted(tfidf.items(), key=lambda x: x[1], reverse=True)]))
As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
  rightful  0.50
      dawn  0.41
    kissed  0.34
     broke  0.32
      king  0.32
      mane  0.32
    golden  0.30
        As  0.25
     light  0.24
     first  0.20
    Narnia  0.19
     Aslan  0.17
        of  0.14
       the  0.13
         ,  0.08
         .  0.01
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
   casting  0.56
     froze  0.52
       icy  0.50
    shadow  0.35
     White  0.35
    breath  0.31
     Witch  0.25
      once  0.23
      over  0.22
    Narnia  0.21
       The  0.16
        's  0.12
         a  0.08
       the  0.05
         ,  0.04
         .  0.01
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
 footsteps  0.63
   legends  0.63
    echoed  0.58
     halls  0.54
      born  0.46
   Paravel  0.35
      Cair  0.35
     where  0.26
      Lucy  0.22
      were  0.16
        's  0.14
        in  0.12
        of  0.09
       the  0.05
         ,  0.05
         .  0.01

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

  • Sublinear scaling on TF: \left\{ \begin{array}{cl}  1 + \log\mathbf{TF}(t,d) & \mbox{if $\mathbf{TF}(t,d) > 0$}\\  0 & \mbox{otherwise} \end{array} \right.

  • Normalized TF: α+(1−α)TF(t,d)max⁡(TF(∗,d))\alpha + (1-\alpha)\frac{\mathbf{TF}(t,d)}{\max(\mathbf{TF}(*,d))}α+(1−α)max(TF(∗,d))TF(t,d)​

  • Normalized IDF: max⁡(IDF(∗,D))IDF(t,D)\frac{\max(\mathbf{IDF}(*,D))}{\mathbf{IDF}(t,D)}IDF(t,D)max(IDF(∗,D))​

  • Probabilistic IDF: ∣D∣−DF(t,D)DF(t,D)\frac{|D| - \mathbf{DF}(t,D)}{\mathbf{DF}(t,D)}DF(t,D)∣D∣−DF(t,D)​

Q6: Should we still apply stop words when using TF-IDF scores to represent the documents?

References

Document Similarity

Euclidean Similarity

Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

  • L6: ** k represents the power of k.

We then measure the Euclidean distance between the two vectors above:

Cosine Similarity

Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).

The cosine similarity between two vectors can be measured as follow:

Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

We then measure the Euclidean distance between the two vectors above:

The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.

Q7: Why is Cosine Similarity generally preferred over Euclidean Distance in most NLP applications?

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

The joint probability of the word sequence "I am a student" can be measured as follows:

Q7: What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?

Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

The joint probability can now be measured by the product of unigram and bigram probabilities.

Q8: How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?

Initial Word Probability

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

References

Tokenization

Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.

Q3: What is the difference between a word and a token?

  • "R1: → ['"', "R1", ":"]

  • (R&D) → ['(', 'R&D', ')']

  • 15th-largest → ['15th', '-', 'largest']

  • Atlanta, → ['Atlanta', ',']

  • Department's → ['Department', "'s"]

  • activity"[26] → ['activity', '"', '[26]']

  • centers.[21][22] → ['centers', '.', '[21]', '[22]']

Depending on the task, you may want to tokenize "[26]" into ['[', '26', ']'] for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number "26" surrounded by square brackets. Thus, we aim to recognize it as a single token.

Delimiters

Let us write the delimit() function that takes a word and a set of delimiters, and returns a list of tokens by splitting the word using the delimiters:

  • L3: If no delimiter is found, return a list containing word as a single token.

  • L4: If a delimiter is found, create a list tokens to store the individual tokens.

  • L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.

  • L7: Add the delimiter itself as a separate token to tokens.

Let us define a set of delimiters and test delimit() using various input:

Q4: All delimiters used in our implementation are punctuation marks. What types of tokens should not be split by such delimiters?

Post-Processing

When reviewing the output of delimit(), the first four test cases yield accurate results, while the last five are not handled properly, which should have been tokenized as follows:

  • Department's → ['Department', "'s"]

  • activity"[26] → ['activity', '"', '[26]']

  • centers.[21][22] → ['centers', '.', '[21]', '[22]']

  • 149,000 → ['149,000']

  • U.S. → ['U.S.']

To handle these special cases, let us post-process the tokens generated by delimit():

  • L2: Initialize variables i for the current position and new_tokens for the resulting tokens.

  • L4: Iterate through the input tokens.

  • L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).

    • L6: Combine the apostrophe and "s" and append it as a single token.

    • L7: Move the position indicator by 1 to skip the next character.

  • L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).

    • L11: Combine the special number format and append it as a single token.

    • L12: Move the position indicator by 2 to skip the next two characters.

  • L13: Case 3: Handling acronyms like "U.S.".

    • L14: Combine the acronym and append it as a single token.

    • L15: Move the position indicator by 3 to skip the next three characters.

  • L17: Case 4: If none of the special cases above are met, append the current token.

  • L18: Move the position indicator by 1 to process the next token.

  • L20: Return the list of processed tokens.

Once the post-processing is applied, all outputs are now handled properly:

Q5: Our tokenizer uses hard-coded rules to handle specific cases. What would be a scalable approach to handling more diverse cases?

Tokenizing

Finally, let us write tokenize() that takes a path to a corpus and a set of delimiters, and returns a list of tokens from the corpus:

  • L2: Read the corpus file.

  • L3: Split the text into words.

Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.

Q6: The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?

References

N-gram Models

An n-gram is a contiguous sequence of n items from text data. These items can be:

  • Words (most common in language modeling)

  • Characters (useful for morphological analysis)

  • Subword tokens (common in modern NLP)

  • Bytes or other units

For the sentence "I'm a computer scientist.", we can extract different n-grams:

  • 1-gram (unigram): {"I'm", "a", "computer", "scientist."}

  • 2-gram (bigram): {"I'm a", "a computer", "computer scientist."}

  • 3-gram (trigram): {"I'm a computer", "a computer scientist."}

Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?

Unigram Estimation

Where:

  • The denominator represents the total word count.

Let us define the Unigram type:

  • L3: A dictionary where the key is a unigram and the value is its probability.

Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

  • L3: Calculate the total count of all unigrams in the text.

  • L4: Return a dictionary where each word is a key and its probability is the value.

  • L3: The second argument accepts a function that takes a string and returns a Unigram.

  • L4: Call the estimator with the text file and store the result in unigrams.

  • L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.

  • L7: Iterate through the top 300 unigrams with the highest probabilities.

  • L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).

  • L2: Pass the unigram_estimation() function as the second argument.

This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".

Bigram Estimation

Where:

Let us define the Bigram type:

Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

  • L5: Create a defaultdict with Counters as default values to store bigram frequencies.

  • L9: Iterate through the words, starting from the second word (index 1) in each line.

  • L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

  • L8: Calculate the total count of all bigrams with the same previous word.

  • L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

  • L2: Call the bigram_estimation() function with the text file and store the result.

  • L5: Create a bigram list given the previous word, sorted by probability in descending order.

  • L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?

References

L3: , , .

L5: Define a tokenizer function using a expression.

We then retrieve documents in and create a vocaburay dictionary:

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:

L1: .

L11: The tuple t is into three arguments and passed to the function.

Source:

Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the corpus:

Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the and .

Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:

Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:

The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.

The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.

In the , you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .

Thus, the probability of any given word sequence can be measured as:

The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring is as challenging as measuring .

Let us consider the unigram probabilities of and . In general, "the" appears more frequently than "The" such that:

Let be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as and , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:

Thus, to predict a more probable initial word, it is better to consider the bigram probability rather than the unigram probability when measuring sequence probability:

Q9: Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?

, Wikipedia

When examining the from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

L1: .

L2: Find the index of the first character in word that is in a of delimiters (, ). If no delimiter is found in word, return -1 ().

L9-10: If there are characters after the delimiter, call delimit() recursively on the remaining part of word and the tokens list with the result.

L1: .

L17: Iterate over the two lists, input and output, in parallel using the function.

L18:

L4: Tokenize each word in the corpus using the specified delimiters. postprocess() is used to process the special cases further. The resulting tokens are collected in a list and returned ().

Given the new tokenizer, let us recount word types in the corpus, , and save them:

L2: Import save_output() from the module.

L13: Save the tokenized output to .

Source:

- a heuristic-based tokenizer

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:

: the count of word in the corpus.

: the vocabulary (set of all unique words).

L1: .

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :

L1: Import the type from the typing module.

Q2: What advantages do unigram probabilities have over ?

A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):

: the total occurrences of in the corpus in that order.

: a set of all word types occurring after .

A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a for .

L1: Import the class from the package.

L2: import the Bigram from the package.

Source: .

, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

union types
callable objects
default arguments
lambda
chronicles_of_narnia.txt
bag of words model
stopwords.txt
string.punctuation
unpacked
format()
term_weighting.py
from src.bag_of_words_model import vocabulary
from src.term_weighing import read_corpus, document_frequencies, tf_idf

if __name__ == '__main__':
    corpus = read_corpus('dat/chronicles_of_narnia.txt')
    vocab = vocabulary(corpus)
    dfs = document_frequencies(vocab, corpus)
    D = len(corpus)

    documents = [
        'I like this movie very much'.split(),
        'I hate this movie very much'.split(),
        'I love this movie so much'.split()
    ]

    vs = [tf_idf(vocab, dfs, D, document) for document in documents]
    for v in vs: print(vs)
{980: 0.31, 7363: 0.52, 7920: 0.70, 11168: 0.51, 11833: 0.51}
{980: 0.31, 6423: 1.24, 7920: 0.70, 10325: 0.53, 11168: 0.51}
Vi=[vi1,…,vin]V_i = [v_{i1}, \dots, v_{in}]Vi​=[vi1​,…,vin​]
Vj=[vj1,…,vjn]V_j = [v_{j1}, \dots, v_{jn}]Vj​=[vj1​,…,vjn​]
DiD_iDi​
DjD_jDj​
∥Vi−Vj∥=∑k=1n(vik−vjk)2\lVert V_i - V_j \rVert = \sqrt{\sum_{k=1}^n (v_{ik} - v_{jk})^2}∥Vi​−Vj​∥=k=1∑n​(vik​−vjk​)2​
import math
from src.bag_of_words_model import SparseVector

def euclidean_distance(v1: SparseVector, v2: SparseVector) -> float:
    d = sum((v - v2.get(k, 0)) ** 2 for k, v in v1.items())
d += sum(v ** 2 for k, v in v2.items() if k not in v1)
return math.sqrt(d)
print(euclidean_distance(vs[0], vs[0]))
print(euclidean_distance(vs[0], vs[1]))
print(euclidean_distance(vs[0], vs[2]))
0.0
1.347450458032576
1.3756015678855296
v0v_0v0​
v1v_1v1​
v0v_0v0​
v2v_2v2​
v1v_1v1​
v0v_0v0​
v2v_2v2​
Vi⋅Vj∥Vi∥∥Vj∥=∑∀k(vik⋅vjk)∑∀k(vik)2⋅∑∀k(vjk)2\frac{V_i\cdot V_j}{\lVert V_i\rVert\lVert V_j\rVert} = \frac{\sum_{\forall k} (v_{ik} \cdot v_{jk})}{\sqrt{\sum_{\forall k} (v_{ik})^2} \cdot \sqrt{\sum_{\forall k} (v_{jk})^2}}∥Vi​∥∥Vj​∥Vi​⋅Vj​​=∑∀k​(vik​)2​⋅∑∀k​(vjk​)2​∑∀k​(vik​⋅vjk​)​
def cosine_similarity(v1: SparseVector, v2: SparseVector) -> float:
    n = sum(v * v2.get(k, 0) for k, v in v1.items())
    d = math.sqrt(sum(v ** 2 for k, v in v1.items()))
    d *= math.sqrt(sum(v ** 2 for k, v in v2.items()))
    return n / d
print(cosine_similarity(vs[0], vs[0]))
print(cosine_similarity(vs[0], vs[1]))
print(cosine_similarity(vs[0], vs[2]))
0.9999999999999999
0.5775130451716284
0.4826178600593854
v0v_0v0​
v1v_1v1​
v0v_0v0​
v2v_2v2​
P(student∣I,am,a)=P(I,am,a,student)P(I,am,a)P(student|I,am,a) = \frac{P(I,am,a,student)}{P(I,am,a)}P(student∣I,am,a)=P(I,am,a)P(I,am,a,student)​
P(I,am,a,student)=#(I,am,a,student)∑∀a∑∀b∑∀c∑∀d#(wa,wb,wc,wd)P(I,am,a,student) = \frac{\#(I,am,a,student)}{\sum_{\forall a}\sum_{\forall b}\sum_{\forall c}\sum_{\forall d}\#(w_a,w_b,w_c,w_d)}P(I,am,a,student)=∑∀a​∑∀b​∑∀c​∑∀d​#(wa​,wb​,wc​,wd​)#(I,am,a,student)​
P(I,am,a,student)=P(I)⋅P(am∣I)⋅P(a∣I,am)⋅P(student∣I,am,a)P(I,am,a,student) = P(I) \cdot P(am|I) \cdot P(a|I,am) \cdot P(student|I,am,a)P(I,am,a,student)=P(I)⋅P(am∣I)⋅P(a∣I,am)⋅P(student∣I,am,a)
w1n=(w1,…,wn)w_1^n=(w_1,\ldots,w_n)w1n​=(w1​,…,wn​)
P(w1n)=P(w1)⋅P(w2∣w1)⋅P(w3∣w12)⋯P(wn∣w1n−1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1})P(w1n​)=P(w1​)⋅P(w2​∣w1​)⋅P(w3​∣w12​)⋯P(wn​∣w1n−1​)
P(wn∣w1n−1)P(w_n|w_1^{n-1})P(wn​∣w1n−1​)
P(w1n)P(w_1^n)P(w1n​)
P(w1n)=P(w1)⋅P(w2∣w1)⋅P(w3∣w2)⋯P(wn∣wn−1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1​)⋅P(w2​∣w1​)⋅P(w3​∣w2​)⋯P(wn​∣wn−1​)
P(the)P(the)P(the)
P(The)P(The)P(The)
P(the)>P(The)P(the) > P(The)P(the)>P(The)
w0w_0w0​
P(the∣w0)P(the|w_0)P(the∣w0​)
P(The∣w0)P(The|w_0)P(The∣w0​)
P(the∣w0)<P(The∣w0)P(the|w_0) < P(The|w_0)P(the∣w0​)<P(The∣w0​)
P(w1∣w0)P(w_1 | w_0)P(w1​∣w0​)
P(w1)P(w_1)P(w1​)
P(w1n)=P(w1∣w0)⋅P(w2∣w1)⋅P(w3∣w2)⋯P(wn∣wn−1)P(w_1^n) = P(w_1|w_0) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1​∣w0​)⋅P(w2​∣w1​)⋅P(w3​∣w2​)⋯P(wn​∣wn−1​)
P(w1n)=∏i=1nP(wi∣wi−1)P(w_1^n) = \prod_{i=1}^n P(w_i|w_{i-1})P(w1n​)=i=1∏n​P(wi​∣wi−1​)
log⁡P(w1n)=log⁡(∏i=1nP(wi∣wi−1))=∑i=1nlog⁡P(wi∣wi−1)\log P(w_1^n) = \log(\prod_{i=1}^n P(w_i|w_{i-1})) = \sum_{i=1}^n \log P(w_i|w_{i-1})logP(w1n​)=log(i=1∏n​P(wi​∣wi−1​))=i=1∑n​logP(wi​∣wi−1​)
wn+1w_{n+1}wn+1​
P(wn+1∣wn)P(w_{n+1} | w_n)P(wn+1​∣wn​)
def delimit(word: str, delimiters: set[str]) -> list[str]:
    i = next((i for i, c in enumerate(word) if c in delimiters), -1)
    if i < 0: return [word]
    tokens = []

    if i > 0: tokens.append(word[:i])
    tokens.append(word[i])

    if i + 1 < len(word):
        tokens.extend(delimit(word[i + 1:], delimiters))

    return tokens
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}

input = [
    '"R1:',
    '(R&D)',
    '15th-largest',
    'Atlanta,',
    "Department's",
    'activity"[26]',
    'centers.[21][22]',
    '149,000',
    'U.S.'
]

output = [delimit(word, delims) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))
"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'", 's']
activity"[26] -> ['activity', '"', '[', '26', ']']
centers.[21][22] -> ['centers', '.', '[', '21', ']', '[', '22', ']']
149,000 -> ['149', ',', '000']
U.S. -> ['U', '.', 'S', '.']
def postprocess(tokens: list[str]) -> list[str]:
    i, new_tokens = 0, []

    while i < len(tokens):
        if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
            new_tokens.append(''.join(tokens[i:i + 2]))
            i += 1
        elif i + 2 < len(tokens) and \
                ((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
                 (tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
            new_tokens.append(''.join(tokens[i:i + 3]))
            i += 2
        elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
            new_tokens.append(''.join(tokens[i:i + 4]))
            i += 3
        else:
            new_tokens.append(tokens[i])
        i += 1

    return new_tokens
output = [postprocess(delimit(word, delims)) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))
"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']
def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
    with open(corpus) as fin:
        words = fin.read().split()
    return [token for word in words for token in postprocess(delimit(word, delimiters))]
from collections import Counter
from src.frequency_analysis import save_output

corpus = 'dat/emory-wiki.txt'
output = 'dat/word_types-token.txt'

words = tokenize(corpus, delims)
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

save_output(counts, output)
# of word tokens: 363
# of word types: 197
wiw_iwi​
P(wi)=#(wi)∑∀wk∈V#(wk)P(w_i) = \frac{\#(w_i)}{\sum_{\forall w_k \in V}\#(w_k)}P(wi​)=∑∀wk​∈V​#(wk​)#(wi​)​
#(wi)\#(w_i)#(wi​)
VVV
from typing import TypeAlias

Unigram: TypeAlias = dict[str, float]
from collections import Counter

def unigram_count(filepath: str) -> Counter:
    unigrams = Counter()

    for line in open(filepath):
        words = line.split()
        unigrams.update(words)

    return unigrams
def unigram_estimation(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values())
    return {word: count / total for word, count in counts.items()}
from collections.abc import Callable

def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
    unigrams = estimator(filepath)
    unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]

    for word, prob in unigram_list[:300]:
        if word[0].isupper() and word.lower() not in unigrams:
            print(f'{word:>10} {prob:.6f}')
corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)
         I 0.010543
     Aslan 0.001850
      Lucy 0.001815
    Edmund 0.001409
    Narnia 0.001379
   Caspian 0.001338
      Jill 0.001262
     Peter 0.001034
    Shasta 0.000928
    Digory 0.000925
   Eustace 0.000877
     Susan 0.000654
    Tirian 0.000601
     Polly 0.000547
    Aravis 0.000537
      Bree 0.000492
Puddleglum 0.000492
    Scrubb 0.000482
    Andrew 0.000406
wiw_{i}wi​
wI−1w_{I-1}wI−1​
#(wi−1,wi)\#(w_{i-1},w_{i})#(wi−1​,wi​)
(wi−1,wi)(w_{i-1},w_{i})(wi−1​,wi​)
ViV_iVi​
wi−1w_{i-1}wi−1​
P(wi∣wi−1)=#(wi−1,wi)∑∀wk∈Vi#(wi−1,wk)P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}P(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)#(wi−1​,wi​)​
#(wi−1,wi)\#(w_{i-1},w_{i})#(wi−1​,wi​)
(wi−1,wi)(w_{i-1},w_{i})(wi−1​,wi​)
ViV_iVi​
wi−1w_{i-1}wi−1​
Bigram: TypeAlias = dict[str, Unigram | float]
from collections import Counter, defaultdict
from src.types import Bigram

def bigram_count(filepath: str) -> dict[str, Counter]:
    bigrams = defaultdict(Counter)

    for line in open(filepath):
        words = line.split()
        for i in range(1, len(words)):
            bigrams[words[i - 1]].update([words[i]])

    return bigrams
from src.types import Bigram

def bigram_estimation(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    bigrams = dict()

    for prev, ccs in counts.items():
        total = sum(ccs.values())
        bigrams[prev] = {curr: count / total for curr, count in ccs.items()}

    return bigrams
def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
    bigrams = estimator(filepath)
    for prev in ['I', 'the', 'said']:
        print(prev)
        bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
        for curr, prob in bigram_list[:10]:
            print("{:>10} {:.6f}".format(curr, prob))
test_bigram(corpus, bigram_estimation)
I
        'm 0.081628
        do 0.075849
       've 0.044065
       was 0.041897
      have 0.038045
        am 0.035878
       'll 0.032507
     think 0.032025
        'd 0.026246
      know 0.025765
the
      same 0.014846
     other 0.013405
      King 0.012528
     Witch 0.011776
     whole 0.009020
    others 0.008958
     first 0.008770
     Dwarf 0.008582
      door 0.008519
     great 0.008519
said
       the 0.157635
         , 0.073645
      Lucy 0.057635
    Edmund 0.045074
   Caspian 0.040394
     Peter 0.039409
      Jill 0.034975
         . 0.034729
    Digory 0.031034
     Aslan 0.030049
unigram model

Language Models

A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.

Contents

Frequency Analysis

Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.

Word Counting

Our task is to determine the number of word tokens and unique word types in this text.

Q1: What is the difference between a word token and a word type?

A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:

  • L6: Count the occurrences of each word and return the results as a Counter.

When running this program, you may encounter FileNotFoundError. To fix this error, follow these steps to set up your working directory:

  1. Go to [Run] > [Edit Configurations] in the menu.

  2. Select "frequency_analysis" from the sidebar.

  3. Change the working directory to the top-level "nlp-essentials" directory.

  4. Click [OK] to save the changes.

Word Frequency

In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:

  • L2: Sort words in counts in ascending order and save them into asc as a list of (word, count) tuples.

  • L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.

Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

Q2: How can we interpret the most frequent words in a text?

Save Output

Finally, let us save all word types in alphabetical order to a file:

  • L4: Iterate over unique word types (keys) of counts in alphabetical order.

  • L5: Write each word followed by a newline character to fout.

  • L7: Close the output stream.

References

Neural Networks

Logistic Regression

Q7: What role does the sigmoid function play in the logistic regression model?

Consider a corpus consisting of two sentences:

D1: I love this movie

D2: I hate this movie

Softmax Regression

Q9: What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?

Consider a corpus consisting of three sentences:

D1: I love this movie

D2: I hate this movie

D3: I watched this movie

What are the limitations of the softmax regression model?

Multilayer Perceptron

Q10: Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?

Q11: What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?

Q12: What are the limitations of a multilayer perceptron?

References

Recurrent Neural Networks

Update: 2023-10-26

A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.

The hidden state is a vector representing the network's internal memory of the previous time step. It captures information from previous time steps and influences the predictions made at the current time step, often updated at each time step as the RNN processes a sequence of inputs.

RNN for Sequence Tagging

RNN for Text Classification

Bidirectional RNN

For example, let us consider the word "early" in the following two sentences:

  • They are early birds -> "early" is an adjective.

  • They are early today -> "early" is an adverb.

The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.

To overcome this challenge, a Bidirectional RNN is suggested [2] that considers both forward and backward directions, creating twice as many hidden states to capture a more comprehensive context. Figure 3 illustrates a bidirectional RNN for sequence tagging:

Q8: What are the advantages and limitations of implementing bidirectional RNNs for text classification and sequence tagging tasks?

Advanced Topics

  • Long Short-Term Memory (LSTM) Networks [3-5]

  • Gated Recurrent Units (GRUs) [6-7]

References

Subword Tokenization

Algorithm

  1. Initialization: Given a dictionary consisting of all words and their counts in a corpus, the symbol vocabulary is initialized by tokenizing each word into its most basic subword units, such as characters.

  2. Expectation: With the (updated) symbol vocabulary, it calculates the frequency of every symbol pair within the vocabulary.

  3. Maximization: Given all symbol pairs and their frequencies, it merges the top-k most frequent symbol pairs in the vocabulary.

  4. Steps 2 and 3 are repeated until meaningful sets of subwords are found for all words in the corpus.

Q4: The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?

Implementation

Let us consider a toy vocabulary:

First, we create the symbol vocabulary by inserting a space between every pair of adjacent characters and adding a special symbol [EoW] at the end to indicate the End of the Word:

Next, we count the frequencies of all symbol pairs in the vocabulary:

Finally, we update the vocabulary by merging the most frequent symbol pair across all words:

The expect() and maximize() can be repeated for multiple iterations until the tokenization becomes reasonable:

When you uncomment L7 in bpe_vocab(), you can see how the symbols are merged in each iteration:

References

Document Classification

Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.

Supervised Learning

Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.

Data Split

Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.

It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.

Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:

We then print the number of documents in each set:

Q8: What potential problems might arise from the above data splitting approach, and what alternative method could mitigate these issues?

Vectorization

To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:

Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:

We then vectorize all documents in each set:

Q9: Why do we use only the training set to collect the vocabulary?

Classification

Finally, we test our classification model on the development set:

Q10: What are the primary weaknesses and limitations of the K-Nearest Neighbors (KNN) classification model when applied to document classification?

References

Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

Finally, let us our bag-of-words model with the examples above:

Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

References

Homework

HW2: Language Models

Task 1: Bigram Modeling

Implementation

  1. Define a function named bigram_model() that takes a file path pointing to the text file, and returns a dictionary of bigram probabilities estimated in the text file.

  2. Use the following constants to indicate the unknown and initial probabilities:

Notes

  1. Each line should be treated independently for bigram counting such that the INIT token should precede the first word of each line.

Task 2: Sequence Generation

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

Implementation

  • A bigram model (the resulting dictionary of Task 1)

  • The initial word (the first word to appear in the sequence)

  • The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

  • It must have the precise number of tokens as specified.

  • Excluding punctuation, there should be no redundant tokens in the sequence.

Finally, the function returns a tuple comprising the following two elements:

  • The list of tokens in the sequence

Extra Credit

Create a function called sequence_generator_plus() that takes the same input parameters as the existing sequence_generator() function. This new function should generate sequences with higher probability scores and better semantic coherence compared to the original implementation.

Submission

Commit and push the language_models.py file to your GitHub repository.

Rubric

  • Task 1: Bigram Modeling (5 points)

  • Task 2: Sequence Generator (4.6 points), Extra Credit (2 points)

  • Concept Quiz (2.4 points)

wi−1w_{i-1}wi−1​
wiw_{i}wi​
wi−1w_{i-1}wi−1​
wi−1w_{i-1}wi−1​

Consider the following text from Wikipedia about (as of 2023-10-18):

L1: Import the class, a special type of a , from the package.

L3: Use to indicate the parameter type (str) and the return type (Counter).

L4: the corpus file.

L5: the contents of the file as a string, it into a of words, and store them in words.

L1: The corpus can be found .

L4: Save the total number of word tokens in the corpus, which is the of counts values.

L5: Save the number of unique word types in the corpus, which is the of counts.

L7-8: Print the value using the .

L1: Sort words in counts in descending order and save them into dec as a list of (word, count) , sorted from the most frequent to the least frequent words (, , ).

L4: Iterate over the top k most frequent words in the sorted list using , and print each word along with its count.

L2: Open outfile in mode (w).

L1: Creates the file if it does not exist; otherwise, its previous contents will be completely overwritten.

Source:

, The Python Standard Library - Built-in Types

, The Python Standard Library - Built-in Types

, The Python Tutorial

Let be a vector representing an input instance, where denotes the 'th feature of the input and be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that belongs to :

The weight vector assigns weights to each dimension of the input vector for the label such that a higher magnitude of weight indicates greater importance of the feature . Finally, represents the bias of the label within the training distribution.

The input vectors and can be created for these two sentences using the :

Let and be the output labels of and , representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector can be trained using logistic regression:

Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights , , and are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight for "love" () contributes positively to the label , the weight for "hate" () has a negative impact on the label . Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias is also set to 0.

Given the weight vector and the bias, we have and , resulting the following probabilities:

As the probability of being exceeds (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being is below 50%.

Q8: Under what circumstances would the bias be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?

Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector and its output lable , the model uses the softmax function to estimates the probability that belongs to each class separately:

The weight vector assigns weights to for the label , while represents the bias associated with the label .

Then, the input vectors , , and for the sentences can be created using the :

Let , , and be the output labels of , , and , representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors , , and can be trained using softmax regression as follows:

Unlike the case of logistic regression where all weights are oriented to (both and giving positive and negative weights to respectively, but not ), the values in each weigh vector are oriented to each corresponding label.

Given the weight vectors and the biases, we can estimate the following probabilities for :

Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For , the following probabilities can be estimated:

Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.

Softmax regression always predicts values so that it is represented by an output vector , wherein the 'th value in contains the probability of the input belonging to the 'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix , where the 'th row represents the weight vector for the 'th label.

With this new formulation, softmax regression can be defined as , and the optimal prediction can be achieved as , which returns a set of labels with the highest probabilities.

A multilayer perceptron (MLP) is a type of consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector and an output vector , the model allows zero to many hidden layers to generate intermediate representations of the input.

Let be a hidden layer between and . To connect and , we need a weight matrix such that , where is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.

Similarly, to connect and , we need a weight matrix such that . Thus, a multilayer perceptron with one hidden layer can be represented as:

Consider a corpus comprising the following five sentences the corresponding labels ():

D1: I love this movie postive

D2: I hate this movie negative

D3: I watched this movie neutral

D4: I truly love this movie very positive

D5: I truly hate this movie very negative

The input vectors can be created using the :

The first weight matrix can be trained by an MLP as follows:

Given the values in , we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.

Each of is multiplied by to achieve the hiddner layer , respectively, where the activation function is designed as follow:

The second weight matrix can also be trained by an MLP as follows:

By applying the softmax function to each , we achieve the corresponding output vector :

The prediction can be made by taking the argmax of each .

, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.

Given an input sequence where , an RNN for defines two functions, and :

takes the current input and the hidden state of the previous input , and returns a hidden state such that , where , , and is an .

takes the hidden state and returns an output such that , where .

Figure 1 shows an example of an RNN for sequence tagging, such as :

Notice that the output for the first input is predicted by considering only the input itself such that (e.g., the POS tag of the first word "I" is predicted solely using that word). However, the output for every other input is predicted by considering both and , an intermediate representation created explicitly for the task. This enables RNNs to capture sequential information that cannot.

Q6: How does each hidden state in a RNN encode information relevant to sequence tagging tasks?

Unlike sequence tagging where the RNN predicts a sequence of output for the input , an RNN designed for predicts only one output for the entire input sequence such that:

Sequence Tagging

Text Classification:

To accomplish this, a common practice is to predict the output from the last hidden state using the function . Figure 2 shows an example of an RNN for text classification, such as :

Q7: In text classification tasks, what specific information is captured by the final hidden state of a RNN?

The above does not consider the words that follow the current word when predicting the output. This limitation can significantly impact model performance since contextual information following the current word can be crucial.

For every , the hidden states and are created by considering and , respectively. The function takes both and and returns an output such that , where is a concatenation of the two hidden states and .

, Elman, Cognitive Science, 14(2), 1990.

, Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.

, Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 ( available at ResearchGate).

, Ma and Hovy, ACL, 2016.*

, Akbik et al., COLING, 2018.*

, Cho et al., EMNLP, 2014.*

, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*

Byte Pair Encoding (BPE) is a data compression algorithm that is commonly used in the context of subword tokenization for . BPE text into smaller units, such as subword pieces or characters, to handle out-of-vocabulary words, reduce vocabulary size, and enhance the efficiency of language models.

The following describes the steps of BPE in terms of the :

Q5: What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?

Source code:

, Sennrich et al., ACL, 2016.

, Gage, The C Users Journal, 1994.

, Kudo and Richardson, EMNLP, 2018.

(WordPiece), Wu et al., arXiv, 2016.

The directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the file, following a file-naming convention of A_B, where A denotes the book ID and B indicates the chapter ID.

L7: the module, the module.

Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:

L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .

L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .

Source:

, Wikipedia.

Consider a corpus containing the following two documents:

Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:

Source:

, Wikipedia

, Working With Text Data, scikit-learn Tutorials

Your goal is to build a bigram model using (1) Laplace smoothing with and (2) by adding the artificial token at the beginning of every sentence.

Create a file in the directory.

Test your model using .

Use such that all probabilities must sum to 1.0 for any given previous word.

Unknown word probabilities should be retrieved using the UNKNOWN key for both the previous word () and the current word ().

Under , define a function named sequence_generator() that takes the following parameters:

Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of puncuation tokens are permitted).

The goal of this task is not to discover a sequence that maximizes the overall , but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word , a potential strategy is as follows:

Identify the next word where the bigram probability is maximized.

If fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.

Make and repeat the #1 until you reach the specific sequence length.

The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base , provided as the math.log() function in Python.

chronicles_of_narnia.txt
SpareVector
previous section
Maximum likelihood estimation
dat/word_types.txt
Support for type hints
set
enumerate()
next()
generator expressions
extend()
Set types
zip()
Format Specification Mini-Language
list comprehension
emory-wiki.txt
src/frequency_analysis.py
dat/word_types-token.txt
dat/text_processing /word_types-token.txt
src/tokenization.py
ELIT Tokenizer
tokenized
Type aliases
dat/chronicles_of_narnia.txt
Callable
defaultdict
collections
type alias
src.types
ngram_models.py
N-gram Language Models
Euclidean distance
Cosine similarity
Chain Rule
Markov Assumption
smoothing probability
from collections import Counter

def count_words(corpus: str) -> Counter:
    fin = open(corpus)
    words = fin.read().split()
    return Counter(words)
corpus = 'dat/emory-wiki.txt'
counts = count_words(corpus)

n_tokens = sum(counts.values())
n_types = len(counts)

print(f'# of word tokens: {n_tokens}')
print(f'# of word types: {n_types}')
# of word tokens: 305
# of word types: 180
des = sorted(counts.items(), key=lambda x: x[1], reverse=True)
asc = sorted(counts.items(), key=lambda x: x[1])

for word, count in des[:10]: print(word, count)
for word, count in asc[:10]: print(word, count)
the 18
and 15
of 12
Emory 11
in 10
University 7
is 7
university 6
United 6
research 5
private 1
Atlanta, 1
Georgia. 1
Founded 1
1836 1
College 1
by 1
Episcopal 1
Church 1
named 1
def save_output(counts: Counter, outfile: str):
    fout = open(outfile, 'w')

    for word in sorted(counts.keys()):
        fout.write(f'{word}\n')

    fout.close()
save_output(counts, 'dat/word_types.txt')
word frequencies
x=[x1,…,xn]\mathrm{x} = [x_1, \ldots, x_n]x=[x1​,…,xn​]
xix_ixi​
iii
y∈{0,1}y \in \{0, 1\}y∈{0,1}
x\mathrm{x}x
yyy
P(y=1∣x)=11+e−(x⋅wT+b)P(y=0∣x)=1−P(y=1∣x)\begin{align*} P(y=1|\mathrm{x}) &= \frac{1}{1 + e^{-(\mathrm{x}\cdot\mathrm{w}^T+b)}}\\ P(y=0|\mathrm{x}) &= 1 - P(y=1|\mathrm{x}) \end{align*}P(y=1∣x)P(y=0∣x)​=1+e−(x⋅wT+b)1​=1−P(y=1∣x)​
w=[w1,…,wn]\mathrm{w}=[w_1, \ldots, w_n]w=[w1​,…,wn​]
x\mathrm{x}x
y=1y=1y=1
wiw_iwi​
xix_ixi​
bbb
y=1y=1y=1
V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie"}
x1 = [1, 1, 0, 1, 1]
x2 = [1, 0, 1, 1, 1]
y1=1y_1 =1y1​=1
y2=0y_2 = 0y2​=0
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
w\mathrm{w}w
w = [0.0, 1.5, -1.5, 0.0, 0.0]
b = 0
w1​w_1​w1​​
w4​w_4​w4​​
w5​w_5​w5​​
w1​w_1​w1​​
x1​x_1​x1​​
y=1y=1y=1
w2w_2w2​
x2​x_2​x2​​
y=1y=1y=1
bbb
x1⋅wT+b=1.5\mathrm{x}_1 \cdot \mathrm{w}^T + b = 1.5x1​⋅wT+b=1.5
x2⋅wT+b=−1.5\mathrm{x}_2 \cdot \mathrm{w}^T + b = -1.5x2​⋅wT+b=−1.5
P(y=1∣x1)≈0.82P(y=1∣x2)≈0.18\begin{align*} P(y=1|\mathrm{x}_1) \approx 0.82\\ P(y=1|\mathrm{x}_2) \approx 0.18 \end{align*}P(y=1∣x1​)≈0.82P(y=1∣x2​)≈0.18​
x1​x_1​x1​​
y=1y=1y=1
0.50.50.5
y=1y=1y=1
bbb
x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n
y∈{0,…,m−1}y \in \{0, \ldots, m-1\}y∈{0,…,m−1}
x\mathrm{x}x
P(y=k∣x)=ex⋅wkT+bk∑j=1mex⋅wjT+bjP(y=k|\mathrm{x}) = \frac{e^{\mathrm{x}\cdot\mathrm{w}_k^T+b_k}}{\sum_{j=1}^m e^{\mathrm{x}\cdot\mathrm{w}_j^T+b_j}}P(y=k∣x)=∑j=1m​ex⋅wjT​+bj​ex⋅wkT​+bk​​
wk​\mathrm{w}_k​wk​​
x\mathrm{x}x
y=ky=ky=k
bk​b_k​bk​​
y=ky=ky=k
V = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched"}
x1 = [1, 1, 0, 1, 1, 0]
x2 = [1, 0, 1, 1, 1, 0]
x3 = [1, 0, 0, 1, 1, 1]
y1=1y_1 =1y1​=1
y2=0y_2 = 0y2​=0
y=2y=2y=2
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
x3\mathrm{x}_3x3​
w1\mathrm{w}_1w1​
w2\mathrm{w}_2w2​
w3\mathrm{w}_3w3​
w1 = [0.0,  1.5, -1.0, 0.0, 0.0, 0.0]
w2 = [0.0, -1.0,  1.5, 0.0, 0.0, 0.0]
w3 = [0.0, -1.0, -1.0, 0.0, 0.0, 1.5]
b1 = b2 = b3 = 0
y=1y = 1y=1
w1w_1w1​
w2w_2w2​
y=1y = 1y=1
y=0y=0y=0
x1\mathrm{x}_1x1​
x1⋅w1T+b1=1.5 ⇒P(y=1∣x1)=0.86x1⋅w2T+b2=−1.0⇒P(y=0∣x1)=0.07x1⋅w3T+b3=−1.0⇒P(y=2∣x1)=0.07\begin{align*} \mathrm{x}_1 \cdot \mathrm{w}_1^T + b_1 &=& 1.5  &\Rightarrow & P(y=1|\mathrm{x}_1) &=& 0.86\\ \mathrm{x}_1 \cdot \mathrm{w}_2^T + b_2 &=& -1.0 &\Rightarrow & P(y=0|\mathrm{x}_1) &=& 0.07\\ \mathrm{x}_1 \cdot \mathrm{w}_3^T + b_3 &=& -1.0 &\Rightarrow & P(y=2|\mathrm{x}_1) &=& 0.07 \end{align*}x1​⋅w1T​+b1​x1​⋅w2T​+b2​x1​⋅w3T​+b3​​===​1.5 −1.0−1.0​⇒⇒⇒​P(y=1∣x1​)P(y=0∣x1​)P(y=2∣x1​)​===​0.860.070.07​
y=1y=1y=1
x3\mathrm{x}_3x3​
x3⋅w1T+b1=0 ⇒P(y=1∣x3)=0.14x3⋅w2T+b2=0⇒P(y=0∣x3)=0.14x3⋅w3T+b3=1.5⇒P(y=2∣x3)=0.69\begin{align*} \mathrm{x}_3 \cdot \mathrm{w}_1^T + b_1 &=& 0  &\Rightarrow & P(y=1|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_2^T + b_2 &=& 0 &\Rightarrow & P(y=0|\mathrm{x}_3) &=& 0.14\\ \mathrm{x}_3 \cdot \mathrm{w}_3^T + b_3 &=& 1.5 &\Rightarrow & P(y=2|\mathrm{x}_3) &=& 0.69 \end{align*}x3​⋅w1T​+b1​x3​⋅w2T​+b2​x3​⋅w3T​+b3​​===​0 01.5​⇒⇒⇒​P(y=1∣x3​)P(y=0∣x3​)P(y=2∣x3​)​===​0.140.140.69​
y=2y=2y=2
mmm
y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m
iii
y\mathrm{y}y
iii
W∈Rm×n\mathrm{W} \in \mathbb{R}^{m \times n}W∈Rm×n
iii
iii
y=softmax(x⋅WT)\mathrm{y} = \mathrm{softmax}(\mathrm{x} \cdot \mathrm{W}^T)y=softmax(x⋅WT)
argmax(y)\mathrm{argmax}(\mathrm{y})argmax(y)
h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​)
h\mathrm{h}h
y\mathrm{y}y
Wh∈Rm×d\mathrm{W}_h \in \mathbb{R}^{m \times d}Wh​∈Rm×d
y=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(h⋅WhT​)
y=softmax[activation(x⋅Wx)⋅WhT]=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}[\mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x) \cdot \mathrm{W}_h^T] = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax[activation(x⋅Wx​)⋅WhT​]=softmax(h⋅WhT​)
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
⇒\Rightarrow⇒
X = {0: "I", 1: "love", 2: "hate", 3: "this", 4: "movie", 5: "watched", 6: "truly"}
Y = {0: "positive", 1: "negative", 2: "neutral", 3: "very positive", 4: "very negative"}

x1 = [1, 1, 0, 1, 1, 0, 0]
x2 = [1, 0, 1, 1, 1, 0, 0]
x3 = [1, 0, 0, 1, 1, 1, 0]
x4 = [1, 1, 0, 1, 1, 0, 1]
x5 = [1, 0, 1, 1, 1, 0, 1]

y1, y2, y3, y4, y5 = 0, 1, 2, 3, 4
Wx∈R7×5\mathrm{W}_x \in \mathbb{R}^{7 \times 5}Wx​∈R7×5
Wx = [
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [1.0, 0.0, 0.0, 0.5, 0.0],
  [0.0, 1.0, 0.0, 0.0, 0.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 1.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.5, 0.5]
]
Wx\mathrm{W}_xWx​
x1..5\mathrm{x}_{1..5}x1..5​
Wx\mathrm{W}_xWx​
h1..5\mathrm{h}_{1..5}h1..5​
activation(x)={ xif x>0.5 0otherwise\mathrm{activation}(x) = \left\{ \begin{array}{ll}   x & \text{if}\: x > 0.5\\   0 & \text{otherwise} \end{array} \right.activation(x)={ x 0​ifx>0.5otherwise​
g1 = [1.0, 0.0, 0.0, 0.5, 0.0]
g2 = [0.0, 1.0, 0.0, 0.0, 0.5]
g3 = [0.0, 0.0, 1.0, 0.0, 0.0]
g4 = [1.0, 0.0, 0.0, 1.0, 0.5]
g5 = [0.0, 1.0, 0.0, 0.5, 1.0]
h1 = activation(g1) = [1.0, 0.0, 0.0, 0.0, 0.0]
h2 = activation(g2) = [0.0, 1.0, 0.0, 0.0, 0.0]
h3 = activation(g3) = [0.0, 0.0, 1.0, 0.0, 0.0]
h4 = activation(g4) = [1.0, 0.0, 0.0, 1.0, 0.0]
h5 = activation(g5) = [0.0, 1.0, 0.0, 0.0, 1.0]
Wh∈R5×5\mathrm{W}_h \in \mathbb{R}^{5 \times 5}Wh​∈R5×5
Wh = [
  [ 1.0, -1.0, 0.0, -0.5, -1.0],
  [-1.0,  1.0, 0.0, -1.0, -0.5],
  [-1.0, -1.0, 1.0, -1.0, -1.0],
  [ 0.0, -1.0, 0.0,  1.0, -1.0],
  [-1.0,  0.0, 0.0, -1.0,  1.0]
]
hi⋅ WhT\mathrm{h}_i \cdot \mathrm{W}^T_hhi​⋅ WhT​
yi\mathrm{y}_iyi​
o1 = [ 1.0, -1.0, -1.0,  0.0, -1.0]
o2 = [-1.0,  1.0, -1.0, -1.0,  0.0]
o3 = [ 0.0,  0.0,  1.0,  0.0,  0.0]
o4 = [ 0.5, -2.0, -2.0,  1.0, -2.0]
o5 = [-2.0,  0.5, -2.0, -2.0,  1.0]
y1 = softmax(o1) = [0.56, 0.08, 0.08, 0.21, 0.08]
y2 = softmax(o2) = [0.08, 0.56, 0.08, 0.08, 0.21]
y3 = softmax(o3) = [0.15, 0.15, 0.40, 0.15, 0.15]
y4 = softmax(o4) = [0.35, 0.03, 0.03, 0.57, 0.03]
y5 = softmax(o5) = [0.03, 0.35, 0.03, 0.03, 0.57]
yi\mathrm{y}_iyi​
ggg
hih_ihi​
yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1
g(hi)=Wohi=yig(h_i) = W^o h_i = y_ig(hi​)=Wohi​=yi​
Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e
hih_ihi​
RNNst(X)→Y\text{RNN}_{st}(X) \rightarrow YRNNst​(X)→Y
RNNst(X)→y\text{RNN}_{st}(X) \rightarrow yRNNst​(X)→y
hnh_nhn​
xix_ixi​
h→i\overrightarrow{h}_ihi​
h←i\overleftarrow{h}_ihi​
h→i−1\overrightarrow{h}_{i-1}hi−1​
h←i+1\overleftarrow{h}_{i+1}hi+1​
ggg
h→i\overrightarrow{h}_ihi​
h←i\overleftarrow{h}_ihi​
yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1
g(h→i,h←i)=Wo(h→i⊕h←i)=yig(\overrightarrow{h}_i, \overleftarrow{h}_i) = W^o (\overrightarrow{h}_i \oplus \overleftarrow{h}_i) = y_ig(hi​,hi​)=Wo(hi​⊕hi​)=yi​
(h→i⊕h←i)∈R2e×1(\overrightarrow{h}_i \oplus \overleftarrow{h}_i) \in \mathbb{R}^{2e \times 1}(hi​⊕hi​)∈R2e×1
Wo∈Ro×2eW^o \in \mathbb{R}^{o \times 2e}Wo∈Ro×2e
from src.types import WordCount, PairCount
EOW = '[EoW]'

word_counts = {
    'high': 12,
    'higher': 14,
    'highest': 10,
    'low': 12,
    'lower': 11,
    'lowest': 13
}
def initialize(word_counts: WordCount) -> WordCount:
    return {' '.join(list(word) + [EOW]): count for word, count in word_counts.items()}
def expect(vocab: WordCount) -> PairCount:
    pairs = collections.defaultdict(int)

    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += freq

    return pairs
def maximize(vocab: WordCount, pairs: PairCount) -> WordCount:
    best = max(pairs, key=pairs.get)
    p = re.compile(r'(?<!\S)' + re.escape(' '.join(best)) + r'(?!\S)')
    return {p.sub(''.join(best), word): freq for word, freq in vocab.items()}
def bpe_vocab(word_counts: WordCount, max_iter: int):
    vocab = initialize(word_counts)

    for i in range(max_iter):
        pairs = expect(vocab)
        vocab = maximize(vocab, pairs)
        # print(vocab)

    return vocab
bpe_vocab(word_counts, 10)
{'hi g h [EoW]': 12, 'hi g h e r [EoW]': 14, 'hi g h e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
{'hig h [EoW]': 12, 'hig h e r [EoW]': 14, 'hig h e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
{'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'l o w [EoW]': 12, 'l o w e r [EoW]': 11, 'l o w e s t [EoW]': 13}
{'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'lo w [EoW]': 12, 'lo w e r [EoW]': 11, 'lo w e s t [EoW]': 13}
{'high [EoW]': 12, 'high e r [EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low e r [EoW]': 11, 'low e s t [EoW]': 13}
{'high [EoW]': 12, 'high er [EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low er [EoW]': 11, 'low e s t [EoW]': 13}
{'high [EoW]': 12, 'high er[EoW]': 14, 'high e s t [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low e s t [EoW]': 13}
{'high [EoW]': 12, 'high er[EoW]': 14, 'high es t [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low es t [EoW]': 13}
{'high [EoW]': 12, 'high er[EoW]': 14, 'high est [EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low est [EoW]': 13}
{'high [EoW]': 12, 'high er[EoW]': 14, 'high est[EoW]': 10, 'low [EoW]': 12, 'low er[EoW]': 11, 'low est[EoW]': 13}
from src.bag_of_words_model import Document
import glob, os

def collect(dirpath: str) -> dict[int, list[Document]]:
    books = dict()

    for filename in glob.glob(os.path.join(dirpath, '*.txt')):
        t = os.path.basename(filename).split('_')
        book_id = int(t[0])
        fin = open((filename))
        books.setdefault(book_id, list()).append(fin.read().split())

    return books
def join_documents(dataset: dict[int, list[Document]]) -> list[Document]:
    return [document for documents in dataset.values() for document in documents]

trn = collect('dat/document_classification/trn')
dev = collect('dat/document_classification/dev')
tst = collect('dat/document_classification/tst')
print(len(join_documents(trn)), len(join_documents(dev)), len(join_documents(tst)))
82 14 14
corpus = join_documents(trn)
vocab = vocabulary(join_documents(trn))
dfs = document_frequencies(vocab, corpus)
D = len(corpus)
def vectorize(vocab: Vocab, dfs: SparseVector, D: int, docset: dict[int, list[Document]]) -> list[tuple[int, SparseVector]]:
    vs = []

    for book_id, documents in docset.items():
        for document in documents:
            vs.append((book_id, tf_idf(vocab, dfs, D, document)))

    return vs
trn_vs = vectorize(vocab, dfs, D, trn)
dev_vs = vectorize(vocab, dfs, D, dev)
tst_vs = vectorize(vocab, dfs, D, tst)
kkk
def knn(trn_vs: list[tuple[int, SparseVector]], v: SparseVector, k: int = 1) -> tuple[int, float]:
    sims = [(book_id, cosine_similarity(v, t)) for book_id, t in trn_vs]
    sims.sort(key=lambda x: x[1], reverse=True)
    return Counter(sims[:k]).most_common(1)[0][0]
vvv
ttt
ttt
kkk
vvv
correct = 0
    
for g_book_id, document in dev_vs:
    p_book_id, p_score = knn(trn_vs, document)
    if g_book_id == p_book_id: correct += 1
    print('Gold: {}, Auto: {}, Score: {:.2f}'.format(g_book_id, p_book_id, p_score))

print('Accuracy: {} ({}/{})'.format(100 * correct / len(dev_vs), correct, len(dev_vs)))
Gold: 1, Auto: 1, Score: 0.49
Gold: 1, Auto: 1, Score: 0.27
Gold: 3, Auto: 3, Score: 0.36
Gold: 3, Auto: 3, Score: 0.32
Gold: 5, Auto: 5, Score: 0.29
Gold: 5, Auto: 5, Score: 0.54
Gold: 0, Auto: 0, Score: 0.32
Gold: 0, Auto: 0, Score: 0.26
Gold: 6, Auto: 6, Score: 0.48
Gold: 6, Auto: 6, Score: 0.49
Gold: 2, Auto: 2, Score: 0.37
Gold: 2, Auto: 2, Score: 0.31
Gold: 4, Auto: 4, Score: 0.56
Gold: 4, Auto: 4, Score: 0.60
Accuracy: 100.0 (14/14)
D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]
Di=[wi,1,…,wi,n]D_i = [w_{i,1}, \ldots, w_{i,n}]Di​=[wi,1​,…,wi,n​]
wi,jw_{i,j}wi,j​
jjj
DiD_iDi​
DiD_iDi​
vi=[count(wj∈Di):∀wj∈W]∈R∣W∣v_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}vi​=[count(wj​∈Di​):∀wj​∈W]∈R∣W∣
wjw_jwj​
jjj
WWW
viv_ivi​
wjw_jwj​
DiD_iDi​
#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]
viv_ivi​
WWW
DiD_iDi​
viv_ivi​
v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}
from typing import TypeAlias

Document: TypeAlias = list[str]
Vocab: TypeAlias = dict[str, int]

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}
from collections import Counter

SparseVector: TypeAlias = dict[int, int | float]

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}
documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))
{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}
UNKNOWN = ''
INIT = '[INIT]'
wi−1w_{i-1}wi−1​
wiw_iwi​
floor(21/5)=4\mathrm{floor}(21 / 5) = 4floor(21/5)=4
w′w'w′
P(w′∣w)P(w′∣w)P(w′∣w)
w′w′w′
w=w′w = w'w=w′
eee

Latent Semantic Analysis

Document-Term Matrix

LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.

With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:

Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:

Q3: Why is the performance of document_term_matrix() significantly slower than document_term_matrix_np()?

Dimensionality Reduction

For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:

Q4: What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?

Document Embedding

From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.

Q5: By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?

Word Embedding

From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.

References

Contextual Encoding

Contents

References

Q2: How did the embedding representation facilitate the adaption of Neural Networks in Natural Language Processing?

Q3: How are embedding representations for Natural Language Processing fundamentally different from ones for Computer Vision?

Emory University is a private research university in Atlanta, Georgia. Founded in 1836 as Emory College by the Methodist Episcopal Church and named in honor of Methodist bishop John Emory.[18]

Emory University has nine academic divisions. Emory Healthcare is the largest healthcare system in the state of Georgia[19] and comprises seven major hospitals, including Emory University Hospital and Emory University Hospital Midtown.[20] The university operates the Winship Cancer Institute, Yerkes National Primate Research Center, and many disease and vaccine research centers.[21][22] Emory University is the leading coordinator of the U.S. Health Department's National Ebola Training and Education Center.[23] The university is one of four institutions involved in the NIAID's Tuberculosis Research Units Program.[24] The International Association of National Public Health Institutes is headquartered at the university.[25]

Emory University has the 15th-largest endowment among U.S. colleges and universities.[9] The university is classified among "R1: Doctoral Universities - Very high research activity"[26] and is cited for high scientific performance and citation impact in the CWTS Leiden Ranking.[27] The National Science Foundation ranked the university 36th among academic institutions in the United States for research and development (R&D) expenditures.[28][29] In 1995 Emory University was elected to the Association of American Universities, an association of the 65 leading research universities in the United States and Canada.[5]

Emory faculty and alumni include 2 Prime Ministers, 9 university presidents, 11 members of the United States Congress, 2 Nobel Peace Prize laureates, a Vice President of the United States, a United States Speaker of the House, and a United States Supreme Court Justice. Other notable alumni include 21 Rhodes Scholars and 6 Pulitzer Prize winners, as well as Emmy Award winners, MacArthur Fellows, CEOs of Fortune 500 companies, heads of state and other leaders in foreign government.[30] Emory has more than 149,000 alumni, with 75 alumni clubs established worldwide in 20 countries.[31][32][33]
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
x1\mathrm{x}_1x1​
x2\mathrm{x}_2x2​
x3\mathrm{x}_3x3​
x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n
y∈R1×m\mathrm{y} \in \mathbb{R}^{1 \times m}y∈R1×m
h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d
x\mathrm{x}x
y\mathrm{y}y
x\mathrm{x}x
h\mathrm{h}h
Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d
h=activation(x⋅Wx)\mathrm{h} = \mathrm{activation}(\mathrm{x} \cdot \mathrm{W}_x)h=activation(x⋅Wx​)
activation()\mathrm{activation}()activation()
x1..5\mathrm{x}_{1..5}x1..5​
X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​]
xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1
fff
ggg
fff
xi∈Xx_i \in Xxi​∈X
hi−1h_{i-1}hi−1​
xi−1x_{i-1}xi−1​
hi∈Re×1h_i \in \mathbb{R}^{e \times 1}hi​∈Re×1
f(xi,hi−1)=α(Wxxi+Whhi−1)=hif(x_i, h_{i-1}) = \alpha(W^x x_i + W^h h_{i-1}) = h_if(xi​,hi−1​)=α(Wxxi​+Whhi−1​)=hi​
Wx∈Re×dW^x \in \mathbb{R}^{e \times d}Wx∈Re×d
Wh∈Re×eW^h \in \mathbb{R}^{e \times e}Wh∈Re×e
α\alphaα
y1y_1y1​
x1x_1x1​
f(x1,0)=α(Wxx1)=h1f(x_1, \mathbf{0}) = \alpha(W^x x_1) = h_1f(x1​,0)=α(Wxx1​)=h1​
yiy_iyi​
xix_ixi​
xix_ixi​
hi−1h_{i-1}hi−1​
Y=[y1,…,yn]Y = [y_1, \ldots, y_n]Y=[y1​,…,yn​]
X=[x1,…,xn]X = [x_1, \ldots, x_n]X=[x1​,…,xn​]
yyy
yyy
hnh_nhn​
ggg
w0w_0w0​
www

Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the .

Let us define a function that reads a corpus, and returns a list of all documents in the corpus and a dictionary whose keys and values are terms and their unique indices, respectively:

We then define a function that takes and , and returns the document-term matrix such that indicates the frequency of the 'th term in within the 'th document:

L1: The package.

L3:

Let us create a document-term matrix from the corpus, :

L1: .

L3:

The 'th row of is considered the document vector of the 'th document in the corpus, while the transpose of 'th column of is considered the term vector of the 'th term in the vocabulary.

LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix into three matrices, , , and , where are orthogonal matrices and is a diagonal matrix containing singular values, such that .

An is a square matrix whose rows and columns are orthogonal such that , where is the identity matrix.

are non-negative values listed in decreasing order that represent the importance of each topic.

L3:

L4:

This results , , and such that:

In , each row represents a document and each column represent a topic.

In , each diagonal cell represents the weight of the corresponding topic.

In , each column represents a term and each row represents a topic.

The last two singular values in are actually non-negative values, and , respectively.

The first four singular values in appear to be sufficiently larger than the others; thus, let us reduce their dimensions to such that , , and :

Given the LSA results, an embedding of the 'th document can be obtained as :

Finally, an embedding of the 'th term can be achieved as :

Q6: is not transposed in L3 of the above code. Should we use S.transpose() instead?

Source:

, Wikipedia.

, Wikipedia.

Contextual representations are representations of words, phrases, or sentences within the context of the surrounding text. Unlike word embeddings from where each word is represented by a fixed vector regardless of its context, contextual representations capture the meaning of a word or sequence of words based on their context in a particular document such that the representation of a word can vary depending on the words surrounding it, allowing for a more nuanced understanding of meaning in natural language processing tasks.

, Vaswani et al., Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.

Q1: How can document-level vector representations be derived from word embeddings?

N-gram Models
Smoothing
Maximum Likelihood Estimation
Entropy and Perplexity
Assignments
Emory University
dat/emory-wiki.txt
Counter
dictionary
collections
typing
open()
read()
split()
list
dat/emory-wiki.txt
sum()
len()
formatted string literals
tuples
sorted()
items()
lambda
slicing
write
dat/word_types.txt
src/frequency_analysis.py
Sequence Types
Mapping Types
Input and Output
bag-of-words model
bag-of-words model
Feedforward Neural Networks
Activation functions
bag-of-words model
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition
sequence tagging
Feedforward Neural Networks
text classification
Finding Structure in Time
Bidirectional Recurrent Neural Networks
Long Short-Term Memory
PDF
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
Contextual String Embeddings for Sequence Labeling
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
neural language models
tokenizes
EM algorithm
rule-based tokenization
src/byte_pair_encoding.py
Neural Machine Translation of Rare Words with Subword Units
A New Algorithm for Data Compression
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
document_classification
chronicles_of_narnia.txt
glob
os
document_classification.py
K-nearest neighbors
tokenized
bag_of_words_model.py
Bag-of-Words Model
Bags of words
language_models.py
src/homework/
dat/chronicles_of_narnia.txt
language_models.py
activation function
RNN for sequence tagging
smoothing with normalization
normalization
initial word probabilities
sequence probability
DDD
TTT
from src.types import Document, Vocab

def retrieve(filename: str) -> tuple[list[Document], Vocab]:
    documents = [line.split() for line in open(filename)]
    t = {word for document in documents for word in document}
    terms = {term: j for j, term in enumerate(sorted(list(t)))}
    return documents, terms
DDD
TTT
X∈R∣D∣×∣T∣X \in \mathbb{R}^{|D| \times |T|}X∈R∣D∣×∣T∣
Xi,jX_{i,j}Xi,j​
jjj
TTT
iii
import numpy as np

def document_term_matrix(documents: list[Document], terms: Vocab) -> np.array:
    def doc_vector(document: list[str]) -> list[int]:
        v = [0] * len(terms)
        for term in document:
            v[terms[term]] += 1
        return v
    return np.array([doc_vector(document) for document in documents])
import time

D, T = retrieve('dat/chronicles_of_narnia.txt')

st = time.time()
X = document_term_matrix(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) = 17.87
def document_term_matrix_np(documents: list[Document], terms: Vocab) -> np.array:
    X = np.zeros((len(documents), len(vocab)), dtype=int)
    for i, document in enumerate(documents):
        for term in document:
            X[i, vocab[term]] += 1
    return X
st = time.time()
X = document_term_matrix_np(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) =  0.55
iii
XXX
iii
jjj
XXX
jjj
XXX
UUU
Σ\SigmaΣ
VVV
U,V∈R∣D∣×σU, V \in \mathbb{R}^{|D| \times \sigma}U,V∈R∣D∣×σ
Σ∈Rσ×σ\Sigma \in \mathbb{R}^{\sigma \times \sigma}Σ∈Rσ×σ
X=UΣVTX = U\Sigma V^TX=UΣVT
love white cat
love black cat
hate white cat
hate black cat
love white dog
love black dog
hate white dog
hate black dog
D, T = retrieve('dat/latent_semantic_analysis.txt')
X = document_term_matrix_np(D, T)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
S = np.diag(S)
U∈R8×6U \in \mathbb{R}^{8 \times 6}U∈R8×6
Σ∈R6×6\Sigma \in \mathbb{R}^{6 \times 6}Σ∈R6×6
VT∈R6×6V^T \in \mathbb{R}^{6 \times 6}VT∈R6×6
UUU
Σ\SigmaΣ
VTV^TVT
def print_np(matrix: np.array):
    print(matrix.shape)
    for row in matrix:
        print(' '.join(['{:8.4f}'.format(c) for c in row]))
        
print_np(U)
print_np(S)
print_np(Vt)
(8, 6)
 -0.3536  -0.4969  -0.0552   0.3536  -0.6648  -0.0396
 -0.3536  -0.4969  -0.0552  -0.3536   0.3750   0.5911
 -0.3536  -0.0552   0.4969   0.3536   0.4847  -0.3598
 -0.3536  -0.0552   0.4969  -0.3536  -0.1949  -0.1917
 -0.3536   0.0552  -0.4969   0.3536   0.3187  -0.1450
 -0.3536   0.0552  -0.4969  -0.3536  -0.0289  -0.4065
 -0.3536   0.4969   0.0552   0.3536  -0.1386   0.5444
 -0.3536   0.4969   0.0552  -0.3536  -0.1512   0.0071
(6, 6)
  3.4641   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   2.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   2.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   2.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
(6, 6)
 -0.4082  -0.4082  -0.4082  -0.4082  -0.4082  -0.4082
  0.0000  -0.5522   0.5522   0.4417  -0.4417   0.0000
  0.0000   0.4417  -0.4417   0.5522  -0.5522   0.0000
 -0.7071  -0.0000  -0.0000  -0.0000  -0.0000   0.7071
 -0.2614  -0.3151  -0.3151   0.5765   0.5765  -0.2614
 -0.5148   0.4838   0.4838   0.0310   0.0310  -0.5148
Σ\SigmaΣ
3.72393692×10−163.72393692 \times 10^{−16}3.72393692×10−16
2.29674157×10−162.29674157 \times 10^{−16}2.29674157×10−16
Σ\SigmaΣ
k=4k = 4k=4
U′∈R8×4U' \in \mathbb{R}^{8 \times 4}U′∈R8×4
Σ′∈R4×4\Sigma' \in \mathbb{R}^{4 \times 4}Σ′∈R4×4
V′T∈R4×6V'^{T} \in \mathbb{R}^{4 \times 6}V′T∈R4×6
k = 4
U = U[:, :k]
S = S[:k, :k]
Vt = Vt[:k, :]
iii
di=Ui,∗⋅Σ∈R1×4d_i = U_{i,*} \cdot \Sigma \in \mathbb{R}^{1 \times 4}di​=Ui,∗​⋅Σ∈R1×4
for i, document in enumerate(D):
    t = np.dot(U[i], S)
    print('{}: {}'.format(' '.join(document), ['{:5.2f}'.format(f) for f in t]))
love white cat: [-1.22, -0.99, -0.11,  0.71]
love black cat: [-1.22, -0.99, -0.11, -0.71]
hate white cat: [-1.22, -0.11,  0.99,  0.71]
hate black cat: [-1.22, -0.11,  0.99, -0.71]
love white dog: [-1.22,  0.11, -0.99,  0.71]
love black dog: [-1.22,  0.11, -0.99, -0.71]
hate white dog: [-1.22,  0.99,  0.11,  0.71]
hate black dog: [-1.22,  0.99,  0.11, -0.71]
jjj
tj=Σ⋅V∗,jT=Vj,∗⋅ΣT∈R1×4t_j = \Sigma \cdot V^T_{*,j} = V_{j,*} \cdot \Sigma^T \in \mathbb{R}^{1 \times 4}tj​=Σ⋅V∗,jT​=Vj,∗​⋅ΣT∈R1×4
V = Vt.transpose()
for term, j in sorted(T.items(), key=lambda x: x[1]):
    t = np.dot(V[j], S)
    print('{:>5}: {}'.format(term, ['{:5.2f}'.format(f) for f in t]))
black: [-1.41,  0.00,  0.00, -1.41]
  cat: [-1.41, -1.10,  0.88, -0.00]
  dog: [-1.41,  1.10, -0.88, -0.00]
 hate: [-1.41,  0.88,  1.10, -0.00]
 love: [-1.41, -0.88, -1.10, -0.00]
white: [-1.41,  0.00,  0.00,  1.41]
Σ\SigmaΣ

Word Representations

One-hot Encoding

One-hot encoding represents words as binary vectors such that each word is represented as a vector where all dimensions are zero except for one, which is set to one, indicating the presence of that word.

Consider the following vocabulary:

V = [
    'king',    # 0
    'man',     # 1
    'woman',   # 2
    'queen'    # 3
]

Given a vocabulary size of 4, each word is represented as a 4-dimensional vector as illustrated below:

 king = [1, 0, 0, 0]
  man = [0, 1, 0, 0]
woman = [0, 0, 1, 0]
queen = [0, 0, 0, 1]

One-hot encoding has been largely adopted in traditional NLP models due to its simple and efficient representation of words in sparse vectors.

Q2: What are the drawbacks of using one-hot encoding to represent word vectors?

Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space. Each word is represented in a high-dimensional space, where the dimensions correspond to different contextual features of the word's meaning.

Consider the embeddings for three words, 'king', 'male', and 'female':

 king = [0.5, 0.0, 0.5, 0.0]
  man = [0.0, 0.5, 0.5, 0.0]
woman = [0.0, 0.5, 0.0, 0.5]

Based on these distributions, we can infer that the four dimensions in this vector space represent royalty, gender, male, and female respectively, such that the embedding for the word 'queen' can be estimated as follows:

 queen = king - man + woman
       = [0.5, 0.0, 0.0, 0.5]

The key idea is to capture semantic relationships between words by representing them in a way that similar words have similar vector representations. These embeddings are learned from large amounts of text data, where the model aims to predict or capture the context in which words appear.

In the above examples, each dimension represents a distinct type of meaning. However, in practice, a dimension can encapsulate multiple types of meanings. Furthermore, a single type of meaning can be depicted by a weighted combination of several dimensions, making it challenging to precisely interpret what each dimension implies.

distributional hypothesis
numpy
numpy.array
dat/chronicles_of_narnia.txt
Basic date and time types
time.time()
OOT=OTO=IO O^T = O^T O = IOOT=OTO=I
III
orthogonal matrix
Singular values
dat/latent_semantic_analysis.txt
numpy.linalg.svd()
numpy.diag()
numpy.dot()
latent_semantic_analysis.py
Latent Semantic Analysis
Singular Value Decomposition
Word2Vec
Subword Tokenization
Recurrent Neural Networks
Transformer
Encoder-Decoder Framework
Attention is All You Need
Word2Vec

Team Formation

Formation

  1. Review project ideas from your classmates. Consider which projects align with your interests and skills.

  2. Reflect on your interactions during the speed dating session. Think about:

    • Who shared complementary skills to yours?

    • Which classmates did you communicate well with?

    • Whose working style seemed compatible with yours?

Submission

Connect with your classmates to discuss potential collaboration. Once you have formed a complete team of 3-4 members, submit your team composition to Canvas: [People] > [Teams #].

  • Contact the instructor if you want to form a team of 5 members. While teams of 3-4 are preferred to prevent free riding, exceptions may be considered with justification.

  • If you're having difficulty finding a team, please reach out to the instructor at least a week before the deadline.

Distributional Semantics

Distributional semantics represents the meaning of words based on their distributional properties in large corpora of text. It follows the distributional hypothesis, which states that "words with similar meanings tend to occur in similar contexts".

Contents

Distributional Hypothesis

The distributional hypothesis suggests that words that occur in similar contexts tend to have similar meanings [1]. Let us examine the following two sentences with blanks:

A: I sat on a __

B: I played with my __

The blank for A can be filled with words such as {bench, chair, sofa, stool}, which convey the meaning "something to sit on" in this context. On the other hand, the blank for B can be filled with words such as {child, dog, friend, toy}, carrying the meaning of "someone/thing to play with." However, these two sets of words are not interchangeable, as it is unlikely that you would sit on a "friend" or play with a "sofa".

This hypothesis provides a potent framework for understanding how meaning is encoded in language and has become a cornerstone of modern computational linguistics and natural language processing.

Q1: Assuming that your corpus has only the following three sentences, what context would influence the meaning of the word "chair" according to the distributional hypothesis?

  1. I sat on a chair.

  2. I will chair the meeting.

  3. I am the chair of my department.

References

Word2Vec

Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.

Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.

Continuous Bag-of-Words

wk=arg⁡max⁡∀.w∗∈VP(w∗∣wk−2,wk−1,wk+1,wk+2)w_k = \arg\max_{\forall. w_* \in V}P(w_*|w_{k-2},w_{k-1},w_{k+1},w_{k+2})wk​=arg∀.w∗​∈Vmax​P(w∗​∣wk−2​,wk−1​,wk+1​,wk+2​)

Let y∈R1×n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1×n be an output vector, where all dimensions have the value of 000 except for the one representing wkw_kwk​, which is set to 111.

Let h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d be a hidden layer between x\mathrm{x}x and y\mathrm{y}y and Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d be the weight matrix between x\mathrm{x}x and h\mathrm{h}h, where the sigmoid function is used as the activation function:

h=sigmoid(x⋅Wx)\mathrm{h} = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{W}_x)h=sigmoid(x⋅Wx​)

Finally, let Wh∈Rn×d\mathrm{W}_h \in \mathbb{R}^{n \times d}Wh​∈Rn×d be the weight matrix between h\mathrm{h}h and y\mathrm{y}y:

y=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(h⋅WhT​)

Thus, each dimension in y\mathrm{y}y represents the probability of the corresponding word being wkw_kwk​ given the set of context words III.

Q13: What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?

Skip-gram

In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in III given wkw_kwk​. Formally, the objective of a Skip-gram model is as follows:

wk−2=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk−1=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk+1=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk+2=arg⁡max⁡∀.w∗∈VP(w∗∣wk)\begin{align*} w_{k-2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k-1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k) \end{align*}wk−2​wk−1​wk+1​wk+2​​=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)​

Let x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n be an input vector, where only the dimension representing wkw_kwk​ is set to 111; all the other dimensions have the value of 000 (thus, x\mathrm{x}x in Skip-gram is the same as y\mathrm{y}y in CBOW). Let y∈R1×n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1×n be an output vector, where only the dimension representing wj∈Iw_j \in Iwj​∈I is set to 111; all the other dimensions have the value of 000. All the other components, such as the hidden layer h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d and the weight matrices Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d and Wh∈Rn×d\mathrm{W}_h \in \mathbb{R}^{n \times d}Wh​∈Rn×d, stay the same as the ones in CBOW.

Q14: What are the advantages of CBOW models compared to Skip-gram models, and vice versa?

Distributional Embeddings

What does each dimension in the hidden layer h\mathrm{h}h represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in III, deemed valuable for predicting the target word wi​w_i​wi​​. Formally, each dimension hj\mathrm{h}_jhj​ is computed as the sigmoid activation of the weighted sum between the input vector x\mathrm{x}x and the column vector such that:

hj=sigmoid(x⋅cxj)\mathrm{h}_j = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{cx}_j)hj​=sigmoid(x⋅cxj​)

Then, what does each row vector rxi=Wx[i,:]  ∈R1×d\mathrm{rx}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d}rxi​=Wx​[i,:]  ∈R1×d represent? The jjj'th dimension in rxi\mathrm{rx}_irxi​ denotes the weight of the jjj'th feature in h\mathrm{h}h with respect to the iii'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the iii'th word. Thus, ri\mathrm{r}_iri​ can serve as an embedding for the iii'th word in VVV.

What about the other weight matrix Wh\mathrm{W}_hWh​? The jjj'th column vector chj=Wh[:,j] ∈Rn×1\mathrm{ch}_j = \mathrm{W}_h[:,j] \in \mathbb{R}^{n \times 1}chj​=Wh​[:,j] ∈Rn×1 denotes the weights of the jjj'th feature in h\mathrm{h}h for all words in the vocabulary. Thus, the iii'th dimension of chj\mathrm{ch}_jchj​ indicates the importance of jjj'th feature for the iii'th word being predicted as the target word wkw_kwk​.

On the other hand, the iii'th row vector rhi=Wx[i,:] ∈R1×d\mathrm{rh}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d}rhi​=Wx​[i,:] ∈R1×d denotes the weights of all features for the iii'th word in the vocabulary, enabling it to be utilized as an embedding for wi∈Vw_i \in Vwi​∈V. However, in practice, only the row vectors of the first weight matrix Wx\mathrm{W}_xWx​ are employed as word embeddings because the weights in Wh\mathrm{W}_hWh​ are often optimized for the downstream task, in this case predicting wkw_kwk​, whereas the weights in Wx\mathrm{W}_xWx​ are optimized for finding representations that are generalizable across various tasks.

Q15: What are the implications of the weight matrices Wx\mathrm{W}_xWx​ and Wh\mathrm{W}_hWh​ in the Skip-gram model?

Q16: What limitations does the Word2Vec model have, and how can these limitations be addressed?

References

Encoder-Decoder Framework

Updated 2023-10-27

Encoder

An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.

Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":

Is it possible to derive the context vector from xnx_nxn​ instead of xcx_cxc​? What is the purpose of appending an extra token to indicate the end of the sequence?

Decoder

A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.

Let Y=[y1,…,ym,yt]Y = [y_1, \ldots, y_m, y_{t}]Y=[y1​,…,ym​,yt​] be an output sequence, where yi∈Ro×1y_i \in \mathbb{R}^{o \times 1}yi​∈Ro×1 is the iii'th word in the sequence, and yty_tyt​ is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, fff and ggg :

  • fff takes the previous output yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​, and returns a hidden state si∈Re×1s_{i} \in \mathbb{R}^{e \times 1}si​∈Re×1 such that f(yi−1,si−1)=α(Wyyi−1​+Wssi−1)=sif(y_{i-1}, s_{i-1}) = \alpha(W^y y_{i-1}​ + W^s s_ {i−1}) = s_if(yi−1​,si−1​)=α(Wyyi−1​​+Wssi−1​)=si​, where Wy∈Re×oW^y \in \mathbb{R}^{e \times o}Wy∈Re×o, Ws∈Re×eW^s \in \mathbb{R}^{e \times e}Ws∈Re×e, and α\alphaα is an activation function.

  • ggg takes the hidden state sis_isi​ and returns an output yi∈Ro×1y_{i} \in \mathbb{R}^{o \times 1}yi​∈Ro×1 such that g(si)=Wosi=yig(s_i) = W^o s_i = y_{i}g(si​)=Wosi​=yi​, where Wo∈Ro×eW^o \in \mathbb{R}^{o \times e}Wo∈Ro×e.

Note that the initial hidden state s1s_1s1​ is created by considering only the context vector ycy_cyc​such that the first output y1y_1y1​ is solely predicted by the context in the input sequence. However, the prediction of every subsequent output yiy_{i}yi​ is conditioned on both the previous output yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​. Finally, the decoder stops generating output when it predicts the end-of-sequence token yty_tyt​.

In some variations of the decoder, the initial hidden state s1s_1s1​ is created by considering both ycy_cyc​ and hch_chc​ [1].

Figure 3 illustrates a decoder example that takes the context vector y0y_0y0​ and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:

The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?

The likelihood of the current output yiy_iyi​ can be calculated as:

P(yi∣{yc}∪{y1,…,yi−1})=q(yc,yi−1,si−1)P(y_{i} | \{y_c\} \cup \{y_1, \ldots, y_{i-1}\}) = q(y_c, y_{i-1}, s_{i-1})P(yi​∣{yc​}∪{y1​,…,yi−1​})=q(yc​,yi−1​,si−1​)
P(Y)=∏i=1m+1q(yc,yi−1,si−1)P(Y) = \prod_{i=1}^{m+1} q(y_c, y_{i-1}, s_{i-1})P(Y)=i=1∏m+1​q(yc​,yi−1​,si−1​)

The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token yty_tyt​. What are the benefits of incorporating this artificial token when estimating the sequence probability?

References

Following our Project Ideas submissions and session, it is time to form your project teams. Each team will consist of 3-4 members.

, Zellig S. Harris, Word, 10 (2-3): 146-162, 1954.

Consider a sequence of words, {wk−2,wk−1,wk,wk+1,wk+2}\{w_{k-2}, w_{k-1}, w_{k}, w_{k+1}, w_{k+2}\}{wk−2​,wk−1​,wk​,wk+1​,wk+2​}. We can predict wiw_iwi​ by leveraging its contextual words using a similar to the discussed previously (VVV: a vocabulary list comprising all unique words in the corpus):

This objective can also be achieved by using a such as Continuous Bag-of-Words (CBOW) using a . Let x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n be an input vector, where n=∣V∣n = |V|n=∣V∣. x\mathrm{x}x is created by the model on a set of context words, I={wk−2,wk−1,wk+1,wk+2}I = \{w_{k-2},w_{k-1},w_{k+1},w_{k+2}\}I={wk−2​,wk−1​,wk+1​,wk+2​}, such that only the dimensions of x\mathrm{x}x representing words in III have a value of 111; otherwise, they are set to 000.

, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.

, Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.

The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an , a context vector, and a , as illustrated in Figure 1:

Let X=[x1,…,xn,xc]X =[x_1, \ldots, x_n,x_c]X=[x1​,…,xn​,xc​] be an input sequence, where xi∈Rd×1x_i \in \mathbb{R}^{d \times 1}xi​∈Rd×1 is the iii'th word in the sequence and xcx_cxc​is an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, fff and ggg, which are defined in the same way as in the . Notice that the end-of-sequence token xcx_cxc​ is used to create an additional hidden state hch_chc​, which in turn creates the context vector ycy_cyc​.

where qqq is a function that takes the context vector ycy_cyc​, previous input yi−1y_{i-1}yi−1​ and its hidden state si−1s_{i-1}si−1​, and returns the probability of yiy_iyi​. Then, the of the output sequence can be estimated as follows (y0=s0=0y_0 = s_0 = \mathbf{0}y0​=s0​=0):

, Sutskever et al., NeurIPS, 2014.*

, Bahdanau et al., ICLR, 2015.*

Speed Dating
Distributional Hypothesis
Word Representations
Latent Semantic Analysis
Neural Networks
Word2Vec
Homework
Distributional Structure
generative model
n-gram models
Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
Bag of Tricks for Efficient Text Classification
maximum likelihood
Sequence to Sequence Learning with Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
discriminative model
bag-of-words
multilayer perceptron
encoder
decoder
RNN for text classification

Text Classification

Sentiment Analysis

Project Ideas (2024)

Spring 2024

Mara Adams

My vision is a model which could reference the major (at least American, but potentially other English) style guides (the AP Stylebook, Chicago, MLA, etc.) and, given a sentence with an point of ambiguous grammar/style, could give the solution according the different major style guides. I'm not married to this idea per se, but I like the idea of working with style guides in some way.

Nate Adams

I'd like to try and build a sentiment analysis tool, capable of classifying emotions into positive, neutral, or negative (or perhaps give a numerical rating). If possible, the model can be expanded to include other emotions like happiness, anger, disappointment, etc.

Calvin Brauer

Looking at words used on social media and comparing sentiments across different social platforms.

Wenxuan Cai

For the team project, I want to use NLP skills to catch keywords in some math papers. From my own experience, math research papers are hard to understand due to their nouns and abstract ideas. If we can get some key things out from the math paper, this will be helpful.

Michael Cao

N/A

Alec Chapman

For the team project at the end of the semester, I would like to build a Sentiment Analysis model for Cryptocurrency Trading. In the world of Crypto, many of the price movements are caused by sudden changes in the sentiment of investors. This often can be found on social media sites like Reddit and Twitter, where users often post their feelings about the currencies. A few years ago, a subreddit called WallStreetBets blew up for causing an insane price increase in Gamestop stock as well as some others. From this, I discovered the power of using Sentiment Analysis modeling on various social media sites to attempt to predict price movements of online currencies and stocks. For this project, I would like to analyze user posts on these social media sites to calculate the overall sentiment for specific Cryptocurrencies. I will then use this data to predict the incoming price changes that may occur.

Marcus Cheema

My project idea involves training a large language model on countless recipes, found online, which must be preprocessed accordingly to be usable. Once the chatbot is trained, a simple web platform would be used for accessibility and testing. This platform would be comprised of chats, and an area to enter text. The goal is to make a chatbot capable of creating usable recipes based on user inputs, dietary restrictions, and ingredient restrictions. I thought of this project idea with Dylan Parker.

Murphy Chen

Concept and Idea: Spam Detection Bot for Email Systems The potential concept for our team project is to develop an advanced Spam Detection Bot that is specifically designed for email systems.

Intelligent Linguistic Analysis: The bot will utilize natural language processing (NLP) techniques to analyze the content of emails. It will focus on identifying linguistic patterns, keywords, and phrase structures commonly associated with spam.

Machine Learning Integration: By employing machine learning algorithms, the bot will be able to learn and adapt to evolving spam tactics. This continuous learning approach ensures that the bot remains effective even as spammers change their strategies.

Feedback Mechanism: The bot allows the users to manually mark certain email as spam or not, which allows the bot to update itself to increase the accuracy of detection and reduce the possibility of false detection.

Real-time Processing and Efficiency: The bot is designed to process emails in real-time, ensuring that users' inboxes are promptly cleared of spam.

Security and Privacy Focus: In addition to spam detection, the bot will prioritize user security and privacy, ensuring that the user’s privacy is always the top priority.

Andrew Chung

I would like to make a model that can read influential people's tweets in the financial world to get some sort of sentiment, and determine if someoen should invest in that stock or not.

ex. If Warren Buffett were to tweet out "Apple is a terrible company", the model would be able to detect the sentiment of Apple as negative, and therefore not invest in the stock.

Henry Dierkes

I want to create a language identifier that can output the name of a language based on a text of a certain language as the input. To make this work I believe that we would first identify the types of characters used and narrow down the languages based on that. Secondly, we would tokenize the text into words and compare it to the dictionaries of each language we are still considering. Lastly, we would look at how many matches there are between the text and the dictionaries and output the most similar one. Maybe if it's below a certain threshold, the output could say that there's not enough data to suggest that the language is one we have access to.

Tung Dinh

For the project, I'm thinking about an AI diary app using GPT. This app will let students write about their day, and the AI will offer encouraging words and advice. It could also detect if the student is stressed and help them as a friend. The goal is to create a comforting space for students to reflect and relax.

Benjamin Dixon

Project Idea I would like to study the difference between the usage of “(disability-adj) person” and “person with (disability)” in the context of academic papers. For example, there is lots of discourse on the difference between saying autistic person or person with autism, and from my own experience on reading research papers about autism, both are used frequently. The project could involve gathering data about its usage frequency in scientific papers between academic fields (medical, psychology, sociology, anthropology, etc.), within the same papers (do research papers always use the same term, or do some use both?), by date (is there a difference in pre- and post- 2015, or some other relevant date?), or some other metric. Further work could be done with sentiment analysis technology to see if the papers use language that is favorable / amiable towards disability or disparaging towards disability, and could be correlated with their chosen phrase to describe disability (disabled person vs person with disability) to see if there is a significant difference.

Lindsay Esterman

I just switched into this class and haven't attended any lectures yet so I don't have a good concept of what a good project is, but an idea is to analyze tik toks to gauge that population's opinions on the 2024 election.

Calla Gong

I have an plan to do a text analyze on the answer in QUORA to distinguish answer from experts and other participants.

Hunter Grimes

One idea for a project which I might like to pursue would be an AI coach for video games. The idea would be to have the coach look at in game performance, potentially in real time, and give coaching about how to improve ala "pay more attention to objectives".

Frederic Dean Michae Guintu

Something that may be challenging to do as a team project is to look at computational linguistics based on languages other than English. An alternative that would fall within the scope of the class while also looking at nonstandard English would be analyzing various English dialects, with projects such as dialect identification, translation between dialects, or even a chatbot that understands multiple English dialects.

Molly Han

For the group project, I am interested in sentiment analysis, specifically concerning evaluations of business products, or social media posts. Additionally, I am also inclined towards topics associated with the detection of spam emails or messages.

Paige Hendricks

I am fairly open to the type of team project I would like to pursue this semester. Things such as sentiment analysis, story generation, and chatbots would all be interesting to me. I am not sure about the feasability, but the project I would be most interested in working on would be a text to speech tool. I would like to learn how speech generation works on such a level, and it is a tool I could imagine using once created. I know that this type of tool already exists, so I would ideally like to work on a less prevalent aspect such as adding emotionally expressive speech so that it doesn't sound as flat. I am also interested in the ways that characters or celebrities voices can be emulated, but I am not sure if this would have any legal hurdles.

Jerry Hong

N/A

Peter Jeong

I have thought of working on a sentiment analysis model which would classify customer review we see on e-commerce platform(whether they are positive, negative, or neutral). However, I also want to listen to other ideas so I am also open to new ideas.

Helen Jin

Project Idea: Using AI for recipe recommendation, modification, and generation. Opening the fridge often reveals leftover ingredients from previous meals, which can make deciding what to cook a challenge. Typically, we end up spending considerable time looking for recipes that match our tastes and the ingredients we currently have. It’s common to find that we’re missing a few items for the recipes we’re interested in, which can be quite inconvenient. However, by incorporating artificial intelligence, we could have a tool that acts like a consultative chef. This AI system would allow us to tailor our recipe searches more precisely, adjusting recipes to fit our personal requirements, like creating a low-calorie version of a certain dish or suggesting substitutions for missing ingredients.

Carl Kassabian

I am very unsure of what kind of project I would like to pursue. I think combining linguistics with visual data could be interesting though. Maybe some kind of coding that sorts or filters certain types of words or phrases, and then those are represented in some visually pleasing way. I think some kind of code that analyzes literature at a highly specific linguistic level could be very interesting. I am extremely flexible and would be willing to work on almost any project.

Michelle Kim

A possible project could be to use NLP techniques on unconventional data. Personally, I am interested in cross-cultural linguistics / multilingualism such as heritage language usage in diasporic settings. Some possible datasets could be English loan word usage in a non-English speaking countries (ie. in everyday life, music lyrics, social media, websites). For example, in South Korea and many other countries, popular music integrates English into various text such as music lyrics, advertisements and marketing, and everyday speech (based on age group). Another possible topic could be topics such as correcting gendered language to non-gendered language. Lastly, academic related topics: I think a fun project could be something like predicting final course grade based on a student writing sample, predicting / generating potential test questions based on a text, or predict / generate weak areas of students based on their code sample.

Will Kohn

A text summarization tool for simplifying complex readings for classes.

Andrew Lee

I have a group that I believe we will be working together, yet we have not yet decided on a topic for a project. An idea I have been thinking about is incorporating a character into game that will take the language input that the player puts in. After analyzing the type of writing that the players uses, the character will respond in the same way the player wrote. For example, if someone is using Shakespearean language, then the character will respond back the same way.

Sam Liu

Project Idea: Explain Attention Is All You Need to children: Design a system that explains and summarizes academic papers in a more comprehensive way, especially for those who do not have much background knowledge. It can reliably lay out the fundamental information from abstracts, introductions, methods, and findings from any academic paper without missing vital information, allowing readers to process the main ideas efficiently.

Andrew Lu

Though I still don't know much about NLP, for my team project, I think it would be interesting to try to work on a language model that is trained and works solely on inputs with perfect grammar in an attempt to see the effect of input "sanitization" on performance and model size.

Louis Lu

I'd like to apply NLP algorithms and some machine learning algorithms on a public available dataset to perform supervised classification task. For example, applying MLP on product review to distinguish helpful reviews against unhelpful ones. I would like to further compare and evaluate the performance of some large language models, such as Bert and GPT4 API.

Wenzhuo Ma

My idea is to make an "PolyGlotBot" which is a multilingual virtual assistant that helps Chinese learners practice and improve their skills. The model will give real-time feedback on grammar, pronunciation, and vocabulary when having the conversation with it. It will also have the interior function which adapts the user's proficiency level and personalizes learning content. This can help leaners learn better on his own rythm and on his own level.

Ja'Zmin McKeel

I'm not sure but I am open to exploring

Izana Melese

Project Idea: Create a tool that can figure out how people feel when they post on social media, no matter what language they're using. The idea is to make a system that helps us see the emotions behind online conversations in different languages, so users and businesses can get a sense of what people are expressing on a global scale.

Ruichen Ni

I would like to learn how could NLP techniques be applied to alternative data in finance and business.

Ellie Paek

Utilization of sentiment analysis would be something I'd be interested in working on; a project like conducting sentiment analysis on a dataset (such as poems) and using it to generate new data (such as new poems based on a key emotion) is something that I would be interested in. In addition, I would also be interested in leveraging LLMs to create something, such as recreating the personality of characters.

Dylan Parker

The team project that I have in mind is an LLM that recommends personalized recipes based on user requests, dietary restrictions, and ingredient availability. The system could also assist users in meal planning by suggesting balanced meal combinations and creating shopping lists if they do not have the necessary ingredients. My team would ideally collect many recipes online for our database to serve as the foundation for our recommendations. An algorithm would need to be developed for the meal recommendation system. We would then create a web-based application for user interactions with our system. If possible, it would be great to have a user "account" feature for further user personalization in the future. I thought of this project idea with Marcus Cheema.

Sherry Rui

Text Writing Editor: Use models like OpenAI's Davinci API for generating creative writing, including poetry, stories, or even scripts. The focus would be on fine-tuning the model's parameters and prompts to allow certain styles or themes.

Chengyu Shi

GitHub repository link: Team project: I'd like to know more about sentiment/emotion/opinion analysis of text. This semester I'd probably do something related to analysis techniques regarding sentiment lexicons and sentiment classification models.

Dennis Sun

N/A

Nicole Thomas

N/A

Michael Wang

The team project that we would like to pursue is to use ai (chatGPT) to train specifically for a task, such as writing, research, programming. For example, there is a plugin function in Canva supported by ChatGPT that is able to generate infographics and visualizations automatically through prompts.

Yoyo Wang

N/A

Lindsey Wendkos

I am interested in doing a project about sentiment analysis, where a model would be able to decide what the tone of a piece of text is. This is intriguing to me because even as people, tone is difficult to convey accurately through text, and even when we are able to determine the sentiment of a piece of text, it is not always clear exactly how that tone was communicated. I am curious to see how accurate a machine can be in determining something so emotion-based and not clear-cut. With that said, I am open to other project ideas as well!

Cashin Woo

N/A

Freddy Xiong

Two idea possibilities: 1)Sentiment Analysis Chatbot: A chatbot that detects the emotional and state of mind-being from a person, the idea behind this we are LARPing as Walmart or Target or another retailer, and we want to detect how the customer feels without having them directly state it (because that tends to anger them). Put it like this, if you're disappointed with either customer service or a product you just purchased, and you write to a bot wanting to express disappointment, but is instead asked how you feel about the service or product, that will just annoy you into stating you are angry or upset. The key to this tool is being able to interpret word choices and convert them into state of emotional well-being and satisfaction from a state of 1-5, 1 being extremely unsatisfactory, and 5 being extremely satisfactory, and utilizing this metric across 5 rubrics (Customer Service, Product Quality, Cleanliness of Stores, Location Convenience, Feeling of Safety) from a short conversation with the consumer. Along the way, the bot can also make recommendations, including for products and advices, after the conversation. 2)Spam Detection Bot: Email filter bot that would combine ANN with corpus of common spam emails, including ones I've fallen for (I've failed 100% of the Emory phishing emails, I'm sad to say, not an exaggerated stat, I've never not clicked on those baits). We'd create a distinction between actual legitimate bot emails (i.e. Job offers, important notifications, reminders) from both harmless and harmful spam. Differs from email spam on multi-lingual factor: my emails still have large amounts of non-english spam that filter through, but because the spam bot isn't as well-trained in that factor, I'm receiving garbage on fake chinese job referrals and german job offers (I know they're fake, because like the genius that I am, clicked on them and inquired, which resulted in me getting more spam emails). The spam detection bot would also create warnings on non-spam emails that border on spam (i.e. promotional emails), and offer the end user the opportunity to enable options to filter them out.

Simon Yu

Team Project Idea: Examine presidential inaugural addresses for in word tokens, types, diversity, etc. More points of analysis will definitely come up, but in general hope to look for different trends over time, individual differences between presidents, and other interesting observations.

Yunnie Yu

Our team would like to develop a system that can analyze and classify the sentiment of social media posts. We will choose one social media platform and focus on posts from a certain period or around a specific topic. Our goal is that the system can help businesses, organizations, and governments understand the public reaction and adjust policies/improve products.

Jingxuan Zhang

N/A

Joyce Zhang

Project Idea: I aim to analyze the linguistic nuances and sentiment differences in academic papers when referring to "people with disabilities" vs. "disabled people". I hope to understand if the choice of terminology correlates with varying sentiments and contextual frameworks in disability discourse within academic literature.

Chenming Zhou

N/A

Serena Zhou

Our team would like to develop a system that can analyze and classify the sentiment of social media posts. We will choose one social media platform and focus on posts from a certain period or around a specific topic. Our goal is that the system can help businesses, organizations, and governments understand the public reaction and adjust policies/improve products.

Team Projects

Spring 2025

[T1] MedEase: AI-Powered Health Companion

  • Members: Peter Chen, Rolf Shi, Annie He, Vera Wen, Sixing Wu

[T2] GitFolio

  • Members: Humphrey Amoakohene, Charlington Coulanges, Robert Jarman, Joseph Ukpong

[T3] FutureFetch: AI-Driven Research & Internship Opportunity Extractor

  • Members: Carol Hou, Steven Liu, Sammi Mi

[T4] OpenDoc: Hybrid Search RAG System for Doctor Recommendation

  • Members: Chris Kim, Brian Suh, Gregory Tolmochow, Andrew Yang

[T5] CourseConnect

  • Members: Mary Guan, Yutong Hu, Jiya Shah

[T6] Tiered NLP Translation of Academic Texts for Variable Comprehension Levels

  • Members: Palmer King, Chloe Pham, El Yirdaw

[T7] CareerAi: AI-Powered Resume Tool

  • Members: Eric Ahn, Daniel Ambrose, Nicholas Correa-Perez, Zeshan Zahid

[T8] MarketGuardian: Your Digital Sentinel for Safe and Authentic Deals

  • Members: Alex Forbes, Jake Gelb, Riyan Shah, James Yoon

[T9] Predicting FDA Efficacy

  • Members: Ridge Chang, Angelina Chen, Isha Dahiya, Well Jitngamplang

[T10] EmaiLLM: Bulk Cleanser

  • Members: Nate Hu, Caleb Jennings, Michael Jung, Kairos Wu

[T11] ADAPT: Automated Document And Prose Tone-transformer

  • Members: Brian Hakam, Olivia Kim, Jonathan Zhang

[T12] ATAP: Application Tracking Automation Program Using NLP to Maximize Efficiency for Applications

  • Members: Mark Arshavsky, Henry Gao, Jenny Jiang

[T13] AlphaParse: AI Summarization of Company 10-K Reporting for Efficient Information Extraction

  • Members: Sujith Yeruva, Akhil Arularasu, Noah Reicin, Jerry Xing

[T14] LabLink: The Missing Link Between You and Research

  • Members: Emma Carrier, Maya Dattatreya, Rhea Kansal, Ryan Lin, Alaş Özen Sezgin

[T15] PoliLLM

  • Members: Harutoshi Okumura

Homework

HW3: Vector Space Models

  • Define a function named sentiment_analyzer() that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.

  • Use the kkk-nearest neighbors algorithm for the classification. Find the optimal value of kkk using the development set, and then hardcode this value into your function before submission.

Data

Each line is a document, which is formatted as follows:

[Label]\t[Document]

Below are the explanations of what each label signifies:

  • 0: Very negative

  • 1: Negative

  • 2: Neutral

  • 3: Positive

  • 4: Very positive

Submission

Commit and push the vector_space_models.py file to your GitHub repository.

Extra Credit

Define a function named sentiment_analyzer_extra() that gives an improved sentiment analyzer.

Rubric

  • Code Submission (1 point)

  • Program Execution (1 point)

  • Development Set Accuracy (4 points)

  • Evaluation Set Accuracy (4 points)

  • Concept Quiz (2 points)

Homework

HW4: Distributional Semantics

Task 1

Each line in the file adheres to the following format:

[WORD](\t[FLOAT]){50}

Task 2

Your task is to retrieve a list of the most similar words to a given target word:

  1. Define a function called similar_words() that takes the word embeddings from Task 1, a target word (string), and a threshold (float).

  2. Return a list of tuples, where each tuple contains a word similar to the target word and the cosine similarity between them as determined by the embeddings. The returned list must only include words with similarity scores greater than or equal to the threshold, sorted in descending order based on the similarity scores.

Task 3

Your task is to measure a similarity score between two documents:

  1. Define a function called document_similarity() that takes the word embeddings and two documents (string). Assume that the documents are already tokenized.

  2. For each document, generate a document embedding by averaging the embeddings of all words within the document.

  3. Return the cosine similarity between the two document embeddings.

Submission

Commit and push the distributional_semantics.py file to your GitHub repository.

Rubric

  • Task 1: Read Word Embeddings (2.8 points)

  • Task 2: Similar Words (3 points)

  • Task 3: Document Similarity (3 points)

  • Concept Quiz (3.2 points)

Transformer

Self Attention

Self-attention, also known as scaled dot-product attention, is a fundamental mechanism used in deep learning and natural language processing, particularly in transformer-based models like BERT, GPT, and their variants. Self-attention is a crucial component that enables these models to understand relationships and dependencies between words or tokens in a sequence.

Here's an overview of self-attention:

  1. The Motivation:

    • The primary motivation behind self-attention is to capture dependencies and relationships between different elements within a sequence, such as words in a sentence or tokens in a document.

    • It allows the model to consider the context of each element based on its relationships with other elements in the sequence.

  2. The Mechanism:

    • Self-attention computes a weighted sum of the input elements (usually vectors) for each element in the sequence. This means that each element can attend to and be influenced by all other elements.

    • The key idea is to learn weights (attention scores) that reflect how much focus each element should give to the others. These weights are often referred to as "attention weights."

  3. Attention Weights:

    • Attention weights are calculated using a similarity measure (typically the dot product) between a query vector and a set of key vectors.

    • The resulting attention weights are then used to take a weighted sum of the value vectors. This weighted sum forms the output for each element.

  4. Scaling and Softmax:

    • To stabilize the gradients during training, the dot products are often scaled by the square root of the dimension of the key vectors.

    • After scaling, a softmax function is applied to obtain the attention weights. The softmax ensures that the weights are normalized and sum to 1.

  5. Multi-Head Attention:

    • Many models use multi-head attention, where multiple sets of queries, keys, and values are learned. Each set of attention weights captures different aspects of relationships in the sequence.

    • These multiple sets of attention results are concatenated and linearly transformed to obtain the final output.

  6. Applications:

    • Self-attention is widely used in transformer-based models for various NLP tasks, including machine translation, text classification, text generation, and more.

    • It is also applied in computer vision tasks, such as image captioning, where it can capture relationships between different parts of an image.

Self-attention is a powerful mechanism because it allows the model to focus on different elements of the input sequence depending on the context. This enables the model to capture long-range dependencies, word relationships, and nuances in natural language, making it a crucial innovation in deep learning for NLP and related fields.

Q9: How does self-attention operate given an embedding matrix W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d representing a document, where nnn is the number of words and ddd is the embedding dimension?

Multi-head Attention

Multi-head attention is a crucial component in transformer-based models, such as BERT, GPT, and their variants. It extends the basic self-attention mechanism to capture different types of relationships and dependencies in a sequence. Here's an explanation of multi-head attention:

  1. Motivation:

    • The primary motivation behind multi-head attention is to enable a model to focus on different parts of the input sequence when capturing dependencies and relationships.

    • It allows the model to learn multiple sets of attention patterns, each suited to capturing different kinds of associations in the data.

  2. Mechanism:

    • In multi-head attention, the input sequence (e.g., a sentence or document) is processed by multiple "attention heads."

    • Each attention head independently computes attention scores and weighted sums for the input sequence, resulting in multiple sets of output values.

    • These output values from each attention head are then concatenated and linearly transformed to obtain the final multi-head attention output.

  3. Learning Different Dependencies:

    • Each attention head can learn to attend to different aspects of the input sequence. For instance, one head may focus on syntactic relationships, another on semantic relationships, and a third on longer-range dependencies.

    • By having multiple heads, the model can learn to capture a variety of dependencies, making it more versatile and robust.

  4. Multi-Head Processing:

    • In each attention head, there are three main components: queries, keys, and values. These are linearly transformed projections of the input data.

    • For each head, queries are compared to keys to compute attention weights, which are then used to weight the values.

    • Each attention head performs these calculations independently, allowing it to learn a unique set of attention patterns.

  5. Concatenation and Linear Transformation:

    • The output values from each attention head are concatenated into a single tensor.

    • A linear transformation is applied to this concatenated output to obtain the final multi-head attention result. The linear transformation helps the model combine information from all heads appropriately.

  6. Applications:

    • Multi-head attention is widely used in NLP tasks, such as text classification, machine translation, and text generation.

    • It allows models to capture diverse dependencies and relationships within text data, making it highly effective in understanding and generating natural language.

Multi-head attention has proven to be a powerful tool in transformer architectures, enabling models to handle complex and nuanced relationships within sequences effectively. It contributes to the remarkable success of transformer-based models in a wide range of NLP tasks.

Q10: Given an embedding matrix W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d representing a document, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?

Transformer Architecture

Q11: What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?

BERT

Q12: How is a Masked Language Model used in training a language model with a transformer?

Q13: How can one train a document-level embedding using a transformer?

References

Homework

HW5: Contextual Encoding

Quiz

  1. The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?

  2. Given the same embedding matrix as in question #3, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?

  3. What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?

  4. How is a Masked Language Model used in training a language model with a transformer?

  5. How can one train a document-level embedding using a transformer?

  6. Neural networks gained widespread popularity for training natural language processing models since 2013. What factors enabled this popularity, and how do they differ from traditional NLP methods?

  7. Recent large language models like ChatGPT or Claude are trained quite differently from traditional NLP models. What are the main differences, and what factors enabled their development?

References

Proposal: , (18)

Final: , , (14)

Proposal: , (13)

Final: , , (5)

Proposal: , (2)

Final: , , (1)

Proposal: , (11)

Final: , , (15)

Proposal: , (9)

Final: , , (6)

Proposal: , (4)

Final: , , (9)

Proposal: , (3)

Final: , , (8)

Proposal: , (8)

Final: , , (6)

Proposal: , (13)

Final: , , (5)

Proposal: , (15)

Final: , , (10)

Proposal: , (4)

Final: , , (15)

Proposal: , (22)

Final: , , (13)

Proposal: , (12)

Final: , , (10)

Proposal: , (11)

Final: , , (13)

Proposal: , report (8)

Final: demo, report, (2)

Your task is to develop a sentiment analyzer train on the :

Create a file in the directory.

The directory contains the following two files:

: a training set consisting of 8,544 labeled documents.

: a development set consisting of 1,101 labeled documents.

Create a file in the directory.

Your task is to read word embeddings trained by :

Define a function called read_word_embeddings() that takes a path to the file consisting of word embeddings, .

Return a dictionary where the key is a word and the value is its corresponding embedding in .

Q14: What are the advantages of embeddings generated by BERT compared to those generated by ?

, Vaswani et al., NIPS 2017.

, Devlin et al., NAACL, 2019.

What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?

How does self-attention operate given an embedding matrix representing a document, where is the number of words and is the embedding dimension?

What are the advantages of embeddings generated by transformers compared to those generated by ?

, Vaswani et al., NIPS 2017.

, Devlin et al., NAACL 2019.

slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
report
demo
report
comments
slides
comments
Stanford Sentiment Treebank
vector_space_models.py
src/homework/
sentiment_treebank
sst_trn.tst
sst_dev.tst
distributional_semantics.py
src/homework/
Word2Vec
word_embeddings.txt
numpy.array
Word2Vec
Attention is All you Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
sentiment analysis
W∈Rn×d\mathrm{W} \in \mathbb{R}^{n \times d}W∈Rn×d
nnn
ddd

Sequence Tagging

Part-of-Speech Tagging

rule-based tokenization
Word2Vec
Attention is All you Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
part-of-speech tagging

Final Report

Proposal

Header

  • Project Title (create an engaging and impactful title)

  • Course ID and Course Name

  • Team Member Information:

    • Full names

    • Academic majors

    • Contact information

Abstract

A concise summary (200-300 words) addressing:

  • Complete project overview and objectives

  • Intellectual merit and technical innovation

  • Broader societal and field impact

Introduction

  • Objectives

    • Clear statement of project goals

    • Specific deliverables achievable within semester

    • Technical scope and boundaries

  • Motivation

    • Project's societal value and importance

    • Target beneficiaries or applications

    • Potential real-world impact

  • Problem Statement

    • Specific NLP challenge being addressed

    • Current limitations in the field

    • Technical barriers to overcome

  • Innovation Component

    • Unique aspects of proposed approach

    • Comparative analysis with existing solutions

    • Technical or methodological advances

Related Work

  • Analysis of relevant academic research

  • Review of existing industry solutions

  • Identification of current limitations

  • Differentiation of proposed approach

Approach

  • Methodology

    • Detailed technical approach

    • Development framework

    • Implementation strategy

    • Quality assurance methods

  • Research (if applicable)

    • Dataset description and preparation

    • Experimental design

    • Evaluation metrics

    • Validation approach

Results

  • Performance Analysis

    • Quantitative results and metrics

    • Qualitative observations

    • System capabilities and limitations

  • Limitations

    • Current system constraints

    • Technical barriers encountered

    • Areas for improvement

Conclusion

  • Summary

    • Project achievements and contributions

    • Key lessons learned

    • Technical and practical insights

  • Future Work

    • Potential improvements

    • Extension opportunities

    • Research directions

References

Properly formatted citations of all referenced work.

Contributions

In the Appendix, list each team member's name and their percentage of contribution to the proposal development. The percentages should total 100%.

Submission

Submit your final report in PDF to Canvas.

Proposal Report

This half-semester-long project allows you to explore and contribute to the field of Natural Language Processing (NLP) through hands-on research and development. Working in teams, you will propose and execute an innovative NLP project that addresses a real-world problem.

Proposal

Header

  • Project Title (create an engaging and impactful title)

  • Course ID and Course Name

  • Team Member Information:

    • Full names

    • Academic majors

    • Contact information

Abstract

A concise summary (200-300 words) addressing:

  • Complete project overview and objectives

  • Intellectual merit and technical innovation

  • Broader societal and field impact

Introduction

  • Objectives

    • Clear statement of project goals

    • Specific deliverables achievable within semester

    • Technical scope and boundaries

  • Motivation

    • Project's societal value and importance

    • Target beneficiaries or applications

    • Potential real-world impact

  • Problem Statement

    • Specific NLP challenge being addressed

    • Current limitations in the field

    • Technical barriers to overcome

  • Innovation

    • Unique aspects of proposed approach

    • Comparative analysis with existing solutions

    • Technical or methodological advances

Background

  • Related Work

    • Analysis of relevant academic research

    • Review of existing industry solutions

    • Identification of current limitations

    • Differentiation of proposed approach

  • Preliminary Work (optional)

    • Summary of completed research/prototypes

    • Initial findings and insights

    • Remaining work for semester

Proposed Approach

  • Methodology

    • Detailed technical approach

    • Development framework

    • Implementation strategy

    • Quality assurance methods

  • Research (if applicable)

    • Dataset description and preparation

    • Experimental design

    • Evaluation metrics

    • Validation approach

Timeline

  • Weekly Schedule

    • Detailed milestone breakdown

    • Key deliverables and deadlines

    • Risk mitigation strategies

  • Team Responsibilities

    • Individual task assignments

    • Role distribution

    • Collaboration methods

    • Progress tracking plan

References

Properly formatted citations of all referenced work.

Contributions

In the Appendix, list each team member's name and their percentage of contribution to the proposal development. The percentages should total 100%.

Submission

Submit your proposal report in PDF to Canvas.

Submit a final report in PDF format consisting of 5-8 pages (excluding references) using the provided . Change the section titles as needed.

Write a proposal consisting of 5-8 pages (excluding references and appendix) using the provided . Your proposal should offer a thorough description of your project.

template
template

Speed Dating

During this class hour, you will have the opportunity to meet your classmates in a series of brief, focused conversations. The goal is to identify potential teammates whose research interests and skills complement your own. This is not about finding your exact clone, but rather discovering peers whose strengths and passions could contribute to a well-rounded and dynamic project team.

Be concise, be curious, and most importantly, be yourself. After this session, you will have 2.5 weeks to form a team for collaboration on a team project.

Process

  • You will participate in 7 rounds of speed dating, with each round lasting 8 minutes and 2-minute transition time between rounds.

  • During each round:

    • Form a group of three people with classmates.

    • Each person gets 2 minutes to present and answer questions (6 minutes total).

    • Use the remaining 2 minutes for mutual discussion and note-taking.

  • After each round, split up and form new groups with people you have npt met yet.

  • By the end of the session, you will have met approximately 14 different classmates (about 25% of the class), helping you identify potential teammates that complement your skills and interests.

Make the most of each interaction by being concise about your interests and skills while asking thoughtful questions to understand your classmates' strengths and working styles.

Team Projects (2024)

Spring 2024

[T1] CAAP: Capture Assistant in Academic Papers

  • Will Kohn, Sam Liu, Ja’Zmin McKeel, Ellie Paek

[T2] Resume Revamper

  • Andrew Chung, Andrew Lu, Frederic Guintu, Tung Dinh

[T3] EmiChat: A 24/7 AI-Powered Academic Advisor

  • Yunnie Yu, Jason Zhang, Serena Zhou

[T4] Classify: The Key to Surviving Course Selection

  • Henry Dierkes, Andrew Lee, Nicole Thomas

[T5] Resu4me

  • Calvin Brauer, Cashin Woo, Jerry Hong

[T6] Enhancing Product Review Insights with OpenAI API

  • Calla Gong, Louis Lu, Wenzhuo Ma, Yoyo Wang

[T7] Parser Matcher: Enhancing Job Search Efficiency for International Students

  • Freddy Xiong, Molly Han, Murphy Chen, Peter Jeong

[T8] NotMe: A Tuning-Less Approach to AI Lyric Generation

  • Helen Jin, Michael Cao, Michael Wang, Michelle Kim

[T9] OpenPlate: Promoting Better Eating Habits

  • Marcus Cheema, Dylan Parker, Sherry Rui

[T10] Two-Stage Hate Speech Detection Method

  • Chengyu Shi, Chenming Zhou, Ruichen Ni, Wenxuan Cai

[T11] RateMyProfessor or HateMyProfessor? A Tool to Detangle Review Bias

  • Joyce Zhang, Paige Hendricks, Lindsey Wendkos, Benjamin Dixon

[T12] Citation Wizard

  • Mara Adams, Hunter Grimes, Carl Kassabian, Simon Yu

[T13] Sentiment Analysis Model

  • Alec Chapman, Izana Melese

Proposal: , (12)

Final: , (10)

Proposal: , (8)

Final: , (14)

Proposal: , (10)

Final: , (13)

Proposal: , (9)

Final: , (10)

Proposal: , (13)

Final: , (7)

Proposal: , (14)

Final: , (5)

Proposal: , (11)

Final: , (11)

Proposal: , (10)

Final: , (13)

Proposal: , (19)

Final: , (19)

Proposal: , (7)

Final: , (6)

Proposal: , (13)

Final: , (9)

Proposal: , (4)

Final: , (9)

Proposal: , (5)

Final: , (9)

slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report
slides
report
demo
report

Live Demonstration

Your team will participate in an interactive demonstration session where you will present your final project to multiple groups of classmates. The session will follow a rotation format, with each team hosting their station while other students move between presentations in 10-minute intervals.

Logistics

  • Time per rotation: 10 minutes (5 minutes presentation + 5 minutes interaction).

  • Format: Station-based demonstrations using your laptop.

  • Audience: 6-7 different groups will visit your station.

Preparation

Technical Setup

  • Ensure your laptop is fully charged and bring your charger.

  • Test all features of your project thoroughly.

  • Prepare a backup plan (e.g., screenshots, video recording) in case of technical issues.

  • Have all necessary software installed and running properly.

Demonstration Script

Create a concise 5-minute presentation that includes:

  • Project overview.

  • Key features and functionality.

  • Technical highlights or innovative solutions.

  • Implementation challenges and solutions.

  • Future improvements or extensions.

Interactive Component

Prepare for the 5-minute interaction period:

  • Design 2-3 hands-on activities for visitors to try.

  • Create a clear set of instructions for each interactive element.

  • Prepare sample data or test scenarios that showcase different aspects of your project.

  • Consider potential questions and prepare concise answers.

Submission

Create a 5-10 minute video demonstrating your system and submit it to Canvas

Rubric

  • Technical Achievement & Innovation (1 point):

    • Achievement looks at the technical complexity level and how well it's executed - did you tackle challenging problems? Did you use advanced techniques appropriately?

    • Innovation evaluates creative solutions and novel approaches - did you solve problems in unique ways? Did you combine technologies creatively?

  • Documentation & Communication (0.5 points):

    • Assesses how clearly the team explains your technical implementation during the demo.

    • Evaluates your ability to communicate complex concepts to different audience levels.

    • Includes quality of any supporting materials (diagrams, architecture drawings, user guides).

  • Project Impact & Problem Solving (1 point):

    • Examines the problem definition and how well the solution addresses it.

    • Evaluates whether you understood their target users/market.

    • Considers the potential real-world applicability of the solution.

  • Testing & Quality Assurance (1 point):

    • Evaluates how thoroughly the project was tested.

    • Includes handling of edge cases and error conditions.

    • Assesses performance under various conditions.

  • User Experience & Interface Design (0.5 points):

    • Examines how intuitive and user-friendly the interface is.

    • Evaluates the clarity of user workflows.

    • Assesses visual design and consistency.

  • Video Submission (2 points)

    • Project overview

    • Key features demonstration

    • Technical highlights

    • Use case examples

Proposal Pitch

Requirements

  • Duration: 8 minutes (strict time limit).

  • All team members should contribute to the presentation.

  • Content should cover key aspects of your written proposal, including:

    • Project overview and objectives

    • Problem statement and motivation

    • Proposed methodology

    • Dataset and evaluation plans

    • Timeline and team responsibilities

    • At least one end-to-end example demonstrating your system's input/output

Submission

Submit your presentation slides in PDF to Canvas.

Rubric

  • Content & Technical Merit (1 point):

    • Demonstrates clear understanding of the technical concepts and methodologies involved.

    • Presents a well-researched problem statement with relevant background information.

    • Shows depth of knowledge in the chosen technical area.

    • Identifies and addresses key technical challenges.

  • Feasibility & Planning (1 point):

    • Provides a realistic timeline and clear milestones.

    • Demonstrates thoughtful consideration of available resources and constraints.

    • Outlines specific roles and responsibilities within the team.

    • Shows understanding of potential risks and mitigation strategies.

  • Presentation Quality (1 point):

    • Delivers information in a clear, engaging, and professional manner.

    • Uses visual aids effectively to support key points.

    • Shows good preparation and coordination among team members.

    • Handles questions thoughtfully (if any) and demonstrates good understanding.

  • Innovation & Impact (1 point):

    • Presents a novel approach or unique application to address the problem.

    • Clearly articulates the potential value and significance of the project.

    • Demonstrates awareness of existing solutions and explains improvements.

    • Shows consideration of broader implications and applications.

Each team will deliver an 8-minute presentation pitching their to the class. This presentation should effectively communicate your project's vision, methodology, and potential impact while demonstrating its feasibility within the semester timeframe.

project proposal

Project Ideas

Spring 2025

Project Groups by Theme

A. Career Development & Professional Tools

Resume & Cover Letter Analysis

  • Amoakohene, Humphrey: NLP system to analyze and rank resumes based on job descriptions

  • Kansal, Rhea: System to extract key skills from resumes and match with job descriptions

  • Wen, Yuanhuizi: AI-powered resume-to-job matching system using neural networks

  • Zahid, Zeshan: Web app for resume editing and optimization

  • Ambrose, Daniel: Smart cover letter generator using job description keywords

Job Application Tracking

  • Arshavsky, Mark: Email-based internship application tracking system

B. Content Analysis & Verification

Media Bias & Fake News Detection

  • Dattatreya, Maya: Sentiment analysis for detecting tone and bias in media

  • Jerry Xing: Program to evaluate media bias levels

  • Kansal, Rhea: Fake news detection system

  • King, Marisa: Analysis of media word choices based on demographic factors

  • Suh, EunGyul: Fake news detector for social media

  • Yoon, James: Tool for identifying bias in journalism using sentiment analysis

Academic & Research Analysis

  • Ahn, Eric: Academic paper simplifier for general audiences

  • Xu, Jack: Sentiment analysis tool for evaluating research paper quality

  • Xinyuan, Hu: Tool to evaluate research paper credibility

  • Okumura, Harutoshi: LLM-based contradiction detection in political speeches

Review & Sentiment Analysis

  • Chen, Angelina: Bot detection in Amazon product reviews

  • Coulanges, Charlington: Sentiment analysis chatbot for platform reviews

  • Gyul Kim: Analysis of healthcare system and doctor reviews

  • Hou, Carol: Sentiment analysis for social media posts

  • Sezgin, Alas: Dogwhistle detector for social media

  • Yeruva, Sujith: Stock sentiment analysis from financial discussion platforms

C. Content Generation & Processing

Text Simplification & Accessibility

  • Pham, Chloe: Translation of academic text to accessible language

  • Yirdaw, Elshaday: Readability analysis tool with improvement suggestions

  • Yuxuan, Shi: Medical text simplification for patient education

Creative Writing & Story Generation

  • Zhang, Jingzhi: Interactive life story generator

  • Wu, Junting: Poetry style analysis and classification

  • Sixing, Wu: Context-aware reply generator for various scenarios

D. Educational Tools

Course Selection & Academic Planning

  • Hu, Yutong & Shah, Jiya: Natural language-based course recommendation system

Study Aids

  • Jarman, Robert: Pocket Teaching Assistant for lecture note processing

  • Yirdaw, Elshaday: Vocabulary building tool

E. Language & Translation

Language Understanding

  • Chang, Ridge: Idiomatic expression translator across languages

  • Dahiya, Rishita: Context-based word definition tool

  • Kim, Olivia: Hand gesture and NLP integration system

F. Content Classification

Genre & Style Analysis

  • Correa-Perez, Nicholas: Music genre and lyrics analysis

  • Jennings, Caleb: Author style prediction model

  • Jitngamplang, Varakorn: Music mood and theme analyzer

  • Mi, Renzhi & Xiaotong, Liu: Book genre classifier based on word patterns

G. Code Analysis & Documentation

Code Quality

  • Carrier, Emma: Code syntax analysis and improvement tool

  • Shah, Riyan: Automated code documentation generator

H. Data Processing & Analysis

Financial Analysis

  • Jiang, Jenny: Financial news summarizer

  • Reicin, Noah: Financial document analyzer

  • Suh, EunGyul: Accounting fraud detector

Data Cleaning & Structure

  • Guan, Yifei: Police report information extractor

  • Yang, Junhyeok: Web-scraped data cleaner for LLMs

  • Tolmochow, Gregory: Natural language query interface for structured data

I. Health & Fitness

Fitness Planning

  • Chen, Xinmo: Personalized fitness recommendation system

  • Arularasu, Akhil: AI-powered meal planning assistant

  • Jarman, Robert: Gym companion app

J. AI & Machine Learning Research

AI Development

  • Gao Henry: AI text detection and human-like text generation

  • Jung, Seungwon: Zero-shot vs few-shot learning comparison

  • Hakam, Brian: AI debate simulator using persona-based LLMs

Detailed Individual Ideas

Ahn, Eric

The team project idea I had in mind was an academic paper simplifier that would aim to bridge the gap between complex academic research and general audiences. Lots of scholarly papers today are filled with technical jargon, which makes them difficult to understand for non-experts. In this way, by using NLTK for text preprocessing followed by NLP (using pre-trained models like GPT or BERT) for summarization to analyze research articles, we could preserve key information and provide plain-language summarizes, highlight essential points, and offer explanations for complex terms. Users, such as students or the general public, can input an academic paper and receive an easy-to-read summary via input from the following potential options: (1) web scraping from URLs (like PubMed), (2) uploading .txt/.docx/.pdf file, (3) and/or direct text input in the UI.

Ambrose, Daniel

Cover letters are definitely the most annoying and time consuming part about applying to jobs. There are methods to make a cover letter quicker like using ChatGPT, but the result is usually too general and recruiters could easily tell it's AI-generated. I want to create a project where you can input a job’s description and your general cover letter and it outputs a tailored cover letter that incorporates keywords from the job description. To create this project, I will need to use a mix of AI-generation and keyword matching through text processing. This project will mostly focus on computer science jobs to limit the variety of keywords. Keywords extracted from job descriptions will prioritize certain groups of keywords such as programming languages and the requirements section in the job description. Additionally, there could be functionality to rate how well the job's keywords match your cover letter's keywords.

Amoakohene, Humphrey

Building an NLP system to analyze and rank resumes based on job descriptions to make it easy to apply to spam apply for jobs without having to manually create new resumes

Arshavsky, Mark

In Data Science and CS, it’s quite normal for students to apply to tons of internships. From personal experience, it gets really hard to keep track of all the applications. Sometimes you might even miss an important status update that gets buried in your email. My project idea is to create a product that simplifies the internship application process for students by handling the “tracking” and “monitoring” step for them. My vision is for it to work by students connecting the product to their emails (as when you apply to an internship you usually get an email), but I am open to other interpretations! Then, the product will use NLP to parse the relevant emails, extracting data such as company name, date of application, position name, etc. and put it in an Excel sheet. The product will also track for updates to students’ applications and color-code accordingly. Perhaps red would be “Rejected”, yellow would be “Status Update”, and green would be “Offered”. I have been thinking of this idea for some time now, and I think it can make a real-world impact!

Arularasu, Akhil

I want to implement an AI-powered meal planning assistant that dynamically generates personalized meal plans based on user inputted dietary preferences, allergies, fitness goals, and daily caloric needs. By processing natural language inputs (eg: "I need a high-protein vegetarian meal plan with 2500 calories per day"), the system will create meals from one or many extensive recipe datasets, ensuring nutritional balance while considering restrictions like keto, vegan, vegetarian, gluten-free, etc.

Key Features include leveraging NLP for text generation, calory and nutrient optimization, dynamic recipe recommendations, grocery list generator, and more. Users can describe their dietary needs in natural language (eg: "low-carb, high-protein meals under 2000 calories"), and the system will parse these requests into structured data to generate meal plans. I feel that there is a lot of places the group can take this idea and run with it while utilizing NLP.

Carrier, Emma

An application that identifies poor coding syntax practices and gives suggestions to fix it. For example, if there isn't any commenting found throughout the code, it will suggest to add more comments in specific places (before functions, etc.). The user would input a file / code chunk to be analyzed and it can either return the code chunk accepting the suggestions or just give a list of suggestions for the user to change themselves. There could be an option for multiple programming languages because they all have different syntax and standards that programmers should follow. Could market it as a way to help new programmers or programmers learning a new language as a way to help them identify areas in which their code could improve. Would likely need to implement a parser to achieve this goal.

Chang, Ridge

I want to build a system that translates idiomatic expressions across languages in a way that actually makes sense, rather than just converting words literally. Expressions like “break the ice” don’t mean much if translated directly, but most language has its own way of saying the same thing or similar things. I want to create a tool that captures these nuances, ensuring that meaning—not just words—is preserved to the best of its ability. I’m interested in how idioms reflect cultural perspectives, and I want to explore how this project can improve communication between languages while keeping translations natural, authentic, and pragmatic.

Chen, Angelina

I think it would be interesting to look into Amazon product reviews and use NLP to detect whether a bot is responding or an actual human; this is due to the increased number of bot action on Amazon. Often bots will give fake positive reviews to help boost a particular item or to show that a user is active if they are a paid reviewer. Another idea is to analyze the most popular companies and the kinds of words that are used in advertisements/website/customer interaction to understand how they have good customer retention. There could be a analysis of popular and dying companies.

Chen, Xinmo

I want to analyze user input and provide personalized fitness recommendations. It will allow users to log their workouts in natural language, such as "I lifted 50 lbs for 3 sets of 8 reps today," and extract key data points using NLP. By analyzing past workout logs, the system will identify trends, detect plateaus, and suggest progressive overload strategies, such as increasing weight, reps, or intensity. Additionally, I also want to do/incorporate sentiment analysis to give user motivation and provide encouraging feedback. This model will probabaly rely on outside fitness-related datasets and integrating them in order to offer actionable insights for optimizing workout routines.

Correa-Perez, Nicholas

My idea for the team project would be an analysis between lyrics and music genre. It would take a list of songs from a list of different genres and see if common themes or even words come up in the search. Perhaps maybe even an ambitious goal could be to see if we can tie the type of music to its words (i.e. bubbly music has positive words). Genre would be described as the musical aspects of songs to have a consistent definition, and the lyrics would be analysis for overall theme and tone of the lyrics as a whole.

Coulanges, Charlington

I would like to build a sentiment analysis chatbot to distinguish reviews on given platforms (Yelp, Amazon, Rotten Tomatoes, etc.). By giving the data either a positive, neutral, or negative label, the majority of the labels will either recommend or not recommend the product using excerpts from the reviews analyzed. There are often times when a review site will have a decent amount of bot reviews to either tank or boost the rating. Thus, the chatbot will also be able to detect whether any review appears to be written by bots or a any non-human source.

Dahiya, Rishita

I've often been fascinated by the intricacies of the English language which may not be intuitive if English is not someone's first language. An example of such is how certain words change in definition by their context, for example the word "set", which means something different in each of these phrases: set the table, the sun set, to set the record, and so on. My goal in this project would be to develop a tool which is able to decipher the definition of the ambiguous words depending on the context it is used and as a result make machine language processing more precise and intuitive. This would make machines understand the human language better and would have practical applications in machine translations and search engines.

Dattatreya, Maya

My project idea would be to create a model that can perform sentiment analysis to detect the tone and bias in media and the news. I want to see how levels of emotional language used in different kinds of articles affect the engagement metrics of popular media and news sources. My goal is to understand how sentiment influences public perception and how media literacy can be improved.

Forbes, Alexander

I get spam emails daily, and I think it'd be an interesting final project to create a straightforward spam detection tool that flags suspicious emails or messages by using NLP techniques to identify spammy keywords, suspicious links, or recurring patterns. A feedback loop will let users confirm or override the model’s decisions, which could potentially improve accuracy over time. This project would combine classification algorithms with a user-friendly interface for simple adoption. Ultimately, I want to produce a packaged, practical tool that addresses an actual, real-world communication problem.

Gelb, Jake

Hi! I am open to all sorts of various projects and am not attached to any specific idea. Some areas of interest include spam bot detection, phishing and scam prevention, recipes tailored to dietary needs, sentiment analysis, and chat bot development. One specific idea is aa chatbot geared toward fantasy football users that analyzes trades and decision-making, helping users optimize their teams based on player performance, matchups, and trends. I am very excited for this course and am eager to collaborate on some impactful project.

Guan, Yifei

I'm currently working with a professor to aggregate data from the police department. Some department has organized data that we can use directly, but most of them only have individual reports that we have to fetch information manually. I want to use linguistic model and text processing technique to extract information efficiently from individual police report, including incident, date, victims, information officer, information, and so on. This would largely shorten the time we need on processing the reports.

Hakam, Brian

AI Debate Simulator-Using persona based LLMs(Similar to Character AI) This project would fine tune LLMs to answer like specific people, using scraped transcripts and a tone analysis of the person(funny, serious). Then attach the person’s history and debate relevant events to the context window. For example, to do Abraham Lincoln, an LLM would first be fine tuned on speech of that time, then specifically tuning on his transcripts and text. Then another LLM would be trained in this fashion on another person. The 2 or more LLMs would then debate a chosen topic with an intermediately and neutral LLM asking questions, then ultimately deciding the winner. This process would be fully automated, where you type in a name and it would web scrape and train all on its own.

Henry, Gao

I have two ideas:

  1. Create an AI model that will identify text written by AI. It would be difficult because AIs are becoming more and more human-like. Some ideas would be to look for some “errors” that AIs won't make, or check the cosine similarities between sentences. The “errors” should be something that is grammatically correct but not often used. By checking cosine similarities, we can see if the sentences use similar words, or have similar structures.

  2. Use human-written text to train an AGI agent that writes like a real human. Instead of always choosing the token with the highest probability, choose a random token based on the probability assigned. In this way AI will have a larger vocabulary set. Another possible approach is to analyze the sentence structures and let each sentence structure appear in a probability similar to that in reality.

  • Henry Gao

Hou, Carol

I'm interested in a project focusing on sentiment analysis or an email spam detector. I am interested in researching machine learning models that can be used for the classification tasks (classifying a social media post as positive, negative, or neutral, or for emails as spam or not spam.) I haven't narrowed down the main social media website I want to use as a data source for sentiment analysis, but I'm currently thinking of X (twitter), Yelp, etc. The goal of the project would be to determine what customers are thinking about certain products, restaurants, etc. Or, to detect if an email is spam or not.

Hu, Yutong

I would like to build a natural language-based course recommendation system that helps students find relevant courses based on their preferences and academic history. The system will process natural language inputs from users to understand their course requirements and generate personalized recommendations. The system will accept free-form text descriptions of desired courses, parsing key information such as: subject area/discipline, course topics and content, instructor preferences, credit hour requirements, and etc. Besides, we would like to add more advanced filtering based on students' personal information: their majors and the course they've already taken. We might need to use named entity recognition and some filtering engine.

Jarman, Robert

I have a few ideas.

Online Multiplayer Game: Create a simple multiplayer game (e.g., trivia like charades, word games like Wordle, or strategy-based optimization such as collecting coins on a map within a limited number of steps or time). The game would feature real-time interaction and leaderboards to engage players.

Pocket Teaching Assistant (Pocket-TA): Develop an app that uses natural language processing to summarize lecture notes, create flashcards, organize notes collaboratively in real-time (with features like version history, tags, and search), and answer questions from uploaded documents. This tool aims to make studying for exams more efficient.

Gym Companion App: Build an app to personalize fitness journeys with tailored exercise, diet, and sleep plans based on a user’s current build and desired physique. The app also fosters community by helping users find nearby sports partners for games like basketball or tennis, promoting health and social connection.

Jennings, Caleb

Create a model using machine learning combined with computational linguistics that learns the writing style of a list of classic authors from a database of many classic literature books, and the model tries to predict an author based on a sample text that was left out of the database for testing. Finding the database should be easy since much of classic literature is public domain (though it would take some time to collect it all). A number of machine learning techniques could be applied such as Stochastic Gradient Descent, Random Forests, Neural Networks, and Clustering. The challenge would be to determine how to parse the data, what learning technique to use, and how to use the data in learning the model.

Jiang, Jenny

An idea I had using NLP is a tool that processes hundreds of news articles that are released daily regarding markets, finance, and the economy (from Wall Street Journal, New York Times, CNBC, MarketWatch, etc.) that then provides a summary of what happened. As someone that likes to keep up with the news and markets myself, it can be exhausting reading multiple articles a day from different platforms to get an accurate grasp of current events. An NLP tool could process all these news articles using big data to then generate daily summaries of the news. This can help make keeping up with the economy more accessible for those that don’t follow the news.

Jitngamplang, Varakorn

I like the idea of a tool that helps find music that precisely matches a desired mood or theme. I want to develop a tool that uses natural language processing techniques to analyze song lyrics and facilitate music discovery based on a specified lyrical characteristics. Ideally, it would searches for songs based on themes (e.g., romance, travel), sentiment (e.g., melancholic, upbeat), and other lyrical features. By extracting keywords, analyzing sentiment, and employing other NLP processes, I hope it will recommend songs that resonate with given preferences. This goes beyond a simple sentiment analysis tools or aggregator as this would require analyzing music rhetoric.

Jung, Seungwon

I am open to any topic to research. I am open to topics like sentiment analysis, news fraud detection, text-to-speech tools that add emotional expressiveness, and even poetry-generating writing editor. Also, research related to bias detection and ethical language generation can be done. My idea at the moment is to research the effectiveness of zero-shot vs. few-shot learning on dynamic gaming(or just a normal life if sufficient datasets are not found) NPCs. Although it is well-known few-shot learning will be more effective than zero-shot, I would like to research its limitations using BLEU, ROUGE, or BERTScore to evaluate the quality of a generated text using few-shot learning.

Kansal, Rhea

I am open to any project ideas, but one that I think could be interesting is using NLP to detect fake news. In this project, we would develop a system that can classify news articles or social media posts as either fake or real. The system could analyze linguistic patterns, stylistic features, and content to identify misinformation and provide users with a confidence score for the classification. The prevalence of misinformation in today’s society makes this a highly relevant topic.

Another idea I have is a system that identifies and extracts key skills, qualifications, and experiences from resumes. This tool can be used to match resumes with specific job descriptions, streamlining the recruitment process. It could also rank resumes based on how closely they align with the job requirements, helping recruiters quickly find the best candidates.

Kim, Gyul

I would like to analyze patient reviews on healthcare systems and doctors to identify key topics such as wait time, doctor empathy, quality of care, misdiagnosis, and facility conditions. Through topic modeling, the project can identify recurring themes to provide a summarized review on a facility, service, doctor/ any other patient feedback. The project will involve collecting and preprocessing patient reviews from platforms such as Healthgrades, Zocdoc, or Reddit, followed by implementing unsupervised learning models to extract meaningful topics. Sentiment analysis can also be integrated to assess whether patient sentiments about specific topics are positive, negative, or neutral.

Kim, Olivia

Embedding hand gestures with NLP. As LLMs expand out to voice recognition, there has been significant more researches about tone, but to my short knowledge hand gesture research hasn't been done as extensively. By recording the hand movements made when speaking a certain text and analyzing it, nuances will be able to be captured more effectively. This could also be implemented to turn a written text into ai-generated videos of sign language. However, I strongly believe this is too good of an idea to not have been made already.

King, Marisa

I would enjoy developing a project that examines the different words and terms used by the media to describe events (both global and internal), and how their specific word choices differ based on factors such as race, gender, sexuality, politics, religion, etc.

Lin, Ryan

My team project concept is a some kind of text-message/DM chatbot that can mimic slang or like styles of texting. People speak differently over text whether in content, style, or length of messages. I think its pretty interesting how language over text can be so different than using proper grammar, and in a sense it is its own kind of dialect of language. I'm open to doing other stuff cuz I'm also interested in using different types of data including video, audio, and picture data but I'm not really certain how yet.

Mi, Renzhi

I'm thinking about developing a program that classifies books and novels into their respective genres based on word frequency patterns and word types, which can potentially help to build the page suggestions for readers. This might involve processing volumes of text, extracting frequently occurring terms and utilizing them as key indicators to infer the overarching theme and genre. By leveraging NLP techniques, the program will identify genre-specific linguistic patterns and compare them against predefined datasets corresponding to various literary genres such as mystery, fantasy, science fiction, and romance. This automated classification framework enhances genre identification by analyzing textual content directly, so it can provide a more objective and data-driven approach to literary categorization.

Okumura, Harutoshi

Political Science Research Paper on using LLMs to detect explicit contradictions from presidential debate speech candidates.

Can LLM rely solely on Natural Language Structure to find explicit contradictions on candidates' past speeches (public, tweets, websties, e.t.c), and use it against them to solidify an argument of contradiction, and thereby lack of credibility in certain topics?

Are there certain semantical features and law of contradiction, that we can adapt reliably for a consistent system that detects contradictions (no matter how fundamental it may be), and apply to large corpus of past speeches.

Pham, Chloe

I'm potentially interested in the idea of "translating" texts written in a highly elevated/academic/obscure tone to language that may be more accessible to the average person. While there is an argument for "flowery" and "smart" writing, I think there is an elitist undertone to conveying ideas in a way that is inaccessible to many, especially if the text's purpose is education. There is a large set of existing source texts to draw from, along with some existing SparkNotes-type "translations" of classics and other foundational texts.

Reicin, Noah

Develop a Natural Language Processing (NLP) system that analyzes financial documents to perform two main tasks:

  1. Sentiment Analysis: Identify and categorize the document's tone—positive, neutral, or negative—based on relevant keywords' frequency and contextual usage.

  2. Key Focus Extraction: Detect the frequency and distribution of specific domain-relevant terms to determine the company’s primary areas of emphasis (e.g., sustainability initiatives, R&D, and shareholder returns).

Sezgin, Alas

I would like to work on a dogwhistle detector, specifically for Twitter (or, reluctantly, now named "X") as with the new administration of the website, far right rhetoric (like fascism) has become less and less censored and perhaps even more common, albeit hidden behind a veil of dogwhistles and plausible deniability which make it hard for the average person to determine if what they are consuming is a joke or far right rhetoric. This has the effect of familiarizing users to unkowingly become familiarized with harmful rhetoric through humor, the selective lack of censorship towards which can sway user opinions on a supposedly neutral website. This is why it's important to detect dogwhistles or general far right rhetoric, which may be difficult for the average person. It would be nice to implement this as a browser add-on but that might be outside the scope of this class. This would likely involve a lot of hard-coding or feeding labeled data into a model, so I am open to new ideas.

Shah, Jiya

This idea was made in collaboration with Yutong Hu. Although I don’t have any concrete ideas on implementation currently, we were thinking of something that helps students choose courses based on information inputted in natural language. It would use their description of the course (i.e., which academic discipline, professors they’d like, how many hours they want/need, which major/general education requirements they still need to fulfill, general subjects they would like to explore, etc.) in addition to information about what courses they have already taken and their major(s) and/or minor(s). This would all be to provide suggestions of what courses they could take based on the course atlas of that semester.

Shah, Riyan

My idea is to process pieces of code and using an NLP algorithm, generate meaningful comments for it so other people looking at the code can better understand it.

  • create a list of variables used with their purpose

  • comment long lines of code to explain what they are trying to do (eg. long mathematical expressions, or lines involving lots of variables and packages)

  • create comments for blocks of code that are trying to perform a task (eg. this for loop is meant to do blah blah blah)

Sixing, Wu

My idea is to make a Brilliant Reply model. I hope the model can work to assist people in replying in various scenarios. 1. email writing assistant: helps to formulate emails for different occasions, including networking, job applications, announcements, etc.); 2. text message reply: helps to generate an appropriate and engaging response for personal and business usage, being able to handle slangs and emojis.); 3. scenarios based replies: eg: helps to generate a hook-up message or flirting replies in ins, a coffee chat invitation in LinkedIn, appropriate complaints message to customers services, etc; 4. tries to tailor the users' tone and learn from the previous text.

Suh, EunGyul

Project Concept 1: Accounting Fraud Detector. I assume that the SEC filings of companies involved in accounting fraud may contain exaggerated or magnified language and tone. I would like to develop a classification or anomaly detection model that identifies accounting fraud by analyzing companies' SEC filings. However, I anticipate that datasets containing SEC filings from fraudulent companies would not be available enough to train the model effectively.

Project Concept 2: Fake News Detector. Fake news on social media has become a significant concern in recent times. Similar to the first concept, I aim to develop a detection model that identifies potential fake news by analyzing social media posts, such as those on platforms like X. I would collect dataset from X with API or use existing dataset on fake news on social media. Additionally, I am interested in exploring which components of a social media post—such as whether the account is verified, whether the post contains images, the post's length, etc.—serve as indicators of potential fake news via extracting meta features from posts

Tolmochow, Gregory

Natural Language Query Interface for Structured Data This project aims to develop an NLP-powered interface that allows users to query structured databases (CSV or SQL) using natural language. Instead of writing complex SQL queries or manually filtering data, users can ask questions conversationally, and the system will break down their queries into structured filters. For example, given a housing dataset, a user might ask, "How many homes built in 2023 have 2 or more residents within 5 miles of a school?" The model would extract relevant columns such as year_built = 2023, # of ppl > 2, and dist_to_school < 5. These filters would be put into blocks that users can then refine these filters before and after applying them, or even pick which ones to remove. This project combines natural language understanding with data retrieval, making database interaction more intuitive and user-friendly.

Ukpong, Imeikan

One idea that hopefully will call on knowledge learned in this course is a tool that is able to take in prompts like conversations or simple sentences and returns whether that input is either positive or negative, (depending on the difficulty maybe neutral or more nuanced statements can be considered and categorized as well). To figure out what to produce as an output, the tool could observe certain keywords based on a certain heuristic (like words or symbols that tend to generally be important when people use them, like “I just got a new job!” has “new job” - positive and exclamation point further helps model be sure that the statement is positive).The sign of the output (Either positive or negative) is solely based on whether the input elicits positive emotions like happiness, laughter, etc. or negative emotions like sadness, pain, etc.

Wen, Yuanhuizi

This project aims to develop an AI-powered resume-to-job matching system using a dataset containing candidate resumes, the job positions they are applying for, and a numerically labeled match score. We will train a neural network-based regression model to predict the match score between a given resume and job position (already found the dataset from Kaggle). The model's performance will be evaluated using appropriate metrics such as Mean Squared Error (MSE) or R² score (We will do more research on this). This system has two key applications: (1) Automating resume screening to help recruiters efficiently rank candidates, reducing manual effort, and (2) Assisting job seekers in evaluating how well their resume aligns with different job positions before applying. By streamlining the hiring process, this project saves time for both candidates and recruiters, increasing overall efficiency in job matching.

Wu, Junting

Different poets have different styles. People who read enough poems can know the author of a poem through its style. But what is the style? Is it the use of words (some poets would use certain words repeatedly, some poets have characteristic ways of arranging the sequence of words, etc.)? Is it the meaning? Is it the way of starting or ending a poem? Or is it just a feeling? I wonder whether AI can classify poems according to their authors just based on their style. The result may shed light on the understanding of a poet's style. Only English poets would be chosen to avoid issues with translation.

Xiaotong, Liu

My project aims to develop a program for analyzing text in books and novels to determine their genre based on word frequency patterns. The program will skim through a large amount of text, identify the most frequently occurring words, and use them as key indicators to make inferences about the overall theme or genre. Using Natural Language Processing (NLP) techniques, the program will identify genre-specific words and compare them to predefined datasets of literary genres such as suspense, fantasy, science fiction, or romance. This automated classification system helps readers to effectively categorize novels based on textual content rather than metadata.

Xing, Jerry

My team project concept is a program that evaluates media and determines the level of bias. It would specifically evaluate news articles and transcripts of news videos. While all media has the same goal of informing the viewer or reader, often there is a certain level of bias from the person who wrote the article or script. I believe that natural language processing could be very applicable for analyzing media and gauging the amount of bias. In terms of deliverables, the end product would aggregate various sources of media on a topic, and compare the degrees of bias between the content that is written on a singular subject.

Xinyuan, Hu

I'm planning to build a tool that uses natural language processing to analyze and evaluate research paper credibility on platforms like arXiv and Google Scholar. Given the massive volume of papers being published today, having an automated way to assess paper quality would be incredibly helpful. The project will use well-established papers from top journals as training data to identify what makes research credible. The analysis will focus on some key aspects such as: methodology robustness (how well research methods are described and justified), citation patterns (how the work connects with existing research), and overall writing quality, etc. The process might be retrieving papers from websites, extracting the paper contents, and hen developing a scoring method.

Xu, Jack

I'm interested in making a sentiment analysis tool for studies/papers that can classify the overall quality, or to see if there are patterns that can be found in papers that are considered low-quality in meta-research. Given the text of a paper, it should predict the quality of the evidence and overall methodology. Also, maybe it could highlight some common patterns that are indicative of high quality, etc.

Yang, Junhyeok

Idea: Making web-scraped data readable for LLM

Description: Web-scraped data from various websites often contain unstructured and irrelevant content, making it difficult for LLMs to process effectively. Issues include raw HTML tags, boilerplate text, headers, footers, advertisements, and redundant information. NLP techniques can be leveraged to clean, preprocess, and structure this data into a format optimized for LLMs. Methods such as text normalization, entity recognition, summarization, and noise filtering help remove unwanted elements while preserving meaningful content. By applying NLP-driven parsing and formatting, web-scraped data can be transformed into a structured, high-quality dataset for better comprehension and usability for LLMs.

Yeruva, Sujith

I would like to do something involving sentiment analysis on stocks. There are sites like SeekingAlpha, ValueInvestorsClub, Yahoo Finance, and many others that analyze a stock and sometimes pitch it. There is also discussion on social media sites like Twitter and Reddit from individual retail investors. I would like to use this to potentially generate a "sentiment score" that shows what the public perception is on certain stocks (and how it might vary between different sources).

Yirdaw, Elshaday

I mainly have two project ideas that I would like to work on throughout the semester. My first (and main) project idea involves creating a tool that can estimate the readability level of a text provided by a user. In addition, I would like for this tool to offer some kind of suggestions that can make the text more accessible. These suggestions could range from simple modification, like replacing difficult words with more commonly used words, to advanced modifications, such as restructuring sentences or paragraphs that may otherwise be unclear and hard to understand. Essentially, the goal of the project would be to estimate the current readability level of the given text and provide modification suggestions to improve the readability of the text (perhaps to some level of complexity that a user might want). The secondary idea I am considering is developing some kind of a vocabulary builder. This tool would take a text provided by a user and identify words (and phrases) that may be challenging to understand. Then, using these words (and phrases), it would generate a vocabulary list along with definitions so that users can use it to expand their understanding.

Yoon, James

With the Information Age’s abundance of information, such an overwhelming sea of viewpoints and novel ideas may embolden more polarizing instances of media. Consequently, there remains an ever-urgent necessity to identify bias within journalism which may be obviated through utilizing NLP and subsequent sentiment analysis techniques. By leveraging sentiment analysis, named entity recognition, and topic modeling, one could assess the emotional tone, political leaning, and framing of articles from multiple sources. Users could input URLs or text to receive a breakdown of sentiment polarity, lexical bias, and comparative analysis against a diverse dataset of news sources. The tool could further employ machine learning models trained on labeled datasets to enhance accuracy, offering readers an objective lens through which to evaluate media narratives.

Yuxuan, Shi

Text Simplification for Patient Education

The complexity of medical language has been a major barrier in consumer health informatics. For individuals untrained in medical field, the healthcare processes become a black box. Here are some potential directions that I aim to optimize through this project: lexical and contextual complexity; health literacy; language and cultural barriers; patient-centric communication tools.

Solution: a multi-agent environment that can do:

  1. lexical simplification: replace complex medical jargon with simpler terms; syntactic simplification to shorten long, complex sentences.

  2. context specific explanations. ("benign" as "not cancerous" in a pathology report)

  3. dynamic summarization: highlight and prioritize most critical information, such as diagnoses, treatment plans, next steps.

  4. user feedback loop: allow interaction with agent for further clarification or simplification

  5. personalization: adjust output based on user’s patient literacy levels, language preference, and prior knowledge.

Zahid, Zeshan

The project Im thinking about creating is a webapp or program that can make edits to your resume weather it be format or edits to your words in the way you describe certain stuff in your resumes. Im familiar with backend webapp creation with Python and think using some of the libraries along with possibly Ai like chatgpt 4 to be able to make those edits. This tool could be useful for students as well as job seekers by helping tailor their resumes. Using Ai would make it alot more robust to where once inputed we could use Ai to tailor the resume for certain jobs or purposes and make effective edits.

Zhang, Jingzhi

Interactive Life Story Generator: Turn user-provided life details into a realistic, interactive, fictional story, where real-time user choices can adjust the plot (i.e., exploring future "what-if" scenarios). The project will allow users to specify the genre and ending type and use pre-trained models to dynamically generate realistic narratives based on the user's past experiences. The focus will likely be on prompt engineering to ensure coherence and immersion with a web-based UI for user interaction.

Configure the working directory.
Types of decision regions that can be formed by different layers of MLP [1]/
Figure 1 - An example of an RNN and its applicatoin in part-of-speech (POS) tagging.
Figure 2 - An example of an RNN and its applicatoin in sentiment analysis.
Figure 3 - An overview of a bidirectional RNN.
Figure 1 - An overview of an encoder-decoder framework.
Figure 2 - An encoder example.
Figure 3 - A decoder example.
Multi-Head Attention. Figure 2 from [1].
The Transformer architecture. Figure 1 from [1].
BERT training mechanisms. Figure 1 from [2].
Run the program.

Text Processing

Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.

Sections

If you are not acquainted with Python programming, we strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.

Frequency Analysis
Tokenization
Lemmatization
Regular Expressions
Homework