Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
CS|QTM|LING-329: Computational Linguistics (Spring 2025)
Time: TBA
Location: TBA
Jinho Choi : Associate Professor of Computer Science : Office Hours → TBA
Catherine Baker : MS Student in Computer Science : Office Hours → TuTh 11 AM - 12:30 PM, Zoom (ID: 964 3750 1501, PW: posted in Canvas)
Zelin Zhang : Ph.D. student in Computer Science and Informatics : Office Hours → MW 11:20 AM - 12:50 PM, Zoom (ID: 975 2341 9724, PW: posted in Canvas)
Homework: 70%
Team Formation: 3%
Project Proposal: 12%
Live Demonstration: 15%
Your work is governed by the Emory Honor Code. Honor code violations (e.g., copies from any source, including colleagues and internet sites) will be referred to the Emory Honor Council.
Requests for absence/rescheduling due to severe personal events (such as health, family, or personal reasons) impacting course performance must be supported by a letter from the Office of Undergraduate Education.
Each topic will include homework that combines quizzes and programming assignments to assess your understanding of the subject matter.
Assignments must be submitted individually. While discussions are allowed, your work must be original.
Late submissions within a week will be accepted with a grading penalty of 15% but will not be accepted after the solutions are discussed in class.
Each section incorporates questions to explore the content more comprehensively, with their corresponding answers slated for discussion in the class.
While certain questions may have multiple valid answers, the grading will be based on the responses discussed in class, and alternative answers will be disregarded. This approach allows us to distinguish between answers discussed in class and those generated by AI tools like ChatGPT.
You are encouraged to use any code examples provided in this book.
You can invoke any APIs provided in the course packages (under the src/ directory).
Feel free to create additional functions and variables in the assigned Python file. For each homework, ensure that all your implementations are included in the respective Python file located under the src/homework/ directory.
Usage of packages not covered in the corresponding chapter is prohibited. Ensure that your code does not rely on the installation of additional packages, as we will not be able to execute your program for evaluation if external dependencies are needed.
You are expected to:
Group a team of 3-4 members.
Give a pitch presentation to showcase your idea for the project.
Provide a live demonstration to illustrate the details and potential of your project.
Everyone in the same group will receive the same grade for the project, except for the individual portion.
Your project will undergo evaluation based on various criteria, including originality, feasibility, and potential impact.
Your project will also undergo peer assessment, which will factor into your project grade.
Participation in project presentations and live demonstrations is compulsory. Failure to attend any of these events will result in a zero grade for the respective activity. In the event of unavoidable absence due to severe personal circumstances, a formal letter from the Office of Undergraduate Education must accompany any excuses.
You can earn up to 3 extra credits by helping us improve this online book. If you wish to contribute, please submit an issue to our GitHub repository using the "Online Book" template. Upon verification, you will receive credits based on the following criteria:
Content enhancements (e.g., additional explanations, test codes): 0.3 points
Code bug fixes: 0.2 points
Identification and correction of typos (and other obvious mistakes): 0.1 points
Prior to submission, please check for existing issues to avoid duplication. If multiple submissions of the same (or very similar) issues occur, only the first one will be credited.
By Jinho D. Choi (2023 Edition)
Natural Language Processing (NLP) is a vibrant field in Artificial Intelligence that seeks to create computational models to understand, interpret, and generate human language. NLP technology has become deeply ingrained in our daily lives through various applications, evolving at an unprecedented pace. Understanding how NLP works enables you to maximize the utilization of these applications, ultimately enhancing your lifestyle.
This course focuses on establishing a solid foundation in the core principles essential for modern NLP techniques. Starting with the basics of text processing, you will learn how to manipulate text to enhance data quality for developing NLP models. Next, we will delve into language modeling that enables computational systems to understand and generate human language and explore vector space models that convert human language into machine-readable vector representations.
Moving forward, we will cover distributional semantics, a technique for creating word embeddings based on their global contextual usage, and adapt them for sequence modeling to tackle NLP tasks that are inherently structured around sequences of words. We will also delve into contextual representations that capture the subtleties and nuances of language by considering local context. Finally, we will explore cutting-edge topics, including large language models and their effects on NLP tasks and applications.
Throughout the course, several quizzes and Python programming assignments will further deepen your understanding of the concepts and the practice of NLP. By the end of the term, you can expect to possess the knowledge and skills necessary to navigate the swiftly evolving landscape of NLP.
Introduction to Python Programming
Introduction to Machine Learning
Each section has its own set of references. We highly recommend you read the ones marked with asterisks (*), as they provide an in-depth understanding of those subjects.
HW0: Getting Started
Install Python version 3.10 or higher. Earlier versions are not compatible with this course.
You are encouraged to install the latest version of Python. Please review the new features introduced in each version.
Login to GitHub (create an account if you do not have one).
Create a new repository called nlp-essentials and set it to private.
From the [Settings]
menu, add the following as a collaborator to this repository: EmoryTA.
Install PyCharm on your local machine:
The following instructions assume that you have "PyCharm 2023.3.x Professional Edition".
You can get the professional version by applying for an academic license.
Configure your GitHub account:
Go to [Settings] - [Version Control] - [GitHub]
.
Press [+]
, select Log in via GitHub
, and follow the procedure.
Create a new project:
Press the [Get from VCS]
button on the Welcome
prompt.
Choose [GitHub]
on the left menu, select the nlp-essentials
repository, and press [Clone]
(make sure the directory name is nlp-essentials
).
Setup an interpreter:
Go to [Settings] - [Project: nlp-essentials] - [Project Interpreter]
.
Click Add Interpreter
and select Add Local Interpreter
.
In the prompted window, choose [Virtualenv Environment]
on the left menu, configure as follows, then press [OK]
:
Environment: New
Location: SOME_LOCAL_PATH/nlp-essentials/venv
Base interpreter: Python 3.11
(or the Python version you installed)
https://plugins.jetbrains.com/plugin/10081-jetbrains-academy
Open a terminal by clicking [Terminal]
at the bottom (or go to [View] - [Terminal]
).
Upgrade pip (if necessary) by entering the following command into the terminal:
Install setuptools (if necessary) using the following command:
Install the ELIT Tokenizer with the following command:
If the terminal prompts "Successfully installed ...", the packages are installed on your machine.
1. Create a package called src
under the nlp-essentials
directory.
PyCharm may automatically create the __init__.py
file under src
, which is required for Python to recognize the directory as a package, so leave the file as it is.
2. Create a homework
package under the src
package.
3. Create a Python file called getting_started.py
under homework
and copy the code:
If PyCharm prompts you to add getting_started.py
to git, press [Add]
.
4. Run the program by clicking [Run] - [Run 'getting_started']
. An alternative way is to click the green triangle (L20) and select Run 'getting_started'
:
5. If you see the following output, your program runs successfully.
1. Create a .gitignore
file under the nlp-essentials
directory and copy the content:
2. Add the following files to git by right-clicking on them and selecting [Git] - [Add]
(if not already):
getting_started.py
.gitignore
Once the files are added to git, they should turn green. If not, restart PyCharm and try to add them again.
3. Commit and push your changes to GitHub:
Right-click on nlp-essentials
.
Select [Git] - [Commit Directory]
.
Enter a commit message (e.g., Submit Quiz 0).
Press the [Commit and Push]
button.
Make sure you both commit
and push
, not just commit
.
4. Check if the above files are properly pushed to your GitHub repository.
Submit the URL of your GitHub repository to Canvas.
Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.
If you are not acquainted with Python programming, I strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.
Update: 2023-10-31
Consider the following text from Wikipedia about (as of 2023-10-18):
Our task is to determine the number of word tokens and unique word types in this text. A simple way of accomplishing this task is to split the text with whitespaces and count the strings:
What is the difference between a word token and a word type?
L1: Import the class from the package.
L4: Open the corpus file (), read the contents of the file as a string (), split it into a list of words (), and store them in the words
list.
L5: Use Counter to count the occurrences of each word in words
and store the results in the word_counts
dictionary.
L7: Print the total number of word tokens in the corpus (), which is the length of words
().
L8: Print the number of unique word types in the corpus, which is the length of word_counts
.
In this task, we want to check the top-k most or least frequently occurring words in this text:
L4: Sort items in word_counts
in ascending order and save them into wc_asc
as a list of (word, count) tuples.
L5: Iterate over the top 10 (least frequent) words in the sorted list zand print each word along with its count.
Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) or "Georgia." (with the period). This is because the text was split only by whitespaces without considering punctuation. As a result, these words are separately recognized from the word types "Atlanta" or "Georgia", respectively. Hence, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.
Finally, save all word types in alphabetical order to a file:
L1: Open a file word_types.txt
in write mode (w
). # If the file does not exist, it will be created. If it does exist, its previous contents will be overwritten.
L2: Iterate over unique word types (keys) of word_counts
in alphabetical order, and write each word followed by a newline character to fout
.
Update: 2023-10-31
Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.
Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.
What is the difference between a lemmatizer and a stemmer [1]?
When analyzing the obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:
Universities
University
universities
university
Two variations are applied to the noun "university": capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:
study
studies
studied
studying
get_lemma_lexica()
We want to develop a lemmatizer that normalizes all variations into their respective lemmas. Let us start by creating lexica for lemmatization:
Lexica:
lemmatize()
Next, let us write the lemmatize()
function that takes a word and lemmatizes it using the lexica:
L2: Define a nested function aux
to handle lemmatization.
L6-7: Try applying each rule in the rules
list to word
.
L8: If the resulting lemma is in the vocabulary, return it.
L10: If no lemma is found, return None
.
L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.
We now test our lemmatizer for nouns and verbs:
When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.
In which tasks can lemmatization negatively impact performance?
Update: 2023-10-31
Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.
Online Interpreter:
Regex provides metacharacters with specific meanings, making it convenient to define patterns:
.
: any single character except a newline character.
[ ]
: a character set matching any character within the brackets.
\d
: any digit, equivalent to [0-9]
.
\D
: any character that is not a digit, equivalent to [^0-9]
.
\s
: any whitespace character, equivalent to [ \t\n\r\f\v]
.
\S
: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v]
.
\w
: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_]
.
\W
: any character that is not a word character, equivalent to [^A-Za-z0-9_]
.
\b
: a word boundary matching the position between a word character and a non-word character.
Examples:
M.\.
matches "Mr." and "Ms.", but not "Mrs." (\
escapes the metacharacter .
).
[aeiou]
matches any vowel.
\d\d\d
searches for "170" in "CS170".
\D\D
searches for "kg" in "100kg".
\s
searches for the space " " in "Hello World".
\S
searches for "H" in " Hello".
\w\w
searches for "1K" in "$1K".
\W
searches for "!" in "Hello!".
\bis\b
matches "is", but does not match "island" nor searches for "is" in "basis".
Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:
*
: the preceding character or group appears zero or more times.
+
: the preceding character or group appears one or more times.
?
: the preceding character or group appears zero or once, making it optional.
{m}
: the preceding character or group appears exactly m
times.
{m,n}
: the preceding character or group appears at least m
times but no more than n
times.
{m,}
: the preceding character or group appears at least m
times or more.
By default, matches are "greedy" such that patterns match as many characters as possible.
Matches become "lazy" by adding ?
after the repetition metacharacters, in which case, patterns match as few characters as possible.
Examples:
\d*
matches "90" in "90s" as well as "" (empty string) in "ABC".
\d+
matches "90" in "90s", but no match in "ABC".
https?
matches both "http" and "https".
\d{3}
is equivalent to \d\d\d
.
\d{2,4}
matches "12", "123", "1234", but not "1" or "12345".
\d{2,}
matches "12", "123", "1234", and "12345", but not "1".
<.+>
matches the entire string of "<Hello> and <World>".
<.+?>
matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.
Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses (
and )
.
|
: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives.
( )
: a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways.
(?: )
: a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways.
\num
: a backreference that refers back to the most recently matched text by the num
'th capturing group within the same regex.
You can nest groups within other groups to create more complex patterns.
Examples:
(cat|dog)
matches either "cat" or "dog".
(\w+)@(\w+.\w+)
has two capturing groups, (\w+)
and (\w+.\w+)
, and matches email addresses such as "john@emory.edu" where the first and second groups capture "john" and "emory.edu", respectively.
(?:\w+)@(\w+.\w+)
has one non-capturing group (?:\w+)
and one capturing group (\w+.\w+)
. It still matches "john@emory.edu" but only captures "emory.edu", not "john".
(\w+) (\w+) - (\2), (\1)
has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".
(\w+.(edu|org))
has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.
Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.
A positive lookahead assertion (?= )
checks that a specific pattern is present immediately after the current position.
A negative lookahead assertion (?! )
checks that a specific pattern is not present immediately after the current position.
A positive look-behind assertion (?<= )
checks that a specific pattern is present immediately before the current position.
A negative look-behind assertion (?<! )
checks that a specific pattern is not present immediately before the current position.
^
asserts that the pattern following the caret must match at the beginning of the text.
$
asserts that the pattern preceding the dollar sign must match at the end of the text.
Examples:
apple(?=[ -]pie)
matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".
do(?!(?: not|n't))
matches "do" in "do it" or "doing", but not in "do not" or "don't".
(?<=\$)\d+
matches "100" in "$100", but not in "100 dollars".
(?<!not )(happy|sad)
searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".
not
searches for "not" in "note" and "cannot", whereas ^not
matches "not" in "note" but not in "cannot".
not$
searches for "not" in "cannot" but not in "note".
Python provides several functions to make use of regular expressions.
Let us create a regular expression that matches "Mr." and "Ms.":
L1:
r'M'
matches the letter "M".
r'[rs]'
matches either "r" or "s".
r'\.
' matches a period (dot).
Currently, no group has been specified for re_mr
:
Let us capture the letters and the period as separate groups:
L1: The pattern re_mr
is looking for the following:
1st group: "M" followed by either "r" or 's'.
2nd group: a period (".")
L2: Match re_mr
with the input string "Ms".
L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.
If the pattern does not find a match, it returns None
.
Let us match the following strings with re_mr
:
s1
matches "Mr." but not "Ms." while s2
does not match any pattern. It is because the match()
function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search()
instead:
The search()
function matches "Mr." in both s1
and s2
but still does not match "Ms.". To match them all, we need to use the findall()
function:
While the findall()
function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer()
function:
Finally, you can replace the matched results with another string by using the sub()
function:
Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:
L2: Create a regular expression to match delimiters and a special case:
Delimiters: ','
, '.'
, or whitespaces ('\s+'
).
The special case: 'n't'
(e.g., "can't").
L3: Create an empty list tokens
to store the resulting tokens, and initialize prev_idx
to keep track of the previous token's end position.
L5: Iterate over matches in text
using the regular expression pattern.
L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t
.
L7: If t
is not empty (i.e., it is not just whitespace), add it to the tokens
list.
L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t
.
L10: If t
is not empty (i.e., the pattern is matched):
L11-12: Check if the previous token in tokens
is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.
L13-14: Otherwise, add t
to tokens
.
L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens
.
Test cases for the tokenizer:
Update: 2024-01-05
Your goal is to extract specific information from each book in the . For each book, gather the following details:
Book Title
Year of Publishing
For each chapter within a book, collect the following information:
Chapter Number
Chapter Title
Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.
Download the file and place it under the directory. Note that the text file is already tokenized using the .
Create a file in the directory.
Define a function named chronicles_of_narnia()
that takes a file path pointing to the text file and returns a dictionary structured as follows:
For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.
If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations]
-> text_processing
. Confirm that the working directory path is set to the top directory, nlp-essentials
.
RE_Abbreviation
: Dr.
, U.S.A.
RE_Apostrophe
: '80
, '90s
, 'cause
RE_Concatenation
: don't
, gonna
, cannot
RE_Hyperlink
: https://emory.gitbook.io/nlp-essentials
RE_Number
: 1/2
, 123-456-7890
, 1,000,000
RE_Unit
: $10
, #20
, 5kg
Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.
Commit and push the text_processing.py file to your GitHub repository.
CS|QTM|LING-329: Computational Linguistics (Spring 2024)
Update: 2023-10-31
Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.
What is the difference between a word and a token?
When examining the from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:
"R1:
-> ['"', "R1", ":"]
(R&D)
-> ['(', 'R&D', ')']
15th-largest
-> ['15th', '-', 'largest']
Atlanta,
-> ['Atlanta', ',']
Department's
-> ['Department', "'s"]
activity"[26]
-> ['activity', '"', '[26]']
centers.[21][22]
-> ['centers', '.', '[21]', '[22]']
Depending on the task, you may want to tokenize [26]
into ['[', '26', ']']
for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number 26 surrounded by square brackets. Thus, we aim to recognize it as a single token.
delimit()
Let us write the delimit()
function that takes a word and a set of delimiters and returns a list of tokens by splitting the word using the delimiters:
L3: If no delimiter is found, return a list containing word
as a single token.
L5: If a delimiter is found, create a list tokens
to store the individual tokens.
L6: If the delimiter is not at the beginning of word
, add the characters before the delimiter as a token to tokens
.
L7: Add the delimiter itself as a separate token to tokens
.
We now test delimit()
using the following cases:
postprocess()
When reviewing the above output, the first four test cases yield accurate results, while the last five are not handled correctly, which should have been tokenized as follows:
Department's
-> ['Department', "'s"]
activity"[26]
-> ['activity', '"', '[26]']
centers.[21][22]
-> ['centers', '.', '[21]', '[22]']
149,000
-> ['149,000']
U.S.
-> ['U.S.']
To handle these special cases, let us post-process the tokens generated by delimit()
:
L2: Initialize variables i
for the current position and new_tokens
for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
L6: Combine the apostrophe and "s" and append it as a single token.
L7: Move the position indicator by 1 to skip the next character.
L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).
L11: Combine the special number format and append it as a single token.
L12: Move the position indicator by 2 to skip the next two characters.
L13: Case 3: Handling acronyms like "U.S.".
L14: Combine the acronym and append it as a single token.
L15: Move the position indicator by 3 to skip the next three characters.
L16-17: Case 4: If none of the special cases above are met, append the current token.
L18: Move the position indicator by 1 to process the next token.
L20: Return the list of processed tokens.
Once the post-processing is applied, all outputs are handled correctly:
tokenize()
At last, we write the tokenize()
function that takes a file path to a corpus and a set of delimiters and returns a list of tokens from the corpus:
L2: Read the contents of a file (corpus
) and split it into words.
Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.
Your goal is to develop a bigram model that uses the following techniques:
Laplace smoothing with
Measure the initial word probability by adding the artificial token at the beginning of every sentence.
Test your model using .
Create a file in the directory.
Define a function named bigram_model()
that takes a file path pointing to the text file and returns a dictionary of bigram probabilities found in the text file.
Use the following constants to indicate the unknown and initial probabilities:
Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.
A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)
This function aims to generate a sequence of tokens that adheres to the following criteria:
It must have the precise number of tokens as specified.
Excluding punctuation, there should be no redundant tokens in the sequence.
Finally, the function returns a tuple comprising the following two elements:
The list of tokens in the sequence
Define a function named sequence_generator_max()
that accepts the same parameters but returns a sequence with the highest sequence probability among all possible sequences using exhaustive search. To generate long sequences, dynamic programming needs to be adapted.
Commit and push the language_modeling.py file to your GitHub repository.
L1: Sort items in word_counts
in descending order (, , ) and save them into wc_dec
as a list of (word, count) tuples, sorted from the most frequent to the least frequent words.
L2: Iterate over the top 10 (most frequent) words in the sorted list () and print each word along with its count.
, The Python Standard Library - Built-in Types.
, The Python Standard Library - Built-in Types.
, The Python Tutorial.
Source:
L1:
L8-11:
: base nouns
: nouns whose plural forms are irregular (e.g., mouse -> mice)
: pluralization rules for nouns
: base verbs
: verbs whose inflection forms are irregular (e.g., buy -> bought)
: inflection rules for verbs
L3-4: Check if the word is in the irregular dictionary (), if so, return its lemma.
At last, let us recount word types in using the lemmatizer and save them:
Source:
, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 ().
- A heuristic-based lemmatizer.
The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the .
L3: Create a regular expression re_mr
(). Note that a string indicated by an r
prefix is considered a regular expression in Python.
L4: Try to match re_mr
at the beginning of the string "Mr. Wayne" ().
L6: Print the value of m
. If matched, it prints the information; otherwise, m
is None
; thus, it prints "None".
L7: Check if a match was found (m
is not None
), and print the start position () and end position () of the match.
L1:
L5: Print the entire matched string ().
L6: Print a tuple of all captured groups ().
Source:
, Kuchling, HOWTOs in Python Documentation.*
The file contains books and chapters listed in chronological order. Your program will be evaluated using both the original and a modified version of this text file, maintaining the same format but with books and chapters arranged in a mixed order.
Define regular expressions to match the following cases using the corresponding variables in :
L1:
L2: Find the index of the first character in word
that is in the delimiters
set (, ). If no delimiter is found in word
, return -1 ().
L9-10: If there are characters after the delimiter, recursively call the delimit()
function on the remaining part of word
and extend the tokens
list with the result ().
L1: .
L3: Tokenize each word in the corpus using the specified delimiters. postprocess()
is used to process the special cases further. The resulting tokens are collected in a list and returned ().
Given the new tokenizer, let us recount word types in the corpus, , and save them:
Despite the increase in word types, using a more advanced tokenizer effectively mitigates the issue of sparsity in . What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?
Source:
- A heuristic-based tokenizer.
Under , define a function named sequence_generator()
that takes the following parameters:
Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of puncuation tokens are permitted).
In this task, the goal is not to discover a sequence that maximizes the overall , but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word , a potential strategy is as follows:
Identify the next word where the bigram probability is maximized.
If fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.
Make and repeat the #1 until you reach the specific sequence length.
The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base , provided as the math.log()
function in Python.
01/17
01/22
01/24
(continue)
01/29
(continue)
01/31
(continue)
02/05
02/07
(continue)
02/12
(continue)
02/14
(continue)
02/19
02/21
(continue)
02/26
(continue)
02/28
(continue)
03/04
03/06
(continue)
03/11
Spring Break
03/13
Spring Break
03/18
03/20
(continue)
03/25
(continue)
03/27
(continue)
04/01
(continue)
HW4
04/03
Progress Report
04/08
(continue)
04/10
04/15
04/17
HW5
04/22
04/24
(continue)
04/29
HW6
In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:
Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:
One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:
How does the bag-of-words model handle unknown words that are not in the vocabulary?
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
Finally, let us our bag-of-words model with the examples above:
Source: bag_of_words_model.py
Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials
Update: 2023-10-13
Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable is defined as:
is the probability distribution of . The self-information of is defined as , which measures how much information is gained when occurs. The negative sign indicates that as the occurrence of increases, its self-information value decreases.
Entropy has several properties, including:
It is non-negative: .
It is at its minimum when is entirely predictable (all probability mass on a single outcome).
It is at its maximum when all outcomes of are equally likely.
Why is the self-information value expressed in a logarithmic scale?
Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.
Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:
The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :
In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:
The Shannon-McMillan-Breiman theorem implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.
The bigram model in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.
By applying this theorem, can be approximated:
Consequently, is approximated as follows, where :
What does it mean when the entropy of a corpus is high?
Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:
Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .
Perplexity, , can be directly derived from the approximated entropy rate, :
How does high entropy affect the perplexity of a language model?
A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.
Update: 2024-01-05
An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.
For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:
1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"]
and ["scientist", "."]
.
What are the potential issues of using n-grams without proper tokenization?
Given a large corpus, a unigram model calculates the probability of each word as follows (: the total occurrences of in the corpus, : a set of all word types in the corpus):
Let us define a function unigram_count()
that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())
?
We then define a function unigram_estimation()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L1: Import the Unigram type alias from the src.types package.
L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram()
that takes a file path as well as an estimator function, and test unigram_estimation()
with a text file dat/chronicles_of_narnia.txt:
L1: Import the Callable type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams
.
L5: Create a list of unigram-probability pairs, unigram_list
, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation()
function as the second argument.
What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
Let us define a function bigram_count()
that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L1: Import the defaultdict class from the collections package.
L2: import the Bigram type alias from the src.types package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.
We then define a function bigram_estimation()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram()
that takes a file path and an estimator function, and test bigram_estimation()
with the text file:
L2: Call the bigram_estimation()
function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
Source: ngram_models.py
N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
Update: 2024-01-05
The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.
Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:
Thus, the probability of any unknown word with Laplace smoothing is calculated as follows:
The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.
Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:
Let us define a function unigram_smoothing()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the unigram_count()
function from the src.ngram_models package.
L4: Define a constant representing the unknown word.
L8: Increment the total count by the vocabulary size.
L9: Increment each unigram count by 1.
L10: Add the unknown word to the unigrams with a probability of 1 divided by the total count.
We then test unigram_smoothing()
with a text file dat/chronicles_of_narnia.txt:
L1: Import the test_unigram()
function from the ngram_models package.
Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.
Will the probabilities of all unigrams always decrease when Laplace smoothing is applied? If not, under what circumstances might the unigram probabilities increase after smoothing?
The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN
key:
L2: Use the get()
method to retrieve the probability of the target word from probs
. If the word is not present, default to the probability of the 'UNKNOWN' token.
L5: Test a known word, 'Aslan', and an unknown word, 'Jinho'.
The bigram model can also be enhanced by applying Laplace smoothing:
Thus, the probability of an unknown bigram where is known but is unknown is calculated as follows:
What does the Laplace smoothed bigram probability of represent when is unknown? What is a potential problem with this estimation?
Let us define a function bigram_smoothing()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the bigram_count()
function from the src.ngram_models package.
L6-8: Create a set vocab
containing all unique words in the bigrams.
L12: Calculate the total count of all bigrams with the same previous word.
L13: Calculate and store the probabilities of each current word given the previous word
L14: Calculate the probability for an unknown current word.
L17: Add a probability for an unknown previous word.
Why are the L7-8 in the above code necessary to retrieve all word types?
We then test bigram_smoothing()
with the same text file:
L1: Import the test_bigram()
function from the ngram_models package.
Finally, we test the bigram estimation using smoothing for unknown sequences:
L2: Retrieve the bigram probabilities of the previous word, or set it to None
if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.
L8: The tuple word is unpacked as passed as the second and third parameters.
Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:
There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that . Before Laplace smoothing, the bigram probabilities of are estimated as follows:
However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:
The bigram distribution for can be normalized to 1 by adding the total number of word types occurring after , denoted as , to the denominator instead of :
Consequently, the probability of an unknown bigram can be calculated with the normalization as follows:
For the above example, . Once you apply to , the sum of its bigram probabilities becomes 1:
A major drawback of this normalization is that the probability cannot be measured when is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of , where the previous word is unknown, as follows:
Source: smoothing.py
A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.
Update: 2023-10-13
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.
In the previous section, you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.
Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.
The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:
The joint probability of the word sequence "I am a student" can be measured as follows:
Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: Chain Rule and Markov Assumption.
By applying the chain rule, the above joint probability can be decomposed into:
Thus, the probability of any given word sequence can be measured as:
The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring is as challenging as measuring .
The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:
The joint probability can now be measured by the product of unigram and bigram probabilities.
How do the chain rule and Markov assumption simplify the estimation of sequence probability?
Let us consider the unigram probabilities of and . In general, "the" appears more frequently than "The" such that:
Let be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as and , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:
This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.
Thus, to predict a more probable initial word, it is better to consider the bigram probability rather than the unigram probability when measuring sequence probability:
This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:
Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?
The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:
Your task is to develop a sentiment analyzer train on the Stanford Sentiment Treebank:
Create a vector_space_models.py file in the src/homework/ directory.
Define a function named sentiment_analyzer()
that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.
Use the -nearest neighbors algorithm for the classification. Find the optimal value of using the development set, and then hardcode this value into your function before submission.
The sentiment_treebank directory contains the following two files:
sst_trn.tst: a training set consisting of 8,544 labeled documents.
sst_dev.tst: a development set consisting of 1,101 labeled documents.
Each line is a document, which is formatted as follows:
Below are the explanations of what each label signifies:
0
: Very negative
1
: Negative
2
: Neutral
3
: Positive
4
: Very positive
Commit and push the vector_space_models.py file to your GitHub repository.
Define a function named sentiment_analyzer_extra()
that gives an improved sentiment analyzer.
The distributional hypothesis suggests that words that occur in similar contexts tend to have similar meanings [1]. Let us examine the following two sentences with blanks:
A: I sat on a __
B: I played with my __
The blank for A can be filled with words such as {bench, chair, sofa, stool}, which convey the meaning "something to sit on" in this context. On the other hand, the blank for B can be filled with words such as {child, dog, friend, toy}, carrying the meaning of "someone/thing to play with." However, these two sets of words are not interchangeable, as it is unlikely that you would sit on a "friend" or play with a "sofa".
This hypothesis provides a potent framework for understanding how meaning is encoded in language and has become a cornerstone of modern computational linguistics and natural language processing.
Assuming that your corpus has only the following three sentences, what context would influence the meaning of the word "chair" according to the distributional hypothesis?
I sat on a chair.
I will chair the meeting.
I am the chair of my department.
Distributional Structure, Zellig S. Harris, Word, 10 (2-3): 146-162, 1954.
Distributional semantics represents the meaning of words based on their distributional properties in large corpora of text. It follows the distributional hypothesis, which states that "words with similar meanings tend to occur in similar contexts".
One-hot encoding represents words as binary vectors such that each word is represented as a vector where all dimensions are zero except for one, which is set to one, indicating the presence of that word.
Consider the following vocabulary:
Given a vocabulary size of 4, each word is represented as a 4-dimensional vector as illustrated below:
One-hot encoding has been largely adopted in traditional NLP models due to its simple and efficient representation of words in sparse vectors.
What are the drawbacks of using one-hot encoding to represent word vectors?
Word embeddings are dense vector representations of words in a continuous vector space. Each word is represented in a high-dimensional space, where the dimensions correspond to different contextual features of the word's meaning.
Consider the embeddings for three words, 'king', 'male', and 'female':
Based on these distributions, we can infer that the four dimensions in this vector space represent royalty, gender, male, and female respectively, such that the embedding for the word 'queen' can be estimated as follows:
The key idea is to capture semantic relationships between words by representing them in a way that similar words have similar vector representations. These embeddings are learned from large amounts of text data, where the model aims to predict or capture the context in which words appear.
In the above examples, each dimension represents a distinct type of meaning. However, in practice, a dimension can encapsulate multiple types of meanings. Furthermore, a single type of meaning can be depicted by a weighted combination of several dimensions, making it challenging to precisely interpret what each dimension implies.
Multi-head attention is a crucial component in transformer-based models, such as BERT, GPT, and their variants. It extends the basic self-attention mechanism to capture different types of relationships and dependencies in a sequence. Here's an explanation of multi-head attention:
Motivation:
The primary motivation behind multi-head attention is to enable a model to focus on different parts of the input sequence when capturing dependencies and relationships.
It allows the model to learn multiple sets of attention patterns, each suited to capturing different kinds of associations in the data.
Mechanism:
In multi-head attention, the input sequence (e.g., a sentence or document) is processed by multiple "attention heads."
Each attention head independently computes attention scores and weighted sums for the input sequence, resulting in multiple sets of output values.
These output values from each attention head are then concatenated and linearly transformed to obtain the final multi-head attention output.
Learning Different Dependencies:
Each attention head can learn to attend to different aspects of the input sequence. For instance, one head may focus on syntactic relationships, another on semantic relationships, and a third on longer-range dependencies.
By having multiple heads, the model can learn to capture a variety of dependencies, making it more versatile and robust.
Multi-Head Processing:
In each attention head, there are three main components: queries, keys, and values. These are linearly transformed projections of the input data.
For each head, queries are compared to keys to compute attention weights, which are then used to weight the values.
Each attention head performs these calculations independently, allowing it to learn a unique set of attention patterns.
Concatenation and Linear Transformation:
The output values from each attention head are concatenated into a single tensor.
A linear transformation is applied to this concatenated output to obtain the final multi-head attention result. The linear transformation helps the model combine information from all heads appropriately.
Applications:
Multi-head attention is widely used in NLP tasks, such as text classification, machine translation, and text generation.
It allows models to capture diverse dependencies and relationships within text data, making it highly effective in understanding and generating natural language.
Multi-head attention has proven to be a powerful tool in transformer architectures, enabling models to handle complex and nuanced relationships within sequences effectively. It contributes to the remarkable success of transformer-based models in a wide range of NLP tasks.
Self-attention, also known as scaled dot-product attention, is a fundamental mechanism used in deep learning and natural language processing, particularly in transformer-based models like BERT, GPT, and their variants. Self-attention is a crucial component that enables these models to understand relationships and dependencies between words or tokens in a sequence.
Here's an overview of self-attention:
The Motivation:
The primary motivation behind self-attention is to capture dependencies and relationships between different elements within a sequence, such as words in a sentence or tokens in a document.
It allows the model to consider the context of each element based on its relationships with other elements in the sequence.
The Mechanism:
Self-attention computes a weighted sum of the input elements (usually vectors) for each element in the sequence. This means that each element can attend to and be influenced by all other elements.
The key idea is to learn weights (attention scores) that reflect how much focus each element should give to the others. These weights are often referred to as "attention weights."
Attention Weights:
Attention weights are calculated using a similarity measure (typically the dot product) between a query vector and a set of key vectors.
The resulting attention weights are then used to take a weighted sum of the value vectors. This weighted sum forms the output for each element.
Scaling and Softmax:
To stabilize the gradients during training, the dot products are often scaled by the square root of the dimension of the key vectors.
After scaling, a softmax function is applied to obtain the attention weights. The softmax ensures that the weights are normalized and sum to 1.
Multi-Head Attention:
Many models use multi-head attention, where multiple sets of queries, keys, and values are learned. Each set of attention weights captures different aspects of relationships in the sequence.
These multiple sets of attention results are concatenated and linearly transformed to obtain the final output.
Applications:
Self-attention is widely used in transformer-based models for various NLP tasks, including machine translation, text classification, text generation, and more.
It is also applied in computer vision tasks, such as image captioning, where it can capture relationships between different parts of an image.
Self-attention is a powerful mechanism because it allows the model to focus on different elements of the input sequence depending on the context. This enables the model to capture long-range dependencies, word relationships, and nuances in natural language, making it a crucial innovation in deep learning for NLP and related fields.
The frequency of a word 's occurrences in a document is called the Term Frequency (TF) of . TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.
However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:
L3: Define a tokenizer function using a lambda expression.
We then retrieve documents in chronicles_of_narnia.txt and create a vocaburay dictionary:
Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the bag of words model:
L6: The underscore (_
) is used to indicate that the variable is not being used in the loop.
At last, let us print the TFs of all terms in the following three documents:
In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.
If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?
One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.
Let us retrieve a set of commonly used stop words from stopwords.txt and define a function to determine if a term should be considered a stop word:
L1: string.punctuation.
Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:
Finally, let us print the TFs of the same documents using the updated vocabulary:
Note that stop words can be filtered either during the creation of the vocabulary dictionary or when generating the bag-of-words representations. Do both approaches produce the same results? Which approach is preferable?
Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.
Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.
Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:
We then compare the term and document frequencies of all terms in the above documents:
Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.
What are the implications when a term has a high document frequency?
Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).
Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:
In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.
Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:
We then compute the TF-IDF scores of terms in the above documents:
Should we still apply stop words when using TF-IDF scores to represent the documents?
Various of TF-IDF have been proposed to enhance the representation in certain contexts:
Sublinear scaling on TF: \left\{ \begin{array}{cl} 1 + \log\mathbf{TF}(t,d) & \mbox{if $\mathbf{TF}(t,d) > 0$}\\ 0 & \mbox{otherwise} \end{array} \right.
Normalized TF:
Normalized IDF:
Probabilistic IDF:
Source: term_weighting.py
Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the chronicles_of_narnia.txt corpus:
Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the Euclidean distance and Cosine similarity.
Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.
Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:
Let us define a function that takes two vectors in our SpareVector notation and returns the Euclidean distance between them:
L6: ** k
represents the power of k
.
We then measure the Euclidean distance between the two vectors above:
The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.
Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).
The cosine similarity between two vectors can be measured as follow:
Let us define a function that takes two sparse vectors and returns the cosine similarity between them:
We then measure the Euclidean distance between the two vectors above:
The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.
Why do these metrics determine that D1 is more similar to D2 than to D3?
The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.
Create a file in the directory.
Your task is to read word embeddings trained by :
Define a function called read_word_embeddings()
that takes a path to the file consisting of word embeddings, .
Return a dictionary where the key is a word and the value is its corresponding embedding in .
Each line in the file adheres to the following format:
Your task is to retrieve a list of the most similar words to a given target word:
Define a function called similar_words()
that takes the word embeddings from Task 1, a target word (string), and a threshold (float).
Return a list of tuples, where each tuple contains a word similar to the target word and the cosine similarity between them as determined by the embeddings. The returned list must only include words with similarity scores greater than or equal to the threshold, sorted in descending order based on the similarity scores.
Your task is to measure a similarity score between two documents:
Define a function called document_similarity()
that takes the word embeddings and two documents (string). Assume that the documents are already tokenized.
For each document, generate a document embedding by averaging the embeddings of all words within the document.
Return the cosine similarity between the two document embeddings.
Commit and push the distributional_semantics.py file to your GitHub repository.
Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the .
LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.
Let us define a function that reads a corpus, and returns a list of all documents in the corpus and a dictionary whose keys and values are terms and their unique indices, respectively:
We then define a function that takes and , and returns the document-term matrix such that indicates the frequency of the 'th term in within the 'th document:
With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:
Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:
Why is the performance of document_term_matrix()
significantly slower than document_term_matrix_np()
?
For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:
What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?
From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.
By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?
From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.
Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.
Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.
It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.
The directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the file, following a file-naming convention of A_B
, where A
denotes the book ID and B
indicates the chapter ID.
Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:
We then print the number of documents in each set:
To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:
Why do we use only the training set to collect the vocabulary?
Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:
We then vectorize all documents in each set:
Finally, we test our classification model on the development set:
What are the potential weaknesses or limitations of this classification model?
Let be a vector representing an input instance, where denotes the 'th feature of the input and be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that belongs to :
The weight vector assigns weights to each dimension of the input vector for the label such that a higher magnitude of weight indicates greater importance of the feature . Finally, represents the bias of the label within the training distribution.
What role does the sigmoid function play in the logistic regression model?
Consider a corpus consisting of two sentences:
D1: I love this movie
D2: I hate this movie
The input vectors and can be created for these two sentences using the :
Let and be the output labels of and , representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector can be trained using logistic regression:
What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?
Consider a corpus consisting of three sentences:
D1: I love this movie
D2: I hate this movie
D3: I watched this movie
What are the limitations of the softmax regression model?
Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?
What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?
What are the limitations of a multilayer perceptron?
Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.
Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.
Consider a sequence of words, . We can predict by leveraging its contextual words using a similar to the discussed previously (: a vocabulary list comprising all unique words in the corpus):
This objective can also be achieved by using a such as Continuous Bag-of-Words (CBOW) using a . Let be an input vector, where . is created by the model on a set of context words, , such that only the dimensions of representing words in have a value of ; otherwise, they are set to .
Let be an output vector, where all dimensions have the value of except for the one representing , which is set to .
Let be a hidden layer between and and be the weight matrix between and , where the sigmoid function is used as the activation function:
Finally, let be the weight matrix between and :
What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?
What are the advantages of CBOW models compared to Skip-gram models, and vice versa?
What limitations does the Word2Vec model have, and how can these limitations be addressed?
Byte Pair Encoding (BPE) is a data compression algorithm that is commonly used in the context of subword tokenization for . BPE text into smaller units, such as subword pieces or characters, to handle out-of-vocabulary words, reduce vocabulary size, and enhance the efficiency of language models.
The following describes the steps of BPE in terms of the :
Initialization: Given a dictionary consisting of all words and their counts in a corpus, the symbol vocabulary is initialized by tokenizing each word into its most basic subword units, such as characters.
Expectation: With the (updated) symbol vocabulary, it calculates the frequency of every symbol pair within the vocabulary.
Maximization: Given all symbol pairs and their frequencies, it merges the top-k most frequent symbol pairs in the vocabulary.
Steps 2 and 3 are repeated until meaningful sets of subwords are found for all words in the corpus.
The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?
Let us consider a toy vocabulary:
First, we create the symbol vocabulary by inserting a space between every pair of adjacent characters and adding a special symbol [EoW]
at the end to indicate the End of the Word:
Next, we count the frequencies of all symbol pairs in the vocabulary:
Finally, we update the vocabulary by merging the most frequent symbol pair across all words:
The expect()
and maximize()
can be repeated for multiple iterations until the tokenization becomes reasonable:
When you uncomment L7
in bpe_vocab()
, you can see how the symbols are merged in each iteration:
Contextual representations are representations of words, phrases, or sentences within the context of the surrounding text. Unlike word embeddings from where each word is represented by a fixed vector regardless of its context, contextual representations capture the meaning of a word or sequence of words based on their context in a particular document such that the representation of a word can vary depending on the words surrounding it, allowing for a more nuanced understanding of meaning in natural language processing tasks.
, Vaswani et al., Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
L1: The package.
L3:
Let us create a document-term matrix from the corpus, :
L1: .
L3:
The 'th row of is considered the document vector of the 'th document in the corpus, while the transpose of 'th column of is considered the term vector of the 'th term in the vocabulary.
LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix into three matrices, , , and , where are orthogonal matrices and is a diagonal matrix containing singular values, such that .
An is a square matrix whose rows and columns are orthogonal such that , where is the identity matrix.
are non-negative values listed in decreasing order that represent the importance of each topic.
L3:
L4:
This results , , and such that:
In , each row represents a document and each column represent a topic.
In , each diagonal cell represents the weight of the corresponding topic.
In , each column represents a term and each row represents a topic.
The last two singular values in are actually non-negative values, and , respectively.
The first four singular values in appear to be sufficiently larger than the others; thus, let us reduce their dimensions to such that , , and :
Given the LSA results, an embedding of the 'th document can be obtained as :
Finally, an embedding of the 'th term can be achieved as :
is not transposed in L3 of the above code. Should we use S.transpose()
instead?
Source:
, Wikipedia.
, Wikipedia.
L7: the module, the module.
Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:
L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .
L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .
Source:
, Wikipedia.
Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights , , and are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight for "love" () contributes positively to the label , the weight for "hate" () has a negative impact on the label . Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias is also set to 0.
Given the weight vector and the bias, we have and , resulting the following probabilities:
As the probability of being exceeds (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being is below 50%.
Under what circumstances would the bias be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?
Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector and its output lable , the model uses the softmax function to estimates the probability that belongs to each class separately:
The weight vector assigns weights to for the label , while represents the bias associated with the label .
Then, the input vectors , , and for the sentences can be created using the :
Let , , and be the output labels of , , and , representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors , , and can be trained using softmax regression as follows:
Unlike the case of logistic regression where all weights are oriented to (both and giving positive and negative weights to respectively, but not ), the values in each weigh vector are oriented to each corresponding label.
Given the weight vectors and the biases, we can estimate the following probabilities for :
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For , the following probabilities can be estimated:
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.
Softmax regression always predicts values so that it is represented by an output vector , wherein the 'th value in contains the probability of the input belonging to the 'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix , where the 'th row represents the weight vector for the 'th label.
With this new formulation, softmax regression can be defined as , and the optimal prediction can be achieved as , which returns a set of labels with the highest probabilities.
A multilayer perceptron (MLP) is a type of consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector and an output vector , the model allows zero to many hidden layers to generate intermediate representations of the input.
Let be a hidden layer between and . To connect and , we need a weight matrix such that , where is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.
Similarly, to connect and , we need a weight matrix such that . Thus, a multilayer perceptron with one hidden layer can be represented as:
Consider a corpus comprising the following five sentences the corresponding labels ():
D1: I love this movie postive
D2: I hate this movie negative
D3: I watched this movie neutral
D4: I truly love this movie very positive
D5: I truly hate this movie very negative
The input vectors can be created using the :
The first weight matrix can be trained by an MLP as follows:
Given the values in , we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.
Each of is multiplied by to achieve the hiddner layer , respectively, where the activation function is designed as follow:
The second weight matrix can also be trained by an MLP as follows:
By applying the softmax function to each , we achieve the corresponding output vector :
The prediction can be made by taking the argmax of each .
, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.
Thus, each dimension in represents the probability of the corresponding word being given the set of context words .
In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in given . Formally, the objective of a Skip-gram model is as follows:
Let be an input vector, where only the dimension representing is set to ; all the other dimensions have the value of (thus, in Skip-gram is the same as in CBOW). Let be an output vector, where only the dimension representing is set to ; all the other dimensions have the value of . All the other components, such as the hidden layer and the weight matrices and , stay the same as the ones in CBOW.
What does each dimension in the hidden layer represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in , deemed valuable for predicting the target word . Formally, each dimension is computed as the sigmoid activation of the weighted sum between the input vector and the column vector such that:
Then, what does each row vector represent? The 'th dimension in denotes the weight of the 'th feature in with respect to the 'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the 'th word. Thus, can serve as an embedding for the 'th word in .
What about the other weight matrix ? The 'th column vector denotes the weights of the 'th feature in for all words in the vocabulary. Thus, the 'th dimension of indicates the importance of 'th feature for the 'th word being predicted as the target word .
On the other hand, the 'th row vector denotes the weights of all features for the 'th word in the vocabulary, enabling it to be utilized as an embedding for . However, in practice, only the row vectors of the first weight matrix are employed as word embeddings because the weights in are often optimized for the downstream task, in this case predicting , whereas the weights in are optimized for finding representations that are generalizable across various tasks.
What are the implications of the weight matrices and in the Skip-gram model?
, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.
, Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.
What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?
Source code:
, Sennrich et al., ACL, 2016.
, Gage, The C Users Journal, 1994.
, Kudo and Richardson, EMNLP, 2018.
(WordPiece), Wu et al., arXiv, 2016.
Attention in Natural Language Processing (NLP) refers to a mechanism or technique that allows models to focus on specific parts of the input data while making predictions or generating output. It's a crucial component in many modern NLP models, especially in sequence-to-sequence tasks and transformer architectures. Attention mechanisms help the model assign different weights to different elements of the input sequence, allowing it to pay more attention to relevant information and ignore irrelevant or less important parts.
Neural Machine Translation by Jointly Learning to Align and Translate
Key points about attention in NLP:
Contextual Focus: Attention enables the model to focus on the most relevant parts of the input sequence for each step of the output sequence. It creates a dynamic and contextually adaptive way of processing data.
Weighted Information: Each element in the input sequence is associated with a weight or attention score, which determines its importance when generating the output. Elements with higher attention scores have a stronger influence on the model's predictions.
Self-Attention: Self-attention mechanisms allow a model to consider all elements of the input sequence when making predictions, and it learns to assign different attention weights to each element based on its relevance.
Multi-Head Attention: Many NLP models use multi-head attention, which allows the model to focus on different aspects of the input simultaneously. This can improve the capture of various patterns and dependencies.
Transformer Architecture: Attention mechanisms are a fundamental component of the transformer architecture, which has been highly influential in NLP. Transformers use self-attention to process sequences, enabling them to capture long-range dependencies and context.
Applications of attention mechanisms in NLP include:
Machine Translation: Attention helps the model align words in the source language with words in the target language.
Text Summarization: Attention identifies which parts of the source text are most important for generating a concise summary.
Question Answering: It helps the model find the most relevant parts of a passage to answer a question.
Named Entity Recognition: Attention can be used to focus on specific words or subwords to identify named entities.
Language Modeling: In tasks like text generation, attention helps the model decide which words or tokens to generate next based on the context.
Attention mechanisms have revolutionized the field of NLP by allowing models to handle complex and long sequences effectively, making them suitable for a wide range of natural language understanding and generation tasks.
Pointer Networks are a type of neural network architecture designed to handle tasks that involve selecting elements from an input sequence and generating output sequences that reference those elements. These networks are particularly useful when the output sequence is conditioned on the content of the input sequence, and the order or content of the output sequence can vary dynamically based on the input.
Key features of Pointer Networks:
Input Sequence Reference: In tasks that involve sequences, Pointer Networks learn to refer to specific elements from the input sequence. This is particularly valuable in problems like content selection or summarization, where elements from the input sequence are selectively copied to the output sequence.
Variable-Length Output: Pointer Networks are flexible in generating output sequences of variable lengths, as the length and content of the output sequence can depend on the input. This is in contrast to fixed-length output sequences common in many other sequence-to-sequence tasks.
Attention Mechanism: Attention mechanisms are a fundamental part of Pointer Networks. They allow the model to assign different weights or probabilities to elements in the input sequence, indicating which elements should be referenced in the output.
Applications of Pointer Networks include:
Content Selection: Selecting and copying specific elements from the input sequence to generate the output. This is useful in tasks like text summarization, where relevant sentences or phrases from the input are selectively included in the summary.
Entity Recognition: Identifying and referencing named entities from the input text in the output sequence. This is valuable for named entity recognition tasks in information extraction.
Geographic Location Prediction: Predicting geographic locations mentioned in text and generating a sequence of location references.
Pointer Networks have proven to be effective in tasks that involve content selection and variable-length output generation, addressing challenges that traditional sequence-to-sequence models with fixed vocabularies may encounter. They provide a way to dynamically handle content in natural language processing tasks where the input-output relationship can be complex and context-dependent.
Your team will showcase the final project to the class utilizing your laptop. Approximately 6-7 groups of people will rotate to your team's station, where you will have a designated 10-minute slot to present your project to each group. Allocate around 5 minutes for your prepared demonstration, and the remaining time for the audience to actively engage with and test your system.
Completeness: Evaluate whether the project is finalized in an end-to-end manner.
Functionality: Assess the functionality of individual features.
Engagement: Determine the level of engagement achieved during the prepared demonstration.
Robustness: Evaluate the system's robustness in handling various inputs.
Usefulness: Consider the overall usefulness of the system.
Create a 5-10 minute video demonstrating your system and upload it to Canvas.
Spring 2024
My vision is a model which could reference the major (at least American, but potentially other English) style guides (the AP Stylebook, Chicago, MLA, etc.) and, given a sentence with an point of ambiguous grammar/style, could give the solution according the different major style guides. I'm not married to this idea per se, but I like the idea of working with style guides in some way.
I'd like to try and build a sentiment analysis tool, capable of classifying emotions into positive, neutral, or negative (or perhaps give a numerical rating). If possible, the model can be expanded to include other emotions like happiness, anger, disappointment, etc.
Looking at words used on social media and comparing sentiments across different social platforms.
For the team project, I want to use NLP skills to catch keywords in some math papers. From my own experience, math research papers are hard to understand due to their nouns and abstract ideas. If we can get some key things out from the math paper, this will be helpful.
N/A
For the team project at the end of the semester, I would like to build a Sentiment Analysis model for Cryptocurrency Trading. In the world of Crypto, many of the price movements are caused by sudden changes in the sentiment of investors. This often can be found on social media sites like Reddit and Twitter, where users often post their feelings about the currencies. A few years ago, a subreddit called WallStreetBets blew up for causing an insane price increase in Gamestop stock as well as some others. From this, I discovered the power of using Sentiment Analysis modeling on various social media sites to attempt to predict price movements of online currencies and stocks. For this project, I would like to analyze user posts on these social media sites to calculate the overall sentiment for specific Cryptocurrencies. I will then use this data to predict the incoming price changes that may occur.
My project idea involves training a large language model on countless recipes, found online, which must be preprocessed accordingly to be usable. Once the chatbot is trained, a simple web platform would be used for accessibility and testing. This platform would be comprised of chats, and an area to enter text. The goal is to make a chatbot capable of creating usable recipes based on user inputs, dietary restrictions, and ingredient restrictions. I thought of this project idea with Dylan Parker.
Concept and Idea: Spam Detection Bot for Email Systems The potential concept for our team project is to develop an advanced Spam Detection Bot that is specifically designed for email systems.
Intelligent Linguistic Analysis: The bot will utilize natural language processing (NLP) techniques to analyze the content of emails. It will focus on identifying linguistic patterns, keywords, and phrase structures commonly associated with spam.
Machine Learning Integration: By employing machine learning algorithms, the bot will be able to learn and adapt to evolving spam tactics. This continuous learning approach ensures that the bot remains effective even as spammers change their strategies.
Feedback Mechanism: The bot allows the users to manually mark certain email as spam or not, which allows the bot to update itself to increase the accuracy of detection and reduce the possibility of false detection.
Real-time Processing and Efficiency: The bot is designed to process emails in real-time, ensuring that users' inboxes are promptly cleared of spam.
Security and Privacy Focus: In addition to spam detection, the bot will prioritize user security and privacy, ensuring that the user’s privacy is always the top priority.
I would like to make a model that can read influential people's tweets in the financial world to get some sort of sentiment, and determine if someoen should invest in that stock or not.
ex. If Warren Buffett were to tweet out "Apple is a terrible company", the model would be able to detect the sentiment of Apple as negative, and therefore not invest in the stock.
I want to create a language identifier that can output the name of a language based on a text of a certain language as the input. To make this work I believe that we would first identify the types of characters used and narrow down the languages based on that. Secondly, we would tokenize the text into words and compare it to the dictionaries of each language we are still considering. Lastly, we would look at how many matches there are between the text and the dictionaries and output the most similar one. Maybe if it's below a certain threshold, the output could say that there's not enough data to suggest that the language is one we have access to.
For the project, I'm thinking about an AI diary app using GPT. This app will let students write about their day, and the AI will offer encouraging words and advice. It could also detect if the student is stressed and help them as a friend. The goal is to create a comforting space for students to reflect and relax.
Project Idea I would like to study the difference between the usage of “(disability-adj) person” and “person with (disability)” in the context of academic papers. For example, there is lots of discourse on the difference between saying autistic person or person with autism, and from my own experience on reading research papers about autism, both are used frequently. The project could involve gathering data about its usage frequency in scientific papers between academic fields (medical, psychology, sociology, anthropology, etc.), within the same papers (do research papers always use the same term, or do some use both?), by date (is there a difference in pre- and post- 2015, or some other relevant date?), or some other metric. Further work could be done with sentiment analysis technology to see if the papers use language that is favorable / amiable towards disability or disparaging towards disability, and could be correlated with their chosen phrase to describe disability (disabled person vs person with disability) to see if there is a significant difference.
I just switched into this class and haven't attended any lectures yet so I don't have a good concept of what a good project is, but an idea is to analyze tik toks to gauge that population's opinions on the 2024 election.
I have an plan to do a text analyze on the answer in QUORA to distinguish answer from experts and other participants.
One idea for a project which I might like to pursue would be an AI coach for video games. The idea would be to have the coach look at in game performance, potentially in real time, and give coaching about how to improve ala "pay more attention to objectives".
Something that may be challenging to do as a team project is to look at computational linguistics based on languages other than English. An alternative that would fall within the scope of the class while also looking at nonstandard English would be analyzing various English dialects, with projects such as dialect identification, translation between dialects, or even a chatbot that understands multiple English dialects.
For the group project, I am interested in sentiment analysis, specifically concerning evaluations of business products, or social media posts. Additionally, I am also inclined towards topics associated with the detection of spam emails or messages.
I am fairly open to the type of team project I would like to pursue this semester. Things such as sentiment analysis, story generation, and chatbots would all be interesting to me. I am not sure about the feasability, but the project I would be most interested in working on would be a text to speech tool. I would like to learn how speech generation works on such a level, and it is a tool I could imagine using once created. I know that this type of tool already exists, so I would ideally like to work on a less prevalent aspect such as adding emotionally expressive speech so that it doesn't sound as flat. I am also interested in the ways that characters or celebrities voices can be emulated, but I am not sure if this would have any legal hurdles.
N/A
I have thought of working on a sentiment analysis model which would classify customer review we see on e-commerce platform(whether they are positive, negative, or neutral). However, I also want to listen to other ideas so I am also open to new ideas.
Project Idea: Using AI for recipe recommendation, modification, and generation. Opening the fridge often reveals leftover ingredients from previous meals, which can make deciding what to cook a challenge. Typically, we end up spending considerable time looking for recipes that match our tastes and the ingredients we currently have. It’s common to find that we’re missing a few items for the recipes we’re interested in, which can be quite inconvenient. However, by incorporating artificial intelligence, we could have a tool that acts like a consultative chef. This AI system would allow us to tailor our recipe searches more precisely, adjusting recipes to fit our personal requirements, like creating a low-calorie version of a certain dish or suggesting substitutions for missing ingredients.
I am very unsure of what kind of project I would like to pursue. I think combining linguistics with visual data could be interesting though. Maybe some kind of coding that sorts or filters certain types of words or phrases, and then those are represented in some visually pleasing way. I think some kind of code that analyzes literature at a highly specific linguistic level could be very interesting. I am extremely flexible and would be willing to work on almost any project.
A possible project could be to use NLP techniques on unconventional data. Personally, I am interested in cross-cultural linguistics / multilingualism such as heritage language usage in diasporic settings. Some possible datasets could be English loan word usage in a non-English speaking countries (ie. in everyday life, music lyrics, social media, websites). For example, in South Korea and many other countries, popular music integrates English into various text such as music lyrics, advertisements and marketing, and everyday speech (based on age group). Another possible topic could be topics such as correcting gendered language to non-gendered language. Lastly, academic related topics: I think a fun project could be something like predicting final course grade based on a student writing sample, predicting / generating potential test questions based on a text, or predict / generate weak areas of students based on their code sample.
A text summarization tool for simplifying complex readings for classes.
I have a group that I believe we will be working together, yet we have not yet decided on a topic for a project. An idea I have been thinking about is incorporating a character into game that will take the language input that the player puts in. After analyzing the type of writing that the players uses, the character will respond in the same way the player wrote. For example, if someone is using Shakespearean language, then the character will respond back the same way.
Project Idea: Explain Attention Is All You Need to children: Design a system that explains and summarizes academic papers in a more comprehensive way, especially for those who do not have much background knowledge. It can reliably lay out the fundamental information from abstracts, introductions, methods, and findings from any academic paper without missing vital information, allowing readers to process the main ideas efficiently.
Though I still don't know much about NLP, for my team project, I think it would be interesting to try to work on a language model that is trained and works solely on inputs with perfect grammar in an attempt to see the effect of input "sanitization" on performance and model size.
I'd like to apply NLP algorithms and some machine learning algorithms on a public available dataset to perform supervised classification task. For example, applying MLP on product review to distinguish helpful reviews against unhelpful ones. I would like to further compare and evaluate the performance of some large language models, such as Bert and GPT4 API.
My idea is to make an "PolyGlotBot" which is a multilingual virtual assistant that helps Chinese learners practice and improve their skills. The model will give real-time feedback on grammar, pronunciation, and vocabulary when having the conversation with it. It will also have the interior function which adapts the user's proficiency level and personalizes learning content. This can help leaners learn better on his own rythm and on his own level.
I'm not sure but I am open to exploring
Project Idea: Create a tool that can figure out how people feel when they post on social media, no matter what language they're using. The idea is to make a system that helps us see the emotions behind online conversations in different languages, so users and businesses can get a sense of what people are expressing on a global scale.
I would like to learn how could NLP techniques be applied to alternative data in finance and business.
Utilization of sentiment analysis would be something I'd be interested in working on; a project like conducting sentiment analysis on a dataset (such as poems) and using it to generate new data (such as new poems based on a key emotion) is something that I would be interested in. In addition, I would also be interested in leveraging LLMs to create something, such as recreating the personality of characters.
The team project that I have in mind is an LLM that recommends personalized recipes based on user requests, dietary restrictions, and ingredient availability. The system could also assist users in meal planning by suggesting balanced meal combinations and creating shopping lists if they do not have the necessary ingredients. My team would ideally collect many recipes online for our database to serve as the foundation for our recommendations. An algorithm would need to be developed for the meal recommendation system. We would then create a web-based application for user interactions with our system. If possible, it would be great to have a user "account" feature for further user personalization in the future. I thought of this project idea with Marcus Cheema.
Text Writing Editor: Use models like OpenAI's Davinci API for generating creative writing, including poetry, stories, or even scripts. The focus would be on fine-tuning the model's parameters and prompts to allow certain styles or themes.
GitHub repository link: Team project: I'd like to know more about sentiment/emotion/opinion analysis of text. This semester I'd probably do something related to analysis techniques regarding sentiment lexicons and sentiment classification models.
N/A
N/A
The team project that we would like to pursue is to use ai (chatGPT) to train specifically for a task, such as writing, research, programming. For example, there is a plugin function in Canva supported by ChatGPT that is able to generate infographics and visualizations automatically through prompts.
N/A
I am interested in doing a project about sentiment analysis, where a model would be able to decide what the tone of a piece of text is. This is intriguing to me because even as people, tone is difficult to convey accurately through text, and even when we are able to determine the sentiment of a piece of text, it is not always clear exactly how that tone was communicated. I am curious to see how accurate a machine can be in determining something so emotion-based and not clear-cut. With that said, I am open to other project ideas as well!
N/A
Two idea possibilities: 1)Sentiment Analysis Chatbot: A chatbot that detects the emotional and state of mind-being from a person, the idea behind this we are LARPing as Walmart or Target or another retailer, and we want to detect how the customer feels without having them directly state it (because that tends to anger them). Put it like this, if you're disappointed with either customer service or a product you just purchased, and you write to a bot wanting to express disappointment, but is instead asked how you feel about the service or product, that will just annoy you into stating you are angry or upset. The key to this tool is being able to interpret word choices and convert them into state of emotional well-being and satisfaction from a state of 1-5, 1 being extremely unsatisfactory, and 5 being extremely satisfactory, and utilizing this metric across 5 rubrics (Customer Service, Product Quality, Cleanliness of Stores, Location Convenience, Feeling of Safety) from a short conversation with the consumer. Along the way, the bot can also make recommendations, including for products and advices, after the conversation. 2)Spam Detection Bot: Email filter bot that would combine ANN with corpus of common spam emails, including ones I've fallen for (I've failed 100% of the Emory phishing emails, I'm sad to say, not an exaggerated stat, I've never not clicked on those baits). We'd create a distinction between actual legitimate bot emails (i.e. Job offers, important notifications, reminders) from both harmless and harmful spam. Differs from email spam on multi-lingual factor: my emails still have large amounts of non-english spam that filter through, but because the spam bot isn't as well-trained in that factor, I'm receiving garbage on fake chinese job referrals and german job offers (I know they're fake, because like the genius that I am, clicked on them and inquired, which resulted in me getting more spam emails). The spam detection bot would also create warnings on non-spam emails that border on spam (i.e. promotional emails), and offer the end user the opportunity to enable options to filter them out.
Team Project Idea: Examine presidential inaugural addresses for in word tokens, types, diversity, etc. More points of analysis will definitely come up, but in general hope to look for different trends over time, individual differences between presidents, and other interesting observations.
Our team would like to develop a system that can analyze and classify the sentiment of social media posts. We will choose one social media platform and focus on posts from a certain period or around a specific topic. Our goal is that the system can help businesses, organizations, and governments understand the public reaction and adjust policies/improve products.
N/A
Project Idea: I aim to analyze the linguistic nuances and sentiment differences in academic papers when referring to "people with disabilities" vs. "disabled people". I hope to understand if the choice of terminology correlates with varying sentiments and contextual frameworks in disability discourse within academic literature.
N/A
Our team would like to develop a system that can analyze and classify the sentiment of social media posts. We will choose one social media platform and focus on posts from a certain period or around a specific topic. Our goal is that the system can help businesses, organizations, and governments understand the public reaction and adjust policies/improve products.
The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?
What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?
How does self-attention operate given an embedding matrix representing a document, where is the number of words and is the embedding dimension?
Given the same embedding matrix as in question #3, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?
What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?
How is a Masked Language Model used in training a language model with a transformer?
How can one train a document-level embedding using a transformer?
What are the advantages of embeddings generated by transformers compared to those generated by ?
Neural networks gained widespread popularity for training natural language processing models since 2013. What factors enabled this popularity, and how do they differ from traditional NLP methods?
Recent large language models like ChatGPT or Claude are trained quite differently from traditional NLP models. What are the main differences, and what factors enabled their development?
, Vaswani et al., NIPS 2017.
, Devlin et al., NAACL 2019.
Submit your presentation slides in PDF format:
Each team is allotted 8 minutes for their presentation, which may incorporate discussions.
Your presentation should be captivating and provide a concise overview of the written proposal.
Ensure to include an end-to-end example that illustrates the input and output of your system.
Submit a proposal in PDF format consisting of 5-8 pages (excluding references) using the provided . Your proposal should offer a thorough description of your project.
Title (try to be as catchy as possible).
Course ID and name.
List of team members, their majors, and contact information.
Overview of the proposed project.
Intellectual merit.
Broader impact.
Objectives: what are your goals and what do you believe you can achieve during this semester?
Motivation: what contributes to the societal value of this project?
Problem Statement: what specific problem are you seeking to address using NLP technology?
Your research objectives should incorporate an element of novelty. Articulate which aspect of your project is innovative, and provide evidence by comparing it with (potential) competitors.
Reference relevant works, whether from academia or industry. Provide a brief overview of each work and highlight how your work differs from it.
If you've conducted preliminary work, delineate what has already been accomplished and outline what additional work will be undertaken during this semester.
Provide precise details regarding your methodologies, ensuring they are achievable within the confines of this semester. If you intend to continue work beyond this timeframe, delineate the extended timeline in the subsequent section.
Clearly outline the dataset and evaluation methods for experiments. This section must be compelling to ensure the feasibility of the proposal.
Outline weekly plans.
Assign tasks to individuals for each timeline.
Updated 2023-10-27
The Encoder-Decoder Framework is commonly used for solving sequence-to-sequence tasks, where it takes an input sequence, processes it through an encoder, and produces an output sequence. This framework consists of three main components: an , a context vector, and a , as illustrated in Figure 1:
An encoder processes an input sequence and creates a context vector that captures context from the entire sequence and serves as a summary of the input.
Figure 2 shows an encoder example that takes the input sequence, "I am a boy", appended with the end-of-sequence token "[EOS]":
A decoder is conditioned on the context vector, which allows it to generate an output sequence contextually relevant to the input, often one token at a time.
The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this?
Update: 2023-10-26
A Recurrent Neural Network (RNN) [1] maintains hidden states of previous inputs and uses them to predict outputs, allowing it to model temporal dependencies in sequential data.
The hidden state is a vector representing the network's internal memory of the previous time step. It captures information from previous time steps and influences the predictions made at the current time step, often updated at each time step as the RNN processes a sequence of inputs.
Given an input sequence where , an RNN for defines two functions, and :
takes the current input and the hidden state of the previous input , and returns a hidden state such that , where , , and is an .
takes the hidden state and returns an output such that , where .
Figure 1 shows an example of an RNN for sequence tagging, such as :
For example, let us consider the word "early" in the following two sentences:
They are early birds -> "early" is an adjective.
They are early today -> "early" is an adverb.
The POS tags of "early" depend on the following words, "birds" and "today", such that making the correct predictions becomes challenging without the following context.
To overcome this challenge, a Bidirectional RNN is suggested [2] that considers both forward and backward directions, creating twice as many hidden states to capture a more comprehensive context. Figure 3 illustrates a bidirectional RNN for sequence tagging:
Does it make sense to use bidirectional RNN for text classification? Explain your answer.
Long Short-Term Memory (LSTM) Networks [3-5]
Gated Recurrent Units (GRUs) [6-7]
Submit a final report in PDF format consisting of 5-8 pages (excluding references) using the provided . Change the section titles as needed.
Title (try to be as catchy as possible).
Course ID and name.
List of team members, their majors, and contact information.
Overview of the proposed project.
Intellectual merit.
Broader impact.
Objectives: what are your goals to achieve through this project?
Motivation: what contributes to the societal value of this project?
Problem Statement: what specific problem are you seeking to address using NLP technology?
Your research objectives should incorporate an element of novelty. Articulate which aspect of your project is innovative, and provide evidence by comparing it with (potential) competitors.
Reference relevant works, whether from academia or industry. Provide a brief overview of each work and highlight how your work differs from it.
Provide a figure describing an end-to-end example of your system.
Provide precise details regarding your methodologies.
Clearly describe the dataset and evaluation methods and results of your system.
Summarize key methods and findings.
Broader impact: upon completion, how your project can change the world.
Future work: how to improve the quality of your project.
, Vaswani et al., NIPS 2017.
Let be an input sequence, where is the 'th word in the sequence and is an artificial token appended to indicate the end of the sequence. The encoder utilizes two functions, and , which are defined in the same way as in the . Notice that the end-of-sequence token is used to create an additional hidden state , which in turn creates the context vector .
Is it possible to derive the context vector from instead of ? What is the purpose of appending an extra token to indicate the end of the sequence?
Let be an output sequence, where is the 'th word in the sequence, and is an artificial token to indicate the end of the sequence. To generate the output sequence, the decoder defines two functions, and :
takes the previous output and its hidden state , and returns a hidden state such that , where , , and is an activation function.
takes the hidden state and returns an output such that , where .
Note that the initial hidden state is created by considering only the context vector such that the first output is solely predicted by the context in the input sequence. However, the prediction of every subsequent output is conditioned on both the previous output and its hidden state . Finally, the decoder stops generating output when it predicts the end-of-sequence token .
In some variations of the decoder, the initial hidden state is created by considering both and [1].
Figure 3 illustrates a decoder example that takes the context vector and generates the output sequence, "나(I) +는(SBJ) 소년(boy) +이다(am)", terminated by the end-of-sequence token "[EOS]", which translates the input sequence from English to Korean:
The likelihood of the current output can be calculated as:
where is a function that takes the context vector , previous input and its hidden state , and returns the probability of . Then, the of the output sequence can be estimated as follows ():
The maximum likelihood estimation of the output sequence above accounts for the end-of-sequence token . What are the benefits of incorporating this artificial token when estimating the sequence probability?
, Sutskever et al., NeurIPS, 2014.*
, Bahdanau et al., ICLR, 2015.*
Notice that the output for the first input is predicted by considering only the input itself such that (e.g., the POS tag of the first word "I" is predicted solely using that word). However, the output for every other input is predicted by considering both and , an intermediate representation created explicitly for the task. This enables RNNs to capture sequential information that cannot.
What does each hidden state represent in the RNN for sequence tagging?
Unlike sequence tagging where the RNN predicts a sequence of output for the input , an RNN designed for predicts only one output for the entire input sequence such that:
Sequence Tagging
Text Classification:
To accomplish this, a common practice is to predict the output from the last hidden state using the function . Figure 2 shows an example of an RNN for text classification, such as :
What does the hidden state represent in the RNN for text classification?
The above does not consider the words that follow the current word when predicting the output. This limitation can significantly impact model performance since contextual information following the current word can be crucial.
For every , the hidden states and are created by considering and , respectively. The function takes both and and returns an output such that , where is a concatenation of the two hidden states and .
, Elman, Cognitive Science, 14(2), 1990.
, Schuster and Paliwal, IEEE Transactions on Signal Processing, 45(11), 1997.
, Hochreiter and Schmidhuber, Neural Computation, 9(8), 1997 ( available at ResearchGate).
, Ma and Hovy, ACL, 2016.*
, Akbik et al., COLING, 2018.*
, Cho et al., EMNLP, 2014.*
, Chung et al., NeurIPS Workshop on Deep Learning and Representation Learning, 2014.*
Spring 2024
Will Kohn, Sam Liu, Ja’Zmin McKeel, Ellie Paek
Andrew Chung, Andrew Lu, Frederic Guintu, Tung Dinh
Yunnie Yu, Jason Zhang, Serena Zhou
Henry Dierkes, Andrew Lee, Nicole Thomas
Calvin Brauer, Cashin Woo, Jerry Hong
Calla Gong, Louis Lu, Wenzhuo Ma, Yoyo Wang
Freddy Xiong, Molly Han, Murphy Chen, Peter Jeong
Helen Jin, Michael Cao, Michael Wang, Michelle Kim
Marcus Cheema, Dylan Parker, Sherry Rui
Chengyu Shi, Chenming Zhou, Ruichen Ni, Wenxuan Cai
Joyce Zhang, Paige Hendricks, Lindsey Wendkos, Benjamin Dixon
Mara Adams, Hunter Grimes, Carl Kassabian, Simon Yu
Alec Chapman, Izana Melese
[03/04] 2, 5, 6, 8, 10, 11, 12
[03/06] 1, 3, 4, 7, 9, 13