Update: 2023-10-31
Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.
Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.
What is the difference between a lemmatizer and a stemmer [1]?
When analyzing the word types obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:
Universities
University
universities
university
Two variations are applied to the noun "university": capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:
study
studies
studied
studying
get_lemma_lexica()
We want to develop a lemmatizer that normalizes all variations into their respective lemmas. Let us start by creating lexica for lemmatization:
L1: SimpleNamespace
L8-11: JSON encoder and decoder
Lexica:
nouns.txt: base nouns
nouns_irregular.json: nouns whose plural forms are irregular (e.g., mouse -> mice)
nouns_rules.json: pluralization rules for nouns
verbs.txt: base verbs
verbs_irregular.json: verbs whose inflection forms are irregular (e.g., buy -> bought)
verbs_rules.json: inflection rules for verbs
lemmatize()
Next, let us write the lemmatize()
function that takes a word and lemmatizes it using the lexica:
L2: Define a nested function aux
to handle lemmatization.
L3-4: Check if the word is in the irregular dictionary (get()), if so, return its lemma.
L6-7: Try applying each rule in the rules
list to word
.
L8: If the resulting lemma is in the vocabulary, return it.
L10: If no lemma is found, return None
.
L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.
We now test our lemmatizer for nouns and verbs:
At last, let us recount word types in emory-wiki.txt using the lemmatizer and save them:
When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.
In which tasks can lemmatization negatively impact performance?
Source: lemmatization.py
An Algorithm for Suffix Stripping, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 (PDF).
ELIT Morphological Analyzer - A heuristic-based lemmatizer.
Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.
If you are not acquainted with Python programming, I strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.
Update: 2023-10-31
Consider the following text from Wikipedia about (as of 2023-10-18):
Our task is to determine the number of word tokens and unique word types in this text. A simple way of accomplishing this task is to split the text with whitespaces and count the strings:
What is the difference between a word token and a word type?
L1: Import the class from the package.
L4: Open the corpus file (), read the contents of the file as a string (), split it into a list of words (), and store them in the words
list.
L5: Use Counter to count the occurrences of each word in words
and store the results in the word_counts
dictionary.
L7: Print the total number of word tokens in the corpus (), which is the length of words
().
L8: Print the number of unique word types in the corpus, which is the length of word_counts
.
In this task, we want to check the top-k most or least frequently occurring words in this text:
L4: Sort items in word_counts
in ascending order and save them into wc_asc
as a list of (word, count) tuples.
L5: Iterate over the top 10 (least frequent) words in the sorted list zand print each word along with its count.
Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) or "Georgia." (with the period). This is because the text was split only by whitespaces without considering punctuation. As a result, these words are separately recognized from the word types "Atlanta" or "Georgia", respectively. Hence, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.
Finally, save all word types in alphabetical order to a file:
L1: Open a file word_types.txt
in write mode (w
). # If the file does not exist, it will be created. If it does exist, its previous contents will be overwritten.
L2: Iterate over unique word types (keys) of word_counts
in alphabetical order, and write each word followed by a newline character to fout
.
Update: 2024-01-05
Your goal is to extract specific information from each book in the . For each book, gather the following details:
Book Title
Year of Publishing
For each chapter within a book, collect the following information:
Chapter Number
Chapter Title
Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.
Download the file and place it under the directory. Note that the text file is already tokenized using the .
Create a file in the directory.
Define a function named chronicles_of_narnia()
that takes a file path pointing to the text file and returns a dictionary structured as follows:
For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.
If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations]
-> text_processing
. Confirm that the working directory path is set to the top directory, nlp-essentials
.
RE_Abbreviation
: Dr.
, U.S.A.
RE_Apostrophe
: '80
, '90s
, 'cause
RE_Concatenation
: don't
, gonna
, cannot
RE_Hyperlink
: https://emory.gitbook.io/nlp-essentials
RE_Number
: 1/2
, 123-456-7890
, 1,000,000
RE_Unit
: $10
, #20
, 5kg
Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.
Commit and push the text_processing.py file to your GitHub repository.
Update: 2023-10-31
Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.
Online Interpreter:
Regex provides metacharacters with specific meanings, making it convenient to define patterns:
.
: any single character except a newline character.
[ ]
: a character set matching any character within the brackets.
\d
: any digit, equivalent to [0-9]
.
\D
: any character that is not a digit, equivalent to [^0-9]
.
\s
: any whitespace character, equivalent to [ \t\n\r\f\v]
.
\S
: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v]
.
\w
: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_]
.
\W
: any character that is not a word character, equivalent to [^A-Za-z0-9_]
.
\b
: a word boundary matching the position between a word character and a non-word character.
Examples:
M.\.
matches "Mr." and "Ms.", but not "Mrs." (\
escapes the metacharacter .
).
[aeiou]
matches any vowel.
\d\d\d
searches for "170" in "CS170".
\D\D
searches for "kg" in "100kg".
\s
searches for the space " " in "Hello World".
\S
searches for "H" in " Hello".
\w\w
searches for "1K" in "$1K".
\W
searches for "!" in "Hello!".
\bis\b
matches "is", but does not match "island" nor searches for "is" in "basis".
Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:
*
: the preceding character or group appears zero or more times.
+
: the preceding character or group appears one or more times.
?
: the preceding character or group appears zero or once, making it optional.
{m}
: the preceding character or group appears exactly m
times.
{m,n}
: the preceding character or group appears at least m
times but no more than n
times.
{m,}
: the preceding character or group appears at least m
times or more.
By default, matches are "greedy" such that patterns match as many characters as possible.
Matches become "lazy" by adding ?
after the repetition metacharacters, in which case, patterns match as few characters as possible.
Examples:
\d*
matches "90" in "90s" as well as "" (empty string) in "ABC".
\d+
matches "90" in "90s", but no match in "ABC".
https?
matches both "http" and "https".
\d{3}
is equivalent to \d\d\d
.
\d{2,4}
matches "12", "123", "1234", but not "1" or "12345".
\d{2,}
matches "12", "123", "1234", and "12345", but not "1".
<.+>
matches the entire string of "<Hello> and <World>".
<.+?>
matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.
Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses (
and )
.
|
: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives.
( )
: a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways.
(?: )
: a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways.
\num
: a backreference that refers back to the most recently matched text by the num
'th capturing group within the same regex.
You can nest groups within other groups to create more complex patterns.
Examples:
(cat|dog)
matches either "cat" or "dog".
(\w+)@(\w+.\w+)
has two capturing groups, (\w+)
and (\w+.\w+)
, and matches email addresses such as "john@emory.edu" where the first and second groups capture "john" and "emory.edu", respectively.
(?:\w+)@(\w+.\w+)
has one non-capturing group (?:\w+)
and one capturing group (\w+.\w+)
. It still matches "john@emory.edu" but only captures "emory.edu", not "john".
(\w+) (\w+) - (\2), (\1)
has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".
(\w+.(edu|org))
has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.
Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.
A positive lookahead assertion (?= )
checks that a specific pattern is present immediately after the current position.
A negative lookahead assertion (?! )
checks that a specific pattern is not present immediately after the current position.
A positive look-behind assertion (?<= )
checks that a specific pattern is present immediately before the current position.
A negative look-behind assertion (?<! )
checks that a specific pattern is not present immediately before the current position.
^
asserts that the pattern following the caret must match at the beginning of the text.
$
asserts that the pattern preceding the dollar sign must match at the end of the text.
Examples:
apple(?=[ -]pie)
matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".
do(?!(?: not|n't))
matches "do" in "do it" or "doing", but not in "do not" or "don't".
(?<=\$)\d+
matches "100" in "$100", but not in "100 dollars".
(?<!not )(happy|sad)
searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".
not
searches for "not" in "note" and "cannot", whereas ^not
matches "not" in "note" but not in "cannot".
not$
searches for "not" in "cannot" but not in "note".
Python provides several functions to make use of regular expressions.
Let us create a regular expression that matches "Mr." and "Ms.":
L1:
r'M'
matches the letter "M".
r'[rs]'
matches either "r" or "s".
r'\.
' matches a period (dot).
Currently, no group has been specified for re_mr
:
Let us capture the letters and the period as separate groups:
L1: The pattern re_mr
is looking for the following:
1st group: "M" followed by either "r" or 's'.
2nd group: a period (".")
L2: Match re_mr
with the input string "Ms".
L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.
If the pattern does not find a match, it returns None
.
Let us match the following strings with re_mr
:
s1
matches "Mr." but not "Ms." while s2
does not match any pattern. It is because the match()
function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search()
instead:
The search()
function matches "Mr." in both s1
and s2
but still does not match "Ms.". To match them all, we need to use the findall()
function:
While the findall()
function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer()
function:
Finally, you can replace the matched results with another string by using the sub()
function:
Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:
L2: Create a regular expression to match delimiters and a special case:
Delimiters: ','
, '.'
, or whitespaces ('\s+'
).
The special case: 'n't'
(e.g., "can't").
L3: Create an empty list tokens
to store the resulting tokens, and initialize prev_idx
to keep track of the previous token's end position.
L5: Iterate over matches in text
using the regular expression pattern.
L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t
.
L7: If t
is not empty (i.e., it is not just whitespace), add it to the tokens
list.
L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t
.
L10: If t
is not empty (i.e., the pattern is matched):
L11-12: Check if the previous token in tokens
is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.
L13-14: Otherwise, add t
to tokens
.
L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens
.
Test cases for the tokenizer:
Update: 2023-10-31
Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.
What is the difference between a word and a token?
When examining the from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:
"R1:
-> ['"', "R1", ":"]
(R&D)
-> ['(', 'R&D', ')']
15th-largest
-> ['15th', '-', 'largest']
Atlanta,
-> ['Atlanta', ',']
Department's
-> ['Department', "'s"]
activity"[26]
-> ['activity', '"', '[26]']
centers.[21][22]
-> ['centers', '.', '[21]', '[22]']
Depending on the task, you may want to tokenize [26]
into ['[', '26', ']']
for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number 26 surrounded by square brackets. Thus, we aim to recognize it as a single token.
delimit()
Let us write the delimit()
function that takes a word and a set of delimiters and returns a list of tokens by splitting the word using the delimiters:
L3: If no delimiter is found, return a list containing word
as a single token.
L5: If a delimiter is found, create a list tokens
to store the individual tokens.
L6: If the delimiter is not at the beginning of word
, add the characters before the delimiter as a token to tokens
.
L7: Add the delimiter itself as a separate token to tokens
.
We now test delimit()
using the following cases:
postprocess()
When reviewing the above output, the first four test cases yield accurate results, while the last five are not handled correctly, which should have been tokenized as follows:
Department's
-> ['Department', "'s"]
activity"[26]
-> ['activity', '"', '[26]']
centers.[21][22]
-> ['centers', '.', '[21]', '[22]']
149,000
-> ['149,000']
U.S.
-> ['U.S.']
To handle these special cases, let us post-process the tokens generated by delimit()
:
L2: Initialize variables i
for the current position and new_tokens
for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
L6: Combine the apostrophe and "s" and append it as a single token.
L7: Move the position indicator by 1 to skip the next character.
L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).
L11: Combine the special number format and append it as a single token.
L12: Move the position indicator by 2 to skip the next two characters.
L13: Case 3: Handling acronyms like "U.S.".
L14: Combine the acronym and append it as a single token.
L15: Move the position indicator by 3 to skip the next three characters.
L16-17: Case 4: If none of the special cases above are met, append the current token.
L18: Move the position indicator by 1 to process the next token.
L20: Return the list of processed tokens.
Once the post-processing is applied, all outputs are handled correctly:
tokenize()
At last, we write the tokenize()
function that takes a file path to a corpus and a set of delimiters and returns a list of tokens from the corpus:
L2: Read the contents of a file (corpus
) and split it into words.
Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.
L1: Sort items in word_counts
in descending order (, , ) and save them into wc_dec
as a list of (word, count) tuples, sorted from the most frequent to the least frequent words.
L2: Iterate over the top 10 (most frequent) words in the sorted list () and print each word along with its count.
, The Python Standard Library - Built-in Types.
, The Python Standard Library - Built-in Types.
, The Python Tutorial.
Source:
The file contains books and chapters listed in chronological order. Your program will be evaluated using both the original and a modified version of this text file, maintaining the same format but with books and chapters arranged in a mixed order.
Define regular expressions to match the following cases using the corresponding variables in :
The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the .
L3: Create a regular expression re_mr
(). Note that a string indicated by an r
prefix is considered a regular expression in Python.
L4: Try to match re_mr
at the beginning of the string "Mr. Wayne" ().
L6: Print the value of m
. If matched, it prints the information; otherwise, m
is None
; thus, it prints "None".
L7: Check if a match was found (m
is not None
), and print the start position () and end position () of the match.
L1:
L5: Print the entire matched string ().
L6: Print a tuple of all captured groups ().
Source:
, Kuchling, HOWTOs in Python Documentation.*
L1:
L2: Find the index of the first character in word
that is in the delimiters
set (, ). If no delimiter is found in word
, return -1 ().
L9-10: If there are characters after the delimiter, recursively call the delimit()
function on the remaining part of word
and extend the tokens
list with the result ().
L1: .
L3: Tokenize each word in the corpus using the specified delimiters. postprocess()
is used to process the special cases further. The resulting tokens are collected in a list and returned ().
Given the new tokenizer, let us recount word types in the corpus, , and save them:
Despite the increase in word types, using a more advanced tokenizer effectively mitigates the issue of sparsity in . What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?
Source:
- A heuristic-based tokenizer.