1 of 6

Text Processing

Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.

Sections

If you are not acquainted with Python programming, I strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.

Word Counting

Update: 2023-10-31

Consider the following text from Wikipedia about (as of 2023-10-18):

Word Count

Our task is to determine the number of word tokens and unique word types in this text. A simple way of accomplishing this task is to split the text with whitespaces and count the strings:

What is the difference between a word token and a word type?

L1: Import the class from the package.
L4: Open the corpus file (), read the contents of the file as a string (), split it into a list of words (), and store them in the words list.
L5: Use Counter to count the occurrences of each word in words and store the results in the word_counts dictionary.
L7: Print the total number of word tokens in the corpus (), which is the length of words ().
L8: Print the number of unique word types in the corpus, which is the length of word_counts.

Top-k Frequent Words

In this task, we want to check the top-k most or least frequently occurring words in this text:

L4: Sort items in word_counts in ascending order and save them into wc_asc as a list of (word, count) tuples.
L5: Iterate over the top 10 (least frequent) words in the sorted list zand print each word along with its count.

Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) or "Georgia." (with the period). This is because the text was split only by whitespaces without considering punctuation. As a result, these words are separately recognized from the word types "Atlanta" or "Georgia", respectively. Hence, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

Save Output

Finally, save all word types in alphabetical order to a file:

L1: Open a file word_types.txt in write mode (w). # If the file does not exist, it will be created. If it does exist, its previous contents will be overwritten.
L2: Iterate over unique word types (keys) of word_counts in alphabetical order, and write each word followed by a newline character to fout.

References

Tokenization

Update: 2023-10-31

Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.

What is the difference between a word and a token?

When examining the from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

"R1: -> ['"', "R1", ":"]
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']

Depending on the task, you may want to tokenize [26] into ['[', '26', ']'] for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number 26 surrounded by square brackets. Thus, we aim to recognize it as a single token.

`delimit()`

Let us write the delimit() function that takes a word and a set of delimiters and returns a list of tokens by splitting the word using the delimiters:

L3: If no delimiter is found, return a list containing word as a single token.
L5: If a delimiter is found, create a list tokens to store the individual tokens.
L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.
L7: Add the delimiter itself as a separate token to tokens.

We now test delimit() using the following cases:

`postprocess()`

When reviewing the above output, the first four test cases yield accurate results, while the last five are not handled correctly, which should have been tokenized as follows:

Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']

To handle these special cases, let us post-process the tokens generated by delimit():

L2: Initialize variables i for the current position and new_tokens for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
- L6: Combine the apostrophe and "s" and append it as a single token.
- L7: Move the position indicator by 1 to skip the next character.
L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).
- L11: Combine the special number format and append it as a single token.
- L12: Move the position indicator by 2 to skip the next two characters.
L13: Case 3: Handling acronyms like "U.S.".
- L14: Combine the acronym and append it as a single token.
- L15: Move the position indicator by 3 to skip the next three characters.
L16-17: Case 4: If none of the special cases above are met, append the current token.
L18: Move the position indicator by 1 to process the next token.
L20: Return the list of processed tokens.

Once the post-processing is applied, all outputs are handled correctly:

`tokenize()`

At last, we write the tokenize() function that takes a file path to a corpus and a set of delimiters and returns a list of tokens from the corpus:

L2: Read the contents of a file (corpus) and split it into words.

Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.

References

Lemmatization

Update: 2023-10-31

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

What is the difference between a lemmatizer and a stemmer [1]?

When analyzing the word types obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

Universities
University
universities
university

Two variations are applied to the noun "university": capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

study
studies
studied
studying

`get_lemma_lexica()`

We want to develop a lemmatizer that normalizes all variations into their respective lemmas. Let us start by creating lexica for lemmatization:

import json
from types import SimpleNamespace

def get_lemma_lexica() -> SimpleNamespace:
    return SimpleNamespace(
        nouns={noun.strip() for noun in open('dat/text_processing/nouns.txt')},
        verbs={noun.strip() for noun in open('dat/text_processing/verbs.txt')},
        nouns_irregular=json.load(open('dat/text_processing/nouns_irregular.json')),
        verbs_irregular=json.load(open('dat/text_processing/verbs_irregular.json')),
        nouns_rules=json.load(open('dat/text_processing/nouns_rules.json')),
        verbs_rules=json.load(open('dat/text_processing/verbs_rules.json'))
    )

L1: SimpleNamespace
L8-11: JSON encoder and decoder
Lexica:
- nouns.txt: base nouns
- nouns_irregular.json: nouns whose plural forms are irregular (e.g., mouse -> mice)
- nouns_rules.json: pluralization rules for nouns
- verbs.txt: base verbs
- verbs_irregular.json: verbs whose inflection forms are irregular (e.g., buy -> bought)
- verbs_rules.json: inflection rules for verbs

`lemmatize()`

Next, let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

def lemmatize(word: str, lexica: SimpleNamespace) -> str:
    def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
        lemma = irregular.get(word, None)
        if lemma is not None: return lemma

        for p, s in rules:
            lemma = word[:-len(p)] + s
            if lemma in vocabs: return lemma

        return None

    word = word.lower()
    lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)

    if lemma is None:
        lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)

    return lemma if lemma else word

L2: Define a nested function aux to handle lemmatization.
- L3-4: Check if the word is in the irregular dictionary (get()), if so, return its lemma.
- L6-7: Try applying each rule in the rules list to word.
- L8: If the resulting lemma is in the vocabulary, return it.
- L10: If no lemma is found, return None.
L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

We now test our lemmatizer for nouns and verbs:

lemma_lexica = get_lemma_lexica()

test_nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
for word in test_nouns: print('{} -> {}'.format(word, lemmatize(word, lemma_lexica)))

test_verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
for word in test_verbs: print('{} -> {}'.format(word, lemmatize(word, lemma_lexica)))

studies -> study
crosses -> cross
areas -> area
gentlemen -> gentleman
vertebrae -> vertebra
alumni -> alumnus
children -> child
crises -> crisis

applies -> apply
cried -> cry
pushes -> push
entered -> enter
takes -> take
heard -> hear
lying -> lie
studying -> study
taking -> take
drawn -> draw
clung -> cling
was -> be
bought -> buy

At last, let us recount word types in emory-wiki.txt using the lemmatizer and save them:

from src.tokenization import tokenize, DELIMITERS
from src.word_counting import CORPUS

corpus = 'dat/text_processing/emory-wiki.txt'
words = [lemmatize(word, lemma_lexica) for word in tokenize(corpus, DELIMITERS)]
word_counts = Counter(words)

print('# of word tokens: {}'.format(len(words)))
print('# of word types: {}'.format(len(word_counts)))

fout = open('dat/text_processing/word_types-token-lemma.txt', 'w')
for key in sorted(word_counts.keys()): fout.write('{}\n'.format(key))

# of word tokens: 363
# of word types: 177

dat/text_processing /word_types-token-lemma.txt

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

In which tasks can lemmatization negatively impact performance?

References

Source: lemmatization.py

An Algorithm for Suffix Stripping, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 (PDF).
ELIT Morphological Analyzer - A heuristic-based lemmatizer.

Regular Expressions

Update: 2023-10-31

Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.

Online Interpreter:

Metacharacters

Regex provides metacharacters with specific meanings, making it convenient to define patterns:

.: any single character except a newline character.
[ ]: a character set matching any character within the brackets.
\d: any digit, equivalent to [0-9].
\D: any character that is not a digit, equivalent to [^0-9].
\s: any whitespace character, equivalent to [ \t\n\r\f\v].
\S: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v].
\w: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_].
\W: any character that is not a word character, equivalent to [^A-Za-z0-9_].
\b: a word boundary matching the position between a word character and a non-word character.

Examples:

M.\. matches "Mr." and "Ms.", but not "Mrs." (\ escapes the metacharacter .).
[aeiou] matches any vowel.
\d\d\d searches for "170" in "CS170".
\D\D searches for "kg" in "100kg".
\s searches for the space " " in "Hello World".
\S searches for "H" in " Hello".
\w\w searches for "1K" in "$1K".
\W searches for "!" in "Hello!".
\bis\b matches "is", but does not match "island" nor searches for "is" in "basis".

Repetitions

Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:

*: the preceding character or group appears zero or more times.
+: the preceding character or group appears one or more times.
?: the preceding character or group appears zero or once, making it optional.
{m}: the preceding character or group appears exactly m times.
{m,n}: the preceding character or group appears at least m times but no more than n times.
{m,}: the preceding character or group appears at least m times or more.
By default, matches are "greedy" such that patterns match as many characters as possible.
Matches become "lazy" by adding ? after the repetition metacharacters, in which case, patterns match as few characters as possible.

Examples:

\d* matches "90" in "90s" as well as "" (empty string) in "ABC".
\d+ matches "90" in "90s", but no match in "ABC".
https? matches both "http" and "https".
\d{3} is equivalent to \d\d\d.
\d{2,4} matches "12", "123", "1234", but not "1" or "12345".
\d{2,} matches "12", "123", "1234", and "12345", but not "1".
<.+> matches the entire string of "<Hello> and <World>".
<.+?> matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.

Groupings

Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses ( and ).

|: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives.
( ): a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways.
(?: ): a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways.
\num: a backreference that refers back to the most recently matched text by the num'th capturing group within the same regex.
You can nest groups within other groups to create more complex patterns.

Examples:

(cat|dog) matches either "cat" or "dog".
(\w+)@(\w+.\w+) has two capturing groups, (\w+) and (\w+.\w+), and matches email addresses such as "john@emory.edu" where the first and second groups capture "john" and "emory.edu", respectively.
(?:\w+)@(\w+.\w+) has one non-capturing group (?:\w+) and one capturing group (\w+.\w+). It still matches "john@emory.edu" but only captures "emory.edu", not "john".
(\w+) (\w+) - (\2), (\1) has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".
(\w+.(edu|org)) has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.

Assertions

Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.

A positive lookahead assertion (?= ) checks that a specific pattern is present immediately after the current position.
A negative lookahead assertion (?! ) checks that a specific pattern is not present immediately after the current position.
A positive look-behind assertion (?<= ) checks that a specific pattern is present immediately before the current position.
A negative look-behind assertion (?<! ) checks that a specific pattern is not present immediately before the current position.
^ asserts that the pattern following the caret must match at the beginning of the text.
$ asserts that the pattern preceding the dollar sign must match at the end of the text.

Examples:

apple(?=[ -]pie) matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".
do(?!(?: not|n't)) matches "do" in "do it" or "doing", but not in "do not" or "don't".
(?<=\$)\d+ matches "100" in "$100", but not in "100 dollars".
(?<!not )(happy|sad) searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".
not searches for "not" in "note" and "cannot", whereas ^not matches "not" in "note" but not in "cannot".
not$ searches for "not" in "cannot" but not in "note".

Functions

Python provides several functions to make use of regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

L1:
- r'M' matches the letter "M".
- r'[rs]' matches either "r" or "s".
- r'\.' matches a period (dot).

group()

Currently, no group has been specified for re_mr:

Let us capture the letters and the period as separate groups:

L1: The pattern re_mr is looking for the following:
- 1st group: "M" followed by either "r" or 's'.
- 2nd group: a period (".")
L2: Match re_mr with the input string "Ms".
L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.

If the pattern does not find a match, it returns None.

search()

Let us match the following strings with re_mr:

s1 matches "Mr." but not "Ms." while s2 does not match any pattern. It is because the match() function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search() instead:

findall()

The search() function matches "Mr." in both s1 and s2 but still does not match "Ms.". To match them all, we need to use the findall() function:

finditer()

While the findall() function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer() function:

sub()

Finally, you can replace the matched results with another string by using the sub() function:

Tokenization

Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:

L2: Create a regular expression to match delimiters and a special case:
- Delimiters: ',', '.', or whitespaces ('\s+').
- The special case: 'n't' (e.g., "can't").
L3: Create an empty list tokens to store the resulting tokens, and initialize prev_idx to keep track of the previous token's end position.
L5: Iterate over matches in text using the regular expression pattern.
- L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t.
- L7: If t is not empty (i.e., it is not just whitespace), add it to the tokens list.
- L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t.
- L10: If t is not empty (i.e., the pattern is matched):
  - L11-12: Check if the previous token in tokens is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.
  - L13-14: Otherwise, add t to tokens.
L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens.

Test cases for the tokenizer:

References

Homework

Update: 2024-01-05

Task 1: Chronicles of Narnia

Your goal is to extract specific information from each book in the . For each book, gather the following details:

Book Title
Year of Publishing

For each chapter within a book, collect the following information:

Chapter Number
Chapter Title
Token count of the chapter content (excluding the chapter number and title), considering each symbol and punctuation as a separate token.

Steps

Download the file and place it under the directory. Note that the text file is already tokenized using the .
Create a file in the directory.
Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

Notes

For each book, ensure that the list of chapters is sorted in ascending order based on chapter numbers. For every chapter across the books, your program should produce the same list for both the original and modified versions of the text file.
If your program encounters difficulty locating the text file, please verify the working directory specified in the run configuration: [Run - Edit Configurations] -> text_processing. Confirm that the working directory path is set to the top directory, nlp-essentials.

Task 2: Regular Expressions

RE_Abbreviation: Dr., U.S.A.
RE_Apostrophe: '80, '90s, 'cause
RE_Concatenation: don't, gonna, cannot
RE_Hyperlink: https://emory.gitbook.io/nlp-essentials
RE_Number: 1/2, 123-456-7890, 1,000,000
RE_Unit: $10, #20, 5kg

Notes

Your regular expressions will be assessed based on typical cases beyond the examples mentioned above.

Submission

Commit and push the text_processing.py file to your GitHub repository.

Regular Expressions

Update: 2023-10-31

Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.

Online Interpreter:

Metacharacters

Regex provides metacharacters with specific meanings, making it convenient to define patterns:

.: any single character except a newline character.
[ ]: a character set matching any character within the brackets.
\d: any digit, equivalent to [0-9].
\D: any character that is not a digit, equivalent to [^0-9].
\s: any whitespace character, equivalent to [ \t\n\r\f\v].
\S: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v].
\w: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_].
\W: any character that is not a word character, equivalent to [^A-Za-z0-9_].
\b: a word boundary matching the position between a word character and a non-word character.

Examples:

M.\. matches "Mr." and "Ms.", but not "Mrs." (\ escapes the metacharacter .).
[aeiou] matches any vowel.
\d\d\d searches for "170" in "CS170".
\D\D searches for "kg" in "100kg".
\s searches for the space " " in "Hello World".
\S searches for "H" in " Hello".
\w\w searches for "1K" in "$1K".
\W searches for "!" in "Hello!".
\bis\b matches "is", but does not match "island" nor searches for "is" in "basis".

The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the .

Repetitions

Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:

*: the preceding character or group appears zero or more times.
+: the preceding character or group appears one or more times.
?: the preceding character or group appears zero or once, making it optional.
{m}: the preceding character or group appears exactly m times.
{m,n}: the preceding character or group appears at least m times but no more than n times.
{m,}: the preceding character or group appears at least m times or more.
By default, matches are "greedy" such that patterns match as many characters as possible.
Matches become "lazy" by adding ? after the repetition metacharacters, in which case, patterns match as few characters as possible.

Examples:

\d* matches "90" in "90s" as well as "" (empty string) in "ABC".
\d+ matches "90" in "90s", but no match in "ABC".
https? matches both "http" and "https".
\d{3} is equivalent to \d\d\d.
\d{2,4} matches "12", "123", "1234", but not "1" or "12345".
\d{2,} matches "12", "123", "1234", and "12345", but not "1".
<.+> matches the entire string of "<Hello> and <World>".
<.+?> matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.

Groupings

Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses ( and ).

|: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives.
( ): a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways.
(?: ): a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways.
\num: a backreference that refers back to the most recently matched text by the num'th capturing group within the same regex.
You can nest groups within other groups to create more complex patterns.

Examples:

(cat|dog) matches either "cat" or "dog".
(\w+)@(\w+.\w+) has two capturing groups, (\w+) and (\w+.\w+), and matches email addresses such as "john@emory.edu" where the first and second groups capture "john" and "emory.edu", respectively.
(?:\w+)@(\w+.\w+) has one non-capturing group (?:\w+) and one capturing group (\w+.\w+). It still matches "john@emory.edu" but only captures "emory.edu", not "john".
(\w+) (\w+) - (\2), (\1) has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".
(\w+.(edu|org)) has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.

Assertions

A positive lookahead assertion (?= ) checks that a specific pattern is present immediately after the current position.
A negative lookahead assertion (?! ) checks that a specific pattern is not present immediately after the current position.
A positive look-behind assertion (?<= ) checks that a specific pattern is present immediately before the current position.
A negative look-behind assertion (?<! ) checks that a specific pattern is not present immediately before the current position.
^ asserts that the pattern following the caret must match at the beginning of the text.
$ asserts that the pattern preceding the dollar sign must match at the end of the text.

Examples:

apple(?=[ -]pie) matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".
do(?!(?: not|n't)) matches "do" in "do it" or "doing", but not in "do not" or "don't".
(?<=\$)\d+ matches "100" in "$100", but not in "100 dollars".
(?<!not )(happy|sad) searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".
not searches for "not" in "note" and "cannot", whereas ^not matches "not" in "note" but not in "cannot".
not$ searches for "not" in "cannot" but not in "note".

Functions

Python provides several functions to make use of regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

import re

re_mr = re.compile(r'M[rs]\.')
m = re_mr.match('Mr. Wayne')

print(m)
if m: print(m.start(), m.end())

L1:
L3: Create a regular expression re_mr (). Note that a string indicated by an r prefix is considered a regular expression in Python.
- r'M' matches the letter "M".
- r'[rs]' matches either "r" or "s".
- r'\.' matches a period (dot).
L4: Try to match re_mr at the beginning of the string "Mr. Wayne" ().
L6: Print the value of m. If matched, it prints the information; otherwise, m is None; thus, it prints "None".
L7: Check if a match was found (m is not None), and print the start position () and end position () of the match.

<re.Match object; span=(0, 3), match='Mr.'>
0 3

group()

Currently, no group has been specified for re_mr:

print(m.groups())

()

Let us capture the letters and the period as separate groups:

re_mr = re.compile(r'(M[rs])(\.)')
m = re_mr.match('Ms.')

print(m)
print(m.group())
print(m.groups())
print(m.group(0), m.group(1), m.group(2))

L1: The pattern re_mr is looking for the following:
- 1st group: "M" followed by either "r" or 's'.
- 2nd group: a period (".")
L2: Match re_mr with the input string "Ms".
L5: Print the entire matched string ().
L6: Print a tuple of all captured groups ().
L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.

<re.Match object; span=(0, 3), match='Ms.'>
Ms.
('Ms', '.')
Ms. Ms .

If the pattern does not find a match, it returns None.

m = RE_MR.match('Mrs.')
print(m)

None

search()

Let us match the following strings with re_mr:

s1 = 'Mr. and Ms. Wayne are here'
print(re_mr.match(s1))
s2 = 'Here are Mr. and Ms. Wayne'
print(re_mr.match(s2))

<re.Match object; span=(0, 3), match='Mr.'>
None

print(re_mr.search(s1))
print(re_mr.search(s2))

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(9, 12), match='Mr.'>

findall()

The search() function matches "Mr." in both s1 and s2 but still does not match "Ms.". To match them all, we need to use the findall() function:

print(re_mr.findall(s1))
print(re_mr.findall(s2))

[('Mr', '.'), ('Ms', '.')]
[('Mr', '.'), ('Ms', '.')]

finditer()

ms = re_mr.finditer(s1)
for m in ms: print(m)

ms = re_mr.finditer(s2)
for m in ms: print(m)

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(8, 11), match='Ms.'>
<re.Match object; span=(9, 12), match='Mr.'>
<re.Match object; span=(17, 20), match='Ms.'>

sub()

Finally, you can replace the matched results with another string by using the sub() function:

print(re_mr.sub('Dr.', 'I met Mr. Wayne and Ms. Kyle.'))

I met Dr. Wayne and Dr. Kyle.

Tokenization

Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:

def tokenize(text: str) -> list[str]:
    re_tok = re.compile(r'([",.]|\s+|n\'t)')
    tokens, prev_idx = [], 0

    for m in re_tok.finditer(text):
        t = text[prev_idx:m.start()].strip()
        if t: tokens.append(t)
        t = m.group().strip()
        
        if t:
            if tokens and tokens[-1] in {'Mr', 'Ms'} and t == '.':
                tokens[-1] = tokens[-1] + t
            else:
                tokens.append(t)
        
        prev_idx = m.end()

    t = text[prev_idx:]
    if t: tokens.append(t)
    return tokens

L2: Create a regular expression to match delimiters and a special case:
- Delimiters: ',', '.', or whitespaces ('\s+').
- The special case: 'n't' (e.g., "can't").
L3: Create an empty list tokens to store the resulting tokens, and initialize prev_idx to keep track of the previous token's end position.
L5: Iterate over matches in text using the regular expression pattern.
- L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t.
- L7: If t is not empty (i.e., it is not just whitespace), add it to the tokens list.
- L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t.
- L10: If t is not empty (i.e., the pattern is matched):
  - L11-12: Check if the previous token in tokens is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.
  - L13-14: Otherwise, add t to tokens.
L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens.

Test cases for the tokenizer:

text = 'Mr. Wayne isn\'t the hero we need, but "the one" we deserve.'
print(tokenize(text))

text = 'Ms. Wayne is "Batgirl" but not "the one".'
print(tokenize(text))

['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"']
['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"', '.']

References

Source:

, Kuchling, HOWTOs in Python Documentation.*

Tokenization

Update: 2023-10-31

Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.

What is the difference between a word and a token?

When examining the from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

"R1: -> ['"', "R1", ":"]
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']

`delimit()`

Let us write the delimit() function that takes a word and a set of delimiters and returns a list of tokens by splitting the word using the delimiters:

L1:
L2: Find the index of the first character in word that is in the delimiters set (, ). If no delimiter is found in word, return -1 ().
L3: If no delimiter is found, return a list containing word as a single token.
L5: If a delimiter is found, create a list tokens to store the individual tokens.
L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.
L7: Add the delimiter itself as a separate token to tokens.
L9-10: If there are characters after the delimiter, recursively call the delimit() function on the remaining part of word and extend the tokens list with the result ().

We now test delimit() using the following cases:

DELIMITERS = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}

tests = [
    '"R1:',
    '(R&D)',
    '15th-largest',
    'Atlanta,',
    "Department's",
    'activity"[26]',
    'centers.[21][22]',
    '149,000',
    'U.S.'
]

for test in tests:
    print('{} -> {}'.format(test, delimit(test, DELIMITERS)))

L1: .

"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'", 's']
activity"[26] -> ['activity', '"', '[', '26', ']']
centers.[21][22] -> ['centers', '.', '[', '21', ']', '[', '22', ']']
149,000 -> ['149', ',', '000']
U.S. -> ['U', '.', 'S', '.']

`postprocess()`

When reviewing the above output, the first four test cases yield accurate results, while the last five are not handled correctly, which should have been tokenized as follows:

Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']

To handle these special cases, let us post-process the tokens generated by delimit():

def postprocess(tokens: list[str]) -> list[str]:
    i, new_tokens = 0, []

    while i < len(tokens):
        if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
            new_tokens.append(''.join(tokens[i:i + 2]))
            i += 1
        elif i + 2 < len(tokens) and \
                ((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
                 (tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
            new_tokens.append(''.join(tokens[i:i + 3]))
            i += 2
        elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
            new_tokens.append(''.join(tokens[i:i + 4]))
            i += 3
        else:
            new_tokens.append(tokens[i])
        i += 1

    return new_tokens

L2: Initialize variables i for the current position and new_tokens for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
- L6: Combine the apostrophe and "s" and append it as a single token.
- L7: Move the position indicator by 1 to skip the next character.
L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).
- L11: Combine the special number format and append it as a single token.
- L12: Move the position indicator by 2 to skip the next two characters.
L13: Case 3: Handling acronyms like "U.S.".
- L14: Combine the acronym and append it as a single token.
- L15: Move the position indicator by 3 to skip the next three characters.
L16-17: Case 4: If none of the special cases above are met, append the current token.
L18: Move the position indicator by 1 to process the next token.
L20: Return the list of processed tokens.

Once the post-processing is applied, all outputs are handled correctly:

for test in tests:
    print('{} -> {}'.format(test, postprocess(delimit(test, DELIMITERS))))

"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']

`tokenize()`

At last, we write the tokenize() function that takes a file path to a corpus and a set of delimiters and returns a list of tokens from the corpus:

def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
    tmp = open(corpus).read().split()
    return [token for t in tmp for token in postprocess(delimit(t, delimiters))]

L2: Read the contents of a file (corpus) and split it into words.
L3: Tokenize each word in the corpus using the specified delimiters. postprocess() is used to process the special cases further. The resulting tokens are collected in a list and returned ().

Given the new tokenizer, let us recount word types in the corpus, , and save them:

corpus = 'dat/text_processing/emory-wiki.txt'
words = tokenize(corpus, DELIMITERS)
word_counts = Counter(words)

print('# of word tokens: {}'.format(len(words)))
print('# of word types: {}'.format(len(word_counts)))

fout = open('dat/text_processing/word_types-token.txt', 'w')
for key in sorted(word_counts.keys()): fout.write('{}\n'.format(key))

# of word tokens: 363
# of word types: 197

Despite the increase in word types, using a more advanced tokenizer effectively mitigates the issue of sparsity in . What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?

References

Source:

- A heuristic-based tokenizer.