arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Lemmatization

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

circle-exclamation

Q7: What is the difference between a lemmatizer and a stemmer [3]?

When analyzing obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

  • Universities

  • University

  • universities

Two variations are applied to the noun "university" - capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

  • study

  • studies

  • studied

We want to develop a lemmatizer that normalizes all variations into their respective lemmas.

hashtag
Lemma Lexica

Let us create lexica for lemmatization:

  • L1:

  • L2:

  • L3:

  • L1: res_dir: the path to the root directory where all lexica files are located.

  • L2: : a list of base nouns

  • L3: : a list of base verbs

We then verify that all lexical resources are loaded correctly:

circle-exclamation

Q8: What are the key differences between inflectional and derivational morphology?

hashtag
Lemmatizing

Let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

  • L2: Define a nested function aux to handle lemmatization.

  • L3-4: Check if the word is in the irregular dictionary (), if so, return its lemma.

  • L6-7: Try applying each rule in the rules list to word

We now test our lemmatizer for nouns and verbs:

Finally, let us recount word types in using the lemmatizer and save them to :

  • L2: Import the tokenize() function from the module.

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

circle-exclamation

Q9: In which tasks can lemmatization negatively impact performance?

hashtag
References

  1. Source:

  2. - A heuristic-based lemmatizer

  3. , Porter, Program: Electronic Library and Information Systems, 14(3), 1980 ()

Frequency Analysis

Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.

hashtag
Word Counting

Consider the following text from Wikipedia about (as of 2023-10-18):

chevron-right
university
studying

L4: nouns_irregular.jsonarrow-up-right: a dictionary of nouns whose plural forms are irregular (e.g., mouse → mice)

  • L5: verbs_irregular.jsonarrow-up-right: a dictionary of verbs whose inflection forms are irregular (e.g., buy → bought)

  • L6: nouns_rules.jsonarrow-up-right: a list of pluralization rules for nouns

  • L7: verbs_rules.jsonarrow-up-right: a list of inflection rules for verbs

  • .
  • L8: If the resulting lemma is in the vocabulary, return it.

  • L10: If no lemma is found, return None.

  • L12: Convert the input word to lowercase for case-insensitive processing.

  • L13: Try to lemmatize the word using verb-related lexica.

  • L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.

  • L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

  • dat/word_types-token.txtarrow-up-right
    JSON encoder and decoderarrow-up-right
    Common pathname manipulationsarrow-up-right
    SimpleNamespacearrow-up-right
    nouns.txtarrow-up-right
    verbs.txtarrow-up-right
    get()arrow-up-right
    dat/emory-wiki.txtarrow-up-right
    dat/word_types-token-lemma.txtarrow-up-right
    src/tokenization.pyarrow-up-right
    dat/text_processing /word_types-token-lemma.txtarrow-up-right
    lemmatization.pyarrow-up-right
    ELIT Morphological Analyzerarrow-up-right
    An Algorithm for Suffix Strippingarrow-up-right
    PDFarrow-up-right
    import json
    import os
    from types import SimpleNamespace
    def get_lexica(res_dir: str) -> SimpleNamespace:
        with open(os.path.join(res_dir, 'nouns.txt')) as fin: nouns = {noun.strip() for noun in fin}
        with open(os.path.join(res_dir, 'verbs.txt')) as fin: verbs = {verb.strip() for verb in fin}
        with open(os.path.join(res_dir, 'nouns_irregular.json')) as fin: nouns_irregular = json.load(fin)
        with open(os.path.join(res_dir, 'verbs_irregular.json')) as fin: verbs_irregular = json.load(fin)
        with open(os.path.join(res_dir, 'nouns_rules.json')) as fin: nouns_rules = json.load(fin)
        with open(os.path.join(res_dir, 'verbs_rules.json')) as fin: verbs_rules = json.load(fin)
    
        return SimpleNamespace(
            nouns=nouns,
            verbs=verbs,
            nouns_irregular=nouns_irregular,
            verbs_irregular=verbs_irregular,
            nouns_rules=nouns_rules,
            verbs_rules=verbs_rules
        )
    print(len(lexica.nouns))
    print(len(lexica.verbs))
    print(lexica.nouns_irregular)
    print(lexica.verbs_irregular)
    print(lexica.nouns_rules)
    print(lexica.verbs_rules)
    91
    27
    {'children': 'child', 'crises': 'crisis', 'mice': 'mouse'}
    {'is': 'be', 'was': 'be', 'has': 'have', 'had': 'have', 'bought': 'buy'}
    [['ies', 'y'], ['es', ''], ['s', ''], ['men', 'man'], ['ae', 'a'], ['i', 'us']]
    [['ies', 'y'], ['ied', 'y'], ['es', ''], ['ed', ''], ['s', ''], ['d', ''], ['ying', 'ie'], ['ing', ''], ['ing', 'e'], ['n', ''], ['ung', 'ing']]
    def lemmatize(lexica: SimpleNamespace, word: str) -> str:
        def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
            lemma = irregular.get(word, None)
            if lemma is not None: return lemma
    
            for p, s in rules:
                lemma = word[:-len(p)] + s
                if lemma in vocabs: return lemma
    
            return None
    
        word = word.lower()
        lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)
    
        if lemma is None:
            lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)
    
        return lemma if lemma else word
    nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
    nouns_lemmatized = [lemmatize(lexica, word) for word in nouns]
    for word, lemma in zip(nouns, nouns_lemmatized): print('{} -> {}'.format(word, lemma))
    
    verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
    verbs_lemmatized = [lemmatize(lexica, word) for word in verbs]
    for word, lemma in zip(verbs, verbs_lemmatized): print('{} -> {}'.format(word, lemma))
    studies -> study
    crosses -> cross
    areas -> area
    gentlemen -> gentleman
    vertebrae -> vertebra
    alumni -> alumnus
    children -> child
    crises -> crisis
    applies -> apply
    cried -> cry
    pushes -> push
    entered -> enter
    takes -> take
    heard -> hear
    lying -> lie
    studying -> study
    taking -> take
    drawn -> draw
    clung -> cling
    was -> be
    bought -> buy
    from collections import Counter
    from src.tokenization import tokenize
    
    corpus = 'dat/emory-wiki.txt'
    delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
    words = [lemmatize(lexica, word) for word in tokenize(corpus, delims)]
    counts = Counter(words)
    
    print(f'# of word tokens: {len(words)}')
    print(f'# of word types: {len(counts)}')
    
    output = 'dat/word_types-token-lemma.txt'
    with open(output, 'w') as fout:
        for key in sorted(counts.keys()): fout.write(f'{key}\n')
    # of word tokens: 363
    # of word types: 177
    hashtag
    Emory University is a private research university in Atlanta, Georgia. Founded in 1836 as Emory College by the Methodist Episcopal Church and named in honor of Methodist bishop John Emory.[18]
    
    Emory University has nine academic divisions. Emory Healthcare is the largest healthcare system in the state of Georgia[19] and comprises seven major hospitals, including Emory University Hospital and Emory University Hospital Midtown.[20] The university operates the Winship Cancer Institute, Yerkes National Primate Research Center, and many disease and vaccine research centers.[21][22] Emory University is the leading coordinator of the U.S. Health Department's National Ebola Training and Education Center.[23] The university is one of four institutions involved in the NIAID's Tuberculosis Research Units Program.[24] The International Association of National Public Health Institutes is headquartered at the university.[25]
    
    Emory University has the 15th-largest endowment among U.S. colleges and universities.[9] The university is classified among "R1: Doctoral Universities - Very high research activity"[26] and is cited for high scientific performance and citation impact in the CWTS Leiden Ranking.[27] The National Science Foundation ranked the university 36th among academic institutions in the United States for research and development (R&D) expenditures.[28][29] In 1995 Emory University was elected to the Association of American Universities, an association of the 65 leading research universities in the United States and Canada.[5]
    

    Our task is to determine the number of word tokens and unique word types in this text.

    circle-exclamation

    Q1: What is the difference between a word token and a word type?

    A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:

    • L1: Import the Counterarrow-up-right class, a special type of a dictionaryarrow-up-right, from the collectionsarrow-up-right package.

    • L3: Use typingarrow-up-right to indicate the parameter type (str) and the return type (Counter).

    • L4: open()arrow-up-right the corpus file.

    • L5: the contents of the file as a string, it into a of words, and store them in words.

    • L6: Count the occurrences of each word and return the results as a Counter.

    • L1: The corpus can be found dat/emory-wiki.txtarrow-up-right.

    • L4: Save the total number of word tokens in the corpus, which is the sum()arrow-up-right of counts values.

    • L5: Save the number of unique word types in the corpus, which is the len()arrow-up-right of counts.

    • L7-8: Print the value using the .

    circle-info

    When running this program, you may encounter FileNotFoundError. To fix this error, follow these steps to set up your working directory:

    1. Go to [Run] > [Edit Configurations] in the menu.

    2. Select "frequency_analysis" from the sidebar.

    3. Change the working directory to the top-level "nlp-essentials" directory.

    4. Click [OK] to save the changes.

    hashtag
    Word Frequency

    In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:

    • L1: Sort words in counts in descending order and save them into dec as a list of (word, count) tuplesarrow-up-right, sorted from the most frequent to the least frequent words (sorted()arrow-up-right, items()arrow-up-right, lambdaarrow-up-right).

    • L2: Sort words in counts in ascending order and save them into asc as a list of (word, count) tuples.

    • L4: Iterate over the top k most frequent words in the sorted list using , and print each word along with its count.

    • L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.

    Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

    circle-exclamation

    Q2: How can we interpret the most frequent words in a text?

    hashtag
    Save Output

    Finally, let us save all word types in alphabetical order to a file:

    • L2: Open outfile in writearrow-up-right mode (w).

    • L4: Iterate over unique word types (keys) of counts in alphabetical order.

    • L5: Write each word followed by a newline character to fout.

    • L7: Close the output stream.

    • L1: Creates the dat/word_types.txtarrow-up-right file if it does not exist; otherwise, its previous contents will be completely overwritten.

    hashtag
    References

    1. Source: src/frequency_analysis.pyarrow-up-right

    2. Sequence Typesarrow-up-right, The Python Standard Library - Built-in Types

    3. Mapping Typesarrow-up-right, The Python Standard Library - Built-in Types

    4. , The Python Tutorial

    Emory Universityarrow-up-right
    dat/emory-wiki.txtarrow-up-right
    corpus = 'dat/emory-wiki.txt'
    counts = count_words(corpus)
    
    n_tokens = sum(counts.values())
    n_types = len(counts)
    
    print(f'# of word tokens: {n_tokens}')
    print(f'# of word types: {n_types}')
    des = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    asc = sorted(counts.items(), key=lambda x: x[1])
    
    for word, count in des[:10]: print(word, count)
    for word, count in asc[:10]: print(word, count)
    save_output(counts, 'dat/word_types.txt')
    from collections import Counter
    
    def count_words(corpus: str) -> Counter:
        fin = open(corpus)
        words = fin.read().split()
        return Counter(words)
    def save_output(counts: Counter, outfile: str):
        fout = open(outfile, 'w')
    
        for word in sorted(counts.keys()):
            fout.write(f'{word}\n')
    
        fout.close()
    Emory faculty and alumni include 2 Prime Ministers, 9 university presidents, 11 members of the United States Congress, 2 Nobel Peace Prize laureates, a Vice President of the United States, a United States Speaker of the House, and a United States Supreme Court Justice. Other notable alumni include 21 Rhodes Scholars and 6 Pulitzer Prize winners, as well as Emmy Award winners, MacArthur Fellows, CEOs of Fortune 500 companies, heads of state and other leaders in foreign government.[30] Emory has more than 149,000 alumni, with 75 alumni clubs established worldwide in 20 countries.[31][32][33]
    read()arrow-up-right
    split()arrow-up-right
    listarrow-up-right
    formatted string literalsarrow-up-right
    slicingarrow-up-right
    Input and Outputarrow-up-right
    Configure the working directory.
    # of word tokens: 305
    # of word types: 180
    the 18
    and 15
    of 12
    Emory 11
    in 10
    University 7
    is 7
    university 6
    United 6
    research 5
    private 1
    Atlanta, 1
    Georgia. 1
    Founded 1
    1836 1
    College 1
    by 1
    Episcopal 1
    Church 1
    named 1

    Regular Expressions

    Regular expressions, commonly abbreviated as regex, form a language for string matching, enabling operations to search, match, and manipulate text based on specific patterns or rules.

    • Online Interpreter: Regular Expressions 101arrow-up-right

    hashtag
    Core Syntax

    hashtag
    Metacharacters

    Regex provides metacharacters with specific meanings, making it convenient to define patterns:

    • .: any single character except a newline character

      e.g., M.\. matches "Mr." and "Ms.", but not "Mrs." (\ escapes the metacharacter .).

    • [ ]

    circle-info

    The terms "match" and "search" in the above examples have different meanings. "match" means that the pattern must be found at the beginning of the text, while "search" means that the pattern can be located anywhere in the text. We will discuss these two functions in more detail in the .

    hashtag
    Repetitions

    Repetitions allow you to define complex patterns that can match multiple occurrences of a character or group of characters:

    • *: the preceding character or group appears zero or more times

      e.g., \d* matches "90" in "90s" as well as "" (empty string) in "ABC".

    • +: the preceding character or group appears one or more times

    hashtag
    Groupings

    Grouping allows you to treat multiple characters, subpatterns, or metacharacters as a single unit. It is achieved by placing these characters within parentheses ( and ).

    • |: a logical OR, referred to as a "pipe" symbol, allowing you to specify alternatives

      e.g., (cat|dog) matches either "cat" or "dog".

    • ( ): a capturing group; any text that matches the parenthesized pattern is "captured" and can be extracted or used in various ways

    hashtag
    Assertions

    Assertions define conditions that must be met for a match to occur. They do not consume characters in the input text but specify the position where a match should happen based on specific criteria.

    • A positive lookahead assertion (?= ) checks that a specific pattern is present immediately after the current position

      e.g., apple(?=[ -]pie) matches "apple" in "apple pie" or "apple-pie", but not in "apple juice".

    • A negative lookahead assertion (?! ) checks that a specific pattern is not present immediately after the current position

    hashtag
    Functions

    Python provides several functions to make use of regular expressions.

    hashtag
    match()

    Let us create a regular expression that matches "Mr." and "Ms.":

    • L1: .

    • L3: Create a regular expression re_mr (). Note that a string indicated by an r prefix is considered a regular expression in Python.

    hashtag
    group()

    Currently, no group has been specified for re_mr:

    • L1:

    Let us capture the letters and the period as separate groups:

    • L1: The pattern re_mr is looking for the following:

      • 1st group: "M" followed by either "r" or 's'.

    If the pattern does not find a match, it returns None.

    hashtag
    search()

    Let us match the following strings with re_mr:

    s1 matches "Mr." but not "Ms." while s2 does not match any pattern. It is because the match() function matches patterns only at the beginning of the string. To match patterns anywhere in the string, we need to use search() instead:

    hashtag
    findall()

    The search() function matches "Mr." in both s1 and s2 but still does not match "Ms.". To match them all, we need to use the findall() function:

    hashtag
    finditer()

    While the findall() function matches all occurrences of the pattern, it does not provide a way to locate the positions of the matched results in the string. To find the locations of the matched results, we need to use the finditer() function:

    hashtag
    sub()

    Finally, you can replace the matched results with another string by using the sub() function:

    hashtag
    Tokenization

    Finally, let us write a simple tokenizer using regular expressions. We will define a regular expression that matches the necessary patterns for tokenization:

    • L2: Create a regular expression to match delimiters and a special case:

      • Delimiters: ',', '.', or whitespaces ('\s+').

    Test cases for the tokenizer:

    circle-exclamation

    Q10: What are the benefits and limitations of using regular expressions for tokenization vs. the rule-based tokenization approach discussed in the ?

    hashtag
    References

    1. Source:

    2. , Kuchling, HOWTOs in Python Documentation

    : a character set matching any character within the brackets

    e.g., [aeiou] matches any vowel.

  • \d: any digit, equivalent to [0-9]

    e.g., \d\d\d searches for "170" in "CS170".

  • \D: any character that is not a digit, equivalent to [^0-9]

    e.g., \D\D searches for "kg" in "100kg".

  • \s: any whitespace character, equivalent to [ \t\n\r\f\v]

    e.g., \s searches for the space " " in "Hello World".

  • \S: any character that is not a whitespace character, equivalent to [^ \t\n\r\f\v]

    e.g., \S searches for "H" in " Hello".

  • \w: any word character (alphanumeric or underscore), equivalent to [A-Za-z0-9_]

    e.g., \w\w searches for "1K" in "$1K".

  • \W: any character that is not a word character, equivalent to [^A-Za-z0-9_]

    e.g., \W searches for "!" in "Hello!".

  • \b: a word boundary matching the position between a word character and a non-word character

    e.g., \bis\b matches "is", but does not match "island" nor searches for "is" in "basis".

  • e.g., \d+ matches "90" in "90s", but no match in "ABC".
  • ?: the preceding character or group appears zero or once, making it optional

    e.g., https? matches both "http" and "https".

  • {m}: the preceding character or group appears exactly m times

    e.g., \d{3} is equivalent to \d\d\d.

  • {m,n}: the preceding character or group appears at least m times but no more than n times

    e.g., \d{2,4} matches "12", "123", "1234", but not "1" or "12345".

  • {m,}: the preceding character or group appears at least m times or more

    e.g., \d{2,} matches "12", "123", "1234", and "12345", but not "1".

  • By default, matches are "greedy" such that patterns match as many characters as possible

    e.g., <.+> matches the entire string of "<Hello> and <World>".

  • Matches become "lazy" by adding ? after the repetition metacharacters, in which case, patterns match as few characters as possible

    e.g., <.+?> matches "<Hello>" in "<Hello> and <World>", and searches for "<World>" in the text.

  • e.g., (\w+)@(\w+.\w+) has two capturing groups, (\w+) and (\w+.\w+), and matches email addresses such as "john@emory.eduenvelope" where the first and second groups capture "john" and "emory.edu", respectively.

  • (?: ): a non-capturing group; any text that matches the parenthesized pattern, while indeed matched, is not "captured" and thus cannot be extracted or used in other ways

    e.g., (?:\w+)@(\w+.\w+) has one non-capturing group (?:\w+) and one capturing group (\w+.\w+). It still matches "john@emory.eduenvelope" but only captures "emory.edu", not "john".

  • \num: a backreference that refers back to the most recently matched text by the num'th capturing group within the same regex

    e.g., (\w+) (\w+) - (\2), (\1) has four capturing groups, where the third and fourth groups refer to the second and first groups, respectively. It matches "Jinho Choi - Choi, Jinho" where the first and fourth groups capture "Jinho" and the second and third groups capture "Choi".

  • You can nest groups within other groups to create more complex patterns

    e.g., (\w+.(edu|org)) has two capturing groups, where the second group is nested in the first group. It matches "emory.edu" or "emorynlp.org", where the first group captures the entire texts while the second group captures "edu" or "org", respectively.

  • e.g., do(?!(?: not|n't)) matches "do" in "do it" or "doing", but not in "do not" or "don't".

  • A positive look-behind assertion (?<= ) checks that a specific pattern is present immediately before the current position

    e.g., (?<=\$)\d+ matches "100" in "$100", but not in "100 dollars".

  • A negative look-behind assertion (?<! ) checks that a specific pattern is not present immediately before the current position

    e.g., (?<!not )(happy|sad) searches for "happy" in "I'm happy", but does not search for "sad" in "I'm not sad".

  • ^ asserts that the pattern following the caret must match at the beginning of the text

    e.g., not searches for "not" in "note" and "cannot", whereas ^not matches "not" in "note" but not in "cannot".

  • $ asserts that the pattern preceding the dollar sign must match at the end of the text

    e.g., not$ searches for "not" in "cannot" but not in "note".

  • r'M' matches the letter "M".

  • r'[rs]' matches either "r" or "s".

  • r'\.' matches a period (dot).

  • L4: Try to match re_mr at the beginning of the string "Mr. Wayne" (match()arrow-up-right).

  • L6: Print the value of m. If matched, it prints the match objectarrow-up-right information; otherwise, m is None; thus, it prints "None".

  • L7: Check if a match was found (m is not None), and print the start position (start()arrow-up-right) and end position (end()arrow-up-right) of the match.

  • 2nd group: a period (".")
  • L2: Match re_mr with the input string "Ms".

  • L5: Print the entire matched string (group()arrow-up-right).

  • L6: Print a tuple of all captured groups (groups()arrow-up-right).

  • L7: Print specific groups by specifying their indexes. Group 0 is the entire match, group 1 is the first capture group, and group 2 is the second capture group.

  • The special case: 'n't' (e.g., "can't").
  • L3: Create an empty list tokens to store the resulting tokens, and initialize prev_idx to keep track of the previous token's end position.

  • L5: Iterate over matches in text using the regular expression pattern.

    • L6: Extract the substring between the previous token's end and the current match's start, strip any leading or trailing whitespace, and assign it to t.

    • L7: If t is not empty (i.e., it is not just whitespace), add it to the tokens list.

    • L8: Extract the matched token from the match object strip any leading or trailing whitespace, and assign it to t.

    • L10: If t is not empty (i.e., the pattern is matched):

      • L11-12: Check if the previous token in tokens is "Mr" or "Ms" and the current token is a period ("."), in which case, combine them into a single token.

  • L18-19: After the loop, there might be some text left after the last token. Extract it, strip any leading or trailing whitespace, and add it to tokens.

  • following section
    Regular expression operationsarrow-up-right
    compile()arrow-up-right
    groups()arrow-up-right
    search()arrow-up-right
    findall()arrow-up-right
    finditer()arrow-up-right
    sub()arrow-up-right
    previous section
    regular_expressions.pyarrow-up-right
    Regular Expressionarrow-up-right

    Text Processing

    Text processing refers to the manipulation and analysis of textual data through techniques applied to raw text, making it more structured, understandable, and suitable for various applications.

    hashtag
    Sections

    • Frequency Analysis

    circle-info

    If you are not acquainted with Python programming, we strongly recommend going through all the examples in this section, as they provide detailed explanations of packages and functions commonly used for language processing.

    Homework

    HW1: Text Processing

    hashtag
    Task 1: Chronicles of Narnia

    Your goal is to extract and organize structured information from C.S. Lewis's Chronicles of Narniaarrow-up-right series, focusing on book metadata and chapter statistics.

    hashtag
    Data Collection

    For each book, gather the following details:

    • Book Title (preserve exact spacing as shown in text)

    • Year of Publishing (indicated in the title)

    For each chapter within every book, collect the following information:

    • Chapter Number (as Arabic numeral)

    • Chapter Title

    • Token Count of the Chapter Content

    hashtag
    Implementation

    1. Download the file and place it under the directory.

      • The text file is pre-tokenized using the .

      • Each token is separated by whitespace.

    hashtag
    Task 2: Regular Expressions

    Define a function named regular_expressions() in src/homework/text_processing.py that takes a string and returns one the four types, "email", "date", "url", "cite", or None if nothing matches:

    • Format:

      • username@hostname.domain

    • Username and Hostname:

    hashtag
    Submission

    Commit and push the text_processing.py file to your GitHub repository.

    hashtag
    Rubric

    • Task 1: Chronicles of Narnia (7 points)

    • Task 2: Regular Expressions (3 points)

    • Concept Quiz (2 points)

    <re.Match object; span=(0, 3), match='Mr.'>
    0 3
    <re.Match object; span=(0, 3), match='Ms.'>
    Ms.
    ('Ms', '.')
    Ms. Ms .
    import re
    
    re_mr = re.compile(r'M[rs]\.')
    m = re_mr.match('Mr. Wayne')
    
    print(m)
    if m: print(m.start(), m.end())
    print(m.groups())
    ()
    re_mr = re.compile(r'(M[rs])(\.)')
    m = re_mr.match('Ms.')
    
    print(m)
    print(m.group())
    print(m.groups())
    print(m.group(0), m.group(1), m.group(2))
    m = RE_MR.match('Mrs.')
    print(m)
    None
    s1 = 'Mr. and Ms. Wayne are here'
    s2 = 'Here are Mr. and Ms. Wayne'
    
    print(re_mr.match(s1))
    print(re_mr.match(s2))
    <re.Match object; span=(0, 3), match='Mr.'>
    None
    print(re_mr.search(s1))
    print(re_mr.search(s2))
    <re.Match object; span=(0, 3), match='Mr.'>
    <re.Match object; span=(9, 12), match='Mr.'>
    print(re_mr.findall(s1))
    print(re_mr.findall(s2))
    [('Mr', '.'), ('Ms', '.')]
    [('Mr', '.'), ('Ms', '.')]
    ms = re_mr.finditer(s1)
    for m in ms: print(m)
    
    ms = re_mr.finditer(s2)
    for m in ms: print(m)
    <re.Match object; span=(0, 3), match='Mr.'>
    <re.Match object; span=(8, 11), match='Ms.'>
    <re.Match object; span=(9, 12), match='Mr.'>
    <re.Match object; span=(17, 20), match='Ms.'>
    print(re_mr.sub('Dr.', 'I met Mr. Wayne and Ms. Kyle.'))
    I met Dr. Wayne and Dr. Kyle.
    def tokenize(text: str) -> list[str]:
        re_tok = re.compile(r'([",.]|\s+|n\'t)')
        tokens, prev_idx = [], 0
    
        for m in re_tok.finditer(text):
            t = text[prev_idx:m.start()].strip()
            if t: tokens.append(t)
            t = m.group().strip()
            
            if t:
                if tokens and tokens[-1] in {'Mr', 'Ms'} and t == '.':
                    tokens[-1] = tokens[-1] + t
                else:
                    tokens.append(t)
            
            prev_idx = m.end()
    
        t = text[prev_idx:]
        if t: tokens.append(t)
        return tokens
    text = 'Mr. Wayne isn\'t the hero we need, but "the one" we deserve.'
    print(tokenize(text))
    
    text = 'Ms. Wayne is "Batgirl" but not "the one".'
    print(tokenize(text))
    ['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"']
    ['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"', '.']
    L13-14: Otherwise, add
    t
    to
    tokens
    .
    Tokenization
    Lemmatization
    Regular Expressions
    Homework
    Each word, punctuation mark, and symbol counts as a separate token.
  • Count begins after chapter title and ends at next chapter heading or book end. Do not include chapter number and chapter title in count.

  • Create a text_processing.pyarrow-up-right file in the src/homework/arrow-up-right directory.

  • Define a function named chronicles_of_narnia() that takes a file path pointing to the text file and returns a dictionary structured as follows:

    • Takes a file path pointing to the text file.

    • Returns a dictionary with the structure shown below.

    • Books must be stored as key-value pairs in the main dictionary.

    • Chapters must be stored as lists within each book's dictionary/

    • Chapter lists must be sorted by chapter number in ascending order.

  • Can contain letters, numbers, period (.), underscore (_), hyphen (-).

  • Must start and end with letter/number.

  • Domain:

    • Limited to com, org, edu, and gov.

    • Formats:

      • YYYY/MM/DD or YY/MM/DD

      • YYYY-MM-DD or YY-MM-DD

    • Year:

      • 4 digits: between 1951 and 2050

      • 2 digits: for 1951 - 2050

    • Month:

      • 1 - 12 (can be with/without leading zero)

    • Day:

      • 1 - 31 (can be with/without leading zero)

      • Must be valid for the given month.

    • Format:

      • protocol://address

    • Protocol:

      • http or https (only)

    • Address:

      • Can contain letters, hyphen, dots.

      • Must start with letter/number.

    • Formats:

      • Single author: Lastname, YYYY (e.g., Smith, 2023)

      • Two authors: Lastname 1 and Lastname 2, YYYY (e.g., Smith and Jones, 2023)

      • Multiple authors: Lastname 1 et al., YYYY (Smith et al., 2023)

    • Lastnames must be capitalized and can have multiple

    • Year must be between 1900-2024.

    chronicles_of_narnia.txtarrow-up-right
    dat/arrow-up-right
    ELIT Tokenizerarrow-up-right
    {
      'The Lion , the Witch and the Wardrobe': {
        'title': 'The Lion , the Witch and the Wardrobe',
        'year': 1950,
        'chapters': [
          {
            'number': 1,
            'title': 'Lucy Looks into a Wardrobe',
            'token_count': 1915
          },
          {
            'number': 2,
            'title': 'What Lucy Found There',
            'token_count': 2887
          },
          ...
        ]
      },
      'Prince Caspian : The Return to Narnia': {
        'title': 'Prince Caspian : The Return to Narnia',
        'year': 1951,
        'chapters': [
          ...
        ]
      },
      ...
    }
    Must include at least one dot.

    Tokenization

    Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.

    circle-exclamation

    Q3: What is the difference between a word and a token?

    When examining the dat/word_types.txtarrow-up-right from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

    • "R1: → ['"', "R1", ":"]

    • (R&D) → ['(', 'R&D', ')']

    • 15th-largest → ['15th', '-', 'largest']

    • Atlanta, → ['Atlanta', ',']

    • Department's → ['Department', "'s"]

    • activity"[26] → ['activity', '"', '[26]']

    • centers.[21][22] → ['centers', '.', '[21]', '[22]']

    circle-info

    Depending on the task, you may want to tokenize "[26]" into ['[', '26', ']'] for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number "26" surrounded by square brackets. Thus, we aim to recognize it as a single token.

    hashtag
    Delimiters

    Let us write the delimit() function that takes a word and a set of delimiters, and returns a list of tokens by splitting the word using the delimiters:

    • L1: .

    • L2: Find the index of the first character in word that is in a of delimiters (, ). If no delimiter is found in word, return -1 ().

    Let us define a set of delimiters and test delimit() using various input:

    • L1: .

    • L17: Iterate over the two lists, input and output, in parallel using the function.

    circle-exclamation

    Q4: All delimiters used in our implementation are punctuation marks. What types of tokens should not be split by such delimiters?

    hashtag
    Post-Processing

    When reviewing the output of delimit(), the first four test cases yield accurate results, while the last five are not handled properly, which should have been tokenized as follows:

    • Department's → ['Department', "'s"]

    • activity"[26] → ['activity', '"', '[26]']

    • centers.[21][22] → ['centers', '.', '[21]', '[22]']

    To handle these special cases, let us post-process the tokens generated by delimit():

    • L2: Initialize variables i for the current position and new_tokens for the resulting tokens.

    • L4: Iterate through the input tokens.

    • L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).

    Once the post-processing is applied, all outputs are now handled properly:

    circle-exclamation

    Q5: Our tokenizer uses hard-coded rules to handle specific cases. What would be a scalable approach to handling more diverse cases?

    hashtag
    Tokenizing

    Finally, let us write tokenize() that takes a path to a corpus and a set of delimiters, and returns a list of tokens from the corpus:

    • L2: Read the corpus file.

    • L3: Split the text into words.

    • L4: Tokenize each word in the corpus using the specified delimiters. postprocess() is used to process the special cases further. The resulting tokens are collected in a list and returned ().

    Given the new tokenizer, let us recount word types in the corpus, , and save them:

    • L2: Import save_output() from the module.

    • L13: Save the tokenized output to .

    Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.

    circle-exclamation

    Q6: The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?

    hashtag
    References

    1. Source:

    2. - a heuristic-based tokenizer

    L3: If no delimiter is found, return a list containing word as a single token.
  • L4: If a delimiter is found, create a list tokens to store the individual tokens.

  • L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.

  • L7: Add the delimiter itself as a separate token to tokens.

  • L9-10: If there are characters after the delimiter, call delimit() recursively on the remaining part of word and extend()arrow-up-right the tokens list with the result.

  • L18: Format Specification Mini-Languagearrow-up-right

    149,000 → ['149,000']

  • U.S. → ['U.S.']

  • L6: Combine the apostrophe and "s" and append it as a single token.

  • L7: Move the position indicator by 1 to skip the next character.

  • L8-10: Case 2: Handling numbers in special formats like [##], ###,### (e.g., [42], 12,345).

    • L11: Combine the special number format and append it as a single token.

    • L12: Move the position indicator by 2 to skip the next two characters.

  • L13: Case 3: Handling acronyms like "U.S.".

    • L14: Combine the acronym and append it as a single token.

    • L15: Move the position indicator by 3 to skip the next three characters.

  • L17: Case 4: If none of the special cases above are met, append the current token.

  • L18: Move the position indicator by 1 to process the next token.

  • L20: Return the list of processed tokens.

  • Support for type hintsarrow-up-right
    setarrow-up-right
    enumerate()arrow-up-right
    next()arrow-up-right
    generator expressionsarrow-up-right
    Set typesarrow-up-right
    zip()arrow-up-right
    list comprehensionarrow-up-right
    emory-wiki.txtarrow-up-right
    src/frequency_analysis.pyarrow-up-right
    dat/word_types-token.txtarrow-up-right
    dat/text_processing /word_types-token.txtarrow-up-right
    src/tokenization.pyarrow-up-right
    ELIT Tokenizerarrow-up-right
    "R1: -> ['"', 'R1', ':']
    (R&D) -> ['(', 'R&D', ')']
    15th-largest -> ['15th', '-', 'largest']
    Atlanta, -> ['Atlanta', ',']
    Department's -> ['Department', "'", 's']
    activity"[26] -> ['activity', '"', '[', '26', ']']
    centers.[21][22] -> ['centers', '.', '[', '21', ']', '[', '22', ']']
    149,000 -> ['149', ',', '000']
    U.S. -> ['U', '.', 'S', '.']
    def delimit(word: str, delimiters: set[str]) -> list[str]:
        i = next((i for i, c in enumerate(word) if c in delimiters), -1)
        if i < 0: return [word]
        tokens = []
    
        if i > 0: tokens.append(word[:i])
        tokens.append(word[i])
    
        if i + 1 < len(word):
            tokens.extend(delimit(word[i + 1:], delimiters))
    
        return tokens
    delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
    
    input = [
        '"R1:',
        '(R&D)',
        '15th-largest',
        'Atlanta,',
        "Department's",
        'activity"[26]',
        'centers.[21][22]',
        '149,000',
        'U.S.'
    ]
    
    output = [delimit(word, delims) for word in input]
    
    for word, tokens in zip(input, output):
        print('{:<16} -> {}'.format(word, tokens))
    def postprocess(tokens: list[str]) -> list[str]:
        i, new_tokens = 0, []
    
        while i < len(tokens):
            if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
                new_tokens.append(''.join(tokens[i:i + 2]))
                i += 1
            elif i + 2 < len(tokens) and \
                    ((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
                     (tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
                new_tokens.append(''.join(tokens[i:i + 3]))
                i += 2
            elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
                new_tokens.append(''.join(tokens[i:i + 4]))
                i += 3
            else:
                new_tokens.append(tokens[i])
            i += 1
    
        return new_tokens
    output = [postprocess(delimit(word, delims)) for word in input]
    
    for word, tokens in zip(input, output):
        print('{:<16} -> {}'.format(word, tokens))
    "R1: -> ['"', 'R1', ':']
    (R&D) -> ['(', 'R&D', ')']
    15th-largest -> ['15th', '-', 'largest']
    Atlanta, -> ['Atlanta', ',']
    Department's -> ['Department', "'s"]
    activity"[26] -> ['activity', '"', '[26]']
    centers.[21][22] -> ['centers', '.', '[21]', '[22]']
    149,000 -> ['149,000']
    U.S. -> ['U.S.']
    def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
        with open(corpus) as fin:
            words = fin.read().split()
        return [token for word in words for token in postprocess(delimit(word, delimiters))]
    from collections import Counter
    from src.frequency_analysis import save_output
    
    corpus = 'dat/emory-wiki.txt'
    output = 'dat/word_types-token.txt'
    
    words = tokenize(corpus, delims)
    counts = Counter(words)
    
    print(f'# of word tokens: {len(words)}')
    print(f'# of word types: {len(counts)}')
    
    save_output(counts, output)
    # of word tokens: 363
    # of word types: 197