Lemmatization

Update: 2023-10-31

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

What is the difference between a lemmatizer and a stemmer [1]?

When analyzing the word types obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

Universities
University
universities
university

Two variations are applied to the noun "university": capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

study
studies
studied
studying

`get_lemma_lexica()`

We want to develop a lemmatizer that normalizes all variations into their respective lemmas. Let us start by creating lexica for lemmatization:

import json
from types import SimpleNamespace

def get_lemma_lexica() -> SimpleNamespace:
    return SimpleNamespace(
        nouns={noun.strip() for noun in open('dat/text_processing/nouns.txt')},
        verbs={noun.strip() for noun in open('dat/text_processing/verbs.txt')},
        nouns_irregular=json.load(open('dat/text_processing/nouns_irregular.json')),
        verbs_irregular=json.load(open('dat/text_processing/verbs_irregular.json')),
        nouns_rules=json.load(open('dat/text_processing/nouns_rules.json')),
        verbs_rules=json.load(open('dat/text_processing/verbs_rules.json'))
    )

L1: SimpleNamespace
L8-11: JSON encoder and decoder
Lexica:
- nouns.txt: base nouns
- nouns_irregular.json: nouns whose plural forms are irregular (e.g., mouse -> mice)
- nouns_rules.json: pluralization rules for nouns
- verbs.txt: base verbs
- verbs_irregular.json: verbs whose inflection forms are irregular (e.g., buy -> bought)
- verbs_rules.json: inflection rules for verbs

`lemmatize()`

Next, let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

def lemmatize(word: str, lexica: SimpleNamespace) -> str:
    def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
        lemma = irregular.get(word, None)
        if lemma is not None: return lemma

        for p, s in rules:
            lemma = word[:-len(p)] + s
            if lemma in vocabs: return lemma

        return None

    word = word.lower()
    lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)

    if lemma is None:
        lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)

    return lemma if lemma else word

L2: Define a nested function aux to handle lemmatization.
- L3-4: Check if the word is in the irregular dictionary (get()), if so, return its lemma.
- L6-7: Try applying each rule in the rules list to word.
- L8: If the resulting lemma is in the vocabulary, return it.
- L10: If no lemma is found, return None.
L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

We now test our lemmatizer for nouns and verbs:

lemma_lexica = get_lemma_lexica()

test_nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
for word in test_nouns: print('{} -> {}'.format(word, lemmatize(word, lemma_lexica)))

test_verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
for word in test_verbs: print('{} -> {}'.format(word, lemmatize(word, lemma_lexica)))

studies -> study
crosses -> cross
areas -> area
gentlemen -> gentleman
vertebrae -> vertebra
alumni -> alumnus
children -> child
crises -> crisis

applies -> apply
cried -> cry
pushes -> push
entered -> enter
takes -> take
heard -> hear
lying -> lie
studying -> study
taking -> take
drawn -> draw
clung -> cling
was -> be
bought -> buy

At last, let us recount word types in emory-wiki.txt using the lemmatizer and save them:

from src.tokenization import tokenize, DELIMITERS
from src.word_counting import CORPUS

corpus = 'dat/text_processing/emory-wiki.txt'
words = [lemmatize(word, lemma_lexica) for word in tokenize(corpus, DELIMITERS)]
word_counts = Counter(words)

print('# of word tokens: {}'.format(len(words)))
print('# of word types: {}'.format(len(word_counts)))

fout = open('dat/text_processing/word_types-token-lemma.txt', 'w')
for key in sorted(word_counts.keys()): fout.write('{}\n'.format(key))

# of word tokens: 363
# of word types: 177

dat/text_processing /word_types-token-lemma.txt

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

In which tasks can lemmatization negatively impact performance?

References

Source: lemmatization.py

An Algorithm for Suffix Stripping, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 (PDF).
ELIT Morphological Analyzer - A heuristic-based lemmatizer.

PreviousTokenization NextRegular Expressions

Last updated 3 months ago