Lemmatization
Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.
Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.
Q7: What is the difference between a lemmatizer and a stemmer [3]?
When analyzing dat/word_types-token.txt obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:
Universities
University
universities
university
Two variations are applied to the noun "university" - capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:
study
studies
studied
studying
We want to develop a lemmatizer that normalizes all variations into their respective lemmas.
Lemma Lexica
Let us create lexica for lemmatization:
L3: SimpleNamespace
L1:
res_dir
: the path to the root directory where all lexica files are located.L2: nouns.txt: a list of base nouns
L3: verbs.txt: a list of base verbs
L4: nouns_irregular.json: a dictionary of nouns whose plural forms are irregular (e.g., mouse → mice)
L5: verbs_irregular.json: a dictionary of verbs whose inflection forms are irregular (e.g., buy → bought)
L6: nouns_rules.json: a list of pluralization rules for nouns
L7: verbs_rules.json: a list of inflection rules for verbs
We then verify that all lexical resources are loaded correctly:
Q8: What are the key differences between inflectional and derivational morphology?
Lemmatizing
Let us write the lemmatize()
function that takes a word and lemmatizes it using the lexica:
L2: Define a nested function
aux
to handle lemmatization.L3-4: Check if the word is in the irregular dictionary (get()), if so, return its lemma.
L6-7: Try applying each rule in the
rules
list toword
.L8: If the resulting lemma is in the vocabulary, return it.
L10: If no lemma is found, return
None
.L12: Convert the input word to lowercase for case-insensitive processing.
L13: Try to lemmatize the word using verb-related lexica.
L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.
L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.
We now test our lemmatizer for nouns and verbs:
Finally, let us recount word types in dat/emory-wiki.txt using the lemmatizer and save them to dat/word_types-token-lemma.txt:
L2: Import the
tokenize()
function from the src/tokenization.py module.
When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.
Q9: In which tasks can lemmatization negatively impact performance?
References
Source: lemmatization.py
ELIT Morphological Analyzer - A heuristic-based lemmatizer
An Algorithm for Suffix Stripping, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 (PDF)
Last updated
Was this helpful?