NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Lemma Lexica
  • Lemmatizing
  • References

Was this helpful?

Export as PDF
  1. Text Processing

Lemmatization

PreviousTokenizationNextRegular Expressions

Last updated 3 months ago

Was this helpful?

Sometimes, it is more appropriate to consider the canonical forms as tokens instead of their variations. For example, if you want to analyze the usage of the word "transformer" in NLP literature for each year, you want to count both "transformer" and 'transformers' as a single item.

Lemmatization is a task that simplifies words into their base or dictionary forms, known as lemmas, to simplify the interpretation of their core meaning.

Q7: What is the difference between a lemmatizer and a stemmer [3]?

When analyzing obtained by the tokenizer in the previous section, the following tokens are recognized as separate word types:

  • Universities

  • University

  • universities

  • university

Two variations are applied to the noun "university" - capitalization, generally used for proper nouns or initial words, and pluralization, which indicates multiple instances of the term. On the other hand, verbs can also take several variations regarding tense and aspect:

  • study

  • studies

  • studied

  • studying

We want to develop a lemmatizer that normalizes all variations into their respective lemmas.

Lemma Lexica

Let us create lexica for lemmatization:

import json
import os
from types import SimpleNamespace
def get_lexica(res_dir: str) -> SimpleNamespace:
    with open(os.path.join(res_dir, 'nouns.txt')) as fin: nouns = {noun.strip() for noun in fin}
    with open(os.path.join(res_dir, 'verbs.txt')) as fin: verbs = {verb.strip() for verb in fin}
    with open(os.path.join(res_dir, 'nouns_irregular.json')) as fin: nouns_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_irregular.json')) as fin: verbs_irregular = json.load(fin)
    with open(os.path.join(res_dir, 'nouns_rules.json')) as fin: nouns_rules = json.load(fin)
    with open(os.path.join(res_dir, 'verbs_rules.json')) as fin: verbs_rules = json.load(fin)

    return SimpleNamespace(
        nouns=nouns,
        verbs=verbs,
        nouns_irregular=nouns_irregular,
        verbs_irregular=verbs_irregular,
        nouns_rules=nouns_rules,
        verbs_rules=verbs_rules
    )
  • L1: res_dir: the path to the root directory where all lexica files are located.

We then verify that all lexical resources are loaded correctly:

print(len(lexica.nouns))
print(len(lexica.verbs))
print(lexica.nouns_irregular)
print(lexica.verbs_irregular)
print(lexica.nouns_rules)
print(lexica.verbs_rules)
91
27
{'children': 'child', 'crises': 'crisis', 'mice': 'mouse'}
{'is': 'be', 'was': 'be', 'has': 'have', 'had': 'have', 'bought': 'buy'}
[['ies', 'y'], ['es', ''], ['s', ''], ['men', 'man'], ['ae', 'a'], ['i', 'us']]
[['ies', 'y'], ['ied', 'y'], ['es', ''], ['ed', ''], ['s', ''], ['d', ''], ['ying', 'ie'], ['ing', ''], ['ing', 'e'], ['n', ''], ['ung', 'ing']]

Q8: What are the key differences between inflectional and derivational morphology?

Lemmatizing

Let us write the lemmatize() function that takes a word and lemmatizes it using the lexica:

def lemmatize(lexica: SimpleNamespace, word: str) -> str:
    def aux(word: str, vocabs: dict[str, str], irregular: dict[str, str], rules: list[tuple[str, str]]):
        lemma = irregular.get(word, None)
        if lemma is not None: return lemma

        for p, s in rules:
            lemma = word[:-len(p)] + s
            if lemma in vocabs: return lemma

        return None

    word = word.lower()
    lemma = aux(word, lexica.verbs, lexica.verbs_irregular, lexica.verbs_rules)

    if lemma is None:
        lemma = aux(word, lexica.nouns, lexica.nouns_irregular, lexica.nouns_rules)

    return lemma if lemma else word
  • L2: Define a nested function aux to handle lemmatization.

  • L6-7: Try applying each rule in the rules list to word.

  • L8: If the resulting lemma is in the vocabulary, return it.

  • L10: If no lemma is found, return None.

  • L12: Convert the input word to lowercase for case-insensitive processing.

  • L13: Try to lemmatize the word using verb-related lexica.

  • L15-16: If no lemma is found among verbs, try to lemmatize using noun-related lexica.

  • L18: Return the lemma if found or the decapitalized word if no lemmatization occurred.

We now test our lemmatizer for nouns and verbs:

nouns = ['studies', 'crosses', 'areas', 'gentlemen', 'vertebrae', 'alumni', 'children', 'crises']
nouns_lemmatized = [lemmatize(lexica, word) for word in nouns]
for word, lemma in zip(nouns, nouns_lemmatized): print('{} -> {}'.format(word, lemma))

verbs = ['applies', 'cried', 'pushes', 'entered', 'takes', 'heard', 'lying', 'studying', 'taking', 'drawn', 'clung', 'was', 'bought']
verbs_lemmatized = [lemmatize(lexica, word) for word in verbs]
for word, lemma in zip(verbs, verbs_lemmatized): print('{} -> {}'.format(word, lemma))
studies -> study
crosses -> cross
areas -> area
gentlemen -> gentleman
vertebrae -> vertebra
alumni -> alumnus
children -> child
crises -> crisis
applies -> apply
cried -> cry
pushes -> push
entered -> enter
takes -> take
heard -> hear
lying -> lie
studying -> study
taking -> take
drawn -> draw
clung -> cling
was -> be
bought -> buy
from collections import Counter
from src.tokenization import tokenize

corpus = 'dat/emory-wiki.txt'
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
words = [lemmatize(lexica, word) for word in tokenize(corpus, delims)]
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

output = 'dat/word_types-token-lemma.txt'
with open(output, 'w') as fout:
    for key in sorted(counts.keys()): fout.write(f'{key}\n')
# of word tokens: 363
# of word types: 177

When the words are further normalized by lemmatization, the number of word tokens remains the same as without lemmatization, but the number of word types is reduced from 197 to 177.

Q9: In which tasks can lemmatization negatively impact performance?

References

L1:

L2:

L3:

L2: : a list of base nouns

L3: : a list of base verbs

L4: : a dictionary of nouns whose plural forms are irregular (e.g., mouse → mice)

L5: : a dictionary of verbs whose inflection forms are irregular (e.g., buy → bought)

L6: : a list of pluralization rules for nouns

L7: : a list of inflection rules for verbs

L3-4: Check if the word is in the irregular dictionary (), if so, return its lemma.

Finally, let us recount word types in using the lemmatizer and save them to :

L2: Import the tokenize() function from the module.

Source:

- A heuristic-based lemmatizer

, Porter, Program: Electronic Library and Information Systems, 14(3), 1980 ()

dat/word_types-token.txt
JSON encoder and decoder
Common pathname manipulations
SimpleNamespace
nouns.txt
verbs.txt
nouns_irregular.json
verbs_irregular.json
nouns_rules.json
verbs_rules.json
get()
dat/emory-wiki.txt
dat/word_types-token-lemma.txt
src/tokenization.py
dat/text_processing /word_types-token-lemma.txt
lemmatization.py
ELIT Morphological Analyzer
An Algorithm for Suffix Stripping
PDF