NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Unigram Estimation
  • Bigram Estimation
  • References

Was this helpful?

Export as PDF
  1. Language Models

N-gram Models

PreviousLanguage ModelsNextSmoothing

Last updated 3 months ago

Was this helpful?

An n-gram is a contiguous sequence of n items from text data. These items can be:

  • Words (most common in language modeling)

  • Characters (useful for morphological analysis)

  • Subword tokens (common in modern NLP)

  • Bytes or other units

For the sentence "I'm a computer scientist.", we can extract different n-grams:

  • 1-gram (unigram): {"I'm", "a", "computer", "scientist."}

  • 2-gram (bigram): {"I'm a", "a computer", "computer scientist."}

  • 3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?

Unigram Estimation

Given a large corpus, a unigram model assumes word independence and calculates the probability of each word wiw_iwi​ as:

P(wi)=#(wi)∑∀wk∈V#(wk)P(w_i) = \frac{\#(w_i)}{\sum_{\forall w_k \in V}\#(w_k)}P(wi​)=∑∀wk​∈V​#(wk​)#(wi​)​

Where:

  • The denominator represents the total word count.

Let us define the Unigram type:

from typing import TypeAlias

Unigram: TypeAlias = dict[str, float]
  • L3: A dictionary where the key is a unigram and the value is its probability.

Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

from collections import Counter

def unigram_count(filepath: str) -> Counter:
    unigrams = Counter()

    for line in open(filepath):
        words = line.split()
        unigrams.update(words)

    return unigrams

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

def unigram_estimation(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values())
    return {word: count / total for word, count in counts.items()}
  • L3: Calculate the total count of all unigrams in the text.

  • L4: Return a dictionary where each word is a key and its probability is the value.

from collections.abc import Callable

def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
    unigrams = estimator(filepath)
    unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]

    for word, prob in unigram_list[:300]:
        if word[0].isupper() and word.lower() not in unigrams:
            print(f'{word:>10} {prob:.6f}')
  • L3: The second argument accepts a function that takes a string and returns a Unigram.

  • L4: Call the estimator with the text file and store the result in unigrams.

  • L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.

  • L7: Iterate through the top 300 unigrams with the highest probabilities.

  • L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)
  • L2: Pass the unigram_estimation() function as the second argument.

         I 0.010543
     Aslan 0.001850
      Lucy 0.001815
    Edmund 0.001409
    Narnia 0.001379
   Caspian 0.001338
      Jill 0.001262
     Peter 0.001034
    Shasta 0.000928
    Digory 0.000925
   Eustace 0.000877
     Susan 0.000654
    Tirian 0.000601
     Polly 0.000547
    Aravis 0.000537
      Bree 0.000492
Puddleglum 0.000492
    Scrubb 0.000482
    Andrew 0.000406

This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".

Bigram Estimation

Where:

Let us define the Bigram type:

Bigram: TypeAlias = dict[str, Unigram | float]

Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

from collections import Counter, defaultdict
from src.types import Bigram

def bigram_count(filepath: str) -> dict[str, Counter]:
    bigrams = defaultdict(Counter)

    for line in open(filepath):
        words = line.split()
        for i in range(1, len(words)):
            bigrams[words[i - 1]].update([words[i]])

    return bigrams
  • L5: Create a defaultdict with Counters as default values to store bigram frequencies.

  • L9: Iterate through the words, starting from the second word (index 1) in each line.

  • L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

from src.types import Bigram

def bigram_estimation(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    bigrams = dict()

    for prev, ccs in counts.items():
        total = sum(ccs.values())
        bigrams[prev] = {curr: count / total for curr, count in ccs.items()}

    return bigrams
  • L8: Calculate the total count of all bigrams with the same previous word.

  • L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
    bigrams = estimator(filepath)
    for prev in ['I', 'the', 'said']:
        print(prev)
        bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
        for curr, prob in bigram_list[:10]:
            print("{:>10} {:.6f}".format(curr, prob))
  • L2: Call the bigram_estimation() function with the text file and store the result.

  • L5: Create a bigram list given the previous word, sorted by probability in descending order.

  • L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

test_bigram(corpus, bigram_estimation)
I
        'm 0.081628
        do 0.075849
       've 0.044065
       was 0.041897
      have 0.038045
        am 0.035878
       'll 0.032507
     think 0.032025
        'd 0.026246
      know 0.025765
the
      same 0.014846
     other 0.013405
      King 0.012528
     Witch 0.011776
     whole 0.009020
    others 0.008958
     first 0.008770
     Dwarf 0.008582
      door 0.008519
     great 0.008519
said
       the 0.157635
         , 0.073645
      Lucy 0.057635
    Edmund 0.045074
   Caspian 0.040394
     Peter 0.039409
      Jill 0.034975
         . 0.034729
    Digory 0.031034
     Aslan 0.030049

Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?

References

#(wi)\#(w_i)#(wi​): the count of word in the corpus.

VVV: the vocabulary (set of all unique words).

L1: .

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :

L1: Import the type from the typing module.

Q2: What advantages do unigram probabilities have over ?

A bigram model calculates the conditional probability of the current word wiw_{i}wi​ given the previous word wI−1w_{I-1}wI−1​ as follows (#(wi−1,wi)\#(w_{i-1},w_{i})#(wi−1​,wi​): the total occurrences of (wi−1,wi)(w_{i-1},w_{i})(wi−1​,wi​) in the corpus in that order, ViV_iVi​: a set of all word types occurring after wi−1w_{i-1}wi−1​):

P(wi∣wi−1)=#(wi−1,wi)∑∀wk∈Vi#(wi−1,wk)P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}P(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)#(wi−1​,wi​)​

#(wi−1,wi)\#(w_{i-1},w_{i})#(wi−1​,wi​): the total occurrences of (wi−1,wi)(w_{i-1},w_{i})(wi−1​,wi​) in the corpus in that order.

ViV_iVi​: a set of all word types occurring after wi−1w_{i-1}wi−1​.

A dictionary where the key is wi−1w_{i-1}wi−1​ and the value is a nested dictionary representing the unigram distribution of all wiw_{i}wi​ given wi−1w_{i-1}wi−1​, or a for wi−1w_{i-1}wi−1​.

L1: Import the class from the package.

L2: import the Bigram from the package.

Source: .

, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

Type aliases
dat/chronicles_of_narnia.txt
Callable
defaultdict
collections
type alias
src.types
ngram_models.py
N-gram Language Models
tokenized
word frequencies
smoothing probability