NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Unigram Smoothing
  • Bigram Smoothing
  • Normalization
  • Reference

Was this helpful?

Export as PDF
  1. Language Models

Smoothing

PreviousN-gram ModelsNextMaximum Likelihood Estimation

Last updated 3 months ago

Was this helpful?

Unigram Smoothing

The in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

PL(wi)=#(wi)+1∑∀wk∈V(#(wk)+1)=#(wi)+1∑∀wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(wi​)=∑∀wk​∈V​(#(wk​)+1)#(wi​)+1​=∑∀wk​∈V​#(wk​)+∣V∣#(wi​)+1​

Thus, the probability of any unknown word w∗w_*w∗​with Laplace smoothing is calculated as follows:

PL(w∗)=1∑∀wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(w∗​)=∑∀wk​∈V​#(wk​)+∣V∣1​

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

∑i=1vP(wi)=∑i=1vPL(wi)=1\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1i=1∑v​P(wi​)=i=1∑v​PL​(wi​)=1

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import unigram_count, Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams
  • L3: Define a constant representing the unknown word.

  • L7: Increment the total count by the vocabulary size.

  • L8: Increment each unigram count by 1.

  • L9: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)
         I 0.010225
     Aslan 0.001796
      Lucy 0.001762
    Edmund 0.001369
    Narnia 0.001339
   Caspian 0.001300
      Jill 0.001226
     Peter 0.001005
    Shasta 0.000902
    Digory 0.000899
   Eustace 0.000853
     Susan 0.000636
    Tirian 0.000585
     Polly 0.000533
    Aravis 0.000523
      Bree 0.000479
Puddleglum 0.000479
    Scrubb 0.000469
    Andrew 0.000396
   Unigram  With Smoothing   W/O Smoothing
         I        0.010225        0.010543
     Aslan        0.001796        0.001850
      Lucy        0.001762        0.001815
    Edmund        0.001369        0.001409
    Narnia        0.001339        0.001379
   Caspian        0.001300        0.001338
      Jill        0.001226        0.001262
     Peter        0.001005        0.001034
    Shasta        0.000902        0.000928
    Digory        0.000899        0.000925
   Eustace        0.000853        0.000877
     Susan        0.000636        0.000654
    Tirian        0.000585        0.000601
     Polly        0.000533        0.000547
    Aravis        0.000523        0.000537
      Bree        0.000479        0.000492
Puddleglum        0.000479        0.000492
    Scrubb        0.000469        0.000482
    Andrew        0.000396        0.000406

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Q4: When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram's probability to increase?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])
unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print(f'{word} {smoothed_unigram(unigram, word):.6f}')
  • L2: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Aslan 0.001796
Jinho 0.000002

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import bigram_count, Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: (count + 1) / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1 / len(vocab)
    return bigrams
  • L11: Calculate the total count of all bigrams with the same previous word.

  • L12: Calculate and store the probabilities of each current word given the previous word

  • L13: Calculate the probability for an unknown current word.

  • L16: Add a probability for an unknown previous word.

We then test bigram_smoothing() with the same text file:

from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)
I
        'm 0.020590
        do 0.019136
       've 0.011143
       was 0.010598
      have 0.009629
        am 0.009084
       'll 0.008236
     think 0.008115
        'd 0.006661
      know 0.006540
the
      same 0.008403
     other 0.007591
      King 0.007096
     Witch 0.006673
     whole 0.005119
    others 0.005084
     first 0.004978
     Dwarf 0.004872
      door 0.004837
     great 0.004837
said
       the 0.039038
         , 0.018270
      Lucy 0.014312
    Edmund 0.011206
   Caspian 0.010049
     Peter 0.009805
      Jill 0.008709
         . 0.008648
    Digory 0.007734
     Aslan 0.007491

Finally, we test the bigram estimation using smoothing for unknown sequences:

def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])
  • L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.

  • L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.

bigram = bigram_smoothing(corpus)
for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
    print(f'{word} {smoothed_bigram(bigram, *word):.6f}')
('Aslan', 'is') 0.001146
('Aslan', 'Jinho') 0.000076
('Jinho', 'is') 0.000081

Normalization

You are a student
You and I are students

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

Q6: Why is it problematic when bigram probabilities following a given word don't sum to 1?

Reference

L1: Import the unigram_count() function from the package.

We then test unigram_smoothing() with a text file :

L1: Import test_unigram() from the package.

L2: Use to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the UNKNOWN token.

PL(wi∣wi−1)=#(wi−1,wi)+1∑∀wk∈Vi#(wi−1,wk)+∣V∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}PL​(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)+∣V∣#(wi−1​,wi​)+1​

Thus, the probability of an unknown bigram (wu−1,w∗)(w_{u-1}, w_{*})(wu−1​,w∗​) where wu−1w_{u-1}wu−1​ is known but w∗w_{*}w∗​ is unknown is calculated as follows:

PL(w∗∣wu−1)=1∑∀wk∈Vi#(wu−1,wk)+∣V∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}PL​(w∗​∣wu−1​)=∑∀wk​∈Vi​​#(wu−1​,wk​)+∣V∣1​

Q5: What does the Laplace smoothed bigram probability of (wu−1,wu)(w_{u-1}, w_{u})(wu−1​,wu​) represent when wu−1w_{u-1}wu−1​ is unknown, and what is a potential problem with this estimation?

L1: Import the bigram_count() function from the package.

L5: Create a set vocab containing all unique wi−1w_{i-1}wi−1​.

L6-7: Add all unique wiw_iwi​ to vocab.

L1: Import the test_bigram() function from the package.

L3: The tuple word is as passed as the second and third parameters.

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word wiw_iwi​ is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that v=7v=7v=7. Before Laplace smoothing, the bigram probabilities of (wi−1=You,wi=∗)(w_{i-1} = \textit{You}, w_{i} = *)(wi−1​=You,wi​=∗) are estimated as follows:

P(are|You)=P(and|You)=1/2P(are|You)+P(and|You)=1\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}P(are|You)=P(and|You)P(are|You)+P(and|You)​=1/2=1​
PL(are|You)=PL(and|You)=(1+1)/(2+7)=2/9PL(are|You)+PL(and|You)=4/9\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+7)=2/9=4/9​

The bigram distribution for wi−1w_{i-1}wi−1​ can be normalized to 1 by adding the total number of word types occurring after wi−1w_{i-1}wi−1​, denoted as ∣Vi∣|V_i|∣Vi​∣, to the denominator instead of vvv:

PL(wi∣wi−1)=#(wi−1,wi)+1∑∀wk∈Vi#(wi−1,wk)+∣Vi∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}PL​(wi​∣wi−1​)=∑∀wk​∈Vi​​#(wi−1​,wk​)+∣Vi​∣#(wi−1​,wi​)+1​

Consequently, the probability of an unknown bigram (wu−1,w∗)(w_{u-1}, w_{*})(wu−1​,w∗​) can be calculated with the normalization as follows:

PL(w∗∣wu−1)=1∑∀wk∈Vi#(wu−1,wk)+∣Vi∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}PL​(w∗​∣wu−1​)=∑∀wk​∈Vi​​#(wu−1​,wk​)+∣Vi​∣1​

For the above example, ∣Vi∣=∣{are,and}∣=2|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2∣Vi​∣=∣{are,and}∣=2. Once you apply ∣Vi∣|V_{i}|∣Vi​∣ to PL(∗∣You)P_{\mathcal{L}}(*|\textit{You})PL​(∗∣You), the sum of its bigram probabilities becomes 1:

PL(are|You)=PL(and|You)=(1+1)/(2+2)=1/2PL(are|You)+PL(and|You)=1\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+2)=1/2=1​

A major drawback of this normalization is that the probability cannot be measured when wu−1w_{u-1}wu−1​ is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of (w∗,wu)(w_*, w_u)(w∗​,wu​), where the previous word is unknown, as follows:

PL(wu∣w∗)=min⁡({PL(w∗∣wk):∀wk∈V})P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})PL​(wu​∣w∗​)=min({PL​(w∗​∣wk​):∀wk​∈V})

Source:

, Wikipedia

src.ngram_models
dat/chronicles_of_narnia.txt
ngram_models
get()
src.ngram_models
ngram_models
unpacked
smoothing.py
Additive smoothing
unigram model