Smoothing

Update: 2024-01-05

Unigram Smoothing

The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

PL(wi)=#(wi)+1wkV(#(wk)+1)=#(wi)+1wkV#(wk)+VP_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

Thus, the probability of any unknown word ww_*with Laplace smoothing is calculated as follows:

PL(w)=1wkV#(wk)+VP_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

i=1vP(wi)=i=1vPL(wi)=1\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import unigram_count
from src.types import Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams
  • L1: Import the unigram_count() function from the src.ngram_models package.

  • L4: Define a constant representing the unknown word.

  • L8: Increment the total count by the vocabulary size.

  • L9: Increment each unigram count by 1.

  • L10: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

We then test unigram_smoothing() with a text file dat/chronicles_of_narnia.txt:

from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)
  • L1: Import the test_unigram() function from the ngram_models package.

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Will the probabilities of all unigrams always decrease when Laplace smoothing is applied? If not, under what circumstances might the unigram probabilities increase after smoothing?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])

unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print("{} {:.6f}".format(word, smoothed_unigram(unigram, word)))
  • L2: Use the get() method to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the 'UNKNOWN' token.

  • L5: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

PL(wiwi1)=#(wi1,wi)+1wkVi#(wi1,wk)+VP_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}

Thus, the probability of an unknown bigram (wu1,w)(w_{u-1}, w_{*}) where wu1w_{u-1} is known but ww_{*} is unknown is calculated as follows:

PL(wwu1)=1wkVi#(wu1,wk)+VP_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}

What does the Laplace smoothed bigram probability of (wu1,wu)(w_{u-1}, w_{u}) represent when wu1w_{u-1} is unknown? What is a potential problem with this estimation?

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import bigram_count
from src.types import Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: count / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1/ len(vocab)
    return bigrams
  • L1: Import the bigram_count() function from the src.ngram_models package.

  • L6-8: Create a set vocab containing all unique words in the bigrams.

  • L12: Calculate the total count of all bigrams with the same previous word.

  • L13: Calculate and store the probabilities of each current word given the previous word

  • L14: Calculate the probability for an unknown current word.

  • L17: Add a probability for an unknown previous word.

Why are the L7-8 in the above code necessary to retrieve all word types?

We then test bigram_smoothing() with the same text file:

from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)
  • L1: Import the test_bigram() function from the ngram_models package.

Finally, we test the bigram estimation using smoothing for unknown sequences:

def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])

test_bigram(corpus, bigram_smoothing)
    bigram = bigram_smoothing(corpus)
    for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
        print("{} {:.6f}".format(word, smoothed_bigram(bigram, *word)))
  • L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.

  • L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.

  • L8: The tuple word is unpacked as passed as the second and third parameters.

Normalization

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word wiw_i is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

You are a student
You and I are students

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that v=7v=7. Before Laplace smoothing, the bigram probabilities of (wi1=You,wi=)(w_{i-1} = \textit{You}, w_{i} = *) are estimated as follows:

P(are|You)=P(and|You)=1/2P(are|You)+P(and|You)=1\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

PL(are|You)=PL(and|You)=(1+1)/(2+7)=2/9PL(are|You)+PL(and|You)=4/9\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}

The bigram distribution for wi1w_{i-1} can be normalized to 1 by adding the total number of word types occurring after wi1w_{i-1}, denoted as Vi|V_i|, to the denominator instead of vv:

PL(wiwi1)=#(wi1,wi)+1wkVi#(wi1,wk)+ViP_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}

Consequently, the probability of an unknown bigram (wu1,w)(w_{u-1}, w_{*}) can be calculated with the normalization as follows:

PL(wwu1)=1wkVi#(wu1,wk)+ViP_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}

For the above example, Vi={are,and}=2|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2. Once you apply Vi|V_{i}| to PL(You)P_{\mathcal{L}}(*|\textit{You}), the sum of its bigram probabilities becomes 1:

PL(are|You)=PL(and|You)=(1+1)/(2+2)=1/2PL(are|You)+PL(and|You)=1\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

A major drawback of this normalization is that the probability cannot be measured when wu1w_{u-1} is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of (w,wu)(w_*, w_u), where the previous word is unknown, as follows:

PL(wuw)=min({PL(wwk):wkV})P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})

Reference

Source: smoothing.py

Last updated

Copyright © 2023 All rights reserved