NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Entropy
  • Sequence Entropy
  • Perplexity
  • References

Was this helpful?

Export as PDF
  1. Language Models

Entropy and Perplexity

Entropy

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable X={x1,…,xn}X = \{x_1, \ldots, x_n\}X={x1​,…,xn​} is defined as:

H(X)=∑i=1nP(xi)⋅(−log⁡P(xi))=−∑i=1nP(xi)⋅log⁡P(xi)H(X) = \sum_{i=1}^n P(x_i) \cdot (-\log P(x_i)) = -\sum_{i=1}^n P(x_i) \cdot \log P(x_i)H(X)=i=1∑n​P(xi​)⋅(−logP(xi​))=−i=1∑n​P(xi​)⋅logP(xi​)

P(xi)P(x_i)P(xi​) is the probability distribution of xix_ixi​. The self-information of xix_ixi​ is defined as −log⁡P(x)-\log P(x)−logP(x), which measures how much information is gained when xix_ixi​ occurs. The negative sign indicates that as the occurrence of xix_ixi​ increases, its self-information value decreases.

Entropy has several properties, including:

  • It is non-negative: H(X)≥0H(X) \geq 0H(X)≥0.

  • It is at its minimum when xix_ixi​ is entirely predictable (all probability mass on a single outcome).

  • It is at its maximum when all outcomes of xix_ixi​ are equally likely.

Q10: Why is logarithmic scale used to measure self-information in entropy calculations?

Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, W=[w1,…,wn]W = [w_1, \ldots, w_n]W=[w1​,…,wn​], concatenating the entire text from a language L\mathcal{L}L. Let S={S1,…,Sm}\mathcal{S} = \{S_1, \ldots, S_m\}S={S1​,…,Sm​} be a set of all possible sequences derived from WWW, where S1S_1S1​ is the shortest sequence (a single word) and Sm=WS_m = WSm​=W is the longest sequence. Then, the entropy of WWW can be measured as follows:

H(W)=−∑j=1mP(Sj)⋅log⁡P(Sj)H(W) = -\sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(W)=−j=1∑m​P(Sj​)⋅logP(Sj​)

The entropy rate (per-word entropy), H′(W)H'(W)H′(W), can be measured by dividing H(W)H(W)H(W) by the total number of words nnn:

H′(W)=−1n∑j=1mP(Sj)⋅log⁡P(Sj)H'(W) = -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H′(W)=−n1​j=1∑m​P(Sj​)⋅logP(Sj​)

In theory, there is an infinite number of unobserved word sequences in the language L\mathcal{L}L. To estimate the true entropy of L\mathcal{L}L, we need to take the limit to H′(W)H'(W)H′(W) as nnn approaches infinity:

H(L)=lim⁡n→∞H′(W)=lim⁡n→∞−1n∑j=1mP(Sj)⋅log⁡P(Sj)H(\mathcal{L}) = \lim_{n \rightarrow \infty} H'(W) = \lim_{n \rightarrow \infty} -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(L)=n→∞lim​H′(W)=n→∞lim​−n1​j=1∑m​P(Sj​)⋅logP(Sj​)

By applying this theorem, H(L)H(\mathcal{L})H(L) can be approximated:

H(L)≈lim⁡n→∞−1nlog⁡P(Sm)H(\mathcal{L}) \approx \lim_{n \rightarrow \infty} -\frac{1}{n} \log P(S_m)H(L)≈n→∞lim​−n1​logP(Sm​)

Consequently, H′(W)H'(W)H′(W) is approximated as follows, where Sm=WS_m = WSm​=W:

H′(W)≈−1nlog⁡P(W)H'(W) \approx -\frac{1}{n} \log P(W)H′(W)≈−n1​logP(W)

Q11: What indicates high entropy in a text corpus?

Perplexity

Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence WWWis measured as:

PP(W)=P(W)−1n=1P(W)nPP(W) = P(W)^{-\frac{1}{n}} = \sqrt[n]{\frac{1}{P(W)}}PP(W)=P(W)−n1​=nP(W)1​​

Hence, the higher P(W)P(W)P(W) is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating WWW.

Perplexity, PP(W)PP(W)PP(W), can be directly derived from the approximated entropy rate, H′(W)H'(W)H′(W):

H′(W)≈−1nlog⁡2P(W)2H′(W)≈2−1nlog⁡2P(W)=2log⁡2P(W)−1n=P(W)−1n=PP(W)\begin{array}{rcl} H'(W) &\approx & -\frac{1}{n} \log_2 P(W) \\[5pt] 2^{H'(W)} &\approx & 2^{-\frac{1}{n} \log_2 P(W)} \\[5pt] &= & 2^{\log_2 P(W)^{-\frac{1}{n}}} \\[5pt] &= & P(W)^{-\frac{1}{n}} \\[5pt] &= & PP(W) \end{array}H′(W)2H′(W)​≈≈===​−n1​log2​P(W)2−n1​log2​P(W)2log2​P(W)−n1​P(W)−n1​PP(W)​

Q12: What is the relationship between corpus entropy and language model perplexity?

References

PreviousMaximum Likelihood EstimationNextHomework

Last updated 3 months ago

Was this helpful?

The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure H(L)H(\mathcal{L})H(L) because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

The in the previous section is stationary because all probabilities rely on the same condition, P(wi+1,wi)P(w_{i+1}, w_i)P(wi+1​,wi​). In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.

, Wikipedia

, Wikipedia

Shannon-McMillan-Breiman theorem
bigram model
Entropy
Perplexity