arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Entropy and Perplexity

hashtag
Entropy

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable X={x1,…,xn}X = \{x_1, \ldots, x_n\}X={x1​,…,xn​} is defined as:

H(X)=βˆ‘i=1nP(xi)β‹…(βˆ’log⁑P(xi))=βˆ’βˆ‘i=1nP(xi)β‹…log⁑P(xi)H(X) = \sum_{i=1}^n P(x_i) \cdot (-\log P(x_i)) = -\sum_{i=1}^n P(x_i) \cdot \log P(x_i)H(X)=i=1βˆ‘n​P(xi​)β‹…(βˆ’logP(xi​))=βˆ’i=1βˆ‘n​P(xi​)β‹…logP(xi​)

P(xi)P(x_i)P(xi​) is the probability distribution of xix_ixi​. The self-information of xix_ixi​ is defined as βˆ’log⁑P(x)-\log P(x)βˆ’logP(x), which measures how much information is gained when xix_ixi​ occurs. The negative sign indicates that as the occurrence of xix_ixi​ increases, its self-information value decreases.

Entropy has several properties, including:

  • It is non-negative: .

  • It is at its minimum when is entirely predictable (all probability mass on a single outcome).

  • It is at its maximum when all outcomes of are equally likely.

circle-exclamation

Q10: Why is logarithmic scale used to measure self-information in entropy calculations?

hashtag
Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:

The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :

In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:

The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

circle-info

The in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.

By applying this theorem, can be approximated:

Consequently, is approximated as follows, where :

circle-exclamation

Q11: What indicates high entropy in a text corpus?

hashtag
Perplexity

Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:

Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .

Perplexity, , can be directly derived from the approximated entropy rate, :

circle-exclamation

Q12: What is the relationship between corpus entropy and language model perplexity?

hashtag
References

  1. , Wikipedia

  2. , Wikipedia

H(X)β‰₯0H(X) \geq 0H(X)β‰₯0
xix_ixi​
xix_ixi​
W=[w1,…,wn]W = [w_1, \ldots, w_n]W=[w1​,…,wn​]
L\mathcal{L}L
S={S1,…,Sm}\mathcal{S} = \{S_1, \ldots, S_m\}S={S1​,…,Sm​}
WWW
S1S_1S1​
Sm=WS_m = WSm​=W
WWW
H(W)=βˆ’βˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H(W) = -\sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(W)=βˆ’j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
Hβ€²(W)H'(W)Hβ€²(W)
H(W)H(W)H(W)
nnn
Hβ€²(W)=βˆ’1nβˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H'(W) = -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)Hβ€²(W)=βˆ’n1​j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
L\mathcal{L}L
L\mathcal{L}L
Hβ€²(W)H'(W)Hβ€²(W)
nnn
H(L)=lim⁑nβ†’βˆžHβ€²(W)=lim⁑nβ†’βˆžβˆ’1nβˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H(\mathcal{L}) = \lim_{n \rightarrow \infty} H'(W) = \lim_{n \rightarrow \infty} -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(L)=nβ†’βˆžlim​Hβ€²(W)=nβ†’βˆžlimβ€‹βˆ’n1​j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
H(L)H(\mathcal{L})H(L)
P(wi+1,wi)P(w_{i+1}, w_i)P(wi+1​,wi​)
H(L)H(\mathcal{L})H(L)
H(L)β‰ˆlim⁑nβ†’βˆžβˆ’1nlog⁑P(Sm)H(\mathcal{L}) \approx \lim_{n \rightarrow \infty} -\frac{1}{n} \log P(S_m)H(L)β‰ˆnβ†’βˆžlimβ€‹βˆ’n1​logP(Sm​)
Hβ€²(W)H'(W)Hβ€²(W)
Sm=WS_m = WSm​=W
Hβ€²(W)β‰ˆβˆ’1nlog⁑P(W)H'(W) \approx -\frac{1}{n} \log P(W)Hβ€²(W)β‰ˆβˆ’n1​logP(W)
WWW
PP(W)=P(W)βˆ’1n=1P(W)nPP(W) = P(W)^{-\frac{1}{n}} = \sqrt[n]{\frac{1}{P(W)}}PP(W)=P(W)βˆ’n1​=nP(W)1​​
P(W)P(W)P(W)
WWW
PP(W)PP(W)PP(W)
Hβ€²(W)H'(W)Hβ€²(W)
Hβ€²(W)β‰ˆβˆ’1nlog⁑2P(W)2Hβ€²(W)β‰ˆ2βˆ’1nlog⁑2P(W)=2log⁑2P(W)βˆ’1n=P(W)βˆ’1n=PP(W)\begin{array}{rcl} H'(W) &\approx & -\frac{1}{n} \log_2 P(W) \\[5pt] 2^{H'(W)} &\approx & 2^{-\frac{1}{n} \log_2 P(W)} \\[5pt] &= & 2^{\log_2 P(W)^{-\frac{1}{n}}} \\[5pt] &= & P(W)^{-\frac{1}{n}} \\[5pt] &= & PP(W) \end{array}Hβ€²(W)2Hβ€²(W)β€‹β‰ˆβ‰ˆ===β€‹βˆ’n1​log2​P(W)2βˆ’n1​log2​P(W)2log2​P(W)βˆ’n1​P(W)βˆ’n1​PP(W)​
Shannon-McMillan-Breiman theoremarrow-up-right
bigram model
Entropyarrow-up-right
Perplexityarrow-up-right