# Entropy and Perplexity

## Entropy

**Entropy** is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable $$X = {x\_1, \ldots, x\_n}$$ is defined as:

$$
H(X) = \sum\_{i=1}^n P(x\_i) \cdot (-\log P(x\_i)) = -\sum\_{i=1}^n P(x\_i) \cdot \log P(x\_i)
$$

$$P(x\_i)$$ is the probability distribution of $$x\_i$$. The self-information of $$x\_i$$ is defined as $$-\log P(x)$$, which measures how much information is gained when $$x\_i$$ occurs. The negative sign indicates that as the occurrence of $$x\_i$$ increases, its self-information value decreases.

Entropy has several properties, including:

* It is non-negative: $$H(X) \geq 0$$.
* It is at its minimum when $$x\_i$$ is entirely predictable (all probability mass on a single outcome).
* It is at its maximum when all outcomes of $$x\_i$$ are equally likely.

{% hint style="warning" %}
**Q10**: Why is logarithmic scale used to measure **self-information** in entropy calculations?
{% endhint %}

## Sequence Entropy

**Sequence entropy** is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, $$W = \[w\_1, \ldots, w\_n]$$, concatenating the entire text from a language $$\mathcal{L}$$. Let $$\mathcal{S} = {S\_1, \ldots, S\_m}$$ be a set of all possible sequences derived from $$W$$, where $$S\_1$$ is the shortest sequence (a single word) and $$S\_m = W$$ is the longest sequence. Then, the entropy of $$W$$ can be measured as follows:

$$
H(W) = -\sum\_{j=1}^m P(S\_j) \cdot \log P(S\_j)
$$

The **entropy rate** (per-word entropy), $$H'(W)$$, can be measured by dividing $$H(W)$$ by the total number of words $$n$$:

$$
H'(W) = -\frac{1}{n} \sum\_{j=1}^m P(S\_j) \cdot \log P(S\_j)
$$

In theory, there is an infinite number of unobserved word sequences in the language $$\mathcal{L}$$. To estimate the true entropy of $$\mathcal{L}$$, we need to take the limit to $$H'(W)$$ as $$n$$ approaches infinity:

$$
H(\mathcal{L}) = \lim\_{n \rightarrow \infty} H'(W) = \lim\_{n \rightarrow \infty} -\frac{1}{n} \sum\_{j=1}^m P(S\_j) \cdot \log P(S\_j)
$$

The [Shannon-McMillan-Breiman theorem](https://www.sciencedirect.com/science/article/pii/088523089190016J?via=ihub) implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure $$H(\mathcal{L})$$ because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

{% hint style="info" %}
The [bigram model](/nlp-essentials/language-models/n-gram-models.md) in the previous section is **stationary** because all probabilities rely on the same condition, $$P(w\_{i+1}, w\_i)$$. In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.
{% endhint %}

By applying this theorem, $$H(\mathcal{L})$$ can be approximated:

$$
H(\mathcal{L}) \approx \lim\_{n \rightarrow \infty} -\frac{1}{n} \log P(S\_m)
$$

Consequently, $$H'(W)$$ is approximated as follows, where $$S\_m = W$$:

$$
H'(W) \approx -\frac{1}{n} \log P(W)
$$

{% hint style="warning" %}
**Q11**: What indicates **high entropy** in a text corpus?
{% endhint %}

## Perplexity

**Perplexity** measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence $$W$$is measured as:

$$
PP(W) = P(W)^{-\frac{1}{n}} = \sqrt\[n]{\frac{1}{P(W)}}
$$

Hence, the higher $$P(W)$$ is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating $$W$$.

Perplexity, $$PP(W)$$, can be directly derived from the approximated entropy rate, $$H'(W)$$:

$$
\begin{array}{rcl}
H'(W) &\approx & -\frac{1}{n} \log\_2 P(W) \\\[5pt]
2^{H'(W)} &\approx & 2^{-\frac{1}{n} \log\_2 P(W)} \\\[5pt]
&= & 2^{\log\_2 P(W)^{-\frac{1}{n}}} \\\[5pt]
&= & P(W)^{-\frac{1}{n}} \\\[5pt]
&= & PP(W)
\end{array}
$$

{% hint style="warning" %}
**Q12**: What is the relationship between **corpus entropy** and **language model perplexity**?
{% endhint %}

## References

1. [Entropy](https://en.wikipedia.org/wiki/Entropy_\(information_theory\)), Wikipedia
2. [Perplexity](https://en.wikipedia.org/wiki/Perplexity), Wikipedia


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://emory.gitbook.io/nlp-essentials/language-models/entropy-and-perplexity.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
