An alternative construction of Shannon Entropy

TL;DR: Shannon’s entropy formula is usually justified by showing it satisfies key mathematical criteria, or by computing how much space is needed to encode a variable. But one can also construct Shannon’s formula starting purely from the simpler notion of entropy as a (logarithm of a) count—of how many different ways a distribution could have emerged from a sequence of samples.

Entropy of a probability disribution

The underpinning of information theory is Shannon’s formula for the entropy H of a discrete probability distribution P. Derived in 1948 in the paper “A mathematical theory of communication”, this is given by \[ H[P] = -\sum_{i=1}^N p_i \log p_i \] where \( p_i = p(x_i) \) is the probability of sampling \( x_i \) from \( P \).

There are many ways to think about entropy, but one intuition is that H quantifies how spread out P is. It is a more general statistic than e.g. variance, because entropy does not require a notion of distance between different \( x_i \). Thus one can compute the entropy of a distribution over colors, names, categories, etc., for which variance might not make sense.

For a uniform distribution over N outcomes, the Shannon entropy is: \[ H[P] = -\sum_{i=1}^N \frac{1}{N} \log \frac{1}{N} = -\log \frac{1}{N} = \log N. \] The entropy is just the number of possible outcomes, counted in log units. Log units are used because the number of outcomes often grows exponentially with system size. For instance, the number of unique binary words of length \( T \) is \( 2^T \). The entropy of a uniform distribution over such words is therefore \( \log 2^T = T \) bits.

But Eq. 1 applies to all discrete distributions, not just uniform ones. At first glance the general formula is not particularly intuitive. Why the minus sign? Why a probability times a logarithm of a probability? Where does Eq. 1 come from?

Shannon's derivation

The usual explanation of why Eq. 1 is the right formula for entropy can be found in Shannon’s paper itself (p. 10-11).

Essentially Eq. 1 is the unique solution (up to a positive multiplicative factor) satisfying three criteria: (1) \( H \) is continuous in the probability levels, (2) \( H \) of a uniform distribution increases with the number of possible outcomes, and (3) if an observation from P can be broken into two or more sequential observations (e.g. first observing whether a sample is in category A or B, then observing its exact value), then \( H \) should be decomposable into a weighted sum of the individual entropies.

Appendix 2 of Shannon’s paper shows why Eq. 1 satisfies these, and the proof is quite accessible. However, the derivation essentially starts by assuming one already knows the answer, obscuring how one would have discovered Eq. 1 in the first place.

Expected space to encode a sample

A second derivation of Shannon’s formula comes from computing the expected number of bits required to encode a sample from \( P \). For instance, if \( P \) is a uniform distribution over 4 possible outcomes, it takes \( \log_2(4) = 2 \) bits to encode one of these outcomes (00, 01, 10, or 11), i.e. to communicate to someone else which outcome occurred. More generally, it takes \( \log 1/p_i = -\log p_i \) bits to encode a sample that occurs with probability \( p_i \), and the expectation of this quantity is \( \mathbb{E}[-\log p_i] \) which is equivalent to Eq. 1.

This derivation is elegant and intuitive, but it requires introducing the notion of encoding. And even though it kind of makes sense that rarer outcomes would require more space to encode, it takes a bit of head-scratching (at least for me) to arrive at why \( \log 1/p_i \) is the right quantity for a non-uniform distribution. (Of course, the link between entropy and encoding is a core element of Information Theory, so it’s definitely worth working through this, but it takes a bit of thinking.)

An alternative construction

Is there a way to arrive at Shannon’s formula without knowing what it should look like ahead of time? And without the notion of encoding? In particular, can we go from the simpler notion of entropy as the log number of possible outcomes of something to Shannon’s formula for an arbitrary \( P \)?

It turns out Shannon entropy emerges when we choose that something to be a sequence of samples consistent with a histogram approximating \( P \) (i.e. that converges to \( P \) in the limit of infinite samples). If we count these in log units, Eq. 1 emerges.

I first saw this derivation in Leonard Susskind’s excellent Statistical Mechanics course (Lecture 3), albeit in a different context—connecting Boltzmann and Gibbs entropy. Below I’ll present it differently, but the math is the same. This derivation is longer than Shannon’s, but it shows how one might have come upon Eq. 1 without already knowing what it should look like and without introducing the notion of encoding.

Entropy as a log count of sequences of samples

Consider drawing \( L \) samples from a distribution \( P \), one after another. As \( L \) increases, the histogram constructed from the samples will tend toward \( P \). How many different sequences of \( L \) samples could have produced a given histogram?

Suppose \( P \) is defined over \( N \) possible outcomes, \( x_1 , \dots , x_N \) , and we have collected \( L_1 \) observations of \( x_1 \), \( L_2 \) observations of \( x_2 \), etc, where \( L = L_1 + \dots + L_N \). The number of ways (call this \( N_{seq} \)) we could have gotten \( L_1 \) samples of \( x_1 \), \( L_2 \) samples of \( x_2 \), etc. is \[ N_{seq}(L) = {L \choose L_1} {L - L_1 \choose L_2} \cdots {L - L_1 - \cdots - L_{N-1} \choose L_N} \] \[ = \frac{L!}{L_1! (L-L_1)!} \times \frac{(L-L_1)!}{L_2! (L-L_1-L_2)!} \cdots = \frac{L!}{L_1! L_2! \cdots L_N!} \] where we have canceled the second term in each denominator with the next numerator.

Counting in log units we then find that \[ \log N_{seq}(L) = \log \left( \frac{L!}{L_1! \cdots L_N!} \right) = \log L! - \log L_1! - \dots - \log L_N!, \] the log number of different sequences of \( L \) samples that would have produced our histogram. Next, letting \( L \) be large enough to use Stirling's approximation (\( \log L! \approx L \log L - L) \), we find \[ \log N_{seq}(L) \approx L \log L - L - (L_1 \log L_1 - L_1) - \dots - (L_N \log L_N - L_N). \]

Regrouping terms, \[ \log N_{seq}(L) \approx L \log L - (L_1 \log L_1 + \dots + L_N \log L_N) - L + (L_1 + \dots + L_N) \] \[ = L \log L - (L_1 \log L_1 + \dots + L_N \log L_N) \] since \( L = L_1 + \dots + L_N \). Factoring out \( L \) we have \[ \log N_{seq}(L) \approx L\left[ \log L - \left(\frac{L_1}{L} \log L_1 + \dots + \frac{L_N}{L} \log L_N \right) \right]. \] \[ = L\left[ \log L - (\hat{p}_1 \log L + \dots + \hat{p}_N \log L) - (\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] \] \[ = L\left[ \log L - \log L - (\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] \] where we have used \[ \hat{p}_i \equiv L_i/L \quad \text{and} \quad \sum_i \hat{p}_i = 1. \]

Thus \[ \log N_{seq}(L) \approx L\left[ -(\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] = -L \sum_i \hat{p}_i \log \hat{p}_i. \] Since our histogram approaches \( P \) as \( L \to \infty \) we arrive at \[ \log N_{seq}(L) \to -L\sum_i p_i \log p_i . \]

Finally, to obtain a quantity that does not depend on the number of samples we divide both sides by \( L \): \[ \frac{\log N_{seq}(L)}{L} \to -\sum_i p_i \log p_i = H[P]. \]

Thus, when we count how many different sequences of samples could have produced \( P \), in log units and normalizing by the number of samples, we arrive once again at Eq. 1. That is, Shannon's formula is a normalized log-count of how many ways a distribution could have emerged.