An alternative construction of Shannon entropy

TL;DR: Shannon’s entropy formula is usually justified by showing it satisfies key mathematical criteria, or by computing how much space is needed to encode a set of samples. But one can also construct Shannon’s formula starting purely from the simpler notion of entropy as a log-count—of how many different ways a distribution could have emerged from a sequence of samples.

Entropy of a probability distribution

The underpinning of information theory is Shannon’s formula for the entropy H of a discrete probability distribution P. Derived in 1948 in the paper “A mathematical theory of communication”, this is given by

    \[  H[P] = - \sum_{i=1}^N p_i \log p_i \quad \quad (1) \]

where the pi = p(xi) is the probability of sampling xi from P.

There are many ways to think about entropy, but one intuition is that H quantifies how spread out P is. It is a more general statistic than e.g. variance, because entropy does not require a notion of distance between different xi. Thus one can compute the entropy of a distribution over colors, names, categories, etc., for which variance might not make sense.

For a uniform distribution over N outcomes, the Shannon entropy is:

    \[  H[P] = - \sum_{i=1}^N \frac{1}{N} \log \frac{1}{N} = - \log \frac{1}{N} = \log N. \]

The entropy is just the number of possible outcomes, counted in log units. Log units are used because the number of outcomes often grows exponentially with system size. For instance, the number of unique binary words of length T is 2T. The entropy of a uniform distribution over such words is therefore log 2T = T bits.

But Eq. 1 applies to all discrete distributions, not just uniform ones. At first glance the general formula is not particularly intuitive. Why the minus sign? Why a probability times a logarithm of a probability? Where does Eq. 1 come from?

Shannon’s derivation

The usual explanation of why Eq. 1 is the right formula for entropy can be found in Shannon’s paper itself (p. 10-11).

Essentially Eq. 1 is the unique solution (up to a positive multiplicative factor) satisfying three criteria: (1) H is continuous in the probability levels, (2) H of a uniform distribution increases with the number of possible outcomes, and (3) if an observation from P can be broken into two or more sequential observations (e.g. first observing whether a sample is in category A or B, then observing its exact value), then H should be decomposable into a weighted sum of the individual entropies.

Appendix 2 of Shannon’s paper shows why Eq. 1 satisfies these, and the proof is quite accessible. However, the derivation essentially starts by assuming one already knows the answer, obscuring how one would have discovered Eq. 1 in the first place.

Expected space required to encode a sample

A second derivation of Shannon’s formula comes from computing the expected number of bits required to encode a sample from P. For instance, if P is a uniform distribution over 4 possible outcomes, it takes log2(4) = 2 bits to encode one of these outcomes (00, 01, 10, or 11), i.e. to communicate to someone else which outcome occurred. More generally, it takes log 1/pi = -log pi bits to encode a sample that occurs with probability pi, and the expectation of this quantity is E[-log pi] which is equivalent to Eq. 1.

This derivation is elegant and intuitive, but it requires introducing the notion of encoding. For instance, even though it’s somewhat intuitive that rarer outcomes would require more space to encode, it takes a bit of head-scratching (at least for me) to arrive at why log 1/pi is the right quantity for a non-uniform distribution. (Of course, the link between entropy and encoding is a core element of Information Theory, so it is worth working through this.)

An alternative construction

Is there a way to arrive at Shannon’s formula without knowing what it should look like ahead of time? And without the notion of encoding? In particular, can we go from the simpler notion of entropy as the log number of possible outcomes of something to Shannon’s formula for an arbitrary P?

It turns out Shannon entropy emerges when we choose that something to be a sequence of samples consistent with a histogram approximating P (i.e. that converges to P in the limit of infinite samples). If we count these in log units, Eq. 1 emerges.

I first saw this derivation in Leonard Susskind’s excellent Statistical Mechanics course (Lecture 3), albeit in a different context–connecting Boltzmann and Gibbs entropy. Below I’ll present it differently, but the math is the same. This derivation is longer than Shannon’s, but it shows how one might have come upon Eq. 1 without already knowing what it should look like and without the notion of encoding.

Entropy as a log count of sequences of samples

Consider drawing L of samples from a distribution P, one after another. As L increases, the histogram constructed from the samples will tend toward P. How many different sequences of L samples could have produced a given histogram?

Suppose P is defined over N possible outcomes, x1 , …, xN , and we have collected L1 observations of x1, L2 observations of x2, etc, where L = L1 + … + LN. The number of ways (call this Nseq) we could have sampled x1 L1 times, x2 L2 times, etc. is

    \[ N_{seq}(L) = {L \choose L_1} {L - L_1 \choose L_2} \cdots {L - L_1 - \cdots - L_{N-1} \choose L_N} \]

    \[ = \frac{L!}{L_1! (L-L_1)!} \times \frac{(L-L_1)!}{L_2! (L-L_1-L_2)!} \cdots = \frac{L!}{L_1! L_2! \cdots L_N!} \]

where we have canceled the second term in each denominator with the next numerator.

Counting in log units we then find that

    \[ \log N_{seq}(L) =  \log \left( \frac{L!}{L_1! \cdots L_N!} \right) = \log L! - \log L_1! - \dots - \log L_N!, \]

the log number of different sequences of L samples that would have produced our histogram. Next, letting L be large enough to use Stirling’s approximation (log L! ≈ L log LL), we find

    \[ \log N_{seq}(L) \approx L \log L - L - (L_1 \log L_1 - L_1) - \dots - (L_N \log L_N - L_N), \]

and regrouping terms,

    \[ \log N_{seq}(L) \approx L \log L - (L_1 \log L_1 + \dots + L_N \log L_N) - L + (L_1 + \dots + L_N) \]

    \[ = L \log L - (L_1 \log L_1 + \dots + L_N \log L_N) \]

since L = L1 + … + LN. Factoring out L we have

    \[ \log N_{seq}(L) \approx L\left[ \log L - \left(\frac{L_1}{L} \log L_1 + \dots + \frac{L_N}{L} \log L_N \right) \right]. \]

    \[ = L\left[ \log L - (\hat{p}_1 \log L + \dots + \hat{p}_N \log L) - (\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] \]

    \[ = L\left[ \log L - \log L - (\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] \]

where we have used the fact that

    \[ L_i = L\hat{p}_i \quad \text{and} \quad \sum_i \hat{p}_i = 1   \]

Thus

    \[ \log N_{seq}(L) \approx  L\left[ -(\hat{p}_1 \log \hat{p}_1 + \cdots + \hat{p}_N \log \hat{p}_N) \right] = -L \sum_i \hat{p}_i \log \hat{p}_i. \]

Since our histogram approaches P as L → ∞ we arrive at

    \[ \log N_{seq}(L) \to -L\sum_i p_i \log p_i . \]

Finally, to obtain a quantity that does not depend on the number of samples we divide both sides by L:

    \[ \frac{\log N_{seq}(L)}{L} \to -\sum_i p_i \log p_i = H[P]. \]

Thus, when we count how many different sequences of samples could have produced P, in log units and normalizing by the number of samples, we arrive once again at Eq. 1. That is, Shannon’s formula is a normalized log-count of how many ways a distribution could have emerged.

Further reading

C.E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal 1948.

David MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press 2003. (+ online lectures)

Leave a Reply

Your email address will not be published. Required fields are marked *