Shannon Entropy — The Missing Chapters

Chapter 87

The Coin Flip That Founded a Science

You flip a fair coin. It lands heads. How much did you just learn?

The question sounds almost absurd — like asking how heavy a thought is. But it turns out there's an exact, rigorous, deeply useful answer: you learned exactly one bit of information. Not "a bit" in the vague English sense of "a small amount." One bit, a binary digit, the fundamental unit of information, as precisely defined as a meter or a kilogram.

Here's why. Before the flip, you had two equally likely possibilities. After the flip, you had one. The outcome resolved exactly one yes-or-no question: "Did it land heads?" That resolution — that collapse from uncertainty to certainty — is worth one bit. If I flip two coins and tell you both results, that's two bits (two binary questions resolved). Three coins? Three bits. You see where this is going.

Now roll a fair six-sided die. How much information does the outcome carry? More than a coin — six possibilities are more uncertain than two — but how much more? The answer is log₂(6) ≈ 2.58 bits. Not a round number, which might feel strange, but we're measuring something continuous: the amount of surprise inherent in learning which of six equally likely outcomes actually occurred.1

And here is where things get interesting. What if the die is loaded?

⁂

The Loaded Die Problem

Suppose one face of the die has a 99% chance of coming up. You roll, and that face appears. Are you surprised? Of course not. You expected it. The outcome barely taught you anything — it merely confirmed what you already knew. Now suppose, against the odds, one of the other five faces appears. That is genuinely surprising. That outcome carried a lot of information.

So the average information per roll of this loaded die should be much less than the 2.58 bits of a fair die. The loaded die is more predictable, less uncertain, less… what's the right word?

Claude Shannon, a 32-year-old mathematician at Bell Labs, found the word in 1948. He called it entropy.2

Shannon Entropy

H = −Σ p(x) log₂ p(x)

The sum runs over every possible outcome x

H: Entropy — the average surprise, measured in bits
p(x): The probability of outcome x
log₂: Logarithm base 2 — because we're counting binary questions

Read it aloud: entropy is the expected value of surprise, where the surprise of an event with probability p is −log₂(p). An event that happens with probability 1 (certainty) has surprise 0. An event with probability 1/2 has surprise 1 (one bit). An event with probability 1/64 has surprise 6 bits. The negative sign is there because the logarithm of a number between 0 and 1 is negative, and we want surprise to be positive.

For a fair coin: H = −[½ log₂(½) + ½ log₂(½)] = −[½(−1) + ½(−1)] = 1 bit. Exactly what we said.

For a fair die: H = −6 × [⅙ log₂(⅙)] = log₂(6) ≈ 2.58 bits. Also what we said.

For the 99% loaded die: H ≈ 0.11 bits. Almost zero. Almost nothing to learn.

The binary entropy function. Maximum uncertainty (1 bit) occurs when p = 0.5 — a perfectly fair coin. Certainty in either direction collapses entropy to zero.

That curve is the binary entropy function, H(p) = −p log₂(p) − (1−p) log₂(1−p), and it captures something philosophically profound. The most uncertain situation — the one that carries the most information when resolved — is the fair coin, right at the peak. The more lopsided the coin, the less you learn from flipping it. In the extreme, a two-headed coin carries zero information. You already knew what was going to happen.

⁂

Chapter 87

The Man Who Measured Information

Shannon didn't set out to revolutionize mathematics. He was trying to solve an engineering problem: how much data can you push through a noisy telephone wire?3

Bell Labs in the 1940s was arguably the most productive research institution in human history — transistors, lasers, solar cells, Unix, the cosmic microwave background. Shannon fit right in. He was the kind of person who juggled on a unicycle through the hallways and built a machine whose sole purpose was to turn itself off.4 But his 1948 paper, "A Mathematical Theory of Communication," was his masterpiece, and it wasn't whimsical at all. It was a cathedral.

The paper did something unprecedented: it defined information as a mathematical quantity, separate from meaning. A Shakespeare sonnet and a random string of characters of the same length could carry the same number of bits. Shannon wasn't interested in whether the message was profound or gibberish. He was interested in how many yes-or-no questions you'd need to narrow down which message was sent.

"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."
— Claude Shannon, 1948

Notice the word "selected." The message is chosen from a set of possibilities. The more possibilities, the more bits needed. The less uniform the probabilities, the fewer bits needed. This is entropy.

A Name Borrowed from Physics

Why "entropy"? The story, possibly apocryphal, involves John von Neumann, who allegedly told Shannon: "You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really means, so in a debate you will always have the advantage."5

But the connection to thermodynamic entropy — Boltzmann's entropy, the arrow-of-time entropy, the entropy you half-remember from chemistry class — is deeper than a naming joke. Boltzmann's entropy S = k_B ln W counts the number of microscopic arrangements (microstates) consistent with a macroscopic observation. Shannon's entropy counts uncertainty about which message was sent. Both measure the number of things you don't know. The mathematical form is essentially identical; the only difference is the logarithm base and a multiplicative constant.

Two entropies, seventy years apart, with the same mathematical skeleton. The connection is not a coincidence — both count "ways things could be."

⁂

Chapter 87

Try It Yourself: The Entropy Calculator

The formula comes alive when you play with it. Below, you can adjust the probabilities of a set of outcomes and watch entropy respond in real time. Start with two outcomes (a coin) and work your way up. Notice how entropy peaks when all outcomes are equally likely, and collapses toward zero when any single outcome dominates.

Entropy Calculator

Adjust probabilities and watch entropy change. The formula: H = −Σ p(x) log₂ p(x)

2 outcomes

0 bits (certain)1.00 bits (max)

Shannon Entropy

1.000 bits

Maximum entropy for 2 outcomes

⁂

Chapter 87

The Limits of Compression

Shannon's entropy isn't just an abstract measure of surprise. It has a stunning operational meaning: entropy is the theoretical minimum number of bits per symbol needed to encode a message. You cannot do better. Period.6

Think about English text. There are 26 letters, so naively you'd need log₂(26) ≈ 4.7 bits per character. But English isn't random. After the letter Q, the next letter is almost certainly U. After "TH", the next letter is probably E, A, or I. After "THE UNITED STA", the next word is almost certainly "STATES." All this predictability — all this redundancy — means English carries far less than 4.7 bits per character. Shannon himself estimated about 1.0 to 1.5 bits per character, and modern estimates hover around 1.3.7

The Compression Limit

If a source has entropy H bits per symbol, you can compress it to approximately H bits per symbol but no further. This is Shannon's source coding theorem. A ZIP file shrinks English text by roughly 70% — because English is roughly 70% redundant.

This is why compression algorithms work. Huffman coding, arithmetic coding, LZ77 (the algorithm inside ZIP and gzip) — they all exploit the gap between the naive encoding (4.7 bits/char for English) and the actual entropy (~1.3 bits/char). That gap is redundancy, and redundancy is compressible.

And now here's a beautiful connection that leads us straight into the 21st century.

Cross-Entropy and the Ghost in the Machine

Suppose you have the true probability distribution of letters in English (call it p), and you build a model that guesses the distribution (call it q). If your model is perfect — q = p — then the number of bits you'd need per symbol is exactly the entropy H(p). But if your model is imperfect, you'll need more bits. How many more? The answer is the cross-entropy:

Cross-Entropy

H(p, q) = −Σ p(x) log₂ q(x)

Always ≥ H(p), with equality iff q = p

The difference between cross-entropy and entropy is the Kullback-Leibler divergence, D_KL(p ‖ q) = H(p,q) − H(p), which measures how "wrong" your model is. It's always non-negative — you can never do better than the true distribution.

Here's the punchline: every time you train a neural network for classification — every time you train a language model — the loss function you're minimizing is cross-entropy. The training process is trying to make the model's predicted distribution q match the true distribution p as closely as possible, which means minimizing the KL divergence, which means minimizing the gap between the model's encoding and the theoretically optimal one.8

Shannon published his paper in 1948. Seventy-five years later, his formula is the heartbeat of every large language model. Every time ChatGPT generates a word, it's playing Shannon's game: predicting the next symbol from a probability distribution, trying to minimize entropy. Shannon's ghost is very much in the machine.

Shannon's 1948 idea flows through modern technology: entropy measures information, compression encodes it efficiently, and machine learning models are trained by minimizing the gap between predicted and true distributions.

⁂

Chapter 87

What Your Text Reveals

You can measure the entropy of any text. Count the frequency of each character, treat those frequencies as probabilities, and plug them into Shannon's formula. (This gives you the single-character entropy — a lower bound on the true per-character entropy, which would account for patterns between characters.)

English prose typically lands around 4.0–4.2 bits/char at the single-character level, because some letters (E, T, A) are much more common than others (Q, Z, X). Random text, where every character is equally likely, would hit the maximum: log₂(26) ≈ 4.7 bits/char. The gap tells you about structure — about the fingerprint of a language.

Try it below. Paste any text and watch the entropy emerge.

Text Entropy Analyzer

Paste text to measure its per-character entropy and estimate compressibility.

—

bits/character

—

max possible

—

unique chars

—

compressibility

Redundancy (compressibility estimate)

0% (random)100% (fully redundant)

Character frequency distribution

⁂

Chapter 87

Why This Matters

Shannon entropy is one of those rare ideas that starts as a solution to a specific problem (telephone engineering) and expands until it touches everything. Here's a partial list of where entropy shows up:

Data compression: ZIP, gzip, PNG, MP3 — all rely on Shannon's source coding theorem.

Cryptography: A good encryption scheme maximizes the entropy of ciphertext. The one-time pad achieves perfect secrecy precisely because it makes every ciphertext equally likely.

Machine learning: Cross-entropy loss is the standard objective for classification. Decision trees use information gain (entropy reduction) to choose splits.

Statistics: The maximum entropy principle says: given partial information about a distribution, choose the one with maximum entropy. It's the least presumptuous choice.

Biology: DNA has about 1.95 bits per base pair (close to the maximum of 2, since there are 4 bases).

Ecology: The Shannon diversity index measures species diversity — same formula, different application.

Gambling: Kelly criterion, meet entropy. The optimal bet size is related to the difference between your model's entropy and the market's.

Shannon gave us a ruler for measuring the invisible. Before him, "information" was a vague, qualitative word — something you could have more or less of but couldn't count. After him, information was as measurable as length or weight. That single act of quantification — of turning a fuzzy concept into a precise formula — is perhaps the most consequential mathematical move of the twentieth century.

The next time you send a text message, compress a file, or watch a neural network train, remember: you're watching Shannon's formula at work. The mathematics of surprise, written in 1948 by a juggling unicyclist at Bell Labs, quietly running the modern world.

Notes & References

Why log base 2? Because we're counting binary questions. If you used base e, the unit would be "nats" (natural units); base 10 gives "hartleys." The choice is a convention, not a deep fact. Shannon used base 2, establishing the bit as the standard unit.
Claude E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656 (1948). The paper was later republished as a book with Warren Weaver's expository introduction: The Mathematical Theory of Communication (University of Illinois Press, 1949).
Shannon's earlier master's thesis at MIT (1937) had already shown that Boolean algebra could be used to design switching circuits — essentially founding digital circuit design. The man had two epoch-making ideas before turning 35.
The "Ultimate Machine" — a box with a single switch. When you flip the switch, a hand emerges from the box, turns the switch off, and retreats. Shannon built it based on an idea by Marvin Minsky. See Jimmy Soni and Rob Goodman, A Mind at Play: How Claude Shannon Invented the Information Age (Simon & Schuster, 2017).
The story appears in Myron Tribus and Edward C. McIrvine, "Energy and Information," Scientific American, Vol. 225, No. 3 (1971). Whether von Neumann actually said it remains debated; see Soni and Goodman (2017), pp. 147–148.
This is Shannon's source coding theorem (the noiseless coding theorem). His channel coding theorem is even more remarkable: it says there exists a coding scheme that can transmit data at any rate below channel capacity with arbitrarily low error probability. The proofs are non-constructive — Shannon showed such codes exist without telling you how to build them. Finding practical codes that approach capacity took decades (turbo codes in 1993, LDPC codes rediscovered in the 1990s).
Shannon estimated English entropy through a guessing game experiment. Modern estimates using neural language models (Brown et al., 1992; more recently GPT-based estimates) place it around 1.2–1.4 bits per character when accounting for long-range context. See C. E. Shannon, "Prediction and Entropy of Printed English," Bell System Technical Journal, 30(1), 50–64 (1951).
In practice, machine learning uses the natural logarithm (nats) rather than log base 2 (bits), but the principle is identical. The cross-entropy loss for a classification task with true label y and predicted probabilities q is −log q(y), averaged over examples. Minimizing this is equivalent to minimizing KL divergence between the empirical distribution and the model.