The Coin Flip That Founded a Science
You flip a fair coin. It lands heads. How much did you just learn?
The question sounds almost absurd — like asking how heavy a thought is. But it turns out there's an exact, rigorous, deeply useful answer: you learned exactly one bit of information. Not "a bit" in the vague English sense of "a small amount." One bit, a binary digit, the fundamental unit of information, as precisely defined as a meter or a kilogram.
Here's why. Before the flip, you had two equally likely possibilities. After the flip, you had one. The outcome resolved exactly one yes-or-no question: "Did it land heads?" That resolution — that collapse from uncertainty to certainty — is worth one bit. If I flip two coins and tell you both results, that's two bits (two binary questions resolved). Three coins? Three bits. You see where this is going.
Now roll a fair six-sided die. How much information does the outcome carry? More than a coin — six possibilities are more uncertain than two — but how much more? The answer is log₂(6) ≈ 2.58 bits. Not a round number, which might feel strange, but we're measuring something continuous: the amount of surprise inherent in learning which of six equally likely outcomes actually occurred.1
And here is where things get interesting. What if the die is loaded?
The Loaded Die Problem
Suppose one face of the die has a 99% chance of coming up. You roll, and that face appears. Are you surprised? Of course not. You expected it. The outcome barely taught you anything — it merely confirmed what you already knew. Now suppose, against the odds, one of the other five faces appears. That is genuinely surprising. That outcome carried a lot of information.
So the average information per roll of this loaded die should be much less than the 2.58 bits of a fair die. The loaded die is more predictable, less uncertain, less… what's the right word?
Claude Shannon, a 32-year-old mathematician at Bell Labs, found the word in 1948. He called it entropy.2
- H
- Entropy — the average surprise, measured in bits
- p(x)
- The probability of outcome x
- log₂
- Logarithm base 2 — because we're counting binary questions
Read it aloud: entropy is the expected value of surprise, where the surprise of an event with probability p is −log₂(p). An event that happens with probability 1 (certainty) has surprise 0. An event with probability 1/2 has surprise 1 (one bit). An event with probability 1/64 has surprise 6 bits. The negative sign is there because the logarithm of a number between 0 and 1 is negative, and we want surprise to be positive.
For a fair coin: H = −[½ log₂(½) + ½ log₂(½)] = −[½(−1) + ½(−1)] = 1 bit. Exactly what we said.
For a fair die: H = −6 × [⅙ log₂(⅙)] = log₂(6) ≈ 2.58 bits. Also what we said.
For the 99% loaded die: H ≈ 0.11 bits. Almost zero. Almost nothing to learn.
That curve is the binary entropy function, H(p) = −p log₂(p) − (1−p) log₂(1−p), and it captures something philosophically profound. The most uncertain situation — the one that carries the most information when resolved — is the fair coin, right at the peak. The more lopsided the coin, the less you learn from flipping it. In the extreme, a two-headed coin carries zero information. You already knew what was going to happen.
The Man Who Measured Information
Shannon didn't set out to revolutionize mathematics. He was trying to solve an engineering problem: how much data can you push through a noisy telephone wire?3
Bell Labs in the 1940s was arguably the most productive research institution in human history — transistors, lasers, solar cells, Unix, the cosmic microwave background. Shannon fit right in. He was the kind of person who juggled on a unicycle through the hallways and built a machine whose sole purpose was to turn itself off.4 But his 1948 paper, "A Mathematical Theory of Communication," was his masterpiece, and it wasn't whimsical at all. It was a cathedral.
The paper did something unprecedented: it defined information as a mathematical quantity, separate from meaning. A Shakespeare sonnet and a random string of characters of the same length could carry the same number of bits. Shannon wasn't interested in whether the message was profound or gibberish. He was interested in how many yes-or-no questions you'd need to narrow down which message was sent.
— Claude Shannon, 1948
Notice the word "selected." The message is chosen from a set of possibilities. The more possibilities, the more bits needed. The less uniform the probabilities, the fewer bits needed. This is entropy.
A Name Borrowed from Physics
Why "entropy"? The story, possibly apocryphal, involves John von Neumann, who allegedly told Shannon: "You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really means, so in a debate you will always have the advantage."5
But the connection to thermodynamic entropy — Boltzmann's entropy, the arrow-of-time entropy, the entropy you half-remember from chemistry class — is deeper than a naming joke. Boltzmann's entropy S = kB ln W counts the number of microscopic arrangements (microstates) consistent with a macroscopic observation. Shannon's entropy counts uncertainty about which message was sent. Both measure the number of things you don't know. The mathematical form is essentially identical; the only difference is the logarithm base and a multiplicative constant.
Try It Yourself: The Entropy Calculator
The formula comes alive when you play with it. Below, you can adjust the probabilities of a set of outcomes and watch entropy respond in real time. Start with two outcomes (a coin) and work your way up. Notice how entropy peaks when all outcomes are equally likely, and collapses toward zero when any single outcome dominates.
The Limits of Compression
Shannon's entropy isn't just an abstract measure of surprise. It has a stunning operational meaning: entropy is the theoretical minimum number of bits per symbol needed to encode a message. You cannot do better. Period.6
Think about English text. There are 26 letters, so naively you'd need log₂(26) ≈ 4.7 bits per character. But English isn't random. After the letter Q, the next letter is almost certainly U. After "TH", the next letter is probably E, A, or I. After "THE UNITED STA", the next word is almost certainly "STATES." All this predictability — all this redundancy — means English carries far less than 4.7 bits per character. Shannon himself estimated about 1.0 to 1.5 bits per character, and modern estimates hover around 1.3.7
If a source has entropy H bits per symbol, you can compress it to approximately H bits per symbol but no further. This is Shannon's source coding theorem. A ZIP file shrinks English text by roughly 70% — because English is roughly 70% redundant.
This is why compression algorithms work. Huffman coding, arithmetic coding, LZ77 (the algorithm inside ZIP and gzip) — they all exploit the gap between the naive encoding (4.7 bits/char for English) and the actual entropy (~1.3 bits/char). That gap is redundancy, and redundancy is compressible.
And now here's a beautiful connection that leads us straight into the 21st century.
Cross-Entropy and the Ghost in the Machine
Suppose you have the true probability distribution of letters in English (call it p), and you build a model that guesses the distribution (call it q). If your model is perfect — q = p — then the number of bits you'd need per symbol is exactly the entropy H(p). But if your model is imperfect, you'll need more bits. How many more? The answer is the cross-entropy:
The difference between cross-entropy and entropy is the Kullback-Leibler divergence, DKL(p ‖ q) = H(p,q) − H(p), which measures how "wrong" your model is. It's always non-negative — you can never do better than the true distribution.
Here's the punchline: every time you train a neural network for classification — every time you train a language model — the loss function you're minimizing is cross-entropy. The training process is trying to make the model's predicted distribution q match the true distribution p as closely as possible, which means minimizing the KL divergence, which means minimizing the gap between the model's encoding and the theoretically optimal one.8
Shannon published his paper in 1948. Seventy-five years later, his formula is the heartbeat of every large language model. Every time ChatGPT generates a word, it's playing Shannon's game: predicting the next symbol from a probability distribution, trying to minimize entropy. Shannon's ghost is very much in the machine.
What Your Text Reveals
You can measure the entropy of any text. Count the frequency of each character, treat those frequencies as probabilities, and plug them into Shannon's formula. (This gives you the single-character entropy — a lower bound on the true per-character entropy, which would account for patterns between characters.)
English prose typically lands around 4.0–4.2 bits/char at the single-character level, because some letters (E, T, A) are much more common than others (Q, Z, X). Random text, where every character is equally likely, would hit the maximum: log₂(26) ≈ 4.7 bits/char. The gap tells you about structure — about the fingerprint of a language.
Try it below. Paste any text and watch the entropy emerge.
Why This Matters
Shannon entropy is one of those rare ideas that starts as a solution to a specific problem (telephone engineering) and expands until it touches everything. Here's a partial list of where entropy shows up:
Data compression: ZIP, gzip, PNG, MP3 — all rely on Shannon's source coding theorem.
Cryptography: A good encryption scheme maximizes the entropy of ciphertext. The one-time pad achieves perfect secrecy precisely because it makes every ciphertext equally likely.
Machine learning: Cross-entropy loss is the standard objective for classification. Decision trees use information gain (entropy reduction) to choose splits.
Statistics: The maximum entropy principle says: given partial information about a distribution, choose the one with maximum entropy. It's the least presumptuous choice.
Biology: DNA has about 1.95 bits per base pair (close to the maximum of 2, since there are 4 bases).
Ecology: The Shannon diversity index measures species diversity — same formula, different application.
Gambling: Kelly criterion, meet entropy. The optimal bet size is related to the difference between your model's entropy and the market's.
Shannon gave us a ruler for measuring the invisible. Before him, "information" was a vague, qualitative word — something you could have more or less of but couldn't count. After him, information was as measurable as length or weight. That single act of quantification — of turning a fuzzy concept into a precise formula — is perhaps the most consequential mathematical move of the twentieth century.
The next time you send a text message, compress a file, or watch a neural network train, remember: you're watching Shannon's formula at work. The mathematics of surprise, written in 1948 by a juggling unicyclist at Bell Labs, quietly running the modern world.