Zipf's Law — The Missing Chapters

Chapter 89

The Word That Won't Shut Up

Open any book, in any language, and count the words. The most common word will appear roughly twice as often as the second most common, three times the third, four times the fourth. This isn't an approximation. It isn't a tendency. It's one of the most reliable statistical laws ever discovered—and nobody can fully explain why.

In English, the winner is "the." It appears in roughly 7% of all words in a typical text. The runner-up, "of," clocks in around 3.5%. Then "and" at about 2.8%, "to" at 2.7%, and so on down a long, steep decline. By the time you reach the 100th most common word, you're looking at something that appears maybe once every thousand words. By the 10,000th, you're in the territory of "bioluminescence" and "defenestration"—lovely words, but vanishingly rare.1

In 1949, a Harvard linguist named George Kingsley Zipf gathered this observation into a crisp mathematical statement: the frequency of a word is inversely proportional to its rank. If the most common word appears f times, the nth most common appears roughly f/n times. Or, to put it in a formula that fits on a napkin:

Zipf's Law

frequency × rank ≈ constant

The nth most common word appears with frequency proportional to 1/n

Zipf checked this in English, German, Latin, and Chinese. It held everywhere. Later researchers confirmed it in every human language tested—including extinct ones, constructed ones, and even the babbling patterns of young children.2

This would be a nice curiosity—a quirk of how humans use language—if it stopped there. But it doesn't stop there. It doesn't even slow down.

• • •

The Law That's Everywhere

Consider the populations of American cities. New York, the largest, has about 8.3 million people. Los Angeles: 3.9 million, roughly half. Chicago: 2.7 million, roughly a third. Houston: 2.3 million, roughly a quarter. The pattern isn't perfect—Houston is a bit big for its rank, Phoenix a bit small—but the general shape is unmistakable. Plot population against rank on logarithmic axes and you get something very close to a straight line with slope −1.3

Now consider website traffic. Google dominates. YouTube is roughly half of Google. Facebook roughly a third. Then comes a long, long tail of smaller sites, each with a fraction of the traffic of the one ranked above it. The same pattern. The same slope.

Music streaming? The top song on Spotify might get 100 million plays in a month. The 100th most popular song gets about a million. The 10,000th? Maybe 10,000 plays. Plot it: a straight line on log-log paper.

Word frequencies in every language. City populations within countries. Company sizes by revenue. Earthquake magnitudes. Moon crater diameters. Wealth distribution. Citations of scientific papers. Casualties in wars. Downloads of apps. Frequency of last names.

All follow approximately the same law: a few giants, a handful of medium players, and a vast ocean of small fry.

This is the kind of universality that makes physicists excited and mathematicians suspicious. When the same pattern shows up in domains that have nothing in common—no shared mechanism, no shared history, no shared anything—either there's a deep reason, or there's a deep confusion. In the case of Zipf's law, we've got a healthy dose of both.

On a log-log plot, Zipf's law becomes a straight line. Real word frequencies hug it remarkably closely.

The Mathematics of Inequality

Let's be a bit more precise. Zipf's law says that if you rank items by their size (or frequency, or population, or whatever), the size of the item at rank r is proportional to 1/r^α, where α is an exponent close to 1. When α = 1 exactly, you get the classical Zipf's law. When α drifts a bit above or below 1, you get what statisticians call a power law—a close cousin.4

The magic of a power law is what it looks like on a log-log plot. If you take the logarithm of both frequency and rank, the relationship becomes:

Log-Log Form

log(frequency) = − α · log(rank) + C

A straight line with slope −α. For pure Zipf, α ≈ 1.

This is why researchers love log-log plots. The messy curve of raw data—that vertiginous plunge from "the" down to "defenestration"—becomes a tidy straight line. You can read the exponent right off the slope. And you can see, at a glance, whether the data really follows Zipf's law or merely looks like it does for the top few ranks before wandering off.

There's a related observation that's even older. In 1896, the Italian economist Vilfredo Pareto noticed that 80% of Italy's land was owned by 20% of the population. His "80/20 rule" is really Zipf's law in disguise—or rather, both are manifestations of the same underlying power-law distribution. If wealth follows a Zipfian distribution, then a small fraction of people will inevitably hold a large fraction of the total. The exact split depends on the exponent α, but the qualitative story is always the same: massive concentration at the top, a long tail at the bottom.5

"Zipf's law isn't about words, or cities, or wealth. It's about what happens when you rank anything at all."

Why? (Three Answers, Zero Consensus)

So: why does Zipf's law hold? You'd think that after seventy-five years of study, we'd have nailed this down. We haven't. Instead, we have a buffet of competing explanations, each plausible, none definitive. Here are the three most important.

1. The Rich Get Richer

In 1955, Herbert Simon—future Nobel laureate, polymathic genius—proposed a model based on what he called "preferential attachment," though the catchy name came later. The idea is simple: the probability that a word (or city, or website) grows is proportional to how big it already is. Popular things attract more attention, which makes them more popular, which attracts more attention.

Simon proved mathematically that this feedback loop produces Zipf's law. It's not hard to see intuitively: if "the" gets used a lot, people get used to using it, which makes them use it more, which makes it even more common. Cities work the same way: people move to big cities because that's where the jobs are, which creates more jobs, which attracts more people. It's positive feedback with a mathematical punchline.6

2. Information Theory

Benoit Mandelbrot—yes, the fractal guy—came at the problem from a completely different angle. He argued that Zipf's law emerges from the requirement that language be an efficient coding system. If you want to transmit the maximum amount of information per character, you should make the most common words short and the rare words long (which, in fact, they are). Mandelbrot showed that under certain optimality conditions, the resulting frequency distribution follows Zipf's law.

Mandelbrot also proposed a generalization that handles the slight curvature you often see in real data:

Zipf-Mandelbrot Law

f(r) = C / (r + q)^α

Adding the parameter q smooths the curve for top-ranked items.

The extra parameter q shifts the distribution slightly, accounting for the fact that the very top-ranked items sometimes don't quite fit the pure 1/r pattern. It's a minor tweak, but it fits real data almost perfectly.

3. Monkeys at Typewriters

Here's the most provocative explanation, and the one that drives Zipf enthusiasts craziest. In 1992, Wentian Li published a paper showing that if you let a monkey type randomly on a keyboard—including a space bar that separates "words"—the resulting "words" follow Zipf's law.7

Let that sink in. Random typing produces Zipfian frequencies. No meaning, no grammar, no optimization—just chance. This suggests that Zipf's law might not be telling us anything deep about language or cities or wealth at all. It might be a statistical artifact, emerging from any process that produces items of varying length or size from a bounded set of components.

The debate between "Zipf's law is profound" and "Zipf's law is trivial" has been running for decades, and it's not resolved. The truth, as with many things in mathematics, is probably somewhere in the middle. The law might be easy to produce—many simple processes generate it—but the particular exponent you observe in a given domain probably does carry real information about the underlying mechanism.

Three competing explanations for Zipf's law—each compelling, none complete.

Try It Yourself

Don't take my word for it. Paste any text below—a speech, a chapter of a novel, your own writing—and watch Zipf's law emerge. The analyzer will count word frequencies, plot them on log-log axes, and fit the Zipf exponent. If you're lazy (no judgment), try one of the presets.

📊 Zipf Analyzer

Paste text or choose a preset. The analyzer counts word frequencies and fits Zipf's law.

—

Total Words

—

Unique Words

—

Zipf Exponent (α)

—

R² Fit

Your data Ideal Zipf (α=1) Best fit line

• • •

Scale Invariance and the Long Tail

There's something philosophically dizzying about Zipf's law, and it has to do with the concept of scale invariance. A Zipfian distribution looks the same no matter what scale you examine it at. Zoom in on the top 10 items, and they follow the law. Zoom in on items 100–1,000, and they follow it too. There's no characteristic scale—no natural "size" at which things cluster. This is the hallmark of a power law, and it connects Zipf's law to a whole family of mathematical phenomena, from fractals to critical phase transitions in physics.8

Contrast this with a normal (Gaussian) distribution, where there's a definite "typical" value. Human heights follow a bell curve: most men are around 5'10", and virtually nobody is 10 feet tall. But in a Zipfian world, there is no "typical." The median city is tiny, the mean is pulled up by a handful of megacities, and there's no meaningful "average city size." The distribution is dominated by its tail.

This has enormous practical consequences. If you're designing a web cache, and page visits follow Zipf's law, then a cache holding just 20% of the pages can serve 80% of the requests. If you're stocking a bookstore, a small number of bestsellers will account for most of your sales. If you're a search engine, a handful of queries ("facebook," "weather," "porn") dominate your traffic, and the rest is an effectively infinite variety of long-tail queries that each appear once in a blue moon.

Why Engineers Care

Zipf's law means optimization is often about the head, not the tail. A cache, a recommendation engine, an inventory system—in each case, getting the top few items right matters far more than handling the rare ones. The 80/20 rule isn't a rough approximation; it's a mathematical consequence of the distribution's shape.

Chris Anderson popularized this idea in his book The Long Tail (2006), arguing that the internet changed the economics of the tail—that Netflix and Amazon could profit from rare items that no physical store could stock. He was right about the opportunity, but the underlying math was Zipf's all along. The long tail was always there; the internet just made it accessible.

Zipf in the Wild

How well does Zipf's law actually hold across different domains? Try the explorer below. Switch between datasets—city populations, word frequencies, website traffic—and see how each one plots on log-log axes. Some follow the law almost perfectly; others deviate at the extremes. The fitted exponent α tells you how steeply the distribution falls off.

🌍 Zipf in the Wild

Explore how Zipf's law appears across different domains. Toggle datasets and compare.

—

Fitted α

—

R² Fit

—

#1 Item

—

#1 / #2 Ratio

Actual data Ideal Zipf Best fit

• • •

The head-and-tail structure of a Zipfian distribution. A few items dominate; most are rare.

The Moral of the Law

Zipf's law is one of those mathematical facts that's simultaneously obvious and mind-bending. Of course some words are more common than others—but why does the distribution have this exact shape, and why does the same shape show up in city sizes and earthquake magnitudes? Of course wealth is unequal—but why does the specific degree of inequality follow a predictable mathematical curve?

Here's what I think Zipf's law is really telling us, and it's a lesson that Ellenberg would appreciate: many of the patterns we observe aren't features of the specific system we're studying. They're features of the mathematics of ranking itself. Whenever you have items competing for a share of some finite resource—attention, population, wealth, usage—and the competition has any element of cumulative advantage, you get Zipf. Not because of some deep secret about language or cities, but because the mathematics of "more begets more" has a very particular shape.

This doesn't make the law trivial. It makes it a lens. When data doesn't follow Zipf's law, that's when you should get curious. It means something unusual is constraining the system—regulation, physical limits, deliberate design. The deviations from Zipf are where the interesting stories live.

And there's a practical takeaway for anyone who designs systems: don't design for the average. Design for the distribution. In a Zipfian world, the average is a fiction. The reality is a tiny head and an enormous tail, and the best designs acknowledge both.

Zipf's law won't tell you why "the" beats "of." It won't tell you why New York is bigger than Los Angeles. But it will tell you, with remarkable precision, by how much. And sometimes, knowing the shape of the world is enough.

Notes & References

Word frequency data from the Corpus of Contemporary American English (COCA). "The" accounts for about 6.9% of all word tokens; "of" about 3.6%; "and" about 2.8%. See Mark Davies, "The Corpus of Contemporary American English," International Journal of Corpus Linguistics 14, no. 2 (2009): 159–190.
George Kingsley Zipf, Human Behavior and the Principle of Least Effort (Cambridge, MA: Addison-Wesley, 1949). Zipf's original magnum opus, sprawling and eccentric, but containing the core empirical observations that launched the field.
Xavier Gabaix, "Zipf's Law for Cities: An Explanation," Quarterly Journal of Economics 114, no. 3 (1999): 739–767. Gabaix showed that a simple random growth model (Gibrat's law) for cities produces Zipf's law in the tail.
The relationship between Zipf's law and power laws is discussed rigorously in Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman, "Power-Law Distributions in Empirical Data," SIAM Review 51, no. 4 (2009): 661–703. They also warn that many claimed power laws don't survive rigorous statistical testing.
Vilfredo Pareto, Cours d'économie politique (Lausanne: F. Rouge, 1896–97). The 80/20 rule is a loose popularization; the underlying mathematics connects Pareto distributions (continuous) to Zipf's law (discrete rankings).
Herbert A. Simon, "On a Class of Skew Distribution Functions," Biometrika 42, no. 3/4 (1955): 425–440. Simon's model was later rediscovered and elaborated by Barabási and Albert in the context of networks (1999).
Wentian Li, "Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution," IEEE Transactions on Information Theory 38, no. 6 (1992): 1842–1845. The paper is only four pages and devastatingly clear.
For the connection between power laws, scale invariance, and critical phenomena, see Mark Newman, "Power Laws, Pareto Distributions and Zipf's Law," Contemporary Physics 46, no. 5 (2005): 323–351.