Goodhart's Law — Why Every Metric Eventually Fails

Chapter 1

The Nail Factory

Here is a story they tell about the Soviet Union, and like the best stories about the Soviet Union, it is both hilarious and quietly horrifying.

Moscow, in its infinite wisdom, wanted to know how well its nail factories were performing. Reasonable enough. You need nails, you build factories, you want to make sure the factories are making nails. So the central planners did what central planners do: they set a target. The target was total weight of nails produced.

The factories immediately started producing enormous nails. Giant, useless, railroad-spike-sized monstrosities that no carpenter could drive into a plank. But the tonnage numbers looked spectacular.

The planners caught on and changed the metric to total number of nails. The factories pivoted overnight. Now they produced tiny, hair-thin nails — essentially metal slivers — by the millions. Useless in a different way, but the count was magnificent.

This story may be apocryphal, but it doesn't matter, because its truth is deeper than history. It captures something about the relationship between what we measure and what we want that turns out to be one of the most important ideas in modern life. The thing the planners wanted was "good nails, in the right quantities." The thing they measured was a number. And the moment they started optimizing the number, the nails went sideways.

The same factory, the same workers, the same metal — just a different number on the wall in Moscow.

In 1975, a British economist named Charles Goodhart noticed something similar happening — not in nail factories, but in the beating heart of the Bank of England.

· · ·

Chapter 2

What Goodhart Actually Said

Charles Goodhart was advising the Bank of England on monetary policy. The bank, like central banks everywhere, was trying to control inflation by targeting the money supply. The logic was elegant: inflation correlates with how much money is sloshing around the economy. Control the money supply, control inflation. Simple.

Except it wasn't simple at all. The moment the Bank of England started targeting a specific monetary aggregate — say, M3, the broad money supply — the historical relationship between M3 and inflation broke down. The number they were steering by had stopped being a reliable compass the instant they grabbed the wheel.1

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

That's Goodhart's Law, in Goodhart's own rather dry wording from 1975. It doesn't sound like much. It sounds like the kind of thing an economist would say at a conference before everyone shuffled off to lunch. But buried in that bland sentence is a kind of mathematical tragedy — a deep statement about the impossibility of indirect measurement under optimization pressure.

The anthropologist Marilyn Strathern later sharpened it into the version most people know: "When a measure becomes a target, it ceases to be a good measure."2

But let's be precise about why this happens, because the why is where the mathematics lives.

The Core Problem

M ≈ V + ε
before optimization

M ↛ V
after optimization pressure on M

M = measured proxy, V = true value you care about, ε = noise. The correlation between M and V breaks down once agents optimize for M directly.

Think of it this way. There's something you actually care about — call it V, for value. You can't observe V directly (if you could, you wouldn't need a metric). So you find some proxy M that correlates with V. As long as nobody is specifically trying to push M around, the correlation holds. M is a decent window into V.

But the moment you announce that M is the target — that bonuses depend on M, that careers rise and fall with M, that the entire system orients itself around maximizing M — you've changed the game. Now there are agents in the system whose entire incentive is to push M upward, and they'll find ways to do it that have nothing to do with V.

· · ·

Chapter 3

The Zoo of Perverse Incentives

Goodhart's Law has cousins. It's not alone in the intellectual zoo. In fact, it shares a cage with several creatures that all bite in the same way.

Campbell's Law

In 1979, the social psychologist Donald Campbell put it this way: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."3

Campbell was thinking about education. About standardized test scores. About what happens when an entire school system's funding depends on whether kids bubble in the right circles on a Scantron sheet.

When No Child Left Behind tied school funding to standardized test scores in the United States, teachers didn't just "teach to the test" — they became the test. Curriculum narrowed to what was measured. Art vanished. Recess shrank. Some districts were caught literally erasing wrong answers and filling in correct ones.4

The metric (test scores) went up. The thing the metric was supposed to measure (educational quality) is... harder to say. But there's a telling fact: international rankings of U.S. students didn't budge.

The Cobra Effect

The most delightful variant comes from colonial India. The British government in Delhi, troubled by the number of venomous cobras, offered a bounty for every dead cobra brought in. Enterprising citizens began breeding cobras. When the government discovered this and scrapped the bounty program, the breeders released their now-worthless snakes into the wild, and the cobra population ended up higher than before.5

The Pattern

Every one of these stories has the same skeleton: (1) identify something you want, (2) find a measurable proxy, (3) incentivize the proxy, (4) watch in horror as the proxy diverges from the thing you wanted. The specifics change — nails, cobras, test scores — but the structure is invariant.

The Rat Farms of Hanoi

Lest you think this is ancient history: in French colonial Hanoi, the government paid citizens per rat tail delivered. Rat catchers started cutting off tails and releasing the rats alive — to breed more profitable rats. Some people even started rat farms.6

The lesson is always the same, and it's always learned too late.

The four-step skeleton of every Goodhart failure. Each step feels perfectly reasonable. The disaster is emergent.

· · ·

Chapter 4

The Silicon Valley Edition

Now let's talk about where you actually live. Because Goodhart's Law isn't a curiosity from Soviet factories or colonial capitals. It's the operating system of modern technology companies, and it's eating the world in real time.

In the early days of software engineering, managers needed to know who was productive. The obvious metric: lines of code written. Programmers immediately began writing verbose, sprawling code. Functions that should have been five lines became fifty. Why refactor elegantly when the number goes down?

Bill Gates reportedly said: "Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs."

We laugh at lines of code now. We've moved on to much more sophisticated metrics that fail in much more sophisticated ways.

Engagement Metrics

Social media platforms optimize for "engagement" — clicks, time-on-site, shares, comments. The theory: if people are engaged, the product must be good. But what maximizes engagement? Outrage. Conflict. Misinformation that's so wrong people can't stop arguing about it. The most "engaging" content is often the most toxic, because anger is stickier than satisfaction.7

The proxy (engagement) diverged catastrophically from the true objective (user wellbeing, societal health, long-term platform value). And it took years for anyone with power to notice — because the number looked so good.

Citation Counts

In academia, your career depends on citations. Get cited a lot, and you're a star. This has produced an entire ecosystem of gaming: citation rings (I'll cite you if you cite me), salami-slicing (publishing the smallest possible unit of research to maximize paper count), and the curious phenomenon of papers that exist primarily to be cited rather than read.8

The metric became the product. The product was supposed to be knowledge.

Click-Through Rates

Advertisers pay for clicks. So the internet optimizes for clicks. And what gets clicks? Headlines engineered to exploit curiosity gaps. "You Won't BELIEVE What Happened Next." The click-through rate is high. The user satisfaction on the other side of that click? Abysmal. But the number on the dashboard is green, so everything must be fine.

· · ·

Chapter 5

The Mathematics of Divergence

Let me be a little more rigorous about what's happening here, because the math is illuminating and it's not complicated.

Suppose you have a true objective V that you care about. You choose a proxy metric M. Before optimization, M and V are correlated — let's say they move together with some noise:

Before Optimization

M = αV + βG + ε

α = signal strength from true value, β = sensitivity to gaming, G = gaming effort (initially ≈ 0), ε = random noise

When nobody's gaming the system, G ≈ 0, and M ≈ αV + ε. The proxy works. But the moment you incentivize M, rational agents will invest in G — in finding ways to push M up that don't go through V.

The Divergence Dynamics

G(t) = G₀ · e^rt

Gaming effort grows exponentially as agents discover and share exploits. r depends on incentive strength and the "attack surface" of the metric.

Here's the key insight: gaming effort grows over time because gaming is learnable. People get better at it. They share techniques. They build tools. The attack surface expands. Meanwhile, the true signal α stays constant or even degrades, because as the metric becomes less informative, people who were genuinely doing good work become demoralized and leave.

The Fundamental Asymmetry

Improving the true objective V is hard. It requires real work, talent, and time. Gaming the metric M is easier and gets easier over time. In any competitive system, the easy path eventually dominates. This is Goodhart's Law expressed as a differential equation: the gaming term grows faster than the signal term.

The situation is even worse than it looks. As more agents learn to game M, the informational content of M degrades — it becomes a noisier and noisier signal for V. Which means that even honest actors who try to use M as feedback for improving V are now navigating by a broken compass.

Before the target is set, M tracks V. After? The metric soars while reality quietly decays. The shaded area is how wrong your dashboard is.

· · ·

Chapter 6

Run a Company (Into the Ground)

Enough theory. Let's watch Goodhart's Law happen in real time. Below is a simulation of a company optimizing a metric. You control the incentive strength — how much pressure you put on the metric. Watch what happens to the true objective versus the measured metric as time passes.

The Goodhart Simulator

You're the CEO. Adjust the sliders and the simulation runs automatically — watch the gap between what you think is happening and what's actually happening.

Incentive Strength 50%

Metric Gameability 40%

Team Integrity 60%

True Objective (V)

Measured Metric (M)

Gaming Effort (G)

After 24 Quarters

—

Metric (M)

—

True Value (V)

—

Gaming (G)

—

Divergence

—

Play with the sliders. Notice what happens when you crank incentive strength to maximum with high gameability. That's the nail factory. Now try high incentive strength with low gameability and high integrity — that's the rare case where metrics actually work, at least for a while.

What the Simulation Shows

The measured metric always goes up — that's the point, you're incentivizing it. But the true objective often diverges and can even decline, because effort flows toward gaming rather than genuine improvement. The gap between the red line and the green line is the Goodhart Gap — the distance between what you think is happening and what's actually happening.

· · ·

Chapter 7

Goodhart-Resistant Design

So what do we do? Give up on measurement? Retreat to vibes-based management?

No. That's the wrong lesson. The right lesson is that metrics are tools, not truths, and like all tools, they need to be used with awareness of their failure modes. Here's what actually works:

1. Rotate Your Metrics

If a metric can be gamed, it will be — given time. So don't give people time. Change the metric regularly. Not capriciously, but on a schedule that's shorter than the time it takes for gaming strategies to mature. This is expensive and annoying, which is why almost nobody does it, which is also why almost everyone has Goodhart problems.

2. Use Metric Bundles, Not Single Numbers

A single number is a target. A dashboard of correlated metrics is much harder to game, because optimizing one at the expense of others raises a red flag. If your customer satisfaction score goes up but retention drops and support tickets spike, something is wrong and the dashboard tells you so.9

3. Measure Outcomes, Not Outputs

The nail factory measured output (tons, count). What they should have measured was the outcome: Were buildings getting built? Were things getting nailed together? Outcome metrics are harder to collect but much harder to game, because they're closer to V itself.

4. Keep Some Metrics Secret

This is counterintuitive in an age of transparency, but Goodhart's Law only bites when agents know what's being optimized. If you evaluate people on criteria they can't fully predict, gaming becomes harder. This is why good hiring processes include unstructured elements, and why the best teachers test understanding rather than memorization.

5. Watch for the Divergence

The most important thing is simply to be aware that your metric will eventually decouple from your objective. Build monitoring for the gap. When the metric improves but reality doesn't, that's your signal to intervene.

The goal is not to find the perfect metric. The goal is to notice when your metric stops being a good one.

Netflix famously abandoned its star-rating system — which was being gamed by both users and the recommendation algorithm — in favor of a simpler thumbs-up/thumbs-down combined with deeper engagement signals (did you actually finish the show? did you come back the next day?). They didn't find a perfect metric. They found a harder-to-game bundle of metrics.10

· · ·

Coda

The Map Is Not the Territory

There's a beautiful short story by Jorge Luis Borges about an empire that created a map so detailed, so perfect, that it was the same size as the empire itself. A 1:1 map. It was, of course, completely useless.11

Every metric is a map. It compresses the messy, high-dimensional reality of whatever you're trying to understand into a single number, or a few numbers, that you can fit on a dashboard. That compression is simultaneously what makes metrics useful and what makes them dangerous. The map is always smaller than the territory. And when you start navigating by the map instead of looking out the window, you will eventually drive off a cliff that exists in reality but not on your map.

Goodhart's Law is, in the end, a statement about the limits of reductionism. You cannot reduce a complex system to a single number and then optimize that number without consequences. The system will fight back. The agents within it will find the gaps between your metric and your intention and drive a truck through them.

The Soviet planners learned this with nails. The British central bankers learned it with money supply. Silicon Valley is learning it right now with engagement metrics and AI alignment benchmarks. The lesson keeps being taught because it keeps being forgotten.

But here's the hopeful version: Goodhart's Law isn't a counsel of despair. It's a design constraint. Like the laws of thermodynamics, it tells you what you can't do (build a perpetual-metric machine) so you can focus on what you can do (build better, more resilient measurement systems that fail gracefully instead of catastrophically).

Measure wisely. Target reluctantly. And always, always keep one eye on the nails.

Notes & References

Goodhart, C.A.E. (1975). "Problems of Monetary Management: The U.K. Experience." Papers in Monetary Economics, Reserve Bank of Australia. Reprinted in Goodhart (1984), Monetary Theory and Practice, pp. 91–121.
Strathern, M. (1997). "'Improving Ratings': Audit in the British University System." European Review, 5(3), 305–321.
Campbell, D.T. (1979). "Assessing the Impact of Planned Social Change." Evaluation and Program Planning, 2(1), 67–90.
Jacob, B.A. & Levitt, S.D. (2003). "Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating." Quarterly Journal of Economics, 118(3), 843–877.
Sievert, H. (2001). The cobra effect is widely cited in economics literature. See Dubner, S.J. & Levitt, S.D. in Freakonomics (2005) for a popular account.
Vann, M.G. (2003). "Of Rats, Rice, and Race: The Great Hanoi Rat Massacre, an Episode in French Colonial History." French Colonial History, 4, 191–203.
Brady, W.J. et al. (2017). "Emotion Shapes the Diffusion of Moralized Content in Social Networks." Proceedings of the National Academy of Sciences, 114(28), 7313–7318.
Edwards, M.A. & Roy, S. (2017). "Academic Research in the 21st Century: Maintaining Scientific Integrity in a Climate of Perverse Incentives and Hypercompetition." Environmental Engineering Science, 34(1), 51–61.
This approach is sometimes called "metric ecology" — see Muller, J.Z. (2018), The Tyranny of Metrics, Princeton University Press.
Gomez-Uribe, C.A. & Hunt, N. (2015). "The Netflix Recommender System: Algorithms, Business Value, and Innovation." ACM Transactions on Management Information Systems, 6(4), 1–19.
Borges, J.L. (1946). "On Exactitude in Science." Collected in A Universal History of Infamy.