The Strange Math That Predicts (Almost) Anything

TL;DR

Markov chains show that long-run stabilization can occur even when events are dependent, so convergence in averages does not prove independence.

Briefing Cornell Notes

Briefing

A century-old math feud in Russia didn’t just settle a philosophical argument about free will—it produced tools that later powered everything from nuclear-bomb simulations to Google search. The key pivot came when Andrej Markov showed that probability can still “settle down” even when events are dependent, breaking Pavel Nekrasov’s claim that convergence in real-world statistics proves independence (and therefore free will).

Nekrasov leaned on the law of large numbers, the familiar idea that repeated independent trials (like fair coin flips) produce averages that drift toward expected values. He went further: if marriage rates, crime rates, or birth rates appear to stabilize over time, then the underlying decisions must be independent—acts of free will that can be treated as measurable. Markov, an atheist who criticized what he saw as sloppy reasoning, attacked that leap. He argued that dependence doesn’t prevent probability from working; it only invalidates the inference from “stable averages” to “independent causes.”

To make the case, Markov turned to language. In Alexander Pushkin’s Eugene Onegin, he treated letters as a sequence where the next letter’s type (vowel or consonant) depends on the current one. Counting overlapping letter pairs showed strong dependence: vowel-vowel pairs occurred far less often than they would under independence. Then Markov built a predictive “machine” using states (vowel or consonant) and transition probabilities derived from the observed frequencies. Even with dependence, the long-run vowel/consonant proportions converged to the same stable split found in the original text. The takeaway was blunt: convergence in social data doesn’t prove independence, and independence isn’t required for probability theory to function.

Markov’s framework—later known as Markov chains—became a practical engine for simulation. During the Manhattan Project, Stanislaw Ulam used a card-game intuition to approximate hard problems by sampling many random outcomes. But neutrons inside a bomb don’t behave independently; each step depends on the neutron’s current state. Von Neumann recognized the need for a chain model, and together they used Markov chains on ENIAC to estimate the multiplication factor k and determine whether a chain reaction would die out, stabilize, or grow explosively. The method was christened “Monte Carlo,” and it spread quickly beyond weapons research.

That same probabilistic logic later shaped the internet. Sergey Brin and Larry Page modeled web surfing as a Markov chain, treating links as endorsements and using a damping factor to prevent getting trapped in loops. Their PageRank algorithm—built on long-run state probabilities—ranked pages by relative importance and helped Google overtake Yahoo. Meanwhile, Markov ideas also underlie text prediction: Claude Shannon’s experiments with predicting letters and words foreshadow modern token-based language modeling, where context matters and attention mechanisms refine predictions.

The through-line is memorylessness: for many complex systems, it’s enough to track the current state and ignore most of the past. That simplification is what makes Markov chains powerful—and what turned a personal intellectual feud into a mathematical backbone for 20th-century science and today’s information systems.

Cornell Notes

Markov chains emerged from a Russian dispute about whether stable statistics imply independence and thus “free will.” Pavel Nekrasov argued that if averages converge, the underlying causes must be independent; Andrej Markov countered that dependent processes can still produce long-run stability. Using Pushkin’s Eugene Onegin, Markov showed vowels and consonants depend on the previous letter, yet the overall vowel/consonant proportions still converge to steady values. This “dependent but predictable” idea later enabled Monte Carlo simulation of neutron behavior in nuclear physics and powered PageRank, the Markov-chain ranking method behind Google search. The unifying principle is the memoryless property: many systems can be modeled by transitions from the current state without tracking the full history.

What was Nekrasov’s key inference from the law of large numbers to free will, and why did Markov reject it?

Nekrasov accepted that the law of large numbers works cleanly for independent trials. He then treated convergence in real social statistics—like marriage, crime, and birth rates—as evidence that the underlying decisions were independent. That led him to frame free will as something measurable. Markov rejected the inference: convergence of averages does not require independence. Dependent events can also settle into stable long-run frequencies, so stable statistics cannot prove that causes were independent.

How did Markov demonstrate dependence while still achieving convergence?

Markov analyzed Pushkin’s Eugene Onegin by stripping punctuation and spaces and treating the resulting character stream as a sequence of vowels and consonants. He found about 43% vowels and 57% consonants, but vowel-vowel adjacent pairs occurred only about 6% of the time—far below what independence would predict (roughly 0.43×0.43 ≈ 18%). He then built a two-state Markov chain (vowel, consonant) using transition probabilities derived from those counts. Simulating the chain produced long-run proportions that converged back to the observed 43/57 split, showing dependence doesn’t prevent the law-of-large-numbers-like stabilization.

Why did Ulam and von Neumann need Markov chains for nuclear-bomb simulations?

In Ulam’s Solitaire analogy, each game is independent, so sampling random outcomes works. Neutron behavior inside a bomb is different: a neutron’s next interaction depends on its current position, velocity, and energy, and on the evolving state of the core. That dependence means you can’t sample isolated outcomes; you must model a chain of state-dependent steps. Von Neumann recognized that structure and used a Markov chain model to simulate neutron histories and estimate the multiplication factor k.

What does PageRank do that makes it a Markov chain, and why is damping necessary?

PageRank models a “random surfer” moving across web pages, where each page is a state and outgoing links define transition probabilities. Over time, the fraction of time spent on each page converges to a steady ranking score. Damping is added because not all pages are connected; without it, a random surfer can get stuck in loops and never reach other parts of the web. With an 85% follow-link rule and a 15% random jump, the chain explores the whole network.

How do Markov ideas relate to modern text prediction?

Shannon’s work predicted next characters or words using limited context, improving predictions as more previous tokens were considered. Modern language models generalize this: they predict the next token from prior tokens, but they don’t treat all context equally. Attention mechanisms decide what parts of the history matter most, such as distinguishing “cell” in “the structure of the cell” as biology rather than a prison cell. The underlying probabilistic prediction task still echoes Markov-style next-step modeling.

What is the “memoryless” property that makes Markov chains useful?

A Markov chain assumes the future depends only on the current state, not the full past. That memoryless property lets complex systems be simplified into transition rules between states. The transcript emphasizes that for many real processes—weather patterns, disease spread, particle interactions—tracking just the present state can still yield meaningful long-run predictions.

Review Questions

How does Markov’s vowel/consonant example undermine the claim that convergence implies independence?
In what way is neutron transport inside a bomb analogous to a Markov chain, and why would independent sampling fail?
What problem does damping solve in PageRank, and what would happen without it?

Key Points

1
Markov chains show that long-run stabilization can occur even when events are dependent, so convergence in averages does not prove independence.
2
Nekrasov’s leap from stable social statistics to measurable free will was invalidated by Markov’s dependent-but-convergent construction.
3
Markov used Pushkin’s Eugene Onegin to quantify dependence between adjacent letters and then built a two-state transition model to simulate long-run behavior.
4
Ulam’s Monte Carlo sampling idea required Markov chains for neutron physics because each neutron’s next step depends on its current state.
5
Von Neumann and Ulam used ENIAC simulations to estimate the multiplication factor k and classify whether a chain reaction dies out, stays steady, or grows explosively.
6
PageRank treats web navigation as a Markov chain and uses damping to avoid getting trapped in disconnected loops.
7
Modern text prediction systems inherit the “predict the next token from context” logic, but attention mechanisms decide which parts of the context matter most.

Highlights

Markov proved that dependent processes can still produce the kind of long-run convergence people associate with independent trials—so stable statistics don’t certify independence.

The ENIAC-based neutron simulation relied on state-dependent transitions, turning an intractable physics problem into a Markov-chain Monte Carlo estimate of k.

PageRank’s ranking comes from long-run state probabilities in a web-surfing Markov chain, with damping preventing the surfer from getting stuck.

Attention-based language models can be seen as a modern, more flexible extension of next-token prediction ideas that trace back to Markov-style thinking.

Topics

Markov Chains
Law of Large Numbers
Monte Carlo Simulation
PageRank
Text Prediction

Mentioned

Pavel Nekrasov
Andrej Markov
Jacob Bernoulli
Eugene Onegin
Alexander Pushkin
J. Robert Oppenheimer
John von Neumann
Stanislaw Ulam
Masayoshi Son
Jerry Yang
David Filo
Sergey Brin
Larry Page
Claude Shannon
Casper
ENIAC