The Key Equation Behind Probability

TL;DR

Probability distributions assign likelihoods to all possible states, with probabilities constrained to sum (or integrate) to one.

Briefing Cornell Notes

Briefing

Probability thinking—assigning likelihoods rather than single answers—sits at the core of how both brains and machine-learning systems handle uncertainty. A foggy walk-home example captures the point: when sensory information is ambiguous, the mind effectively carries a probability distribution over possible interpretations. That same probabilistic machinery powers generative modeling, where systems learn a distribution from data, then sample from it to produce new, plausible outcomes.

The discussion begins with probability distributions: a mapping from every possible state to a number between 0 and 1, with the constraint that all probabilities sum to one. For discrete cases like a fair six-sided die, each face gets probability 1/6. For continuous variables like adult height, the distribution must integrate to one (area under the curve), which can be pictured as many tiny bins. When the “state” includes multiple variables—height and weight, or even all pixels in a 100 by 100 image—the distribution lives in a high-dimensional space, making visualization impossible but the underlying idea unchanged.

To use such distributions, the key operation is sampling: drawing new outcomes according to the learned probabilities. A die roll is the simplest example; a generative AI model extends the same logic to complex data, producing images that reflect structure in natural data rather than independent random noise per pixel.

From there, the math is built around “surprise.” Surprise should be larger when an event is unlikely and should add across independent events. The function that satisfies these requirements is proportional to log(1/p). Averaging surprise over the distribution yields entropy, a measure of inherent uncertainty. The transcript contrasts a fair coin with a “thick coin” that lands on its side 2% of the time: the side outcome creates rare but highly surprising events, raising the overall entropy.

A crucial real-world twist follows: the true distribution is usually hidden. People and models rely on an internal belief distribution Q, not the real one P. When observations come from P but predictions use Q, the expected surprise becomes cross entropy. Cross entropy is always at least the entropy of P, meaning a wrong model can only increase expected surprise. It also behaves asymmetrically: believing a fair coin is rigged produces a different cross-entropy than believing a rigged coin is fair, because rare outcomes under the mistaken model can dominate the average.

To isolate the extra surprise caused specifically by model mismatch, the transcript introduces Kullback-Leibler divergence (KL divergence). By subtracting entropy from cross entropy, KL divergence measures the additional penalty of using Q instead of P. In machine learning training, this matters because minimizing KL divergence is equivalent to minimizing cross entropy: the entropy term H(P) does not depend on the model parameters, so it shifts values but not which model Q is optimal. That equivalence helps explain why cross-entropy objectives dominate generative modeling, even when the true distribution cannot be directly computed.

Cornell Notes

The core idea is that uncertainty is represented by probability distributions, not single answers. Surprise is defined as a quantity that grows when events are unlikely and adds for independent events; averaging surprise over a distribution produces entropy. When a model uses an internal belief distribution Q but the data come from a true distribution P, expected surprise becomes cross entropy, which is always at least the entropy of P and is asymmetric in P vs. Q. Subtracting entropy from cross entropy yields KL divergence, which isolates the extra surprise due to using the wrong model. In generative modeling, minimizing KL divergence is equivalent to minimizing cross entropy because the entropy of the true data distribution is constant with respect to model parameters.

Why does the transcript treat probability as “degree of belief” rather than just counting frequencies?

Frequentist probability works when outcomes can be repeated under the same conditions, like rolling dice. But a statement like “70% chance of rain tomorrow” can’t be repeated tomorrow in the same way—rain happens once. The Bayesian framing instead interprets probability as a belief value: saying rain has probability 0.7 means assigning 0.7 certainty to rain, and the probabilities over all mutually exclusive outcomes must sum to 1 (e.g., sun gets 0.3 if rain is 0.7).

What exactly is a probability distribution, and why can’t one state be treated independently of the others?

A probability distribution assigns a probability to every possible state, with each probability between 0 and 1 and the total summing (or integrating) to 1. Because the total must equal 1, increasing the probability of one state forces decreases elsewhere. That’s why the distribution characterizes the system as a whole, not isolated pieces.

How does the transcript derive entropy from the idea of “surprise”?

Surprise is defined to be high for low-probability events and to go to zero when probability approaches one. Independence implies an additivity requirement: if probabilities multiply for independent events, surprise should add. The logarithm of 1/p satisfies this. Entropy is then the expected (average) surprise under the distribution: sum over states of p(state) times surprisal. Rare events contribute disproportionately because their surprisal is large.

What is cross entropy measuring, and why does it increase when the model is wrong?

Cross entropy measures the average surprise when outcomes are generated by the true distribution P but surprisal is computed using a different belief distribution Q. It includes both inherent uncertainty in P and extra mismatch from using Q. A key property highlighted is that cross entropy H(P,Q) is always ≥ entropy H(P), so incorrect modeling can only increase expected surprise, never reduce it.

Why is KL divergence introduced, and what does it “peel away”?

KL divergence (D_KL(P||Q)) is defined by taking cross entropy and subtracting entropy of P. That removes the part of surprise due purely to uncertainty inherent in the true distribution, leaving only the additional penalty from using Q instead of P. In the coin analogy, it isolates the mismatch-driven surprise rather than the surprise of the outcome itself.

Why does minimizing KL divergence end up equivalent to minimizing cross entropy in training?

KL divergence differs from cross entropy by a term involving entropy H(P). Since H(P) depends only on the true data distribution and not on the model parameters, it acts like a constant offset. Therefore, the model Q that minimizes KL divergence also minimizes cross entropy, even though training objectives are often written as cross entropy.

Review Questions

In what way does the additivity requirement for surprise across independent events force the surprisal function to involve a logarithm?
Explain the difference between entropy, cross entropy, and KL divergence using the roles of P (true) and Q (model).
Why does cross entropy minimization lead to the same optimal model as KL divergence minimization, even though KL divergence is the “true” discrepancy measure?

Key Points

1
Probability distributions assign likelihoods to all possible states, with probabilities constrained to sum (or integrate) to one.
2
Sampling from a learned distribution is the mechanism behind generative modeling, producing new outputs that reflect data structure rather than independent noise.
3
Surprise can be defined to increase for unlikely events and to add across independent events; this leads naturally to a log(1/p) form.
4
Entropy is the expected surprise under a distribution, quantifying inherent uncertainty and rising when rare outcomes exist.
5
Cross entropy measures expected surprise when data come from P but predictions use Q, and it is always at least the entropy of P.
6
KL divergence isolates the extra expected surprise caused by model mismatch by subtracting entropy from cross entropy.
7
In generative-model training, minimizing cross entropy is effectively equivalent to minimizing KL divergence because the entropy term of the true distribution is constant with respect to model parameters.

Highlights

Entropy is built from an intuitive “surprise” quantity: rare events contribute more because surprisal grows like log(1/p).

Cross entropy captures both inherent uncertainty and the penalty from using the wrong belief distribution; it can’t be smaller than the true entropy.

KL divergence is cross entropy minus entropy, isolating the mismatch-driven part of surprise.

Minimizing KL divergence and minimizing cross entropy pick the same best model because the entropy of the true distribution doesn’t depend on model parameters.

Topics

Probability Distributions
Entropy and Surprise
Cross Entropy
KL Divergence
Generative Modeling

Mentioned

Nord VPN
KL divergence