The Key Equation Behind Probability
Based on Artem Kirsanov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Probability distributions assign likelihoods to all possible states, with probabilities constrained to sum (or integrate) to one.
Briefing
Probability thinking—assigning likelihoods rather than single answers—sits at the core of how both brains and machine-learning systems handle uncertainty. A foggy walk-home example captures the point: when sensory information is ambiguous, the mind effectively carries a probability distribution over possible interpretations. That same probabilistic machinery powers generative modeling, where systems learn a distribution from data, then sample from it to produce new, plausible outcomes.
The discussion begins with probability distributions: a mapping from every possible state to a number between 0 and 1, with the constraint that all probabilities sum to one. For discrete cases like a fair six-sided die, each face gets probability 1/6. For continuous variables like adult height, the distribution must integrate to one (area under the curve), which can be pictured as many tiny bins. When the “state” includes multiple variables—height and weight, or even all pixels in a 100 by 100 image—the distribution lives in a high-dimensional space, making visualization impossible but the underlying idea unchanged.
To use such distributions, the key operation is sampling: drawing new outcomes according to the learned probabilities. A die roll is the simplest example; a generative AI model extends the same logic to complex data, producing images that reflect structure in natural data rather than independent random noise per pixel.
From there, the math is built around “surprise.” Surprise should be larger when an event is unlikely and should add across independent events. The function that satisfies these requirements is proportional to log(1/p). Averaging surprise over the distribution yields entropy, a measure of inherent uncertainty. The transcript contrasts a fair coin with a “thick coin” that lands on its side 2% of the time: the side outcome creates rare but highly surprising events, raising the overall entropy.
A crucial real-world twist follows: the true distribution is usually hidden. People and models rely on an internal belief distribution Q, not the real one P. When observations come from P but predictions use Q, the expected surprise becomes cross entropy. Cross entropy is always at least the entropy of P, meaning a wrong model can only increase expected surprise. It also behaves asymmetrically: believing a fair coin is rigged produces a different cross-entropy than believing a rigged coin is fair, because rare outcomes under the mistaken model can dominate the average.
To isolate the extra surprise caused specifically by model mismatch, the transcript introduces Kullback-Leibler divergence (KL divergence). By subtracting entropy from cross entropy, KL divergence measures the additional penalty of using Q instead of P. In machine learning training, this matters because minimizing KL divergence is equivalent to minimizing cross entropy: the entropy term H(P) does not depend on the model parameters, so it shifts values but not which model Q is optimal. That equivalence helps explain why cross-entropy objectives dominate generative modeling, even when the true distribution cannot be directly computed.
Cornell Notes
The core idea is that uncertainty is represented by probability distributions, not single answers. Surprise is defined as a quantity that grows when events are unlikely and adds for independent events; averaging surprise over a distribution produces entropy. When a model uses an internal belief distribution Q but the data come from a true distribution P, expected surprise becomes cross entropy, which is always at least the entropy of P and is asymmetric in P vs. Q. Subtracting entropy from cross entropy yields KL divergence, which isolates the extra surprise due to using the wrong model. In generative modeling, minimizing KL divergence is equivalent to minimizing cross entropy because the entropy of the true data distribution is constant with respect to model parameters.
Why does the transcript treat probability as “degree of belief” rather than just counting frequencies?
What exactly is a probability distribution, and why can’t one state be treated independently of the others?
How does the transcript derive entropy from the idea of “surprise”?
What is cross entropy measuring, and why does it increase when the model is wrong?
Why is KL divergence introduced, and what does it “peel away”?
Why does minimizing KL divergence end up equivalent to minimizing cross entropy in training?
Review Questions
- In what way does the additivity requirement for surprise across independent events force the surprisal function to involve a logarithm?
- Explain the difference between entropy, cross entropy, and KL divergence using the roles of P (true) and Q (model).
- Why does cross entropy minimization lead to the same optimal model as KL divergence minimization, even though KL divergence is the “true” discrepancy measure?
Key Points
- 1
Probability distributions assign likelihoods to all possible states, with probabilities constrained to sum (or integrate) to one.
- 2
Sampling from a learned distribution is the mechanism behind generative modeling, producing new outputs that reflect data structure rather than independent noise.
- 3
Surprise can be defined to increase for unlikely events and to add across independent events; this leads naturally to a log(1/p) form.
- 4
Entropy is the expected surprise under a distribution, quantifying inherent uncertainty and rising when rare outcomes exist.
- 5
Cross entropy measures expected surprise when data come from P but predictions use Q, and it is always at least the entropy of P.
- 6
KL divergence isolates the extra expected surprise caused by model mismatch by subtracting entropy from cross entropy.
- 7
In generative-model training, minimizing cross entropy is effectively equivalent to minimizing KL divergence because the entropy term of the true distribution is constant with respect to model parameters.