Get AI summaries of any video or article — Sign up free
Probability Theory 31 | Central Limit Theorem thumbnail

Probability Theory 31 | Central Limit Theorem

4 min read

Based on The Bright Side of Mathematics's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The central limit theorem requires IID random variables with finite expectation and finite variance.

Briefing

Central limit theorem assumptions are simple—independent, identically distributed random variables with finite mean and variance—and they drive a powerful payoff: standardized sample averages become approximately normal, and in the limit they converge to the standard normal distribution. This matters because it turns complicated averaging behavior into a predictable bell curve, enabling practical approximations for uncertainty and fluctuations in real-world data.

Start with IID random variables X1, X2, …, Xn. The distribution of each Xi can be anything, as long as the expectation E[X1] = μ and the variance Var(X1) = σ² exist and are finite. The sample mean X̄n = (1/n)∑_{k=1}^n Xk inherits the same expectation as the individual variables, so E[X̄n] = μ. But its variance shrinks with sample size: Var(X̄n) = σ²/n. Larger n means smaller fluctuations around μ, a theme already familiar from the law of large numbers—yet the central limit theorem goes further by describing the *shape* of those fluctuations.

A concrete example uses an urn model without replacement, connected to the hypergeometric distribution: five balls contain two types, and three are drawn; each random variable counts how many “ones” appear in the draw. The expectation for that count is given as 9/5 (about 1.8). By simulating many such draws and plotting histograms of X̄n for different n values, the distribution of the sample mean tightens around 1.8 as n grows. More importantly, the histogram begins to resemble a bell curve. Increasing n makes the approximation to normality clearer, while the spread decreases in line with the 1/n variance scaling.

The theorem’s formal statement comes from standardizing the sample mean. Define a standardized variable

YN = ( (X̄n − μ) / (σ/√n) ).

This transformation shifts the mean to 0 and rescales the variance to 1. Under the IID and finite-mean/finite-variance conditions, the distribution of YN converges to the Normal(0,1) distribution as n → ∞. One way to express this convergence is through cumulative distribution functions: for each real x, the CDF of YN, P(YN ≤ x), approaches the CDF of the standard normal. That limiting CDF can be written as an integral of the standard normal density from −∞ to x, using the familiar exponential form exp(−t²/2).

A key takeaway is robustness: the original Xi distribution doesn’t need to be normal. As long as the variables are IID and have finite expectation and variance, averages become approximately normal for large n, which explains why Gaussian models appear so often in statistics and applied probability.

Cornell Notes

With IID random variables X1,…,Xn that have finite mean μ and finite variance σ², the sample mean X̄n has expectation μ and variance σ²/n. As n grows, the spread around μ shrinks, but the central limit theorem also predicts the *shape* of that spread. By standardizing the mean as YN = (X̄n − μ)/(σ/√n), the variance becomes 1 and the mean becomes 0. The distribution of YN converges to the standard normal distribution Normal(0,1) as n → ∞, meaning its CDF approaches the standard normal CDF for every real x. This provides a general route to approximate distributions of averages even when the original Xi distribution is not normal.

What conditions must hold for the central limit theorem to apply?

The random variables must be independent and identically distributed (IID). In addition, the expectation E[X1] must exist and the variance Var(X1) must exist (finite). The original distribution of X1 can be arbitrary; only the IID structure and finite mean/variance are required.

Why does the sample mean get less variable as the sample size increases?

For X̄n = (1/n)∑_{k=1}^n Xk, the expectation stays the same as for a single draw: E[X̄n] = E[X1] = μ. The variance, however, scales down with n: Var(X̄n) = σ²/n. So increasing n reduces fluctuations around μ.

How does standardization turn the sample mean into something that converges to Normal(0,1)?

Standardization shifts and rescales: YN = (X̄n − μ)/(σ/√n). Subtracting μ centers the variable at 0, and dividing by σ/√n stretches it so the variance becomes 1. This standardized variable is the one whose distribution converges to the standard normal.

What does “converges” mean in terms of cumulative distribution functions?

For each fixed real x, the CDF of YN, P(YN ≤ x), approaches the CDF of Normal(0,1) as n → ∞. The limiting CDF is computed by integrating the standard normal density from −∞ to x: F(x) = ∫_{−∞}^x (1/√(2π)) exp(−t²/2) dt. Numerically, those CDF values can be used as approximations when n is large.

Why does the urn/hypergeometric simulation illustrate the theorem?

In the urn example, each draw counts how many “ones” appear when drawing a fixed number of balls without replacement, producing a hypergeometric-type random variable. Even though the underlying distribution is not normal, repeated averaging across many independent samples yields histograms for X̄n that become increasingly bell-shaped as n increases, centered near the expectation (given as 9/5 ≈ 1.8). That visual pattern matches the central limit theorem’s prediction.

Review Questions

  1. Given IID random variables with finite mean μ and variance σ², what are E[X̄n] and Var(X̄n]?
  2. Write the standardized variable YN used in the central limit theorem and state its limiting distribution.
  3. How would you approximate P(YN ≤ x) for large n using the standard normal CDF?

Key Points

  1. 1

    The central limit theorem requires IID random variables with finite expectation and finite variance.

  2. 2

    For the sample mean X̄n, the mean stays at μ while the variance shrinks to σ²/n.

  3. 3

    Standardizing the sample mean as YN = (X̄n − μ)/(σ/√n) produces a variable with mean 0 and variance 1.

  4. 4

    As n → ∞, the distribution of YN converges to Normal(0,1), not just in spread but in shape.

  5. 5

    Convergence can be expressed via CDFs: P(YN ≤ x) approaches the standard normal CDF for every real x.

  6. 6

    The normal approximation for averages works even when the original Xi distribution is not normal, as long as the IID and finite-variance conditions hold.

Highlights

Even with non-normal underlying data (like a hypergeometric urn count), averages become approximately normal as n grows.
Variance of the sample mean scales like 1/n, explaining why histograms tighten around the mean.
The standardized statistic YN = (X̄n − μ)/(σ/√n) is the object that converges to Normal(0,1).
CDF-based convergence means probabilities for YN can be approximated using the standard normal integral once n is large.

Topics