Probability Theory 28 | Weak Law of Large Numbers

TL;DR

Empirical probability is modeled as relative frequency: for n trials, X̄n=(1/n)∑_{k=1}^n Xk.

Briefing Cornell Notes

Briefing

The weak law of large numbers formalizes a simple but powerful idea: when independent, identically distributed random outcomes are sampled many times, the observed relative frequency of an event gets close to its theoretical probability. The key point isn’t that the relative frequency converges perfectly every time; it’s that the chance of a noticeable deviation becomes small as the number of trials grows. In the coin-toss example, the fraction of heads among the first n tosses behaves like a random variable whose probability of differing from 1/2 by more than any fixed ε shrinks toward zero as n increases—capturing why large samples reliably reflect underlying probabilities.

To make that connection precise, the discussion distinguishes theoretical probability from empirical probability (relative frequency). The theoretical side comes from a probability measure P on a model space, while the empirical side comes from repeatedly running the experiment and counting how often outcomes fall in an event A. For n tosses, the empirical probability of heads can be written as a random variable: define Xk to be 1 if the k-th toss is heads and 0 otherwise, then the relative frequency is X̄n = (1/n)∑_{k=1}^n Xk. Because X̄n is itself random, “convergence” must be defined in a probabilistic sense.

That probabilistic convergence is given by convergence in probability. The weak law of large numbers says that X̄n converges in probability to μ, where μ is the common expected value of the underlying i.i.d. variables. Concretely, for every ε>0, the probability that the relative frequency deviates from μ by at least ε, namely P(|X̄n−μ|≥ε), tends to 0 as n→∞. This is the operational meaning of “the empirical probability approaches the theoretical probability” without requiring almost-sure (path-by-path) convergence.

The assumptions needed for the general statement are spelled out: the random variables must be independent and identically distributed (IID), and their expectations must exist in a way that ensures integrability—expressed as E(|X1|)<∞. Under these conditions, μ = E(X1) is also the expectation of every Xk and of the average X̄n.

A clean proof route is then given for the common special case where the variables have finite variance. Let Var(X1)=Σ^2. Using expectation properties, the average still has mean μ. Using variance properties and independence, the variance of the average scales down like Var(X̄n)=Σ^2/n. Chebyshev’s inequality then turns that variance shrinkage into a probability bound: P(|X̄n−μ|≥ε) ≤ Var(X̄n)/ε^2 = Σ^2/(nε^2). Since Σ^2 and ε are fixed, the right-hand side goes to 0 as n grows, forcing the deviation probability to vanish—exactly the weak law’s convergence-in-probability claim. The result hinges on finite variance; the next step is to consider how things change when variance assumptions fail, which is left for a later application-focused discussion.

Cornell Notes

The weak law of large numbers connects theoretical probability to empirical relative frequency. For IID random variables X1, X2, …, the sample average X̄n = (1/n)∑_{k=1}^n Xk converges to μ = E(X1) in probability. “In probability” means that for every ε>0, the probability of a deviation at least ε, P(|X̄n−μ|≥ε), goes to 0 as n→∞. A key proof uses finite variance: if Var(X1)=Σ^2, then Var(X̄n)=Σ^2/n, and Chebyshev’s inequality bounds the deviation probability by Σ^2/(nε^2). As n increases, that bound collapses to zero, making large-sample frequencies reliable.

What does “empirical probability” mean in this setting, and how is it written using random variables?

Empirical probability is the relative frequency of an event after n trials. In the coin example, define Xk=1 if the k-th toss is heads and Xk=0 if it’s tails. Then the relative frequency of heads among the first n tosses is X̄n=(1/n)∑_{k=1}^n Xk, which is a random variable taking values between 0 and 1.

What is the exact meaning of convergence in probability used by the weak law?

Convergence in probability to μ means: for every ε>0, the probability that the average deviates from μ by at least ε goes to zero. Formally, P(|X̄n−μ|≥ε)→0 as n→∞. This captures “getting close with high probability,” not guaranteed closeness on every outcome path.

Why do independence and identical distribution matter, and what is the IID shorthand?

Independence ensures that probabilities for finite collections factor correctly, which is crucial when computing variance of sums/averages. Identical distribution ensures all Xk share the same distribution, so they have the same expectation μ and the same variance Σ^2 (when finite). Together this is summarized as IID: independent and identically distributed.

How does finite variance enable a straightforward proof of the weak law?

With finite variance Var(X1)=Σ^2, the variance of the average shrinks: Var(X̄n)=Σ^2/n. Chebyshev’s inequality then gives P(|X̄n−μ|≥ε) ≤ Var(X̄n)/ε^2 = Σ^2/(nε^2). As n→∞, Σ^2/(nε^2)→0, forcing the deviation probability to vanish.

What role does Chebyshev’s inequality play in turning variance into a probability statement?

Chebyshev’s inequality converts a bound on spread (variance) into a bound on tail probabilities. Specifically, it bounds the probability that |X̄n−μ| is at least ε by Var(X̄n)/ε^2. Because Var(X̄n) decreases like 1/n, the tail probability bound decreases like 1/n as well.

Review Questions

In the coin-toss model, what are Xk and X̄n, and what value does X̄n aim to approach?
State the weak law of large numbers in terms of P(|X̄n−μ|≥ε) and explain what “for every ε>0” implies.
Under what additional condition does the proof using Chebyshev’s inequality work, and how does Var(X̄n) scale with n?

Key Points

1
Empirical probability is modeled as relative frequency: for n trials, X̄n=(1/n)∑_{k=1}^n Xk.
2
The weak law of large numbers gives convergence in probability: for every ε>0, P(|X̄n−μ|≥ε)→0 as n→∞.
3
The weak law requires IID random variables (independent and identically distributed) and integrability so expectations exist.
4
The target value μ is the common expectation E(X1), which equals E(Xk) for all k and also E(X̄n).
5
With finite variance Var(X1)=Σ^2, the average’s variance shrinks as Var(X̄n)=Σ^2/n.
6
Chebyshev’s inequality turns variance shrinkage into a deviation bound: P(|X̄n−μ|≥ε) ≤ Σ^2/(nε^2).

Highlights

Relative frequency becomes a random variable: X̄n=(1/n)∑_{k=1}^n Xk.

Convergence in probability is quantified by P(|X̄n−μ|≥ε)→0 for every ε>0.

Finite variance is the engine of the proof because Var(X̄n)=Σ^2/n.

Chebyshev’s inequality provides the explicit bound Σ^2/(nε^2), showing deviations become unlikely as n grows.

Topics

Weak Law of Large Numbers
Convergence in Probability
IID Assumptions
Relative Frequency
Chebyshev Inequality