Probability Theory 28 | Weak Law of Large Numbers
Based on The Bright Side of Mathematics's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Empirical probability is modeled as relative frequency: for n trials, X̄n=(1/n)∑_{k=1}^n Xk.
Briefing
The weak law of large numbers formalizes a simple but powerful idea: when independent, identically distributed random outcomes are sampled many times, the observed relative frequency of an event gets close to its theoretical probability. The key point isn’t that the relative frequency converges perfectly every time; it’s that the chance of a noticeable deviation becomes small as the number of trials grows. In the coin-toss example, the fraction of heads among the first n tosses behaves like a random variable whose probability of differing from 1/2 by more than any fixed ε shrinks toward zero as n increases—capturing why large samples reliably reflect underlying probabilities.
To make that connection precise, the discussion distinguishes theoretical probability from empirical probability (relative frequency). The theoretical side comes from a probability measure P on a model space, while the empirical side comes from repeatedly running the experiment and counting how often outcomes fall in an event A. For n tosses, the empirical probability of heads can be written as a random variable: define Xk to be 1 if the k-th toss is heads and 0 otherwise, then the relative frequency is X̄n = (1/n)∑_{k=1}^n Xk. Because X̄n is itself random, “convergence” must be defined in a probabilistic sense.
That probabilistic convergence is given by convergence in probability. The weak law of large numbers says that X̄n converges in probability to μ, where μ is the common expected value of the underlying i.i.d. variables. Concretely, for every ε>0, the probability that the relative frequency deviates from μ by at least ε, namely P(|X̄n−μ|≥ε), tends to 0 as n→∞. This is the operational meaning of “the empirical probability approaches the theoretical probability” without requiring almost-sure (path-by-path) convergence.
The assumptions needed for the general statement are spelled out: the random variables must be independent and identically distributed (IID), and their expectations must exist in a way that ensures integrability—expressed as E(|X1|)<∞. Under these conditions, μ = E(X1) is also the expectation of every Xk and of the average X̄n.
A clean proof route is then given for the common special case where the variables have finite variance. Let Var(X1)=Σ^2. Using expectation properties, the average still has mean μ. Using variance properties and independence, the variance of the average scales down like Var(X̄n)=Σ^2/n. Chebyshev’s inequality then turns that variance shrinkage into a probability bound: P(|X̄n−μ|≥ε) ≤ Var(X̄n)/ε^2 = Σ^2/(nε^2). Since Σ^2 and ε are fixed, the right-hand side goes to 0 as n grows, forcing the deviation probability to vanish—exactly the weak law’s convergence-in-probability claim. The result hinges on finite variance; the next step is to consider how things change when variance assumptions fail, which is left for a later application-focused discussion.
Cornell Notes
The weak law of large numbers connects theoretical probability to empirical relative frequency. For IID random variables X1, X2, …, the sample average X̄n = (1/n)∑_{k=1}^n Xk converges to μ = E(X1) in probability. “In probability” means that for every ε>0, the probability of a deviation at least ε, P(|X̄n−μ|≥ε), goes to 0 as n→∞. A key proof uses finite variance: if Var(X1)=Σ^2, then Var(X̄n)=Σ^2/n, and Chebyshev’s inequality bounds the deviation probability by Σ^2/(nε^2). As n increases, that bound collapses to zero, making large-sample frequencies reliable.
What does “empirical probability” mean in this setting, and how is it written using random variables?
What is the exact meaning of convergence in probability used by the weak law?
Why do independence and identical distribution matter, and what is the IID shorthand?
How does finite variance enable a straightforward proof of the weak law?
What role does Chebyshev’s inequality play in turning variance into a probability statement?
Review Questions
- In the coin-toss model, what are Xk and X̄n, and what value does X̄n aim to approach?
- State the weak law of large numbers in terms of P(|X̄n−μ|≥ε) and explain what “for every ε>0” implies.
- Under what additional condition does the proof using Chebyshev’s inequality work, and how does Var(X̄n) scale with n?
Key Points
- 1
Empirical probability is modeled as relative frequency: for n trials, X̄n=(1/n)∑_{k=1}^n Xk.
- 2
The weak law of large numbers gives convergence in probability: for every ε>0, P(|X̄n−μ|≥ε)→0 as n→∞.
- 3
The weak law requires IID random variables (independent and identically distributed) and integrability so expectations exist.
- 4
The target value μ is the common expectation E(X1), which equals E(Xk) for all k and also E(X̄n).
- 5
With finite variance Var(X1)=Σ^2, the average’s variance shrinks as Var(X̄n)=Σ^2/n.
- 6
Chebyshev’s inequality turns variance shrinkage into a deviation bound: P(|X̄n−μ|≥ε) ≤ Σ^2/(nε^2).