Get AI summaries of any video or article — Sign up free
Probability Theory 28 | Weak Law of Large Numbers [dark version] thumbnail

Probability Theory 28 | Weak Law of Large Numbers [dark version]

4 min read

Based on The Bright Side of Mathematics's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The weak law of large numbers links theoretical probability μ to observed relative frequency via the empirical average X̄_n.

Briefing

The weak law of large numbers formalizes a simple but powerful intuition: when independent, identically distributed random outcomes are sampled many times, the observed relative frequency of an event settles near its theoretical probability. In the coin-toss example, the fraction of heads among the first n tosses becomes increasingly close to 1/2 as n grows. The key question becomes what “close” and “settles” mean mathematically, and the weak law answers it using convergence in probability.

Relative frequency is built from random variables. For each toss k, define X_k to be 1 if toss k lands heads and 0 otherwise. After n tosses, the empirical probability of heads is the average X̄_n = (1/n)∑_{k=1}^n X_k. Since each X_k has expected value μ (for a fair coin, μ = 1/2), the weak law targets the event that the empirical average deviates from μ by more than a chosen tolerance ε > 0. Convergence in probability to μ means that P(|X̄_n − μ| > ε) → 0 as n → ∞ for every ε > 0.

To make this guarantee, the random variables must satisfy three main conditions. First, they are independent: any finite collection has joint probabilities that factor as products. Second, they are identically distributed: each X_k has the same distribution, so expectations match across k. Together these are summarized as an IID assumption. Third, expectations must exist in a way that supports the probability bound; the transcript uses the requirement that E|X_1| is finite (integrability), ensuring μ is well-defined.

A clean proof route appears when the variables also have finite variance. Let Var(X_1) = Σ^2. The expected value of the average stays fixed: E[X̄_n] = μ. More importantly, the variance shrinks with sample size: Var(X̄_n) = Σ^2/n. This scaling is the engine behind the result. Chebyshev’s inequality then bounds the deviation probability: P(|X̄_n − μ| > ε) ≤ Var(X̄_n)/ε^2 = (Σ^2/n)/ε^2. Because Σ^2 and ε^2 are constants, the right-hand side goes to zero as n increases, forcing the left-hand side to vanish as well. That is exactly convergence in probability, which is the weak law’s formal meaning.

The practical message is that empirical frequencies become reliable without requiring almost-sure convergence. Under IID sampling and finite variance, the chance of a noticeable mismatch between observed frequency and theoretical probability becomes arbitrarily small as the number of trials grows—setting the probabilistic foundation for statistical estimation and long-run frequency reasoning.

Cornell Notes

The weak law of large numbers connects theoretical probability to observed relative frequency. For IID random variables X_1, X_2, … with common mean μ, the empirical average X̄_n = (1/n)∑_{k=1}^n X_k approaches μ in probability. “In probability” means that for every ε > 0, the probability of a deviation larger than ε goes to zero: P(|X̄_n − μ| > ε) → 0 as n → ∞. When the variables have finite variance Σ^2, the proof becomes direct: Var(X̄_n) = Σ^2/n, and Chebyshev’s inequality yields P(|X̄_n − μ| > ε) ≤ Σ^2/(nε^2). As n grows, this bound shrinks to zero, making the empirical frequency increasingly accurate.

How is empirical probability (relative frequency) represented using random variables?

For each trial k, define X_k to encode whether the event occurred (e.g., for a fair coin, X_k = 1 for heads and 0 for tails). After n trials, the empirical probability of the event is the average X̄_n = (1/n)∑_{k=1}^n X_k. This average is itself a random variable, and it equals the fraction of trials in which the event happened.

What does “convergence in probability” mean in the weak law of large numbers?

Convergence in probability to μ means that for every tolerance ε > 0, the probability of deviating from μ by more than ε goes to zero: P(|X̄_n − μ| > ε) → 0 as n → ∞. The transcript emphasizes this ε-based probability statement as the explicit definition.

Why are the IID assumptions central to the weak law?

Independence ensures joint probabilities for any finite subset factor into products, which is needed for variance calculations. Identical distribution ensures all X_k share the same distribution, so they have the same mean μ (and the same variance Σ^2 when it exists). Together, IID guarantees the averages behave consistently as n increases.

What role does finite variance play in the proof?

Finite variance enables a variance bound that shrinks with n. With Var(X_1) = Σ^2, the variance of the average becomes Var(X̄_n) = Σ^2/n. Without finite variance, the Chebyshev-based argument in the transcript would not apply in this form.

How does Chebyshev’s inequality produce the weak law’s probability bound?

Chebyshev’s inequality states P(|Y − E[Y]| > ε) ≤ Var(Y)/ε^2. Applying it to Y = X̄_n gives P(|X̄_n − μ| > ε) ≤ Var(X̄_n)/ε^2 = (Σ^2/n)/ε^2 = Σ^2/(nε^2). Since n → ∞ forces Σ^2/(nε^2) → 0, the deviation probability vanishes.

Review Questions

  1. In the coin-toss setup, what are X_k and X̄_n, and what value does X̄_n aim to approach?
  2. State the formal definition of convergence in probability used in the weak law.
  3. Under what additional condition does the transcript give a Chebyshev-based proof, and what bound does it yield?

Key Points

  1. 1

    The weak law of large numbers links theoretical probability μ to observed relative frequency via the empirical average X̄_n.

  2. 2

    Empirical probability is modeled as X̄_n = (1/n)∑_{k=1}^n X_k, where X_k indicates whether the event occurred on trial k.

  3. 3

    Convergence in probability means P(|X̄_n − μ| > ε) → 0 for every ε > 0.

  4. 4

    The IID assumptions (independence and identical distribution) ensure consistent mean behavior and enable variance calculations.

  5. 5

    Integrability (e.g., E|X_1| < ∞) is used to guarantee the mean μ is well-defined.

  6. 6

    With finite variance Σ^2, Var(X̄_n) = Σ^2/n, so deviations become less likely as n grows.

  7. 7

    Chebyshev’s inequality turns the variance scaling into the weak law’s probability bound: P(|X̄_n − μ| > ε) ≤ Σ^2/(nε^2).

Highlights

The empirical frequency of heads after n tosses is the random average X̄_n, and it targets the theoretical probability μ = 1/2.
“Converges in probability” is an explicit ε statement: the chance of a deviation larger than ε goes to zero.
Finite variance makes the proof concrete because Var(X̄_n) shrinks like 1/n.
Chebyshev’s inequality converts the shrinking variance into a vanishing deviation probability bound.
The weak law guarantees reliability in probability, not certainty in every outcome path.

Topics