Session 45 - Hypothesis Testing Part 1

TL;DR

Hypothesis testing converts sample results into a decision about population parameters by testing H0 (“no effect”) against H1 (“an effect exists”).

Briefing Cornell Notes

Briefing

Hypothesis testing is presented as the decision-making tool for turning sample data into probabilistic claims about a population—especially when business, finance, and economics can’t rely on “try it and see” forever. The core problem: observed results (like longer average YouTube view duration after changing a video style) can be caused by randomness, smarter timing, or other hidden factors. For high-stakes decisions—whether a new product should replace an old one, or whether a training program truly improves productivity—teams need a structured way to test whether an observed difference is strong enough to reject a “no change” assumption.

The session frames hypothesis testing around two competing statements. The null hypothesis (H0) assumes no significant effect or difference—“nothing new is happening.” The alternative hypothesis (H1) contradicts H0 and represents the effect of interest—“the new method increases/decreases the metric” or “the weight is not equal to the claimed value.” A key rule of thumb is emphasized: H0 is the status-quo baseline, and H1 is the challenge. After collecting data, the logic is not “prove H0 true,” but “reject H0 if evidence is strong enough.” Confusion is addressed directly: failing to reject H0 does not mean H0 is true; it only means the data didn’t provide sufficient evidence against it.

To operationalize the decision, the lecture introduces the “rejection region approach” as a step-by-step workflow. First, H0 and H1 are defined (e.g., average view duration is equal to 6 minutes under the old vs. new shooting technique, or a product weight equals 50 grams). Next, a significance level (α) is chosen—commonly 0.05 or 0.01—which represents the probability of rejecting H0 when H0 is actually true (linked to Type I error). Then assumptions determine which test to use (for example, a z-test when population standard deviation is known and a t-test when it isn’t). After computing a test statistic (like a t-statistic), the result is compared to critical values on a normal curve to decide whether to reject or not reject H0. The session also walks through two concrete examples: a training program productivity test (sample mean productivity rises from a baseline of 50 to 53 with n=30, σ known) and a consumer “50 grams” weight claim test (testing whether the mean weight differs from 50 using a two-sided setup with σ=4 and n=40).

The lecture then pivots to limitations of the rejection-region method: it can’t meaningfully distinguish between very close test statistics (e.g., 1.95 vs. 1.97) if both fall on the same side of the boundary. That motivates the next approach—p-value—promised for the following class. Additional foundational concepts are introduced: Type I vs. Type II errors, the trade-off controlled by α, and the difference between one-tailed and two-tailed tests based on whether the alternative hypothesis specifies “greater than,” “less than,” or “not equal to.” Finally, hypothesis testing is positioned as broadly useful across domains—evaluating interventions, comparing means and proportions, analyzing relationships, testing independence of categorical variables, and supporting machine learning tasks like model comparison, feature selection, hyperparameter tuning, and checking algorithm assumptions.

Cornell Notes

Hypothesis testing turns sample evidence into a structured decision about population parameters by pitting a null hypothesis (H0: “no effect”) against an alternative hypothesis (H1: “an effect exists”). The session emphasizes that rejecting H0 means the data provide strong evidence against “no change,” while failing to reject H0 does not prove H0 is true. A significance level α (often 0.05 or 0.01) sets the tolerance for Type I error—rejecting H0 when it’s actually true. Using the rejection-region approach, the workflow defines H0/H1, chooses α, checks assumptions (z-test vs t-test), computes a test statistic, and compares it to critical values to decide reject vs not reject. Examples include testing whether a training program increases productivity and whether a package’s mean weight differs from 50 grams.

Why does hypothesis testing exist if you already observe differences in data (like longer view duration after changing a video style)?

Observed differences can be caused by randomness and hidden factors, not only by the change being tested. The session’s examples show this: a new shooting style might increase average view duration, but the increase could also come from timing, content choice, or other uncontrolled variables. Hypothesis testing formalizes the question as: “Is the evidence strong enough to reject the baseline assumption (H0: no difference)?” That’s crucial for high-stakes decisions like product replacement or training program evaluation.

What is the null hypothesis (H0) and how is it chosen?

H0 is the “nothing new is happening” baseline statement. It represents the status quo or the claimed value. Examples used: (1) For the video style, H0 can be “average view duration with the new style equals the old value (e.g., 6 minutes).” (2) For the chips weight claim, H0 can be “mean packet weight equals 100 grams” (or later, “equals 50 grams”). A rule of thumb stressed: H0 assumes equality/no effect; H1 challenges it.

What is the alternative hypothesis (H1), and how does it relate to H0?

H1 contradicts H0 and represents the effect of interest. The session highlights mutual exclusivity: only one can be correct after testing—either H0 is rejected in favor of H1, or H0 is not rejected. Examples: (1) Video style: H1 is “new style increases average view duration.” (2) Weight claim: H1 is “weight is not equal to 50 grams,” which can be two-sided (different both ways) or one-sided depending on the claim.

How does the rejection-region approach use α to decide reject vs not reject?

α is the significance level: the probability of rejecting H0 when H0 is actually true (Type I error). After choosing α, the method draws rejection regions on a distribution curve and computes a test statistic (like a z or t statistic). If the statistic falls inside the rejection region (e.g., beyond critical values such as about ±1.96 for a two-sided 5% case), H0 is rejected; otherwise, H0 is not rejected. The session also notes that α affects the size of rejection regions: smaller α shrinks them, making rejection harder.

What’s the key difference between Type I and Type II errors?

Type I error: rejecting H0 when H0 is true (false positive). Type II error: failing to reject H0 when H0 is false (false negative). The session links α to the Type I risk and explains a trade-off: reducing Type I error by lowering α typically increases Type II error risk. It also clarifies that “not rejecting H0” doesn’t mean H0 is true; it means evidence wasn’t strong enough to reject it.

When should a one-tailed test be used instead of a two-tailed test?

Use a one-tailed test when H1 specifies a direction (greater than or less than). Use a two-tailed test when H1 is “not equal to” (differences can go either direction). The session’s weight example illustrates this: if the claim is “exactly 50 grams,” the alternative is two-sided (“not equal to 50”), so rejection regions appear on both tails. If the claim were only “more than 50” or only “less than 50,” a one-sided test would concentrate the rejection region on one side.

Review Questions

In your own words, why does failing to reject H0 not equal proving H0 is true?
Describe the step-by-step rejection-region workflow for a hypothesis test, including where α is used.
Give one example of a scenario where a one-tailed test is appropriate and explain why the direction matters.

Key Points

1
Hypothesis testing converts sample results into a decision about population parameters by testing H0 (“no effect”) against H1 (“an effect exists”).
2
Rejecting H0 is evidence-based; not rejecting H0 does not prove H0 is true—it only means the evidence wasn’t strong enough.
3
The significance level α sets the probability of a Type I error (rejecting H0 when it’s actually true).
4
The rejection-region approach uses critical values: compute a test statistic and compare it to the rejection region to decide reject vs not reject.
5
Choosing the correct test depends on assumptions (e.g., whether population standard deviation is known leads to z-test; unknown leads to t-test).
6
One-tailed vs two-tailed tests depend on whether H1 specifies direction (greater/less) or “not equal to.”
7
Type I and Type II errors trade off: tightening α reduces Type I error risk but can increase Type II error risk.

Highlights

Hypothesis testing is framed as a way to separate real effects from randomness when decisions can’t be made by observation alone.

A central caution: “not rejecting H0” is not the same as “H0 is true.”

The rejection-region method struggles to distinguish very close test statistics (like 1.95 vs 1.97), motivating the p-value approach next.

Two-sided weight claims naturally lead to two-tailed tests, while directional claims lead to one-tailed tests.

Type I error is controlled by α, and Type II error changes in response—creating an inherent trade-off.

Topics

Hypothesis Testing
Null vs Alternative
Significance Level
Rejection Region
Type I/Type II Errors

Mentioned

H0
H1
α
p-value
Type I Error
Type II Error
CI
ML

Session 45 - Hypothesis Testing Part 1 | DSMP 2023