Simpson's Paradox

TL;DR

Simpson’s paradox can produce opposite conclusions when comparing subgroup results versus aggregated totals from the same dataset.

Briefing Cornell Notes

Briefing

Simpson’s paradox can flip the apparent effect of a treatment depending on whether results are grouped by category or combined—so the same dataset can support opposite conclusions. In a medical-style example, treated cats all survive (100%), while untreated cats have a lower survival rate (75%). Treated humans fare worse: only 25% survive, while untreated humans have 0% survival. Taken separately, the treatment looks beneficial for cats and harmful for humans.

But once the data are aggregated across both groups, the story reverses again: only 40% of the combined treated population survives, while 60% of the combined untreated population recovers. The paradox isn’t a mistake in arithmetic; it’s a warning that statistical comparisons can be distorted by how categories are mixed—especially when the categories differ in baseline risk.

The key driver is causality and selection effects. If humans are more seriously ill than cats, and doctors therefore prescribe treatment more often to the sickest humans, then the treated group may start out with a higher likelihood of death even if the treatment improves recovery chances. In that scenario, lower survival among treated humans doesn’t necessarily mean the treatment is bad; it can reflect that the treated humans were more likely to die in the first place.

The same logic can produce the opposite-looking pattern when treatment assignment is biased in the other direction. If humans are more likely to receive treatment than cats for non-medical reasons—such as social or financial incentives—then the raw survival rates by group can mislead. A pattern like “4 out of 5 humans die while only 1 in 5 cats die” might tempt a conclusion that treatment is harmful, even if the treatment actually helps within each group.

Simpson’s paradox also shows up in real-world comparisons such as education. Wisconsin’s overall 8th grade standardized test scores have often been higher than Texas’s, suggesting better teaching. Yet when results are broken down by race, Texas students outperform Wisconsin students across black, Hispanic, and white groups. The overall ranking can still favor Wisconsin because Wisconsin has a different racial composition—proportionally fewer black and Hispanic students and more white students—so the aggregate statistic reflects demographics as much as instruction.

Visually, the paradox can appear when two subgroup trends move in the same direction, but the combined trend moves the opposite way. The transcript even uses a playful “money makes people sadder” versus “money makes cats sadder” setup to illustrate how an overall graph can mislead if the starting populations differ.

The practical takeaway is procedural: controlled experiments must prevent any causally related factors from influencing who receives treatment, while uncontrolled studies must account for outside biases. Statistics alone can’t resolve the paradox; understanding what drives group differences—illness severity, treatment assignment, and demographic composition—is what determines which conclusion is meaningful.

Cornell Notes

Simpson’s paradox shows how the same data can imply opposite conclusions depending on whether results are analyzed within subgroups or after combining them. In the treatment example, cats and humans show different survival rates when separated, but the aggregated survival rates reverse the apparent effect. The reversal happens because group composition can be shaped by causal factors like baseline severity or biased treatment assignment. Without controlling for those influences, aggregate statistics can misrepresent the true effect of a treatment or policy. The lesson extends beyond medicine to comparisons like education, where demographic differences can explain why overall rankings differ from subgroup performance.

How can a treatment look beneficial in subgroup data but harmful after aggregation?

In the example, treated cats have a higher survival rate than untreated cats (100% vs 75%), while treated humans have a lower survival rate than untreated humans (25% vs 0%). When all treated cats and treated humans are combined, only 40% survive; when all untreated are combined, 60% recover. The flip occurs because the treated and untreated groups are not comparable in baseline risk—aggregation mixes groups with different starting conditions.

What role does causality play in resolving Simpson’s paradox?

Causality determines why one group is more likely to receive treatment. If humans are generally sicker and therefore more likely to be prescribed treatment, then treated humans may have a higher chance of death even if treatment improves recovery. Conversely, if humans are more likely to be treated for non-medical reasons (like cost or incentives) than cats, then subgroup survival patterns can be misleading when combined. The paradox can’t be resolved by statistics alone; it requires understanding how treatment assignment is generated.

Why can education rankings based on overall test scores be misleading?

Wisconsin’s overall 8th grade standardized test scores can be higher than Texas’s, but subgroup analysis by race can show Texas students outperforming Wisconsin students across black, Hispanic, and white categories. The overall difference can persist because Wisconsin has a different demographic mix—proportionally fewer black and Hispanic students and more white students—so the aggregate statistic reflects composition as well as instruction.

What does a “two trends that disagree with the combined trend” graph illustrate?

Simpson’s paradox often appears when each subgroup trend points one way, yet the combined trend points the other way. The transcript’s money-and-sadness illustration uses the idea that cats might start out happier and richer than people; even if money affects each subgroup in one direction, the overall population mix can make it look like money has the opposite effect.

What should be done in experiments to prevent Simpson’s paradox from distorting conclusions?

Controlled experiments should ensure that no causally related factors influence who receives the treatment. In uncontrolled settings, analysts must identify and adjust for outside biases—such as baseline severity, demographic composition, or systematic differences in treatment assignment—so subgroup comparisons reflect the treatment effect rather than selection effects.

Review Questions

In the cat-and-human treatment example, what specific mechanism allows the aggregated survival rate to reverse the subgroup conclusions?
Give two different causal stories that could produce Simpson’s paradox in treatment assignment, and explain how each changes the interpretation of survival rates.
Why can overall standardized test rankings differ from race-by-race performance comparisons?

Key Points

1
Simpson’s paradox can produce opposite conclusions when comparing subgroup results versus aggregated totals from the same dataset.
2
Baseline differences and selection effects—such as illness severity—can make treated and untreated groups incomparable.
3
Understanding treatment assignment requires causal reasoning; statistics alone can’t determine which conclusion is correct.
4
Uncontrolled studies must account for outside biases that influence who receives treatment or benefits from a program.
5
Overall education rankings can be driven by demographic composition even when subgroup performance tells a different story.
6
Graphical summaries can mislead when subgroup trends combine into an overall trend that points the opposite direction.
7
Controlled experiments should prevent causally related factors from affecting treatment assignment.

Highlights

Subgroup survival rates can suggest a treatment helps one group and hurts another, yet aggregation can reverse the apparent effect entirely.

The paradox is driven by how groups are mixed—often because baseline risk and treatment assignment are correlated.

Wisconsin’s higher overall test scores can coexist with Texas outperforming Wisconsin within every major racial subgroup due to demographic differences.

A combined graph can show the opposite of each subgroup’s direction, making context essential for interpreting statistics.

Topics

Simpson's Paradox
Causality
Treatment Bias
Education Metrics
Data Aggregation