Is Most Published Research Wrong?
Based on Veritasium's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
A low p-value can be misleading when the analysis effectively involves many unreported tests or comparisons rather than a single pre-specified hypothesis.
Briefing
A small statistical bump can look like evidence of something extraordinary, but the modern “reproducibility crisis” suggests that many published findings—across psychology, medicine, and even physics—are more likely to be false positives than reliable truths. The core issue isn’t that researchers are always dishonest; it’s that common statistical practices, especially reliance on p-values with a fixed cutoff, can systematically inflate the odds of publishing results that don’t replicate.
The discussion begins with a 2011 paper in the Journal of Personality and Social Psychology claiming “anomalous retroactive influences on cognition and affect.” One experiment had participants choose between two curtains, expecting a hit rate near 50% because the image placement was random. For neutral and negative images, the hit rate matched chance. For erotic images, the hit rate rose to 53%, which produced a p-value of 0.01—seemingly strong enough to reject the null hypothesis that participants were merely guessing. But the significance threshold itself is arbitrary: the commonly used 0.05 line traces back to Ronald Fisher’s 1925 work. That arbitrariness matters because it interacts with how researchers test many possibilities.
The transcript then shifts from one striking result to a broader math problem: when researchers test many hypotheses, the expected rate of false positives among “significant” findings can far exceed the naive 5% intuition. A worked example assumes a field testing 1,000 hypotheses, with only 10% truly reflecting real effects. Even with decent statistical power (80%), false negatives still occur, and among the many false hypotheses, a p<0.05 rule can incorrectly flag dozens as true. Because journals preferentially publish positive outcomes—while negative results are less likely to appear—published literature can end up with a substantial fraction of incorrect claims, potentially approaching “nearly a third” even when the system is functioning as intended.
Replication efforts underscore the concern. A Reproducibility Project that re-ran 100 psychology studies found only 36% produced statistically significant results the second time, and effect sizes averaged about half of the original estimates. In cancer “landmark” studies, only six of 53 were reproduced even when researchers worked closely with original authors.
The transcript illustrates how p-values can be gamed without overt fraud through “p-hacking.” A widely publicized 2015 chocolate-and-weight-loss study was later found to have been intentionally designed to increase false positives: tiny sample sizes (five per group) and many tracked outcomes (18 measurements per person) meant that if weight didn’t show significance, another variable could. The headline effect emerged because the probability of at least one significant result rises quickly when many comparisons are possible.
Incentives and publication rules further distort the evidence. Journals tend to favor novel, statistically significant results, while replication studies often struggle to get published. The transcript argues that science is still the best available method for approaching truth, but the frequency of wrong turns—despite peer review and sophisticated statistics—raises a sobering question: how often do people mislead themselves when they rely on less rigorous approaches?
Cornell Notes
The transcript argues that p-values and publication incentives can make false positives common, even when researchers are acting in good faith. A p<0.05 cutoff can look decisive, but when many hypotheses or many measurements are tested, the chance that at least one result crosses the threshold rises sharply. Replication projects in psychology and attempts to verify cancer “landmark” studies found low success rates and smaller effect sizes the second time around. Examples like the “chocolate lowers weight faster” story show how small samples plus many outcomes can manufacture statistically significant headlines. The takeaway: scientific knowledge self-corrects imperfectly, so results need replication and better safeguards against selective analysis.
Why does a p-value like 0.01 not automatically mean an extraordinary claim is reliable?
How can testing many hypotheses inflate false positives beyond the naive “5%” expectation?
What is p-hacking, and how does it relate to multiple measurements in real studies?
Why do replication attempts often fail to appear in the literature even when they matter?
How do incentives and publication bias affect what gets believed?
Review Questions
- In the retroactive-influence example, what would need to change for the erotic-image result to become more convincing than a single p-value?
- Using the transcript’s 1,000-hypothesis scenario, explain why false positives can dominate among published significant results even when power is reasonably high.
- What specific study design features (sample size, number of outcomes, analysis flexibility) most increase the risk of p-hacking?
Key Points
- 1
A low p-value can be misleading when the analysis effectively involves many unreported tests or comparisons rather than a single pre-specified hypothesis.
- 2
The common p<0.05 threshold is historically arbitrary, and its meaning changes when researchers have flexibility in how data are collected and analyzed.
- 3
When many hypotheses are tested, the expected false-positive rate among “significant” findings can be far larger than the naive 5% intuition.
- 4
Replication failures in psychology and limited success in cancer “landmark” verification suggest that published effect sizes often shrink or disappear under re-testing.
- 5
p-hacking can occur through legitimate-seeming choices (e.g., stopping rules, adding data, selecting among multiple outcomes), especially with small samples.
- 6
Publication incentives and journal policies can discourage replication, slowing the self-correction of science.
- 7
Recent reforms—large-scale replications, Retraction Watch, negative-results repositories, and preregistration—aim to reduce publication bias and analysis flexibility.