Is Most Published Research Wrong?

Q: Why does a p-value like 0.01 not automatically mean an extraordinary claim is reliable?

A p-value measures how likely an observed result (or something more extreme) would be if the null hypothesis were true. In the retroactive-influence example, a 53% hit rate for erotic images produced p=0.01, but the transcript stresses that the 0.05 significance threshold is arbitrary (linked to Ronald Fisher). More importantly, p-values assume a single, pre-specified test; if researchers effectively “try many shots” (many hypotheses or many outcomes), the probability of seeing a low p-value by chance increases.

Q: How can testing many hypotheses inflate false positives beyond the naive “5%” expectation?

With many hypotheses, even if only a minority are true, false ones still get tested. The transcript’s numerical example assumes 1,000 hypotheses with 10% true. With 80% power, about 80 true relationships are detected and 20 are missed. Among the 900 false hypotheses, using p<0.05 can incorrectly label about 45 false hypotheses as significant. Because journals publish positive results more often than null results, the published set can contain a large fraction of false positives—potentially close to a third.

Q: What is p-hacking, and how does it relate to multiple measurements in real studies?

p-hacking refers to analysis choices that reduce p-values and increase the chance of crossing the significance threshold. The transcript’s chocolate example shows the mechanism: researchers tracked 18 different measurements per participant (weight, cholesterol, sodium, blood proteins, sleep quality, well-being, etc.) with only five people per group. If weight didn’t show significance, another outcome could. That turns a single-test p-value into a “many comparisons” problem, where at least one significant result becomes likely.

Q: Why do replication attempts often fail to appear in the literature even when they matter?

The transcript highlights that replication studies can be rejected by journals. In the retroactive-influence case, three researchers attempted replication and found a hit rate not significantly different from chance; when they tried to publish in the same journal, the submission was rejected. The practical consequence is incentive distortion: scientists may avoid replication because it’s less likely to be published and more likely to be labeled as “doing it wrong” if results don’t reproduce.

Q: How do incentives and publication bias affect what gets believed?

Journals favor statistically significant and novel findings, and researchers face career incentives tied to publication. The transcript quotes Brian Nosek’s line: “There is no cost to getting things wrong, the cost is not getting them published.” This encourages strategies that increase the probability of significance—such as selective analysis or testing unlikely hypotheses—thereby worsening the ratio of true to spurious effects in the published record.

TL;DR

A low p-value can be misleading when the analysis effectively involves many unreported tests or comparisons rather than a single pre-specified hypothesis.

Briefing Cornell Notes

Briefing

A small statistical bump can look like evidence of something extraordinary, but the modern “reproducibility crisis” suggests that many published findings—across psychology, medicine, and even physics—are more likely to be false positives than reliable truths. The core issue isn’t that researchers are always dishonest; it’s that common statistical practices, especially reliance on p-values with a fixed cutoff, can systematically inflate the odds of publishing results that don’t replicate.

The discussion begins with a 2011 paper in the Journal of Personality and Social Psychology claiming “anomalous retroactive influences on cognition and affect.” One experiment had participants choose between two curtains, expecting a hit rate near 50% because the image placement was random. For neutral and negative images, the hit rate matched chance. For erotic images, the hit rate rose to 53%, which produced a p-value of 0.01—seemingly strong enough to reject the null hypothesis that participants were merely guessing. But the significance threshold itself is arbitrary: the commonly used 0.05 line traces back to Ronald Fisher’s 1925 work. That arbitrariness matters because it interacts with how researchers test many possibilities.

The transcript then shifts from one striking result to a broader math problem: when researchers test many hypotheses, the expected rate of false positives among “significant” findings can far exceed the naive 5% intuition. A worked example assumes a field testing 1,000 hypotheses, with only 10% truly reflecting real effects. Even with decent statistical power (80%), false negatives still occur, and among the many false hypotheses, a p<0.05 rule can incorrectly flag dozens as true. Because journals preferentially publish positive outcomes—while negative results are less likely to appear—published literature can end up with a substantial fraction of incorrect claims, potentially approaching “nearly a third” even when the system is functioning as intended.

Replication efforts underscore the concern. A Reproducibility Project that re-ran 100 psychology studies found only 36% produced statistically significant results the second time, and effect sizes averaged about half of the original estimates. In cancer “landmark” studies, only six of 53 were reproduced even when researchers worked closely with original authors.

The transcript illustrates how p-values can be gamed without overt fraud through “p-hacking.” A widely publicized 2015 chocolate-and-weight-loss study was later found to have been intentionally designed to increase false positives: tiny sample sizes (five per group) and many tracked outcomes (18 measurements per person) meant that if weight didn’t show significance, another variable could. The headline effect emerged because the probability of at least one significant result rises quickly when many comparisons are possible.

Incentives and publication rules further distort the evidence. Journals tend to favor novel, statistically significant results, while replication studies often struggle to get published. The transcript argues that science is still the best available method for approaching truth, but the frequency of wrong turns—despite peer review and sophisticated statistics—raises a sobering question: how often do people mislead themselves when they rely on less rigorous approaches?

Cornell Notes

The transcript argues that p-values and publication incentives can make false positives common, even when researchers are acting in good faith. A p<0.05 cutoff can look decisive, but when many hypotheses or many measurements are tested, the chance that at least one result crosses the threshold rises sharply. Replication projects in psychology and attempts to verify cancer “landmark” studies found low success rates and smaller effect sizes the second time around. Examples like the “chocolate lowers weight faster” story show how small samples plus many outcomes can manufacture statistically significant headlines. The takeaway: scientific knowledge self-corrects imperfectly, so results need replication and better safeguards against selective analysis.

Why does a p-value like 0.01 not automatically mean an extraordinary claim is reliable?

A p-value measures how likely an observed result (or something more extreme) would be if the null hypothesis were true. In the retroactive-influence example, a 53% hit rate for erotic images produced p=0.01, but the transcript stresses that the 0.05 significance threshold is arbitrary (linked to Ronald Fisher). More importantly, p-values assume a single, pre-specified test; if researchers effectively “try many shots” (many hypotheses or many outcomes), the probability of seeing a low p-value by chance increases.

How can testing many hypotheses inflate false positives beyond the naive “5%” expectation?

With many hypotheses, even if only a minority are true, false ones still get tested. The transcript’s numerical example assumes 1,000 hypotheses with 10% true. With 80% power, about 80 true relationships are detected and 20 are missed. Among the 900 false hypotheses, using p<0.05 can incorrectly label about 45 false hypotheses as significant. Because journals publish positive results more often than null results, the published set can contain a large fraction of false positives—potentially close to a third.

What is p-hacking, and how does it relate to multiple measurements in real studies?

p-hacking refers to analysis choices that reduce p-values and increase the chance of crossing the significance threshold. The transcript’s chocolate example shows the mechanism: researchers tracked 18 different measurements per participant (weight, cholesterol, sodium, blood proteins, sleep quality, well-being, etc.) with only five people per group. If weight didn’t show significance, another outcome could. That turns a single-test p-value into a “many comparisons” problem, where at least one significant result becomes likely.

Why do replication attempts often fail to appear in the literature even when they matter?

The transcript highlights that replication studies can be rejected by journals. In the retroactive-influence case, three researchers attempted replication and found a hit rate not significantly different from chance; when they tried to publish in the same journal, the submission was rejected. The practical consequence is incentive distortion: scientists may avoid replication because it’s less likely to be published and more likely to be labeled as “doing it wrong” if results don’t reproduce.

How do incentives and publication bias affect what gets believed?

Journals favor statistically significant and novel findings, and researchers face career incentives tied to publication. The transcript quotes Brian Nosek’s line: “There is no cost to getting things wrong, the cost is not getting them published.” This encourages strategies that increase the probability of significance—such as selective analysis or testing unlikely hypotheses—thereby worsening the ratio of true to spurious effects in the published record.

Review Questions

In the retroactive-influence example, what would need to change for the erotic-image result to become more convincing than a single p-value?
Using the transcript’s 1,000-hypothesis scenario, explain why false positives can dominate among published significant results even when power is reasonably high.
What specific study design features (sample size, number of outcomes, analysis flexibility) most increase the risk of p-hacking?

Key Points

1
A low p-value can be misleading when the analysis effectively involves many unreported tests or comparisons rather than a single pre-specified hypothesis.
2
The common p<0.05 threshold is historically arbitrary, and its meaning changes when researchers have flexibility in how data are collected and analyzed.
3
When many hypotheses are tested, the expected false-positive rate among “significant” findings can be far larger than the naive 5% intuition.
4
Replication failures in psychology and limited success in cancer “landmark” verification suggest that published effect sizes often shrink or disappear under re-testing.
5
p-hacking can occur through legitimate-seeming choices (e.g., stopping rules, adding data, selecting among multiple outcomes), especially with small samples.
6
Publication incentives and journal policies can discourage replication, slowing the self-correction of science.
7
Recent reforms—large-scale replications, Retraction Watch, negative-results repositories, and preregistration—aim to reduce publication bias and analysis flexibility.

Highlights

A 53% “hit rate” in a retroactive-influence experiment produced p=0.01, yet the transcript emphasizes that significance thresholds don’t protect against multiple-testing problems.

A worked 1,000-hypothesis example shows how p<0.05 can yield dozens of false positives among published results when most tested hypotheses are wrong.

The chocolate-and-weight-loss headline was later tied to an intentionally false-positive design: five participants per group and 18 tracked outcomes per person.

Replication can be blocked by journal policy; a failed replication attempt in the same journal was rejected, illustrating incentive problems.

Even with peer review and sophisticated statistics, the transcript argues that scientific error remains common enough to demand replication and caution.

Topics

p-values
False Positives
Reproducibility Crisis
p-Hacking
Replication Incentives

Mentioned

Audible
Ronald Fisher
Brian Nosek
p
p-hacking