ChatGPT Will Destroy Your Papers If You Let It

TL;DR

Treat citation reliability as two separate problems: reference existence and claim-to-evidence alignment.

Briefing Cornell Notes

Briefing

Researchers trying to use ChatGPT for academic writing face a practical risk: large language models can produce citations that exist but don’t actually support the claims being attributed to them. A stress test by Andy Stapleton’s team breaks that problem into two failure modes—first-order hallucinations (whether a cited paper truly exists) and second-order hallucinations (whether the paper contains the specific content for which it’s being cited). The results point to a clear workflow: rely on ChatGPT with “deep research” enabled and prefer the “auto” mode over agent-style tools.

In the first-order check, the team randomly selected five papers and asked different ChatGPT configurations to return summaries, direct quotations, and APA-style bibliographies. With ChatGPT5 auto plus deep research, all five references were correct—papers existed as cited. By contrast, ChatGPT instant plus web search produced only two correct references out of five, suggesting that web search alone doesn’t reliably anchor citations to real academic sources. Adding deep research improved outcomes across configurations: ChatGPT5 thinking with deep research beat web-search-only approaches, reinforcing the idea that deeper retrieval and verification steps reduce outright fabrication.

The second-order test targeted a subtler but more damaging issue: even when a reference exists, the model can cite it for the wrong reason. Using the same set of model configurations, the team evaluated whether the cited paper actually contained the content matching the claim citation match. ChatGPT5 auto plus deep research again performed best, though it was not perfect—some citations still didn’t align cleanly with the claim attributed to them. Deep research consistently improved accuracy, while web search lagged behind. “Thinking” modes helped, but not as much as the auto configuration paired with deep research.

The most striking finding involved ChatGPT5 agent. Although the agent sometimes returned references that existed, it failed the second-order requirement: the cited papers did not contain the specific information for which they were being referenced. In the team’s random sample, the agent produced zero correct results for the citation-content match. That outcome challenges the expectation that agentic systems—designed to go out and “do stuff” in the world—will automatically reduce citation errors.

Taken together, the takeaways are operational rather than theoretical. For serious academic work, turn on deep research, use ChatGPT5 auto, and avoid ChatGPT5 agent. If deep research isn’t available, web search-only workflows appear substantially riskier for both real-reference accuracy and claim-to-evidence alignment. The study also signals that further benchmarking across other AI tools for academia is planned, with the goal of identifying which systems are worth time and money versus those prone to citation mismatches.

Cornell Notes

The core finding is that citation reliability depends on more than whether a referenced paper exists. A stress test separated “first-order hallucinations” (fake or nonexistent references) from “second-order hallucinations” (real papers cited for the wrong claim). Across both checks, ChatGPT5 auto plus deep research produced the most accurate results, including 100% correct existence checks in a five-paper sample. Deep research improved performance consistently, while web search alone lagged. ChatGPT5 agent performed worst on the claim-to-evidence match, returning references that existed but did not contain the cited content.

What’s the difference between first-order and second-order hallucinations in academic citations?

First-order hallucinations ask whether the cited paper actually exists. Second-order hallucinations ask whether the paper contains the specific content that justifies the claim being attributed to it. A model can pass the first test while still failing the second by citing a real paper for an incorrect reason.

Which ChatGPT configuration performed best for real-reference accuracy (first-order)?

ChatGPT5 auto plus deep research. In a sample where five papers were randomly selected, all five references were correct (the papers existed as cited). Other setups—like ChatGPT instant plus web search—scored lower (two correct out of five).

How did deep research affect citation quality compared with web search?

Deep research improved both existence checks and claim-to-evidence alignment. The team found that adding deep research consistently raised accuracy versus web-search-only workflows, including cases where “thinking” modes improved but still didn’t beat auto plus deep research.

What did the second-order test reveal about citation-content matching?

Even with the best configuration, accuracy wasn’t perfect. ChatGPT5 auto plus deep research performed best for claim citation match, but some citations still had mismatches—meaning the cited paper didn’t fully support the specific claim as presented. This is why second-order checking matters.

Why was ChatGPT5 agent considered a major red flag?

ChatGPT5 agent failed the claim citation match in the team’s random sample. References could exist, but the cited papers did not contain the information they were being used to support—resulting in zero correct results for the second-order requirement.

Review Questions

If a model returns citations to real papers but the quoted evidence doesn’t match the claim, which hallucination type is that?
What evidence from the stress test supports using deep research over web search for academic referencing?
Why might an agentic workflow still produce citation errors even when it can “go out” to find sources?

Key Points

1
Treat citation reliability as two separate problems: reference existence and claim-to-evidence alignment.
2
Enable deep research when using ChatGPT for academic references; it improved accuracy in both tests.
3
Prefer ChatGPT5 auto over agent-style workflows for citation correctness.
4
Web search-only configurations were less reliable for both real references and correct citation-content matches.
5
Even the best setup (auto plus deep research) can still produce some citation-content mismatches, so verification remains necessary.
6
Agentic mode (ChatGPT5 agent) can return real references that don’t support the cited claims, making it especially risky for research writing.

Highlights

ChatGPT5 auto plus deep research hit 100% correct reference existence in a five-paper sample, while web-search-only approaches lagged.

Second-order hallucinations showed that real papers can still be cited for the wrong reason, breaking claim citation match.

ChatGPT5 agent produced zero correct results for citation-content matching in the random sample—real references without the supporting content.

Topics

Mentioned

Andy Stapleton