The Ultimate AI Showdown: ChatGPT vs Claude vs Gemini

TL;DR

Treat LLM citations as two separate risks: non-existent references and citations that don’t support the specific claim.

Briefing Cornell Notes

Briefing

Large language models can produce citations that look academic while failing at the two hardest parts of scholarly referencing: finding references that actually exist and—more importantly—citing the right source for the specific claim. In a stress test across ChatGPT, Claude, and Gemini, real-world citation quality split sharply by model and by whether web search or “deep research” was enabled, with Gemini performing worst on both accuracy checks.

The testing framework separated “first-order hallucinations” (references that don’t exist) from “second-order hallucinations” (references that exist but don’t support the claim they’re attached to). Under the first-order check, ChatGPT produced correct, existing references more than 60% of the time, Claude landed around 56%, and Gemini managed only about 20%. The differences widened when looking at specific configurations: ChatGPT’s best results came from “ChatGPT5 thinking” with web search enabled, followed by “ChatGPT5 auto plus deep research.” Claude was inconsistent—Sonnet 4 with research hit a 100% success rate for existing references, while Opus 4.1 performed poorly, returning references that didn’t actually exist. Gemini’s performance was especially weak under pressure: Flash 2.5 Pro with deep research and Flash 2.5 both produced no valid references, and only some other Gemini variants reached about 40%.

The second-order check—whether the cited paper actually contains the information for the claim—was even more damning. Across configurations, only just under half of ChatGPT’s citations matched the claim they were supposed to support. Claude did worse, with just over 40% of citations containing the relevant support. Gemini provided essentially no usable citations for claim support: it produced 0% citations where the paper contained the content being cited for, making it a poor choice for academic research in this evaluation.

When the analysis drilled down further, the best-performing setups were again those that used retrieval tools. ChatGPT5 thinking with deep research or web search led the pack for second-order accuracy, while “ChatGPT5 agent” scored 0%. Claude’s success rate hovered around 40–50%, and Gemini again hit 0%.

A key takeaway is that paying for a model doesn’t guarantee better citation reliability. The test also highlighted a common failure mode: even when a cited fact appears in the paper, it may be located in the introduction rather than the part of the paper that directly supports the claim—effectively “double extraction” from what other papers say. These systems behave like plausibility engines, generating outputs that feel right, even when the scholarly grounding is missing.

For safer workflows, the guidance was to generate answers with an LLM only as a starting point, then trace every claim back to the PDF and page. For reference gathering specifically, the transcript recommends avoiding LLMs as the primary source and instead using research-focused tools: Elicit (which checks papers in the background), S i s p a c e (with paper search and literature review built on real references), and Consensus (for field-level yes/no answers).

Cornell Notes

The core finding is that LLM citations often fail in two ways: references may not exist (first-order hallucinations), and even when references exist, they may not support the specific claim (second-order hallucinations). In tests across ChatGPT, Claude, and Gemini, ChatGPT performed best overall for existing references (over 60%), while Gemini lagged (about 20%). For claim-supported citations, ChatGPT landed just under 50%, Claude just over 40%, and Gemini hit 0%—meaning it did not provide citations where the paper actually contained the cited content. The results also show that enabling web search or deep research matters more than paying for a premium tier. For academic work, the transcript recommends tracing claims to the PDF and using tools like Elicit, S i s p a c e, and Consensus instead of relying on LLMs for reference collection.

What’s the difference between first-order and second-order hallucinations in citations?

First-order hallucinations are when the model outputs a reference that doesn’t actually exist (missing or incorrect bibliographic details). Second-order hallucinations happen when the reference exists, but the cited paper doesn’t contain the information for the claim it’s attached to—so the citation doesn’t truly support the statement.

Which models and settings performed best for finding references that actually exist?

ChatGPT’s strongest configuration was “ChatGPT5 thinking” with web search enabled, followed by “ChatGPT5 auto plus deep research.” Claude was mixed: Sonnet 4 with research achieved a 100% success rate for existing references, while Opus 4.1 performed poorly, producing references that didn’t exist. Gemini was weakest under pressure: Flash 2.5 Pro with deep research and Flash 2.5 returned no valid references, with other Gemini variants around 40%.

How did the models perform when checking whether citations truly support the claims?

Across configurations, ChatGPT’s claim-supported citations were just under 50%. Claude was slightly lower at just over 40%. Gemini effectively failed: it produced 0% citations where the paper contained the content for why it was being cited.

Why does paying for a model not necessarily improve citation quality?

The test found that premium access didn’t guarantee better grounding. Gemini’s paid configuration (Flash 2.5 Pro with deep research) still produced no valid references, and the best results depended more on retrieval features like web search/deep research than on subscription status.

What common citation failure mode did the test highlight?

Even when a cited fact appears somewhere in a paper, it may be in the introduction rather than in the section that directly supports the claim. That can reflect “double extraction,” where the model pulls what one paper says another paper said, rather than citing primary support.

What workflow is recommended to reduce citation errors in academic use?

Use LLM output as a draft, then trace every claim to the original PDF and page. For reference gathering, rely on research-focused tools: Elicit (background-checked real papers), S i s p a c e (paper search and literature review from real references), and Consensus (yes/no answers for research fields).

Review Questions

How would you design a check to distinguish a non-existent citation from a citation that doesn’t actually support the claim?
Which retrieval features (web search vs deep research vs agent mode) were most associated with higher second-order citation accuracy, and what were the worst-performing modes?
Why might a citation still be misleading even if the cited paper contains the relevant idea somewhere in the document?

Key Points

1
Treat LLM citations as two separate risks: non-existent references and citations that don’t support the specific claim.
2
In the test, Gemini produced about 0% claim-supported citations, making it especially unreliable for academic research.
3
Enabling web search or deep research improved citation reliability for ChatGPT, while some agent-style modes performed poorly (including 0% in one configuration).
4
Premium tiers did not automatically improve citation accuracy; retrieval behavior mattered more than payment.
5
A common failure mode is “double extraction,” where the cited support is indirect (e.g., mentioned in an introduction rather than directly supporting the claim).
6
For academic use, trace every claim to the PDF and page before relying on it.
7
Use research-first tools like Elicit, S i s p a c e, and Consensus for reference gathering instead of relying on LLMs as the primary source.

Highlights

Gemini’s citations failed the claim-support test: it produced 0% citations where the paper actually contained the content being cited for.

ChatGPT’s best performance came from “ChatGPT5 thinking” with web search enabled, not from paying for a premium tier alone.

Even when references exist, they can be academically misleading if they don’t support the claim in the cited context (often due to indirect or introduction-level mentions).

The safest workflow is to generate with an LLM only as a starting point, then verify every claim against the original PDF and page.