Best AI for Literature Reviews? Only ONE Passed the Test
Based on Andy Stapleton's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini AI produced the most verifiable citations in the audit, with an estimated ~1% hallucination/failure rate.
Briefing
AI tools can draft literature reviews quickly—but the real risk is whether the citations are real. After stress-testing three systems on the same thesis-style prompt (with requests for peer-reviewed sources and IEEE-style in-text citations), Gemini AI from Google produced the most reliable reference list, while GenSpark generated the highest share of non-existent or fabricated citations.
Manis was the fastest to return a structured literature review, generating a 14-page document with 38 references. The write-up itself looked strong: it included key research themes, evolution over time, conflicting viewpoints, gaps, and suggestions for future research. The problem emerged when every citation was checked against Google Scholar and, when needed, against the journal’s own records. Manis’s failure rate landed at about 16%—not catastrophic, but high enough to demand manual verification. Errors included wrong journal details (such as incorrect year or journal), repeated references, and fully fabricated entries where the cited work could not be found at all.
GenSpark finished second (roughly 5–7 minutes) and produced a shorter reference set—19 references—but it performed worse on citation accuracy. When the references were audited, GenSpark’s non-existent or incorrect citations rose to about 26%. Multiple entries simply did not exist, and some citations were only “close” matches—similar titles or related works that weren’t the exact peer-reviewed sources claimed. The output format also proved less convenient for workflow: unlike Manis’s PDF, GenSpark’s results were harder to extract into a file suitable for reference management.
Gemini AI (advanced, using “deep research”) took the longest—around 20 minutes—but delivered a 61-page, table-heavy review with 105 references. The key differentiator was citation verifiability: after checking the reference list, Gemini’s hallucination rate was about 1%. The citations mostly resolved to real sources in some form, including a thesis and a website, and it even surfaced very recent work (including 2025 research). The trade-offs were practical and stylistic rather than purely factual. Gemini did not strictly limit sources to peer-reviewed journal articles, and one cited item led to a live page that the checker couldn’t access for the underlying content. It also didn’t format references cleanly for IEEE expectations—one spot labeled “I E” didn’t match the intended IEEE-style formatting.
Taken together, the audit suggests a clear hierarchy for literature-review drafting with citation reliability: Gemini AI for the lowest rate of broken references, Manis as a strong second option that still requires systematic checking, and GenSpark as the riskiest choice for citation accuracy. The takeaway is blunt: even when the narrative reads convincingly, references must be verified—because fabricated citations can be common, especially outside the top performer.
Cornell Notes
A citation audit of three AI literature-review tools found that reference accuracy varies dramatically. Manis produced a structured review quickly and generated 38 references, but about 16% of citations were wrong, repeated, or not found in Google Scholar or journal records. GenSpark was slower than Manis, produced 19 references, and had a higher failure rate—about 26%—including multiple citations that did not exist. Gemini AI (advanced deep research) took longer and produced a much larger review (61 pages, 105 references), but its citations were the most verifiable, with roughly a 1% hallucination/failure rate. Gemini’s main drawbacks were weaker adherence to “peer-reviewed only” and some formatting/workflow limitations, not widespread citation fabrication.
How was citation reliability measured across the tools?
What were the citation failure rates for Manis and what kinds of errors appeared?
Why did GenSpark rank lower despite producing a coherent review?
What made Gemini AI the top performer in the audit?
What practical limitations still matter even when citations are mostly real?
Review Questions
- Which tool had the highest citation failure rate, and what specific verification failures were most common?
- How did the auditing process handle citations that didn’t appear on Google Scholar?
- What trade-offs did Gemini AI have despite its low hallucination rate (e.g., source type and workflow/formatting issues)?
Key Points
- 1
Gemini AI produced the most verifiable citations in the audit, with an estimated ~1% hallucination/failure rate.
- 2
Manis generated a solid, thesis-ready literature review quickly, but about ~16% of its references were wrong, repeated, or not found in journal/Scholar checks.
- 3
GenSpark had the highest citation failure rate at about ~26%, including multiple citations that did not exist and some near-miss references.
- 4
Citation accuracy varied widely even when the narrative quality looked convincing, so manual verification remains essential.
- 5
When Google Scholar didn’t confirm a reference, the audit escalated to checking the cited journal’s records directly.
- 6
Output format affects usability: Manis’s PDF was easier to work with than GenSpark’s less extractable format.
- 7
Gemini’s main drawbacks were stricter source-type compliance (peer-reviewed only) and workflow/formatting limitations, not widespread fabricated citations.