Best AI for Literature Reviews? Only ONE Passed the Test

TL;DR

Gemini AI produced the most verifiable citations in the audit, with an estimated ~1% hallucination/failure rate.

Briefing Cornell Notes

Briefing

AI tools can draft literature reviews quickly—but the real risk is whether the citations are real. After stress-testing three systems on the same thesis-style prompt (with requests for peer-reviewed sources and IEEE-style in-text citations), Gemini AI from Google produced the most reliable reference list, while GenSpark generated the highest share of non-existent or fabricated citations.

Manis was the fastest to return a structured literature review, generating a 14-page document with 38 references. The write-up itself looked strong: it included key research themes, evolution over time, conflicting viewpoints, gaps, and suggestions for future research. The problem emerged when every citation was checked against Google Scholar and, when needed, against the journal’s own records. Manis’s failure rate landed at about 16%—not catastrophic, but high enough to demand manual verification. Errors included wrong journal details (such as incorrect year or journal), repeated references, and fully fabricated entries where the cited work could not be found at all.

GenSpark finished second (roughly 5–7 minutes) and produced a shorter reference set—19 references—but it performed worse on citation accuracy. When the references were audited, GenSpark’s non-existent or incorrect citations rose to about 26%. Multiple entries simply did not exist, and some citations were only “close” matches—similar titles or related works that weren’t the exact peer-reviewed sources claimed. The output format also proved less convenient for workflow: unlike Manis’s PDF, GenSpark’s results were harder to extract into a file suitable for reference management.

Gemini AI (advanced, using “deep research”) took the longest—around 20 minutes—but delivered a 61-page, table-heavy review with 105 references. The key differentiator was citation verifiability: after checking the reference list, Gemini’s hallucination rate was about 1%. The citations mostly resolved to real sources in some form, including a thesis and a website, and it even surfaced very recent work (including 2025 research). The trade-offs were practical and stylistic rather than purely factual. Gemini did not strictly limit sources to peer-reviewed journal articles, and one cited item led to a live page that the checker couldn’t access for the underlying content. It also didn’t format references cleanly for IEEE expectations—one spot labeled “I E” didn’t match the intended IEEE-style formatting.

Taken together, the audit suggests a clear hierarchy for literature-review drafting with citation reliability: Gemini AI for the lowest rate of broken references, Manis as a strong second option that still requires systematic checking, and GenSpark as the riskiest choice for citation accuracy. The takeaway is blunt: even when the narrative reads convincingly, references must be verified—because fabricated citations can be common, especially outside the top performer.

Cornell Notes

A citation audit of three AI literature-review tools found that reference accuracy varies dramatically. Manis produced a structured review quickly and generated 38 references, but about 16% of citations were wrong, repeated, or not found in Google Scholar or journal records. GenSpark was slower than Manis, produced 19 references, and had a higher failure rate—about 26%—including multiple citations that did not exist. Gemini AI (advanced deep research) took longer and produced a much larger review (61 pages, 105 references), but its citations were the most verifiable, with roughly a 1% hallucination/failure rate. Gemini’s main drawbacks were weaker adherence to “peer-reviewed only” and some formatting/workflow limitations, not widespread citation fabrication.

How was citation reliability measured across the tools?

Each tool was given the same thesis-style prompt requiring themes, evolution, key studies, conflicts, gaps, and future research, plus peer-reviewed sources and IEEE-style in-text citations. After generating outputs, every reference was checked against Google Scholar; when a citation didn’t show up there, the checker went to the cited journal’s records to confirm whether the work existed. References were then categorized as existing, wrong (e.g., incorrect journal/year/title details), repeated, or not found (fabricated).

What were the citation failure rates for Manis and what kinds of errors appeared?

Manis generated 38 references in a 14-page PDF and looked strong in structure and critical analysis. The citation audit found about a 16% hallucination/failure rate. Errors included wrong journal and year details (even when the title was plausible), repeated references, and fully made-up entries where the journal lookup returned “records not found.”

Why did GenSpark rank lower despite producing a coherent review?

GenSpark produced a review in roughly 5–7 minutes and generated 19 references, but its references failed verification more often—about 26% of the time. Multiple citations simply did not exist, and some were “near matches” (similar titles) rather than the exact peer-reviewed works claimed. It also lacked a convenient PDF export, making it harder to move citations into a reference workflow.

What made Gemini AI the top performer in the audit?

Gemini AI (advanced deep research) took about 20 minutes and produced a 61-page review with 105 references. After checking, its hallucination/failure rate was about 1%, meaning nearly all citations resolved to real sources in some form. It also surfaced very recent research (including 2025) and provided references as links, which helped verification. The main issues were not widespread fabrication but looser adherence to “peer-reviewed only,” plus some formatting quirks and one inaccessible cited item.

What practical limitations still matter even when citations are mostly real?

Even with high citation accuracy, Gemini’s output wasn’t immediately plug-and-play for reference managers like Mendeley or EndNote, requiring manual entry of sources. Also, the review didn’t strictly restrict itself to peer-reviewed journal articles (it included a thesis and a website). Finally, some formatting didn’t match the requested IEEE-style in-text citation expectations perfectly (e.g., a labeled “I E” reference issue).

Review Questions

Which tool had the highest citation failure rate, and what specific verification failures were most common?
How did the auditing process handle citations that didn’t appear on Google Scholar?
What trade-offs did Gemini AI have despite its low hallucination rate (e.g., source type and workflow/formatting issues)?

Key Points

1
Gemini AI produced the most verifiable citations in the audit, with an estimated ~1% hallucination/failure rate.
2
Manis generated a solid, thesis-ready literature review quickly, but about ~16% of its references were wrong, repeated, or not found in journal/Scholar checks.
3
GenSpark had the highest citation failure rate at about ~26%, including multiple citations that did not exist and some near-miss references.
4
Citation accuracy varied widely even when the narrative quality looked convincing, so manual verification remains essential.
5
When Google Scholar didn’t confirm a reference, the audit escalated to checking the cited journal’s records directly.
6
Output format affects usability: Manis’s PDF was easier to work with than GenSpark’s less extractable format.
7
Gemini’s main drawbacks were stricter source-type compliance (peer-reviewed only) and workflow/formatting limitations, not widespread fabricated citations.

Highlights

Gemini AI’s reference list checked out at roughly a 1% failure rate, making it the most reliable option among the three tested.

Manis’s citations failed verification about 16% of the time—often due to incorrect bibliographic details or repeated entries.

GenSpark’s citations failed verification about 26% of the time, with multiple references that simply did not exist.

The fastest tool wasn’t the safest: speed didn’t correlate with citation accuracy across the systems tested.