Best AI Tools for Deep Research (Ranked by a PhD, Not Hype)

TL;DR

Score deep-research tools using workflow criteria: recency, citation volume, clarity, multimedia usefulness, and exportability—not just how much text is generated.

Briefing Cornell Notes

Briefing

Deep-research tools for academia are only useful if they deliver recent, well-cited scholarship in a form researchers can actually use. After running the same nanostructured-electrodes prompt through multiple systems and scoring them on recency (past two years), reference volume, clarity, multimedia support, and exportability, Gemini and Manis AI emerged as the top performers—while Storm landed at the bottom.

ChatGPT produced strong, readable research with clear explanations and visible multimedia elements like figures and tables pulled from real papers. It also generated a large set of sources (reported as 45), and the output included a dedicated section on recent breakthroughs. The weak spot was academic workflow compatibility: export options weren’t “academic friendly,” and the displayed citations didn’t fully match the claimed reference count (the interface showed far fewer than the initial total). That combination left ChatGPT in the middle of the pack at about 3.5 points.

SciSpace (using its deep review mode) leaned heavily into citation retrieval and structure. It returned a “top 20 papers” style synthesis with a compact but useful table and a clear breakdown into sections like current state, introduction, and key materials. It also performed well on recency, showing papers across 2023–2025. The tradeoff was depth and writing density: it offered fewer explanatory paragraphs than the strongest competitors, and the exportable value was mostly in references rather than a fully usable narrative report. Its score landed around 3.

Perplexity delivered a large number of sources (49 shown), with clickable references and a PDF export option. But the citation experience was inconsistent: the PDF export appeared to include only nine references, and the output lacked the same emphasis on “recent breakthroughs” and clear structuring seen elsewhere. With limited multimedia and weaker clarity, it finished around 3 (rounded up).

Gemini stood out for producing a highly referenced, exportable report. It generated an extremely citation-dense document—described as 28 pages of references and hundreds of cited items—where citations appeared to support nearly every sentence. While multimedia wasn’t present, the combination of “lots of references,” strong recency coverage, and export to Docs earned Gemini the highest practical score (rounded to 4). Manis AI also scored a 4 by separating the work into multiple downloadable files (current state, recent breakthroughs, scalability challenges, key materials), which is useful for researchers who want to reorganize content. Its main drawback was citation usability: references were present but not reliably linked inline, and some sections had broken links. Still, it beat most competitors overall.

Storm, built by Stanford, was the only tool described as free but it underperformed on academic usability. It provided many references and sentence/paragraph-level citation markers, yet the references weren’t easy to verify (no clear external reference view, requiring hover/click per item), recency was unclear, and exportability was lacking. It scored about 1.5 and finished last.

Overall ranking: Gemini and Manis AI lead for different reasons—Gemini for highly referenced, exportable reports; Manis for segmented, multi-file deep research. SciSpace is best when the primary goal is exporting references to a reference manager, while ChatGPT and Perplexity sit in the middle due to citation/export mismatches and weaker workflow fit.

Cornell Notes

Running the same academic prompt about nanostructured electrodes in organic solar cells across several deep-research tools highlighted a consistent pattern: citation quality and export/workflow fit matter as much as raw output length. ChatGPT delivered readable explanations and multimedia (figures/tables) but had export limitations and citation-count inconsistencies. SciSpace excelled at returning many references and a structured “top papers” view, with export options focused on reference lists. Perplexity produced many sources and clickable citations, but the PDF export appeared to include far fewer references than the on-screen count. Gemini and Manis AI scored highest: Gemini for extremely citation-dense, exportable Docs reports; Manis for segmented downloadable files (current state, breakthroughs, scalability, materials) despite weaker inline citation linking.

What criteria were used to judge whether a deep-research tool is actually usable for academia?

The scoring emphasized five practical needs: (1) whether results reflect the past two years, (2) whether the output includes lots of references, (3) whether explanations are useful and clear, (4) whether multimedia like figures/tables/graphs appear, and (5) whether the results are exportable in an academic-friendly way (e.g., usable in a reference manager or document workflow).

Why did ChatGPT score well on content quality but not top the leaderboard?

ChatGPT produced a readable, research-style synthesis with multimedia elements (figures and tables) and a section on recent breakthroughs. It also generated a high reference count in its summary (reported as 45), but the interface showed far fewer references than that claim. It also scored poorly on exportability in an academic-friendly format, which reduced its overall usefulness for a research workflow.

What made SciSpace attractive for researchers even though it didn’t win overall?

SciSpace was strong on retrieving and exporting references: it returned many relevant papers across 2023–2025 and offered export formats for citations (including bib/text/XML/ris-style options). It also included a table-style multimedia element. The limitation was depth and writing density—more “paragraph + reference” style output rather than the richer, more detailed narrative seen in the top performers.

What citation/export mismatch hurt Perplexity’s score?

Perplexity displayed 49 sources on-screen and provided clickable links to sources. However, when exporting to PDF, the export contained only nine references, creating a disconnect between the displayed citation count and what researchers would actually receive in the exported document. It also lacked a clear “recent breakthroughs” section and didn’t provide multimedia.

How did Gemini and Manis AI differ in what they do best?

Gemini excelled at producing a highly citation-dense, exportable report (described as 28 pages of references and hundreds of cited items, with citations supporting nearly every sentence). It scored highly despite lacking multimedia. Manis AI excelled at organizing research into multiple downloadable files—current state, recent breakthroughs, scalability challenges, and key materials—making it easier to work with content. Its weakness was citation usability: references appeared but weren’t reliably linked inline, and some links were broken.

Why did Storm finish last despite being free?

Storm provided many citations and paragraph/sentence-level reference markers, but references weren’t presented in a straightforward way—researchers had to hover or click each citation to see details, and there was no clear external reference view. Recency within the past two years was also unclear, and exportability was rated low, making it less practical for academic use.

Review Questions

Which scoring dimensions most directly affect whether a deep-research tool fits an academic workflow (not just whether it produces text)?
Compare how citation counts behaved across tools when exporting (e.g., Perplexity’s on-screen vs PDF references). What does that imply for trusting outputs?
Why might a tool with strong segmentation into files (like Manis AI) still underperform if inline citations and link reliability are weak?

Key Points

1
Score deep-research tools using workflow criteria: recency, citation volume, clarity, multimedia usefulness, and exportability—not just how much text is generated.
2
ChatGPT’s strengths were readable synthesis and multimedia, but export limitations and citation-count inconsistencies reduced its academic usability.
3
SciSpace is particularly strong when the goal is collecting and exporting references to a reference manager, even if the narrative depth is lighter.
4
Perplexity’s on-screen citation count may not match what appears in exported documents, so exported reference lists should be checked.
5
Gemini’s advantage is extremely citation-dense, exportable reporting (Docs), making it strong for writing and verification.
6
Manis AI’s advantage is segmented deliverables (separate files for breakthroughs, scalability, materials), but inline citation linking and link integrity can be unreliable.
7
Storm’s free access didn’t compensate for weak reference presentation, unclear recency, and poor exportability for academic work.

Highlights

Gemini earned the top practical score by producing a highly citation-dense, exportable Docs report where citations appear to support nearly every sentence.

Manis AI scored as highly by splitting research into separate downloadable files (current state, recent breakthroughs, scalability challenges, key materials), even though inline citation linking was less reliable.

Perplexity showed a major citation mismatch: 49 sources on-screen but only nine references in the PDF export.

ChatGPT delivered figures/tables and a recent-breakthroughs section, but exportability and citation-display inconsistencies held it back.

Storm finished last because references weren’t easy to verify and exportability was limited, despite being free.

Topics

Deep Research Tools
Academic Citations
Exportable Reports
Recent Breakthroughs
Organic Solar Cells

Mentioned

Andy Stapleton