Can Free AI Handle Academic Pressure? Only One Passed My Test
Based on Andy Stapleton's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ChatGPT (free ChatGPT4 Turbo) delivered the strongest overall performance across the tested academic tasks, especially for PDF-based factual extraction.
Briefing
Free large language models can handle some academic workflows reliably—especially extracting facts from a PDF—but they still struggle with accurate literature retrieval and citations, with hallucinations remaining a serious risk.
In a multi-hour stress test of free options used for research tasks, ChatGPT (including the free ChatGPT4 Turbo) delivered the strongest overall performance. When the models were asked to interrogate an uploaded research paper—answering questions strictly from the text—ChatGPT hit 100% accuracy in the test set and even corrected intentionally planted errors in the user’s prompt. Claude also performed extremely well on PDF-based factual extraction, matching ChatGPT’s accuracy in that specific category. Gemini was more vulnerable: it could be persuaded to confirm claims that were not actually present in the paper, effectively fabricating support. Perplexity was the weakest for PDF interrogation, both easiest to mislead into “finding” nonexistent content and sometimes outright wrong about what the PDF contained.
The second major task—building a mini literature review by retrieving sources and formatting citations—proved harder for every model. The test asked for recent articles in a defined area using American Chemical Society citation style, plus additional prompts designed to see whether the model would invent references or fabricate expertise. Here, ChatGPT again led the group, but not by a wide margin that would make it safe for unsupervised academic use: its reference accuracy still showed a substantial hallucination rate. Claude and Perplexity performed poorly overall, while Gemini landed in the middle.
A key pattern emerged across models: when prompts were structured to elicit “confident” answers, the systems sometimes backed into plausible-sounding citations or invented recognition for nonexistent theories. In one example, ChatGPT initially responded in a way that implied a nonexistent “Stapleton theory” was not established, then—after follow-up prompting—produced citations-like behavior that could be steered toward falsehoods. Gemini and Perplexity showed similar susceptibility, with Perplexity especially prone to missing correct information or generating incorrect references.
The takeaway is practical and conditional. For free tools focused on extracting information from a single PDF, ChatGPT and Claude are the safest choices in this test, with ChatGPT topping the list. For free literature retrieval and citation generation, none of the models were dependable enough to trust without verification; Perplexity in particular was flagged as unreliable in the free tier. The results also suggest that “deep research” modes can change outcomes—Gemini’s deep research reportedly performed better and reduced hallucinations—but the recommendation in this test remains conservative: if only one free model is chosen for academic work, ChatGPT is the best bet, while Perplexity should be avoided for research-grade referencing in its free form.
Cornell Notes
A stress test of free AI tools for academic work found a split between tasks: extracting facts from an uploaded PDF can be highly accurate, but generating literature reviews with correct citations remains error-prone. ChatGPT (free ChatGPT4 Turbo) achieved 100% accuracy for PDF-based factual extraction and also corrected intentionally misleading user input. Claude performed similarly well on PDF interrogation, while Gemini and Perplexity were easier to trick into confirming claims not present in the paper. For reference gathering and citation accuracy, ChatGPT still led the group, but hallucinations remained common enough to require careful checking; Perplexity performed worst in the free tier. The practical recommendation: use ChatGPT (or Claude) for PDF Q&A, and verify any retrieved citations from free models.
Why did the models perform very differently on “chat with PDF” versus literature review tasks?
What counted as an “erroneous response” during PDF testing?
How did the test try to measure hallucination risk in a way that reflects real misuse?
What were the main outcomes for reference accuracy in the literature-review task?
What practical recommendation came out of the results for someone using only free tiers?
Review Questions
- When a model is asked about a claim that is not present in the PDF, what behaviors separated ChatGPT from Gemini and Perplexity in this test?
- What specific kinds of citation mistakes were treated as reference accuracy failures, and why do those matter for academic work?
- If someone must choose one free model for research tasks, how should they decide between PDF Q&A and literature review/citation generation based on the reported results?
Key Points
- 1
ChatGPT (free ChatGPT4 Turbo) delivered the strongest overall performance across the tested academic tasks, especially for PDF-based factual extraction.
- 2
ChatGPT achieved 100% accuracy on interrogating uploaded PDFs in the test set and corrected intentionally misleading user prompts.
- 3
Claude matched ChatGPT’s accuracy for PDF interrogation, but the models diverged more on literature retrieval and citation tasks.
- 4
Gemini and Perplexity were easier to mislead into confirming claims that were not present in the PDF, indicating higher hallucination risk for document Q&A.
- 5
Literature review and citation generation were significantly less reliable across all free models, with hallucinations still common enough to require verification.
- 6
Perplexity was flagged as the worst option for free academic referencing in this test, while ChatGPT and Gemini were better choices for free tiers.
- 7
Gemini’s deep research mode reportedly reduced hallucinations compared with its standard free behavior, but the recommendation remained conservative for free usage.