Get AI summaries of any video or article — Sign up free
Can Free AI Handle Academic Pressure? Only One Passed My Test thumbnail

Can Free AI Handle Academic Pressure? Only One Passed My Test

Andy Stapleton·
5 min read

Based on Andy Stapleton's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

ChatGPT (free ChatGPT4 Turbo) delivered the strongest overall performance across the tested academic tasks, especially for PDF-based factual extraction.

Briefing

Free large language models can handle some academic workflows reliably—especially extracting facts from a PDF—but they still struggle with accurate literature retrieval and citations, with hallucinations remaining a serious risk.

In a multi-hour stress test of free options used for research tasks, ChatGPT (including the free ChatGPT4 Turbo) delivered the strongest overall performance. When the models were asked to interrogate an uploaded research paper—answering questions strictly from the text—ChatGPT hit 100% accuracy in the test set and even corrected intentionally planted errors in the user’s prompt. Claude also performed extremely well on PDF-based factual extraction, matching ChatGPT’s accuracy in that specific category. Gemini was more vulnerable: it could be persuaded to confirm claims that were not actually present in the paper, effectively fabricating support. Perplexity was the weakest for PDF interrogation, both easiest to mislead into “finding” nonexistent content and sometimes outright wrong about what the PDF contained.

The second major task—building a mini literature review by retrieving sources and formatting citations—proved harder for every model. The test asked for recent articles in a defined area using American Chemical Society citation style, plus additional prompts designed to see whether the model would invent references or fabricate expertise. Here, ChatGPT again led the group, but not by a wide margin that would make it safe for unsupervised academic use: its reference accuracy still showed a substantial hallucination rate. Claude and Perplexity performed poorly overall, while Gemini landed in the middle.

A key pattern emerged across models: when prompts were structured to elicit “confident” answers, the systems sometimes backed into plausible-sounding citations or invented recognition for nonexistent theories. In one example, ChatGPT initially responded in a way that implied a nonexistent “Stapleton theory” was not established, then—after follow-up prompting—produced citations-like behavior that could be steered toward falsehoods. Gemini and Perplexity showed similar susceptibility, with Perplexity especially prone to missing correct information or generating incorrect references.

The takeaway is practical and conditional. For free tools focused on extracting information from a single PDF, ChatGPT and Claude are the safest choices in this test, with ChatGPT topping the list. For free literature retrieval and citation generation, none of the models were dependable enough to trust without verification; Perplexity in particular was flagged as unreliable in the free tier. The results also suggest that “deep research” modes can change outcomes—Gemini’s deep research reportedly performed better and reduced hallucinations—but the recommendation in this test remains conservative: if only one free model is chosen for academic work, ChatGPT is the best bet, while Perplexity should be avoided for research-grade referencing in its free form.

Cornell Notes

A stress test of free AI tools for academic work found a split between tasks: extracting facts from an uploaded PDF can be highly accurate, but generating literature reviews with correct citations remains error-prone. ChatGPT (free ChatGPT4 Turbo) achieved 100% accuracy for PDF-based factual extraction and also corrected intentionally misleading user input. Claude performed similarly well on PDF interrogation, while Gemini and Perplexity were easier to trick into confirming claims not present in the paper. For reference gathering and citation accuracy, ChatGPT still led the group, but hallucinations remained common enough to require careful checking; Perplexity performed worst in the free tier. The practical recommendation: use ChatGPT (or Claude) for PDF Q&A, and verify any retrieved citations from free models.

Why did the models perform very differently on “chat with PDF” versus literature review tasks?

PDF interrogation is constrained by the uploaded text: the model can quote or paraphrase what it has access to in the document. Literature review requires external retrieval and correct bibliographic details (authors, years, and titles), which increases the chance of inventing sources or mis-citing. In the test, ChatGPT and Claude were reliable on PDF questions, while Gemini and Perplexity were more easily persuaded into claiming nonexistent passages. For citations, every model showed meaningful hallucination risk, with ChatGPT best among the free options but still far from “trust without verification.”

What counted as an “erroneous response” during PDF testing?

Two failure modes were tracked. First, content accuracy errors: the model provided information not found in the paper, fabricated terminology, or otherwise answered with material absent from the PDF. Second, reference accuracy errors: citations attributed to the wrong author or year, or references that didn’t exist at all. Responses were marked wrong when either the content or the cited references failed these checks.

How did the test try to measure hallucination risk in a way that reflects real misuse?

It included “tricky” prompts that intentionally contained false premises—then checked whether the model would correct the user or confirm the incorrect claim. For example, the prompts asserted that a specific statement existed in the paper when it did not. ChatGPT and Claude tended to push back and correct the user, while Gemini and Perplexity were more likely to agree with the false premise or fabricate supporting text.

What were the main outcomes for reference accuracy in the literature-review task?

ChatGPT had the highest reference accuracy among the free models tested, but its hallucination rate was still substantial (described as just under 80% inaccuracy for the reference task). Claude and Perplexity performed poorly, and Gemini was second best. Perplexity was singled out as especially unreliable for free academic referencing, with “hit and miss” behavior across prompts.

What practical recommendation came out of the results for someone using only free tiers?

For extracting information from a single PDF, the test recommended ChatGPT or Claude, with ChatGPT leading. For literature review and citation generation, the test warned against trusting free outputs without verification—particularly recommending users avoid Perplexity in its free form. It also noted that Gemini’s “deep research” mode performed better than its standard free behavior, but the recommendation emphasized the free-tier limitations.

Review Questions

  1. When a model is asked about a claim that is not present in the PDF, what behaviors separated ChatGPT from Gemini and Perplexity in this test?
  2. What specific kinds of citation mistakes were treated as reference accuracy failures, and why do those matter for academic work?
  3. If someone must choose one free model for research tasks, how should they decide between PDF Q&A and literature review/citation generation based on the reported results?

Key Points

  1. 1

    ChatGPT (free ChatGPT4 Turbo) delivered the strongest overall performance across the tested academic tasks, especially for PDF-based factual extraction.

  2. 2

    ChatGPT achieved 100% accuracy on interrogating uploaded PDFs in the test set and corrected intentionally misleading user prompts.

  3. 3

    Claude matched ChatGPT’s accuracy for PDF interrogation, but the models diverged more on literature retrieval and citation tasks.

  4. 4

    Gemini and Perplexity were easier to mislead into confirming claims that were not present in the PDF, indicating higher hallucination risk for document Q&A.

  5. 5

    Literature review and citation generation were significantly less reliable across all free models, with hallucinations still common enough to require verification.

  6. 6

    Perplexity was flagged as the worst option for free academic referencing in this test, while ChatGPT and Gemini were better choices for free tiers.

  7. 7

    Gemini’s deep research mode reportedly reduced hallucinations compared with its standard free behavior, but the recommendation remained conservative for free usage.

Highlights

ChatGPT hit 100% accuracy when extracting factual information from an uploaded PDF and even corrected false premises embedded in the questions.
Gemini could be persuaded to “find” or confirm information that wasn’t in the PDF, showing how easily document-grounded tasks can still go wrong.
For literature reviews with citations, no free model reached reliability; ChatGPT led the pack but still showed a high hallucination rate for references.
Perplexity performed worst for free academic referencing, with both incorrect citations and frequent misses.

Topics

Mentioned