Shocking Flaws in ChatGPT's Latest Upgrade: DO NOT USE FOR RESEARCH!
Based on Andy Stapleton's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Custom GPT document retrieval can be inconsistent once many documents are uploaded, making it unreliable for precise research needs.
Briefing
ChatGPT’s latest upgrade—especially longer context and the ability to build custom “research GPTs”—is drawing excitement, but it’s also raising red flags for academics and anyone relying on precise document retrieval. The core problem is reliability: uploaded material is not consistently recalled or returned accurately, and that unreliability becomes more pronounced as document size and query complexity grow. For PhD students and researchers who need exact citations, this makes the upgrade a risky foundation for “research mode.”
One major feature is the ability to create custom GPTs and feed them personal document collections. In practice, performance is described as “hit or miss” once many documents are added. Even with careful configuration—such as limiting capabilities to the code interpreter and disabling web browsing and image generation—the assistant still struggles to return the right information consistently. The transcript ties this to limitations of the assistant API retrieval approach: accuracy is reported as poor even on documents under 20 pages, and while there are “hacks” to push beyond that scale, they don’t reliably work. Speed is also criticized, with claims that retrieval can be surprisingly slow despite minimal network latency. The takeaway is blunt: there’s no guarantee that the assistant will retrieve the correct facts even when the documents are provided.
A second concern targets the new long-context capability. With a 128,000-token context window, users might expect stronger recall across large academic texts like theses and review papers. Instead, pressure testing reportedly shows recall performance degrading once the context exceeds roughly 73,000 tokens. The transcript attributes the drop to uneven representation of information across the document: facts placed near the beginning are recalled reliably, but statements buried in the middle (between about 7% and 50% of the document depth) are harder to retrieve, while recall improves again in the latter portion. A test method described as inserting a random statement at different depths and asking for retrieval highlights a key failure mode—there are no guarantees that a specific fact will be found when it’s located in the “wrong” region of a long document.
Security is the third issue. Custom GPTs and assistants can expose underlying knowledge files through simple actions, such as requesting a download of the file, which can reveal training or retrieval content that should remain protected. That’s particularly concerning for unpublished papers, industry-sponsored research, and patent-adjacent work where intellectual property needs to stay confidential. While the transcript suggests a “security floor” exists and may be tightened later, it’s still framed as an unacceptable default for sensitive research.
Overall, the guidance is to avoid relying on ChatGPT’s upgraded assistants for research retrieval right now. Alternatives are recommended—specifically tools like Doc analyzer and Power drill—because they reportedly retrieve information more accurately and handle smaller context with better precision. The transcript closes by positioning the upgrade as a step forward, but one that still lacks the robustness and safeguards researchers need for trustworthy, large-scale academic use.
Cornell Notes
The upgrade’s promise—custom GPTs trained on personal documents and a much larger 128,000-token context window—doesn’t translate into dependable retrieval for academic work. Document recall is described as inconsistent once many files are added, with retrieval accuracy dropping even for relatively small documents when using assistant-style retrieval. Long-context performance is also uneven: recall degrades beyond roughly 73,000 tokens, and where a fact appears in a long document strongly affects whether it’s retrieved. On top of accuracy issues, there are security concerns where knowledge files can be exposed via simple requests, raising risks for unpublished or proprietary research. The result: researchers are advised to steer clear of ChatGPT for precise document-based research and use purpose-built retrieval tools instead.
Why does custom GPT document retrieval become unreliable as more documents are added?
What does the 128,000-token context window change, and why doesn’t it deliver the expected recall?
How does the position of a fact inside a long document affect retrieval?
What does the transcript recommend instead of relying on ChatGPT for research retrieval?
What security risk is highlighted for custom GPTs and assistants?
Review Questions
- What retrieval failure modes appear when using assistant-style document ingestion for research, and how do they change with document count and query complexity?
- How do the transcript’s described long-context recall results differ for facts near the beginning versus facts in the middle of a document?
- What security behaviors are described that could expose knowledge files in custom GPTs, and why does that matter for academic or industry research?
Key Points
- 1
Custom GPT document retrieval can be inconsistent once many documents are uploaded, making it unreliable for precise research needs.
- 2
Assistant API retrieval accuracy is reported as poor even for documents under 20 pages, and workarounds for larger sets don’t reliably fix it.
- 3
Long-context performance with a 128,000-token window degrades above roughly 73,000 tokens, with recall uneven across the document.
- 4
Fact location inside long documents strongly affects retrieval: beginning facts are easier, middle facts (about 7%–50%) are harder, and later facts improve again.
- 5
Security concerns include the possibility of knowledge files being exposed through simple download requests, which is risky for unpublished or proprietary research.
- 6
Purpose-built retrieval tools like Doc analyzer and Power drill are recommended as more accurate alternatives for uploading and querying research materials.
- 7
For academics, the safest near-term approach is to avoid relying on ChatGPT’s upgraded assistants as a primary research retrieval system.