Shocking Flaws in ChatGPT's Latest Upgrade: DO NOT USE FOR RESEARCH!

TL;DR

Custom GPT document retrieval can be inconsistent once many documents are uploaded, making it unreliable for precise research needs.

Briefing Cornell Notes

Briefing

ChatGPT’s latest upgrade—especially longer context and the ability to build custom “research GPTs”—is drawing excitement, but it’s also raising red flags for academics and anyone relying on precise document retrieval. The core problem is reliability: uploaded material is not consistently recalled or returned accurately, and that unreliability becomes more pronounced as document size and query complexity grow. For PhD students and researchers who need exact citations, this makes the upgrade a risky foundation for “research mode.”

One major feature is the ability to create custom GPTs and feed them personal document collections. In practice, performance is described as “hit or miss” once many documents are added. Even with careful configuration—such as limiting capabilities to the code interpreter and disabling web browsing and image generation—the assistant still struggles to return the right information consistently. The transcript ties this to limitations of the assistant API retrieval approach: accuracy is reported as poor even on documents under 20 pages, and while there are “hacks” to push beyond that scale, they don’t reliably work. Speed is also criticized, with claims that retrieval can be surprisingly slow despite minimal network latency. The takeaway is blunt: there’s no guarantee that the assistant will retrieve the correct facts even when the documents are provided.

A second concern targets the new long-context capability. With a 128,000-token context window, users might expect stronger recall across large academic texts like theses and review papers. Instead, pressure testing reportedly shows recall performance degrading once the context exceeds roughly 73,000 tokens. The transcript attributes the drop to uneven representation of information across the document: facts placed near the beginning are recalled reliably, but statements buried in the middle (between about 7% and 50% of the document depth) are harder to retrieve, while recall improves again in the latter portion. A test method described as inserting a random statement at different depths and asking for retrieval highlights a key failure mode—there are no guarantees that a specific fact will be found when it’s located in the “wrong” region of a long document.

Security is the third issue. Custom GPTs and assistants can expose underlying knowledge files through simple actions, such as requesting a download of the file, which can reveal training or retrieval content that should remain protected. That’s particularly concerning for unpublished papers, industry-sponsored research, and patent-adjacent work where intellectual property needs to stay confidential. While the transcript suggests a “security floor” exists and may be tightened later, it’s still framed as an unacceptable default for sensitive research.

Overall, the guidance is to avoid relying on ChatGPT’s upgraded assistants for research retrieval right now. Alternatives are recommended—specifically tools like Doc analyzer and Power drill—because they reportedly retrieve information more accurately and handle smaller context with better precision. The transcript closes by positioning the upgrade as a step forward, but one that still lacks the robustness and safeguards researchers need for trustworthy, large-scale academic use.

Cornell Notes

The upgrade’s promise—custom GPTs trained on personal documents and a much larger 128,000-token context window—doesn’t translate into dependable retrieval for academic work. Document recall is described as inconsistent once many files are added, with retrieval accuracy dropping even for relatively small documents when using assistant-style retrieval. Long-context performance is also uneven: recall degrades beyond roughly 73,000 tokens, and where a fact appears in a long document strongly affects whether it’s retrieved. On top of accuracy issues, there are security concerns where knowledge files can be exposed via simple requests, raising risks for unpublished or proprietary research. The result: researchers are advised to steer clear of ChatGPT for precise document-based research and use purpose-built retrieval tools instead.

Why does custom GPT document retrieval become unreliable as more documents are added?

The transcript describes “hit or miss” behavior when feeding large numbers of documents into a custom research GPT. Even with constraints like disabling web browsing and image generation and relying only on the code interpreter, the assistant may not return the right amount of information or the right information. This is linked to limitations in assistant API retrieval: accuracy is reported as not ideal even for documents under 20 pages, and while there are workaround “hacks” to include more than that, they don’t reliably solve the problem—especially when queries get complex.

What does the 128,000-token context window change, and why doesn’t it deliver the expected recall?

A larger context window should, in theory, let the model ingest more of a thesis or long paper. But pressure testing described in the transcript finds recall performance degrades above about 73,000 tokens. The issue isn’t just scale; it’s distribution. Facts are not represented evenly across long-form content, so recall depends on where the information sits within the document.

How does the position of a fact inside a long document affect retrieval?

Position matters strongly. Facts near the beginning of a document are recalled reliably regardless of context length. But statements placed between roughly 7% and 50% of the document depth are harder to retrieve, while recall improves again in the second half. The transcript describes a test approach where a random statement is inserted at various depths and then retrieval is attempted, revealing a non-uniform recall pattern.

What does the transcript recommend instead of relying on ChatGPT for research retrieval?

It recommends using dedicated document-retrieval tools that perform better at extracting specific information. Two named options are Doc analyzer and Power drill. The transcript claims these tools currently outperform ChatGPT’s assistant retrieval for uploading and querying personal materials, and that smaller context can yield higher accuracy—something these tools reportedly handle more effectively.

What security risk is highlighted for custom GPTs and assistants?

The transcript warns that custom GPT knowledge and retrieval content can be exposed. It describes findings that a user can request a download of the file and receive the underlying information, and that knowledge files can be accessible through retrieval/augmented generation. This is framed as a major concern for protecting unpublished papers, industry-sponsored work, and patent-related intellectual property.

Review Questions

What retrieval failure modes appear when using assistant-style document ingestion for research, and how do they change with document count and query complexity?
How do the transcript’s described long-context recall results differ for facts near the beginning versus facts in the middle of a document?
What security behaviors are described that could expose knowledge files in custom GPTs, and why does that matter for academic or industry research?

Key Points

1
Custom GPT document retrieval can be inconsistent once many documents are uploaded, making it unreliable for precise research needs.
2
Assistant API retrieval accuracy is reported as poor even for documents under 20 pages, and workarounds for larger sets don’t reliably fix it.
3
Long-context performance with a 128,000-token window degrades above roughly 73,000 tokens, with recall uneven across the document.
4
Fact location inside long documents strongly affects retrieval: beginning facts are easier, middle facts (about 7%–50%) are harder, and later facts improve again.
5
Security concerns include the possibility of knowledge files being exposed through simple download requests, which is risky for unpublished or proprietary research.
6
Purpose-built retrieval tools like Doc analyzer and Power drill are recommended as more accurate alternatives for uploading and querying research materials.
7
For academics, the safest near-term approach is to avoid relying on ChatGPT’s upgraded assistants as a primary research retrieval system.

Highlights

Even with careful GPT configuration, retrieval from uploaded research collections is described as “hit or miss,” especially as document volume and query complexity increase.

Long-context recall isn’t uniform: facts near the start are retrieved reliably, while facts buried in the middle of a long document can be missed until the later portion.

A key security warning is that custom GPT knowledge can be exposed via simple actions like requesting a file download—an unacceptable risk for sensitive research.

Topics

Custom GPTs
Document Retrieval
Long-Context Recall
Research Security
Academic Tools

Mentioned

Andy Stapleton
James Brigg
Greg Camat