Why AI Companies Lied About Context Windows
Based on Tiago Forte's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Large context windows don’t guarantee reliable answers; effective performance often drops well below advertised token limits.
Briefing
AI companies advertise huge context windows, but real-world reliability drops far earlier—often to roughly a quarter to a half of the marketed capacity. A study testing 18 leading large language models across different context lengths found that promises like “200,000 tokens” (and even 1 million tokens) don’t translate into dependable performance at those scales. Instead, the effective context window—where answers stay reliably grounded—lands closer to about 50,000 to 100,000 tokens for major models such as ChatGPT and Claude. The gap isn’t just a matter of better prompting; it reflects a systemic limitation that has been dubbed “context rot.”
Context rot doesn’t mean the models can’t technically ingest large inputs. It means they struggle to use that information accurately as the context grows. The underlying mechanism is resource budgeting: LLMs operate under constraints like compute, time, and token processing capacity. As inputs expand, the system can’t “read everything” the way a careful human might. Instead, it samples and attends to parts of the context that seem relevant to the query. When the context becomes too large, that sampling gets less precise, so crucial details get missed even if they’re present somewhere in the prompt.
That framing shifts the practical goal. Rather than trying to cram more text into a single request—or switching models to chase larger context—users need to change how information is prepared before it reaches the model. The transcript lays out three tactics aimed at reducing the amount of irrelevant material while increasing clarity.
First, use structured formats that LLMs can parse efficiently, such as Markdown or JSON. Plain text pasted in bulk is less machine-friendly, while structured representations align better with how models process language. The workflow can even include asking an LLM to convert raw material into Markdown.
Second, chunk documents into focused, clearly labeled units. Instead of one massive “master document,” the approach is to maintain separate files for major business areas and load only the subset needed for a given question. If the task is about taxes, only the finance document gets included; if the task is about compensation, the strategy document can stay out.
Third, organize those chunks in a consistent hierarchy so the right documents are easy to retrieve. The transcript uses the PARA method—Projects, Areas, Resources, Archives—as an overarching system. It gives an example of a company drive structured into strategic planning, enabling processes (like HR and finance), and core processes (like delivering services). Drilling down further, core processes are broken into stages, then into “blueprints,” step-by-step guides, and finally tools and templates. The result is that a question can be answered by loading the smallest relevant “level” of detail rather than dumping tens of thousands of words into a model and hoping it finds the needle.
The takeaway is that reliability comes from context engineering: designing inputs so the model can focus. With the right structure, chunking, and retrieval hierarchy, context rot becomes something to work around rather than fight.
Cornell Notes
Marketing claims about massive context windows don’t match dependable performance. Testing across 18 leading LLMs found that reliable results often appear closer to 50,000–100,000 tokens, even when models advertise 200,000 tokens or more—meaning users may get only about 25%–50% of the promised capacity. The failure mode is “context rot,” driven by resource limits and less precise attention sampling as inputs grow. Instead of trying to stuff in more text or rewrite prompts, the transcript recommends preparing information so the model receives only what’s relevant. Structured formats (Markdown/JSON), purpose-built document chunks, and a retrieval hierarchy like PARA help keep context focused and answers more dependable.
What is “context rot,” and why does it happen even when a model can accept large inputs?
How do the reported effective context limits compare with marketing claims?
Why isn’t better prompting enough to solve the problem?
What role do structured formats like Markdown or JSON play?
How does chunking documents improve reliability?
How does the PARA method support “context engineering” in practice?
Review Questions
- If a model advertises a 200,000-token context window, what range does the transcript suggest is more reliable—and what mechanism causes the gap?
- Describe two preparation steps (formatting and chunking) that reduce context rot. How do they change what the model receives?
- How does a PARA-based hierarchy help decide which documents to load for a specific question?
Key Points
- 1
Large context windows don’t guarantee reliable answers; effective performance often drops well below advertised token limits.
- 2
“Context rot” is driven by resource constraints and less precise attention sampling as inputs grow.
- 3
Better prompting can help, but it doesn’t fix the underlying attention-and-cost problem caused by oversized context.
- 4
Use structured formats like Markdown or JSON to make inputs easier for LLMs to parse.
- 5
Chunk information into focused, labeled documents and load only the subset needed for each question.
- 6
Build a retrieval hierarchy (e.g., PARA: Projects, Areas, Resources, Archives) so the right level of detail is available without dumping everything into the prompt.
- 7
Treat context engineering as a workflow design problem: deliver less, but more relevant, information.