Why AI Companies Lied About Context Windows

TL;DR

Large context windows don’t guarantee reliable answers; effective performance often drops well below advertised token limits.

Briefing Cornell Notes

Briefing

AI companies advertise huge context windows, but real-world reliability drops far earlier—often to roughly a quarter to a half of the marketed capacity. A study testing 18 leading large language models across different context lengths found that promises like “200,000 tokens” (and even 1 million tokens) don’t translate into dependable performance at those scales. Instead, the effective context window—where answers stay reliably grounded—lands closer to about 50,000 to 100,000 tokens for major models such as ChatGPT and Claude. The gap isn’t just a matter of better prompting; it reflects a systemic limitation that has been dubbed “context rot.”

Context rot doesn’t mean the models can’t technically ingest large inputs. It means they struggle to use that information accurately as the context grows. The underlying mechanism is resource budgeting: LLMs operate under constraints like compute, time, and token processing capacity. As inputs expand, the system can’t “read everything” the way a careful human might. Instead, it samples and attends to parts of the context that seem relevant to the query. When the context becomes too large, that sampling gets less precise, so crucial details get missed even if they’re present somewhere in the prompt.

That framing shifts the practical goal. Rather than trying to cram more text into a single request—or switching models to chase larger context—users need to change how information is prepared before it reaches the model. The transcript lays out three tactics aimed at reducing the amount of irrelevant material while increasing clarity.

First, use structured formats that LLMs can parse efficiently, such as Markdown or JSON. Plain text pasted in bulk is less machine-friendly, while structured representations align better with how models process language. The workflow can even include asking an LLM to convert raw material into Markdown.

Second, chunk documents into focused, clearly labeled units. Instead of one massive “master document,” the approach is to maintain separate files for major business areas and load only the subset needed for a given question. If the task is about taxes, only the finance document gets included; if the task is about compensation, the strategy document can stay out.

Third, organize those chunks in a consistent hierarchy so the right documents are easy to retrieve. The transcript uses the PARA method—Projects, Areas, Resources, Archives—as an overarching system. It gives an example of a company drive structured into strategic planning, enabling processes (like HR and finance), and core processes (like delivering services). Drilling down further, core processes are broken into stages, then into “blueprints,” step-by-step guides, and finally tools and templates. The result is that a question can be answered by loading the smallest relevant “level” of detail rather than dumping tens of thousands of words into a model and hoping it finds the needle.

The takeaway is that reliability comes from context engineering: designing inputs so the model can focus. With the right structure, chunking, and retrieval hierarchy, context rot becomes something to work around rather than fight.

Cornell Notes

Marketing claims about massive context windows don’t match dependable performance. Testing across 18 leading LLMs found that reliable results often appear closer to 50,000–100,000 tokens, even when models advertise 200,000 tokens or more—meaning users may get only about 25%–50% of the promised capacity. The failure mode is “context rot,” driven by resource limits and less precise attention sampling as inputs grow. Instead of trying to stuff in more text or rewrite prompts, the transcript recommends preparing information so the model receives only what’s relevant. Structured formats (Markdown/JSON), purpose-built document chunks, and a retrieval hierarchy like PARA help keep context focused and answers more dependable.

What is “context rot,” and why does it happen even when a model can accept large inputs?

“Context rot” refers to the drop in answer reliability as context length increases. The transcript links it to resource budgeting: LLMs have limits on compute, time, electricity, and token processing. As the context grows, the model can’t effectively attend to every detail; it samples and focuses on what seems relevant. That sampling becomes less precise at larger scales, so important information can be present but still not used correctly.

How do the reported effective context limits compare with marketing claims?

A study testing 18 leading LLMs across context lengths found that advertised windows like 200,000 tokens don’t represent the point where outputs stay dependable. For major models such as ChatGPT and Claude, the effective window for reliable results was described as roughly 50,000 to 100,000 tokens. The transcript translates this into a practical gap: users may get only about 25%–50% of the advertised capacity in dependable performance.

Why isn’t better prompting enough to solve the problem?

The transcript argues that prompt improvements often address symptoms, not the underlying cost. Even with strong instructions, the model still faces the same resource constraints and attention-sampling degradation as context grows. The real lever is reducing irrelevant material and delivering the right subset of information so the model can focus.

What role do structured formats like Markdown or JSON play?

Structured formats help LLMs process information more efficiently because they resemble how models naturally parse and interpret text. The transcript recommends converting content into Markdown or JSON rather than pasting large plain-text blocks. It also suggests using an LLM to convert raw text into Markdown so the model receives a more machine-friendly structure.

How does chunking documents improve reliability?

Chunking prevents the model from being forced to search through an oversized context. The transcript recommends breaking large documents into focused, purpose-driven, clearly labeled chunks aligned to business areas. Then, when asking a question, only the relevant chunk(s) are loaded as project files—e.g., taxes questions load finance documents, while compensation questions avoid unrelated strategy material.

How does the PARA method support “context engineering” in practice?

PARA provides a consistent hierarchy for organizing and retrieving the right documents. Projects, Areas, Resources, and Archives keep active work easy to locate while older or irrelevant material stays out of the way. The transcript’s example shows drilling down from high-level core processes to stages, then to blueprints, step-by-step guides, and tools/templates—so a query can load the smallest level of detail needed instead of dumping tens of thousands of words.

Review Questions

If a model advertises a 200,000-token context window, what range does the transcript suggest is more reliable—and what mechanism causes the gap?
Describe two preparation steps (formatting and chunking) that reduce context rot. How do they change what the model receives?
How does a PARA-based hierarchy help decide which documents to load for a specific question?

Key Points

1
Large context windows don’t guarantee reliable answers; effective performance often drops well below advertised token limits.
2
“Context rot” is driven by resource constraints and less precise attention sampling as inputs grow.
3
Better prompting can help, but it doesn’t fix the underlying attention-and-cost problem caused by oversized context.
4
Use structured formats like Markdown or JSON to make inputs easier for LLMs to parse.
5
Chunk information into focused, labeled documents and load only the subset needed for each question.
6
Build a retrieval hierarchy (e.g., PARA: Projects, Areas, Resources, Archives) so the right level of detail is available without dumping everything into the prompt.
7
Treat context engineering as a workflow design problem: deliver less, but more relevant, information.

Highlights

A study of 18 LLMs found reliable performance often appears around 50,000–100,000 tokens, even when models market 200,000 tokens or more.

Context rot is framed as a systemic limitation: models sample and attend less precisely as context length increases.

The practical fix shifts from “ask better” to “prepare better”—structured formats, targeted chunks, and a retrieval hierarchy.

The PARA method is used as an organizational backbone to load only the relevant documents and levels of detail.

Topics

Mentioned

Tiago Forte