Why LLMs get dumb (Context Windows Explained)
Based on NetworkChuck's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LLMs “forget” in long chats because their context window limits how many tokens they can effectively attend to at once.
Briefing
LLMs start “getting dumb” in long chats because their context window—the maximum amount of text (measured in tokens) the model can actively pay attention to—fills up and attention degrades. As the conversation grows, the model must keep track of more tokens and run heavier attention computations, which increases GPU load and slows responses. When the model can’t reliably focus on the relevant parts, accuracy drops and hallucinations become more likely.
The transcript uses a local-model demo in LM Studio to make the mechanism concrete. A model such as “GEMMA 3 4 B” is configured with a context window (e.g., 2048 tokens). After feeding it a statement about a book (“How to Take Smart Notes”), the conversation is extended with unrelated prompts (“a story about cows,” then a sequel, then a prequel). Even though the earlier detail was provided, the model later fails to recall the book—showing how earlier context can effectively fall out of usable attention once the window is overwhelmed. Increasing the context window (to 4096) restores the ability to retrieve the earlier information, but it doesn’t remove the underlying tradeoff.
Bigger context windows demand more compute and memory. The transcript notes that while some models advertise extremely large contexts—GPT-4o at 128,000 tokens, Claude 3.7 at 200,000, Google Gemini at 1 million, and even Meta’s Llama 4 Scout at 10 million—running those limits locally is constrained by hardware. In the demo, pushing toward very large contexts (around 120,000–131,000 tokens) maxes out video RAM (VRAM), causing the system to slow dramatically and become harder to interact with. The key point: “full advertised context” doesn’t automatically mean a local machine can handle it smoothly.
Even with large windows, attention can still fail. A cited research result (“Lost in the Middle”) describes a U-shaped accuracy pattern: LLMs tend to perform best on information at the beginning and end of a long input, while middle content suffers. The transcript frames this as the model “falling asleep” during long sequences—less reliable attention to the middle portions of the conversation.
To manage the problem, the transcript recommends practical tactics: start a new chat when the user makes a significant topic shift, rather than letting one thread sprawl. It also highlights tooling to reduce token waste and improve readability when ingesting web pages—converting HTML into clean Markdown before pasting.
Finally, it outlines local performance optimizations that can make long contexts more feasible: enabling Flash Attention to avoid building the full attention comparison matrix in memory; using KV cache optimizations to compress stored attention data; and optionally using paged cache to spill cache from GPU VRAM into system RAM (faster than disk, but still slower than staying on-GPU). The transcript closes with a caution: larger contexts increase the “attack surface,” making it easier for malicious content to hide in long inputs and potentially bypass safeguards. In short, long context is powerful, but it’s expensive, attention is imperfect, and longer chats can be riskier.
Cornell Notes
LLMs can “forget” or hallucinate during long conversations because their context window limits how much text they can attend to at once. Context windows are measured in tokens, and as chats grow, the model must run more expensive attention computations, increasing GPU/VRAM pressure and slowing responses. Research on long inputs finds a U-shaped pattern: accuracy is better near the beginning and end, while the middle degrades (“Lost in the Middle”). Even if a model supports a huge context length, local hardware may not sustain it, so performance can collapse. Practical fixes include starting a new chat when switching topics and using attention/memory optimizations like Flash Attention and KV cache compression for local models.
What exactly is a “context window,” and why does it cause forgetting in long chats?
Why do larger context windows often make responses slower or less accurate even when the model “supports” them?
What does “Lost in the Middle” mean for real conversations?
How can users reduce context-window problems during chat?
What local optimizations help make long contexts practical?
What risk comes with very large context windows beyond performance?
Review Questions
- How does token counting differ from word counting, and why does that matter when setting a context window?
- Why might increasing context length from 2048 to 4096 improve recall, yet still not prevent hallucinations in very long chats?
- What are the tradeoffs between Flash Attention, KV cache compression, and paged cache when running long-context models locally?
Key Points
- 1
LLMs “forget” in long chats because their context window limits how many tokens they can effectively attend to at once.
- 2
As conversations grow, attention computation becomes more expensive, increasing VRAM/compute load and slowing responses.
- 3
Accuracy can drop in the middle of long inputs even when the information is still inside the context window (U-shaped pattern).
- 4
Local hardware often can’t sustain advertised maximum context lengths; VRAM limits can cause major performance degradation.
- 5
Starting a new chat when switching topics reduces irrelevant tokens and improves reliability.
- 6
Flash Attention and KV cache optimizations can make large context windows more feasible on local machines.
- 7
Very large context windows can increase security risk by giving more space for malicious content to hide and potentially bypass safeguards.