Get AI summaries of any video or article — Sign up free
Why LLMs get dumb (Context Windows Explained) thumbnail

Why LLMs get dumb (Context Windows Explained)

NetworkChuck·
5 min read

Based on NetworkChuck's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

LLMs “forget” in long chats because their context window limits how many tokens they can effectively attend to at once.

Briefing

LLMs start “getting dumb” in long chats because their context window—the maximum amount of text (measured in tokens) the model can actively pay attention to—fills up and attention degrades. As the conversation grows, the model must keep track of more tokens and run heavier attention computations, which increases GPU load and slows responses. When the model can’t reliably focus on the relevant parts, accuracy drops and hallucinations become more likely.

The transcript uses a local-model demo in LM Studio to make the mechanism concrete. A model such as “GEMMA 3 4 B” is configured with a context window (e.g., 2048 tokens). After feeding it a statement about a book (“How to Take Smart Notes”), the conversation is extended with unrelated prompts (“a story about cows,” then a sequel, then a prequel). Even though the earlier detail was provided, the model later fails to recall the book—showing how earlier context can effectively fall out of usable attention once the window is overwhelmed. Increasing the context window (to 4096) restores the ability to retrieve the earlier information, but it doesn’t remove the underlying tradeoff.

Bigger context windows demand more compute and memory. The transcript notes that while some models advertise extremely large contexts—GPT-4o at 128,000 tokens, Claude 3.7 at 200,000, Google Gemini at 1 million, and even Meta’s Llama 4 Scout at 10 million—running those limits locally is constrained by hardware. In the demo, pushing toward very large contexts (around 120,000–131,000 tokens) maxes out video RAM (VRAM), causing the system to slow dramatically and become harder to interact with. The key point: “full advertised context” doesn’t automatically mean a local machine can handle it smoothly.

Even with large windows, attention can still fail. A cited research result (“Lost in the Middle”) describes a U-shaped accuracy pattern: LLMs tend to perform best on information at the beginning and end of a long input, while middle content suffers. The transcript frames this as the model “falling asleep” during long sequences—less reliable attention to the middle portions of the conversation.

To manage the problem, the transcript recommends practical tactics: start a new chat when the user makes a significant topic shift, rather than letting one thread sprawl. It also highlights tooling to reduce token waste and improve readability when ingesting web pages—converting HTML into clean Markdown before pasting.

Finally, it outlines local performance optimizations that can make long contexts more feasible: enabling Flash Attention to avoid building the full attention comparison matrix in memory; using KV cache optimizations to compress stored attention data; and optionally using paged cache to spill cache from GPU VRAM into system RAM (faster than disk, but still slower than staying on-GPU). The transcript closes with a caution: larger contexts increase the “attack surface,” making it easier for malicious content to hide in long inputs and potentially bypass safeguards. In short, long context is powerful, but it’s expensive, attention is imperfect, and longer chats can be riskier.

Cornell Notes

LLMs can “forget” or hallucinate during long conversations because their context window limits how much text they can attend to at once. Context windows are measured in tokens, and as chats grow, the model must run more expensive attention computations, increasing GPU/VRAM pressure and slowing responses. Research on long inputs finds a U-shaped pattern: accuracy is better near the beginning and end, while the middle degrades (“Lost in the Middle”). Even if a model supports a huge context length, local hardware may not sustain it, so performance can collapse. Practical fixes include starting a new chat when switching topics and using attention/memory optimizations like Flash Attention and KV cache compression for local models.

What exactly is a “context window,” and why does it cause forgetting in long chats?

A context window is the maximum number of tokens the model can pay attention to at one time. Tokens are how the model counts text (words, spaces, punctuation can each map to tokens). When the conversation grows, earlier tokens may no longer be effectively attended to, so the model can’t reliably retrieve details it was given—leading to wrong answers or hallucinations. The transcript demonstrates this by setting a model to 2048 tokens, then extending the chat until it can’t recall a previously mentioned book; raising the context window to 4096 restores recall.

Why do larger context windows often make responses slower or less accurate even when the model “supports” them?

Longer contexts increase compute and memory demands. Each new message forces the model to recompute attention over more tokens, which requires more GPU power and more VRAM. The transcript’s local demo shows that pushing toward ~120,000–131,000 tokens can max out VRAM, making the system sluggish. Even if VRAM holds, attention quality can still degrade, especially in the middle of long sequences.

What does “Lost in the Middle” mean for real conversations?

The transcript cites a paper showing a U-shaped accuracy curve for long inputs: information at the beginning and end is handled better, while the middle suffers a large drop-off. In practice, that means a long chat can become unreliable for details buried in the middle—even if those details are still within the context window—because the model’s attention over long sequences becomes less effective.

How can users reduce context-window problems during chat?

A key rule is to start a new chat when there’s a significant topic shift, rather than keeping one conversation open while bouncing between unrelated subjects (coffee, weather, math, etc.). This reduces the amount of irrelevant text the model must weigh and keeps attention focused. Some LLM interfaces even warn that things are slowing down and suggest starting fresh.

What local optimizations help make long contexts practical?

The transcript highlights Flash Attention (computes attention more efficiently by processing tokens in chunks and avoiding storing the full attention matrix), KV cache optimizations (compress stored attention data to reduce VRAM usage), and paged cache (moves cache from GPU VRAM to system RAM, like a page file—often slower than staying on-GPU). Together, these can allow using very large context windows without immediately exhausting VRAM.

What risk comes with very large context windows beyond performance?

Longer contexts increase the attack surface. With more text available, malicious content can be hidden in the middle of long inputs, potentially making it easier to bypass safety protections. The transcript frames this as a security downside: the longer the conversation, the more opportunities exist for harmful instructions to slip through and for the model to lose reliable attention to the relevant parts.

Review Questions

  1. How does token counting differ from word counting, and why does that matter when setting a context window?
  2. Why might increasing context length from 2048 to 4096 improve recall, yet still not prevent hallucinations in very long chats?
  3. What are the tradeoffs between Flash Attention, KV cache compression, and paged cache when running long-context models locally?

Key Points

  1. 1

    LLMs “forget” in long chats because their context window limits how many tokens they can effectively attend to at once.

  2. 2

    As conversations grow, attention computation becomes more expensive, increasing VRAM/compute load and slowing responses.

  3. 3

    Accuracy can drop in the middle of long inputs even when the information is still inside the context window (U-shaped pattern).

  4. 4

    Local hardware often can’t sustain advertised maximum context lengths; VRAM limits can cause major performance degradation.

  5. 5

    Starting a new chat when switching topics reduces irrelevant tokens and improves reliability.

  6. 6

    Flash Attention and KV cache optimizations can make large context windows more feasible on local machines.

  7. 7

    Very large context windows can increase security risk by giving more space for malicious content to hide and potentially bypass safeguards.

Highlights

A model set to a 2048-token context window can lose an earlier detail (the book title) after the conversation is extended with unrelated prompts; increasing to 4096 restores recall.
Even with huge context support, attention quality can degrade—research finds better handling at the beginning and end, with a sharp drop in the middle (“Lost in the Middle”).
Pushing context length to ~120,000+ tokens locally can max out VRAM, turning a fast chat into a slow, difficult-to-use experience.
Flash Attention and KV cache compression can reduce memory pressure enough to run very large contexts without immediately exhausting GPU resources.
Longer contexts expand the attack surface, making it easier for harmful content to be buried in the middle of inputs.

Topics