Get AI summaries of any video or article — Sign up free
Better Attention is All You Need thumbnail

Better Attention is All You Need

sentdex·
5 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Most LLMs remain constrained by context windows around 2,048 tokens, with occasional 4K, which limits document-level tasks like research-paper analysis.

Briefing

Large language models are still bottlenecked by context length: even as model quality, data, and architectures improve, most systems remain stuck around 2,048 tokens (with occasional jumps to 4K). That ceiling quickly breaks down for real projects—email and code workflows, or research-paper analysis—where a single document can exceed the available window. Workarounds like summarization and vector search help, but they often fall short of replacing a truly large, end-to-end context.

The push for longer context windows runs into three attention-driven problems. First is raw feasibility: attention over more tokens demands more GPU memory, and the memory requirement grows sharply as context expands. Second is latency: processing time increases with the token count, turning into longer training cycles and slower inference in production. Third—and arguably the most damaging for usefulness—is quality degradation inside long contexts. Experiments commonly show a “lost in the middle” pattern: information near the beginning and end of the context is recalled better than information in the middle, producing a U-shaped curve of performance across positions. This behavior appears across multiple models, pointing to a structural limitation of attention rather than a single model’s weakness.

Microsoft Research’s LongNet is presented as a serious attempt to break the first two bottlenecks while keeping quality competitive. LongNet claims scaling to extremely large contexts—up to a billion tokens—by changing how attention is computed. Instead of applying full attention uniformly across the entire sequence, it splits the context into segments and sparsifies attention within those segments, then combines the results. The approach is described as “dilated attention,” with tunable segment size and sparsity so the system can be adjusted to different compute budgets. In principle, this makes it possible to fit a massive context into realistic hardware (for example, an 8× H100 SXM setup) and keep speed more comparable to standard Transformers operating at 8K–16K contexts.

But the transcript also flags two unresolved questions that matter for whether billion-token context is actually usable. One is how LongNet compares to standard attention at truly large scales: Microsoft’s reported comparisons reportedly stop at typical Transformers up to 32K tokens, leaving uncertainty about how performance extrapolates toward the billion-token regime. The other is whether the “lost in the middle” problem returns in a new form. Even if dilated attention improves average perplexity, the concern is that quality may still deteriorate with context length in a way that mirrors known attention limits—meaning that simply fitting more tokens may not translate into reliable recall of the most relevant middle sections.

The bottom line is that attention in its current form likely must change to unlock larger context windows. LongNet is positioned as an important step toward that shift, but the transcript argues that hardware constraints, scaling comparisons, and positional recall quality remain the deciding hurdles before billion-token context becomes practical for everyday tasks like document-level reasoning and research synthesis.

Cornell Notes

The transcript argues that longer context windows remain the main practical limiter for large language models, despite rapid gains in model quality. Expanding context stresses GPUs (memory), slows computation (latency), and often harms recall inside long prompts via the “lost in the middle” effect. LongNet from Microsoft Research targets the first two issues by using dilated attention: it segments the context and sparsifies attention within segments, then combines outputs, aiming to scale to up to a billion tokens. Reported results suggest better perplexity than sparse Transformers as context grows, but key uncertainties remain—especially how performance compares beyond 32K tokens and whether positional quality degradation persists at extreme lengths.

Why does context length become the bottleneck even when LLMs keep improving?

Most models cluster around ~2,048 tokens, with some reaching 4K, but real tasks often require far more than a single prompt can hold. The transcript highlights workflows like email/code generation and research-paper analysis, where even 4K may not cover one paper. Techniques like summarization and vectorization can help, but they don’t fully replace the value of feeding a large document directly into the model’s context.

What are the three major problems that appear when trying to scale attention to longer contexts?

(1) Memory feasibility: attention over more tokens requires more GPU memory, and the requirement grows quickly as context increases. (2) Compute/latency: processing time rises with token count, affecting both training time and inference speed in production. (3) Quality degradation: recall tends to follow a “U-shaped” pattern—information near the beginning and end is remembered better than information in the middle, a phenomenon often called “lost in the middle.”

How does LongNet attempt to scale to extremely large contexts without full attention over every token?

LongNet uses “dilated attention.” The context is split into segments, attention is sparsified within those segments, and segment outputs are combined into a final result. The method is described as tunable via segment size and sparsity, allowing compute to be adjusted. The goal is to make a billion-token context feasible on realistic hardware while keeping speed more comparable to standard Transformers at 8K–16K contexts.

What evidence is cited for LongNet’s effectiveness, and what comparison gap remains?

LongNet is said to be compared using perplexity scores against sparse Transformers, with lower perplexity considered better. The transcript notes that LongNet looks somewhat better as context increases. However, comparisons reportedly only go up to typical Transformers around 32K tokens, leaving uncertainty about how performance extrapolates toward the billion-token setting.

Why might “lost in the middle” still be a problem even if LongNet fits and runs fast?

Even if dilated attention improves average perplexity, positional recall could still degrade as context grows. The transcript suggests that if quality deteriorates significantly between smaller ranges (e.g., 8K to 32K), then it’s unclear why it would suddenly remain stable up to 1 billion tokens. Segmenting and sparsifying may change computation, but it may not eliminate the underlying positional limitations of attention-based mechanisms.

Review Questions

  1. What specific resource constraints (memory vs. compute) make long-context attention difficult to deploy in practice?
  2. How does the “lost in the middle” effect manifest, and why does it matter for tasks like paper analysis?
  3. What design change does dilated attention introduce, and what two major uncertainties remain about its real-world usefulness at billion-token scale?

Key Points

  1. 1

    Most LLMs remain constrained by context windows around 2,048 tokens, with occasional 4K, which limits document-level tasks like research-paper analysis.

  2. 2

    Scaling context via standard attention quickly becomes infeasible due to sharply increasing GPU memory requirements.

  3. 3

    Longer contexts also increase processing time, slowing both training and real-time inference.

  4. 4

    Quality often degrades inside long prompts through the “lost in the middle” effect, harming recall of middle information.

  5. 5

    LongNet targets feasibility by using dilated attention: segmenting the context and sparsifying attention within segments before combining results.

  6. 6

    Reported perplexity gains are promising, but comparisons reportedly stop around 32K tokens, leaving uncertainty about billion-token performance.

  7. 7

    Even if billion-token contexts become computationally possible, positional recall quality may still deteriorate unless attention mechanisms change in ways that address the underlying limitation.

Highlights

Context length—not model architecture or training data—remains a practical ceiling for many LLM applications, often blocking single-document workflows.
Attention scaling hits three linked barriers: GPU memory, latency, and positional quality degradation (“lost in the middle”).
LongNet’s dilated attention uses segmented, sparsified attention to make extremely large contexts (up to a billion tokens) more feasible.
The transcript questions whether perplexity improvements at moderate lengths will translate into reliable recall at extreme context sizes.

Topics

Mentioned

  • LLM
  • GPU
  • H100
  • SXMs
  • MPT
  • GPT-4