Better Attention is All You Need
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Most LLMs remain constrained by context windows around 2,048 tokens, with occasional 4K, which limits document-level tasks like research-paper analysis.
Briefing
Large language models are still bottlenecked by context length: even as model quality, data, and architectures improve, most systems remain stuck around 2,048 tokens (with occasional jumps to 4K). That ceiling quickly breaks down for real projects—email and code workflows, or research-paper analysis—where a single document can exceed the available window. Workarounds like summarization and vector search help, but they often fall short of replacing a truly large, end-to-end context.
The push for longer context windows runs into three attention-driven problems. First is raw feasibility: attention over more tokens demands more GPU memory, and the memory requirement grows sharply as context expands. Second is latency: processing time increases with the token count, turning into longer training cycles and slower inference in production. Third—and arguably the most damaging for usefulness—is quality degradation inside long contexts. Experiments commonly show a “lost in the middle” pattern: information near the beginning and end of the context is recalled better than information in the middle, producing a U-shaped curve of performance across positions. This behavior appears across multiple models, pointing to a structural limitation of attention rather than a single model’s weakness.
Microsoft Research’s LongNet is presented as a serious attempt to break the first two bottlenecks while keeping quality competitive. LongNet claims scaling to extremely large contexts—up to a billion tokens—by changing how attention is computed. Instead of applying full attention uniformly across the entire sequence, it splits the context into segments and sparsifies attention within those segments, then combines the results. The approach is described as “dilated attention,” with tunable segment size and sparsity so the system can be adjusted to different compute budgets. In principle, this makes it possible to fit a massive context into realistic hardware (for example, an 8× H100 SXM setup) and keep speed more comparable to standard Transformers operating at 8K–16K contexts.
But the transcript also flags two unresolved questions that matter for whether billion-token context is actually usable. One is how LongNet compares to standard attention at truly large scales: Microsoft’s reported comparisons reportedly stop at typical Transformers up to 32K tokens, leaving uncertainty about how performance extrapolates toward the billion-token regime. The other is whether the “lost in the middle” problem returns in a new form. Even if dilated attention improves average perplexity, the concern is that quality may still deteriorate with context length in a way that mirrors known attention limits—meaning that simply fitting more tokens may not translate into reliable recall of the most relevant middle sections.
The bottom line is that attention in its current form likely must change to unlock larger context windows. LongNet is positioned as an important step toward that shift, but the transcript argues that hardware constraints, scaling comparisons, and positional recall quality remain the deciding hurdles before billion-token context becomes practical for everyday tasks like document-level reasoning and research synthesis.
Cornell Notes
The transcript argues that longer context windows remain the main practical limiter for large language models, despite rapid gains in model quality. Expanding context stresses GPUs (memory), slows computation (latency), and often harms recall inside long prompts via the “lost in the middle” effect. LongNet from Microsoft Research targets the first two issues by using dilated attention: it segments the context and sparsifies attention within segments, then combines outputs, aiming to scale to up to a billion tokens. Reported results suggest better perplexity than sparse Transformers as context grows, but key uncertainties remain—especially how performance compares beyond 32K tokens and whether positional quality degradation persists at extreme lengths.
Why does context length become the bottleneck even when LLMs keep improving?
What are the three major problems that appear when trying to scale attention to longer contexts?
How does LongNet attempt to scale to extremely large contexts without full attention over every token?
What evidence is cited for LongNet’s effectiveness, and what comparison gap remains?
Why might “lost in the middle” still be a problem even if LongNet fits and runs fast?
Review Questions
- What specific resource constraints (memory vs. compute) make long-context attention difficult to deploy in practice?
- How does the “lost in the middle” effect manifest, and why does it matter for tasks like paper analysis?
- What design change does dilated attention introduce, and what two major uncertainties remain about its real-world usefulness at billion-token scale?
Key Points
- 1
Most LLMs remain constrained by context windows around 2,048 tokens, with occasional 4K, which limits document-level tasks like research-paper analysis.
- 2
Scaling context via standard attention quickly becomes infeasible due to sharply increasing GPU memory requirements.
- 3
Longer contexts also increase processing time, slowing both training and real-time inference.
- 4
Quality often degrades inside long prompts through the “lost in the middle” effect, harming recall of middle information.
- 5
LongNet targets feasibility by using dilated attention: segmenting the context and sparsifying attention within segments before combining results.
- 6
Reported perplexity gains are promising, but comparisons reportedly stop around 32K tokens, leaving uncertainty about billion-token performance.
- 7
Even if billion-token contexts become computationally possible, positional recall quality may still deteriorate unless attention mechanisms change in ways that address the underlying limitation.