AI just got Elephant Memory - Hands on with the Wildest AI Updates
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
MSA (Memory Sparse Attention) targets ultra-long LLM contexts up to 100 million tokens by embedding sparse document retrieval directly into the attention mechanism.
Briefing
AI memory is taking a major leap: Memory Sparse Attention (MSA) is presented as a way to push large language models to ultra-long contexts—up to 100 million tokens—without the usual collapse around the 1 million-token mark. The core idea is to stop treating attention as an exhaustive “compare every token to every earlier token” problem. Instead, MSA embeds a document-level sparse retrieval mechanism directly into the attention architecture, using a lightweight router key to find the most relevant chunks and then running full attention only on that small subset. That shift is designed to keep compute from exploding as context grows.
The transcript ties the breakthrough to three practical bottlenecks that normally block long-context deployments: quadratic compute cost, loss of precision from lossy compression, and hardware memory limits. MSA is described as re-engineering routing, positional encoding, and memory storage so the model can scale while staying stable. It uses a document-wise rotary positional embedding approach (“rope”) that resets position counting per document, helping the model extrapolate from training lengths to far longer sequences without incoherent outputs. On the systems side, it’s framed as a tiered storage strategy that keeps the heavy content cache in CPU DRAM while loading only the compact routing keys into GPU VRAM—so the model can score relevance quickly without requiring a single GPU to hold the entire 100-million-token state.
Alongside the memory breakthrough, the transcript surveys a set of “hands-on” AI updates aimed at creativity, games, image generation, and agent workflows. Core AI’s Crea Node Agent is highlighted for automatically building modifiable node pipelines inside Crea—then letting users swap models, prompts, and connections while the workflow remains functional. Hugo’s browser-based local “battle royale” is pitched as an early multiplayer world-model demo: it runs a small model (70 million parameters) trained heavily on Doom, producing a fuzzy but coherent game world streamed in real time. Another thread compares Microsoft’s updated image generator (ranked fifth on an Arena leaderboard) against Nano Banana 2 and Nano Banana Pro, with the Microsoft model praised for text and graphics while Nano Banana variants are favored for instruction-following and coherence.
Google’s AI Studio “vibe coding” updates are also discussed, with expectations for design modes, Figma and Google Workspace integration, improved GitHub support, and planning-style execution inspired by Claude Code—features aimed at making app creation more systematic and easier to deploy. For cost control in agent systems, an open-source “Claw Router” project is described as routing prompts to the best-value LLM by scoring across 15 dimensions, with customization across 44+ models. Finally, an open-source GoDo generation skill for Claude Code is presented as an end-to-end pipeline that plans, executes, generates assets (including via Tripo 3D), and performs visual QA by screenshot analysis.
Taken together, the updates point to a theme: AI is moving from isolated demos toward practical pipelines—whether that’s long-context memory that can actually scale, tools that assemble workflows on demand, or agent frameworks that reduce cost while improving reliability.
Cornell Notes
MSA (Memory Sparse Attention) is presented as a route to ultra-long context for LLMs—up to 100 million tokens—without relying on external retrieval (RAG) or brute-force context windows. The method replaces exhaustive token-to-token attention with document-level sparse retrieval embedded inside the attention mechanism: a lightweight router key scores which document blocks matter, then full attention runs only on the top-k blocks (often 16). MSA also addresses positional encoding and extrapolation by using document-wise rotary positional embeddings that reset position counting per document. A tiered storage strategy keeps routing keys on GPU VRAM and the large content cache in CPU DRAM, making the approach deployable despite GPU memory bandwidth limits.
Why does standard attention break down around ~1M tokens, and what does MSA change?
How does MSA keep outputs coherent when context length jumps far beyond training?
What role does “router key” plus top-k selection play in MSA’s efficiency?
How does MSA get around GPU memory limits for 100M-token contexts?
What practical AI workflow improvements appear alongside MSA in the transcript?
How do the game and image updates fit the broader theme of “scalable AI”?
Review Questions
- MSA claims linear computational complexity relative to context length—what specific architectural change enables that, and what part of attention is avoided?
- Explain how document-wise RoPE differs from standard global positional encoding and why that matters for extrapolating to 100M tokens.
- Describe the tiered storage approach in MSA: which components stay on GPU VRAM, which move to CPU DRAM, and how does that affect deployability?
Key Points
- 1
MSA (Memory Sparse Attention) targets ultra-long LLM contexts up to 100 million tokens by embedding sparse document retrieval directly into the attention mechanism.
- 2
MSA avoids exhaustive token-to-token attention by using a router key to score document blocks and then running full attention only on the top-k relevant blocks (often 16).
- 3
Document-wise RoPE resets positional counting per document to reduce incoherence when context lengths far exceed training ranges.
- 4
A tiered storage strategy (“memory parallel”) keeps compact routing keys on GPU VRAM while offloading the large content cache to CPU DRAM, bypassing GPU memory overflow.
- 5
Core AI’s Crea Node Agent auto-builds modifiable workflow pipelines inside Crea, enabling users to swap models/prompts and rewire nodes without breaking the pipeline.
- 6
Hugo’s browser demo runs a local, Doom-trained world model for multiplayer gameplay, demonstrating real-time streamed generation despite fuzziness.
- 7
Claw Router is an open-source agent cost-control layer that routes prompts to the best-value LLM using scoring across 15 dimensions and supports 44+ models.