Get AI summaries of any video or article — Sign up free
AI just got Elephant Memory - Hands on with the Wildest AI Updates thumbnail

AI just got Elephant Memory - Hands on with the Wildest AI Updates

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

MSA (Memory Sparse Attention) targets ultra-long LLM contexts up to 100 million tokens by embedding sparse document retrieval directly into the attention mechanism.

Briefing

AI memory is taking a major leap: Memory Sparse Attention (MSA) is presented as a way to push large language models to ultra-long contexts—up to 100 million tokens—without the usual collapse around the 1 million-token mark. The core idea is to stop treating attention as an exhaustive “compare every token to every earlier token” problem. Instead, MSA embeds a document-level sparse retrieval mechanism directly into the attention architecture, using a lightweight router key to find the most relevant chunks and then running full attention only on that small subset. That shift is designed to keep compute from exploding as context grows.

The transcript ties the breakthrough to three practical bottlenecks that normally block long-context deployments: quadratic compute cost, loss of precision from lossy compression, and hardware memory limits. MSA is described as re-engineering routing, positional encoding, and memory storage so the model can scale while staying stable. It uses a document-wise rotary positional embedding approach (“rope”) that resets position counting per document, helping the model extrapolate from training lengths to far longer sequences without incoherent outputs. On the systems side, it’s framed as a tiered storage strategy that keeps the heavy content cache in CPU DRAM while loading only the compact routing keys into GPU VRAM—so the model can score relevance quickly without requiring a single GPU to hold the entire 100-million-token state.

Alongside the memory breakthrough, the transcript surveys a set of “hands-on” AI updates aimed at creativity, games, image generation, and agent workflows. Core AI’s Crea Node Agent is highlighted for automatically building modifiable node pipelines inside Crea—then letting users swap models, prompts, and connections while the workflow remains functional. Hugo’s browser-based local “battle royale” is pitched as an early multiplayer world-model demo: it runs a small model (70 million parameters) trained heavily on Doom, producing a fuzzy but coherent game world streamed in real time. Another thread compares Microsoft’s updated image generator (ranked fifth on an Arena leaderboard) against Nano Banana 2 and Nano Banana Pro, with the Microsoft model praised for text and graphics while Nano Banana variants are favored for instruction-following and coherence.

Google’s AI Studio “vibe coding” updates are also discussed, with expectations for design modes, Figma and Google Workspace integration, improved GitHub support, and planning-style execution inspired by Claude Code—features aimed at making app creation more systematic and easier to deploy. For cost control in agent systems, an open-source “Claw Router” project is described as routing prompts to the best-value LLM by scoring across 15 dimensions, with customization across 44+ models. Finally, an open-source GoDo generation skill for Claude Code is presented as an end-to-end pipeline that plans, executes, generates assets (including via Tripo 3D), and performs visual QA by screenshot analysis.

Taken together, the updates point to a theme: AI is moving from isolated demos toward practical pipelines—whether that’s long-context memory that can actually scale, tools that assemble workflows on demand, or agent frameworks that reduce cost while improving reliability.

Cornell Notes

MSA (Memory Sparse Attention) is presented as a route to ultra-long context for LLMs—up to 100 million tokens—without relying on external retrieval (RAG) or brute-force context windows. The method replaces exhaustive token-to-token attention with document-level sparse retrieval embedded inside the attention mechanism: a lightweight router key scores which document blocks matter, then full attention runs only on the top-k blocks (often 16). MSA also addresses positional encoding and extrapolation by using document-wise rotary positional embeddings that reset position counting per document. A tiered storage strategy keeps routing keys on GPU VRAM and the large content cache in CPU DRAM, making the approach deployable despite GPU memory bandwidth limits.

Why does standard attention break down around ~1M tokens, and what does MSA change?

Standard dense self-attention requires comparing every new token against every historical token, creating quadratic compute growth. It also forces key/value caches to balloon until hardware memory limits are hit. MSA changes the mechanics by embedding a sparse, document-based router into attention: it uses a router key to retrieve the most relevant document blocks, then runs attention only on that small subset instead of the entire history.

How does MSA keep outputs coherent when context length jumps far beyond training?

The transcript highlights an extrapolation problem: global positional encodings assign strictly increasing IDs, and at millions of tokens those IDs fall outside the range seen during training, causing incoherence. MSA uses document-wise RoPE (rotary positional embedding) where the position counter resets to zero at the start of each document. It then applies a global RoPE offset so causal structure remains consistent while integrating facts across multiple documents.

What role does “router key” plus top-k selection play in MSA’s efficiency?

MSA adds a third projection alongside the usual key/value matrices: a specialized router key. Document hidden states are chunked into fixed 64-token blocks, then compressed via chunkwise mean pooling into compact latent representations. For each query, the model creates a routing vector and performs cosine similarity search over the router-key cache to score documents, selecting the top-k (typically 16) relevant blocks for the expensive attention step.

How does MSA get around GPU memory limits for 100M-token contexts?

The transcript describes a hardware constraint: storing a 100M-token cache can require ~169 GB, which exceeds the available GPU memory (e.g., two A800 GPUs providing ~160 GB VMA). MSA uses tiered storage (“memory parallel”): routing keys stay in GPU VRAM for fast scoring, while the heavy content keys/values are offloaded to CPU system DRAM. After relevant documents are identified, only those specific matrices are fetched to the GPU for final attention.

What practical AI workflow improvements appear alongside MSA in the transcript?

Core AI’s Crea Node Agent is framed as a way to generate complex node pipelines inside Crea automatically, while still allowing full modification—swapping models, editing prompts, detaching and reattaching nodes, and branching into new pipelines. Separately, Claw Router is described as an open-source cost-saving layer for agents, routing prompts to the best-value LLM by scoring across 15 dimensions and offering choices among 44+ models.

How do the game and image updates fit the broader theme of “scalable AI”?

Hugo’s local multiplayer Doom-like world model is positioned as an early step toward generative AI games that run in real time, though the world is described as fuzzy and low-resolution. Microsoft’s updated image generator is compared against Nano Banana 2/Pro: it’s praised for text and graphics, while Nano Banana variants are favored for coherence and instruction-following. Together, they illustrate progress in interactive generation and controllable creative output.

Review Questions

  1. MSA claims linear computational complexity relative to context length—what specific architectural change enables that, and what part of attention is avoided?
  2. Explain how document-wise RoPE differs from standard global positional encoding and why that matters for extrapolating to 100M tokens.
  3. Describe the tiered storage approach in MSA: which components stay on GPU VRAM, which move to CPU DRAM, and how does that affect deployability?

Key Points

  1. 1

    MSA (Memory Sparse Attention) targets ultra-long LLM contexts up to 100 million tokens by embedding sparse document retrieval directly into the attention mechanism.

  2. 2

    MSA avoids exhaustive token-to-token attention by using a router key to score document blocks and then running full attention only on the top-k relevant blocks (often 16).

  3. 3

    Document-wise RoPE resets positional counting per document to reduce incoherence when context lengths far exceed training ranges.

  4. 4

    A tiered storage strategy (“memory parallel”) keeps compact routing keys on GPU VRAM while offloading the large content cache to CPU DRAM, bypassing GPU memory overflow.

  5. 5

    Core AI’s Crea Node Agent auto-builds modifiable workflow pipelines inside Crea, enabling users to swap models/prompts and rewire nodes without breaking the pipeline.

  6. 6

    Hugo’s browser demo runs a local, Doom-trained world model for multiplayer gameplay, demonstrating real-time streamed generation despite fuzziness.

  7. 7

    Claw Router is an open-source agent cost-control layer that routes prompts to the best-value LLM using scoring across 15 dimensions and supports 44+ models.

Highlights

MSA reframes long-context attention as a retrieval-and-attend problem: route first, then attend—so compute doesn’t explode as context grows.
Document-wise RoPE is used to keep positional math stable when scaling from training lengths to 100M-token inference.
Tiered storage makes 100M-token contexts practical by splitting routing keys (GPU) from content caches (CPU).
Crea Node Agent generates complete node pipelines automatically, then keeps every part editable for iterative creative control.
Hugo’s Doom-like multiplayer world model runs locally with a small parameter count, showing early steps toward AI-generated multiplayer games.

Topics

  • Memory Sparse Attention
  • Crea Node Agent
  • Local World Models
  • Image Generation
  • Agent Cost Routing

Mentioned