Mamba part 2 - Can it replace Transformers?

TL;DR

Mamba aims to scale sequence modeling more efficiently than attention by using a state-space backbone with linear-in-sequence-length behavior.

Briefing Cornell Notes

Briefing

Mamba’s core pitch is simple: it aims to match—and in some settings surpass—Transformer-style language modeling while scaling linearly with sequence length, avoiding the quadratic attention cost that makes very long contexts expensive. The breakthrough is a “selective” state space design that lets key internal parameters vary with the current input token, so the model can emphasize important information (like content words) and downweight filler, without giving up the efficiency benefits of state space models.

The discussion starts by contrasting Transformers with earlier state space models such as S4. Transformers can directly attend to any prior token, but that flexibility drives quadratic compute and memory growth with sequence length. Classic recurrent models avoid that blow-up by carrying a fixed-size hidden state forward one step at a time, but they can struggle with long-range context and training stability. State space models sit in between: they can behave like RNNs for step-by-step inference while also being expressible in a way that enables parallel training, often via convolution-like equivalences.

S4’s efficiency depends on parameters that remain constant after training. Mamba changes that constraint. Instead of using fixed S4 parameters, Mamba introduces a selection mechanism that makes key state-space parameters—often described as B, C, and Delta—depend on the input at each token position. That input-dependent behavior increases modeling power, especially for tasks where the model must ignore irrelevant tokens and retrieve the right information based on context. The tradeoff is that the original S4 convolution trick becomes harder to use, because the “scan” dynamics now vary across time.

To recover speed, Mamba relies on a hardware-aware algorithm that computes the state updates recurrently using a scan rather than convolution. The key engineering theme is memory movement: GPUs are fast at arithmetic but slower at moving large tensors between high-bandwidth memory and on-chip compute. Mamba’s implementation keeps the largest objects—particularly the hidden state—resident in fast on-chip memory (SRAM/registers) while loading smaller components as needed. When backpropagation requires intermediate values, it uses recomputation to avoid storing everything for gradients, reducing memory pressure.

On the modeling side, the transcript highlights two synthetic tasks used to motivate the design: selective copying (where filler tokens are inserted at random intervals, forcing the model to rely on content rather than position) and induction heads (where the model must retrieve earlier information conditioned on context). In these tests, the selective state-space approach (described as S6 in the discussion) improves over S4-like baselines, while Mamba is reported to succeed on extremely long sequences—up to one million tokens in the induction-heads setting—far beyond training lengths.

Empirically, Mamba is compared against strong Transformer baselines on language modeling (perplexity), downstream tasks, and non-text modalities like DNA and audio. The transcript also flags limitations: input-dependent selection helps on discrete modalities such as text and DNA, but may hurt when the data benefits from linear time-invariant structure (an audio ablation is mentioned). Overall, the takeaway is that Mamba’s advantage is not just “faster”—it’s a specific combination of selective state-space modeling and GPU-aware execution that targets long-context efficiency while keeping accuracy competitive.

Cornell Notes

Mamba is presented as a long-sequence alternative to Transformers that keeps compute scaling linear in sequence length. It builds on S4-style state space models but adds a selection mechanism that makes key parameters (notably B, C, and Delta) vary with the current input token, improving context-dependent behavior. That input-dependent selection breaks the classic S4 convolution trick, so Mamba uses a hardware-aware “selective scan” algorithm designed to minimize expensive GPU memory transfers and to reduce training memory via recomputation. Reported results include strong performance on synthetic retrieval tasks and language modeling, plus scaling to extremely long sequences (up to one million tokens in the induction-heads discussion). The main caveat raised is that the same selection mechanism can be less helpful for modalities where linear time-invariant inductive bias is advantageous, such as audio waveforms.

Why do Transformers become expensive as context length grows, and what does Mamba try to replace?

Causal Transformers can look back at any prior token, but that flexibility requires attention computations that scale roughly quadratically with sequence length (compute and memory grow with L^2). Mamba targets the same sequence-to-sequence modeling shape (predict next tokens across long contexts) while aiming for linear scaling by using a state-space backbone that updates a fixed-size hidden state over time rather than attending to all past tokens.

What changes from S4 to Mamba, and why does that matter for modeling power?

S4’s key state-space parameters are trained once and then treated as constant during inference/training, which limits input-dependent selection. Mamba introduces a selection mechanism so parameters like B, C, and Delta become functions of the current input token. That lets the model emphasize different information depending on context—useful for tasks like selective copying, where filler tokens appear at random positions and the model must ignore them.

How does Mamba regain efficiency after making parameters input-dependent?

Input-dependent parameters make the original S4 convolution-based acceleration difficult to apply. Mamba instead computes the recurrence with a scan (recurrent computation) and uses hardware-aware execution: it loads state-space parameters into fast on-chip SRAM, performs discretization and recurrence there, writes outputs back to high-bandwidth GPU memory, and uses recomputation during backprop to avoid storing all intermediate states.

What is the “memory movement” argument behind Mamba’s speed?

The transcript emphasizes that GPUs spend a large fraction of time copying data between fast compute cores and memory. Mamba’s algorithm is designed so the largest tensor (the hidden state) stays in fast memory while smaller components are streamed or recomputed. The result is that arithmetic can be done much more efficiently than repeatedly loading and storing large intermediate objects.

What synthetic tasks are used to motivate selective state spaces, and what do they test?

Selective copying tests whether a model can copy content while ignoring filler tokens inserted at random intervals; position-based convolution tricks fail because the “relevant” token offset varies. Induction heads tests retrieval-like behavior: given a pattern, the model must use context to retrieve earlier information and predict the correct next token. In the discussion, S4-like methods struggle on induction heads, while Mamba succeeds and generalizes to much longer test lengths.

Where might Mamba’s selection mechanism be a drawback?

An audio ablation is mentioned where a naive selective state-space application can hurt performance. The suggested reason is that audio waveforms may benefit from linear time-invariant structure, so making parameters input-dependent can remove an inductive bias that helps on that modality.

Review Questions

How does making B, C, and Delta input-dependent change what Mamba can represent compared with S4?
What specific hardware bottleneck does Mamba’s scan-based algorithm target, and how does recomputation help during training?
Why do random filler tokens in selective copying defeat fixed convolution-length solutions?

Key Points

1
Mamba aims to scale sequence modeling more efficiently than attention by using a state-space backbone with linear-in-sequence-length behavior.
2
S4’s efficiency relies on constant (input-independent) state-space parameters; Mamba adds input-dependent selection via parameters such as B, C, and Delta.
3
Input-dependent selection breaks the classic S4 convolution trick, so Mamba uses a selective scan (recurrent recurrence) instead.
4
Mamba’s speed depends heavily on minimizing GPU memory transfers by keeping the largest tensors (hidden state) in fast on-chip memory and streaming smaller pieces.
5
Training memory is reduced by recomputing intermediate values for backprop rather than storing everything.
6
Reported experiments emphasize long-context generalization on synthetic retrieval tasks (selective copying and induction heads) and competitive language modeling performance.
7
A key risk is modality mismatch: selection may hurt when data benefits from linear time-invariant structure, with an audio ablation cited as evidence.

Highlights

Mamba’s selection mechanism makes key state-space parameters (B, C, Delta) vary with each input token, improving context-dependent reasoning but forcing a different computation strategy than S4.

A hardware-aware “selective scan” implementation targets the real bottleneck—expensive data movement between GPU memory and compute—by keeping the hidden state resident in fast memory.

On induction-heads-style retrieval, Mamba is reported to generalize to test sequence lengths up to one million tokens, far beyond training length.

The same input-dependent selection that helps on discrete modalities can hurt on audio, where linear time-invariant inductive bias may be more appropriate.

Topics

Mamba vs Transformers
Selective State Spaces
S4 State Space Models
GPU Memory Optimization
Long-Context Scaling

Mentioned

Albert Goo
Trevor
Samuel Albani
Yanik
Jerry
Dave
Ted
SSM
S4
LLM
GPU
RNN
MLP
RMS Norm
HBM
SRAM
SRAM/registers
S6
S5
H3
DNA
GBD3
GPT
A100
HBM