Mamba part 2 - Can it replace Transformers?
Based on West Coast Machine Learning's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Mamba aims to scale sequence modeling more efficiently than attention by using a state-space backbone with linear-in-sequence-length behavior.
Briefing
Mamba’s core pitch is simple: it aims to match—and in some settings surpass—Transformer-style language modeling while scaling linearly with sequence length, avoiding the quadratic attention cost that makes very long contexts expensive. The breakthrough is a “selective” state space design that lets key internal parameters vary with the current input token, so the model can emphasize important information (like content words) and downweight filler, without giving up the efficiency benefits of state space models.
The discussion starts by contrasting Transformers with earlier state space models such as S4. Transformers can directly attend to any prior token, but that flexibility drives quadratic compute and memory growth with sequence length. Classic recurrent models avoid that blow-up by carrying a fixed-size hidden state forward one step at a time, but they can struggle with long-range context and training stability. State space models sit in between: they can behave like RNNs for step-by-step inference while also being expressible in a way that enables parallel training, often via convolution-like equivalences.
S4’s efficiency depends on parameters that remain constant after training. Mamba changes that constraint. Instead of using fixed S4 parameters, Mamba introduces a selection mechanism that makes key state-space parameters—often described as B, C, and Delta—depend on the input at each token position. That input-dependent behavior increases modeling power, especially for tasks where the model must ignore irrelevant tokens and retrieve the right information based on context. The tradeoff is that the original S4 convolution trick becomes harder to use, because the “scan” dynamics now vary across time.
To recover speed, Mamba relies on a hardware-aware algorithm that computes the state updates recurrently using a scan rather than convolution. The key engineering theme is memory movement: GPUs are fast at arithmetic but slower at moving large tensors between high-bandwidth memory and on-chip compute. Mamba’s implementation keeps the largest objects—particularly the hidden state—resident in fast on-chip memory (SRAM/registers) while loading smaller components as needed. When backpropagation requires intermediate values, it uses recomputation to avoid storing everything for gradients, reducing memory pressure.
On the modeling side, the transcript highlights two synthetic tasks used to motivate the design: selective copying (where filler tokens are inserted at random intervals, forcing the model to rely on content rather than position) and induction heads (where the model must retrieve earlier information conditioned on context). In these tests, the selective state-space approach (described as S6 in the discussion) improves over S4-like baselines, while Mamba is reported to succeed on extremely long sequences—up to one million tokens in the induction-heads setting—far beyond training lengths.
Empirically, Mamba is compared against strong Transformer baselines on language modeling (perplexity), downstream tasks, and non-text modalities like DNA and audio. The transcript also flags limitations: input-dependent selection helps on discrete modalities such as text and DNA, but may hurt when the data benefits from linear time-invariant structure (an audio ablation is mentioned). Overall, the takeaway is that Mamba’s advantage is not just “faster”—it’s a specific combination of selective state-space modeling and GPU-aware execution that targets long-context efficiency while keeping accuracy competitive.
Cornell Notes
Mamba is presented as a long-sequence alternative to Transformers that keeps compute scaling linear in sequence length. It builds on S4-style state space models but adds a selection mechanism that makes key parameters (notably B, C, and Delta) vary with the current input token, improving context-dependent behavior. That input-dependent selection breaks the classic S4 convolution trick, so Mamba uses a hardware-aware “selective scan” algorithm designed to minimize expensive GPU memory transfers and to reduce training memory via recomputation. Reported results include strong performance on synthetic retrieval tasks and language modeling, plus scaling to extremely long sequences (up to one million tokens in the induction-heads discussion). The main caveat raised is that the same selection mechanism can be less helpful for modalities where linear time-invariant inductive bias is advantageous, such as audio waveforms.
Why do Transformers become expensive as context length grows, and what does Mamba try to replace?
What changes from S4 to Mamba, and why does that matter for modeling power?
How does Mamba regain efficiency after making parameters input-dependent?
What is the “memory movement” argument behind Mamba’s speed?
What synthetic tasks are used to motivate selective state spaces, and what do they test?
Where might Mamba’s selection mechanism be a drawback?
Review Questions
- How does making B, C, and Delta input-dependent change what Mamba can represent compared with S4?
- What specific hardware bottleneck does Mamba’s scan-based algorithm target, and how does recomputation help during training?
- Why do random filler tokens in selective copying defeat fixed convolution-length solutions?
Key Points
- 1
Mamba aims to scale sequence modeling more efficiently than attention by using a state-space backbone with linear-in-sequence-length behavior.
- 2
S4’s efficiency relies on constant (input-independent) state-space parameters; Mamba adds input-dependent selection via parameters such as B, C, and Delta.
- 3
Input-dependent selection breaks the classic S4 convolution trick, so Mamba uses a selective scan (recurrent recurrence) instead.
- 4
Mamba’s speed depends heavily on minimizing GPU memory transfers by keeping the largest tensors (hidden state) in fast on-chip memory and streaming smaller pieces.
- 5
Training memory is reduced by recomputing intermediate values for backprop rather than storing everything.
- 6
Reported experiments emphasize long-context generalization on synthetic retrieval tasks (selective copying and induction heads) and competitive language modeling performance.
- 7
A key risk is modality mismatch: selection may hurt when data benefits from linear time-invariant structure, with an audio ablation cited as evidence.