Mamba part 4 - System Details and Implementation
Based on West Coast Machine Learning's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Mamba models long-range dependency by evolving a hidden state sequentially, not by computing explicit attention weights over prior tokens.
Briefing
Mamba’s core implementation hinges on a state-space “mixer” that updates a hidden state sequentially while keeping most computations hardware-friendly—especially by structuring parameters and operations so they can be streamed through time with minimal memory traffic. The key system insight from the walkthrough is that Mamba doesn’t rely on attention-style token-to-token similarity; instead, it compresses history into a hidden state and then uses learned dynamics to decide how much of that state to retain versus how much new input to inject at each step.
The architecture is built around a discretized continuous-time state-space model. A token (or, during training, a whole sequence tensor) first goes through a projection into a higher-dimensional “inner” space, then a convolution-through-time stage (described as a 1D convolution over a small window of neighboring tokens) adds local context. After that, the model generates the parameters needed for the state update: B and C matrices (used to inject input into the state and project the state back out), plus a learned time-step modulation term Δ (Delta) produced via a low-rank projection bottleneck. Δ is passed through softplus, then used to discretize the continuous dynamics.
A major implementation detail is how the continuous-to-discrete conversion is handled. The diagonal state transition parameter A is initialized from an S4-style diagonal structure (diagonal entries 1…n), stored in log space, and exponentiated at runtime so the effective A has the desired sign/constraints. Discretization then uses a zero-order hold formulation: the model computes an “A-bar” term from exp(Δ·A) and a corresponding “B-bar” term that scales the input contribution by Δ. In the code walkthrough, B-bar is simplified compared with the full paper expression—effectively using a shortcut consistent with an Euler-style approximation—so the math in the paper and the math in the implementation don’t match term-for-term.
Δ plays a dual role. It is not just a discretization constant; it also acts like a learned forget/selection mechanism. When Δ is small, the update behaves closer to retaining the existing hidden state; when Δ is larger, the model injects more influence from the current input. This steering is learned per token position (and per channel group), even though the underlying state transition A is diagonal, meaning state channels evolve independently within the dynamics.
After updating the hidden state, the model projects it back to the inner dimension using C, applies a gated modulation via an additional Z path (a separate projection of the input that is run through a nonlinearity), and then returns to model dimension with an output projection. Residual connections and normalization wrap the mixer at the layer level, matching the typical deep-network pattern: each Mamba layer contributes to a residual stream, with RMSNorm applied around the block.
Finally, the walkthrough emphasizes how the implementation maps to tensor shapes and GPU execution. The “state” is effectively matrix-shaped (D_state × D_inner) and operations are vectorized across the inner dimension, while Δ is computed through a low-rank bottleneck. The result is a sequential RNN-like recurrence in time, but expressed in a way that can be efficiently unrolled and trained with backpropagation through time, plus hardware-oriented kernel fusion and memory-aware execution. The group flags vanishing gradients as a separate, deeper topic—likely tied to the continuous-time formulation and discretization strategy—suggesting a future deep dive into the “hippo” connection.
Cornell Notes
Mamba’s implementation centers on a state-space “mixer” that updates a hidden state sequentially, avoiding attention’s explicit token-to-token comparisons. Inputs are projected into an inner space, passed through a convolution-through-time for local context, and then used to generate B, C, and a learned time-step modulation Δ. The model discretizes continuous dynamics using a zero-order hold approach: it builds A-bar from exp(Δ·A) (with A constrained via log-space parameterization) and scales input injection via Δ to form B-bar. Δ functions like a learned forget/selection control, steering how much past state versus current input influences the next state. This matters because it provides an attention-free path to long-range dependency modeling while remaining GPU-efficient through structured, vectorized tensor operations.
How does Mamba replace attention’s “who should I look at?” mechanism?
What is the role of Δ (Delta) in the discretized state-space update?
Why is A parameterized in log space, and what does that guarantee?
What mismatch appears between the paper’s B-bar formula and the code’s implementation?
How do tensor shapes and channel independence affect computation?
Where do residual connections and the gated MLP-like path (Z) fit?
Review Questions
- In Mamba, what exact quantities are generated from the input before the state update (B, C, and Δ), and how does Δ influence the discretization?
- How does the diagonal structure of A change the way state channels interact during the recurrence?
- What implementation shortcut for B-bar was discussed, and why might it differ from the paper’s full zero-order hold expression?
Key Points
- 1
Mamba models long-range dependency by evolving a hidden state sequentially, not by computing explicit attention weights over prior tokens.
- 2
A token is projected into an inner dimension, processed by a convolution-through-time module for local context, and then used to generate B, C, and a learned Δ for state updates.
- 3
A is initialized from an S4-style diagonal pattern and stored in log space so runtime exponentiation enforces the desired sign/constraints for stable continuous dynamics.
- 4
Discretization uses Δ to compute A-bar (via exp(Δ·A)) and to scale input injection through B-bar, making Δ both a discretization control and a learned forget/selection mechanism.
- 5
The code’s B-bar computation appears simplified versus the paper’s full expression, resembling an Euler-style approximation (e.g., effectively Δ·B).
- 6
After the state update, C projects the hidden state back to inner space, a gated Z path modulates it, and an output projection returns to model dimension.
- 7
Residual connections and RMSNorm wrap the mixer at the layer/block level, while Z is an internal gating path rather than the residual connection itself.