Mamba part 4 - System Details and Implementation

TL;DR

Mamba models long-range dependency by evolving a hidden state sequentially, not by computing explicit attention weights over prior tokens.

Briefing Cornell Notes

Briefing

Mamba’s core implementation hinges on a state-space “mixer” that updates a hidden state sequentially while keeping most computations hardware-friendly—especially by structuring parameters and operations so they can be streamed through time with minimal memory traffic. The key system insight from the walkthrough is that Mamba doesn’t rely on attention-style token-to-token similarity; instead, it compresses history into a hidden state and then uses learned dynamics to decide how much of that state to retain versus how much new input to inject at each step.

The architecture is built around a discretized continuous-time state-space model. A token (or, during training, a whole sequence tensor) first goes through a projection into a higher-dimensional “inner” space, then a convolution-through-time stage (described as a 1D convolution over a small window of neighboring tokens) adds local context. After that, the model generates the parameters needed for the state update: B and C matrices (used to inject input into the state and project the state back out), plus a learned time-step modulation term Δ (Delta) produced via a low-rank projection bottleneck. Δ is passed through softplus, then used to discretize the continuous dynamics.

A major implementation detail is how the continuous-to-discrete conversion is handled. The diagonal state transition parameter A is initialized from an S4-style diagonal structure (diagonal entries 1…n), stored in log space, and exponentiated at runtime so the effective A has the desired sign/constraints. Discretization then uses a zero-order hold formulation: the model computes an “A-bar” term from exp(Δ·A) and a corresponding “B-bar” term that scales the input contribution by Δ. In the code walkthrough, B-bar is simplified compared with the full paper expression—effectively using a shortcut consistent with an Euler-style approximation—so the math in the paper and the math in the implementation don’t match term-for-term.

Δ plays a dual role. It is not just a discretization constant; it also acts like a learned forget/selection mechanism. When Δ is small, the update behaves closer to retaining the existing hidden state; when Δ is larger, the model injects more influence from the current input. This steering is learned per token position (and per channel group), even though the underlying state transition A is diagonal, meaning state channels evolve independently within the dynamics.

After updating the hidden state, the model projects it back to the inner dimension using C, applies a gated modulation via an additional Z path (a separate projection of the input that is run through a nonlinearity), and then returns to model dimension with an output projection. Residual connections and normalization wrap the mixer at the layer level, matching the typical deep-network pattern: each Mamba layer contributes to a residual stream, with RMSNorm applied around the block.

Finally, the walkthrough emphasizes how the implementation maps to tensor shapes and GPU execution. The “state” is effectively matrix-shaped (D_state × D_inner) and operations are vectorized across the inner dimension, while Δ is computed through a low-rank bottleneck. The result is a sequential RNN-like recurrence in time, but expressed in a way that can be efficiently unrolled and trained with backpropagation through time, plus hardware-oriented kernel fusion and memory-aware execution. The group flags vanishing gradients as a separate, deeper topic—likely tied to the continuous-time formulation and discretization strategy—suggesting a future deep dive into the “hippo” connection.

Cornell Notes

Mamba’s implementation centers on a state-space “mixer” that updates a hidden state sequentially, avoiding attention’s explicit token-to-token comparisons. Inputs are projected into an inner space, passed through a convolution-through-time for local context, and then used to generate B, C, and a learned time-step modulation Δ. The model discretizes continuous dynamics using a zero-order hold approach: it builds A-bar from exp(Δ·A) (with A constrained via log-space parameterization) and scales input injection via Δ to form B-bar. Δ functions like a learned forget/selection control, steering how much past state versus current input influences the next state. This matters because it provides an attention-free path to long-range dependency modeling while remaining GPU-efficient through structured, vectorized tensor operations.

How does Mamba replace attention’s “who should I look at?” mechanism?

Instead of computing QK similarities over all previous tokens, Mamba compresses history into a hidden state that evolves over time. Each step updates the state using discretized state-space dynamics: the prior hidden state is transformed by A-bar, and the current input contributes through B-bar. The output is then produced by projecting the updated state with C. Δ modulates the discretization, effectively controlling how strongly the model retains the existing state versus injects new information.

What is the role of Δ (Delta) in the discretized state-space update?

Δ is produced from the input via a low-rank projection bottleneck, then passed through softplus. It is used inside the discretization: A-bar depends on exp(Δ·A), and B-bar scales the input injection by Δ. Practically, smaller Δ leads to behavior closer to retaining the hidden state; larger Δ increases the influence of the current input. The group also notes Δ is learned per token position (and shared across channel groups), so the model can adapt its “forgetting” behavior dynamically.

Why is A parameterized in log space, and what does that guarantee?

A is initialized from an S4-style diagonal structure (diagonal entries 1…n), then stored as a log-parameter. At runtime, the model exponentiates it (and applies a sign convention) so the effective A has the desired constraint (notably, positivity/negativity as required by the continuous dynamics). This log-space parameterization keeps the runtime A within a stable, constrained range while still allowing learning.

What mismatch appears between the paper’s B-bar formula and the code’s implementation?

During the code walkthrough, B-bar is described as simplified relative to the full paper expression. The implementation appears to use a shortcut consistent with an Euler-style approximation, effectively reducing the more complex B-bar term down to something like Δ·B. The group flags this as a reason the code and paper equations may not align term-for-term.

How do tensor shapes and channel independence affect computation?

The dynamics use a diagonal A (so state channels evolve independently within the state transition). The “state” is treated as matrix-shaped (D_state × D_inner) in the implementation, and operations are vectorized across the inner dimension. Δ is computed through a bottleneck and then applied across channel groups, while other components (B and C) are generated per token. This structure supports efficient GPU execution and streaming through time.

Where do residual connections and the gated MLP-like path (Z) fit?

The Z path is a separate input projection that is run through a nonlinearity and used to gate the mixer output (multiplying with the projected state output). Residual connections occur at the layer/block level: the mixer output is added back into the residual stream, and RMSNorm wraps the block. The Z gating is not the same as the residual connection; it modulates the mixer’s internal output before the outer residual addition.

Review Questions

In Mamba, what exact quantities are generated from the input before the state update (B, C, and Δ), and how does Δ influence the discretization?
How does the diagonal structure of A change the way state channels interact during the recurrence?
What implementation shortcut for B-bar was discussed, and why might it differ from the paper’s full zero-order hold expression?

Key Points

1
Mamba models long-range dependency by evolving a hidden state sequentially, not by computing explicit attention weights over prior tokens.
2
A token is projected into an inner dimension, processed by a convolution-through-time module for local context, and then used to generate B, C, and a learned Δ for state updates.
3
A is initialized from an S4-style diagonal pattern and stored in log space so runtime exponentiation enforces the desired sign/constraints for stable continuous dynamics.
4
Discretization uses Δ to compute A-bar (via exp(Δ·A)) and to scale input injection through B-bar, making Δ both a discretization control and a learned forget/selection mechanism.
5
The code’s B-bar computation appears simplified versus the paper’s full expression, resembling an Euler-style approximation (e.g., effectively Δ·B).
6
After the state update, C projects the hidden state back to inner space, a gated Z path modulates it, and an output projection returns to model dimension.
7
Residual connections and RMSNorm wrap the mixer at the layer/block level, while Z is an internal gating path rather than the residual connection itself.

Highlights

Δ isn’t just a discretization parameter—it steers how much the model retains the hidden state versus injects the current input, acting like a learned forget gate.

A is diagonal and log-parameterized (S4-style initialization), enabling constrained continuous dynamics that remain stable after discretization.

The walkthrough flags a practical code shortcut: B-bar is simplified compared with the paper’s full formula, aligning more with an Euler-style approximation.

Z provides gated modulation parallel to the state-space dynamics, while residual connections occur around the mixer block with RMSNorm. 

Topics

Mamba System Architecture
State-Space Discretization
Δ Time-Step Modulation
S4 Initialization
Selective State Update Code

Mentioned

SSM
S4
DT
RNN
LSTM
GPU
SRAM
BPTT
RMSNorm
Silu
softplus
Euler