SYSTEM / MACHINE LEARNING / DEEP-LEARNING / STATE-SPACE-MODELS
By Gabriel Navarro
May 17, 2025
This post demonstrates how to forecast minute-by-minute Bitcoin OHLCV data using the Mamba State Space Model (SSM)—a linear-time, hardware-aware sequence model that rivals Transformers on long contexts. We’ll:
mamba-ssm and key dependenciesBy the end, you’ll understand not just how to implement Mamba SSM, but why it works—and where to go next.
Traditional time-series methods (ARIMA, exponential smoothing) excel at short-range forecasts under stationarity, but struggle with latent dynamics and non-stationarity common in finance (Medium).
State-space models (SSMs) introduce hidden “state” vectors that evolve via linear or nonlinear dynamics, while observations are noisy functions of those states (mfe.baruch.cuny.edu). This separation of signal vs. noise yields:
SSMs have powered Kalman filters in control, robotics, and economics since the 1960s—yet only recently have they matched Transformers on raw sequence tasks (arXiv).
Mamba is a next-generation SSM that combines selective state updates with hardware-aware kernels, achieving linear time and memory complexity while retaining strong performance on language, audio, and genomic data (arXiv).
Key innovations:
Empirically, Mamba rivals or outperforms Transformers with 5× higher throughput at 2K–16K contexts (Goomba Lab) and exhibits Lyapunov stability under mixed precision (arXiv).
We recommend Mamba (fast Conda) to isolate dependencies:
# Install mamba in base
conda install -n base -c conda-forge mamba # :contentReference[oaicite:7]{index=7}
# Create env & install PyTorch, Lightning, etc.
mamba create -n mamba-ssm python=3.10
conda activate mamba-ssm
mamba install pytorch lightning numpy pandas matplotlib \
-c pytorch -c conda-forge
pip install mamba-ssm litdata kaggle
Verify your NVIDIA drivers + CUDA match your PyTorch build to leverage Mamba’s Triton kernels. See our MLContainer Lab for a full Dockerfile with FlashAttention, Triton, and Mamba (MLContainer Lab).
We use the mczielinski/bitcoin-historical-data Kaggle dataset (1-min OHLCV) spanning multiple years, ideal for long-context models. Download via the Kaggle API:
kaggle datasets download mczielinski/bitcoin-historical-data \
-f btcusd_1-min_data.csv -p ./datasets/ # requires Kaggle credentials
Log-transform prices to stabilize variance
Log₁ₚₗᵤₛ transform volume to reduce skew
Z-score each feature:
$$ x' = \frac{x - \mu}{\sigma} $$
Windowing into 2 048-step sequences (75 % overlap)
Mask any window containing NaNs
We then stream and serialize with litdata.optimize(), producing ~100 k valid windows for training (Medium).
class Mamba2Model(pl.LightningModule):
def __init__(…):
super().__init__()
self.in_proj = nn.Linear(5, 64)
self.layers = nn.ModuleList([
Mamba2Layer(64,64,4,4,16,i) for i in range(4)
])
self.out_proj = nn.Linear(64, 5)
def forward(self, x):
x = self.in_proj(x)
for l in self.layers: x = l(x)
return self.out_proj(x)
This lean design eschews attention and MLP blocks—yet matches Transformers on nonlinear tasks (Hugging Face).
We leverage PyTorch Lightning for:
bf16-mixed)trainer = pl.Trainer(
max_epochs=100,
accelerator="gpu", devices=1,
precision="bf16-mixed",
accumulate_grad_batches=8,
gradient_clip_val=0.5,
callbacks=[EarlyStopping("val_loss", patience=10),
ModelCheckpoint(monitor="val_loss", save_top_k=1)]
)
trainer.fit(model, train_loader, val_loader)
The training (blue) and validation (orange) curves track closely, indicating stable generalization without heavy overfitting.
After trainer.test(), we compute per-feature MSE, RMSE, MAE, MAPE, R² and compare against a FlashAttention Transformer trained on the same data.
| Feature | Mamba R² | Mamba MAPE (%) | FlashAttn R² | FlashAttn MAPE (%) |
|---|---|---|---|---|
| Close | 0.9901 | 1.45 | 0.8242 | 2.42 |
| Open | 0.9903 | 1.31 | 0.9900 | 2.54 |
| High | 0.9918 | 1.09 | 0.9918 | 2.47 |
| Low | 0.9799 | 1.27 | 0.9798 | 3.48 |
| Volume | 0.1692 | 330.00 | 0.1692 | 393.19 |
Mamba matches or slightly exceeds FlashAttention on price series, cutting percentage errors nearly in half—and both struggle on noisy volume (Maarten Grootendorst Substack).
To push forecasting further:
By uniting Mamba’s linear scaling with domain-aware preprocessing, you can tackle million-step horizons in finance and beyond. Happy modeling! 🚀