Forecasting Bitcoin with Mamba State Space Models
TECHNICAL OVERVIEW

FORECASTING BITCOIN WITH MAMBA STATE SPACE MODELS

SYSTEM / MACHINE LEARNING / DEEP-LEARNING / STATE-SPACE-MODELS

Forecasting Bitcoin with Mamba State Space Models

By Gabriel Navarro
May 17, 2025


This post demonstrates how to forecast minute-by-minute Bitcoin OHLCV data using the Mamba State Space Model (SSM)—a linear-time, hardware-aware sequence model that rivals Transformers on long contexts. We’ll:

  • Build intuition for state-space forecasting and why it excels on noisy financial series
  • Set up your environment with mamba-ssm and key dependencies
  • Ingest & preprocess Bitcoin data: log-scaling, z-scoring, and windowing
  • Define a stacked Mamba2 architecture for autoregression
  • Train with PyTorch Lightning: mixed-precision, gradient clipping, and scheduler tricks
  • Evaluate via regression metrics and compare against a FlashAttention baseline

By the end, you’ll understand not just how to implement Mamba SSM, but why it works—and where to go next.


1. Why State-Space Models for Finance?

Traditional time-series methods (ARIMA, exponential smoothing) excel at short-range forecasts under stationarity, but struggle with latent dynamics and non-stationarity common in finance (Medium).

State-space models (SSMs) introduce hidden “state” vectors that evolve via linear or nonlinear dynamics, while observations are noisy functions of those states (mfe.baruch.cuny.edu). This separation of signal vs. noise yields:

  • Adaptive memory: long-range dependencies learned without quadratic attention costs
  • Robustness: explicit modeling of process & measurement noise
  • Interpretability: clear transition vs. observation equations

SSMs have powered Kalman filters in control, robotics, and economics since the 1960s—yet only recently have they matched Transformers on raw sequence tasks (arXiv).


2. Introducing Mamba SSM

Mamba is a next-generation SSM that combines selective state updates with hardware-aware kernels, achieving linear time and memory complexity while retaining strong performance on language, audio, and genomic data (arXiv).

Key innovations:

  1. Selective state propagation
  2. Control-theory inspired dynamics
  3. Kernel fusion akin to FlashAttention for GPU efficiency (The Gradient)

Empirically, Mamba rivals or outperforms Transformers with 5× higher throughput at 2K–16K contexts (Goomba Lab) and exhibits Lyapunov stability under mixed precision (arXiv).


3. Installation & Setup

We recommend Mamba (fast Conda) to isolate dependencies:

# Install mamba in base
conda install -n base -c conda-forge mamba   # :contentReference[oaicite:7]{index=7}

# Create env & install PyTorch, Lightning, etc.
mamba create -n mamba-ssm python=3.10
conda activate mamba-ssm
mamba install pytorch lightning numpy pandas matplotlib \
             -c pytorch -c conda-forge
pip install mamba-ssm litdata kaggle

Verify your NVIDIA drivers + CUDA match your PyTorch build to leverage Mamba’s Triton kernels. See our MLContainer Lab for a full Dockerfile with FlashAttention, Triton, and Mamba (MLContainer Lab).


4. Data Sourcing & Preprocessing

We use the mczielinski/bitcoin-historical-data Kaggle dataset (1-min OHLCV) spanning multiple years, ideal for long-context models. Download via the Kaggle API:

kaggle datasets download mczielinski/bitcoin-historical-data \
  -f btcusd_1-min_data.csv -p ./datasets/   # requires Kaggle credentials

Preprocessing Steps

  1. Log-transform prices to stabilize variance

  2. Log₁ₚₗᵤₛ transform volume to reduce skew

  3. Z-score each feature:

    $$ x' = \frac{x - \mu}{\sigma} $$

  4. Windowing into 2 048-step sequences (75 % overlap)

  5. Mask any window containing NaNs

We then stream and serialize with litdata.optimize(), producing ~100 k valid windows for training (Medium).


5. Model Architecture

class Mamba2Model(pl.LightningModule):
    def __init__(…):
        super().__init__()
        self.in_proj = nn.Linear(5, 64)
        self.layers  = nn.ModuleList([
            Mamba2Layer(64,64,4,4,16,i) for i in range(4)
        ])
        self.out_proj = nn.Linear(64, 5)

    def forward(self, x):
        x = self.in_proj(x)
        for l in self.layers: x = l(x)
        return self.out_proj(x)
  • Input: 5 features → 64-dim embedding
  • 4× Mamba2 layers (d_state=64, d_conv=4, expand=4, headdim=16)
  • Output: 64 → 5 features
  • Loss: Autoregressive Smooth L₁ + per-feature weighting (volume 0.25×)

This lean design eschews attention and MLP blocks—yet matches Transformers on nonlinear tasks (Hugging Face).


6. Training Loop

We leverage PyTorch Lightning for:

  • Autoregressive Huber loss (shifted predictions)
  • AdamW + ReduceLROnPlateau
  • Gradient clipping (0.5)
  • Mixed precision (bf16-mixed)
  • Batch size: 32 × accumulation 8 → effective 256
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="gpu", devices=1,
    precision="bf16-mixed",
    accumulate_grad_batches=8,
    gradient_clip_val=0.5,
    callbacks=[EarlyStopping("val_loss", patience=10),
               ModelCheckpoint(monitor="val_loss", save_top_k=1)]
)
trainer.fit(model, train_loader, val_loader)

Loss Curves

The training (blue) and validation (orange) curves track closely, indicating stable generalization without heavy overfitting.


7. Evaluation & Baseline Comparison

After trainer.test(), we compute per-feature MSE, RMSE, MAE, MAPE, R² and compare against a FlashAttention Transformer trained on the same data.

Feature Mamba R² Mamba MAPE (%) FlashAttn R² FlashAttn MAPE (%)
Close 0.9901 1.45 0.8242 2.42
Open 0.9903 1.31 0.9900 2.54
High 0.9918 1.09 0.9918 2.47
Low 0.9799 1.27 0.9798 3.48
Volume 0.1692 330.00 0.1692 393.19

Mamba matches or slightly exceeds FlashAttention on price series, cutting percentage errors nearly in half—and both struggle on noisy volume (Maarten Grootendorst Substack).


8. Next Steps

To push forecasting further:

  • Feature engineering: technical indicators, regime‐change flags
  • Hybrid models: combine Mamba SSM with sparse attention
  • Zero-shot forecasting: explore Mamba4Cast’s synthetic training paradigm (arXiv)
  • Hyperparameter sweeps: integrate Lightning’s tuner for optimal d_state, layers, etc.
  • Theoretical analysis: leverage Lyapunov stability for robust mixed-precision training (arXiv)

By uniting Mamba’s linear scaling with domain-aware preprocessing, you can tackle million-step horizons in finance and beyond. Happy modeling! 🚀