SYSTEM / DOCKER / MACHINE LEARNING
Machine learning has delivered remarkable breakthroughs, from beating humans at Go to powering life-saving medical diagnoses. Yet amid this progress lurks a nagging problem: reproducibility. You train a model on your machine and achieve 92% accuracy. You share your code and data—only to hear crickets when colleagues try to reproduce your results. Why? Subtle differences in library versions, CUDA drivers, or even Python patch levels can send your metrics spiraling. As ML workloads grow in complexity—mixing Python packages, C++ extensions, GPU drivers, and cloud services. The “it works on my laptop” excuse no longer cuts it.
Containerization, and Docker in particular, offers a compelling antidote. By packaging code, dependencies, and runtime into a self-contained image, you lock down your environment once and for all. But how do you wield Docker effectively for ML? Let’s dive in.
At its core, Docker lets you build images, immutable snapshots containing everything your code needs, and run them as containers, lightweight virtualized processes that behave identically across hosts.
Typical workflow:
python:3.10, nvcr.io/nvidia/pytorch, etc.).docker build -t my-ml-image .docker run --gpus all -it my-ml-image bashThis ensures your code always sees the same OS libraries, Python packages, and—even GPU drivers—regardless of where you run it.
A well-crafted ML dev container should feel as seamless as a local virtualenv, yet fully reproducible. Key considerations:
ruff or black.-v $(pwd):/workspace) so you can iterate without rebuilding every change.root inside the container to prevent file-permission headaches.Example docker-compose.yml for local dev:
version: '3.8'
services:
ml-dev:
image: my-flash-attn:latest # Dockerfile example below
build: .
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- .:/workspace
ports:
- "8888:8888" # Jupyter
- "6006:6006" # TensorBoard
working_dir: /workspace
user: "${UID}:${GID}"
This lets you docker-compose up and jump straight into coding with GPU support, linting, notebooks, and your editor talking to a containerized runtime.
Production and heavy-duty training images need more careful layering for cache efficiency and minimal size. Here’s a pattern I often use:
Base image selection
nvcr.io/nvidia/pytorch:xx.xx-py3) or nvidia/cuda:xx.x-cudnn8-runtime-ubuntu20.04.System dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential git curl && \
rm -rf /var/lib/apt/lists/*
Python dependencies
Source code
.dockerignore aggressively.Layering tip: group seldom-changing steps (OS packages, complex library builds) before frequently updated steps (your code), so Docker’s cache accelerates iterative workflows.
torch==2.1.0) in your Dockerfile or a requirements.txt.pip install --require-hashes -r requirements.txt.Docker integrates with NVIDIA GPUs through the NVIDIA Container Toolkit. Essentials:
Install the nvidia-docker runtime on the host.
Base image must include CUDA libraries that match the host driver.
Run command:
docker run --gpus '"device=0,1"' -e NCCL_SOCKET_IFNAME=^docker0,lo \
my-ml-image nvidia-smi
Environment variables:
NCCL_SOCKET_IFNAME to optimize multi-GPU communication over the correct network interface.PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 for out-of-memory tuning.Let’s walk through a real example: containerizing the flash-attn repository for both local experimentation and large-scale cloud training.
FROM nvcr.io/nvidia/pytorch:25.01-py3
WORKDIR /workdir
# 1. Build flash-attn from source
RUN rm -rf ./flash-attention/* && \
pip uninstall flash_attn -y && \
git clone -b v2.7.4.post1 https://github.com/Dao-AILab/flash-attention.git && \
cd flash-attention/csrc/rotary && python setup.py install && \
cd ../layer_norm && python setup.py install && \
cd ../fused_dense_lib && python setup.py install && \
cd ../fused_softmax && python setup.py install && \
cd ../../ && python setup.py install
# 2. Install ML & dev dependencies
RUN pip install --no-cache-dir \
lightning tensorboard pydantic \
ipykernel ruff nbformat ipywidgets tqdm \
synapseclient datasets litdata \
google-cloud-aiplatform google-cloud-pipeline-components db-dtypes
# 3. Optimize NCCL for Docker networking
ENV NCCL_SOCKET_IFNAME=^docker0,lo
# 4. Switch to app directory & add healthcheck
WORKDIR /app
HEALTHCHECK --interval=30s --timeout=30s --retries=3 \
CMD nvidia-smi || exit 1
RUN block—only change when upgrading the library.Local builds:
docker build -t flash-attn:local -f Dockerfile .
Once built, you can run the container locally:
docker run --gpus all -it flash-attn:local bash
nvidia-smi.python flash_example.py) to validate FlashAttention functionality.Push to registry:
docker tag flash-attn:local myrepo/flash-attn:1.0.0
docker push myrepo/flash-attn:1.0.0
Deploy to cloud:
Specifically, for GCP, you can use the AI Platform to create a custom training job with your Docker image. This allows you to run distributed training across multiple GPUs seamlessly.
In later posts, we’ll explore how to set up a GCP training job using the flash-attn image. For now, the focus is on the Docker pipeline. As a teaser, here’s a snippet of how you might configure a GCP training job:
from google.cloud import aiplatform
from google.oauth2 import service_account
import os
# Vertex AI Configuration
SERVICE_KEY_PATH = os.getenv(
"GOOGLE_APPLICATION_CREDENTIALS",
"/path/to/your/service_account_key.json"
)
LOCATION = "your-gcp-region" # e.g., "us-central1"
ZONE = "your-gcp-zone" # e.g., "us-central1-a"
PROJECT_ID = "your-gcp-project-id" # e.g., "my-project-12345"
RESERVATION_TYPE = "ANY" # or "ANY_RESERVATION"
STAGING_BUCKET = "gs://your-gcp-training-bucket/flash-attn-example/staging"
SERVICE_ACCOUNT = f"vertex-ai@{PROJECT_ID}.iam.gserviceaccount.com"
TRAIN_IMAGE = f"your-location-docker.pkg.dev/{PROJECT_ID}/repositories/flash-attention:latest"
DISPLAY_NAME = "flash-attn-crypto-model-training"
# Hardware Configuration
NODES = 1
MACHINE_TYPE = "a3-megagpu-8g"
ACCELERATOR_TYPE = "NVIDIA_H100_MEGA_80GB"
ACCELERATOR_COUNT = 8
# Training Command
CMD = [
"python3",
"/gcs/your-gcp-training-bucket/flash-attn-example/scripts/flash_attn_train.py",
"--config",
"/gcs/your-gcp-training-bucket/flash-attn-example/config/flash_attn_crypto_model_config.yaml",
]
# Worker pool specification
worker_pool_specs=[
{
"replica_count": NODES,
"machine_spec": {
"machine_type": MACHINE_TYPE,
"accelerator_type": ACCELERATOR_TYPE,
"accelerator_count": ACCELERATOR_COUNT,
"reservation_affinity": {
"reservation_affinity_type": RESERVATION_TYPE,
}
},
"container_spec": {
"image_uri": TRAIN_IMAGE,
"command": CMD
}
}
]
# Initialize Vertex AI
aiplatform.init(
project=PROJECT_ID,
location=LOCATION,
credentials=service_account.Credentials.from_service_account_file(
SERVICE_KEY_PATH
)
)
# Create and submit the training job
job = aiplatform.CustomJob(
display_name=DISPLAY_NAME,
worker_pool_specs=worker_pool_specs,
staging_bucket=STAGING_BUCKET,
)
job.submit(
service_account=SERVICE_ACCOUNT
)
# Print job details
print(f"Job ID: {job.resource_name}")
print(f"Job state: {job.state}")
This pipeline—from a reproducible local Docker build to a scalable multi-GPU cluster—ensures consistency and dramatically lowers the “it runs here” friction.
Best Practices
docker scan).Common Pitfalls
nvidia-smi to check the driver version. For cloud, ensure your service provider supports the required CUDA version.Docker transforms ML workflows by locking down environments from local dev through production. It tackles the reproducibility crisis head-on and scales effortlessly, whether on your workstation or a GPU cluster. By following best practices—pinning dependencies, optimizing layers, and managing GPU access—you’ll save countless hours chasing elusive bugs.
Further Reading
With a solid containerization strategy, your ML projects will be as reproducible as they are performant. Happy containerizing!