SYSTEM / DOCKER / PROTEIN FOLDING / BIOINFORMATICS
Evolutionary Scale Modeling (ESM) is a suite of high‑capacity Transformer protein language models—from ESM‑2 for embeddings to ESMFold for end‑to‑end structure prediction—developed by the FAIR Protein Team. Containerizing ESM via Docker ensures reproducibility, easy environment management, and GPU enablement.
This guide covers:
esm-extractesm-foldBelow is a minimal Dockerfile that:
fair-esm[esmfold] via pip (includes ESM‑2, ESMFold CLI, and dependencies)# Use official PyTorch image with CUDA support
FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
# Install system dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*
# Install Evolutionary Scale Modeling package with structure support
RUN pip install --no-cache-dir fair-esm[esmfold]
# Set working directory
WORKDIR /app
# Default entrypoint (override with your commands)
ENTRYPOINT ["bash"]
Note: If you only need embeddings (no structure prediction), you can omit
[esmfold]and installfair-esmalone.
From the folder containing Dockerfile:
docker build -t esm:latest .
If you plan to use an external MSA server inside the container, ensure the container has internet access. For GPU support, run with --gpus all.
esm-extractUse the esm-extract CLI to compute per‐token or per‐sequence embeddings from any FASTA file.
# Host directory structure:
# ├── Dockerfile
# ├── sequences.fasta
# └── outputs/
# Run in Docker to extract embeddings
docker run --rm \
-v $(pwd)/sequences.fasta:/data/sequences.fasta \
-v $(pwd)/outputs:/data/outputs \
esm:latest esm-extract \
esm2_t33_650M_UR50D \
/data/sequences.fasta \
/data/outputs \
--repr_layers 0 32 33 \
--include mean per_tok
This creates one .pt embedding file per input sequence under outputs/.
esm-foldThe esm-fold CLI wraps ESMFold for batch structure predictions.
# Host directory:
# ├── sequence.fasta
# └── pdb_out/
# Run structure prediction (GPU)
docker run --rm --gpus all \
-v $(pwd)/sequence.fasta:/data/seqs.fasta \
-v $(pwd)/pdb_out:/data/pdb_out \
esm:latest esm-fold \
-i /data/seqs.fasta \
-o /data/pdb_out \
--num-recycles 3 \
--chunk-size 128 \
--cpu-offload
Outputs:
.pdb files for each input sequence in pdb_out/--num-recycles, --max-tokens-per-batch, and --chunk-size flags.fair-esm:2.9.5) for reproducibility.--use_msa_server, mount a host folder to /root/.cache/esm to reuse alignments.--max-tokens-per-batch to avoid OOMs on shorter sequences.esm-extract with downstream ML pipelines by mounting the outputs/ folder to your training container.Enjoy scalable protein modeling with ESM!