ESM2: Setting Up Evolutionary Scale Modeling with Docker
TECHNICAL OVERVIEW

ESM2: SETTING UP EVOLUTIONARY SCALE MODELING WITH DOCKER

SYSTEM / DOCKER / PROTEIN FOLDING / BIOINFORMATICS

TODO: COMPLETE

Setting Up Evolutionary Scale Modeling (ESM) with Docker

Evolutionary Scale Modeling (ESM) is a suite of high‑capacity Transformer protein language models—from ESM‑2 for embeddings to ESMFold for end‑to‑end structure prediction—developed by the FAIR Protein Team. Containerizing ESM via Docker ensures reproducibility, easy environment management, and GPU enablement.

This guide covers:

  1. Prerequisites
  2. Writing the Dockerfile
  3. Building the Docker image
  4. Extracting embeddings with esm-extract
  5. Predicting structures with esm-fold
  6. Tips & Best Practices
  7. Further Resources

1. Prerequisites

  • Docker (v20.10+)
  • (Optional) NVIDIA GPU and nvidia-docker2 for GPU acceleration
  • A directory of input FASTA files or complex YAML inputs

2. Writing the Dockerfile

Below is a minimal Dockerfile that:

  • Starts from a GPU‑enabled PyTorch image (for CUDA support)
  • Installs fair-esm[esmfold] via pip (includes ESM‑2, ESMFold CLI, and dependencies)
# Use official PyTorch image with CUDA support
FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install system dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends git && \
    rm -rf /var/lib/apt/lists/*

# Install Evolutionary Scale Modeling package with structure support
RUN pip install --no-cache-dir fair-esm[esmfold]

# Set working directory
WORKDIR /app

# Default entrypoint (override with your commands)
ENTRYPOINT ["bash"]

Note: If you only need embeddings (no structure prediction), you can omit [esmfold] and install fair-esm alone.


3. Building the Docker Image

From the folder containing Dockerfile:

docker build -t esm:latest .

If you plan to use an external MSA server inside the container, ensure the container has internet access. For GPU support, run with --gpus all.


4. Extracting Embeddings with esm-extract

Use the esm-extract CLI to compute per‐token or per‐sequence embeddings from any FASTA file.

# Host directory structure:
# ├── Dockerfile
# ├── sequences.fasta
# └── outputs/

# Run in Docker to extract embeddings
docker run --rm \
  -v $(pwd)/sequences.fasta:/data/sequences.fasta \
  -v $(pwd)/outputs:/data/outputs \
  esm:latest esm-extract \
    esm2_t33_650M_UR50D \
    /data/sequences.fasta \
    /data/outputs \
    --repr_layers 0 32 33 \
    --include mean per_tok

This creates one .pt embedding file per input sequence under outputs/.


5. Predicting Structures with esm-fold

The esm-fold CLI wraps ESMFold for batch structure predictions.

# Host directory:
# ├── sequence.fasta
# └── pdb_out/

# Run structure prediction (GPU)
docker run --rm --gpus all \
  -v $(pwd)/sequence.fasta:/data/seqs.fasta \
  -v $(pwd)/pdb_out:/data/pdb_out \
  esm:latest esm-fold \
    -i /data/seqs.fasta \
    -o /data/pdb_out \
    --num-recycles 3 \
    --chunk-size 128 \
    --cpu-offload

Outputs:

  • .pdb files for each input sequence in pdb_out/
  • Performance can be tuned via --num-recycles, --max-tokens-per-batch, and --chunk-size flags.

6. Tips & Best Practices

  • Pin versions: tag your image with specific versions (e.g., fair-esm:2.9.5) for reproducibility.
  • Cache MSAs: if using --use_msa_server, mount a host folder to /root/.cache/esm to reuse alignments.
  • Resource limits: use --max-tokens-per-batch to avoid OOMs on shorter sequences.
  • CI Integration: add a lightweight test (e.g., embed a 50‐AA peptide) in your CI pipeline to catch regressor changes.
  • Hybrid workflows: combine esm-extract with downstream ML pipelines by mounting the outputs/ folder to your training container.

7. Further Resources

  • GitHub Repo: https://github.com/facebookresearch/esm
  • ESM Metagenomic Atlas: https://esmatlas.com
  • ESMFold Paper: Lin et al. (2022). Science 379:eabn0303
  • Community Slack: Join via the link in the repo README.

Enjoy scalable protein modeling with ESM!