SYSTEM / DOCKER / PROTEIN FOLDING / BIOINFORMATICS
By Gabriel Navarro
May 27, 2025
Predicting a protein's three-dimensional structure from its amino acid sequence has been a "grand challenge" since Christian Anfinsen showed in the early 1950s that denatured ribonuclease can spontaneously refold to its native, active conformation solely based on sequence-encoded information (Aklectures, MIT OpenCourseWare). This fundamental discovery established that all the information needed for proper protein folding is encoded within the amino acid sequence itself.
In the 1970s and 1980s, statistical and physics-based approaches—ranging from all-atom molecular dynamics to coarse-grained energy functions and knowledge-based potentials—demonstrated that forcefields and simplified models could recapitulate many aspects of folding thermodynamics and kinetics (Wikipedia). However, the computational complexity of the protein folding problem remained formidable.
To benchmark progress objectively, the CASP (Critical Assessment of Structure Prediction) challenge was launched in 1994 as a blind, community-wide experiment held every two years, driving innovation in homology modeling, threading, and de-novo methods (Wikipedia). This competition became the gold standard for evaluating protein structure prediction methods.
In the late 1990s and 2000s, Rosetta, pioneered by David Baker's lab, harnessed fragment assembly with Monte Carlo sampling guided by physics-inspired scoring functions to win CASP targets and expand into docking, design, and even citizen-science via Foldit (PubMed, Biostatistics and Medical Informatics). Meanwhile, large-scale supercomputers like IBM's Blue Gene sought to tackle folding through brute-force molecular simulations, but these efforts underscored the need for data-driven shortcuts in conformational search (WIRED).
The turning point arrived in 2020 when DeepMind's AlphaFold2 achieved median backbone RMSD of 0.96 Å in CASP14—an order-of-magnitude leap over competitors—and effectively "solved" single-chain structure prediction for most targets (Nature). This breakthrough demonstrated the power of combining deep learning with structural biology insights.
Almost simultaneously, the Baker lab released RoseTTAFold, a three-track network delivering comparable accuracy on consumer GPUs in minutes (Baker Lab), and Meta's ESMFold leveraged massive protein language models to extend high-throughput predictions into metagenomics (Meta AI). These developments democratized access to high-quality protein structure prediction.
While these discriminative networks excel at predicting structures from known sequences, generative design—creating new folds, binding sites, and assemblies—requires models that can sample from the Boltzmann ensemble of conformations. Responding to this need, the Baker group introduced RFdiffusion, which fine-tunes a RoseTTAFold backbone into a denoising diffusion model over coordinate space, enabling de-novo design of symmetric oligomers, enzyme active-site scaffolds, and small-molecule binders with drastically fewer experimental iterations (ScienceDirect, Baker Lab).
Building on this rich heritage, Boltz-1x adopts a novel Boltzmann-inspired architecture that integrates state-space recurrence with graph-based potential terms to learn both long-range sequence correlations and local geometric constraints. By fusing the statistical rigor of energy-based models with modern deep learning and graph representations, Boltz-1x promises faster, more resource-efficient predictions and generative design capabilities on par with the latest diffusion frameworks.
In the following sections, we will:
Docker containerization ensures reproducible environments and simplified deployment across different systems. This section provides a comprehensive guide to setting up Boltz-1x using Docker, enabling you to get started quickly regardless of your local system configuration.
Before diving in, ensure your system meets the following requirements:
Note: While a GPU significantly enhances performance, Boltz-1x can also run on CPU-only systems, albeit with longer processing times.
Begin by cloning the repository containing the necessary Docker configurations for Boltz-1x:
git clone https://github.com/gabenavarro/MLContainerLab.git
cd MLContainerLab
This repository contains pre-configured Dockerfiles optimized for various CUDA and Python versions, streamlining the setup process.
Navigate to the directory containing the Dockerfile and build the Docker image:
docker build -f ./assets/build/Dockerfile.boltz1x.cu126cp310 -t boltz1x:cu126-py310 .
Explanation of parameters:
-f ./assets/build/Dockerfile.boltz1x.cu126cp310: Specifies the Dockerfile tailored for CUDA 12.6 and Python 3.10-t boltz1x:cu126-py310: Tags the image for easy reference and version managementTip: Ensure your host system's CUDA version matches or exceeds the version specified in the Dockerfile to avoid compatibility issues with the Docker Container Toolkit.
Launch the Docker container with GPU support and necessary configurations:
docker run -dt \
--gpus all \
--shm-size=64g \
-v "$(pwd):/workspace" \
--name boltz1x \
--env NVIDIA_VISIBLE_DEVICES=all \
boltz1x:cu126-py310
Parameter breakdown:
--gpus all: Grants the container access to all available GPUs--shm-size=64g: Allocates shared memory to prevent out-of-memory errors during computation-v "$(pwd):/workspace": Mounts the current directory to /workspace inside the container for file access--name boltz1x: Assigns a memorable name to the container--env NVIDIA_VISIBLE_DEVICES=all: Ensures all GPUs are visible within the containerNote: Adjust the
--shm-sizeparameter based on your system's available memory and the complexity of your prediction tasks.
For an integrated development experience, connect to the running container using Visual Studio Code:
Ctrl+Shift+P or Cmd+Shift+P) and select Remote-Containers: Attach to Running Container...boltz1x container from the listAlternative scriptable approach:
# Programmatic container attachment
CONTAINER_NAME=boltz1x
FOLDER=/workspace
HEX_CONFIG=$(printf {\"containerName\":\"/$CONTAINER_NAME\"} | od -A n -t x1 | tr -d '[\n\t ]')
code --folder-uri "vscode-remote://attached-container+$HEX_CONFIG$FOLDER"
Note: Ensure you have the Remote - Containers extension installed in VS Code for seamless container integration.
Inside the container, familiarize yourself with the available command-line options:
boltz predict --help
This command displays comprehensive parameter options including output directories, checkpoint paths, device configurations, recycling steps, and diffusion sampling parameters—all crucial for optimizing prediction performance.
docs directoryNow that we have Boltz-1x set up, let's explore its capabilities through a practical example. We'll focus on predicting the structure of a protein complex involving glycogen synthase kinase 3 alpha (GSK3A) and frequently rearranged in advanced T-cell lymphomas 1 (FRAT1)—two proteins that play crucial roles in cellular signaling pathways.
Before diving into the computational work, it's important to understand the biological significance of our target proteins and their interaction.
Glycogen synthase kinase-3 alpha (GSK3A) is a serine/threonine kinase that serves multiple regulatory functions in cellular biology (Atlas of Genetics in Oncology):
GSK3A is constitutively active in resting cells and becomes inhibited upon stimulation by various signals, including insulin and growth factors, through phosphorylation at specific serine residues.
FRAT1 is a member of the GSK-3-binding protein family and functions as a positive regulator of the Wnt signaling pathway (NCBI, PMC). Its key functions include:
The interaction between GSK3A and FRAT1 is central to Wnt/β-catenin pathway modulation (GeneCards):
Understanding this interaction is crucial for drug design and therapeutic interventions targeting the Wnt signaling pathway.
Let's demonstrate Boltz-1x's capabilities by predicting the structure of the GSK3A-FRAT1 complex, using the experimentally determined structure (PDB ID: 1GNG) as our reference.
The crystal structure of GSK3A bound to FRAT1 (PDB ID: 1GNG) provides valuable insights into their interaction mechanism:
This structure reveals how FRAT1 binds to the active site region of GSK3A, effectively blocking substrate access and inhibiting kinase activity.
Boltz-1x uses YAML format to specify input sequences and molecular compositions. Here's the configuration file for reconstructing the 1GNG structure:
📘 Input YAML file (1GNG-boltz1.yaml):
version: 1
sequences:
- protein:
id: A
sequence: MSGRPRTTSF... # GSK3A sequence (truncated for display)
- protein:
id: B
sequence: MPCRREEE... # FRAT1 sequence (truncated for display)
This format allows Boltz-1x to understand the multi-chain nature of the complex and predict inter-chain interactions.
Execute the prediction using optimized parameters:
boltz predict /workspace/1GNG-boltz1.yaml \
--recycling_steps 10 \
--diffusion_samples 25 \
--accelerator gpu \
--out_dir /workspace/datasets/predict \
--cache /workspace/boltz1x/cache \
--use_msa_server
Parameter explanation:
--recycling_steps 10: Number of iterative refinement cycles for improved accuracy--diffusion_samples 25: Number of diffusion sampling steps for structure generation--accelerator gpu: Utilizes GPU acceleration for faster computation--use_msa_server: Leverages multiple sequence alignment data for enhanced predictionThe prediction generates a complete structural model of the GSK3A-FRAT1 complex:
In green, we see the predicted structure of GSK3A, while FRAT1 is shown in teal. The model captures the key features of the interaction, including the binding interface and overall complex architecture and demonstrates the ability of Boltz-1x to accurately predict multi-chain protein complexes.
To validate the accuracy of our prediction, we compare the Boltz-1x model against the experimentally determined structure using structural alignment techniques. The predicted structure is aligned with the experimental structure (PDB ID: 1GNG) to assess how closely they match.
This animated overlay shows the predicted structure (green) aligned with the experimental structure (magenta), demonstrating the high accuracy of the Boltz-1x prediction.
To objectively evaluate prediction quality, we calculate the Root Mean Square Deviation (RMSD) between predicted and experimental structures:
from pymol import cmd
# Load both structures
cmd.load("predicted_1GNG.pdb", "predicted")
cmd.load("1GNG.pdb", "experimental")
# Perform structural alignment
alignment_result = cmd.align("predicted", "experimental")
# Extract RMSD value
rmsd = alignment_result[0]
print(f"RMSD: {rmsd:.2f} Å")
Result: RMSD = 0.71 Å
This exceptionally low RMSD value indicates high prediction accuracy. For context:
The 0.71 Å RMSD demonstrates that Boltz-1x successfully captured the essential features of the GSK3A-FRAT1 interaction, including the precise positioning of binding interfaces and overall complex architecture.
This successful prediction showcases several important capabilities of Boltz-1x:
Such predictions can inform:
This comprehensive guide has demonstrated the power and accessibility of Boltz-1x for next-generation protein structure prediction. Through our practical example of the GSK3A-FRAT1 complex, we've shown how this Boltzmann-inspired deep learning framework can achieve remarkable accuracy (0.71 Å RMSD) in predicting complex protein-protein interactions.
The success with the GSK3A-FRAT1 complex represents just the beginning of Boltz-1x's potential applications. Future work could explore:
Boltz-1x represents a significant advancement in computational structural biology, combining the theoretical rigor of statistical mechanics with the practical power of modern deep learning. As demonstrated through our GSK3A-FRAT1 example, this approach promises to accelerate both fundamental research and therapeutic development by providing accurate, accessible, and efficient protein structure prediction capabilities.
The integration of energy-based principles with graph neural networks and diffusion models positions Boltz-1x as a valuable tool for the broader scientific community, democratizing access to high-quality structural predictions and enabling new discoveries in protein science and drug design.