SYSTEM / DOCKER / BIOINFORMATICS / GENOMICS
Accurate variant calling is fundamental to modern genomics research and clinical sequencing. With the rise of ever-larger datasets, speed and precision are no longer optional — they are mandatory.
Dragen-GATK offers an accelerated, highly-optimized pipeline for germline variant calling, combining Illumina's hardware acceleration and GATK’s trusted software toolkit. Let's dive into how you can set up and run Dragen-GATK workflows efficiently!
Dragen-GATK is a collaboration between Illumina and the Broad Institute, combining:
Dragen-GATK highlights:
You can also use a lightweight container for Dragen-GATK. Example:
Create a Dockerfile.dragmap:
FROM gambalab/dragmap:latest@sha256:d1d322d87744f154bc53cd400c35bddfeff4d5787c8f6347764caf27512e3fc0
# Install OS deps, Mambaforge, and clean up
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
wget \
bash \
bzip2 \
ca-certificates \
coreutils \
tar \
&& wget -q -P /tmp \
https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
&& bash /tmp/Miniforge3-Linux-x86_64.sh -b -p /opt/mamba \
&& rm -rf /tmp/Mambaforge.sh /var/lib/apt/lists/* \
&& apt-get clean
# Add conda to path
ENV PATH="/opt/mamba/bin:$PATH"
ENV LD_LIBRARY_PATH="/opt/mamba/lib"
# Install Python and bioinformatics tools
RUN mamba install -y \
-c conda-forge \
-c bioconda \
python=3.11 \
samtools=1.21 \
gatk4=4.6.1.0 \
&& mamba clean -afy
Build the image:
docker build -f Dockerfile.dragmap -t dragmap:1.3.0 .
Verify install by running:
docker run --rm -it dragmap:1.3.0 gatk --help
Download FASTQ files Follow example from previous section in FastP tutorial [here](TODO:Add href) to get a trimmed FASTQ file from the Bacillus subtilis ALBA01 strain.
Download Reference Genome
# Make data directory if it doesn't exist,
mkdir -p data
# Choosing the ASM904v1 B. subtilis assembly for variant calling.
wget -nc -P ./data ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/045/GCF_000009045.1_ASM904v1/GCF_000009045.1_ASM904v1_genomic.fna.gz
# Unzip the reference genome
gunzip ./data/GCF_000009045.1_ASM904v1_genomic.fna.gz
Create FASTA Dictionary (GATK):
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk CreateSequenceDictionary \
-R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna"'
Index Reference (Samtools):
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
samtools faidx "/app/data/GCF_000009045.1_ASM904v1_genomic.fna"'
Compose STR Table (GATK):
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk ComposeSTRTableFile \
-R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
-O "/app/data/GCF_000009045.1_ASM904v1_genomic.fna.strtable"'
Build Hash Table
Before alignment, DragMap requires a hash table built from the reference:
# Create a directory for the hash table
mkdir -p data/hash_table
# Build the hash table using DragMap
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
dragen-os \
--build-hash-table true \
--ht-reference "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
--output-directory "/app/data/hash_table/" \
--ht-write-hash-bin 1 \
--num-threads 16'
Map Reads and Create BAM
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
dragen-os \
-r "/app/data/hash_table/" \
-1 "/app/data/SRR3317165_1.trim.fastq.gz" \
-2 "/app/data/SRR3317165_2.trim.fastq.gz" \
--num-threads 16 \
| samtools view \
--threads 16 \
-bh -o "/app/data/SRR3317165.bam"'
Prepare BAM for Variant Calling
Sort the BAM file and index it:
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
samtools sort "/app/data/SRR3317165.bam" \
-@ 16 \
-o "/app/data/SRR3317165.sorted.bam"'
Deduplicate the BAM file:
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" MarkDuplicatesSpark \
-I "/app/data/SRR3317165.sorted.bam" \
-O "/app/data/SRR3317165.dedup.bam" \
-M "/app/data/SRR3317165.dedup.bam.txt" \
--conf "spark.executor.cores=16"'
Index the deduplicated BAM file:
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
samtools \
index "/app/data/SRR3317165.dedup.bam" \
-@ 16'
Calibrate DragStr Model
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" \
CalibrateDragstrModel \
-R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
-I "/app/data/SRR3317165.dedup.bam" \
-str "/app/data/GCF_000009045.1_ASM904v1_genomic.fna.strtable" \
-O "/app/data/GCF_000009045.1_ASM904v1_genomic.model"'
Call Variants
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" \
HaplotypeCaller \
-R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
-I "/app/data/SRR3317165.dedup.bam" \
-O "/app/data/SRR3317165.variants.vcf" \
--native-pair-hmm-threads 16 \
--dragen-mode true \
--dragstr-params-path "/app/data/GCF_000009045.1_ASM904v1_genomic.model'
| File | Description | Step |
|---|---|---|
GCF_000009045.1_ASM904v1_genomic.fna |
Reference genome | Download |
*.dict |
Dictionary file | GATK |
*.fai |
FAI index file | Samtools |
*.strtable |
STR table | GATK |
./data/hash_table |
Directory with hash files | Dragen-os |
SRR3317165.bam |
Aligned reads | Dragen-os |
SRR3317165.sorted.bam |
Sorted BAM file | Samtools |
SRR3317165.dedup.bam |
Deduplicated BAM file | GATK |
SRR3317165.dedup.bam.bai |
Index for deduplicated BAM | Samtools |
SRR3317165.dedup.bam.txt |
MarkDuplicates report | GATK |
GCF_000009045.1_ASM904v1_genomic.model |
Calibrated model | GATK |
SRR3317165.variants.vcf |
Called variants (VCF) | GATK |
| Tool | Parameter | Purpose | |
|---|---|---|---|
| CreateSequenceDictionary | -R |
Reference genome FASTA | |
| ComposeSTRTableFile | -R |
Reference genome FASTA | |
| ComposeSTRTableFile | -O |
Output STR table | |
| dragen-os | --build-hash-table |
Run build hash table for reference | |
| dragen-os.build-hash-table | --ht-reference |
Build hash table directory | |
| dragen-os.build-hash-table | --output-directory |
Output directory for hash table | |
| dragen-os.build-hash-table | --ht-write-hash-bin |
Write hash bin files | |
| dragen-os | num-threads |
Number of threads to use | |
| dragen-os | -r |
Hash table directory | |
| dragen-os | -1 |
Input FASTQ file 1 | |
| dragen-os | -2 |
Input FASTQ file 2 | |
| gatk | --java-options |
Java options for GATK | |
| gatk.MarkDuplicatesSpark | -I |
Input BAM file | |
| gatk.MarkDuplicatesSpark | -O |
Output BAM file | |
| gatk.MarkDuplicatesSpark | -M |
MarkDuplicates report | |
| gatk.MarkDuplicatesSpark | --conf |
Spark configuration | |
| gatk.CalibrateDragstrModel | -R |
Reference genome FASTA | |
| gatk.CalibrateDragstrModel | -I |
Input deduplicated BAM file | |
| gatk.CalibrateDragstrModel | -str |
STR table file | |
| gatk.CalibrateDragstrModel | -O |
Output model file | |
| gatk.HaplotypeCaller | -R |
Reference genome FASTA | |
| gatk.HaplotypeCaller | -I |
Input deduplicated BAM file | |
| gatk.HaplotypeCaller | -O |
Output VCF file | |
| gatk.HaplotypeCaller | --native-pair-hmm-threads |
Number of threads for pair HMM | |
| gatk.HaplotypeCaller | --dragen-mode |
Enable DRAGEN mode | |
| gatk.HaplotypeCaller | --dragstr-params-path |
Path to calibrated model file |
A typical NGS analysis workflow with Dragen-GATK would look like:
Dragen-GATK provides the best of both worlds: blazing speed and rock-solid accuracy.
If you're building high-throughput sequencing pipelines — whether for research, diagnostics, or clinical genomics — integrating Dragen-GATK will significantly boost your variant discovery pipeline’s performance.
High-quality variant calling is closer (and faster) than ever!