TECHNICAL OVERVIEW

DRAGEN-GATK: HIGH-PERFORMANCE VARIANT CALLING

SYSTEM / DOCKER / BIOINFORMATICS / GENOMICS

High-Performance Variant Calling with Dragen-GATK

Accurate variant calling is fundamental to modern genomics research and clinical sequencing. With the rise of ever-larger datasets, speed and precision are no longer optional — they are mandatory.
Dragen-GATK offers an accelerated, highly-optimized pipeline for germline variant calling, combining Illumina's hardware acceleration and GATK’s trusted software toolkit. Let's dive into how you can set up and run Dragen-GATK workflows efficiently!

What is Dragen-GATK?

Dragen-GATK is a collaboration between Illumina and the Broad Institute, combining:

DRAGEN’s hardware-accelerated algorithms (available on-premises or in cloud platforms),
with GATK 4’s best-practice methods for variant calling.

Dragen-GATK highlights:

Improved germline variant quality, especially for indels and difficult regions.
Out-of-the-box compatibility with standard GATK workflows.

Why Choose Dragen-GATK?

🎯 Accuracy: Optimized algorithms for challenging regions of the genome.
🏥 Clinical-grade robustness: Used in clinical diagnostics pipelines.
☁️ Cloud-ready: Available on AWS, GCP, and other cloud platforms.
🛠 Best of both worlds: Combines DRAGEN and GATK technologies under one unified toolkit.

Installing Dragen-GATK

🐳 Docker Local Installation

You can also use a lightweight container for Dragen-GATK. Example:

Create a Dockerfile.dragmap:

Example Dockerfile

FROM gambalab/dragmap:latest@sha256:d1d322d87744f154bc53cd400c35bddfeff4d5787c8f6347764caf27512e3fc0

# Install OS deps, Mambaforge, and clean up
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        wget \
        bash \
        bzip2 \
        ca-certificates \
        coreutils \
        tar \
    && wget -q -P /tmp \
    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
    && bash /tmp/Miniforge3-Linux-x86_64.sh -b -p /opt/mamba \
    && rm -rf /tmp/Mambaforge.sh /var/lib/apt/lists/* \
    && apt-get clean

# Add conda to path
ENV PATH="/opt/mamba/bin:$PATH"
ENV LD_LIBRARY_PATH="/opt/mamba/lib"

# Install Python and bioinformatics tools
RUN mamba install -y \
        -c conda-forge \
        -c bioconda \
        python=3.11 \
        samtools=1.21 \
        gatk4=4.6.1.0 \
    && mamba clean -afy

Build the image:

docker build -f Dockerfile.dragmap -t dragmap:1.3.0 .

Verify install by running:

docker run --rm -it dragmap:1.3.0 gatk --help

Running Dragen-GATK on Sample Data

📥 Download Test Data

Download FASTQ files Follow example from previous section in FastP tutorial [here](TODO:Add href) to get a trimmed FASTQ file from the Bacillus subtilis ALBA01 strain.

Download Reference Genome

To continue with prevbious example, we will use the Bacillus subtilis genome from the European Nucleotide Archive (ENA).

# Make data directory if it doesn't exist,
mkdir -p data
# Choosing the ASM904v1 B. subtilis assembly for variant calling.
wget -nc -P ./data ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/045/GCF_000009045.1_ASM904v1/GCF_000009045.1_ASM904v1_genomic.fna.gz
# Unzip the reference genome
gunzip ./data/GCF_000009045.1_ASM904v1_genomic.fna.gz

Create FASTA Dictionary (GATK):

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk CreateSequenceDictionary \
    -R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna"'

Index Reference (Samtools):

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
samtools faidx "/app/data/GCF_000009045.1_ASM904v1_genomic.fna"'

Compose STR Table (GATK):

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
gatk ComposeSTRTableFile \
    -R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
    -O "/app/data/GCF_000009045.1_ASM904v1_genomic.fna.strtable"'

🚀 Variant Calling Workflow

Build Hash Table

Before alignment, DragMap requires a hash table built from the reference:

# Create a directory for the hash table
mkdir -p data/hash_table
# Build the hash table using DragMap
docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
dragen-os \
    --build-hash-table true \
    --ht-reference "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
    --output-directory "/app/data/hash_table/" \
    --ht-write-hash-bin 1 \
    --num-threads 16'

Map Reads and Create BAM

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
dragen-os \
    -r "/app/data/hash_table/" \
    -1 "/app/data/SRR3317165_1.trim.fastq.gz" \
    -2 "/app/data/SRR3317165_2.trim.fastq.gz" \
    --num-threads 16 \
| samtools view \
    --threads 16 \
    -bh -o "/app/data/SRR3317165.bam"'

Prepare BAM for Variant Calling

Sort the BAM file and index it:

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
    samtools sort "/app/data/SRR3317165.bam" \
        -@ 16 \
        -o "/app/data/SRR3317165.sorted.bam"'

Deduplicate the BAM file:

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
    gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" MarkDuplicatesSpark \
        -I "/app/data/SRR3317165.sorted.bam" \
        -O "/app/data/SRR3317165.dedup.bam" \
        -M "/app/data/SRR3317165.dedup.bam.txt" \
        --conf "spark.executor.cores=16"'

Index the deduplicated BAM file:

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
    samtools \
        index "/app/data/SRR3317165.dedup.bam" \
        -@ 16'

Calibrate DragStr Model

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
    gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" \
    CalibrateDragstrModel \
        -R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
        -I "/app/data/SRR3317165.dedup.bam" \
        -str "/app/data/GCF_000009045.1_ASM904v1_genomic.fna.strtable" \
        -O "/app/data/GCF_000009045.1_ASM904v1_genomic.model"'

Call Variants

docker run --rm -it \
-v "$(pwd):/app" \
dragmap:1.3.0 \
bash -c '
    gatk --java-options "-XX:+UseG1GC -Xms4g -Xmx64g" \
    HaplotypeCaller \
        -R "/app/data/GCF_000009045.1_ASM904v1_genomic.fna" \
        -I "/app/data/SRR3317165.dedup.bam" \
        -O "/app/data/SRR3317165.variants.vcf" \
        --native-pair-hmm-threads 16 \
        --dragen-mode true \
        --dragstr-params-path "/app/data/GCF_000009045.1_ASM904v1_genomic.model'

📂 Output Files

File	Description	Step
`GCF_000009045.1_ASM904v1_genomic.fna`	Reference genome	Download
`*.dict`	Dictionary file	GATK
`*.fai`	FAI index file	Samtools
`*.strtable`	STR table	GATK
`./data/hash_table`	Directory with hash files	Dragen-os
`SRR3317165.bam`	Aligned reads	Dragen-os
`SRR3317165.sorted.bam`	Sorted BAM file	Samtools
`SRR3317165.dedup.bam`	Deduplicated BAM file	GATK
`SRR3317165.dedup.bam.bai`	Index for deduplicated BAM	Samtools
`SRR3317165.dedup.bam.txt`	MarkDuplicates report	GATK
`GCF_000009045.1_ASM904v1_genomic.model`	Calibrated model	GATK
`SRR3317165.variants.vcf`	Called variants (VCF)	GATK

Key Dragen-GATK Parameters You Should Know

Tool	Parameter	Purpose
CreateSequenceDictionary	`-R`	Reference genome FASTA
ComposeSTRTableFile	`-R`	Reference genome FASTA
ComposeSTRTableFile	`-O`	Output STR table
dragen-os	`--build-hash-table`	Run build hash table for reference
dragen-os.build-hash-table	`--ht-reference`	Build hash table directory
dragen-os.build-hash-table	`--output-directory`		Output directory for hash table
dragen-os.build-hash-table	`--ht-write-hash-bin`	Write hash bin files
dragen-os	`num-threads`	Number of threads to use
dragen-os	`-r`	Hash table directory
dragen-os	`-1`	Input FASTQ file 1
dragen-os	`-2`	Input FASTQ file 2
gatk	`--java-options`	Java options for GATK
gatk.MarkDuplicatesSpark	`-I`	Input BAM file
gatk.MarkDuplicatesSpark	`-O`	Output BAM file
gatk.MarkDuplicatesSpark	`-M`	MarkDuplicates report
gatk.MarkDuplicatesSpark	`--conf`	Spark configuration
gatk.CalibrateDragstrModel	`-R`	Reference genome FASTA
gatk.CalibrateDragstrModel	`-I`	Input deduplicated BAM file
gatk.CalibrateDragstrModel	`-str`	STR table file
gatk.CalibrateDragstrModel	`-O`	Output model file
gatk.HaplotypeCaller	`-R`	Reference genome FASTA
gatk.HaplotypeCaller	`-I`	Input deduplicated BAM file
gatk.HaplotypeCaller	`-O`	Output VCF file
gatk.HaplotypeCaller	`--native-pair-hmm-threads`	Number of threads for pair HMM
gatk.HaplotypeCaller	`--dragen-mode`	Enable DRAGEN mode
gatk.HaplotypeCaller	`--dragstr-params-path`	Path to calibrated model file

Dragen-GATK in Your Genomics Workflow

A typical NGS analysis workflow with Dragen-GATK would look like:

QC: Preprocess reads with FastP or similar tools.
Alignment: Map reads to reference using Dragen-GATK's BwaMemAligner.
Variant Calling: Call variants using DragenGatkCaller.
Post-processing: Perform variant filtration and annotation (e.g., with SnpEff or VEP).
Interpretation: Analyze clinically or biologically relevant variants.

🎯 Conclusion

Dragen-GATK provides the best of both worlds: blazing speed and rock-solid accuracy.
If you're building high-throughput sequencing pipelines — whether for research, diagnostics, or clinical genomics — integrating Dragen-GATK will significantly boost your variant discovery pipeline’s performance.

High-quality variant calling is closer (and faster) than ever!

📚 References

← RETURN TO SYSTEMS