SYSTEM / BIOINFORMATICS / DOCKER / GENOMICS
In next-generation sequencing (NGS) workflows, clean data is critical. Low-quality reads, adapter sequences, and other artifacts can heavily impact downstream analyses like genome assembly, transcript quantification, or variant calling. FastP is a modern, ultra-efficient tool that performs both quality control and read cleaning โ all in a single fast pass through your FASTQ files. Let's dive into how you can install and use FastP for your omics pipelines!
FastP is an all-in-one FASTQ preprocessing tool written in C++ designed for maximum speed and minimal memory usage. Whether you're trimming adapters, filtering poor-quality reads, or visualizing sequencing quality, FastP handles it all โ and does it very quickly.
FastP highlights:
Install FastP using conda:
conda install -c bioconda fastp=0.24.1
or, if you prefer mamba:
mamba install -c bioconda fastp=0.24.1
Verify the installation:
fastp --version
This should return the version number, e.g.,
fastp 0.24.1.
Notes: If you are using a conda environment, make sure to activate it first. Also, if you are using a different version of FastP, adjust the version number accordingly.
This installs FastP v0.24.1 inside a lightweight container โ perfect for local or cloud workflows. To build the Docker image:
Create a Dockerfile.fastp in your working directory.
FROM mambaorg/micromamba:2.0-debian11
RUN micromamba install -c bioconda -c conda-forge fastp==0.24.1 \
&& micromamba clean -a -y
Build the Docker image:
docker build \
-f Dockerfile.fastp \
-t fastp:0.24.1 .
Verify the installation:
docker run --rm fastp:0.24.1 fastp --version
This should return the version number, e.g.,
fastp 0.24.1.
Notes: If you are using a different version of FastP, adjust the version number accordingly.
Build the image locally as shown above.
Tag it for GCP Artifact Registry:
docker tag fastp:0.24.1 us-central1-docker.pkg.dev/my-project-id/my-repo/fastp:0.24.1
Push it to the Artifact Registry:
docker push us-central1-docker.pkg.dev/my-project-id/my-repo/fastp:0.24.1
Ensure you have
gcloudCLI installed and configured for authentication.
Let's grab a paired-end dataset for Bacillus subtilis ALBA01 strain from the European Nucleotide Archive:
# Make data directory if it doesn't exist,
mkdir -p data
# Download FASTQ files for Bacillus subtilis ALBA01
wget -nc -P ./data ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR331/005/SRR3317165/SRR3317165_1.fastq.gz
wget -nc -P ./data ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR331/005/SRR3317165/SRR3317165_2.fastq.gz
Now let's run preprocessing:
docker run --rm -it \
-v "$(pwd):/app" \
--user 1000:1000 \
fastp:0.24.1 \
bash -c '
fastp \
--in1 "/app/data/SRR3317165_1.fastq.gz" \
--in2 "/app/data/SRR3317165_2.fastq.gz" \
--out1 "/app/data/SRR3317165_1.trim.fastq.gz" \
--out2 "/app/data/SRR3317165_2.trim.fastq.gz" \
--unpaired1 "/app/data/SRR3317165_1.trim_up.fastq.gz" \
--unpaired2 "/app/data/SRR3317165_2.trim_up.fastq.gz" \
--qualified_quality_phred 20 \
--detect_adapter_for_pe \
--length_required 50 \
--correction \
--low_complexity_filter \
--complexity_threshold 30 \
--html /app/data/fastp.html \
--json /app/data/fastp.json \
--thread 16'
FastP will generate:
| File | Description |
|---|---|
SRR3317165_1.trim.fastq.gz |
Trimmed forward reads |
SRR3317165_2.trim.fastq.gz |
Trimmed reverse reads |
SRR3317165_1.trim_up.fastq.gz |
Unpaired forward reads |
SRR3317165_2.trim_up.fastq.gz |
Unpaired reverse reads |
fastp.html |
Interactive QC report |
fastp.json |
Machine-readable QC report |
| Parameter | Purpose |
|---|---|
--in1 / --in2 |
Input FASTQ files |
--out1 / --out2 |
Output trimmed FASTQ files |
--qualified_quality_phred |
Base quality threshold (default 15) |
--detect_adapter_for_pe |
Auto-detect adapters for paired-end reads |
--correction |
Overlapping read correction |
--length_required |
Minimum length to keep read |
--low_complexity_filter |
Remove low complexity sequences |
--html |
Generate HTML QC report |
--thread |
Number of threads for multithreading |
A standard workflow incorporating FastP would look like this:
FastP is an amazing tool for sequencing data preprocessing โ ultra-fast, user-friendly, and packed with features. If you are building NGS workflows for genomics, transcriptomics, or metagenomics projects, FastP is the perfect starting point for producing high-quality, reliable datasets.
Clean data leads to better science!