Analysis of Saunders, Kronenberg, Holt, Rowell, Eberle (2026), PacBio — bioRxiv preprint Generated on April 29, 2026
Long-read de novo assembly is the right tool for rare-disease whole-genome sequencing. Rare variants — the ones rare-disease patients carry by definition — are also the ones most likely to be wildly different from the reference genome. A sample-specific assembly catches what a reference-based pipeline misses, because the assembler doesn't have to fight the reference at all; it builds the patient's own genome up from raw reads. Yet in clinical practice, assembly remains the backup approach. Almost every diagnostic pipeline still funnels reads straight to GRCh38 and calls variants from that mapping.
This paper diagnoses why assembly is under-used and proposes a fix. The two complaints from real users are concrete: (1) when a variant is called from the assembly, it lives in contig coordinates, which makes it hard to view the read-level evidence that a clinician needs to sign off on (basecall quality, methylation, mosaic support); and (2) reconciling assembly-based calls with the lab's existing reference-based pipeline is awkward — you can end up with two views of the same locus that disagree, and no easy way to decide which is right.
Portello (named for the small Italian word for "hatch" or "porthole" — a window through which to look at the same thing from a new angle) is a small idea with a big payoff: instead of choosing between assembly-based calls and reference-based calls, transfer the read alignments. The reads are first aligned to the sample's own assembly contigs (so they benefit from the long-range haplotype structure the assembler discovered); the contigs are mapped to GRCh38; and portello composes those two mapping functions to produce read alignments that land on GRCh38 coordinates but were placed via the assembly. Standard tools — DeepVariant, structural-variant callers, IGV — consume the resulting BAM file without modification.
The headline number: when DeepVariant is run on portello-remapped HG002 reads instead of pbmm2-mapped reads (same reads, same model, only the mapping changed), it removes 47% of small-variant basecall errors. False negatives drop by 52%; false positives stay essentially flat. On NA12878 the total error reduction is 29%. None of this required retraining DeepVariant.
If you mostly work with short-read or non-genomic data, this section gives you the scaffolding for what follows. Skip if you already speak HiFi.
Sequencing. A modern PacBio HiFi instrument reads single DNA molecules of ~15–25 kb at >99.9% per-base accuracy. ONT (Oxford Nanopore) reads are longer (often >100 kb) but historically less accurate, though that gap has been closing. "Long-read sequencing" in this paper means HiFi unless otherwise specified.
Reads vs. assembly vs. reference. A read is one sequenced molecule. A reference (GRCh38) is the public consensus human genome — one assembly built from many people, used as a coordinate system. De novo assembly takes a sample's reads and stitches them into long contigs (contiguous sequences) without using the reference at all. With HiFi data and a modern assembler (hifiasm, Verkko), a human assembly is typically partially phased or dual-assembly: you get two contig sets representing both haplotypes of the diploid genome, but the contigs are not perfectly switched to the right haplotype throughout (there are switch errors).
Conventional read mapping. The standard variant-calling pipeline aligns each read to the reference (pbmm2 for PacBio HiFi, minimap2 more generally), then runs a caller like DeepVariant on those alignments. This is fast, scales easily, and produces variants in standard reference coordinates. The downside: when the patient's true sequence is wildly different from GRCh38 (large insertions, structural variants, divergent haplotypes), the mapper struggles — reads are split, soft-clipped, or misplaced, and the caller sees garbage.
Variant types. Small variants are SNVs and short indels (typically ≤ 50 bp). Structural variants (SVs) are larger insertions, deletions, inversions, and translocations. Copy-number variants (CNVs) are gains or losses of larger chromosomal regions. Rare-disease patients carry both classes; small variants are easier to call but SVs and CNVs are often the actual diagnosis.
Phasing. A diploid human carries two copies of (most of) each chromosome, one from each parent. Phasing assigns each variant to its parental haplotype. The bam tags HP (haplotype 1 or 2) and PS (phase set ID, marking the contiguous region within which the phasing is internally consistent) are the standard way to record this on individual reads.
Why this paper exists. Assembly captures the patient's genome better than reference-based mapping in hard regions, but assembly-based variant calls are awkward to integrate with the rest of the standard pipeline. Portello bridges the two worlds.
Predicting which variants caused a patient's disease is one of the central tasks of clinical genomics, and for rare disease it is unusually hard. The variants in question, by definition, are not in any common-variant database. They are also more likely than common variants to live in regions where the patient's sequence diverges substantially from GRCh38 — segmental duplications, repeat expansions, structural rearrangements. These are exactly the regions where conventional reference-based mapping has the most trouble: a read drawn from a region that doesn't really exist in GRCh38 has nowhere good to land.
The field has worked through several paradigms over the last two decades. Short-read sequencing (Illumina) plus reference mapping (BWA) plus a probabilistic caller (GATK) became the workhorse in the 2010s, and it solved most of the easy genome. The hard regions remained hard. Long-read sequencing — first PacBio CLR and ONT R9, then HiFi and ONT R10 — arrived with two answers: longer reads make repeat-spanning unambiguous, and high-accuracy long reads make whole-genome assembly feasible from a single library. Tools like hifiasm (Cheng et al. 2021) and Verkko (Rautiainen et al. 2023) routinely produce diploid human assemblies in days. Recent rare-disease studies have shown that a meaningful fraction of candidate causal variants — both de novo and recessive — are only recoverable from assembly-based calling.
So why isn't every rare-disease lab running assembly? Because the operational story is messy. Assembly produces contigs in their own coordinate system. Variants called against contigs need to be lifted to GRCh38 to be talked about, reviewed, or compared with prior data. Read-level evidence — the kind a clinician squints at in IGV before signing off — is hidden behind an extra coordinate translation. Mosaic variants (present in only a fraction of cells, important for cancer and developmental disorders) are unlikely to make it into a haploid assembly consensus at all. And when an assembly-based call disagrees with the lab's existing reference-based pipeline, there is no obvious adjudicator.
Portello's framing is that you don't have to pick. The reference-based pipeline gets to keep its variant callers, its coordinates, and its review tools, but the read alignments feeding the pipeline now ride on top of the patient's assembly. The two halves of the alignment problem — "where is this read on the patient's genome?" and "where is the patient's genome on GRCh38?" — are decoupled and solved separately, then composed back into one BAM in reference coordinates. The mapper handling each sub-problem only has to deal with one source of difficulty: sequencing error for read-to-contig, biological variation for contig-to-reference. Each sub-problem is what its mapper was actually built for.
The paper presents four results, two of which carry quantitative head-to-heads (small-variant accuracy and phasing) and two of which are qualitative demonstrations on illustrative loci (read representation and CNV interpretation). I treat them in the order the paper does.
Before the result, a quick orientation: a VNTR is a variable number tandem repeat — a short sequence motif that is repeated end-to-end, with the number of repeats varying between people. VNTRs are notoriously hard for mappers because the read can plausibly align to many positions within the repeat block at very similar scores; the mapper has to pick one, and small differences in scoring (or in the read's own sequencing errors) can flip the choice. The result is a "messy pileup": within a few hundred base pairs you'll see indels at slightly different reference positions, soft-clips that don't agree with each other, and split alignments fighting over the same reads.
Portello's contribution here is structural rather than quantitative. The authors don't put a number on read-representation quality in this section; they show side-by-side IGV-style views of a VNTR locus mapped two ways. With pbmm2 (Figure 1B in the paper), each read is doing its own thing. With portello (Figure 1C), the reads agree with each other. The mechanism is the same one that drives the variant-calling result later: portello aligns each read against the patient's own contigs first, in which the VNTR has a single, consistent representation; only then does the contig-to-reference alignment have to grapple with how that representation differs from GRCh38, and it does so once per contig instead of once per read.
There is also a more subtle gain: the patient's contigs act as a sample-specific decoy. In conventional mapping, reads from a region that's been duplicated in the sample but not in the reference end up scattered across the closest reference homologs, creating phantom variants. With portello, those reads first align to the correct sample-specific contig — their actual home — and the contig-to-reference step decides where to deposit the contig. Reads that genuinely don't have a good home in the sample assembly are kept in a separate "unassembled" output.
Before the result, a quick orientation: small variants are SNVs and short indels (typically ≤ 50 bp), and they are scored against a truth set — a hand-curated, often assembly-confirmed list of variants for a benchmark sample. The two samples here are HG002 (an Ashkenazi Jewish reference sample with the GIAB draft T2T benchmark) and NA12878 (a CEPH reference sample with the Platinum Pedigree v1.2 truth set). The metric is F1 of variant calls against the benchmark, decomposed into false negatives (variants the truth set has but the caller missed) and false positives (calls the caller made that aren't in the truth set).
The experiment is designed to isolate one variable. The same reads are used in both arms. The same caller (DeepVariant v1.9 with the standard HiFi model) is used in both arms. The DeepVariant model is not retrained on portello output. The only thing that changes between the two arms is the read mapper: pbmm2 v1.17 (conventional) vs portello (assembly-based). Anything that improves between the arms therefore comes purely from better alignment, not from a smarter caller or a model that has seen more of the input distribution.
The result is large. On HG002, total basecall errors drop by 47% (from 375,634 to 199,368 across both BASEPAIR FN and FP) — almost entirely driven by a 52% reduction in false negatives. False positives change by a small amount (-14%). On NA12878, the picture is similar in shape: a 55% drop in false negatives, little change in false positives, and a 29% drop in total errors. The absolute F1 on HG002 climbs from 0.9901 to 0.9948; on NA12878 from 0.9916 to 0.9940. The headline number sounds modest in F1 space (under one percentage point) because both methods are already near the ceiling, but the error reduction lens is the right one for clinical use: every one of those false negatives is a missing variant call that a clinician would have wanted to see.
The asymmetry between samples is notable. HG002 sees a bigger total-error reduction than NA12878, mostly because NA12878 starts with a higher baseline of false positives (likely tied to using an externally produced Verkko assembly from the Platinum Pedigree project rather than a fresh hifiasm assembly built from these reads). Even so, the shape — recall up sharply, precision essentially flat — is consistent and is the expected signature of better placement: reads that previously missed their region (false negatives) get there, while reads that were going to call a wrong variant don't suddenly become more confident.
The two key knobs that decide how big the portello win is on a given sample are (1) the fraction of reads from hard regions (VNTRs, segmental duplications, divergent haplotypes) and (2) the quality of the assembly the reads were aligned to first. Click through to feel how those interact. The numbers below are illustrative, anchored on the paper's HG002 vs NA12878 contrast.
To reason quantitatively about "47% of basecall errors removed", here is the reduction calculation as the paper applies it — treating a basecall error as the union of false negatives and false positives, BASEPAIR-metric, summed across the sample.
from typing import TypedDict
class CountsBP(TypedDict):
fn: int # false negatives in the BASEPAIR comparison
fp: int # false positives in the BASEPAIR comparison
def basecall_error_reduction(
pbmm2: CountsBP, # FN/FP from conventional read mapping
portello: CountsBP, # FN/FP from assembly-based remapping
) -> dict[str, float]:
"""Fraction of small-variant basecall errors removed by switching mapper.
The paper defines `basecall errors = FN + FP` under the BASEPAIR metric,
and reports the fraction *removed* by the new mapper. The same caller
(DeepVariant v1.9, standard HiFi model) is used in both arms, so any
reduction is attributable to mapping, not modeling.
Returns the total reduction plus the FN- and FP-specific reductions
so you can see whether the gain is recall-driven (paper: yes) or
precision-driven.
"""
# Step 1: total errors per arm
err_pbmm2 = pbmm2["fn"] + pbmm2["fp"]
err_portello = portello["fn"] + portello["fp"]
# Step 2: relative reduction (positive = portello better)
total_red = 1.0 - err_portello / err_pbmm2
# Step 3: split by error type so you see *where* the win came from
fn_red = 1.0 - portello["fn"] / pbmm2["fn"]
fp_red = 1.0 - portello["fp"] / pbmm2["fp"]
return {"total_reduction": total_red, "fn_reduction": fn_red, "fp_reduction": fp_red}
# HG002 numbers from Table S1, BASEPAIR row
hg002 = basecall_error_reduction(
pbmm2 ={"fn": 326_253, "fp": 49_381},
portello={"fn": 156_775, "fp": 42_593},
)
# {'total_reduction': 0.469..., 'fn_reduction': 0.519..., 'fp_reduction': 0.137...}
Before the result, a quick orientation: phasing assigns each variant to one of two parental haplotypes. Haplotagging puts that information on the individual reads (so you can color reads by haplotype in IGV, or feed haplotype-aware callers). The standard tags are HP (1 or 2, the haplotype) and PS (phase set, the contiguous region within which the assignment is internally consistent). Phasing quality is summarized by block NG50 (a length statistic: the largest L such that 50% of the genome is contained in phase blocks of length ≥ L), switch errors (pairs of adjacent variants assigned to the wrong relative haplotype), and flips (single-variant inversions, less serious than a switch).
What's nice about portello here is that phasing comes essentially for free. The reads have already been aligned to specific contigs from the assembly, and the assembly already encodes haplotype information — either implicitly (in dual-assembly mode, where each contig is probably one haplotype but with switch errors), or explicitly (in fully-phased mode, when the assembly was scaffolded with parental sequencing or optical mapping).
In dual-assembly mode (the more common setting for unrelated patients), portello uses a small read-backed scheme: walk along the reference, and at each pair of adjacent heterozygous variants where the two contigs disagree, check that at least one read supports both contig alleles in the same way. If yes, extend the phase set across that pair. If no, end the phase set and start a new one. This is essentially a stripped-down whatshap-style phasing, but operating on the contig-derived heterozygous variant set rather than a separately-called variant set.
The numbers reported are intentionally specific to a single sample (HG002), benchmarked with the GIAB T2T draft small-variant truth set: a phase block NG50 of 334,357 bp with 274 switch errors and 67 flips. The authors are explicit that direct evaluation of read haplotag accuracy is fiddly, so they instead emit a phased VCF of heterozygous variants and benchmark that with whatshap compare. Those numbers are competitive with dedicated read-phasing tools, particularly given that portello obtained them as a side effect of the alignment process rather than a separate phasing step.
The authors are also frank that "with further development, there is significant potential to improve this capability in coordination with a portello-integrated small-variant caller designed to more accurately call small variants from the read-to-contig alignments" — i.e., they think this can get better, and a more integrated phasing-aware caller is the next step.
Before the result, a quick orientation: segmental duplications are large (typically ≥ 10 kb) regions where two or more nearly-identical copies of a sequence appear elsewhere in the genome. They are notoriously hard for short-read mapping and even for long-read mapping, because a read's exact origin within the duplicate family is sometimes genuinely ambiguous — the copies are too similar. Copy-number variants (CNVs) in these regions are clinically meaningful; the example here, on chromosome 22, falls in a region implicated in the 22q11.2 deletion/duplication syndromes that affect heart, palate, and immune development.
The result in this section is qualitative — a single locus, walked through — but it is the most clinically vivid result in the paper. The locus is a 225 kb duplication segment in NA21886, confirmed by a clinical microarray as a copy gain. The authors run conventional pbmm2 mapping and portello on the same reads and visualize the coverage profile.
The pbmm2 view is unusable for CNV calling: a coverage spike of nearly 12,000-fold (vs. an expected ~30× for the rest of the genome), followed by a coverage dropout across the genes DGCR6 and PRODH. This shape is characteristic of misplacement — reads from multiple copies of the segmental duplication pile onto a single reference homolog, while the reference homologs that should have received them are starved. A clinician looking at this in IGV would see noise.
The portello view shows a 3-fold coverage region tracking the array segment. That is (i) a clean ~3× bump (consistent with three total copies, i.e. one extra copy on one haplotype), (ii) sized to match the array call, and (iii) partitionable by contig: because the reads are tagged with the contig they aligned through, you can split the pileup into its source contigs and check whether the contigs themselves represent a sequence compression. That last property — a contig-aware view of a CNV — is the new diagnostic capability portello unlocks beyond just "cleaner coverage".
Before the algorithm, a quick orientation: think of read-to-contig alignment as a function R: read → contig coordinates, and contig-to-reference alignment as C: contig coordinate → reference coordinate. The composition C ° R gives read → reference coordinates — that's portello's job. The two practical complications are (1) split alignments (a single read can be split across several segments of contig coordinates, and a single contig can be split across several reference segments) and (2) inconsistent indel representations between the two coordinate systems.
The algorithm, in three pieces:
Pre-process the contig-to-reference mapping. Two cleanup steps. Repeated matches trimming: for each pair of overlapping contig alignments to the reference, retain the alignment with higher gap-compressed identity and clip the others, ensuring each contig base is mapped to the reference at most once. (pbmm2, the recommended read-to-contig mapper, does the analogous step on its side, so each read base also maps once.) Joining co-linear segments: when minimap2 splits a contig alignment into pieces because of a Z-drop dip, but the pieces are still in the same orientation and order, portello rejoins them into a single segment, encoding any unmapped intervening contig sequence as an insertion. This both improves variant calling accuracy and makes downstream visualization saner.
Compose the alignments. For each read, walk through its alignment to its contig, and for each contig position the read covers, look up the contig-to-reference alignment to get a reference position. Concatenate the resulting per-position correspondences into a CIGAR string. Handle split contig alignments by emitting separate read-to-reference segments where appropriate.
Normalize indels. Both inputs follow the convention that insertions and deletions are left-shifted (placed at the leftmost coordinate consistent with the alignment). When a contig segment was mapped to the reference in reverse orientation, the read's indels need to be re-left-shifted in the reference frame to preserve this property — otherwise a region whose two haplotypes are represented by oppositely-oriented contigs would emit a mixture of left- and right-shifted variants, which downstream callers handle inconsistently.
The careful indel handling (step 3) is the kind of detail that doesn't show up until you actually run a caller on the output and notice that a region with two haplotypes pointing different directions is producing duplicated variants.
A toy of step 2 (composition) on a single read. The real implementation handles split alignments, reverse-strand mapping, and CIGAR encoding; this version exists to make the lookup geometry concrete.
import numpy as np
def transfer_read_alignment(
read_to_contig: np.ndarray, # (L, 2) — (read_pos, contig_pos) per matched base
contig_to_ref: np.ndarray, # (M, 2) — (contig_pos, ref_pos) per matched base
) -> np.ndarray:
"""Compose read->contig and contig->reference into read->reference.
Each input is a sorted table of base-level correspondences (the
expanded form of a CIGAR with --eqx). For each read base, we look
up the contig base it touches, then look up the reference base
that contig base maps to. Bases without a contig->reference
correspondence (insertions in the contig vs reference) are
dropped; in the real implementation they would be encoded as
insertions in the output CIGAR.
"""
# Step 1: build a contig_pos -> ref_pos lookup. The contig table is
# already sorted; np.searchsorted gives O(log M) per query.
contig_keys = contig_to_ref[:, 0]
# Step 2: for each read base, find the contig position it touches,
# then resolve that contig position to a reference position.
out: list[tuple[int, int]] = []
for read_pos, contig_pos in read_to_contig:
idx = np.searchsorted(contig_keys, contig_pos)
# Drop read bases whose contig pos isn't in the contig->ref map
# (these are insertions in the contig relative to the reference).
if idx < len(contig_keys) and contig_keys[idx] == contig_pos:
out.append((int(read_pos), int(contig_to_ref[idx, 1])))
# Step 3: return as a (K, 2) array — sorted by read_pos by construction.
return np.array(out, dtype=np.int64) if out else np.empty((0, 2), dtype=np.int64)
Before the algorithm, a quick orientation: a phase set is the contiguous reference region within which haplotype assignments (HP=1/HP=2) are internally consistent. Within a phase set, all the reads with HP=1 are believed to come from the same parental haplotype; between phase sets, the labels reset (so HP=1 in phase set A does not mean the same parent as HP=1 in phase set B). The bigger and longer your phase sets, the more useful the phasing.
In partially-phased mode, portello builds phase sets by:
Haplotype index (HP=1 vs HP=2) is then assigned within each phase set by which of the two contigs each read aligned to. The conservative-overlap rule the paper describes — intervals overlap only if the intersection exceeds min(24 kb, ⅓ of new interval length) — is what stops a single small spurious overlap from collapsing two real contigs into one haplotype index.
This is roughly read-backed phasing (the standard whatshap recipe) but with the heterozygous variant set coming directly from the contig-to-reference alignment rather than a separate variant calling pass. That is a small but meaningful shift — the heterozygous calls are implicit in the assembly, so portello inherits them without an extra step.
--eqx is mandatory: Both alignments must use =/X CIGAR ops, not M. The standard M op doesn't distinguish match from mismatch, which makes the composition ambiguous.The paper's framing of its own contribution is interesting and worth taking seriously. The variant-calling accuracy gain — the number people will quote — is deliberately positioned as one benefit, not the benefit. The authors argue that the operational improvement (assembly-based and reference-based pipelines unified onto a consistent view of the sample) may matter more in the long run, because it removes the integration friction that has kept assembly out of clinical workflows.
There are two forward-looking claims the authors make that are worth holding lightly. First, they note that hybrid variant callers integrating contig and read inputs to portello are a natural next step — CNV calling being the most obvious target, since the chr22 example demonstrates how much information becomes available when reads are tagged with their source contig. They don't quantify this in the paper; they outline it as a direction. Second, they argue that with portello, a contig-based SV caller (PAV is cited) could be combined with a read-based small-variant caller, with all calls landing on the same coordinate system and the same review tools. This combination is enabled by portello rather than demonstrated by it.
The honest limitation: this paper is a method paper with two sample evaluations (HG002 and NA12878), one demonstrative CNV locus (NA21886), and a phasing benchmark on one of those samples. It is not a clinical study. The claim is "this should help rare-disease analysis", and the evidence supports the should, particularly in segdups and VNTRs, but a real evaluation on a rare-disease cohort — with diagnostic yield as the endpoint — remains future work. The CNV story in particular is a single locus; the paper does not yet present a genome-wide CNV F1 vs conventional methods.
There are also dependencies worth naming. Portello inherits the assembly's quality. If the assembly mis-resolves a haplotype switch, the read mapping inherits that error. The authors are explicit about this in the partially-phased section. A phasing-aware caller designed for portello inputs, plus better dual-assembly accuracy, are the two upstream shifts that would compound the wins shown here.
HP and PS tags through the alignment transfer in one step.