Analysis of Pankov, Pronina, Kuzmin, Borisov, Usoltsev, Zeng, Golubkov, Ermolenko, Shirshova, Matveeva (Huawei + ITMO + HSE + SPbU), arXiv:2311.09770v3, June 2024 Generated on May 6, 2026
Zero-shot voice cloning is the task of synthesizing speech in a target speaker's voice from a single reference audio at inference time, without any finetuning. The dominant recipe through 2023 — and the one DINO-VITS lives inside — slots a pretrained speaker encoder (a small network that maps an audio clip to a fixed-dimensional embedding capturing "who is speaking") into a sequence-to-sequence speech synthesizer. The encoder produces an embedding from the reference audio; the synthesizer is conditioned on that embedding to read arbitrary text in that voice. Speaker encoder + acoustic decoder — both pretrained, glued together at training time, frozen at inference.
The recipe has a load-bearing failure mode: the speaker encoder is usually trained for speaker verification (the binary task of "is this the same speaker?"), and verification training pushes embeddings of the same speaker into a tight cluster regardless of the speech's emotion, prosody, or acoustic environment. That invariance is exactly wrong for voice cloning, which wants to transfer style, not erase it. It is also brittle to noise — a noisy reference clip lands in a slightly different region of the embedding space than its clean counterpart, and the synthesizer faithfully reproduces the perturbed embedding as a degraded voice.
Pankov et al. attack both problems with one move: they keep the verification-pretrained CAM++ speaker encoder (Wang et al. 2023, INTERSPEECH) but, during the joint TTS training stage, attach a DINO self-supervised loss (Caron et al. 2021, ICCV) as a sidecar objective. DINO — borrowed unchanged from self-supervised vision — uses a teacher exponential moving average (EMA) and a student network looking at two random crops of the same audio, then minimizes cross-entropy between their projected output distributions. Because DINO uses soft targets and never trains a classifier head, it produces a speaker embedding space that captures within-speaker variation (style, emotion, recording conditions) rather than collapsing it. The whole thing is trained jointly with VITS (Kim et al. 2021, ICML) — the variational-autoencoder-plus-flow-plus-GAN backbone that has anchored open-source TTS for half a decade.
The headline numbers, on a ChiME3 noisy-environment subset rated by Toloka crowdworkers (Table 1 of the paper): naturalness MOS in noisy conditions of 3.55 ± 0.03 vs. 3.11 ± 0.11 for YourTTS and 3.28 ± 0.04 for YourTTS plus a DEMUCS denoiser; similarity MOS of 3.52 ± 0.08 vs. 3.20 ± 0.08 / 3.35 ± 0.08 for the same baselines. A separate ablation run (Table 2 — re-rated by a fresh listener panel, hence different absolute values) tells you what is doing the work: relative to a 4.07 ± 0.05 noisy naturalness for the full DINO-VITS arm, replacing DINO with the standard AAM-Softmax verification loss collapses it to 3.58 ± 0.05; removing reference-audio noise augmentations on top of that collapses it further to 2.47 ± 0.05. DINO is doing real work, and the work is concentrated in the noisy condition — clean-condition naturalness is statistically a tie (4.03 / 4.00 / 4.04 across the three ablation arms), and the paper says so.
A second contribution rides along: in the training-data-noise regime, where you train on uncurated noisy speech without transcripts, the paper shows that using HuBERT-derived discrete units as the linguistic content representation outperforms a Whisper-ASR transcription baseline. The dramatic comparison is on noisy target audio: Whisper-N gets a character error rate of 24.05% while DINO-VITS-N gets 5.04% — the ASR-based pipeline catastrophically degrades when transcripts are noisy, the HuBERT-based one barely moves.
What is at stake practically: most "voice cloning in the wild" use cases — accessibility, dubbing from podcast clips, voice restoration, agent personalization — get reference audio from voicemails, recordings, and remote calls, not studio booths. A speaker encoder that quietly degrades on a 0 dB SNR clip is a pipeline that quietly fails on the data its users actually have. DINO-VITS argues that you can buy a meaningful chunk of that robustness by attaching a self-supervised loss to an off-the-shelf speaker verifier, not by replacing the architecture. Whether the trick generalizes beyond VITS is the open question — and one we will return to.
To place DINO-VITS, picture the last decade of neural TTS as a sequence of bottleneck moves. Around 2017, Tacotron and its successors made TTS neural and end-to-end: a sequence-to-sequence model went from text to mel-spectrograms, and a separate vocoder went from mel-spectrograms to waveforms. The vocoder was the early bottleneck — Griffin-Lim phase reconstruction was scratchy, autoregressive WaveNet was slow — and the next several years bled improvements out of the vocoder side: Parallel WaveNet, WaveGlow, HiFi-GAN. By 2021, VITS (Kim, Kong, Son 2021) folded the acoustic model and vocoder into one variational-autoencoder-plus-flow-plus-GAN trained end-to-end on waveforms. Mel-spectrograms became an internal representation rather than a hand-off interface. VITS is still the synthesizer underneath much of this paper.
The next bottleneck moved upstream, to who the model could synthesize. Single-speaker TTS was solved; zero-shot multi-speaker TTS — clone any voice from a single reference clip — was not. The 2022 reference here is YourTTS (Casanova et al. 2022, ICML), which conditioned a VITS variant on a pretrained speaker-verification embedding and got recognizable cross-speaker cloning out of it. YourTTS made the speaker-encoder-based recipe — pretrained verifier + acoustic decoder, glued at training time — the workhorse of open-source zero-shot TTS for the next 18 months. DINO-VITS is, in lineage terms, a refinement of that recipe.
But by mid-2023 the speaker-encoder-based recipe was no longer the only game. P-Flow (Kim et al. 2023, NeurIPS) and VoiceBox (Le et al. 2023, NeurIPS) replaced the speaker encoder with in-context speech prompting: instead of a 192-dim embedding, the model receives a chunk of the reference audio directly and uses flow-matching to fill in continuation audio that matches its acoustic signature. Conceptually closer to how GPT does few-shot text — and harder for an external speaker encoder to be the bottleneck of, because there is no external speaker encoder. DINO-VITS predates much of that work in submission but lives in a world where it has already landed; the paper acknowledges as much in its Conclusion, and we will pick the thread up in the Critical Read.
The two specific challenges DINO-VITS addresses are noise-related and orthogonal to the architecture choice:
DINO-VITS attacks the first problem with a dual-objective speaker encoder (the title contribution) and the second with a HuBERT-based content-representation pipeline (a smaller second contribution that rides on the same paper).
The "why now?" — what enabling shifts make this paper possible — comes from three directions converging in 2023. CAM++ (Wang et al. 2023, INTERSPEECH) gave the field a fast, AAM-Softmax-trained speaker verifier that is small enough (6M parameters) to cheaply joint-train. DINO (Caron et al. 2021, ICCV) — originally for self-supervised vision — provided a teacher-EMA-plus-student framework whose centering trick prevents representation collapse in a non-contrastive setting. HuBERT k-means-clustered units gave a discrete content representation that is downstream-trainable like phonemes but does not require transcripts. The cross-domain transfer of DINO to speaker embeddings is the paper's main contribution; the other ingredients were sitting on the shelf.
Before the architecture, a one-line orientation: the system has four pretrained components, two of which are co-trained at TTS time. The four are HuBERT (speech-to-unit, 95M params), mBART (text-to-unit, 610M params), CAM++ (speaker encoder, 6M params), and VITS (unit-to-speech, 40M params). At training time, HuBERT extracts discrete content units from the input speech and the speaker encoder extracts an embedding from the same speech; VITS reads the units and the embedding and reconstructs the speech, optimizing a standard VITS loss plus the DINO sidecar. At inference time, mBART replaces HuBERT — text is converted to the same unit space — and the same VITS-plus-encoder pipeline runs.
The architectural commitment that matters for the rest of this section is the speaker encoder. The CAM++ verifier is pretrained on the VoxCeleb2 speaker-rich dataset using AAM-Softmax, the standard angular-margin softmax loss for verification (Wang et al. 2023). AAM-Softmax does two things: it pulls embeddings of the same speaker into a tight cluster, and it pushes embeddings of different speakers apart on the unit hypersphere. Both are great for "is this the same speaker?" — and both are problematic for cloning. A tight cluster erases within-speaker variation, but voice cloning needs to transfer style: emotion, prosody, recording acoustics. The encoder you want for verification is approximately the encoder you do not want for cloning.
The naive fix — unfreeze CAM++ during VITS training and let the synthesis loss drag it toward more cloning-friendly representations — runs into catastrophic forgetting. The encoder loses its noise robustness and its speaker discriminability before it picks up useful style sensitivity. The paper's fix is to keep CAM++ trainable but constrain it with a DINO auxiliary loss applied to two augmented crops of the reference audio.
DINO (Caron et al. 2021) is a self-distillation framework originally proposed for self-supervised vision transformers. It maintains two networks with the same architecture: a student whose weights are updated by gradient descent, and a teacher whose weights are an exponential moving average (EMA) of the student's weights. Both networks see different augmentations of the same input, project their outputs through a small head into a K-dimensional logit space, and the loss is the cross-entropy between the teacher's softmax distribution (with a tight temperature) and the student's. Two anti-collapse tricks make this stable: centering subtracts a running mean of teacher outputs before the softmax (preventing the trivial solution where all inputs map to the same point), and sharpening uses a much lower temperature for the teacher than the student (preventing the trivial solution where all outputs are uniform). The loss the paper uses is
L_DINO = − Σᵢ σ((P_T(x_a1)ᵢ − C) / τ) · log σ(P_S(x_a2)ᵢ / τ)
where σ is the softmax, C is the EMA centering vector, τ is the teacher temperature, and x_a1, x_a2 are two random crops of the same audio with independent noise augmentations. The student is the speaker encoder (CAM++) plus a three-layer MLP projection head producing K-dimensional output; the teacher shares the architecture with EMA-coupled weights.
The full training schedule has three stages. Stage 1: pretrain CAM++ on VoxCeleb2 with AAM-Softmax (noise augmentations from MUSAN and RIRS). Stage 2: train the joint TTS system — VITS plus speaker encoder — minimizing the standard VITS loss plus λ · L_DINO. Stage 2 itself splits in two: 95k iterations with the speaker encoder frozen except for its last layer, then 175k iterations with it fully unfrozen. Stage 3: at inference, unit sequences come from mBART (text-to-unit) instead of HuBERT (speech-to-unit), and only the student speaker encoder is used to embed the reference audio. Total training time: 5 days on 2× RTX 3090, batch size 80.
A minimal PyTorch reference for the DINO loss adapted to two audio crops. The full DINO codebase ships multi-crop logic, momentum schedules, and projection heads — this is the load of the loss function itself, the part the paper substitutes for AAM-Softmax.
import torch
from torch import Tensor, nn
from torch.nn import functional as F
class DinoSpeakerLoss(nn.Module):
"""Two-crop DINO self-distillation loss for a speaker encoder.
The student is the trainable speaker encoder + projection head; the teacher
shares its architecture with EMA-coupled weights. Loss is cross-entropy
between the (centered, sharpened) teacher distribution and the student
distribution over augmented crops of the same utterance. Centering is the
anti-collapse trick that lets DINO work without negative pairs.
Args:
out_dim: dimensionality K of the projection-head output.
teacher_temp: τ_T, sharpens teacher targets to prevent uniform collapse.
student_temp: τ_S > τ_T, smoother student distribution.
center_momentum: EMA factor for the running mean over teacher outputs.
"""
def __init__(
self,
out_dim: int,
teacher_temp: float = 0.04,
student_temp: float = 0.1,
center_momentum: float = 0.9,
) -> None:
super().__init__()
self.teacher_temp = teacher_temp
self.student_temp = student_temp
self.center_momentum = center_momentum
self.register_buffer("center", torch.zeros(1, out_dim))
def forward(self, student_out: Tensor, teacher_out: Tensor) -> Tensor:
# Teacher is stop-grad: detach so gradients only flow into student.
teacher_logits = (teacher_out.detach() - self.center) / self.teacher_temp
teacher_probs = F.softmax(teacher_logits, dim=-1)
student_log_probs = F.log_softmax(student_out / self.student_temp, dim=-1)
loss = -(teacher_probs * student_log_probs).sum(dim=-1).mean()
# Update center as EMA over the batch mean of (un-centered) teacher outputs.
batch_center = teacher_out.detach().mean(dim=0, keepdim=True)
self.center.mul_(self.center_momentum).add_(
batch_center, alpha=1.0 - self.center_momentum
)
return loss
In the joint training loop, this loss is computed on student/teacher outputs from two augmented crops x_a1, x_a2 of the same reference audio, and added to the standard VITS reconstruction + KL + duration + adversarial losses with a scalar weight λ. The student encoder's gradients flow from both objectives simultaneously, and the student embedding is what conditions VITS at inference — there is no separate "DINO encoder" path at deployment.
This is the headline experiment. The setup: the ChiME3 dataset (Barker et al. 2017) provides recordings of speakers reading the same prompt in a quiet "booth" environment and in four real-world noisy environments (bus, cafe, pedestrian area, street junction). A subset is selected: 8 speakers, 15 reference audios per speaker (roughly evenly distributed across the four environments) — 120 reference audios per condition. For each reference, a different ground-truth source audio + text from the same speaker is selected as the cloning target. All test reference audios are trimmed to 3 seconds. Quality is measured by MOS (mean opinion score) on naturalness and similarity, gathered via the Toloka crowdsourcing platform with 10 raters per audio. Total: ~1200 ratings per cell.
The baselines: YourTTS (Casanova et al. 2022) reproduced without the multilingual head; YourTTS + DEMUCS denoiser (Défossez 2021) where a pretrained denoiser cleans the reference before YourTTS sees it, giving a stronger noisy-condition baseline; and BYOL-A-encoder-conditioned VITS (Klapsas et al. 2022), where the speaker encoder is a frozen BYOL-A self-supervised audio encoder. Reading these together, the relevant comparisons are: vs. YourTTS = "did adding DINO and joint training help over the standard speaker-encoder recipe?" vs. YourTTS+DEMUCS = "did learning robustness internally beat externalizing it to a denoiser?" vs. BYOL-A = "is multi-task learning beating a generic SSL audio encoder?"
The Table 1 numbers (MOS ± 95% CI):
| System | Naturalness Clean | Naturalness Noisy | Similarity Clean | Similarity Noisy |
|---|---|---|---|---|
| Ground truth | 4.68 ± 0.03 | — | 3.94 ± 0.07 | — |
| DINO-VITS (ours) | 4.00 ± 0.05 | 3.55 ± 0.03 | 3.85 ± 0.08 | 3.52 ± 0.08 |
| YourTTS | 3.96 ± 0.05 | 3.11 ± 0.11 | 3.33 ± 0.08 | 3.20 ± 0.08 |
| YourTTS + DEMUCS | — | 3.28 ± 0.04 | — | 3.35 ± 0.08 |
| BYOL-A frozen | — | 1.85 ± 0.09 | — | 1.89 ± 0.07 |
Three readings of these numbers. First, the headline win is in noisy similarity: 3.52 vs. 3.20 (YourTTS) and 3.35 (YourTTS+DEMUCS). The denoiser helps YourTTS by 0.15 MOS but does not close the gap to DINO-VITS. Second, BYOL-A frozen lands far below — a frozen generic SSL encoder is not a substitute for a multi-task-trained one, even one that started from the same kind of self-supervised objective. Third, in clean conditions, DINO-VITS and YourTTS naturalness are essentially tied (4.00 vs. 3.96, overlapping CIs); the win is concentrated, again, in the noisy regime where the speaker encoder is exposed to out-of-distribution input.
The ablation in Table 2 makes the role of DINO sharper. (Note: Table 2 is a separate MOS evaluation from Table 1, with its own listener panel — the absolute values shift slightly across panels, but the within-table comparisons are clean.) Three arms: full DINO-VITS (Ours), AAM-Softmax replacing DINO (AV), and AAM-Softmax with reference-noise augmentations also removed (NV). Naturalness in the noisy condition: 4.07 ± 0.05 / 3.58 ± 0.05 / 2.47 ± 0.05. Each ablation step costs roughly 0.5 MOS in noisy naturalness, then another 1.1. DINO loss and noise augmentation are both doing work, and they are additive rather than substitutes — pulling either one out is a meaningful regression. (In clean conditions, the three arms are statistically indistinguishable, 4.03 / 4.00 / 4.04, and the paper says so explicitly.)
A second, smaller experiment in Section 3.2.1 probes whether the DINO sidecar actually changes what the speaker embedding encodes. The authors train a small two-layer classifier on top of the speaker encoder to predict emotion category on CREMA-D (Cao et al. 2014) and IEMOCAP (Busso et al. 2008). The DINO-jointly-trained encoder reaches 62.4% on CREMA-D vs. 53.4% for the AAM-Softmax-only baseline (a 9.0 pp gain), and 45.8% on IEMOCAP vs. 39.8% (a 6.0 pp gain). The paper rounds this to "+9% accuracy" in passing — the larger of the two numbers — but the gain is meaningfully smaller on IEMOCAP. Either way, it is a probe, not a cloning result, but it is consistent with the hypothesis that DINO loosens the embedding cluster enough to encode style.
A short orientation before the result: this section is a separate contribution from the DINO sidecar, addressing a different problem. The DINO experiments above assumed clean training data and stressed the system at inference time with noisy reference audios. This section instead stresses the system at training time: what happens if you train on noisy speech without transcripts? The architectural lever here is not the speaker encoder but the content representation — the bridge between text input and acoustic output.
The standard recipe before HuBERT-style discrete units was: run an ASR system over the training audio to get phoneme transcripts, then train TTS on (audio, phonemes) pairs. This works while transcripts are accurate, and degrades catastrophically once the ASR can no longer transcribe — exactly what happens with noisy data. The HuBERT-based alternative bypasses transcripts: HuBERT extracts continuous self-supervised features from raw audio, k-means clusters them into 1000 discrete units, and TTS learns to map text → unit sequences (via mBART) and unit sequences → speech (via VITS). No transcripts; the units are the bridge.
The experiment compares the two recipes head-to-head. Train each on a 60% clean / 40% noisy variant of LibriTTS (noises from MUSAN, SNR 0 dB on the noisy subset), retain the original VCTK + LibriTTS as the clean variant. The ASR baseline uses Whisper-medium (Radford et al. 2023) — a strong, recent, large-pretrained ASR system — to convert content audio to phonemes; HuBERT does the same job in the proposed system. Inference happens with both clean and noisy target audios; metrics are MOS naturalness, MOS similarity, and CER (character error rate of the synthesized speech, measured by running an ASR over the output and comparing to the ground-truth text).
The key Table 4 result, on noisy target audios:
| Train data | System | Naturalness | Similarity | CER |
|---|---|---|---|---|
| — | Ground truth | 4.79 ± 0.02 | 3.86 ± 0.08 | 3.86 ± 0.20 |
| Clean | Whisper-C | 3.04 ± 0.05 | 2.99 ± 0.08 | 7.29 ± 0.38 |
| Clean | DINO-VITS-C | 3.57 ± 0.05 | 3.16 ± 0.08 | 4.56 ± 0.25 |
| Noisy | Whisper-N | 1.29 ± 0.04 | 1.86 ± 0.07 | 24.05 ± 0.97 |
| Noisy | DINO-VITS-N | 3.52 ± 0.05 | 3.32 ± 0.08 | 5.04 ± 0.28 |
The Whisper-N row is the dramatic one. Trained on noisy data, the Whisper-based pipeline produces synthesized speech whose CER at inference (with a noisy target reference) is 24% — about a quarter of the characters are wrong — and whose MOS naturalness collapses to 1.29. The DINO-VITS-N row, by contrast, holds: CER 5.04%, naturalness 3.52, similarity 3.32. The HuBERT-based pipeline barely degrades when its training data goes from clean to noisy.
The mechanism, per the paper's diagnosis: noisy training data corrupts the Whisper transcripts that the ASR-based pipeline relies on. Whisper itself is robust enough at the noise levels used (0 dB SNR on 40% of clips), but "robust" is not "perfect" — small transcription errors propagate as noise into the TTS training signal and cumulatively destabilize the synthesizer. HuBERT, by contrast, does not produce transcripts; its discrete units encode noise and content together, and the TTS model learns to associate noisy units with noisy targets and (separately) clean units with clean targets. At inference, when the unit sequence comes from mBART (clean by construction, since mBART runs on text), the system synthesizes clean speech.
A tiny corroborating experiment in Section 3.3.2: train a binary CatBoost classifier (Prokhorenkova et al. 2018) on HuBERT features to distinguish noisy from clean speech, leave-one-noise-out across the four ChiME3 environments. F-scores exceed 0.97. The paper presents this as evidence that HuBERT features encode noise, which is separability, not invariance — and arguably the opposite of what you would naively want from a content encoder. The reason it still works for the TTS task is that the synthesizer has learned to associate the noise-encoded part of the units with the noise-encoded part of the target, so at inference (with units from clean text) the noise channel is silent.
The paper makes two bounded claims, both of which survive scrutiny: (1) attaching a DINO loss to the speaker encoder of a VITS-based zero-shot TTS system improves robustness to reference-audio noise at inference, and (2) using HuBERT-derived discrete units as the content representation makes the TTS pipeline robust to training on noisy, untranscribed speech. The evidence for each is well-scoped — internally consistent, statistically tight at the per-cell level, and ablation-validated.
What remains open is more interesting: the mechanism behind the wins, and how far they generalize beyond the specific recipe and test set evaluated. The paper itself flags some of this in its Conclusion (limited speaker-encoder-based scope, no flow-matching comparisons). The synthesis below restates those open threads as three experiments that would convince me the contribution generalizes, framed constructively rather than as gotchas. None of these are killers; they are the experiments that move the result from "a clever trick that works in this regime" to "a load-bearing pattern that other systems can adopt."
Experiment 1 — Cross-loss ablation. The paper claims DINO loss specifically improves the speaker encoder. The broader truth might be that any non-discriminative self-supervised auxiliary loss with a soft target distribution would work — DINO, MoCo (He et al. 2020), BYOL (Grill et al. 2020), or SimCLR (Chen et al. 2020) — because the load-bearing property is not optimizing a hard classification head, not the centering-and-sharpening trick specifically. An ablation that swaps DINO for one of these alternatives, holding everything else fixed, would tell you whether the contribution is "DINO works here" (as the paper says) or the broader "non-discriminative SSL teachers preserve within-speaker variation, and DINO is one example." The latter would be the more useful result for the community; the former leaves a hyperparameter-search-shaped hole.
Experiment 2 — Speaker generality. The Toloka MOS values are tight per-cell because each cell aggregates 1200 ratings. But the test set is 8 ChiME3 speakers across 4 noise environments. That is enough to estimate MOS for those speakers in those environments with confidence; it is not enough to claim generalization across speaker demographics, accents, recording devices, or noise classes outside the four ChiME3 environments. A re-run on a corpus with ≥50 speakers spanning ≥3 accent groups, with noise from at least one corpus other than ChiME3 + MUSAN, reporting MOS variance across speakers (not just within), would clarify whether the gap in Table 1 is a global property of the recipe or a property of this specific test slice. The paper's evidence is consistent with either reading.
Experiment 3 — Cross-architecture transfer. The paper restricts its claims to speaker-encoder-based TTS and explicitly does not compare against P-Flow or VoiceBox, the contemporaneous flow-matching SOTA. This is a fair scope choice — the paper is making a point about a specific recipe, not staking a SOTA claim. But it leaves the most interesting question unanswered: is the DINO sidecar a property of speaker encoders that any system using one would benefit from, or is it a property of VITS that does not transfer to other backbones? Grafting the DINO sidecar onto a flow-matching backbone (P-Flow-style speech prompting still uses an internal embedding pathway; you could attach a DINO loss to that) and measuring noise robustness would resolve the question. If it helps, the contribution is general; if not, it is a regularizer specific to VITS-style architectures.
There is a fourth thread worth noting briefly. The HuBERT noise-encoding probe (Section 3.3.2) shows F > 0.97 separability of noisy vs. clean HuBERT features — separability, not invariance. The paper presents this as evidence the recipe works "because" HuBERT encodes noise. That is a defensible reading of the mechanism (the synthesizer learns to associate the noise channel of units with the noise channel of the target), but it is also a yellow flag: a content encoder that explicitly encodes noise is, in principle, the opposite of what you would naively want. The reason it works here is the joint training of the synthesizer; the reason it might fail elsewhere is that you depend on that joint training to "absorb" the noise channel into the right output behavior. Replacing VITS with a different decoder could re-expose this dependency.