Analysis of Sun, Canziani, LeCun, Zhu (2026), New York University — arXiv preprint Generated on April 29, 2026
Two strange things keep showing up in large language models, and nobody has been able to explain why they travel together. The first is massive activations: a handful of tokens (almost always the very first token in a sequence, sometimes a period or a newline) carry hidden-state values thousands of times larger than the rest of the network. The second is attention sinks: those same tokens absorb a huge slice of attention probability across many heads, regardless of what the prompt is about. Sun, Canziani, LeCun and Zhu set out to ask whether these are two faces of the same mechanism, or two phenomena that just happen to land on the same tokens.
If you have only worked at the application layer of LLMs, both terms can sound exotic, so a quick orientation. A modern decoder-only Transformer (think Llama or Qwen) is a stack of layers; each layer reads the running hidden state, applies a normalization step (RMSNorm — just rescale each row to unit RMS), and then either does multi-head attention or a SwiGLU feed-forward block. The output is added back to the running state via a residual connection. "Pre-norm" means the normalization happens before the block, not after; this design has become essentially universal because it trains more stably than the original "post-norm" Transformer. The catch — and this paper's central thesis — is that pre-norm has a side effect: the residual stream can accumulate unbounded values, and the model learns to use that.
The headline finding is that the co-occurrence is not a fundamental property of Transformers. It is an artifact of the pre-norm design plus the training recipe. Massive activations and attention sinks serve related but distinct functions: spikes act globally (a few channels carry a near-constant signal across all intermediate layers, behaving like implicit bias parameters baked into the residual stream), while sinks act locally (specific heads route excess attention into the spike token to bias themselves toward short-range dependencies). Crucially, the authors show you can suppress either phenomenon without harming the model. Swap RMSNorm for a sandwich-norm or QKNorm and the spikes vanish but the sinks remain. Add per-head conditional gating and the sinks vanish but the spikes remain (sort of — they fall too in some configurations).
That decoupling matters because both phenomena have caused real engineering pain. Quantization — squeezing model weights and activations into 8-bit or 4-bit integers to make inference fast — gets brittle when a few channels carry values in the thousands. KV-cache eviction strategies need special handling for sink tokens or model quality collapses. If we can choose to keep one and drop the other, the cost-quality frontier widens.
To see why this paper is more than an interpretability curiosity, it helps to walk through how the field arrived here. When the original Transformer was published in 2017, attention was a clean idea: every token computes a softmax over the keys of every other token, and the resulting weights say "attend here this much". Five or six years and many trillions of parameters later, we know the behavior of attention in trained LLMs is not always so clean. Two specific oddities have been documented and re-documented: certain hidden activations are gigantic compared to typical values, and certain tokens (especially the first one in a sequence) absorb most of the attention mass even when they are semantically irrelevant.
The first hints of "outlier dimensions" came from BERT-era work in 2021 (Kovaleva et al.) and from Dettmers et al.'s 2022 quantization paper, which found that 8-bit quantization of GPT-3 only worked if you handled a few outlier channels in higher precision. The second oddity — attention sinks — was named and characterized by Xiao et al. in 2024, who showed that streaming inference works well only when you keep the first few tokens around as "sinks". Sun et al. (2024) brought the two together by showing the outlier tokens are the sink tokens, but framed the link as observational. This paper is the mechanistic follow-up: it asks why they share tokens, and what each phenomenon is for.
The "why now?" is largely about practical pressure. Modern LLM serving lives or dies by quantization (FP16 weights and activations are too expensive at scale, so the field has driven hard into INT8, INT4, and now FP4 / NVFP4). Massive activations sabotage low-precision arithmetic because their magnitudes do not fit into the dynamic range of small integer types (BondarenkoĀ etĀ al., 2021; Wei et al., 2022). At the same time, the success of long-context inference techniques like StreamingLLM (Xiao et al., 2024b) and adaptive KV-cache eviction (Ge et al., 2024) hinges on protecting sink tokens. So if you are an LLM systems engineer in 2026, you have been simultaneously trying to suppress spikes (for quantization) and preserve sinks (for long-context fidelity). It would be very useful to know whether you have to trade off, or whether the two can be controlled independently. This paper says you can.
The paper makes three central claims, which structure the rest of the post:
A note on terminology used throughout: a spike token is a token where massive activations appear; a spike channel is one of the few hidden-state dimensions where the magnitudes get huge. A sink token is a token that absorbs disproportionate attention; a sink head is an attention head that routes most of its mass into the sink token. Empirically these largely overlap in pre-norm LLMs — that is the puzzle.
Before diving into the mechanism, lock in the architecture the paper is studying. A modern Llama / Qwen-style decoder-only Transformer is a stack of 2L blocks — alternating attention and feed-forward — with a single hidden-state matrix H flowing through. Each block applies the pre-norm + residual rule:
H_{i+1} = H_i + F_i(RMSNorm(H_i))
A few terms to defuse, since not everyone has stared at this equation for years.
H by its L2 norm and rescales to sqrt(d_model). Where LayerNorm subtracts the mean and divides by the std, RMSNorm just divides by the RMS. It is faster and works just as well in practice (Zhang & Sennrich, 2019), which is why every modern LLM uses it.N_head heads, each of dimension d_head. For each head, Q = H̃ W_Q, K = H̃ W_K, V = H̃ W_V (where H̃ is the RMSNormed input), and the head output is softmax(Q K^T / sqrt(d_head)) V. The per-head outputs are concatenated and projected by W_O.F_ffn(h) = W_down (SiLU(W_gate h) ⊙ (W_up h)), where ⊙ is element-wise product. The gating structure (multiplying two parallel projections) is what makes SwiGLU more expressive than a standard MLP — and, as we will see, what lets it act as a quadratic amplifier when SiLU happens to be near identity.The paper focuses on Llama 2 7B / 13B, Llama 3 8B, and Qwen 2.5 / 3 (7B/8B/14B) for the analysis, plus a from-scratch 7B Llama-style model trained for 100B tokens for the ablations.
(W_gate h) ⊙ (W_up h) is the structural reason an FFN can act as a quadratic amplifier, not just a linear one.Here is the load-bearing observation. If you instrument a trained Llama 2 7B and watch the magnitudes of the hidden state across its 64 layers, almost every channel is small — a few units, maybe a few tens. But two or three specific channels, on a small set of specific tokens (overwhelmingly the very first token), reach magnitudes in the thousands for a long stretch of intermediate layers. The trajectory is consistent: a sharp jump up around block 4, a plateau through the middle, a sharp drop back to normal at block 62 (out of 64). The authors call this the "rise – plateau – fall" lifecycle, and they identify three classes of blocks that produce it: step-up blocks that inject the spike, intermediate blocks that propagate it via residual addition, and step-down blocks at the end that cancel it by injecting equal-and-opposite values.
The mechanism behind step-up is the most surprising part. The SwiGLU feed-forward block, which usually looks like a humdrum nonlinearity, behaves as a directional quadratic amplifier for these tokens. Here is the chain of approximations the paper lays out: when SiLU happens to operate in its near-identity regime (the specific spike tokens land in a part of input space where SiLU(x) ≈ x), the SwiGLU FFN simplifies from W_down (SiLU(W_gate h) ⊙ (W_up h)) to roughly W_down ((W_gate h) ⊙ (W_up h)). That elementwise product of two linear projections is quadratic in h. Each output coordinate k then has the form h^T S_k h for some matrix S_k. For most output coordinates, S_k is unremarkable. But for a few "spike channels", S_k is dominated by a single eigenvalue λ* orders of magnitude larger than the rest of the spectrum. When the input h aligns with the leading eigenvector s*, the FFN multiplies it by λ* — that is the moment a hidden value goes from order-10 to order-1000.
Why does the alignment with s* happen for the first token specifically? Because the first token is in a structurally privileged position. With causal masking, the first token can only attend to itself, so the attention block at position 0 collapses to a fixed linear map W_VO. That map is the same for every prompt. The model has therefore learned a W_VO that consistently steers position-0 representations toward s*. Delimiter tokens (periods, newlines) follow a related but slightly different path: their embeddings are highly aligned with the RMSNorm scale parameters, so RMSNorm gives them an outsized magnitude post-norm, which makes them self-attend disproportionately, which lands them in a near-first-token regime, which triggers the same quadratic amplifier. The pattern holds remarkably broadly: across Llama 2 / 3 and Qwen 2.5 / 3, over 98% of all vocabulary items become spike tokens when placed at position 0 (Table 2 in the paper). The exceptions are rare characters from low-resource scripts whose embeddings stayed close to initialization.
Five concrete properties characterize massive activations, and they all fall out of this mechanism: (i) they appear only in intermediate layers (because of the step-up / step-down injection pattern), (ii) only in a small number of channels (the few S_k matrices with high-gain quadratic forms), (iii) the affected channels spike together (they share the same trigger direction s*), (iv) inter-channel ratios stay nearly fixed (governed by the leading eigenvalues of the shared S_ks), and (v) only a small number of tokens spike (the few that align with s*).
The model spends most of its life in the plateau. The exact step-up and step-down indices vary across families, but the shape stays the same. Click through the model presets to see the architecture summary and where the spike rises and falls.
A minimal sketch of the directional quadratic amplifier — the part of SwiGLU that becomes a quadratic form when SiLU is in its near-identity regime.
import torch
import torch.nn as nn
class DirectionalQuadraticAmplifier(nn.Module):
"""SwiGLU FFN, simplified to the regime that produces massive activations.
When SiLU(x) is approximately x for the spike token's input, the SwiGLU
block reduces to W_down ((W_gate h) elementwise* (W_up h)). Each output
coordinate k is then h^T S_k h, and a few output channels have an S_k
that is rank-one dominated by a leading eigenvector s*. When h aligns
with s*, those channels are amplified by the leading eigenvalue lambda*.
"""
def __init__(self, d_model: int, d_ffn: int):
super().__init__()
self.W_gate = nn.Linear(d_model, d_ffn, bias=False)
self.W_up = nn.Linear(d_model, d_ffn, bias=False)
self.W_down = nn.Linear(d_ffn, d_model, bias=False)
def forward(self, h: torch.Tensor) -> torch.Tensor:
# h shape: (batch, seq, d_model). Each output channel k will end up
# behaving like a quadratic form h^T S_k h.
gate = self.W_gate(h)
up = self.W_up(h)
# SwiGLU normally applies SiLU here; for spike tokens it is
# near-identity, so the multiplicative gate is the dominant nonlinearity.
# The element-wise product is what makes the whole block quadratic in h.
product = gate * up
# W_down picks linear combinations; spike channels are rows whose
# corresponding S_k matrix has a single dominant eigenvalue.
return self.W_down(product)
h^T S_k h, and a handful of S_k matrices have rank-one structure that explodes a single direction.So we have a few tokens carrying values in the thousands across most of the network. Why does that turn into an attention sink? The answer hinges on what RMSNorm does to those tokens. Three properties matter, and they all follow from the spike structure described above.
First, RMSNorm bounds the magnitudes. A spike token entering a block has L2 norm dominated by the few spike channels. After RMSNorm, every coordinate is bounded by sqrt(d_model). So the spike disappears from the normalized input, even though it is still huge in the residual stream.
Second, RMSNorm sparsifies the spike token. Because the L2 norm is dominated by a few channels, division-by-norm crushes the non-spike channels relative to the spike channels. The post-norm vector is approximately a sparse multi-hot indicator over the spike channel set — the rest is noise.
Third, and most importantly, RMSNorm makes the spike token near-constant across prompts. Spike channels maintain (almost) fixed inter-channel ratios across different spike tokens, so when you normalize, you get nearly the same vector regardless of which specific spike token you started from. The paper visualizes this with cosine similarity: pre-step-up, spike-token representations vary widely; post-step-up, they collapse to cosine similarity ≈ 1.0.
Now feed those near-constant vectors into the key projection W_K. Spike-token keys collapse to the span of just one or two rows of W_K (the rows corresponding to the spike channels). That is a radical dimensionality reduction relative to d_head — sink keys live in 1-2 dimensions, not 64 or 128. Non-spike-token keys, in contrast, span a much higher-dimensional manifold.
Whether a head becomes a sink head is then a matter of geometry. Each head has a query subspace and a key subspace. If the query subspace happens to align more closely with the (1-2 dim, near-constant) sink-key subspace than with the non-sink-key subspace, the dot products q^T k_sink are systematically larger than q^T k_non-sink, and the softmax dumps mass on the sink. If the alignment is the other way, the head attends semantically. The model has many heads, and the geometry of their query subspaces relative to W_K determines who becomes a sink head.
This is the key mechanistic insight: sinks emerge because (i) sparsification from normalization confines sink keys to a low-dimensional subspace, and (ii) near-constancy keeps that subspace stable across prompts, which gives the learned W_K something predictable to route around. The result is that W_K partitions the key space cleanly into "sink" and "non-sink" regions, and each query head picks a side.
d_head (typically 64 or 128). That is what the learned W_K exploits.W_K's partition the head's query subspace falls on, not anything semantic.The previous two sections build a mechanism. The next sections turn that mechanism into a causal claim by intervening — one architectural component at a time — and watching what happens to the spike magnitude, the sink ratio (the fraction of attention mass routed to the sink token), and language-modeling perplexity. The setup: a Llama-style 7B model trained from scratch on the DCLM dataset for 100B tokens, which is enough to reproduce both phenomena. Each ablation modifies a specific architectural choice while keeping the rest fixed.
The picture that emerges across the ablations is the headline of the paper: the two phenomena respond differently to the same interventions. That is the empirical fingerprint of two mechanisms, not one.
If SwiGLU is the "directional quadratic amplifier" responsible for spikes, what happens when you take it away? The authors compare four feed-forward designs at fixed capacity: SwiGLU (baseline), GeLU (the older standard), a single linear layer, and an attention-only design (no FFN, just more attention layers).
A quick orientation on what those mean for a non-specialist. SwiGLU and GeLU are both "gated MLP" variants — they apply a nonlinearity (SiLU or GeLU) and, in SwiGLU's case, a multiplicative gate. A single linear layer is just W h, the simplest possible block. Attention-only replaces every FFN with another attention layer; the model becomes "all attention all the time".
The result: massive activations and attention sinks emerge in all four configurations. The block design is not a prerequisite. But the magnitude of the spikes varies wildly. SwiGLU and GeLU yield spike magnitudes around 3000-4000. Linear and attention-only yield spike magnitudes around 600-700 — the spikes still exist, but they are much smaller because they have to be built up gradually across many layers instead of in a single block. The gating (SwiGLU) and saturating-nonlinearity (GeLU) designs concentrate amplification within one block; the others spread it.
That is a useful refinement of the earlier story. SwiGLU is not necessary for spikes — any pre-norm architecture will accumulate them — but it is the most efficient amplifier, which is why the spike magnitudes are largest there. The implication for quantization is direct: if your model uses SwiGLU, expect bigger spikes and budget more dynamic range for them.
Now the most consequential ablation. If normalization is the bridge between spikes and sinks, swapping it should disconnect them. The authors test three alternatives.
tanh-like saturating function. This caps each coordinate independently, which means it cannot produce the sparse multi-hot vector that arises from L2 normalization of a peaky distribution.The headline result, summarized in the comparison below: each alternative successfully reduces spike magnitude relative to the baseline, but the sink ratio mostly survives. Sandwich norm: spike falls from 3818 to 520 while sinks stay at 44.7%. Sandwich-QK: spikes almost gone (92), sinks 42.0%. DynamicTanh: spikes vanish (153) and sinks increase to 61.0% — the model finds an alternative pathway to designate the first token as a stable reference, without needing huge magnitudes.
That is a striking decoupling. The standard story — "spikes cause sinks" — is too strong. What is true is: pre-norm RMSNorm + unbounded residual + SwiGLU is one path to sinks, but it is not the only path. When you take the magnitudes away, the model finds another way.
If the geometric story is right — sinks emerge when the per-head subspace is large enough to cleanly separate sink keys from non-sink keys — then the attention head dimension d_head should be a major lever. The authors run a clean sweep.
A quick orientation. In multi-head attention, the total attention capacity is split into N_head heads of dimension d_head. The per-head Q/K/V projections are d_model -> d_head matrices. Larger d_head gives each head more "room" geometrically. Modern Llama-style models use d_head = 128 and N_head = 32 for a 4096-dim model.
The ablation: hold N_head = 32 fixed, vary d_head from 8 to 128. Result: a monotonic rise in sink ratio (4.1% at d_head=8 to 46.0% at d_head=128) and spike magnitude (291 to 3818). Tiny heads cannot form sinks at all, because the sink-key subspace and non-sink-key subspace cannot be cleanly separated in low dimensions.
A second sweep is more illuminating: hold the total attention capacity d_head x N_head = 4096 fixed, and re-allocate it. With 8/512 (tiny heads, lots of them) the sink ratio is 11%; with 128/32 (Llama-style) it is 46%; with 256/16 (giant heads, fewer of them) it is 52%. Concentrating capacity into fewer, larger heads strengthens sink behavior, even at fixed total capacity.
This is a strong piece of evidence for the geometric mechanism. Sink formation is fundamentally about whether the head has enough dimensions to partition, not about how much total capacity the model has.
How much does the per-head dimension change sink behavior at a fixed total capacity? Pick a configuration to see the layout and the resulting sink ratio.
d_head is the dominant architectural lever for sinks: monotonic rise from 4.1% at d_head=8 to 46% at d_head=128.A second-order question: if sinks are useful for routing, what happens when we give the model an explicit routing mechanism instead? The answer is striking. Following Qiu et al. (2025), the authors test gated attention variants — the head output gets multiplied by a learned gate.
The taxonomy that matters is whether the gate is conditional (a function of the current hidden representation, so it changes prompt-by-prompt) or unconditional / static (fixed at the per-head, per-channel, or per-position level). Within conditional, you can have per-channel gates (one gate per output channel), per-head gates, or single-token gates.
Result: conditional gating eliminates sinks. Per-channel conditional gate yields a 4.5% sink ratio (down from 46%) with spike magnitude 202. Per-head: 6.4% sink, spike 186. Single-token: 31% — partial. Static gates (positional or token-embedding based) preserve sink behavior almost fully (~31-44%).
The interpretation: attention sinks are a learned input-conditioned gating mechanism. When the model lacks a built-in dynamic gate, it improvises one by routing excess attention into the spike token, effectively zeroing out unwanted heads. When you give it a real gate, the improvisation becomes redundant and disappears.
This connects sinks to a larger architectural conversation. Multiple recent designs — gated linear units, gated state-space models, mixture-of-experts routers — build dynamic input-conditioned routing as an explicit primitive. The sink phenomenon is a hint that vanilla self-attention has been silently doing a version of this, with the first token playing the role of a "this head is off" signal.
The last ablation tests the hypothesis that sinks are not just architectural — they are useful for short-range prediction. Xiao et al. (2024a) noted that sink heads tend to attend to nearby tokens of the query; the authors of this paper test that systematically by varying the training context-length distribution. Concretely, they change the range of sequence positions over which the loss is computed during training. Configurations are reported as min/max — e.g. 1/4096 means losses are computed at positions 1 to 4096.
Result: when training includes short sequences (1/256, 1/1024, 1/4096), the sink ratio is stable at ~42-46%. When short contexts are removed and only long-range positions are optimized (1024/4096, 2048/4096, 2048/6144), the sink ratio collapses dramatically — 13%, 1.2%, 5.8% respectively. Spike magnitudes go up in some of these long-only configs (38000+ at 1024/4096), but the sinks have already disengaged.
The interpretation: sinks exist to support short-range prediction. In a mixed-length training regime, the first token serves as a cheap, universal "ignore the far context" reference for short-context examples. When the model never has to do short-context prediction, it never learns to use that reference, and sinks do not form. This is a counterintuitive but empirically robust result — a phenomenon we usually frame as architectural turns out to be partly training-data-distribution-driven.
Pulled together, the picture is coherent. Pre-norm Transformers, as currently trained, have a quirky internal logic. The first token, which can only attend to itself, sits in a structurally privileged position; the model learns to push its representation in a direction s* that the SwiGLU FFN can amplify quadratically; that amplification dumps massive values into a few specific channels, which persist through the residual stream as approximately constant signals (implicit parameters, not data); RMSNorm then transforms those large values into a sparse, near-constant input vector; the learned key projection W_K notices that the first-token keys cluster in a tiny subspace, and partitions the key space accordingly; some heads orient their queries toward the sink subspace and become sink heads, which is useful because dumping attention into the first token is a cheap way to bias toward short-range prediction in mixed-length training.
So the spike-and-sink couple is not one phenomenon; it is two phenomena tied together by an architectural choice (pre-norm + RMSNorm) and a training distribution choice (mixed-length context). Each can be undone:
That last point is the load-bearing engineering claim. If the spike-and-sink coupling were doing something necessary, suppressing it would damage the model. It does not. Their overlap in standard pretrained LLMs is best understood as a byproduct of the default normalization-and-training recipe, not a reflection of any underlying functional necessity.
For practitioners, the main implications:
The bigger picture: this paper continues a useful trend of treating "weird LLM behaviors" as architectural artifacts to be designed around, rather than mysterious emergent properties. Spikes and sinks are not magic; they are the predictable output of pre-norm + RMSNorm + SwiGLU + mixed-length training, and we now have a menu of replacements for each ingredient.
h^T S_k h; a few rank-one-dominated S_k blow up a single direction s*, which the first token consistently aligns with.W_K partitions, heads pick a side based on query alignment.pre-norm + RMSNorm + SwiGLU + mixed-length training is one path; the menu of alternatives is now mapped.