TECHNICAL OVERVIEW

THE SPIKE, THE SPARSE AND THE SINK: ANATOMY OF MASSIVE ACTIVATIONS AND ATTENTION SINKS

SYSTEM / JOURNAL CLUB / MACHINE LEARNING / ARXIV / LANGUAGE MODELS / INTERPRETABILITY / QUANTIZATION

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Analysis of Sun, Canziani, LeCun, Zhu (2026), New York University — arXiv preprint Generated on April 29, 2026

Abstract
Introduction
Preliminaries (Pre-Norm Transformer)
The Emergence of Massive Activations
The Emergence of Attention Sinks
Anatomy by Ablation
Discussion
Key Takeaways (Summary)

Abstract

Overview

Two strange things keep showing up in large language models, and nobody has been able to explain why they travel together. The first is massive activations: a handful of tokens (almost always the very first token in a sequence, sometimes a period or a newline) carry hidden-state values thousands of times larger than the rest of the network. The second is attention sinks: those same tokens absorb a huge slice of attention probability across many heads, regardless of what the prompt is about. Sun, Canziani, LeCun and Zhu set out to ask whether these are two faces of the same mechanism, or two phenomena that just happen to land on the same tokens.

If you have only worked at the application layer of LLMs, both terms can sound exotic, so a quick orientation. A modern decoder-only Transformer (think Llama or Qwen) is a stack of layers; each layer reads the running hidden state, applies a normalization step (RMSNorm — just rescale each row to unit RMS), and then either does multi-head attention or a SwiGLU feed-forward block. The output is added back to the running state via a residual connection. "Pre-norm" means the normalization happens before the block, not after; this design has become essentially universal because it trains more stably than the original "post-norm" Transformer. The catch — and this paper's central thesis — is that pre-norm has a side effect: the residual stream can accumulate unbounded values, and the model learns to use that.

The headline finding is that the co-occurrence is not a fundamental property of Transformers. It is an artifact of the pre-norm design plus the training recipe. Massive activations and attention sinks serve related but distinct functions: spikes act globally (a few channels carry a near-constant signal across all intermediate layers, behaving like implicit bias parameters baked into the residual stream), while sinks act locally (specific heads route excess attention into the spike token to bias themselves toward short-range dependencies). Crucially, the authors show you can suppress either phenomenon without harming the model. Swap RMSNorm for a sandwich-norm or QKNorm and the spikes vanish but the sinks remain. Add per-head conditional gating and the sinks vanish but the spikes remain (sort of — they fall too in some configurations).

That decoupling matters because both phenomena have caused real engineering pain. Quantization — squeezing model weights and activations into 8-bit or 4-bit integers to make inference fast — gets brittle when a few channels carry values in the thousands. KV-cache eviction strategies need special handling for sink tokens or model quality collapses. If we can choose to keep one and drop the other, the cost-quality frontier widens.

Concept Diagram

Key Takeaways

Two phenomena, one token, one bridge: massive activations and attention sinks frequently land on the same first / delimiter tokens, but the bridge between them is normalization, not a single underlying mechanism.
Different jobs: spikes operate as global, near-constant implicit parameters in the residual stream; sinks operate as local, per-head attention routers.
Suppressible independently: swap RMSNorm for sandwich-norm or QKNorm and spikes vanish; add conditional gating and sinks vanish. Either is possible without measurable language-modeling cost.

Introduction

Overview

To see why this paper is more than an interpretability curiosity, it helps to walk through how the field arrived here. When the original Transformer was published in 2017, attention was a clean idea: every token computes a softmax over the keys of every other token, and the resulting weights say "attend here this much". Five or six years and many trillions of parameters later, we know the behavior of attention in trained LLMs is not always so clean. Two specific oddities have been documented and re-documented: certain hidden activations are gigantic compared to typical values, and certain tokens (especially the first one in a sequence) absorb most of the attention mass even when they are semantically irrelevant.

The first hints of "outlier dimensions" came from BERT-era work in 2021 (Kovaleva et al.) and from Dettmers et al.'s 2022 quantization paper, which found that 8-bit quantization of GPT-3 only worked if you handled a few outlier channels in higher precision. The second oddity — attention sinks — was named and characterized by Xiao et al. in 2024, who showed that streaming inference works well only when you keep the first few tokens around as "sinks". Sun et al. (2024) brought the two together by showing the outlier tokens are the sink tokens, but framed the link as observational. This paper is the mechanistic follow-up: it asks why they share tokens, and what each phenomenon is for.

The "why now?" is largely about practical pressure. Modern LLM serving lives or dies by quantization (FP16 weights and activations are too expensive at scale, so the field has driven hard into INT8, INT4, and now FP4 / NVFP4). Massive activations sabotage low-precision arithmetic because their magnitudes do not fit into the dynamic range of small integer types (BondarenkoĀ etĀ al., 2021; Wei et al., 2022). At the same time, the success of long-context inference techniques like StreamingLLM (Xiao et al., 2024b) and adaptive KV-cache eviction (Ge et al., 2024) hinges on protecting sink tokens. So if you are an LLM systems engineer in 2026, you have been simultaneously trying to suppress spikes (for quantization) and preserve sinks (for long-context fidelity). It would be very useful to know whether you have to trade off, or whether the two can be controlled independently. This paper says you can.

The paper makes three central claims, which structure the rest of the post:

Normalization is the bridge. Standard pre-norm RMSNorm allows residual values to grow unbounded across layers and then collapses spike tokens into a sparse, nearly-constant vector. That collapse is what enables sinks — without it, you cannot reliably separate sink keys from non-sink keys. Swap the normalizer and the bridge breaks.
Sinks are driven by attention dimensionality and training context length. Larger per-head dimension gives the geometry room to separate sink keys from non-sink keys; mixed short / long context training makes sinks useful as a way to dump attention.
Independent suppression is possible without quality cost. Architectural choices that eliminate spikes do not necessarily destroy sinks, and vice versa. The two are not functionally fused.

A note on terminology used throughout: a spike token is a token where massive activations appear; a spike channel is one of the few hidden-state dimensions where the magnitudes get huge. A sink token is a token that absorbs disproportionate attention; a sink head is an attention head that routes most of its mass into the sink token. Empirically these largely overlap in pre-norm LLMs — that is the puzzle.

Concept Diagram

Key Takeaways

The puzzle is mechanistic, not observational: previous work established that spikes and sinks share tokens. This paper explains why they share tokens and what each phenomenon does.
Practical stakes are quantization and long-context inference: spikes break low-precision arithmetic; sinks support short-range routing. Engineers want to control them separately.
Three claims drive the paper: normalization is the bridge, sinks live in head-dimension and context-length, and either can be suppressed alone.

Preliminaries (Pre-Norm Transformer)

Overview

Before diving into the mechanism, lock in the architecture the paper is studying. A modern Llama / Qwen-style decoder-only Transformer is a stack of 2L blocks — alternating attention and feed-forward — with a single hidden-state matrix H flowing through. Each block applies the pre-norm + residual rule:

H_{i+1} = H_i + F_i(RMSNorm(H_i))

A few terms to defuse, since not everyone has stared at this equation for years.

RMSNorm: a simpler cousin of LayerNorm. It divides each row of H by its L2 norm and rescales to sqrt(d_model). Where LayerNorm subtracts the mean and divides by the std, RMSNorm just divides by the RMS. It is faster and works just as well in practice (Zhang & Sennrich, 2019), which is why every modern LLM uses it.
Pre-norm vs post-norm: in pre-norm (every modern LLM), the normalization happens before the attention or FFN block, and the residual connection adds the un-normalized input. In post-norm (the original Transformer), normalization happens after the block. Pre-norm is much more stable to train at depth, but it has the side effect that the residual stream can accumulate unbounded values across layers — nothing normalizes the running sum until you hit the final norm before the prediction head. That accumulation is where massive activations live.
Multi-head attention: each layer has N_head heads, each of dimension d_head. For each head, Q = H̃ W_Q, K = H̃ W_K, V = H̃ W_V (where H̃ is the RMSNormed input), and the head output is softmax(Q K^T / sqrt(d_head)) V. The per-head outputs are concatenated and projected by W_O.
SwiGLU FFN: the modern feed-forward block is F_ffn(h) = W_down (SiLU(W_gate h) ⊙ (W_up h)), where ⊙ is element-wise product. The gating structure (multiplying two parallel projections) is what makes SwiGLU more expressive than a standard MLP — and, as we will see, what lets it act as a quadratic amplifier when SiLU happens to be near identity.

The paper focuses on Llama 2 7B / 13B, Llama 3 8B, and Qwen 2.5 / 3 (7B/8B/14B) for the analysis, plus a from-scratch 7B Llama-style model trained for 100B tokens for the ablations.

Concept Diagram

Key Takeaways

The residual stream is un-normalized between blocks: nothing rescales the running sum until the final layer. That is what lets a few channels grow huge.
Pre-norm is universal but not free: the price of training stability is the quiet accumulation that makes massive activations possible.
SwiGLU is a multiplication of two projections: the gate (W_gate h) ⊙ (W_up h) is the structural reason an FFN can act as a quadratic amplifier, not just a linear one.

The Emergence of Massive Activations

Overview

Here is the load-bearing observation. If you instrument a trained Llama 2 7B and watch the magnitudes of the hidden state across its 64 layers, almost every channel is small — a few units, maybe a few tens. But two or three specific channels, on a small set of specific tokens (overwhelmingly the very first token), reach magnitudes in the thousands for a long stretch of intermediate layers. The trajectory is consistent: a sharp jump up around block 4, a plateau through the middle, a sharp drop back to normal at block 62 (out of 64). The authors call this the "rise – plateau – fall" lifecycle, and they identify three classes of blocks that produce it: step-up blocks that inject the spike, intermediate blocks that propagate it via residual addition, and step-down blocks at the end that cancel it by injecting equal-and-opposite values.

The mechanism behind step-up is the most surprising part. The SwiGLU feed-forward block, which usually looks like a humdrum nonlinearity, behaves as a directional quadratic amplifier for these tokens. Here is the chain of approximations the paper lays out: when SiLU happens to operate in its near-identity regime (the specific spike tokens land in a part of input space where SiLU(x) ≈ x), the SwiGLU FFN simplifies from W_down (SiLU(W_gate h) ⊙ (W_up h)) to roughly W_down ((W_gate h) ⊙ (W_up h)). That elementwise product of two linear projections is quadratic in h. Each output coordinate k then has the form h^T S_k h for some matrix S_k. For most output coordinates, S_k is unremarkable. But for a few "spike channels", S_k is dominated by a single eigenvalue λ* orders of magnitude larger than the rest of the spectrum. When the input h aligns with the leading eigenvector s*, the FFN multiplies it by λ* — that is the moment a hidden value goes from order-10 to order-1000.

Why does the alignment with s* happen for the first token specifically? Because the first token is in a structurally privileged position. With causal masking, the first token can only attend to itself, so the attention block at position 0 collapses to a fixed linear map W_VO. That map is the same for every prompt. The model has therefore learned a W_VO that consistently steers position-0 representations toward s*. Delimiter tokens (periods, newlines) follow a related but slightly different path: their embeddings are highly aligned with the RMSNorm scale parameters, so RMSNorm gives them an outsized magnitude post-norm, which makes them self-attend disproportionately, which lands them in a near-first-token regime, which triggers the same quadratic amplifier. The pattern holds remarkably broadly: across Llama 2 / 3 and Qwen 2.5 / 3, over 98% of all vocabulary items become spike tokens when placed at position 0 (Table 2 in the paper). The exceptions are rare characters from low-resource scripts whose embeddings stayed close to initialization.

Five concrete properties characterize massive activations, and they all fall out of this mechanism: (i) they appear only in intermediate layers (because of the step-up / step-down injection pattern), (ii) only in a small number of channels (the few S_k matrices with high-gain quadratic forms), (iii) the affected channels spike together (they share the same trigger direction s*), (iv) inter-channel ratios stay nearly fixed (governed by the leading eigenvalues of the shared S_ks), and (v) only a small number of tokens spike (the few that align with s*).

Concept Diagram

Try It Yourself

The model spends most of its life in the plateau. The exact step-up and step-down indices vary across families, but the shape stays the same. Click through the model presets to see the architecture summary and where the spike rises and falls.

Pick a model — see where the spike turns on and off:

Llama 2 7B Llama 2 13B Qwen3 8B Qwen3 14B

Implementation

A minimal sketch of the directional quadratic amplifier — the part of SwiGLU that becomes a quadratic form when SiLU is in its near-identity regime.

import torch
import torch.nn as nn

class DirectionalQuadraticAmplifier(nn.Module):
    """SwiGLU FFN, simplified to the regime that produces massive activations.

    When SiLU(x) is approximately x for the spike token's input, the SwiGLU
    block reduces to W_down ((W_gate h) elementwise* (W_up h)). Each output
    coordinate k is then h^T S_k h, and a few output channels have an S_k
    that is rank-one dominated by a leading eigenvector s*. When h aligns
    with s*, those channels are amplified by the leading eigenvalue lambda*.
    """

    def __init__(self, d_model: int, d_ffn: int):
        super().__init__()
        self.W_gate = nn.Linear(d_model, d_ffn, bias=False)
        self.W_up   = nn.Linear(d_model, d_ffn, bias=False)
        self.W_down = nn.Linear(d_ffn, d_model, bias=False)

    def forward(self, h: torch.Tensor) -> torch.Tensor:
        # h shape: (batch, seq, d_model). Each output channel k will end up
        # behaving like a quadratic form h^T S_k h.
        gate = self.W_gate(h)
        up   = self.W_up(h)

        # SwiGLU normally applies SiLU here; for spike tokens it is
        # near-identity, so the multiplicative gate is the dominant nonlinearity.
        # The element-wise product is what makes the whole block quadratic in h.
        product = gate * up

        # W_down picks linear combinations; spike channels are rows whose
        # corresponding S_k matrix has a single dominant eigenvalue.
        return self.W_down(product)

Key Takeaways

Step-up + step-down is structural: spikes are injected by one or two early blocks and cancelled by a symmetric block at the end. They live in the residual stream, not the activations of any one block.
SwiGLU is a quadratic amplifier when SiLU is near-identity: the gating structure makes the FFN behave like h^T S_k h, and a handful of S_k matrices have rank-one structure that explodes a single direction.
First tokens are structurally privileged: with causal masking the first token sees only itself, so the attention block applies a fixed linear map that the model can train to point at the spike direction. Delimiters reach the same regime through a related route.
The 5 properties of massive activations are corollaries of one mechanism: layer confinement, channel scarcity, joint activation, fixed ratios, and token scarcity all fall out of the rank-one-dominated quadratic forms.

The Emergence of Attention Sinks

Overview

So we have a few tokens carrying values in the thousands across most of the network. Why does that turn into an attention sink? The answer hinges on what RMSNorm does to those tokens. Three properties matter, and they all follow from the spike structure described above.

First, RMSNorm bounds the magnitudes. A spike token entering a block has L2 norm dominated by the few spike channels. After RMSNorm, every coordinate is bounded by sqrt(d_model). So the spike disappears from the normalized input, even though it is still huge in the residual stream.

Second, RMSNorm sparsifies the spike token. Because the L2 norm is dominated by a few channels, division-by-norm crushes the non-spike channels relative to the spike channels. The post-norm vector is approximately a sparse multi-hot indicator over the spike channel set — the rest is noise.

Third, and most importantly, RMSNorm makes the spike token near-constant across prompts. Spike channels maintain (almost) fixed inter-channel ratios across different spike tokens, so when you normalize, you get nearly the same vector regardless of which specific spike token you started from. The paper visualizes this with cosine similarity: pre-step-up, spike-token representations vary widely; post-step-up, they collapse to cosine similarity ≈ 1.0.

Now feed those near-constant vectors into the key projection W_K. Spike-token keys collapse to the span of just one or two rows of W_K (the rows corresponding to the spike channels). That is a radical dimensionality reduction relative to d_head — sink keys live in 1-2 dimensions, not 64 or 128. Non-spike-token keys, in contrast, span a much higher-dimensional manifold.

Whether a head becomes a sink head is then a matter of geometry. Each head has a query subspace and a key subspace. If the query subspace happens to align more closely with the (1-2 dim, near-constant) sink-key subspace than with the non-sink-key subspace, the dot products q^T k_sink are systematically larger than q^T k_non-sink, and the softmax dumps mass on the sink. If the alignment is the other way, the head attends semantically. The model has many heads, and the geometry of their query subspaces relative to W_K determines who becomes a sink head.

This is the key mechanistic insight: sinks emerge because (i) sparsification from normalization confines sink keys to a low-dimensional subspace, and (ii) near-constancy keeps that subspace stable across prompts, which gives the learned W_K something predictable to route around. The result is that W_K partitions the key space cleanly into "sink" and "non-sink" regions, and each query head picks a side.

Concept Diagram

Key Takeaways

RMSNorm produces three magic properties for spike tokens: bounded magnitudes, sparsity, and near-constancy across prompts. All three matter for sink formation.
Sink keys live in a 1-2 dim subspace: a dramatic dimensional collapse from the full d_head (typically 64 or 128). That is what the learned W_K exploits.
Sink-vs-non-sink heads are a matter of geometry: which side of W_K's partition the head's query subspace falls on, not anything semantic.

Anatomy by Ablation

The previous two sections build a mechanism. The next sections turn that mechanism into a causal claim by intervening — one architectural component at a time — and watching what happens to the spike magnitude, the sink ratio (the fraction of attention mass routed to the sink token), and language-modeling perplexity. The setup: a Llama-style 7B model trained from scratch on the DCLM dataset for 100B tokens, which is enough to reproduce both phenomena. Each ablation modifies a specific architectural choice while keeping the rest fixed.

The picture that emerges across the ablations is the headline of the paper: the two phenomena respond differently to the same interventions. That is the empirical fingerprint of two mechanisms, not one.

Feed-Forward Block Design

Overview

If SwiGLU is the "directional quadratic amplifier" responsible for spikes, what happens when you take it away? The authors compare four feed-forward designs at fixed capacity: SwiGLU (baseline), GeLU (the older standard), a single linear layer, and an attention-only design (no FFN, just more attention layers).

A quick orientation on what those mean for a non-specialist. SwiGLU and GeLU are both "gated MLP" variants — they apply a nonlinearity (SiLU or GeLU) and, in SwiGLU's case, a multiplicative gate. A single linear layer is just W h, the simplest possible block. Attention-only replaces every FFN with another attention layer; the model becomes "all attention all the time".

The result: massive activations and attention sinks emerge in all four configurations. The block design is not a prerequisite. But the magnitude of the spikes varies wildly. SwiGLU and GeLU yield spike magnitudes around 3000-4000. Linear and attention-only yield spike magnitudes around 600-700 — the spikes still exist, but they are much smaller because they have to be built up gradually across many layers instead of in a single block. The gating (SwiGLU) and saturating-nonlinearity (GeLU) designs concentrate amplification within one block; the others spread it.

That is a useful refinement of the earlier story. SwiGLU is not necessary for spikes — any pre-norm architecture will accumulate them — but it is the most efficient amplifier, which is why the spike magnitudes are largest there. The implication for quantization is direct: if your model uses SwiGLU, expect bigger spikes and budget more dynamic range for them.

Concept Diagram

Key Takeaways

FFN design is a knob, not a switch: every architecture produces spikes and sinks, but SwiGLU and GeLU produce far larger spikes (~3000-4000) than linear or attention-only (~600-700).
Concentration vs accumulation: gated nonlinearities concentrate the amplification into a single block; flatter designs accumulate the same effect over many blocks.
Quantization implication: SwiGLU models will always have the biggest dynamic range pressure. If you can swap to a less concentrated design, you buy precision headroom.

Normalization Configuration

Overview

Now the most consequential ablation. If normalization is the bridge between spikes and sinks, swapping it should disconnect them. The authors test three alternatives.

Sandwich normalization (Ding et al., 2021): an extra RMSNorm at the output of each block, after the residual addition. This bounds the residual-stream magnitudes between blocks, which should prevent the unbounded accumulation that makes spikes possible.
Sandwich-QK normalization: a related variant where input normalization is applied only to queries and keys. This decouples the path that produces sinks (Q/K projections) from the rest of the residual stream.
DynamicTanh (Zhu et al., 2025): replace RMSNorm with an element-wise tanh-like saturating function. This caps each coordinate independently, which means it cannot produce the sparse multi-hot vector that arises from L2 normalization of a peaky distribution.

The headline result, summarized in the comparison below: each alternative successfully reduces spike magnitude relative to the baseline, but the sink ratio mostly survives. Sandwich norm: spike falls from 3818 to 520 while sinks stay at 44.7%. Sandwich-QK: spikes almost gone (92), sinks 42.0%. DynamicTanh: spikes vanish (153) and sinks increase to 61.0% — the model finds an alternative pathway to designate the first token as a stable reference, without needing huge magnitudes.

That is a striking decoupling. The standard story — "spikes cause sinks" — is too strong. What is true is: pre-norm RMSNorm + unbounded residual + SwiGLU is one path to sinks, but it is not the only path. When you take the magnitudes away, the model finds another way.

Concept Diagram

Key Takeaways

Pre-norm + unbounded residual is one path to sinks, not the only one: Sandwich, QKNorm, and DynamicTanh all kill spike magnitudes while preserving (or even increasing) sinks.
DynamicTanh is the cleanest decoupling: element-wise saturation is mathematically incapable of producing the sparse multi-hot post-norm vector, yet sinks not only survive but strengthen.
The implication for low-precision serving: you can keep sinks (good for streaming inference) and drop spikes (good for INT4 / FP4 quantization) by changing the normalizer at training time. This is the most actionable result in the paper.

Attention Head Settings

Overview

If the geometric story is right — sinks emerge when the per-head subspace is large enough to cleanly separate sink keys from non-sink keys — then the attention head dimension d_head should be a major lever. The authors run a clean sweep.

A quick orientation. In multi-head attention, the total attention capacity is split into N_head heads of dimension d_head. The per-head Q/K/V projections are d_model -> d_head matrices. Larger d_head gives each head more "room" geometrically. Modern Llama-style models use d_head = 128 and N_head = 32 for a 4096-dim model.

The ablation: hold N_head = 32 fixed, vary d_head from 8 to 128. Result: a monotonic rise in sink ratio (4.1% at d_head=8 to 46.0% at d_head=128) and spike magnitude (291 to 3818). Tiny heads cannot form sinks at all, because the sink-key subspace and non-sink-key subspace cannot be cleanly separated in low dimensions.

A second sweep is more illuminating: hold the total attention capacity d_head x N_head = 4096 fixed, and re-allocate it. With 8/512 (tiny heads, lots of them) the sink ratio is 11%; with 128/32 (Llama-style) it is 46%; with 256/16 (giant heads, fewer of them) it is 52%. Concentrating capacity into fewer, larger heads strengthens sink behavior, even at fixed total capacity.

This is a strong piece of evidence for the geometric mechanism. Sink formation is fundamentally about whether the head has enough dimensions to partition, not about how much total capacity the model has.

Concept Diagram

Try It Yourself

How much does the per-head dimension change sink behavior at a fixed total capacity? Pick a configuration to see the layout and the resulting sink ratio.

Pick a (d_head / N_head) split — total capacity = 4096:

8 / 512 32 / 128 128 / 32 (Llama) 256 / 16

Key Takeaways

d_head is the dominant architectural lever for sinks: monotonic rise from 4.1% at d_head=8 to 46% at d_head=128.
Concentration beats distribution at fixed capacity: moving from many-tiny to few-large heads strengthens sinks even when total capacity is held constant.
The geometric mechanism is the right model: sinks need room in the per-head subspace to separate sink-keys from non-sink-keys. Total capacity is not the binding constraint; per-head capacity is.

Gated Attention

Overview

A second-order question: if sinks are useful for routing, what happens when we give the model an explicit routing mechanism instead? The answer is striking. Following Qiu et al. (2025), the authors test gated attention variants — the head output gets multiplied by a learned gate.

The taxonomy that matters is whether the gate is conditional (a function of the current hidden representation, so it changes prompt-by-prompt) or unconditional / static (fixed at the per-head, per-channel, or per-position level). Within conditional, you can have per-channel gates (one gate per output channel), per-head gates, or single-token gates.

Result: conditional gating eliminates sinks. Per-channel conditional gate yields a 4.5% sink ratio (down from 46%) with spike magnitude 202. Per-head: 6.4% sink, spike 186. Single-token: 31% — partial. Static gates (positional or token-embedding based) preserve sink behavior almost fully (~31-44%).

The interpretation: attention sinks are a learned input-conditioned gating mechanism. When the model lacks a built-in dynamic gate, it improvises one by routing excess attention into the spike token, effectively zeroing out unwanted heads. When you give it a real gate, the improvisation becomes redundant and disappears.

This connects sinks to a larger architectural conversation. Multiple recent designs — gated linear units, gated state-space models, mixture-of-experts routers — build dynamic input-conditioned routing as an explicit primitive. The sink phenomenon is a hint that vanilla self-attention has been silently doing a version of this, with the first token playing the role of a "this head is off" signal.

Concept Diagram

Key Takeaways

Sinks are implicit gates: when an explicit input-conditioned gate is added, sink behavior disappears with no perplexity cost.
Conditional vs static is the dividing line: unconditional or static-signal gates do not replace sinks. The model needs a dynamic gating signal to free up the first token.
A unifying view of recent architectures: gated attention, gated SSMs, and similar explicit-gating designs are doing what attention sinks have been silently doing. Once explicit, the implicit version is vestigial.

Training Context Length

Overview

The last ablation tests the hypothesis that sinks are not just architectural — they are useful for short-range prediction. Xiao et al. (2024a) noted that sink heads tend to attend to nearby tokens of the query; the authors of this paper test that systematically by varying the training context-length distribution. Concretely, they change the range of sequence positions over which the loss is computed during training. Configurations are reported as min/max — e.g. 1/4096 means losses are computed at positions 1 to 4096.

Result: when training includes short sequences (1/256, 1/1024, 1/4096), the sink ratio is stable at ~42-46%. When short contexts are removed and only long-range positions are optimized (1024/4096, 2048/4096, 2048/6144), the sink ratio collapses dramatically — 13%, 1.2%, 5.8% respectively. Spike magnitudes go up in some of these long-only configs (38000+ at 1024/4096), but the sinks have already disengaged.

The interpretation: sinks exist to support short-range prediction. In a mixed-length training regime, the first token serves as a cheap, universal "ignore the far context" reference for short-context examples. When the model never has to do short-context prediction, it never learns to use that reference, and sinks do not form. This is a counterintuitive but empirically robust result — a phenomenon we usually frame as architectural turns out to be partly training-data-distribution-driven.

Concept Diagram

Key Takeaways

Sinks are partly a data-distribution phenomenon: training only on long-context positions collapses sinks to ~1-13%.
The first token is a cheap universal "ignore far context" reference: in mixed-length training, it lets the model ignore distant tokens for short-range prediction.
Architectural and data-side mitigations are both available: if you want to avoid sinks, you can change the architecture (gated attention, alternative norm) or change the training context distribution.

Discussion

Overview

Pulled together, the picture is coherent. Pre-norm Transformers, as currently trained, have a quirky internal logic. The first token, which can only attend to itself, sits in a structurally privileged position; the model learns to push its representation in a direction s* that the SwiGLU FFN can amplify quadratically; that amplification dumps massive values into a few specific channels, which persist through the residual stream as approximately constant signals (implicit parameters, not data); RMSNorm then transforms those large values into a sparse, near-constant input vector; the learned key projection W_K notices that the first-token keys cluster in a tiny subspace, and partitions the key space accordingly; some heads orient their queries toward the sink subspace and become sink heads, which is useful because dumping attention into the first token is a cheap way to bias toward short-range prediction in mixed-length training.

So the spike-and-sink couple is not one phenomenon; it is two phenomena tied together by an architectural choice (pre-norm + RMSNorm) and a training distribution choice (mixed-length context). Each can be undone:

Suppress spikes, keep sinks: sandwich norm, QKNorm, DynamicTanh.
Suppress sinks, keep spikes: per-channel or per-head conditional gating; long-only training distribution.
Suppress both: combining the above.
In every case, language-modeling perplexity is preserved.

That last point is the load-bearing engineering claim. If the spike-and-sink coupling were doing something necessary, suppressing it would damage the model. It does not. Their overlap in standard pretrained LLMs is best understood as a byproduct of the default normalization-and-training recipe, not a reflection of any underlying functional necessity.

For practitioners, the main implications:

Quantization gets easier with the right normalizer. A model trained with sandwich norm or QKNorm has 6-40x smaller spike magnitudes, which directly translates to lower precision-loss in INT4 / FP4 quantization without specialized outlier handling.
KV-cache strategies that rely on sinks need not break under spike suppression. Sinks survive most spike-killing interventions, including DynamicTanh (where they actually strengthen).
Long-context-only training is a different regime. If you fine-tune a base model exclusively on very long sequences, expect attention sinks to fade — which may or may not be what you want depending on your inference pipeline.

The bigger picture: this paper continues a useful trend of treating "weird LLM behaviors" as architectural artifacts to be designed around, rather than mysterious emergent properties. Spikes and sinks are not magic; they are the predictable output of pre-norm + RMSNorm + SwiGLU + mixed-length training, and we now have a menu of replacements for each ingredient.

Concept Diagram

Key Takeaways

Two phenomena, two levers: normalization controls spikes; gated attention and short-context training control sinks. They are not knobs on the same machine.
No language-modeling cost: every intervention examined preserves perplexity. The spike-and-sink coupling is not load-bearing for next-token prediction.
Engineering frontier widens: low-precision serving (which wants no spikes) and streaming inference (which wants sinks) are no longer in tension once you pick the right combination of normalizer and gating.

Key Takeaways (Summary)

Massive activations and attention sinks share tokens but not mechanisms: they co-occur because pre-norm + RMSNorm bridges them, not because one causes the other.
Spikes are a directional quadratic amplifier story: SwiGLU's gated structure makes the FFN behave like h^T S_k h; a few rank-one-dominated S_k blow up a single direction s*, which the first token consistently aligns with.
Sinks are a geometric story: RMSNorm collapses spike tokens to a sparse near-constant vector, so sink keys live in a 1-2 dim subspace. W_K partitions, heads pick a side based on query alignment.
The engineering payoff is real: swap the normalizer to suppress spikes (good for INT4/FP4 quantization) while keeping sinks (good for streaming inference / KV-cache).
Sinks are also a training-distribution phenomenon: in long-only training, they collapse to ~1-13%. Mixed-length training is what makes the first-token sink useful in the first place.
The bigger picture: these are designable artifacts, not emergent magic. The recipe pre-norm + RMSNorm + SwiGLU + mixed-length training is one path; the menu of alternatives is now mapped.

← RETURN TO SYSTEMS

THE SPIKE, THE SPARSE AND THE SINK: ANATOMY OF MASSIVE ACTIVATIONS AND ATTENTION SINKS

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Table of Contents

Abstract

Overview

Concept Diagram

Key Takeaways

Introduction

Overview

Concept Diagram

Key Takeaways

Preliminaries (Pre-Norm Transformer)

Overview

Concept Diagram

Key Takeaways

The Emergence of Massive Activations

Overview

Concept Diagram

Try It Yourself

Implementation

Key Takeaways

The Emergence of Attention Sinks

Overview

Concept Diagram

Key Takeaways

Anatomy by Ablation

Feed-Forward Block Design

Overview

Concept Diagram

Key Takeaways

Normalization Configuration

Overview

Concept Diagram

Key Takeaways

Attention Head Settings

Overview

Concept Diagram

Try It Yourself

Key Takeaways

Gated Attention

Overview

Concept Diagram

Key Takeaways

Training Context Length

Overview

Concept Diagram

Key Takeaways

Discussion

Overview

Concept Diagram

Key Takeaways

Key Takeaways (Summary)