Spectral Properties of Attention Matrices

Research & Case Study

This research investigates the mathematical properties of attention matrices in transformer models by analyzing their eigenvalues, singular values, and norms. We show that attention matrices undergo progressive eigenvalue concentration during training — shifting from uniform spectral distributions to sparse dominant spectra. Singular values follow a power-law decay $σᵢ \approx C/i^α$ with layer-depth-dependent exponent α ∈ [0.5, 1.0]. We further establish that the Frobenius norm scales as $O(\sqrtn)$ and the spectral norm as $O(log n)$ , providing theoretical justification for long-context numerical stability. Finally, the effective rank of attention — bounded by the entropy of the attention distribution — reveals a fundamental trade-off between sparsity, expressivity, and generalization.

§1 Major Findings

Finding 01

Eigenvalue Concentration

Dominant eigenvalues emerge and stabilize during training while others decay toward zero — revealing progressive hierarchical feature learning.

Finding 02

Power-Law Singular Decay

Singular values follow

σᵢ \approx C/i^α

, α ∈ [0.5, 1.0]. Deeper layers show faster decay, indicating more selective attention.

Finding 03

Sublinear Norm Scaling

Frobenius norm grows as

O(\sqrtn)

, spectral norm as

O(log n)

— explaining why long-context transformers remain numerically stable.

Finding 04

Sparsity–Spectrum Duality

High attention sparsity concentrates the spectrum into few large eigenvalues, constraining effective rank and defining an expressivity–efficiency trade-off.

§2 Eigenvalue Concentration During Training

Attention matrices are not symmetric in general, so we consider their eigenvalues as complex numbers. The softmax normalization ensures row-stochastic structure, which constrains the spectrum: all eigenvalues lie in or on the unit disk in the complex plane, and 1 is always an eigenvalue (the Perron eigenvalue).

As training progresses, we observe a characteristic three-phase structure. The spectrum transitions from uniform coverage of the unit disk toward concentration of mass near a few large real eigenvalues, with the remainder collapsing toward zero. We model this evolution as:

λᵢ(t) ≈ λᵢ(∞) · (1 − e^{−t/τᵢ}) + λᵢ(0) · e^{−t/τᵢ} (1)

where $τᵢ$ is the characteristic convergence time for eigenvalue $i$ , with $τ₁ < τ₂ < \cdot\cdot\cdot$ — dominant modes converge faster. Deeper layers exhibit accelerated concentration due to more aggressive information compression.

Figure 1 — Eigenvalue Spectrum Evolution During Training Early phase

Training step t = 0 Layer:

—top |λ|

—eff. rank

—concentration

—training phase

Figure 1. Eigenvalue spectrum (complex plane) at varying training steps. Observe the collapse from a diffuse cloud toward dominant real eigenvalues as training proceeds. The unit circle marks the stability boundary. Deeper layers concentrate faster.

§3 Singular Value Power-Law Decay

For an attention matrix $A \in ℝⁿˣⁿ$ , its singular value decomposition $A = UΣVᵀ$ reveals the informational skeleton of the matrix. We establish empirically and theoretically that singular values follow a power-law:

σᵢ ≈ C · i^{−α}, α ∈ [0.5, 1.0] (2)

The exponent $α$ increases with layer depth: shallow layers exhibit $α \approx 0.5$ (slower decay, richer rank), while deep layers reach $α \approx 1.0$ (aggressive compression). On a log-log plot, this appears as a straight line with slope $-α$ .

Theorem — Effective Rank Bound

For a power-law spectrum $σᵢ = C · i^{−α}$ , the effective rank (defined as $r_eff = (Σσᵢ)² / Σσᵢ²$ ) satisfies:

r_eff ≈ ((2α−1)/(2α)) · n^{(2α−1)/α} for α > 1/2 (3)

As α → 1 (deeper layers), effective rank scales as $O(\sqrtn)$ , confirming information compression with depth.

Figure 2 — Singular Value Power-Law Distribution

α (decay exponent) = 0.50 n (matrix size) = 32

Figure 2. Singular value spectrum on log-log axes. The linear trend confirms power-law decay with slope −α. Shallow layers (low α) retain richer rank; deep layers (high α) compress aggressively. Dashed = theory, solid = empirical.

§4 Stability Analysis: Norm Scaling

A central concern in transformer training is numerical stability. The spectral norm of the attention matrix directly controls signal amplification: for input $x$ , the output satisfies $‖Ax‖ \leq ‖A‖_spec \cdot ‖x‖$ . If this norm grows unboundedly with sequence length, long-context inference becomes numerically hazardous.

Theorem — Stability Criterion

If the spectral norm of each attention matrix satisfies $‖A‖_spec \leq C < 2$ , transformer training is stable. Furthermore, softmax normalization inherently guarantees:

‖A‖_F = O(√n), ‖A‖_spec = O(log n) (4)

Both norms grow sublinearly in sequence length $n$ , providing the theoretical foundation for long-context stability.

Figure 3 — Norm Scaling vs. Sequence Length

Sparsity: 30% Show:

Figure 3. Empirical norm scaling vs. sequence length n. Rust: Frobenius norm — theory O(√n). Gold: Spectral norm — theory O(log n). Dashed lines show theoretical predictions. Sublinear growth in both confirms long-context stability.

§5 Sparsity–Spectrum Relationship

Attention sparsity — the fraction of near-zero attention weights — is intimately linked to spectral structure. A fully dense attention matrix distributes signal uniformly; a sparse one concentrates it. We formalize this duality via:

Effective Rank ≈ exp( H(attention distribution) ) (5)

where $H$ denotes entropy. Maximum entropy (uniform attention) yields maximum effective rank. Minimum entropy (one-hot attention) yields rank 1. This establishes that sparsity and rank are dual quantities, mediated by attention entropy.

Figure 4 — Sparsity vs. Eigenvalue Spectrum Moderate

Sparsity (% zeros): 30% n = 20

—eff. rank

—entropy H

—max σ

—regime

Figure 4. Left: sampled row-stochastic attention matrix (heatmap). Right: sorted singular value magnitudes. As sparsity increases, few singular values dominate and effective rank drops — confirming the sparsity–spectrum duality of Eq. (5).

§6 Expressivity Bounds & Generalization

Expressivity — the capacity of a transformer to represent complex functions — is ultimately bounded by the rank of its attention matrices:

Expressivity ≤ f( rank(A), d_model, num_heads ) (6)

This reveals a fundamental tension: sparse attention (low rank) offers efficiency and interpretability at the cost of expressivity; dense attention (high rank) enables complex reasoning but resists regularization.

6.1 Generalization via Low-Rank Attention

Well-generalized models exhibit lower effective rank than overfit models. Overfitting manifests spectrally as rank inflation — the model attends to noise, introducing spurious singular components. Monitoring effective rank during training can serve as an early indicator of overfitting, and regularization that implicitly suppresses extra singular values improves generalization.

Figure 5 — Expressivity–Efficiency Trade-off

Condition number κ = 5.0 d_model:

Figure 5. Expressivity bound (rust) and stability margin (sage) as functions of attention sparsity. Their intersection — marked in gold — defines the optimal operating point for a given task complexity (κ) and model width.

6.2 Practical Sparsity Guidelines

Simple Tasks — >90% Sparsity

Classification, retrieval, simple QA — dominated by a few salient tokens. High sparsity is sufficient and improves efficiency and interpretability without sacrificing accuracy.

Complex Tasks — 60–80% Sparsity

Multi-step reasoning, complex generation — requires richer token interactions. Moderate sparsity preserves the expressivity needed for these tasks while maintaining training stability.

§ References

[1] Vaswani, A. et al. (2017). Attention is All You Need. NeurIPS 2017.

[2] Dong, Y. et al. (2021). Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth. ICML 2021.

[3] Tian, Y. et al. (2023). Scan and Snap: Understanding Training Dynamics and Token Composition in One-layer Transformer. NeurIPS 2023.

[4] Noci, L. et al. (2022). Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse. NeurIPS 2022.

[5] Adkine, V. (2025). Spectral Properties of Attention Matrices in Transformer Models. Preprint.

Spectral Properties of Attention Matricesin Transformer Models