Research & Case Study
Journal of Theoretical Deep Learning  ·  Vol. 2, 2025  ·  Interactive Paper

Spectral Properties of Attention Matrices
in Transformer Models

Eigenvalue concentration, singular value decay, norm scaling, and expressivity bounds
Vishwajeet Adkine Eigenvalue Analysis Norm Scaling Sparsity–Spectrum Expressivity Bounds

This research investigates the mathematical properties of attention matrices in transformer models by analyzing their eigenvalues, singular values, and norms. We show that attention matrices undergo progressive eigenvalue concentration during training — shifting from uniform spectral distributions to sparse dominant spectra. Singular values follow a power-law decay σᵢ ≈ C/i^α with layer-depth-dependent exponent α ∈ [0.5, 1.0]. We further establish that the Frobenius norm scales as O(√n) and the spectral norm as O(log n), providing theoretical justification for long-context numerical stability. Finally, the effective rank of attention — bounded by the entropy of the attention distribution — reveals a fundamental trade-off between sparsity, expressivity, and generalization.

§1 Major Findings

Finding 01
Eigenvalue Concentration
Dominant eigenvalues emerge and stabilize during training while others decay toward zero — revealing progressive hierarchical feature learning.
Finding 02
Power-Law Singular Decay
Singular values follow σᵢ ≈ C/i^α, α ∈ [0.5, 1.0]. Deeper layers show faster decay, indicating more selective attention.
Finding 03
Sublinear Norm Scaling
Frobenius norm grows as O(√n), spectral norm as O(log n) — explaining why long-context transformers remain numerically stable.
Finding 04
Sparsity–Spectrum Duality
High attention sparsity concentrates the spectrum into few large eigenvalues, constraining effective rank and defining an expressivity–efficiency trade-off.

§2 Eigenvalue Concentration During Training

Attention matrices are not symmetric in general, so we consider their eigenvalues as complex numbers. The softmax normalization ensures row-stochastic structure, which constrains the spectrum: all eigenvalues lie in or on the unit disk in the complex plane, and 1 is always an eigenvalue (the Perron eigenvalue).

As training progresses, we observe a characteristic three-phase structure. The spectrum transitions from uniform coverage of the unit disk toward concentration of mass near a few large real eigenvalues, with the remainder collapsing toward zero. We model this evolution as:

λᵢ(t) ≈ λᵢ(∞) · (1 − e^{−t/τᵢ}) + λᵢ(0) · e^{−t/τᵢ} (1)

where τᵢ is the characteristic convergence time for eigenvalue i, with τ₁ < τ₂ < ··· — dominant modes converge faster. Deeper layers exhibit accelerated concentration due to more aggressive information compression.

Figure 1 — Eigenvalue Spectrum Evolution During Training Early phase
Training step t = 0 Layer:
top |λ|
eff. rank
concentration
training phase
Figure 1. Eigenvalue spectrum (complex plane) at varying training steps. Observe the collapse from a diffuse cloud toward dominant real eigenvalues as training proceeds. The unit circle marks the stability boundary. Deeper layers concentrate faster.

§3 Singular Value Power-Law Decay

For an attention matrix A ∈ ℝⁿˣⁿ, its singular value decomposition A = UΣVᵀ reveals the informational skeleton of the matrix. We establish empirically and theoretically that singular values follow a power-law:

σᵢ ≈ C · i^{−α}, α ∈ [0.5, 1.0] (2)

The exponent α increases with layer depth: shallow layers exhibit α ≈ 0.5 (slower decay, richer rank), while deep layers reach α ≈ 1.0 (aggressive compression). On a log-log plot, this appears as a straight line with slope −α.

Theorem — Effective Rank Bound

For a power-law spectrum σᵢ = C · i^{−α}, the effective rank (defined as r_eff = (Σσᵢ)² / Σσᵢ²) satisfies:

r_eff ≈ ((2α−1)/(2α)) · n^{(2α−1)/α} for α > 1/2 (3)

As α → 1 (deeper layers), effective rank scales as O(√n), confirming information compression with depth.

Figure 2 — Singular Value Power-Law Distribution
α (decay exponent) = 0.50 n (matrix size) = 32
Adjust α to see power-law slope change
Figure 2. Singular value spectrum on log-log axes. The linear trend confirms power-law decay with slope −α. Shallow layers (low α) retain richer rank; deep layers (high α) compress aggressively. Dashed = theory, solid = empirical.

§4 Stability Analysis: Norm Scaling

A central concern in transformer training is numerical stability. The spectral norm of the attention matrix directly controls signal amplification: for input x, the output satisfies ‖Ax‖ ≤ ‖A‖_spec · ‖x‖. If this norm grows unboundedly with sequence length, long-context inference becomes numerically hazardous.

Theorem — Stability Criterion

If the spectral norm of each attention matrix satisfies ‖A‖_spec ≤ C < 2, transformer training is stable. Furthermore, softmax normalization inherently guarantees:

‖A‖_F = O(√n), ‖A‖_spec = O(log n) (4)

Both norms grow sublinearly in sequence length n, providing the theoretical foundation for long-context stability.

Figure 3 — Norm Scaling vs. Sequence Length
Sparsity: 30% Show:
Figure 3. Empirical norm scaling vs. sequence length n. Rust: Frobenius norm — theory O(√n). Gold: Spectral norm — theory O(log n). Dashed lines show theoretical predictions. Sublinear growth in both confirms long-context stability.

§5 Sparsity–Spectrum Relationship

Attention sparsity — the fraction of near-zero attention weights — is intimately linked to spectral structure. A fully dense attention matrix distributes signal uniformly; a sparse one concentrates it. We formalize this duality via:

Effective Rank ≈ exp( H(attention distribution) ) (5)

where H denotes entropy. Maximum entropy (uniform attention) yields maximum effective rank. Minimum entropy (one-hot attention) yields rank 1. This establishes that sparsity and rank are dual quantities, mediated by attention entropy.

Figure 4 — Sparsity vs. Eigenvalue Spectrum Moderate
Sparsity (% zeros): 30% n = 20
eff. rank
entropy H
max σ
regime
Figure 4. Left: sampled row-stochastic attention matrix (heatmap). Right: sorted singular value magnitudes. As sparsity increases, few singular values dominate and effective rank drops — confirming the sparsity–spectrum duality of Eq. (5).

§6 Expressivity Bounds & Generalization

Expressivity — the capacity of a transformer to represent complex functions — is ultimately bounded by the rank of its attention matrices:

Expressivity ≤ f( rank(A), d_model, num_heads ) (6)

This reveals a fundamental tension: sparse attention (low rank) offers efficiency and interpretability at the cost of expressivity; dense attention (high rank) enables complex reasoning but resists regularization.

6.1 Generalization via Low-Rank Attention

Well-generalized models exhibit lower effective rank than overfit models. Overfitting manifests spectrally as rank inflation — the model attends to noise, introducing spurious singular components. Monitoring effective rank during training can serve as an early indicator of overfitting, and regularization that implicitly suppresses extra singular values improves generalization.

Figure 5 — Expressivity–Efficiency Trade-off
Condition number κ = 5.0 d_model:
Adjust κ and d_model to find the optimal sparsity
Figure 5. Expressivity bound (rust) and stability margin (sage) as functions of attention sparsity. Their intersection — marked in gold — defines the optimal operating point for a given task complexity (κ) and model width.

6.2 Practical Sparsity Guidelines

Simple Tasks — >90% Sparsity

Classification, retrieval, simple QA — dominated by a few salient tokens. High sparsity is sufficient and improves efficiency and interpretability without sacrificing accuracy.

Complex Tasks — 60–80% Sparsity

Multi-step reasoning, complex generation — requires richer token interactions. Moderate sparsity preserves the expressivity needed for these tasks while maintaining training stability.

§ References

[1] Vaswani, A. et al. (2017). Attention is All You Need. NeurIPS 2017.
[2] Dong, Y. et al. (2021). Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth. ICML 2021.
[3] Tian, Y. et al. (2023). Scan and Snap: Understanding Training Dynamics and Token Composition in One-layer Transformer. NeurIPS 2023.
[4] Noci, L. et al. (2022). Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse. NeurIPS 2022.
[5] Adkine, V. (2025). Spectral Properties of Attention Matrices in Transformer Models. Preprint.