This research investigates the mathematical properties of attention matrices in transformer models by analyzing their eigenvalues, singular values, and norms. We show that attention matrices undergo progressive eigenvalue concentration during training — shifting from uniform spectral distributions to sparse dominant spectra. Singular values follow a power-law decay σᵢ ≈ C/i^α with layer-depth-dependent exponent α ∈ [0.5, 1.0]. We further establish that the Frobenius norm scales as O(√n) and the spectral norm as O(log n), providing theoretical justification for long-context numerical stability. Finally, the effective rank of attention — bounded by the entropy of the attention distribution — reveals a fundamental trade-off between sparsity, expressivity, and generalization.
Attention matrices are not symmetric in general, so we consider their eigenvalues as complex numbers. The softmax normalization ensures row-stochastic structure, which constrains the spectrum: all eigenvalues lie in or on the unit disk in the complex plane, and 1 is always an eigenvalue (the Perron eigenvalue).
As training progresses, we observe a characteristic three-phase structure. The spectrum transitions from uniform coverage of the unit disk toward concentration of mass near a few large real eigenvalues, with the remainder collapsing toward zero. We model this evolution as:
where τᵢ is the characteristic convergence time for eigenvalue i, with τ₁ < τ₂ < ··· — dominant modes converge faster. Deeper layers exhibit accelerated concentration due to more aggressive information compression.
For an attention matrix A ∈ ℝⁿˣⁿ, its singular value decomposition A = UΣVᵀ reveals the informational skeleton of the matrix. We establish empirically and theoretically that singular values follow a power-law:
The exponent α increases with layer depth: shallow layers exhibit α ≈ 0.5 (slower decay, richer rank), while deep layers reach α ≈ 1.0 (aggressive compression). On a log-log plot, this appears as a straight line with slope −α.
For a power-law spectrum σᵢ = C · i^{−α}, the effective rank (defined as r_eff = (Σσᵢ)² / Σσᵢ²) satisfies:
As α → 1 (deeper layers), effective rank scales as O(√n), confirming information compression with depth.
A central concern in transformer training is numerical stability. The spectral norm of the attention matrix directly controls signal amplification: for input x, the output satisfies ‖Ax‖ ≤ ‖A‖_spec · ‖x‖. If this norm grows unboundedly with sequence length, long-context inference becomes numerically hazardous.
If the spectral norm of each attention matrix satisfies ‖A‖_spec ≤ C < 2, transformer training is stable. Furthermore, softmax normalization inherently guarantees:
Both norms grow sublinearly in sequence length n, providing the theoretical foundation for long-context stability.
Attention sparsity — the fraction of near-zero attention weights — is intimately linked to spectral structure. A fully dense attention matrix distributes signal uniformly; a sparse one concentrates it. We formalize this duality via:
where H denotes entropy. Maximum entropy (uniform attention) yields maximum effective rank. Minimum entropy (one-hot attention) yields rank 1. This establishes that sparsity and rank are dual quantities, mediated by attention entropy.
Expressivity — the capacity of a transformer to represent complex functions — is ultimately bounded by the rank of its attention matrices:
This reveals a fundamental tension: sparse attention (low rank) offers efficiency and interpretability at the cost of expressivity; dense attention (high rank) enables complex reasoning but resists regularization.
Well-generalized models exhibit lower effective rank than overfit models. Overfitting manifests spectrally as rank inflation — the model attends to noise, introducing spurious singular components. Monitoring effective rank during training can serve as an early indicator of overfitting, and regularization that implicitly suppresses extra singular values improves generalization.
Classification, retrieval, simple QA — dominated by a few salient tokens. High sparsity is sufficient and improves efficiency and interpretability without sacrificing accuracy.
Multi-step reasoning, complex generation — requires richer token interactions. Moderate sparsity preserves the expressivity needed for these tasks while maintaining training stability.