We analyze gradient descent through the framework of discrete-time dynamical systems. By treating each update step as the application of a map Φ_η : ℝⁿ → ℝⁿ, we derive exact stability conditions in terms of the spectral radius of the update Jacobian. Our central result establishes that gradient descent undergoes a stability phase transition when ρ(I − ηH) = 1, where H is the Hessian of the loss and η is the learning rate. We characterize three distinct regimes: geometric convergence (stable fixed point), oscillatory divergence (unstable periodic orbit), and explosion (chaotic escape). We provide interactive experiments demonstrating each regime and derive tight convergence bounds with empirical verification.
Consider minimizing a twice-differentiable objective L : ℝⁿ → ℝ. The gradient descent update rule with learning rate η > 0 defines a discrete-time dynamical system via the update map:
Rather than asking "does this converge?", we ask the more precise dynamical question: what are the fixed points of Φ_η, and under what conditions are they stable attractors?
A point θ* is a fixed point of Φ_η if and only if Φ_η(θ*) = θ*, which implies:
Thus, fixed points of Φ_η are precisely the stationary points of L, independent of η. The learning rate governs not where we converge, but whether we converge.
To analyze the local behavior near a fixed point θ*, we linearize Φ_η via its Jacobian. Let J_Φ(θ*) denote this Jacobian:
where H(θ*) = ∇²L(θ*) is the Hessian. The local dynamics of perturbations δ_t = θ_t − θ* are governed by:
After t steps, δ_t ≈ (I − ηH)^t · δ_0. Since H is symmetric, let its eigendecomposition be H = QΛQ^T. The update Jacobian has eigenvalues λ_i(J_Φ) = 1 − ηλ_i(H).
A fixed point θ* of the gradient descent map Φ_η is locally asymptotically stable if and only if the spectral radius of the update Jacobian satisfies:
For a positive definite Hessian with eigenvalues 0 < λ_min ≤ λ_i ≤ λ_max, this holds if and only if η < 2/λ_max. The optimal learning rate minimizing the spectral radius is η* = 2/(λ_min + λ_max).
Proof sketch. The spectral radius condition follows directly from the theory of discrete linear systems: δ_t → 0 iff all eigenvalues of J_Φ lie strictly inside the unit disk. For η λ_i > 2, the i-th mode has |1 − ηλ_i| > 1, causing divergence. Optimal η* balances the magnitudes |1 − ηλ_min| and |1 − ηλ_max| by setting them equal. □
In the stable regime, convergence is geometric (linear) with rate equal to the spectral radius:
The number of steps to reach ε-accuracy scales as T(ε) = O(log(1/ε) / log(1/ρ)). With optimal η*, the spectral radius equals (κ−1)/(κ+1) where κ = λ_max/λ_min is the condition number—recovering the classical result that ill-conditioned problems converge slowly.
As η increases from 0 toward 2/λ_max, the dynamics undergo a qualitative change akin to a bifurcation in continuous dynamical systems. We identify three distinct phases:
0 < η < 2/λ_max
All eigenvalues of J_Φ inside unit disk. Fixed point is a stable node or spiral. Geometric convergence guaranteed.
η = 2/λ_max
Largest eigenvalue of J_Φ equals −1. System on unit circle: non-convergent oscillation along the eigenvector of λ_max.
η > 2/λ_max
Spectral radius > 1. Perturbations grow exponentially. The fixed point is an unstable repeller.
η* = 2/(λ_min + λ_max)
Spectral radius minimized at (κ−1)/(κ+1). Fastest guaranteed convergence within Phase I.
We can plot the long-run behavior of gradient descent as a function of η, analogous to the logistic map bifurcation diagram. For each η, we run 500 steps from a fixed initialization and plot the last 50 iterates of the loss. The transition from a single fixed point to an oscillating orbit and then to explosion is clearly visible.
A key test of any theoretical framework is its predictive power. Here we directly compare the theoretically predicted convergence bound (Eq. 6) with empirical trajectories. The bound predicts geometric decay with ratio ρ(I−ηH); we measure actual decay rates and verify tightness.
The condition number κ = λ_max/λ_min is the fundamental quantity governing how quickly an ill-conditioned quadratic converges. With optimal step size, the convergence rate is:
This explains why deep networks with large condition numbers converge so slowly under vanilla gradient descent, and why preconditioning methods (Adam, natural gradient) that implicitly reduce κ are so effective in practice.
Heavy-ball momentum introduces a second-order system. The update now involves two consecutive iterates, and the Jacobian acts on pairs (θ_t, θ_{t-1}). The stability region becomes a region in the (η, β)-plane, and optimal parameters achieve O(√κ · log(1/ε)) convergence—a quadratic improvement in condition number dependence.
For non-quadratic losses, our analysis applies locally near minima where the quadratic Taylor approximation is valid. Globally, the dynamics may be far richer: multiple basins of attraction, saddle points with unstable manifolds, and loss landscapes that vary the effective Hessian as we traverse them. The stability criterion becomes an instantaneous condition: convergence requires η < 2/L where L is the global Lipschitz constant of ∇L.
If L is L-smooth (i.e., ∇L is L-Lipschitz) and μ-strongly convex, then gradient descent with η = 2/(μ+L) satisfies:
where κ = L/μ is the global condition number. This bound is tight for quadratics.