Gradient Descent as a Dynamical System

Research & Case Study

We analyze gradient descent through the framework of discrete-time dynamical systems. By treating each update step as the application of a map $Φ_η : ℝⁿ \to ℝⁿ$ , we derive exact stability conditions in terms of the spectral radius of the update Jacobian. Our central result establishes that gradient descent undergoes a stability phase transition when $ρ(I - ηH) = 1$ , where $H$ is the Hessian of the loss and $η$ is the learning rate. We characterize three distinct regimes: geometric convergence (stable fixed point), oscillatory divergence (unstable periodic orbit), and explosion (chaotic escape). We provide interactive experiments demonstrating each regime and derive tight convergence bounds with empirical verification.

§1 The Gradient Descent Map

Consider minimizing a twice-differentiable objective $L : ℝⁿ \to ℝ$ . The gradient descent update rule with learning rate $η > 0$ defines a discrete-time dynamical system via the update map:

θ_{t+1} = Φ_η(θ_t) := θ_t − η · ∇L(θ_t) (1)

Rather than asking "does this converge?", we ask the more precise dynamical question: what are the fixed points of $Φ_η$ , and under what conditions are they stable attractors?

1.1 Fixed Points

A point $θ*$ is a fixed point of $Φ_η$ if and only if $Φ_η(θ*) = θ*$ , which implies:

θ* − η · ∇L(θ*) = θ* ⟹ ∇L(θ*) = 0 (2)

Thus, fixed points of $Φ_η$ are precisely the stationary points of $L$ , independent of $η$ . The learning rate governs not where we converge, but whether we converge.

§2 Stability via Linearization

To analyze the local behavior near a fixed point $θ*$ , we linearize $Φ_η$ via its Jacobian. Let $J_Φ(θ*)$ denote this Jacobian:

J_Φ(θ*) = ∂Φ_η/∂θ|_{θ=θ*} = I − η · H(θ*) (3)

where $H(θ*) = \nabla²L(θ*)$ is the Hessian. The local dynamics of perturbations $δ_t = θ_t - θ*$ are governed by:

δ_{t+1} ≈ J_Φ(θ*) · δ_t = (I − ηH) · δ_t (4)

After $t$ steps, $δ_t \approx (I - ηH)^t \cdot δ_0$ . Since $H$ is symmetric, let its eigendecomposition be $H = QΛQ^T$ . The update Jacobian has eigenvalues $λ_i(J_Φ) = 1 - ηλ_i(H)$ .

Theorem 1 — Stability Criterion

A fixed point $θ*$ of the gradient descent map $Φ_η$ is locally asymptotically stable if and only if the spectral radius of the update Jacobian satisfies:

ρ(I − ηH) = max_i |1 − ηλ_i(H)| < 1 (5)

For a positive definite Hessian with eigenvalues $0 < λ_min \leq λ_i \leq λ_max$ , this holds if and only if $η < 2/λ_max$ . The optimal learning rate minimizing the spectral radius is $η* = 2/(λ_min + λ_max)$ .

Proof sketch. The spectral radius condition follows directly from the theory of discrete linear systems: $δ_t \to 0$ iff all eigenvalues of $J_Φ$ lie strictly inside the unit disk. For $η λ_i > 2$ , the $i$ -th mode has $|1 - ηλ_i| > 1$ , causing divergence. Optimal $η*$ balances the magnitudes $|1 - ηλ_min|$ and $|1 - ηλ_max|$ by setting them equal. □

2.1 Convergence Rate

In the stable regime, convergence is geometric (linear) with rate equal to the spectral radius:

‖θ_t − θ*‖ ≤ ρ(I − ηH)^t · ‖θ_0 − θ*‖ (6)

The number of steps to reach $ε$ -accuracy scales as $T(ε) = O(log(1/ε) / log(1/ρ))$ . With optimal $η*$ , the spectral radius equals $(κ-1)/(κ+1)$ where $κ = λ_max/λ_min$ is the condition number—recovering the classical result that ill-conditioned problems converge slowly.

Figure 1 — Interactive Convergence Experiment Stable

Function: η = 0.150

—iterations

—ρ(J_Φ)

—final loss

—regime

Figure 1. Real-time gradient descent trajectory on selected objective. Adjust η to observe the three convergence regimes. The color of the trajectory encodes loss magnitude (blue→red). Critical boundary η = 2/λ_max is shown as a dashed line in the convergence plot.

§3 Phase Transition Analysis

As $η$ increases from 0 toward $2/λ_max$ , the dynamics undergo a qualitative change akin to a bifurcation in continuous dynamical systems. We identify three distinct phases:

Phase I — Convergent

$0 < η < 2/λ_max$

All eigenvalues of $J_Φ$ inside unit disk. Fixed point is a stable node or spiral. Geometric convergence guaranteed.

Phase II — Critical

$η = 2/λ_max$

Largest eigenvalue of $J_Φ$ equals −1. System on unit circle: non-convergent oscillation along the eigenvector of $λ_max$ .

Phase III — Divergent

$η > 2/λ_max$

Spectral radius > 1. Perturbations grow exponentially. The fixed point is an unstable repeller.

Phase IV — Optimal

$η* = 2/(λ_min + λ_max)$

Spectral radius minimized at $(κ-1)/(κ+1)$ . Fastest guaranteed convergence within Phase I.

3.1 The Bifurcation Diagram

We can plot the long-run behavior of gradient descent as a function of $η$ , analogous to the logistic map bifurcation diagram. For each $η$ , we run 500 steps from a fixed initialization and plot the last 50 iterates of the loss. The transition from a single fixed point to an oscillating orbit and then to explosion is clearly visible.

Figure 2 — Bifurcation Diagram Computing...

Function:

Figure 2. Bifurcation diagram showing long-run behavior of gradient descent as a function of learning rate η. Each vertical slice represents the distribution of iterates after burn-in. The transition from stable convergence to oscillation to divergence is visible as a phase transition.

§4 Theory vs. Empirical Verification

A key test of any theoretical framework is its predictive power. Here we directly compare the theoretically predicted convergence bound (Eq. 6) with empirical trajectories. The bound predicts geometric decay with ratio $ρ(I-ηH)$ ; we measure actual decay rates and verify tightness.

Figure 3 — Theory vs. Empirical Convergence

η = 0.30 κ (cond. #) = 4.0

Figure 3. Solid lines show empirical convergence; dashed lines show the theoretical bound from Eq. (6). Upper bound tightness is quantified by the ratio ‖θ_t − θ*‖ / ρ^t·‖θ_0 − θ*‖.

4.1 Condition Number and Convergence

The condition number $κ = λ_max/λ_min$ is the fundamental quantity governing how quickly an ill-conditioned quadratic converges. With optimal step size, the convergence rate is:

ρ* = (κ − 1)/(κ + 1) ⟹ T(ε) = O(κ · log(1/ε)) (7)

This explains why deep networks with large condition numbers converge so slowly under vanilla gradient descent, and why preconditioning methods (Adam, natural gradient) that implicitly reduce $κ$ are so effective in practice.

Figure 4 — Jacobian Eigenvalue Explorer

η = 0.50 λ_max = 2.0 λ_min = 0.5

Figure 4. Complex plane showing eigenvalues of the update Jacobian I − ηH as dots, against the unit circle (stability boundary). All eigenvalues inside the circle guarantee convergence. Hover to inspect individual eigenvalue magnitudes.

§5 Implications and Generalizations

5.1 Momentum as Complex Eigenvalue Rotation

Heavy-ball momentum introduces a second-order system. The update now involves two consecutive iterates, and the Jacobian acts on pairs $(θ_t, θ_{t-1})$ . The stability region becomes a region in the $(η, β)$ -plane, and optimal parameters achieve $O(\sqrtκ \cdot log(1/ε))$ convergence—a quadratic improvement in condition number dependence.

5.2 Non-Quadratic Losses

For non-quadratic losses, our analysis applies locally near minima where the quadratic Taylor approximation is valid. Globally, the dynamics may be far richer: multiple basins of attraction, saddle points with unstable manifolds, and loss landscapes that vary the effective Hessian as we traverse them. The stability criterion becomes an instantaneous condition: convergence requires $η < 2/L$ where $L$ is the global Lipschitz constant of $\nablaL$ .

Corollary — Global Convergence

If $L$ is $L$ -smooth (i.e., $\nablaL$ is $L$ -Lipschitz) and $μ$ -strongly convex, then gradient descent with $η = 2/(μ+L)$ satisfies:

‖θ_t − θ*‖² ≤ ((κ−1)/(κ+1))^{2t} · ‖θ_0 − θ*‖² (8)

where κ = L/μ is the global condition number. This bound is tight for quadratics.

§ References

[1] Strogatz, S.H. (1994). Nonlinear Dynamics and Chaos. Addison-Wesley.

[2] Polyak, B.T. (1987). Introduction to Optimization. Optimization Software Inc.

[3] Nesterov, Y. (2004). Introductory Lectures on Stochastic Programming. Springer.

[4] Cohen, J. et al. (2021). Gradient descent on neural networks typically occurs at the edge of stability. ICLR 2022.

[5] Recht, B. & Ré, C. (2011). Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation.

Gradient Descent as a Dynamical System:Stability, Phase Transitions, and Convergence