Chapter 8
Eigenvalues & Eigenvectors
Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

  1. Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.

  2. Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.

  3. Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.

  4. Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.

  5. PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.

  6. Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.

  7. Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

  • PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.

  • Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.

  • PageRank/Markov chains: Stationary distribution is dominant eigenvector.

  • Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.

  • Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

  • 1900s: Spectral theorem for symmetric matrices.

  • 1911: Gershgorin circle theorem (eigenvalue bounds).

  • 1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.

  • 1998: PageRank (Brin–Page) uses power iteration at web scale.

  • 2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).

  • 2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

  • Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.

  • Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.

  • Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.

  • Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

  • Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.

  • Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.

  • Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.

  • Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.

  • Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

  • PCA: Jolliffe (2002); Shlens (2014).

  • Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).

  • PageRank: Brin–Page (1998).

  • Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).

  • Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up
  1. PCA and covariance spectra

  • Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.

  • Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

  1. Spectral clustering (graph Laplacian)

  • Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.

  • Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

  1. PageRank / Markov chains

  • Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.

  • Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

  1. Conditioning and stability (Jacobians/Hessians)

  • Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.

  • Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

  1. Randomized eigen/svd for large ML

  • Approximate leading eigenpairs with fewer passes over data.

  • Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation
  • Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.

  • Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.

  • Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.

  • Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.

  • Examples:

    • Dominant eigenpair by power iteration on SPD matrix.

    • PCA via eigen-decomposition of $\Sigma$.

    • PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks
  • Non-symmetric matrices may have complex eigenvalues; use appropriate routines.

  • Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.

  • Normalize in power iteration to avoid overflow/underflow.

  • Center data before PCA; otherwise leading eigenvectors capture mean.

  • Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  3. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  4. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

  • Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.

  • Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

  • Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

  1. Trefethen & Bau (1997). Numerical Linear Algebra.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

  • $\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.

  • Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

  • Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

  • Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.

  • Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

  • Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

  1. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

  2. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

  • All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

  • Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

  1. Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.

  2. Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

  • $L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

  • For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

  1. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.

  2. von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation