ex1.ai

Eigenvalues & Eigenvectors

Chapter 8

Eigenvalues & Eigenvectors

Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.
Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.
Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.
Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.
PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.
Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.
Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.
Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.
PageRank/Markov chains: Stationary distribution is dominant eigenvector.
Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.
Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

1900s: Spectral theorem for symmetric matrices.
1911: Gershgorin circle theorem (eigenvalue bounds).
1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.
1998: PageRank (Brin–Page) uses power iteration at web scale.
2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).
2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.
Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.
Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.
Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.
Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.
Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.
Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.
Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

PCA: Jolliffe (2002); Shlens (2014).
Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).
PageRank: Brin–Page (1998).
Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).
Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up

PCA and covariance spectra

Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.
Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

Spectral clustering (graph Laplacian)

Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.
Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

PageRank / Markov chains

Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.
Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

Conditioning and stability (Jacobians/Hessians)

Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.
Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

Randomized eigen/svd for large ML

Approximate leading eigenpairs with fewer passes over data.
Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation

Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.
Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.
Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.
Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.
Examples:
- Dominant eigenpair by power iteration on SPD matrix.
- PCA via eigen-decomposition of $\Sigma$.
- PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks

Non-symmetric matrices may have complex eigenvalues; use appropriate routines.
Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.
Normalize in power iteration to avoid overflow/underflow.
Center data before PCA; otherwise leading eigenvectors capture mean.
Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.
Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

$\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.
Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.
Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.
Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

$L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.
von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)