ex1.ai

Iterative Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Structured Matrices

Historical & Attribution

Projection & Orthogonality

Key Concepts & Theorems

Learning Path & Sequencing

Orthogonal Factorizations

Matrix Decompositions

Cholesky Decomposition

ML Applications

Optimization & Training

Numerical Stability & Robustness

Preconditioning

Theoretical Foundation

Numerical Linear Algebra

Least Squares

Chapter 12

Least Squares

Key ideas: Algorithmic development history

Algorithmic development (milestones)#

1795: Legendre and Gauss independently develop least squares for astronomy/surveying.
1881–1920: Cholesky factorization and early numerical algorithms.
1960s: Golub–Kahan QR algorithm; recognition of conditioning issues in normal equations.
1970s–1980s: Tikhonov regularization and Hansen’s methods for ill-posed problems.
1990s: Ridge regression, elastic net, and LASSO via modern regularization theory (Hastie et al.).
2000s: Stochastic gradient descent for large-scale least squares (Bottou–LeCun).
2010s: Implicit regularization in deep learning; connections between SGD and generalization.

Key ideas: Definitions

Definitions#

Least squares problem: $\min_w \lVert X w - y \rVert_2^2$ with $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$.
Normal equations: $X^\top X w = X^\top y$.
Residual: $r = X w - y \in \mathbb{R}^n$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$ (ratio of singular values).
Ridge regression: $\min_w (\lVert X w - y \rVert^2 + \lambda \lVert w \rVert^2)$; solution $(X^\top X + \lambda I)^{-1} X^\top y$.
Regularization parameter: $\lambda \ge 0$ controls trade-off between fit and smoothness.

Key ideas: Introduction

Introduction#

Least squares is the workhorse of supervised learning. Given data $X \in \mathbb{R}^{n\times d}$ and targets $y \in \mathbb{R}^n$ with $n > d$, least squares finds $w \in \mathbb{R}^d$ minimizing $f(w) = \tfrac{1}{2}\lVert X w - y \rVert_2^2$. Geometrically, it projects $y$ onto the column space of $X$. The solution $w^* = (X^\top X)^{-1} X^\top y$ exists if $X$ has full rank; stable computation uses QR or SVD.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Overdetermined systems and least squares formulation as projection onto column space.
Normal equations and optimality: $\nabla f(w) = X^\top(X w - y) = 0$.
Gram matrix $G = X^\top X$ is PSD; condition number $\kappa(G) = \kappa(X)^2$.
QR decomposition $X = QR$; normal equations become $R w = Q^\top y$ (stable).
SVD solution $w^* = V \Sigma^{-1} U^\top y$ and pseudoinverse.
Ridge regression normal equations and bias-variance trade-off.
Regularization parameter selection (cross-validation, L-curve, GCV).

Applied (landmark systems)#

Linear regression (Hastie et al. 2009; scikit-learn implementation).
Kernel ridge regression (Rasmussen & Williams 2006; standard GP predictor).
Regularization for ill-posed inverse problems (Hansen 1998; Vogel 2002).
Elastic net for feature selection (Zou & Hastie 2005).
LASSO regression (Tibshirani 1996).
SGD for large-scale least squares (Bottou & LeCun 1998).
Implicit regularization in neural networks (Zhu et al. 2021).

Key ideas: Important ideas

Important ideas#

Normal equations
- $X^\top X w = X^\top y$ characterizes optimality via zero gradient.
Residuals and loss
- Residual $r = X w - y$; loss $f(w) = \tfrac{1}{2}\lVert r \rVert^2$ is convex in $w$.
Geometry: projection
- $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y = P_X y$ projects onto column space.
Conditioning and stability
- Condition number $\kappa(X^\top X) = \kappa(X)^2$ amplifies numerical error; prefer QR/SVD.
Pseudoinverse solution
- $w^* = X^\dagger y$ with $X^\dagger = V \Sigma^{-1} U^\top$ (SVD-based); handles rank-deficiency.
Ridge regression
- Add regularizer $\lambda \lVert w \rVert^2$; normal equations become $(X^\top X + \lambda I) w = X^\top y$. Trades bias for lower variance.
Regularization and ill-posedness
- Truncated SVD or Tikhonov filtering remove small singular values; stabilizes solutions to ill-posed inverse problems.

Key ideas: Relevance to ML

Relevance to ML#

Core regression algorithm: linear, polynomial, feature-engineered models.
Bias-variance trade-off: unregularized overfits on noise; regularization improves generalization.
Feature selection and dimensionality: via regularization (L1/elastic net) or subset selection.
Inverse problems: medical imaging, seismic inversion, parameter estimation.
Kernel methods: kernel ridge regression as Tikhonov in infinite-dimensional spaces.
Deep learning: implicit regularization in SGD and architecture design inspired by least squares principles.

Key ideas: Where it shows up

Linear regression and generalized linear models

Core supervised learning; extends to logistic regression, Poisson regression, etc. Achievements: classical statistical foundation; scikit-learn, TensorFlow standard solvers. References: Hastie et al. 2009.

Kernel methods and kernel ridge regression

Least squares in kernel-induced spaces; KRR = Tikhonov regularization in RKHS. Achievements: competitive with SVMs, enables Gaussian process prediction. References: Rasmussen & Williams 2006.

Inverse problems and imaging

Regularized least squares for ill-posed geophysics, medical imaging (CT, MRI). Achievements: Hansen 1998 (regularization tools); clinical deployment. References: Vogel 2002 (computational methods).

Dimensionality reduction via regularization

Ridge regression reduces variance on high-dimensional data; elastic net combines L1/L2 penalties. Achievements: Zou & Hastie 2005 (elastic net); foundation for modern feature selection. References: Tibshirani 1996 (LASSO).

Stochastic gradient descent and deep learning

SGD on least squares loss drives optimization; implicit regularization enables generalization. Achievements: Bottou & LeCun 1998 (stochastic methods); foundation for deep learning. References: Zhu et al. 2021 (implicit regularization theory).

Notation

Data and targets: $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$ (overdetermined: $n > d$).
Parameter vector: $w \in \mathbb{R}^d$.
Predictions and residuals: $\hat{y} = X w$, $r = y - X w$.
Loss (least squares): $f(w) = \tfrac{1}{2} \lVert X w - y \rVert_2^2 = \tfrac{1}{2} \lVert r \rVert_2^2$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Normal equations: $G w = X^\top y$.
QR factorization: $X = QR$ with $Q \in \mathbb{R}^{n\times d}, R \in \mathbb{R}^{d\times d}$ upper triangular.
SVD: $X = U \Sigma V^\top$; solution $w^* = V \Sigma^{-1} U^\top y$.
Ridge regression: $w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$.
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$; $\kappa(G) = \kappa(X)^2$.
Example: If $X$ is $100 \times 5$ with $\sigma_1 = 10, \sigma_5 = 0.1$, then $\kappa(X) = 100$ and $\kappa(X^\top X) = 10000$ (ill-conditioned); use QR or SVD instead of normal equations.

Pitfalls & sanity checks

Never solve normal equations for ill-conditioned $X$; use QR or SVD instead.
Verify system is overdetermined ($n > d$); underdetermined requires pseudoinverse or regularization.
Check $\operatorname{rank}(X) = d$; if rank-deficient, pseudoinverse is needed.
Residual $\lVert X w - y \rVert$ should be small but nonzero (unless exact solution exists).
Condition number $\kappa(X)$ predicts error magnification; regularize if too large.
Cross-validate regularization parameter $\lambda$; do not fit on training data.
Check for multicollinearity: if columns of $X$ are nearly dependent, condition number explodes.
Standardize features before ridge regression; otherwise $\lambda$ is scale-dependent.

References

Historical foundations

Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium.

Classical theory and methods 3. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

Regularization and ridge regression 7. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. 8. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods. 9. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. 10. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Inverse problems and regularization 11. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems. 12. Vogel, C. R. (2002). Computational Methods for Inverse Problems. 13. Ben-Israel, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications.

Stochastic optimization and deep learning 14. Bottou, L., & LeCun, Y. (1998). Large-scale machine learning with stochastic gradient descent. 15. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 16. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

Five worked examples

Worked Example 1: Normal equations and condition number#

Introduction#

Solve an overdetermined least squares system via normal equations; compute condition number and compare to QR.

Purpose#

Illustrate how Gram matrix conditioning affects solution accuracy and why normal equations can fail.

Importance#

Guides choice between normal equations (fast but risky) and QR/SVD (stable but slower).

What this example demonstrates#

Construct overdetermined system $X w = y$.
Solve via normal equations and via QR factorization.
Compute condition numbers $\kappa(X)$ and $\kappa(X^\top X)$.
Compare residuals and solution difference.

Background#

Normal equations are fast but square the condition number, amplifying errors when ill-conditioned.

Historical context#

Recognized by Golub–Kahan (1960s) as a fundamental numerical stability issue.

History#

Modern solvers default to QR/SVD and treat normal equations as historical reference.

Prevalence in ML#

Normal equations still used for quick estimates; QR/SVD for production systems.

Notes#

Condition number roughly predicts relative error magnification (error ~ $\kappa$ × machine epsilon).
For ill-conditioned problems, QR/SVD reduce error by factor of $\kappa(X)$.

Connection to ML#

Conditioning affects whether training converges and generalization; regularization helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ follows from SVD; QR avoids squaring via triangular solve.

Pedagogical Significance#

Concrete demonstration of why stable algorithms matter.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 80, 6
# Create ill-conditioned system
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s) @ V.T
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# Solve via normal equations
G = X.T @ X
kappa_G = np.linalg.cond(G)
w_ne = np.linalg.solve(G, X.T @ y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)

# Solve via SVD
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))

kappa_X = s_svd[0] / s_svd[-1]
print("kappa(X):", round(kappa_X, 4), "kappa(X^T X):", round(kappa_G, 4))
print("residual NE:", round(np.linalg.norm(X @ w_ne - y), 6))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))
print("residual SVD:", round(np.linalg.norm(X @ w_svd - y), 6))

Worked Example 2: QR factorization and stable least squares#

Introduction#

Solve least squares via QR factorization; verify projection onto column space.

Purpose#

Show numerically stable approach compared to normal equations.

Importance#

QR is standard in practice; enables backward-substitution on triangular systems.

What this example demonstrates#

Compute QR of $X = QR$.
Solve normal equations as $R w = Q^\top y$ (via back-substitution).
Verify $\hat{y} = Q Q^\top y$ is the projection.

Background#

QR factorization avoids forming $X^\top X$ explicitly; more stable for ill-conditioned data.

Historical context#

Golub–Kahan algorithm (1965) made QR practical; became standard in numerical libraries.

History#

LAPACK and NumPy default QR implementation.

Prevalence in ML#

Used in scikit-learn LinearRegression, statsmodels, and production systems.

Notes#

$\kappa(R) = \kappa(X)$, so no amplification from squaring.
Back-substitution on $R$ is faster than forming inverse.

Connection to ML#

Faster convergence for large-scale regression; enables incremental updates.

Connection to Linear Algebra Theory#

QR reduces $\kappa$ compared to normal equations; triangular solve is $O(d^2)$.

Pedagogical Significance#

Demonstrates practical stability improvements.

References#

Golub & Kahan (1965). Singular values and pseudo-inverses.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 80, 6
X = np.random.randn(n, d)
X = X / np.linalg.norm(X, axis=0)  # normalize columns
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# QR factorization
Q, R = np.linalg.qr(X, mode='reduced')

# Solve via back-substitution
w_qr = np.linalg.solve(R, Q.T @ y)

# Verify projection
y_proj = Q @ (Q.T @ y)
proj_error = np.linalg.norm(y - y_proj)

# Compare to normal equations
G = X.T @ X
w_ne = np.linalg.solve(G, X.T @ y)

print("QR solution:", np.round(w_qr[:3], 4))
print("NE solution:", np.round(w_ne[:3], 4))
print("projection error:", round(proj_error, 8))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))

Worked Example 3: Ridge regression and regularization parameter#

Introduction#

Solve ridge regression for different $\lambda$ values; demonstrate bias-variance trade-off.

Purpose#

Show how regularization reduces variance at cost of bias; guide $\lambda$ selection via cross-validation.

Importance#

Ridge is standard regularizer in practice; teaches regularization principles.

What this example demonstrates#

Solve ridge normal equations $(X^\top X + \lambda I) w = X^\top y$ for range of $\lambda$.
Compute training error, test error, and norm of solution $\lVert w \rVert$.
Find optimal $\lambda$ via k-fold cross-validation.

Background#

Tikhonov regularization: add penalty $\lambda \lVert w \rVert^2$ to balance fit and complexity.

Historical context#

Tikhonov (1963) for ill-posed problems; Hoerl & Kennard (1970) for regression.

History#

Ridge regression now standard in modern ML frameworks and statistical software.

Prevalence in ML#

Used in virtually all supervised learning systems for regularization.

Notes#

As $\lambda \to 0$: unregularized least squares (high variance, low bias).
As $\lambda \to \infty$: solution $w \to 0$ (high bias, low variance).
Optimal $\lambda$ found by cross-validation or L-curve method.

Connection to ML#

Core regularization strategy; extends to LASSO (L1), elastic net (L1+L2).

Connection to Linear Algebra Theory#

Regularization improves conditioning: $\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda) / (\sigma_d^2 + \lambda)$.

Pedagogical Significance#

Illustrates bias-variance trade-off quantitatively.

References#

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 20
# Create ill-conditioned design matrix
A = np.random.randn(d, d)
X = np.random.randn(n, d) @ np.linalg.cholesky(A.T @ A).T
w_true = np.random.randn(d)
y = X @ w_true + 0.1 * np.random.randn(n)

lams = np.logspace(-4, 2, 20)
errors_train = []
errors_test = []
norms_w = []

for lam in lams:
    G = X.T @ X + lam * np.eye(d)
    w = np.linalg.solve(G, X.T @ y)
    errors_train.append(np.linalg.norm(X @ w - y)**2 / n)
    errors_test.append(np.linalg.norm(X @ w - y)**2 / n + lam * np.linalg.norm(w)**2)
    norms_w.append(np.linalg.norm(w))

opt_idx = np.argmin(errors_test)
print("optimal lambda:", round(lams[opt_idx], 6))
print("train error at opt:", round(errors_train[opt_idx], 6))
print("test error at opt:", round(errors_test[opt_idx], 6))
print("norm(w) at opt:", round(norms_w[opt_idx], 4))

Worked Example 4: SVD-based pseudoinverse for rank-deficient systems#

Introduction#

Solve rank-deficient least squares via SVD pseudoinverse; compare to underdetermined system.

Purpose#

Show how SVD handles rank deficiency gracefully (vs. normal equations failing).

Importance#

Essential for underdetermined and ill-posed problems; enables robust solutions.

What this example demonstrates#

Construct rank-deficient $X$ (more columns than linearly independent rows).
Compute pseudoinverse $X^\dagger = V \Sigma^{-1} U^\top$ via SVD.
Find minimum-norm solution $w^* = X^\dagger y$.
Verify that solution has smallest $\lVert w \rVert$ among all least-squares solutions.

Background#

Moore–Penrose pseudoinverse extends inverse to non-square/rank-deficient matrices.

Historical context#

Formalized early 1900s; SVD computation enabled practical implementation (Golub 1960s).

History#

Standard in scientific computing and ML libraries for robust least squares.

Prevalence in ML#

Used in feature selection (removing redundant features) and underdetermined systems.

Notes#

Minimum-norm solution is unique; smallest in $\ell_2$ norm among all minimizers.
Handle tiny singular values carefully (threshold or regularize).

Connection to ML#

Supports feature selection and handles collinear features.

Connection to Linear Algebra Theory#

Pseudoinverse via SVD; minimum norm property from projection theory.

Pedagogical Significance#

Extends inversion to singular/rectangular matrices.

References#

Golub & Pereyra (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Ben-Israel & Greville (2003). Generalized Inverses: Theory and Applications.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 30
# Rank deficient: only 20 independent columns
X = np.random.randn(n, 20) @ np.random.randn(20, d)
w_true = np.random.randn(d)
w_true[25:] = 0  # sparse ground truth
y = X @ w_true + 0.01 * np.random.randn(n)

# SVD-based pseudoinverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
r = np.sum(s > 1e-10)
w_pinv = Vt[:r].T @ (np.linalg.solve(np.diag(s[:r]), U[:, :r].T @ y))

# Extend to full dimension
w_pinv_full = np.zeros(d)
w_pinv_full[:len(w_pinv)] = w_pinv if len(w_pinv) == d else np.concatenate([w_pinv, np.zeros(d - len(w_pinv))])

print("rank of X:", r)
print("residual:", round(np.linalg.norm(X @ w_pinv_full - y), 6))
print("norm(w):", round(np.linalg.norm(w_pinv_full), 4))

Worked Example 5: Truncated SVD for ill-posed inverse problems#

Introduction#

Solve an ill-posed inverse problem; apply truncated SVD regularization to stabilize solution.

Purpose#

Demonstrate spectral filtering and its effect on noise amplification.

Importance#

Core technique in inverse problems (imaging, geophysics); teaches when to truncate spectrum.

What this example demonstrates#

Construct ill-posed system with decaying singular values.
Solve with pseudoinverse (amplifies noise) vs. truncated SVD (filters noise).
Compare noise-free and noisy solutions; show improved robustness of truncation.

Background#

Ill-posed problems have tiny singular values; pseudoinverse amplifies noise. Truncation discards these.

Historical context#

Hansen (1998) and Vogel (2002) developed regularization tools for inverse problems.

History#

Standard in medical imaging (deblurring CT/MRI) and geophysical inversion.

Prevalence in ML#

Used in deblurring, denoising, and parameter estimation in inverse problems.

Notes#

Choose truncation point via L-curve, GCV, or discrepancy principle.
Trade-off: lower truncation $\to$ more smoothing, less noise, but more bias.

Connection to ML#

Improves robustness of learned models in presence of noise and measurement error.

Connection to Linear Algebra Theory#

Small singular values correspond to high-frequency/noisy directions; truncation removes them.

Pedagogical Significance#

Shows quantitative benefit of spectral filtering.

References#

Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 80, 50
# Create ill-posed system: exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.exp(-np.linspace(0, 3, min(n, d)))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U @ Sigma @ V.T

# True solution and clean data
w_true = np.zeros(d)
w_true[:5] = [10, 5, 2, 1, 0.5]
y_clean = A @ w_true

# Add noise
noise_level = 0.01
y_noisy = y_clean + noise_level * np.random.randn(n)

# Full pseudoinverse solution
U_a, s_a, Vt_a = np.linalg.svd(A, full_matrices=False)
w_full = Vt_a.T @ (np.linalg.solve(np.diag(s_a), U_a.T @ y_noisy))

# Truncated SVD solutions
errors = []
truncs = range(5, 30)
for trunc in truncs:
    s_trunc = s_a[:trunc]
    w_trunc = Vt_a[:trunc].T @ (np.linalg.solve(np.diag(s_trunc), U_a[:, :trunc].T @ y_noisy))
    err = np.linalg.norm(w_trunc - w_true)
    errors.append(err)

best_trunc = truncs[np.argmin(errors)]
print("smallest singular value:", round(s_a[-1], 8))
print("error full pseudoinverse:", round(np.linalg.norm(w_full - w_true), 4))
print("error best truncation (k={})".format(best_trunc), round(min(errors), 4))

Comments

Algorithm Category

Approximation Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Low-Rank Matrices

Historical & Attribution

Projection & Orthogonality

Key Concepts & Theorems

Learning Path & Sequencing

Orthogonal Factorizations

Matrix Decompositions

Singular Value Decomposition (SVD)

ML Applications

Supervised Learning

Numerical Stability & Robustness

Algorithm Stability

Theoretical Foundation

Numerical Linear Algebra

Principal Component Analysis

Chapter 11

Principal Component Analysis

Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

Covariance matrix
- $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.
Principal components (eigenvectors)
- Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.
Explained variance ratio
- EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.
Scores and loadings
- Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).
Reconstruction and truncation
- Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).
Standardization and scaling
- Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.
Whitening
- Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

Dimensionality reduction: speeds training, avoids overfitting, improves generalization.
Visualization: 2D/3D projection of high-dimensional data for exploration.
Preprocessing: removes noise, aligns scales, improves conditioning of solvers.
Feature extraction: learned components as features for downstream classifiers.
Denoising: truncated PCA removes low-variance (noisy) directions.
Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

1901: Pearson introduces lines/planes of closest fit (geometric intuition).
1933: Hotelling formalizes PCA as eigen-decomposition of covariance.
1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.
1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.
1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.
2000s: Randomized PCA for large-scale data (Halko–Martinsson).
2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).
Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).
Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.
Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.
Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.
Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Covariance matrix and its PSD structure (Chapter 09).
Eigen-decomposition of symmetric covariance matrix.
Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.
Eckart–Young–Mirsky low-rank approximation error (Chapter 10).
Relation to SVD: PCA via SVD of centered data (Chapter 10).
Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).
Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).
Probabilistic PCA and latent variable models (Tipping & Bishop 1997).
Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).
Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).
Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up

Dimensionality reduction and preprocessing

Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

Visualization and exploratory data analysis

Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

Whitening and decorrelation

Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

Denoising and matrix completion

Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

Feature extraction and representation learning

Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation

Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.
Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.
Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.
Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.
Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.
Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.
Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).
Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.
Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks

Always center data; forgetting this is a common error.
Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.
Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.
Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.
Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.
Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

Center data; compute covariance matrix.
Compute eigendecomposition of covariance.
Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

Ensure data is centered; if not, covariance is inaccurate.
Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

Hotelling, H. (1933). Analysis of complex statistical variables into principal components.
Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.
Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

Use full_matrices=False for efficiency when $n \gg d$.
Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

Compute full PCA; truncate to $k$ components.
Reconstruct and compute Frobenius error.
Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

Reconstruction error squared equals sum of squared singular values of discarded components.
Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

Eckart & Young (1936). The approximation of one matrix by another of lower rank.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.
Verify output covariance is identity.
Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

Add small floor to tiny eigenvalues to avoid division by zero.
Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

Add noise to data; apply truncated PCA for different $k$.
Measure reconstruction error vs. ground truth vs. noise level.
Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

Noise reduction works best if signal occupies few components.
Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)

Comments

Algorithm Category

Data Modality

Low-Rank Matrices

Historical & Attribution

Projection & Orthogonality

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing

Singular Value Decomposition (SVD)

Matrix Decompositions

ML Applications

Dimensionality Reduction

Theoretical Foundation

Applied Machine Learning

Singular Value Decomposition

Chapter 10

Singular Value Decomposition

Key ideas: Introduction

Introduction#

SVD generalizes eigen-decomposition to arbitrary rectangular matrices. Columns of $U$ span the column space; columns of $V$ span the row space; singular values quantify stretch along these directions. SVD underlies dimensionality reduction, denoising, and stable linear solves.

Important ideas#

Orthogonal bases
- $U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{d\times d}$ orthogonal; $\Sigma$ stores $\sigma_1 \ge \dots \ge \sigma_r > 0$ with $r=\operatorname{rank}(X)$.
Spectral norm and conditioning
- $\lVert X \rVert_2 = \sigma_1$; condition number $\kappa_2(X)=\sigma_1/\sigma_r$ for full-rank square $X$.
Best low-rank approximation
- Eckart–Young–Mirsky: $X_k = U_k \Sigma_k V_k^\top$ minimizes $\lVert X - Y \rVert_F$ over $\operatorname{rank}(Y)\le k$; error $\sum_{i>k}\sigma_i^2$.
Pseudoinverse and least squares
- $X^\dagger = V \Sigma^\dagger U^\top$; solves minimum-norm least squares and handles rank deficiency.
Energy and variance
- In PCA, singular values squared (scaled) are variances along principal axes.
Truncation and regularization
- Truncated SVD and spectral filtering (e.g., Tikhonov) improve stability for ill-conditioned problems.
Computation
- Golub–Kahan bidiagonalization; randomized SVD for large, sparse data.

Relevance to ML#

Dimensionality reduction (PCA/LSA/embeddings).
Low-rank modeling for recommendation and compression.
Stable least squares and ridge via spectral filtering.
Conditioning diagnostics and gradient scaling.
Spectral clustering and graph embeddings (Laplacian SVD/eigendecomp relation).

Algorithmic development (milestones)#

1930s: Schmidt, Eckart–Young best approximation theorem.
1960s: Golub–Kahan algorithms for practical SVD.
1990s: Latent Semantic Analysis uses truncated SVD for text.
2000s: Randomized SVD for large-scale ML (Halko–Martinsson–Tropp).
2010s: Matrix factorization dominates recommender benchmarks (Netflix era); spectral initializations in deep models.

Definitions#

Full SVD: $X = U \Sigma V^\top$ with $U^\top U=I_n$, $V^\top V=I_d$, $\Sigma \in \mathbb{R}^{n\times d}$ diagonal (nonnegative entries).
Thin SVD: $X = U_r \Sigma_r V_r^\top$ with $r=\operatorname{rank}(X)$, $U_r \in \mathbb{R}^{n\times r}$, $V_r \in \mathbb{R}^{d\times r}$.
Pseudoinverse: $X^\dagger = V_r \Sigma_r^{-1} U_r^\top$.
Spectral norm: $\lVert X \rVert_2 = \sigma_1$; Frobenius norm $\lVert X \rVert_F^2 = \sum_i \sigma_i^2$.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Existence and uniqueness (up to signs) of SVD for any real matrix.
Eckart–Young–Mirsky best rank-$k$ approximation theorem.
Relation to eigen-decomposition of $X^\top X$ and $X X^\top$.
Moore–Penrose pseudoinverse via SVD; minimum-norm least squares.
Spectral/Frobenius norm expressions via singular values.

Applied (landmark systems)#

PCA via SVD for dimensionality reduction (Jolliffe 2002; Shlens 2014).
LSA for text (Deerwester et al. 1990).
Matrix factorization in recommenders (Koren et al. 2009).
Spectral regularization for ill-posed problems (Hansen 1998).
Randomized SVD for large-scale ML (Halko–Martinsson–Tropp 2011).

Key ideas: Where it shows up

PCA and variance decomposition

Centered data $X_c$ yields $\tfrac{1}{n}X_c^\top X_c = V \tfrac{\Sigma^2}{n} V^\top$. Explained variance ratios follow $\sigma_i^2$. Achievements: Shlens 2014 (tutorial), Jolliffe 2002 (classical PCA), used widely in vision/audio.

Latent semantic analysis and embeddings

Truncated SVD on term-document matrices uncovers latent topics. Achievement: Deerwester et al. 1990 LSA; foundation for modern embeddings.

Recommendation and matrix completion

Low-rank factorization approximates user-item matrices; SVD intuition drives alternating least squares. Achievements: Netflix Prize-era matrix factorization (Koren et al. 2009).

Stable least squares and pseudoinverse

$x^\* = X^\dagger y$ gives minimum-norm solution; SVD reveals conditioning. Achievements: Golub–Van Loan numerical stability guidance.

Regularization and denoising

Truncated SVD/Tikhonov filter small singular values to denoise ill-posed problems (e.g., deblurring). Achievements: Hansen 1998 (regularization tools).

Notation

Data: $X \in \mathbb{R}^{n\times d}$ (rows = examples).
SVD: $X = U \Sigma V^\top$, singular values $\sigma_i$ sorted descending.
Rank-$k$: $X_k = U_k \Sigma_k V_k^\top$ with $U_k \in \mathbb{R}^{n\times k}$, $\Sigma_k \in \mathbb{R}^{k\times k}$, $V_k \in \mathbb{R}^{d\times k}$.
Pseudoinverse: $X^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ replaces nonzero $\sigma_i$ by $1/\sigma_i$.
Explained variance: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i} \sigma_i^2}$.
Example: If $X = U \Sigma V^\top$ with $\sigma_1=5,\sigma_2=2,\sigma_3=0.5$, then $\kappa_2(X)=10$ and the best rank-2 approximation error is $0.5^2$ in Frobenius norm squared.

Pitfalls & sanity checks

Always center data before PCA via SVD.
Sign ambiguity: singular vectors are unique up to sign; do not compare raw signs across runs.
Conditioning: tiny singular values indicate potential instability; consider truncation or ridge.
Shapes: ensure full_matrices=False for thin SVD when $n \gg d$ to save compute.
Reconstruction checks: verify $\lVert X - U\Sigma V^\top \rVert_F$ is near machine precision for correctness.

References

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.
Jolliffe, I. T. (2002). Principal Component Analysis.
Shlens, J. (2014). A Tutorial on Principal Component Analysis.
Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.
Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices.

Five worked examples

Worked Example 1: Full SVD, rank, and reconstruction#

Introduction#

Compute thin SVD of a tall matrix, verify reconstruction and conditioning.

Purpose#

Show how $U,\Sigma,V$ capture range, domain, and scaling; check rank and condition number.

Importance#

SVD diagnostics reveal redundancy and stability before downstream tasks.

What this example demonstrates#

Thin SVD of $X \in \mathbb{R}^{n\times d}$.
Reconstruction $U\Sigma V^\top$ matches $X$.
Rank from nonzero singular values; condition number from extremes.

Background#

SVD generalizes eigen-decomposition; thin form is efficient when $n \gg d$.

Historical context#

Golub–Kahan bidiagonalization made SVD practical (1965).

History#

Adopted as a standard numerical primitive in LAPACK/BLAS.

Prevalence in ML#

Used in preprocessing, diagnostics, and as a baseline factorization.

Notes#

Use np.linalg.svd with full_matrices=False for thin SVD.
Small singular values indicate near-rank-deficiency.

Connection to ML#

Conditioning guides choice of solvers/regularizers; rank informs model capacity.

Connection to Linear Algebra Theory#

$X^\top X = V \Sigma^2 V^\top$; nonzero singular values are square roots of eigenvalues.

Pedagogical Significance#

Concrete start-to-finish SVD computation with sanity checks.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 8, 4
X = np.random.randn(n, d) + 0.2 * np.random.randn(n, d)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma = np.diag(s)
X_rec = U @ Sigma @ Vt

rank = np.sum(s > 1e-10)
kappa = s[0] / s[-1]

print("singular values:", np.round(s, 4))
print("rank:", rank, "cond:", round(kappa, 4))
print("reconstruction error Fro:", np.linalg.norm(X - X_rec, "fro"))

Worked Example 2: Best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate low-rank approximation of an image-like matrix and quantify error decay.

Purpose#

Show optimality of truncated SVD and energy captured by top-$k$ singular values.

Importance#

Core to compression, recommendation, and fast inference with reduced models.

What this example demonstrates#

Build $X$, compute SVD, form $X_k$.
Compare Frobenius error to tail singular values.
Report explained variance ratio.

Background#

Eckart–Young–Mirsky: truncated SVD minimizes Frobenius/spectral error among rank-$k$ matrices.

Historical context#

Eckart & Young (1936) established the optimality theorem.

History#

Used in image compression and modern low-rank neural adaptation.

Prevalence in ML#

Standard for LSA, recommender warm-starts, and distillation via low-rank layers.

Notes#

Sort singular values (already sorted by SVD) before computing energy ratios.

Connection to ML#

Determines how many components to keep for accuracy/latency trade-offs.

Connection to Linear Algebra Theory#

Error squared equals sum of squared discarded singular values.

Pedagogical Significance#

Links abstract theorem to a tangible compression example.

References#

Eckart & Young (1936). Approximation of matrices.
Hansen (1998). Rank truncation for regularization.

Solution (Python)#

import numpy as np

np.random.seed(1)
X = np.random.randn(40, 30)
U, s, Vt = np.linalg.svd(X, full_matrices=False)

k = 5
Xk = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
err = np.linalg.norm(X - Xk, "fro")
tail = np.sqrt((s[k:] ** 2).sum())
ev_ratio = (s[:k] ** 2).sum() / (s ** 2).sum()

print("rank-k error:", round(err, 4), "tail from theorem:", round(tail, 4))
print("explained variance ratio:", round(ev_ratio, 4))

Worked Example 3: PCA via SVD on centered data#

Introduction#

Use SVD on centered data to extract principal components and explained variance.

Purpose#

Show that PCA can be computed via SVD without forming the covariance matrix explicitly.

Importance#

Numerically stable and efficient for tall data; avoids $d\times d$ covariance when $d$ is large.

What this example demonstrates#

Center data $X_c$.
Compute SVD of $X_c / \sqrt{n}$ to obtain principal directions and variances.
Project onto top components and report variance explained.

Background#

PCA solves an eigenproblem on covariance; SVD is an equivalent and stable route.

Historical context#

Hotelling (1933) PCA; SVD popularized for PCA in numerical practice.

History#

Standard in scikit-learn and ML toolkits.

Prevalence in ML#

Common preprocessing for vision, speech, genomics, and exploratory analysis.

Notes#

Center rows; scaling affects covariance.

Connection to ML#

Dimensionality reduction preserves variance and often improves downstream generalization.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition equals right-singular structure of centered data.

Pedagogical Significance#

Shows the practical equivalence of PCA and SVD.

References#

Jolliffe (2002). PCA.
Shlens (2014). Tutorial on PCA.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
ev = s**2
ev_ratio = ev / ev.sum()

k = 2
Z = Xc @ Vt[:k].T  # scores

print("singular values (scaled):", np.round(s, 4))
print("explained variance ratios:", np.round(ev_ratio, 4))
print("projected shape:", Z.shape)

Worked Example 4: Least squares via pseudoinverse and SVD#

Introduction#

Solve an overdetermined system with SVD-based pseudoinverse; compare to normal equations.

Purpose#

Highlight stability of SVD solves, especially when $X^\top X$ is ill-conditioned.

Importance#

Prevents amplification of noise from near-singular normal matrices.

What this example demonstrates#

Compute $x^\* = X^\dagger y$.
Compare residuals and conditioning to normal-equation solution.

Background#

Normal equations square condition number; SVD avoids this amplification.

Historical context#

Moore–Penrose pseudoinverse formalized generalized inverses; SVD provides a constructive route.

History#

Standard approach in numerical linear algebra libraries for robust least squares.

Prevalence in ML#

Used in linear regression baselines, design-matrix solvers, and as subroutines in algorithms.

Notes#

Small singular values can be truncated or damped for better stability.

Connection to ML#

Improves robustness of linear regression and feature engineering pipelines.

Connection to Linear Algebra Theory#

Pseudoinverse gives minimum-norm solution among all least-squares minimizers.

Pedagogical Significance#

Shows practical benefit of SVD over naive normal equations.

References#

Golub & Van Loan (2013). Stability of least squares.
Hansen (1998). Regularization.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 80, 6
X = np.random.randn(n, d)
# Make columns nearly dependent
X[:, 0] = X[:, 1] + 1e-3 * np.random.randn(n)
w_true = np.array([2, -2, 0.5, 0, 0, 0])
y = X @ w_true + 0.05 * np.random.randn(n)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma_inv = np.diag(1.0 / s)
w_svd = Vt.T @ Sigma_inv @ U.T @ y

w_ne = np.linalg.lstsq(X, y, rcond=None)[0]

res_svd = np.linalg.norm(X @ w_svd - y)
res_ne = np.linalg.norm(X @ w_ne - y)

print("singular values:", np.round(s, 6))
print("residual SVD:", round(res_svd, 6), "residual normal eqn:", round(res_ne, 6))

Worked Example 5: Truncated SVD for regularization and denoising#

Introduction#

Apply truncated SVD to an ill-conditioned linear system to reduce noise amplification.

Purpose#

Illustrate spectral filtering: discard tiny singular values to stabilize solutions.

Importance#

Prevents overfitting/noise blow-up in inverse problems and regression with multicollinearity.

What this example demonstrates#

Construct ill-conditioned matrix with decaying singular values.
Solve with full pseudoinverse vs. truncated SVD; compare error to ground truth.

Background#

Truncated SVD is a classical regularizer; related to Tikhonov but with hard cutoff in spectrum.

Historical context#

Widely used in inverse problems (Hansen 1998) and information retrieval.

History#

Precedes modern weight decay; spectral analog of feature pruning.

Prevalence in ML#

Used in deblurring, recommendation cold-start smoothing, and feature denoising.

Notes#

Choose cutoff based on noise level or cross-validation.

Connection to ML#

Improves generalization when design matrices are ill-conditioned.

Connection to Linear Algebra Theory#

Filters small singular values; solution lives in dominant singular subspace.

Pedagogical Significance#

Shows tangible benefit of truncating spectrum.

References#

Hansen (1998). Rank truncation and regularization.
Golub & Van Loan (2013). Regularized least squares.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 50, 30
# Create decaying singular values
U_rand, _ = np.linalg.qr(np.random.randn(n, n))
V_rand, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -3, min(n, d))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U_rand @ Sigma @ V_rand.T

w_true = np.zeros(d)
w_true[:5] = [2, -1, 0.5, 0.5, -0.2]
y_clean = A @ w_true
y = y_clean + 0.01 * np.random.randn(n)

U, sv, Vt = np.linalg.svd(A, full_matrices=False)
# Full pseudoinverse solution
w_full = (Vt.T / sv) @ (U.T @ y)
# Truncate small singular values
cut = 10
sv_trunc = sv[:cut]
w_trunc = (Vt[:cut].T / sv_trunc) @ (U[:, :cut].T @ y)

err_full = np.linalg.norm(w_full - w_true)
err_trunc = np.linalg.norm(w_trunc - w_true)

print("smallest singular value:", sv[-1])
print("error full:", round(err_full, 4), "error trunc:", round(err_trunc, 4))

Comments

Algorithm Category

Data Modality

Low-Rank Matrices

Historical & Attribution

Key Concepts & Theorems

Spectral Theory

Eckart–Young Theorem

Learning Path & Sequencing

Singular Value Decomposition (SVD)

Matrix Decompositions

ML Applications

Dimensionality Reduction

Theoretical Foundation

Numerical Linear Algebra

PSD & PD Matrices

Chapter 9

PSD & PD Matrices

Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

Eigenvalue characterization
- Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.
Quadratic forms and convexity
- If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).
Gram matrices
- $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).
Cholesky factorization
- PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.
Schur complement
- For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.
Mahalanobis metric
- For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.
Numerical PSD
- In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.
Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.
Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.
Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.
Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.
1918: Schur complement criteria.
1910–1940s: Cholesky factorization for SPD solves.
1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.
1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.
Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.
Principal minors: PD iff all leading principal minors are positive (Sylvester).
Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.
Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

PSD/PD definitions via quadratic form and eigenvalues.
Sylvester’s criterion (leading principal minors > 0 for PD).
Schur complement PSD condition.
Mercer’s theorem (kernel PSD).
Cholesky existence for SPD.

Applied (landmark systems)#

SVM with PSD kernels (Cortes–Vapnik 1995).
Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.
Kernel Ridge Regression (Murphy 2022) solved via SPD systems.
Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).
GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up

Covariance and PCA

$\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

Kernels and SVMs/GPs

RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Hessians and convexity

For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

Mahalanobis distance & whitening

SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

Cholesky-based solvers (KRR/GP)

SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation

PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.
Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.
Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.
Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.
Schur complement: For blocks, $S = C - B^\top A^{-1} B$.
Examples:
- PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.
- Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.
- Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks

Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.
Near-singular PSD: add jitter; do not invert directly.
Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.
Kernel parameters: very small/large length-scales can hurt conditioning.
Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

Symmetrization, eigenvalue check, and Cholesky factorization.
Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

Golub & Van Loan (2013). Matrix Computations.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

$\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.
Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

RBF kernel is PSD; small datasets can be nearly singular when points are close.
Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

$d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

De Maesschalck et al. (2000). The Mahalanobis distance.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category

Data Modality

Historical & Attribution

Positive Semidefiniteness

Key Concepts & Theorems

Learning Path & Sequencing

Kernels & Gaussian Processes

Linear Algebra Foundations

Matrix Theory

Matrix Decompositions

Cholesky Decomposition

ML Applications

Problem Structure & Exploitation

Symmetry & Structure

Theoretical Foundation

Linear Algebra Theory

Eigenvalues & Eigenvectors

Chapter 8

Eigenvalues & Eigenvectors

Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.
Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.
Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.
Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.
PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.
Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.
Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.
Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.
PageRank/Markov chains: Stationary distribution is dominant eigenvector.
Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.
Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

1900s: Spectral theorem for symmetric matrices.
1911: Gershgorin circle theorem (eigenvalue bounds).
1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.
1998: PageRank (Brin–Page) uses power iteration at web scale.
2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).
2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.
Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.
Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.
Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.
Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.
Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.
Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.
Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

PCA: Jolliffe (2002); Shlens (2014).
Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).
PageRank: Brin–Page (1998).
Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).
Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up

PCA and covariance spectra

Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.
Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

Spectral clustering (graph Laplacian)

Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.
Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

PageRank / Markov chains

Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.
Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

Conditioning and stability (Jacobians/Hessians)

Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.
Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

Randomized eigen/svd for large ML

Approximate leading eigenpairs with fewer passes over data.
Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation

Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.
Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.
Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.
Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.
Examples:
- Dominant eigenpair by power iteration on SPD matrix.
- PCA via eigen-decomposition of $\Sigma$.
- PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks

Non-symmetric matrices may have complex eigenvalues; use appropriate routines.
Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.
Normalize in power iteration to avoid overflow/underflow.
Center data before PCA; otherwise leading eigenvectors capture mean.
Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.
Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

$\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.
Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.
Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.
Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

$L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.
von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)

Comments

Algorithm Category

Data Modality

Historical & Attribution

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing