Chapter 10
Singular Value Decomposition
Key ideas: Introduction

Introduction#

SVD generalizes eigen-decomposition to arbitrary rectangular matrices. Columns of $U$ span the column space; columns of $V$ span the row space; singular values quantify stretch along these directions. SVD underlies dimensionality reduction, denoising, and stable linear solves.

Important ideas#

  1. Orthogonal bases

    • $U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{d\times d}$ orthogonal; $\Sigma$ stores $\sigma_1 \ge \dots \ge \sigma_r > 0$ with $r=\operatorname{rank}(X)$.

  2. Spectral norm and conditioning

    • $\lVert X \rVert_2 = \sigma_1$; condition number $\kappa_2(X)=\sigma_1/\sigma_r$ for full-rank square $X$.

  3. Best low-rank approximation

    • Eckart–Young–Mirsky: $X_k = U_k \Sigma_k V_k^\top$ minimizes $\lVert X - Y \rVert_F$ over $\operatorname{rank}(Y)\le k$; error $\sum_{i>k}\sigma_i^2$.

  4. Pseudoinverse and least squares

    • $X^\dagger = V \Sigma^\dagger U^\top$; solves minimum-norm least squares and handles rank deficiency.

  5. Energy and variance

    • In PCA, singular values squared (scaled) are variances along principal axes.

  6. Truncation and regularization

    • Truncated SVD and spectral filtering (e.g., Tikhonov) improve stability for ill-conditioned problems.

  7. Computation

    • Golub–Kahan bidiagonalization; randomized SVD for large, sparse data.

Relevance to ML#

  • Dimensionality reduction (PCA/LSA/embeddings).

  • Low-rank modeling for recommendation and compression.

  • Stable least squares and ridge via spectral filtering.

  • Conditioning diagnostics and gradient scaling.

  • Spectral clustering and graph embeddings (Laplacian SVD/eigendecomp relation).

Algorithmic development (milestones)#

  • 1930s: Schmidt, Eckart–Young best approximation theorem.

  • 1960s: Golub–Kahan algorithms for practical SVD.

  • 1990s: Latent Semantic Analysis uses truncated SVD for text.

  • 2000s: Randomized SVD for large-scale ML (Halko–Martinsson–Tropp).

  • 2010s: Matrix factorization dominates recommender benchmarks (Netflix era); spectral initializations in deep models.

Definitions#

  • Full SVD: $X = U \Sigma V^\top$ with $U^\top U=I_n$, $V^\top V=I_d$, $\Sigma \in \mathbb{R}^{n\times d}$ diagonal (nonnegative entries).

  • Thin SVD: $X = U_r \Sigma_r V_r^\top$ with $r=\operatorname{rank}(X)$, $U_r \in \mathbb{R}^{n\times r}$, $V_r \in \mathbb{R}^{d\times r}$.

  • Pseudoinverse: $X^\dagger = V_r \Sigma_r^{-1} U_r^\top$.

  • Spectral norm: $\lVert X \rVert_2 = \sigma_1$; Frobenius norm $\lVert X \rVert_F^2 = \sum_i \sigma_i^2$.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Existence and uniqueness (up to signs) of SVD for any real matrix.

  • Eckart–Young–Mirsky best rank-$k$ approximation theorem.

  • Relation to eigen-decomposition of $X^\top X$ and $X X^\top$.

  • Moore–Penrose pseudoinverse via SVD; minimum-norm least squares.

  • Spectral/Frobenius norm expressions via singular values.

Applied (landmark systems)#

  • PCA via SVD for dimensionality reduction (Jolliffe 2002; Shlens 2014).

  • LSA for text (Deerwester et al. 1990).

  • Matrix factorization in recommenders (Koren et al. 2009).

  • Spectral regularization for ill-posed problems (Hansen 1998).

  • Randomized SVD for large-scale ML (Halko–Martinsson–Tropp 2011).

Key ideas: Where it shows up
  1. PCA and variance decomposition

  • Centered data $X_c$ yields $\tfrac{1}{n}X_c^\top X_c = V \tfrac{\Sigma^2}{n} V^\top$. Explained variance ratios follow $\sigma_i^2$. Achievements: Shlens 2014 (tutorial), Jolliffe 2002 (classical PCA), used widely in vision/audio.

  1. Latent semantic analysis and embeddings

  • Truncated SVD on term-document matrices uncovers latent topics. Achievement: Deerwester et al. 1990 LSA; foundation for modern embeddings.

  1. Recommendation and matrix completion

  • Low-rank factorization approximates user-item matrices; SVD intuition drives alternating least squares. Achievements: Netflix Prize-era matrix factorization (Koren et al. 2009).

  1. Stable least squares and pseudoinverse

  • $x^\* = X^\dagger y$ gives minimum-norm solution; SVD reveals conditioning. Achievements: Golub–Van Loan numerical stability guidance.

  1. Regularization and denoising

  • Truncated SVD/Tikhonov filter small singular values to denoise ill-posed problems (e.g., deblurring). Achievements: Hansen 1998 (regularization tools).

Notation
  • Data: $X \in \mathbb{R}^{n\times d}$ (rows = examples).

  • SVD: $X = U \Sigma V^\top$, singular values $\sigma_i$ sorted descending.

  • Rank-$k$: $X_k = U_k \Sigma_k V_k^\top$ with $U_k \in \mathbb{R}^{n\times k}$, $\Sigma_k \in \mathbb{R}^{k\times k}$, $V_k \in \mathbb{R}^{d\times k}$.

  • Pseudoinverse: $X^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ replaces nonzero $\sigma_i$ by $1/\sigma_i$.

  • Explained variance: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i} \sigma_i^2}$.

  • Example: If $X = U \Sigma V^\top$ with $\sigma_1=5,\sigma_2=2,\sigma_3=0.5$, then $\kappa_2(X)=10$ and the best rank-2 approximation error is $0.5^2$ in Frobenius norm squared.

Pitfalls & sanity checks
  • Always center data before PCA via SVD.

  • Sign ambiguity: singular vectors are unique up to sign; do not compare raw signs across runs.

  • Conditioning: tiny singular values indicate potential instability; consider truncation or ridge.

  • Shapes: ensure full_matrices=False for thin SVD when $n \gg d$ to save compute.

  • Reconstruction checks: verify $\lVert X - U\Sigma V^\top \rVert_F$ is near machine precision for correctness.

References
  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  3. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

  4. Jolliffe, I. T. (2002). Principal Component Analysis.

  5. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

  6. Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis.

  7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

  8. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.

  9. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices.

Five worked examples

Worked Example 1: Full SVD, rank, and reconstruction#

Introduction#

Compute thin SVD of a tall matrix, verify reconstruction and conditioning.

Purpose#

Show how $U,\Sigma,V$ capture range, domain, and scaling; check rank and condition number.

Importance#

SVD diagnostics reveal redundancy and stability before downstream tasks.

What this example demonstrates#

  • Thin SVD of $X \in \mathbb{R}^{n\times d}$.

  • Reconstruction $U\Sigma V^\top$ matches $X$.

  • Rank from nonzero singular values; condition number from extremes.

Background#

SVD generalizes eigen-decomposition; thin form is efficient when $n \gg d$.

Historical context#

Golub–Kahan bidiagonalization made SVD practical (1965).

History#

Adopted as a standard numerical primitive in LAPACK/BLAS.

Prevalence in ML#

Used in preprocessing, diagnostics, and as a baseline factorization.

Notes#

  • Use np.linalg.svd with full_matrices=False for thin SVD.

  • Small singular values indicate near-rank-deficiency.

Connection to ML#

Conditioning guides choice of solvers/regularizers; rank informs model capacity.

Connection to Linear Algebra Theory#

$X^\top X = V \Sigma^2 V^\top$; nonzero singular values are square roots of eigenvalues.

Pedagogical Significance#

Concrete start-to-finish SVD computation with sanity checks.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 8, 4
X = np.random.randn(n, d) + 0.2 * np.random.randn(n, d)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma = np.diag(s)
X_rec = U @ Sigma @ Vt

rank = np.sum(s > 1e-10)
kappa = s[0] / s[-1]

print("singular values:", np.round(s, 4))
print("rank:", rank, "cond:", round(kappa, 4))
print("reconstruction error Fro:", np.linalg.norm(X - X_rec, "fro"))

Worked Example 2: Best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate low-rank approximation of an image-like matrix and quantify error decay.

Purpose#

Show optimality of truncated SVD and energy captured by top-$k$ singular values.

Importance#

Core to compression, recommendation, and fast inference with reduced models.

What this example demonstrates#

  • Build $X$, compute SVD, form $X_k$.

  • Compare Frobenius error to tail singular values.

  • Report explained variance ratio.

Background#

Eckart–Young–Mirsky: truncated SVD minimizes Frobenius/spectral error among rank-$k$ matrices.

Historical context#

Eckart & Young (1936) established the optimality theorem.

History#

Used in image compression and modern low-rank neural adaptation.

Prevalence in ML#

Standard for LSA, recommender warm-starts, and distillation via low-rank layers.

Notes#

  • Sort singular values (already sorted by SVD) before computing energy ratios.

Connection to ML#

Determines how many components to keep for accuracy/latency trade-offs.

Connection to Linear Algebra Theory#

Error squared equals sum of squared discarded singular values.

Pedagogical Significance#

Links abstract theorem to a tangible compression example.

References#

  1. Eckart & Young (1936). Approximation of matrices.

  2. Hansen (1998). Rank truncation for regularization.

Solution (Python)#

import numpy as np

np.random.seed(1)
X = np.random.randn(40, 30)
U, s, Vt = np.linalg.svd(X, full_matrices=False)

k = 5
Xk = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
err = np.linalg.norm(X - Xk, "fro")
tail = np.sqrt((s[k:] ** 2).sum())
ev_ratio = (s[:k] ** 2).sum() / (s ** 2).sum()

print("rank-k error:", round(err, 4), "tail from theorem:", round(tail, 4))
print("explained variance ratio:", round(ev_ratio, 4))

Worked Example 3: PCA via SVD on centered data#

Introduction#

Use SVD on centered data to extract principal components and explained variance.

Purpose#

Show that PCA can be computed via SVD without forming the covariance matrix explicitly.

Importance#

Numerically stable and efficient for tall data; avoids $d\times d$ covariance when $d$ is large.

What this example demonstrates#

  • Center data $X_c$.

  • Compute SVD of $X_c / \sqrt{n}$ to obtain principal directions and variances.

  • Project onto top components and report variance explained.

Background#

PCA solves an eigenproblem on covariance; SVD is an equivalent and stable route.

Historical context#

Hotelling (1933) PCA; SVD popularized for PCA in numerical practice.

History#

Standard in scikit-learn and ML toolkits.

Prevalence in ML#

Common preprocessing for vision, speech, genomics, and exploratory analysis.

Notes#

  • Center rows; scaling affects covariance.

Connection to ML#

Dimensionality reduction preserves variance and often improves downstream generalization.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition equals right-singular structure of centered data.

Pedagogical Significance#

Shows the practical equivalence of PCA and SVD.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). Tutorial on PCA.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
ev = s**2
ev_ratio = ev / ev.sum()

k = 2
Z = Xc @ Vt[:k].T  # scores

print("singular values (scaled):", np.round(s, 4))
print("explained variance ratios:", np.round(ev_ratio, 4))
print("projected shape:", Z.shape)

Worked Example 4: Least squares via pseudoinverse and SVD#

Introduction#

Solve an overdetermined system with SVD-based pseudoinverse; compare to normal equations.

Purpose#

Highlight stability of SVD solves, especially when $X^\top X$ is ill-conditioned.

Importance#

Prevents amplification of noise from near-singular normal matrices.

What this example demonstrates#

  • Compute $x^\* = X^\dagger y$.

  • Compare residuals and conditioning to normal-equation solution.

Background#

Normal equations square condition number; SVD avoids this amplification.

Historical context#

Moore–Penrose pseudoinverse formalized generalized inverses; SVD provides a constructive route.

History#

Standard approach in numerical linear algebra libraries for robust least squares.

Prevalence in ML#

Used in linear regression baselines, design-matrix solvers, and as subroutines in algorithms.

Notes#

  • Small singular values can be truncated or damped for better stability.

Connection to ML#

Improves robustness of linear regression and feature engineering pipelines.

Connection to Linear Algebra Theory#

Pseudoinverse gives minimum-norm solution among all least-squares minimizers.

Pedagogical Significance#

Shows practical benefit of SVD over naive normal equations.

References#

  1. Golub & Van Loan (2013). Stability of least squares.

  2. Hansen (1998). Regularization.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 80, 6
X = np.random.randn(n, d)
# Make columns nearly dependent
X[:, 0] = X[:, 1] + 1e-3 * np.random.randn(n)
w_true = np.array([2, -2, 0.5, 0, 0, 0])
y = X @ w_true + 0.05 * np.random.randn(n)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma_inv = np.diag(1.0 / s)
w_svd = Vt.T @ Sigma_inv @ U.T @ y

w_ne = np.linalg.lstsq(X, y, rcond=None)[0]

res_svd = np.linalg.norm(X @ w_svd - y)
res_ne = np.linalg.norm(X @ w_ne - y)

print("singular values:", np.round(s, 6))
print("residual SVD:", round(res_svd, 6), "residual normal eqn:", round(res_ne, 6))

Worked Example 5: Truncated SVD for regularization and denoising#

Introduction#

Apply truncated SVD to an ill-conditioned linear system to reduce noise amplification.

Purpose#

Illustrate spectral filtering: discard tiny singular values to stabilize solutions.

Importance#

Prevents overfitting/noise blow-up in inverse problems and regression with multicollinearity.

What this example demonstrates#

  • Construct ill-conditioned matrix with decaying singular values.

  • Solve with full pseudoinverse vs. truncated SVD; compare error to ground truth.

Background#

Truncated SVD is a classical regularizer; related to Tikhonov but with hard cutoff in spectrum.

Historical context#

Widely used in inverse problems (Hansen 1998) and information retrieval.

History#

Precedes modern weight decay; spectral analog of feature pruning.

Prevalence in ML#

Used in deblurring, recommendation cold-start smoothing, and feature denoising.

Notes#

  • Choose cutoff based on noise level or cross-validation.

Connection to ML#

Improves generalization when design matrices are ill-conditioned.

Connection to Linear Algebra Theory#

Filters small singular values; solution lives in dominant singular subspace.

Pedagogical Significance#

Shows tangible benefit of truncating spectrum.

References#

  1. Hansen (1998). Rank truncation and regularization.

  2. Golub & Van Loan (2013). Regularized least squares.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 50, 30
# Create decaying singular values
U_rand, _ = np.linalg.qr(np.random.randn(n, n))
V_rand, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -3, min(n, d))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U_rand @ Sigma @ V_rand.T

w_true = np.zeros(d)
w_true[:5] = [2, -1, 0.5, 0.5, -0.2]
y_clean = A @ w_true
y = y_clean + 0.01 * np.random.randn(n)

U, sv, Vt = np.linalg.svd(A, full_matrices=False)
# Full pseudoinverse solution
w_full = (Vt.T / sv) @ (U.T @ y)
# Truncate small singular values
cut = 10
sv_trunc = sv[:cut]
w_trunc = (Vt[:cut].T / sv_trunc) @ (U[:, :cut].T @ y)

err_full = np.linalg.norm(w_full - w_true)
err_trunc = np.linalg.norm(w_trunc - w_true)

print("smallest singular value:", sv[-1])
print("error full:", round(err_full, 4), "error trunc:", round(err_trunc, 4))

Comments

Algorithm Category
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Matrix Decompositions
ML Applications
Theoretical Foundation