Introduction#
SVD generalizes eigen-decomposition to arbitrary rectangular matrices. Columns of $U$ span the column space; columns of $V$ span the row space; singular values quantify stretch along these directions. SVD underlies dimensionality reduction, denoising, and stable linear solves.
Important ideas#
Orthogonal bases
$U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{d\times d}$ orthogonal; $\Sigma$ stores $\sigma_1 \ge \dots \ge \sigma_r > 0$ with $r=\operatorname{rank}(X)$.
Spectral norm and conditioning
$\lVert X \rVert_2 = \sigma_1$; condition number $\kappa_2(X)=\sigma_1/\sigma_r$ for full-rank square $X$.
Best low-rank approximation
Eckart–Young–Mirsky: $X_k = U_k \Sigma_k V_k^\top$ minimizes $\lVert X - Y \rVert_F$ over $\operatorname{rank}(Y)\le k$; error $\sum_{i>k}\sigma_i^2$.
Pseudoinverse and least squares
$X^\dagger = V \Sigma^\dagger U^\top$; solves minimum-norm least squares and handles rank deficiency.
Energy and variance
In PCA, singular values squared (scaled) are variances along principal axes.
Truncation and regularization
Truncated SVD and spectral filtering (e.g., Tikhonov) improve stability for ill-conditioned problems.
Computation
Golub–Kahan bidiagonalization; randomized SVD for large, sparse data.
Relevance to ML#
Dimensionality reduction (PCA/LSA/embeddings).
Low-rank modeling for recommendation and compression.
Stable least squares and ridge via spectral filtering.
Conditioning diagnostics and gradient scaling.
Spectral clustering and graph embeddings (Laplacian SVD/eigendecomp relation).
Algorithmic development (milestones)#
1930s: Schmidt, Eckart–Young best approximation theorem.
1960s: Golub–Kahan algorithms for practical SVD.
1990s: Latent Semantic Analysis uses truncated SVD for text.
2000s: Randomized SVD for large-scale ML (Halko–Martinsson–Tropp).
2010s: Matrix factorization dominates recommender benchmarks (Netflix era); spectral initializations in deep models.
Definitions#
Full SVD: $X = U \Sigma V^\top$ with $U^\top U=I_n$, $V^\top V=I_d$, $\Sigma \in \mathbb{R}^{n\times d}$ diagonal (nonnegative entries).
Thin SVD: $X = U_r \Sigma_r V_r^\top$ with $r=\operatorname{rank}(X)$, $U_r \in \mathbb{R}^{n\times r}$, $V_r \in \mathbb{R}^{d\times r}$.
Pseudoinverse: $X^\dagger = V_r \Sigma_r^{-1} U_r^\top$.
Spectral norm: $\lVert X \rVert_2 = \sigma_1$; Frobenius norm $\lVert X \rVert_F^2 = \sum_i \sigma_i^2$.
Comments