ex1.ai

Sparse Matrices

Chapter 15

Sparse Matrices

Key ideas: Introduction

Introduction#

Most real-world data matrices are sparse: social networks are sparse graphs; text is sparse (million-word vocabulary but document has hundreds of words); images are sparse in wavelet domains; scientific simulations generate sparse discretizations. Dense matrices require $O(n^2)$ memory and $O(n^3)$ computation; sparse matrices require only $O(\text{nnz})$. A sparse matrix with $\text{nnz} = O(n)$ or $O(n \log n)$ reduces complexity from cubic to linear or nearly linear. This is transformational: it enables algorithms that would be intractable on dense representations. Exploiting sparsity requires careful algorithm design: level-of-fill control, reordering strategies, specialized data structures, and communication patterns in distributed solvers.

Important ideas#

Sparse matrix representations
- Compressed Sparse Row (CSR): store nonzeros by row; three arrays (values, column indices, row pointers); $O(\text{nnz} + n)$ memory.
- Compressed Sparse Column (CSC): symmetric role for columns; efficient for column access.
- Coordinate format (COO): list of $(i, j, v)$ triples; inefficient for computation but flexible for construction.
- Diagonal, tridiagonal, block-structured: exploit special structure for even greater efficiency.
Sparsity patterns and fill-in
- Reachability: nonzero $(i, j)$ entry means path from $i$ to $j$ in graph; reveals communication pattern.
- Fill-in: $L, U$ in $A = LU$ can have more nonzeros than $A$; reordering (minimum degree, nested dissection) minimizes fill-in.
- Symbolic factorization: determine nonzero pattern of $L, U$ without computing values.
Matrix-vector products and efficient computation
- Dense $y = A x$ costs $O(n^2)$; sparse costs $O(\text{nnz})$.
- SpMV (sparse matrix-vector product) is bandwidth-limited on modern hardware; careful data layout crucial.
- GPU sparse kernels: CUSPARSE; CPU sparse kernels: MKL Sparse BLAS, OpenMP parallelization.
Direct sparse solvers
- LU with partial pivoting: reorder to minimize fill-in, apply pivoting.
- Cholesky for SPD: symmetric structure implies symmetric reordering; often very sparse.
- Supernodal methods: group small dense blocks for cache efficiency; key to modern solvers (PARDISO, SuperLU, UMFPACK).
Iterative sparse solvers
- CG, GMRES, MINRES: require only matrix-vector products; no explicit factorization.
- Preconditioning essential: ILU (incomplete LU), IC (incomplete Cholesky), algebraic multigrid (AMG) enable convergence.
- Matrix-free methods: supply only matrix-vector product function; never form or store matrix.
Graph algorithms and adjacency matrices
- Adjacency matrix $A$ where $A_{ij} = 1$ if edge $(i, j)$ exists; number of nonzeros = $2 \times$ edges (undirected).
- Shortest paths, PageRank, diffusion: matrix powers, powers of adjacency matrix encode reachability.
- Laplacian matrix $L = D - A$ (degree diagonal minus adjacency); eigenvectors yield graph partitions (spectral clustering).
Storage and computational complexity trade-offs
- Dense $A \in \mathbb{R}^{n \times n}$: $O(n^2)$ memory, $O(n^3)$ LU, $O(n^2)$ SpMV.
- Sparse $A$ with $\text{nnz} = O(n)$: $O(n)$ memory, $O(n^{1.5})$ LU (with reordering), $O(n)$ SpMV.
- For $n = 10^6$: dense requires terabytes; sparse (with $\text{nnz} = 10^7$) requires gigabytes.

Relevance to ML#

Text and NLP: Document-term matrices are sparse (million words, hundreds per document); sparse SVD enables topic models.
Graphs and networks: Social networks, citation graphs, knowledge graphs are sparse; graph neural networks leverage adjacency matrix sparsity.
Recommender systems: User-item ratings sparse ($\sim 0.1\%$ observed); matrix completion via sparse factorization.
Inverse problems and imaging: Medical imaging, seismic inversion, tomography discretizations are sparse; sparse solvers essential.
Large-scale optimization: SGD and proximal methods operate on sparse data; distributed solvers exploit communication sparsity.

Algorithmic development (milestones)#

1960s: Sparse matrix computation recognized as practical necessity in structural engineering (finite element analysis).
1971: Tinney & Walker develop minimum degree ordering; George extends to nested dissection.
1979: Duff develops efficient symbolic factorization algorithms.
1980s: Supernodal methods developed (Ashcraft, Eisenstat); enable cache-efficient dense kernels within sparse framework.
1990s: PARDISO, SuperLU, UMFPACK emerge as production sparse solvers.
2000s: Algebraic multigrid (AMG) becomes standard preconditioner for sparse iterative methods.
2005–2010: GPU sparse kernels (CUSPARSE) enable acceleration; sparse matrix formats adapt to GPU memory.
2010s: Sparse deep learning (neural networks on graph-structured data); graph convolutional networks (GCN), graph attention networks (GAT) exploit adjacency matrix sparsity.

Definitions#

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A) \ll n^2$ nonzeros; typically $\text{nnz} = O(n)$ or $O(n \log n)$.
Sparsity: $\text{sparsity} = 1 - \frac{\text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity} > 0.99$ is typical for ML.
Compressed Sparse Row (CSR): Three arrays: values[nnz], col[nnz], row_ptr[n+1]; row_ptr[i] points to start of row $i$ in values/col.
Fill-in: Nonzero entries created during factorization that were zero in original matrix.
Reordering: Permutation $P$ such that $P A P^\top$ has lower fill-in; minimum degree, nested dissection are common strategies.
Graph of a matrix: Undirected graph $G$ where vertices are matrix indices, edge $(i, j)$ exists iff $A_{ij} \ne 0$.
Degree: Vertex $i$ has degree = number of nonzeros in row/column $i$.
Laplacian matrix: $L = D - A$ where $D = \text{diag}(d_1, \ldots, d_n)$ is degree matrix, $A$ is adjacency matrix.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Sparse matrix representations: CSR, CSC, COO formats; trade-offs between storage and access patterns. Reference: Golub & Van Loan (2013).
Fill-in and reordering: Minimum degree, nested dissection reduce fill-in; symbolic factorization predicts nonzero pattern. Reference: George & Liu (1981).
SpMV complexity: $O(\text{nnz})$ time, bandwidth-limited on modern hardware. Reference: Williams et al. (2009).
Direct sparse solvers: LU with reordering, supernodal methods, pivot selection for stability. Reference: Davis (2006) (SuiteSparse); Schenk & Gärtner (2004) (PARDISO).
Iterative sparse solvers: CG, GMRES on Krylov subspaces; preconditioning essential (ILU, IC, AMG). Reference: Saad (2003); Briggs et al. (2000) (multigrid tutorial).
Graph Laplacian and spectral methods: Eigenvalues/eigenvectors of $L = D - A$ for clustering, diffusion. Reference: Spielman & Teng (2004).

Applied (landmark systems)#

Sparse SVD for dimensionality reduction: ARPACK for large sparse matrices; used in topic modeling (LSA, LDA). Implementation: scikit-learn TruncatedSVD. Reference: Deerwester et al. (1990).
Direct sparse solvers (production): PARDISO (Intel MKL), SuperLU (LBNL), UMFPACK (UMF collection). Deployed in finite element, circuit simulation. Reference: Schenk & Gärtner (2004); Li (2005).
Iterative sparse solvers with preconditioning: GMRES + ILU in PETSc, Trilinos; ALS for matrix completion in Spark MLlib. Reference: Balay et al. (2021) (PETSc); Herring et al. (2009) (Trilinos).
Graph neural networks: PyTorch Geometric adjacency matrix operations; sparse matrix multiplication for graph convolution. Reference: Kipf & Welling (2017).
Compressed sensing and sparse recovery: ADMM/FISTA for L1-regularized problems; applications in imaging, signal processing. Reference: Boyd et al. (2011); Beck & Teboulle (2009).

Key ideas: Where it shows up

Text and NLP
- Document-term matrices $X \in \mathbb{R}^{n \times d}$ with $n$ documents, $d$ vocabulary; typically $\text{nnz} = O(nd)$ (words per doc).
- Achievements: sparse SVD/NMF in topic modeling (LSA, LDA); scikit-learn TfidfVectorizer produces sparse matrices. References: Deerwester et al. (1990) (LSA); Blei et al. (2003) (LDA).
Graph neural networks and social networks
- Adjacency matrix $A \in \mathbb{R}^{n \times n}$ sparse; graph convolution $X' = \sigma(AXW)$ exploits sparsity for $O(\text{nnz} \cdot d)$ per layer vs. $O(n^2 d)$ dense.
- Achievements: PyTorch Geometric, DGL, Spektral enable billion-node GNNs; GCN (Kipf & Welling 2017), GAT (Velickovic et al. 2018) standard. References: Kipf & Welling (2017); Velickovic et al. (2018).
Recommender systems and matrix completion
- Rating matrices $R \in \mathbb{R}^{m \times n}$ (users × items) extremely sparse ($\sim 0.01\%$ observed); matrix factorization via alternating least squares.
- Achievements: Sparse SVD in Netflix prize; ALS in Spark MLlib scales to billions of ratings. References: Koren et al. (2009) (matrix factorization); Zhou et al. (2008) (implicit feedback).
Inverse problems and sparse regularization
- Measurement matrix $A$ in compressed sensing, tomography often sparse or low-rank. Sparse solutions via L1 regularization (LASSO).
- Achievements: ADMM (alternating direction method of multipliers) for large-scale L1-regularized problems; FISTA (fast iterative shrinkage-thresholding). References: Boyd et al. (2011) (ADMM); Beck & Teboulle (2009) (FISTA).
Distributed and federated learning
- Distributed optimization exploits sparsity in communication: gradient compression, local SGD reduce communication. Sparse parameter updates.
- Achievements: gradient sparsification in federated learning (TensorFlow Federated); communication-efficient distributed SGD (Kairouz et al. 2021). References: Kairouz et al. (2021) (federated learning).

Notation

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A)$ nonzeros; typically $\text{nnz} \ll n^2$.
Sparsity: $\text{sparsity}(A) = \frac{n^2 - \text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity}(A) > 0.99$ typical.
CSR format: Values v[1..nnz], column indices j[1..nnz], row pointers p[1..n+1] with p[i+1] - p[i] = nonzeros in row $i$.
Adjacency matrix: $A \in \{0,1\}^{n \times n}$ where $A_{ij} = 1$ if edge $(i, j)$ exists; $\text{nnz}(A) = 2m$ for $m$ edges (undirected).
Degree matrix: $D = \text{diag}(d_1, \ldots, d_n)$ where $d_i = \sum_j A_{ij}$ (degree of vertex $i$).
Laplacian: $L = D - A$ (graph Laplacian); $L$ is PSD with $\lambda_1 = 0$ (constant eigenvector).
Fill-in ratio: $\text{fill}(L, U) = \frac{\text{nnz}(L) + \text{nnz}(U)}{\text{nnz}(A)}$ (factorization creates nonzeros).
Example: Document-term matrix for 100K documents, 1M vocabulary words, ~200 words per doc: $\text{nnz} \approx 100K \times 200 = 20M$; density $\approx 0.00002\%$.

Pitfalls & sanity checks

Never convert large sparse matrix to dense: Memory explosion; dense matrix for $n = 10^6$ requires terabytes.
Choose representation based on access pattern: CSR for row access/SpMV, CSC for column access/factorization.
Monitor fill-in in factorization: If fill-in ratio $> 100$, consider iterative solver + preconditioner instead.
Ensure reordering is applied: Production solvers (PARDISO, SuperLU) apply automatically; verify if using custom code.
Test on subset before scaling: Verify correctness on small sparse system before large deployment.
GPU sparse kernels have overhead: SpMV on GPU slower than CPU for small matrices; threshold typically $n > 1000$.
Preconditioning setup vs. solve trade-off: Single solve: skip preconditioning. Multiple solves with same matrix: preconditioning pays off.
Sparsity pattern changes: If $A$ has varying sparsity over iterations, symbolic factorization must be updated.

References

Foundational sparse matrix theory

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Sparse matrix representations and algorithms

Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Tinney, W. F., & Walker, J. W. (1967). Direct solutions of sparse network equations by optimally ordered triangular factorization.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Sparse iterative solvers and preconditioning

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic Multigrid (AMG).

Sparse linear algebra in scientific computing

Balay, S., Abhyankar, S., Adams, M. F., et al. (2021). PETSc Users Manual.
Herring, A. L., Gilmour, A., Paddon, D. J., & Giles, M. B. (2009). Trilinos Overview.
Li, X. S. (2005). An overview of SuperLU: Algorithms, implementation, and user interface.

Topic modeling and sparse SVD

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.

Graph neural networks and sparse operations

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Recommender systems and matrix completion

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.
Zhou, T., Kuscsik, Z., Liu, J. K., Medo, M., Yeung, C. H., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems.

Optimization and sparse recovery

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
Kairouz, P., McMahan, H. B., Avent, B., et al. (2021). Advances and Open Problems in Federated Learning.

Five worked examples

Worked Example 1: Sparse matrix representations and storage efficiency#

Introduction#

Construct a sparse matrix (adjacency matrix of a large graph); compare storage and computation time across dense, CSR, and COO formats; demonstrate memory/time trade-offs.

Purpose#

Illustrate practical efficiency gains of sparse formats; motivate algorithm selection based on access patterns.

Importance#

Foundation for understanding when sparse methods are worthwhile; shows memory savings enable billion-scale problems.

What this example demonstrates#

Generate sparse graph (e.g., scale-free network or grid).
Store as dense array, then convert to CSR, CSC, COO.
Measure memory footprint for each format.
Benchmark matrix-vector product on dense vs. sparse.
Plot time vs. sparsity level.

Background#

Dense storage is simple but wasteful for sparse matrices. Sparse formats trade memory for computational complexity in accessing elements.

Historical context#

Sparse matrix formats developed as computers gained memory constraints in 1960s–1970s (finite element analysis).

History#

CSR became de facto standard; specialized hardware (GPU) support added via CUSPARSE, cuSPARSE.

Prevalence in ML#

Every large-scale ML system (Spark, Hadoop, TensorFlow) supports sparse matrices; default for text, graphs.

Notes#

CSR efficient for row access, SpMV; CSC efficient for column access, matrix-matrix products.
COO flexible for construction; convert to CSR/CSC before computation.
GPU memory bandwidth is bottleneck; careful format choice crucial.

Connection to ML#

Text vectorization produces sparse matrices; graphs produce sparse adjacency matrices. System must efficiently handle both.

Connection to Linear Algebra Theory#

Sparse matrix properties (connectivity, degree distribution) determine factorization cost and algorithm performance.

Pedagogical Significance#

Demonstrates performance trade-offs; motivates specialized algorithms.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(30)

# Create sparse adjacency matrix (random graph)
n = 10000
density = 0.001  # 0.1% sparse
nnz = int(density * n * n)

# Generate random sparse matrix
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
data = np.ones(nnz)

# CSR format (efficient for SpMV)
A_csr = sp.csr_matrix((data, (rows, cols)), shape=(n, n))

# CSC format (efficient for column access)
A_csc = A_csr.tocsc()

# COO format (flexible)
A_coo = sp.coo_matrix((data, (rows, cols)), shape=(n, n))

# Dense format
A_dense = A_csr.toarray()

# Measure memory
import sys
mem_csr = sys.getsizeof(A_csr.data) + sys.getsizeof(A_csr.indices) + sys.getsizeof(A_csr.indptr)
mem_dense = A_dense.nbytes

print("Memory comparison for {}x{} matrix (density {:.4f}%)".format(n, n, 100*density))
print("  Dense: {:.2f} MB".format(mem_dense / 1e6))
print("  CSR: {:.2f} MB".format(mem_csr / 1e6))
print("  Savings: {:.1f}x".format(mem_dense / mem_csr))

# Benchmark matrix-vector product
x = np.random.randn(n)
iterations = 100

# Sparse SpMV
t0 = time.time()
for _ in range(iterations):
    y_sparse = A_csr @ x
t_sparse = time.time() - t0

# Dense gemv
t0 = time.time()
for _ in range(iterations):
    y_dense = A_dense @ x
t_dense = time.time() - t0

print("\nMatrix-vector product time ({} iterations):".format(iterations))
print("  Sparse: {:.4f} s".format(t_sparse))
print("  Dense: {:.4f} s".format(t_dense))
print("  Speedup: {:.1f}x".format(t_dense / t_sparse))

print("\nSparsity level: {:.1f}%".format(100 * A_csr.nnz / (n*n)))
print("NNZ: {}".format(A_csr.nnz))

Worked Example 2: Fill-in and reordering in sparse LU factorization#

Introduction#

Construct a sparse matrix; compute LU factorization without reordering and with minimum-degree reordering; compare fill-in and factorization time.

Purpose#

Demonstrate how reordering dramatically reduces fill-in and speeds factorization.

Importance#

Critical for practical sparse solvers; shows that problem structure matters.

What this example demonstrates#

Construct sparse system (banded, block-structured, or random).
Perform symbolic LU factorization (predict nonzero pattern).
Compute LU without reordering; measure fill-in ratio.
Apply minimum-degree reordering; compute LU again.
Compare factorization time and fill-in.

Background#

Fill-in (new nonzeros created during factorization) can destroy sparsity; reordering (permutation of rows/columns) reduces fill-in.

Historical context#

Tinney & Walker (1967) minimum degree; George (1973) nested dissection; standard in SPARSE BLAS era.

History#

All production sparse solvers (PARDISO, SuperLU, UMFPACK) include automatic reordering.

Prevalence in ML#

Used in sparse inverse problems, finite element discretizations; less visible in ML but critical for scalability.

Notes#

Minimum degree is heuristic; nested dissection is theoretically superior but more complex.
Fill-in ratio typically 5–100× for unordered; reordering reduces to 1–5×.

Connection to ML#

Sparse solve is bottleneck in iterative methods (Newton, interior-point algorithms). Reordering accelerates optimization.

Connection to Linear Algebra Theory#

Graph reachability (matrix powers) predicts fill-in; reduction is graph-theoretic problem.

Pedagogical Significance#

Shows interplay between linear algebra and graph algorithms.

References#

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
from scipy.sparse import csgraph
import time

np.random.seed(31)

# Create sparse banded system (mimics discretized PDE)
n = 1000
band_width = 5
A_banded = sp.diags([np.ones(n-1), -2*np.ones(n), np.ones(n-1)],
                     [-1, 0, 1], shape=(n, n), format='csr')
# Add random entries to create irregular pattern
rows = np.random.choice(n, 50)
cols = np.random.choice(n, 50)
A_banded = A_banded + 0.01 * sp.csr_matrix((np.ones(50), (rows, cols)), shape=(n, n))
A_banded.eliminate_zeros()

# Convert to CSC for LU (standard for sparse direct solvers)
A_csc = A_banded.tocsc()

# LU without reordering
print("LU factorization comparison (n={})".format(n))
print("\nNo reordering:")
t0 = time.time()
LU_no_reorder = spla.splu(A_csc, permc_spec='natural')
t_no_reorder = time.time() - t0
nnz_no_reorder = LU_no_reorder.nnz

# LU with COLAMD reordering (minimum degree heuristic)
print("\nWith COLAMD reordering:")
t0 = time.time()
LU_reorder = spla.splu(A_csc, permc_spec='COLAMD')
t_reorder = time.time() - t0
nnz_reorder = LU_reorder.nnz

print("\nResults:")
print("  Original NNZ: {}".format(A_csc.nnz))
print("  No reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_no_reorder, nnz_no_reorder / A_csc.nnz, t_no_reorder))
print("  With reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_reorder, nnz_reorder / A_csc.nnz, t_reorder))
print("  Fill-in reduction: {:.1f}x".format(nnz_no_reorder / nnz_reorder))
print("  Time improvement: {:.1f}x".format(t_no_reorder / t_reorder))

Worked Example 3: Sparse SVD for topic modeling (LSA)#

Introduction#

Apply truncated SVD to sparse document-term matrix; extract latent topics; compare sparse vs. dense SVD efficiency.

Purpose#

Show practical use of sparse linear algebra in NLP; demonstrate dimensionality reduction on real-world sparse data.

Importance#

Sparse SVD is standard in topic modeling; ARPACK enables scalability to billions of words.

What this example demonstrates#

Generate synthetic document-term matrix (sparse, realistic sparsity ~0.01%).
Compute truncated SVD via ARPACK (sparse).
Compute dense SVD for comparison (infeasible for large $n$).
Extract topics (top words per topic).
Measure reconstruction error vs. rank.

Background#

LSA (latent semantic analysis) applies SVD to tf-idf matrix; low-rank factors reveal latent topics.

Historical context#

Deerwester et al. (1990) pioneered LSA; ARPACK (Lehoucq et al. 1998) made scalable SVD practical.

History#

Standard in topic modeling pipelines; scikit-learn TruncatedSVD widely used.

Prevalence in ML#

Textmining, information retrieval, recommendation systems routinely use sparse SVD.

Notes#

Truncated SVD (rank $k \ll n$) computes only $k$ largest singular values; much cheaper than full SVD.
Sparse matrix multiplication dominates cost; $O(k \cdot \text{nnz})$ per iteration.

Connection to ML#

Topics are interpretable; document-topic and topic-word matrices enable downstream tasks.

Connection to Linear Algebra Theory#

ARPACK uses Krylov subspaces and implicitly restarted Arnoldi; exploits SpMV efficiency.

Pedagogical Significance#

Demonstrates power of combining sparse formats with specialized algorithms (Krylov methods).

References#

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import time

np.random.seed(32)

# Synthetic document-term matrix (sparse)
n_docs, n_terms = 10000, 50000
nnz_per_doc = 200

# Generate sparse matrix (realistic sparsity)
data = []
row_idx = []
col_idx = []
for doc in range(n_docs):
    terms = np.random.choice(n_terms, nnz_per_doc, replace=False)
    counts = np.random.poisson(2, nnz_per_doc) + 1
    data.extend(counts)
    row_idx.extend([doc] * nnz_per_doc)
    col_idx.extend(terms)

X = sp.csr_matrix((data, (row_idx, col_idx)), shape=(n_docs, n_terms))

# Normalize (tf-idf variant)
X = X.astype(np.float32)
row_sums = np.array(X.sum(axis=1)).flatten()
X.data /= row_sums[X.nonzero()[0]]

print("Document-term matrix: {}x{}, sparsity: {:.4f}%".format(
    n_docs, n_terms, 100*(1 - X.nnz/(n_docs*n_terms))))

# Sparse SVD (ARPACK)
k = 50  # Number of topics
print("\nTruncated SVD (rank {})...".format(k))
t0 = time.time()
U_sparse, s_sparse, Vt_sparse = svds(X, k=k, which='LM')
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense SVD (for small subset for illustration)
print("\nFull dense SVD (on 100x1000 subset for illustration)...")
X_dense_small = X[:100, :1000].toarray()
t0 = time.time()
U_dense, s_dense, Vt_dense = np.linalg.svd(X_dense_small, full_matrices=False)
t_dense = time.time() - t0
print("  Time (100x1000): {:.4f} s".format(t_dense))
print("  Dense SVD on full matrix would take >> hours")

# Extract topics (top terms per topic)
print("\nTop 5 terms per topic (first 3 topics):")
for topic_idx in range(min(3, k)):
    top_terms = np.argsort(Vt_sparse[topic_idx])[-5:][::-1]
    print("  Topic {}: terms {}".format(topic_idx, top_terms))

# Reconstruction error
rank_list = [5, 10, 20, 50]
print("\nReconstruction error vs. rank:")
for r in rank_list:
    U_r = U_sparse[:, -r:]
    s_r = s_sparse[-r:]
    Vt_r = Vt_sparse[-r:, :]
    X_recon = U_r @ np.diag(s_r) @ Vt_r
    err = np.linalg.norm(X[:100, :].toarray() - X_recon[:100]) / np.linalg.norm(X[:100])
    print("  Rank {}: relative error {:.4f}".format(r, err))

Worked Example 4: Preconditioned iterative solver on sparse system#

Introduction#

Construct a large sparse SPD system (from discretized PDE); solve via unpreconditioned and preconditioned CG; demonstrate preconditioning impact.

Purpose#

Show how preconditioning enables convergence on sparse systems where direct solvers would be expensive.

Importance#

Practical approach for million-scale sparse systems; avoids expensive fill-in of direct LU.

What this example demonstrates#

Construct sparse Laplacian or discretized diffusion matrix.
Solve unpreconditioned CG; count iterations.
Construct ILU or AMG preconditioner.
Solve preconditioned CG; count iterations.
Compare iteration counts and wall-clock time.

Background#

Iterative solvers avoid factorization fill-in; preconditioning clusters spectrum for fast convergence.

Historical context#

Preconditioned CG developed 1970s–1980s; ILU standard; AMG emerged 1980s–1990s.

History#

Standard in PETSc, Trilinos, deal.II (finite element libraries); production code for billion-node problems.

Prevalence in ML#

Used in large-scale optimization (Newton, interior-point), graph algorithms (diffusion on networks).

Notes#

Preconditioning setup cost amortized over many solves.
ILU relatively cheap; AMG more expensive but better convergence for difficult problems.

Connection to ML#

Large-scale optimization (e.g., logistic regression on massive sparse data) requires preconditioned iterative methods.

Connection to Linear Algebra Theory#

CG iteration count $\approx \sqrt{\kappa}$; preconditioning reduces $\kappa$ to enable convergence.

Pedagogical Significance#

Demonstrates integration of sparsity, preconditioning, and iterative methods.

References#

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla

np.random.seed(33)

# Create sparse Laplacian (discretized 2D Poisson, 100x100 grid)
n = 10000  # 100x100 grid
# 5-point stencil: center + 4 neighbors
I = []
J = []
S = []
for i in range(n):
    I.extend([i, i, i])
    J.extend([i, (i+1) % n, (i-1) % n])
    S.extend([4, -1, -1])

A = sp.csr_matrix((S, (I, J)), shape=(n, n))
A = A.tocsr()

# RHS
b = np.random.randn(n)

print("Sparse Laplacian system: {}x{}, nnz={}".format(n, n, A.nnz))
print("Condition number (estimated): ~{}".format(int(n**0.5)))

# Unpreconditioned CG
print("\nUnpreconditioned CG:")
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))

x_un, info_un = spla.cg(A, b, tol=1e-6, callback=callback_un, maxiter=n)
print("  Iterations: {}".format(len(residuals_unprecond)))

# ILU preconditioner
print("\nBuilding ILU preconditioner...")
M_ilu = spla.spilu(A.tocsc())
M = spla.LinearOperator((n, n), matvec=M_ilu.solve)

print("Preconditioned CG:")
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))

x_pre, info_pre = spla.cg(A, b, M=M, tol=1e-6, callback=callback_pre, maxiter=n)
print("  Iterations: {}".format(len(residuals_precond)))

print("\nSummary:")
print("  Unpreconditioned iterations: {}".format(len(residuals_unprecond)))
print("  Preconditioned iterations: {}".format(len(residuals_precond)))
print("  Speedup: {:.1f}x".format(len(residuals_unprecond) / len(residuals_precond)))
print("  Both converged:", info_un == 0 and info_pre == 0)

Worked Example 5: Graph neural network convolution on sparse adjacency matrix#

Introduction#

Construct a graph (sparse adjacency matrix); implement graph convolution layer; measure SpMM efficiency vs. dense multiplication.

Purpose#

Demonstrate sparse matrix operations in modern ML (graph neural networks).

Importance#

GNNs process billion-node graphs; sparsity is essential for feasibility.

What this example demonstrates#

Generate graph (scale-free, grid, or real network).
Build node feature matrix $X \in \mathbb{R}^{n \times d}$.
Implement sparse graph convolution: $X' = \sigma(A X W)$ with SpMM.
Compare sparse vs. dense GCN layer time/memory.
Verify output correctness.

Background#

Graph convolution aggregates features from neighbors via sparse $A$ multiplication.

Historical context#

Kipf & Welling (2017) GCN; PyTorch Geometric (2019) enables efficient sparse operations.

History#

Standard in graph learning frameworks (DGL, Spektral, PyTorch Geometric).

Prevalence in ML#

GNNs on social networks, citation graphs, molecular graphs; sparse adjacency matrix standard.

Notes#

SpMM (sparse-dense matrix product) is key operation; GPU acceleration available.
For $n$-node graph with $m$ edges and $d$ features: sparse is $O(m \cdot d)$, dense is $O(n^2 \cdot d)$.

Connection to ML#

GNNs enable learning on structured, graph-based data; scalable to billion-node networks via sparsity.

Connection to Linear Algebra Theory#

Graph convolution is polynomial filter on graph Laplacian; spectral perspective connects to Chebyshev polynomials.

Pedagogical Significance#

Integrates sparse linear algebra, graph theory, and modern deep learning.

References#

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(34)

# Generate graph (scale-free network)
n = 10000  # Nodes
avg_degree = 10
edges = []
for i in range(n):
    # Preferential attachment: connect to random nodes
    targets = np.random.choice(n, min(avg_degree, n-1), replace=False)
    for j in targets:
        if i != j:
            edges.append((i, j))
            edges.append((j, i))  # Undirected

edges = np.array(edges)
A = sp.csr_matrix((np.ones(len(edges)), edges.T), shape=(n, n))
A.eliminate_zeros()

# Node features
d = 64  # Feature dimension
X = np.random.randn(n, d).astype(np.float32)

# Weight matrix (GCN parameters)
W = np.random.randn(d, d).astype(np.float32)

print("Graph: n={}, edges={}, sparsity={:.4f}%".format(
    n, A.nnz // 2, 100*(1 - A.nnz/(n*n))))
print("Features: n={}, d={}".format(n, d))

# Sparse graph convolution: X_out = A @ X @ W
print("\nSparse graph convolution:")
t0 = time.time()
X_intermediate = A @ X  # SpMM: sparse @ dense
X_out_sparse = X_intermediate @ W  # Dense @ dense
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense graph convolution (for comparison on subset)
print("\nDense graph convolution (on 1000-node subgraph):")
A_dense_sub = A[:1000, :1000].toarray()
X_sub = X[:1000]
t0 = time.time()
X_out_dense_sub = A_dense_sub @ X_sub @ W
t_dense_sub = time.time() - t0
print("  Time (1000 nodes): {:.4f} s".format(t_dense_sub))
# Extrapolate to full graph
t_dense_extrapolated = t_dense_sub * (n/1000)**2
print("  Extrapolated to full graph: {:.2f} s".format(t_dense_extrapolated))

print("\nSpeedup (dense vs. sparse): {:.1f}x".format(t_dense_extrapolated / t_sparse))

# Apply nonlinearity (ReLU)
X_out_sparse = np.maximum(X_out_sparse, 0)

print("\nGCN layer output shape:", X_out_sparse.shape)
print("Output mean:", np.mean(X_out_sparse))
print("Output sparsity after ReLU:", 100 * np.sum(X_out_sparse == 0) / X_out_sparse.size, "%")

Comments

Algorithm Category

Direct Methods

Iterative Methods

Graph Algorithms

Computational Efficiency

Sparse Matrix Operations

Data Modality

Sparse Matrices

Structured Matrices

Historical & Attribution

Modern Era

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing

Advanced

ML Applications

Graph Neural Networks

Recommender Systems

Dimensionality Reduction

Problem Structure & Exploitation

Sparsity

Theoretical Foundation

Applied Machine Learning

Principal Component Analysis

Chapter 11

Principal Component Analysis

Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

Covariance matrix
- $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.
Principal components (eigenvectors)
- Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.
Explained variance ratio
- EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.
Scores and loadings
- Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).
Reconstruction and truncation
- Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).
Standardization and scaling
- Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.
Whitening
- Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

Dimensionality reduction: speeds training, avoids overfitting, improves generalization.
Visualization: 2D/3D projection of high-dimensional data for exploration.
Preprocessing: removes noise, aligns scales, improves conditioning of solvers.
Feature extraction: learned components as features for downstream classifiers.
Denoising: truncated PCA removes low-variance (noisy) directions.
Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

1901: Pearson introduces lines/planes of closest fit (geometric intuition).
1933: Hotelling formalizes PCA as eigen-decomposition of covariance.
1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.
1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.
1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.
2000s: Randomized PCA for large-scale data (Halko–Martinsson).
2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).
Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).
Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.
Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.
Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.
Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Covariance matrix and its PSD structure (Chapter 09).
Eigen-decomposition of symmetric covariance matrix.
Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.
Eckart–Young–Mirsky low-rank approximation error (Chapter 10).
Relation to SVD: PCA via SVD of centered data (Chapter 10).
Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).
Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).
Probabilistic PCA and latent variable models (Tipping & Bishop 1997).
Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).
Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).
Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up

Dimensionality reduction and preprocessing

Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

Visualization and exploratory data analysis

Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

Whitening and decorrelation

Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

Denoising and matrix completion

Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

Feature extraction and representation learning

Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation

Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.
Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.
Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.
Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.
Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.
Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.
Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).
Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.
Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks

Always center data; forgetting this is a common error.
Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.
Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.
Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.
Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.
Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

Center data; compute covariance matrix.
Compute eigendecomposition of covariance.
Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

Ensure data is centered; if not, covariance is inaccurate.
Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

Hotelling, H. (1933). Analysis of complex statistical variables into principal components.
Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.
Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

Use full_matrices=False for efficiency when $n \gg d$.
Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

Compute full PCA; truncate to $k$ components.
Reconstruct and compute Frobenius error.
Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

Reconstruction error squared equals sum of squared singular values of discarded components.
Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

Eckart & Young (1936). The approximation of one matrix by another of lower rank.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.
Verify output covariance is identity.
Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

Add small floor to tiny eigenvalues to avoid division by zero.
Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

Add noise to data; apply truncated PCA for different $k$.
Measure reconstruction error vs. ground truth vs. noise level.
Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

Noise reduction works best if signal occupies few components.
Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)

Comments

Algorithm Category

Direct Methods

Data Modality

Low-Rank Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Spectral Theory

Projection & Orthogonality

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Singular Value Decomposition (SVD)

ML Applications

Dimensionality Reduction

Theoretical Foundation

Applied Machine Learning

Singular Value Decomposition

Chapter 10

Singular Value Decomposition

Key ideas: Introduction

Introduction#

SVD generalizes eigen-decomposition to arbitrary rectangular matrices. Columns of $U$ span the column space; columns of $V$ span the row space; singular values quantify stretch along these directions. SVD underlies dimensionality reduction, denoising, and stable linear solves.

Important ideas#

Orthogonal bases
- $U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{d\times d}$ orthogonal; $\Sigma$ stores $\sigma_1 \ge \dots \ge \sigma_r > 0$ with $r=\operatorname{rank}(X)$.
Spectral norm and conditioning
- $\lVert X \rVert_2 = \sigma_1$; condition number $\kappa_2(X)=\sigma_1/\sigma_r$ for full-rank square $X$.
Best low-rank approximation
- Eckart–Young–Mirsky: $X_k = U_k \Sigma_k V_k^\top$ minimizes $\lVert X - Y \rVert_F$ over $\operatorname{rank}(Y)\le k$; error $\sum_{i>k}\sigma_i^2$.
Pseudoinverse and least squares
- $X^\dagger = V \Sigma^\dagger U^\top$; solves minimum-norm least squares and handles rank deficiency.
Energy and variance
- In PCA, singular values squared (scaled) are variances along principal axes.
Truncation and regularization
- Truncated SVD and spectral filtering (e.g., Tikhonov) improve stability for ill-conditioned problems.
Computation
- Golub–Kahan bidiagonalization; randomized SVD for large, sparse data.

Relevance to ML#

Dimensionality reduction (PCA/LSA/embeddings).
Low-rank modeling for recommendation and compression.
Stable least squares and ridge via spectral filtering.
Conditioning diagnostics and gradient scaling.
Spectral clustering and graph embeddings (Laplacian SVD/eigendecomp relation).

Algorithmic development (milestones)#

1930s: Schmidt, Eckart–Young best approximation theorem.
1960s: Golub–Kahan algorithms for practical SVD.
1990s: Latent Semantic Analysis uses truncated SVD for text.
2000s: Randomized SVD for large-scale ML (Halko–Martinsson–Tropp).
2010s: Matrix factorization dominates recommender benchmarks (Netflix era); spectral initializations in deep models.

Definitions#

Full SVD: $X = U \Sigma V^\top$ with $U^\top U=I_n$, $V^\top V=I_d$, $\Sigma \in \mathbb{R}^{n\times d}$ diagonal (nonnegative entries).
Thin SVD: $X = U_r \Sigma_r V_r^\top$ with $r=\operatorname{rank}(X)$, $U_r \in \mathbb{R}^{n\times r}$, $V_r \in \mathbb{R}^{d\times r}$.
Pseudoinverse: $X^\dagger = V_r \Sigma_r^{-1} U_r^\top$.
Spectral norm: $\lVert X \rVert_2 = \sigma_1$; Frobenius norm $\lVert X \rVert_F^2 = \sum_i \sigma_i^2$.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Existence and uniqueness (up to signs) of SVD for any real matrix.
Eckart–Young–Mirsky best rank-$k$ approximation theorem.
Relation to eigen-decomposition of $X^\top X$ and $X X^\top$.
Moore–Penrose pseudoinverse via SVD; minimum-norm least squares.
Spectral/Frobenius norm expressions via singular values.

Applied (landmark systems)#

PCA via SVD for dimensionality reduction (Jolliffe 2002; Shlens 2014).
LSA for text (Deerwester et al. 1990).
Matrix factorization in recommenders (Koren et al. 2009).
Spectral regularization for ill-posed problems (Hansen 1998).
Randomized SVD for large-scale ML (Halko–Martinsson–Tropp 2011).

Key ideas: Where it shows up

PCA and variance decomposition

Centered data $X_c$ yields $\tfrac{1}{n}X_c^\top X_c = V \tfrac{\Sigma^2}{n} V^\top$. Explained variance ratios follow $\sigma_i^2$. Achievements: Shlens 2014 (tutorial), Jolliffe 2002 (classical PCA), used widely in vision/audio.

Latent semantic analysis and embeddings

Truncated SVD on term-document matrices uncovers latent topics. Achievement: Deerwester et al. 1990 LSA; foundation for modern embeddings.

Recommendation and matrix completion

Low-rank factorization approximates user-item matrices; SVD intuition drives alternating least squares. Achievements: Netflix Prize-era matrix factorization (Koren et al. 2009).

Stable least squares and pseudoinverse

$x^\* = X^\dagger y$ gives minimum-norm solution; SVD reveals conditioning. Achievements: Golub–Van Loan numerical stability guidance.

Regularization and denoising

Truncated SVD/Tikhonov filter small singular values to denoise ill-posed problems (e.g., deblurring). Achievements: Hansen 1998 (regularization tools).

Notation

Data: $X \in \mathbb{R}^{n\times d}$ (rows = examples).
SVD: $X = U \Sigma V^\top$, singular values $\sigma_i$ sorted descending.
Rank-$k$: $X_k = U_k \Sigma_k V_k^\top$ with $U_k \in \mathbb{R}^{n\times k}$, $\Sigma_k \in \mathbb{R}^{k\times k}$, $V_k \in \mathbb{R}^{d\times k}$.
Pseudoinverse: $X^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ replaces nonzero $\sigma_i$ by $1/\sigma_i$.
Explained variance: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i} \sigma_i^2}$.
Example: If $X = U \Sigma V^\top$ with $\sigma_1=5,\sigma_2=2,\sigma_3=0.5$, then $\kappa_2(X)=10$ and the best rank-2 approximation error is $0.5^2$ in Frobenius norm squared.

Pitfalls & sanity checks

Always center data before PCA via SVD.
Sign ambiguity: singular vectors are unique up to sign; do not compare raw signs across runs.
Conditioning: tiny singular values indicate potential instability; consider truncation or ridge.
Shapes: ensure full_matrices=False for thin SVD when $n \gg d$ to save compute.
Reconstruction checks: verify $\lVert X - U\Sigma V^\top \rVert_F$ is near machine precision for correctness.

References

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.
Jolliffe, I. T. (2002). Principal Component Analysis.
Shlens, J. (2014). A Tutorial on Principal Component Analysis.
Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.
Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices.

Five worked examples

Worked Example 1: Full SVD, rank, and reconstruction#

Introduction#

Compute thin SVD of a tall matrix, verify reconstruction and conditioning.

Purpose#

Show how $U,\Sigma,V$ capture range, domain, and scaling; check rank and condition number.

Importance#

SVD diagnostics reveal redundancy and stability before downstream tasks.

What this example demonstrates#

Thin SVD of $X \in \mathbb{R}^{n\times d}$.
Reconstruction $U\Sigma V^\top$ matches $X$.
Rank from nonzero singular values; condition number from extremes.

Background#

SVD generalizes eigen-decomposition; thin form is efficient when $n \gg d$.

Historical context#

Golub–Kahan bidiagonalization made SVD practical (1965).

History#

Adopted as a standard numerical primitive in LAPACK/BLAS.

Prevalence in ML#

Used in preprocessing, diagnostics, and as a baseline factorization.

Notes#

Use np.linalg.svd with full_matrices=False for thin SVD.
Small singular values indicate near-rank-deficiency.

Connection to ML#

Conditioning guides choice of solvers/regularizers; rank informs model capacity.

Connection to Linear Algebra Theory#

$X^\top X = V \Sigma^2 V^\top$; nonzero singular values are square roots of eigenvalues.

Pedagogical Significance#

Concrete start-to-finish SVD computation with sanity checks.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 8, 4
X = np.random.randn(n, d) + 0.2 * np.random.randn(n, d)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma = np.diag(s)
X_rec = U @ Sigma @ Vt

rank = np.sum(s > 1e-10)
kappa = s[0] / s[-1]

print("singular values:", np.round(s, 4))
print("rank:", rank, "cond:", round(kappa, 4))
print("reconstruction error Fro:", np.linalg.norm(X - X_rec, "fro"))

Worked Example 2: Best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate low-rank approximation of an image-like matrix and quantify error decay.

Purpose#

Show optimality of truncated SVD and energy captured by top-$k$ singular values.

Importance#

Core to compression, recommendation, and fast inference with reduced models.

What this example demonstrates#

Build $X$, compute SVD, form $X_k$.
Compare Frobenius error to tail singular values.
Report explained variance ratio.

Background#

Eckart–Young–Mirsky: truncated SVD minimizes Frobenius/spectral error among rank-$k$ matrices.

Historical context#

Eckart & Young (1936) established the optimality theorem.

History#

Used in image compression and modern low-rank neural adaptation.

Prevalence in ML#

Standard for LSA, recommender warm-starts, and distillation via low-rank layers.

Notes#

Sort singular values (already sorted by SVD) before computing energy ratios.

Connection to ML#

Determines how many components to keep for accuracy/latency trade-offs.

Connection to Linear Algebra Theory#

Error squared equals sum of squared discarded singular values.

Pedagogical Significance#

Links abstract theorem to a tangible compression example.

References#

Eckart & Young (1936). Approximation of matrices.
Hansen (1998). Rank truncation for regularization.

Solution (Python)#

import numpy as np

np.random.seed(1)
X = np.random.randn(40, 30)
U, s, Vt = np.linalg.svd(X, full_matrices=False)

k = 5
Xk = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
err = np.linalg.norm(X - Xk, "fro")
tail = np.sqrt((s[k:] ** 2).sum())
ev_ratio = (s[:k] ** 2).sum() / (s ** 2).sum()

print("rank-k error:", round(err, 4), "tail from theorem:", round(tail, 4))
print("explained variance ratio:", round(ev_ratio, 4))

Worked Example 3: PCA via SVD on centered data#

Introduction#

Use SVD on centered data to extract principal components and explained variance.

Purpose#

Show that PCA can be computed via SVD without forming the covariance matrix explicitly.

Importance#

Numerically stable and efficient for tall data; avoids $d\times d$ covariance when $d$ is large.

What this example demonstrates#

Center data $X_c$.
Compute SVD of $X_c / \sqrt{n}$ to obtain principal directions and variances.
Project onto top components and report variance explained.

Background#

PCA solves an eigenproblem on covariance; SVD is an equivalent and stable route.

Historical context#

Hotelling (1933) PCA; SVD popularized for PCA in numerical practice.

History#

Standard in scikit-learn and ML toolkits.

Prevalence in ML#

Common preprocessing for vision, speech, genomics, and exploratory analysis.

Notes#

Center rows; scaling affects covariance.

Connection to ML#

Dimensionality reduction preserves variance and often improves downstream generalization.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition equals right-singular structure of centered data.

Pedagogical Significance#

Shows the practical equivalence of PCA and SVD.

References#

Jolliffe (2002). PCA.
Shlens (2014). Tutorial on PCA.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
ev = s**2
ev_ratio = ev / ev.sum()

k = 2
Z = Xc @ Vt[:k].T  # scores

print("singular values (scaled):", np.round(s, 4))
print("explained variance ratios:", np.round(ev_ratio, 4))
print("projected shape:", Z.shape)

Worked Example 4: Least squares via pseudoinverse and SVD#

Introduction#

Solve an overdetermined system with SVD-based pseudoinverse; compare to normal equations.

Purpose#

Highlight stability of SVD solves, especially when $X^\top X$ is ill-conditioned.

Importance#

Prevents amplification of noise from near-singular normal matrices.

What this example demonstrates#

Compute $x^\* = X^\dagger y$.
Compare residuals and conditioning to normal-equation solution.

Background#

Normal equations square condition number; SVD avoids this amplification.

Historical context#

Moore–Penrose pseudoinverse formalized generalized inverses; SVD provides a constructive route.

History#

Standard approach in numerical linear algebra libraries for robust least squares.

Prevalence in ML#

Used in linear regression baselines, design-matrix solvers, and as subroutines in algorithms.

Notes#

Small singular values can be truncated or damped for better stability.

Connection to ML#

Improves robustness of linear regression and feature engineering pipelines.

Connection to Linear Algebra Theory#

Pseudoinverse gives minimum-norm solution among all least-squares minimizers.

Pedagogical Significance#

Shows practical benefit of SVD over naive normal equations.

References#

Golub & Van Loan (2013). Stability of least squares.
Hansen (1998). Regularization.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 80, 6
X = np.random.randn(n, d)
# Make columns nearly dependent
X[:, 0] = X[:, 1] + 1e-3 * np.random.randn(n)
w_true = np.array([2, -2, 0.5, 0, 0, 0])
y = X @ w_true + 0.05 * np.random.randn(n)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma_inv = np.diag(1.0 / s)
w_svd = Vt.T @ Sigma_inv @ U.T @ y

w_ne = np.linalg.lstsq(X, y, rcond=None)[0]

res_svd = np.linalg.norm(X @ w_svd - y)
res_ne = np.linalg.norm(X @ w_ne - y)

print("singular values:", np.round(s, 6))
print("residual SVD:", round(res_svd, 6), "residual normal eqn:", round(res_ne, 6))

Worked Example 5: Truncated SVD for regularization and denoising#

Introduction#

Apply truncated SVD to an ill-conditioned linear system to reduce noise amplification.

Purpose#

Illustrate spectral filtering: discard tiny singular values to stabilize solutions.

Importance#

Prevents overfitting/noise blow-up in inverse problems and regression with multicollinearity.

What this example demonstrates#

Construct ill-conditioned matrix with decaying singular values.
Solve with full pseudoinverse vs. truncated SVD; compare error to ground truth.

Background#

Truncated SVD is a classical regularizer; related to Tikhonov but with hard cutoff in spectrum.

Historical context#

Widely used in inverse problems (Hansen 1998) and information retrieval.

History#

Precedes modern weight decay; spectral analog of feature pruning.

Prevalence in ML#

Used in deblurring, recommendation cold-start smoothing, and feature denoising.

Notes#

Choose cutoff based on noise level or cross-validation.

Connection to ML#

Improves generalization when design matrices are ill-conditioned.

Connection to Linear Algebra Theory#

Filters small singular values; solution lives in dominant singular subspace.

Pedagogical Significance#

Shows tangible benefit of truncating spectrum.

References#

Hansen (1998). Rank truncation and regularization.
Golub & Van Loan (2013). Regularized least squares.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 50, 30
# Create decaying singular values
U_rand, _ = np.linalg.qr(np.random.randn(n, n))
V_rand, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -3, min(n, d))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U_rand @ Sigma @ V_rand.T

w_true = np.zeros(d)
w_true[:5] = [2, -1, 0.5, 0.5, -0.2]
y_clean = A @ w_true
y = y_clean + 0.01 * np.random.randn(n)

U, sv, Vt = np.linalg.svd(A, full_matrices=False)
# Full pseudoinverse solution
w_full = (Vt.T / sv) @ (U.T @ y)
# Truncate small singular values
cut = 10
sv_trunc = sv[:cut]
w_trunc = (Vt[:cut].T / sv_trunc) @ (U[:, :cut].T @ y)

err_full = np.linalg.norm(w_full - w_true)
err_trunc = np.linalg.norm(w_trunc - w_true)

print("smallest singular value:", sv[-1])
print("error full:", round(err_full, 4), "error trunc:", round(err_trunc, 4))