ex1.ai

Sparse Matrices

Chapter 15

Sparse Matrices

Key ideas: Introduction

Introduction#

Most real-world data matrices are sparse: social networks are sparse graphs; text is sparse (million-word vocabulary but document has hundreds of words); images are sparse in wavelet domains; scientific simulations generate sparse discretizations. Dense matrices require $O(n^2)$ memory and $O(n^3)$ computation; sparse matrices require only $O(\text{nnz})$. A sparse matrix with $\text{nnz} = O(n)$ or $O(n \log n)$ reduces complexity from cubic to linear or nearly linear. This is transformational: it enables algorithms that would be intractable on dense representations. Exploiting sparsity requires careful algorithm design: level-of-fill control, reordering strategies, specialized data structures, and communication patterns in distributed solvers.

Important ideas#

Sparse matrix representations
- Compressed Sparse Row (CSR): store nonzeros by row; three arrays (values, column indices, row pointers); $O(\text{nnz} + n)$ memory.
- Compressed Sparse Column (CSC): symmetric role for columns; efficient for column access.
- Coordinate format (COO): list of $(i, j, v)$ triples; inefficient for computation but flexible for construction.
- Diagonal, tridiagonal, block-structured: exploit special structure for even greater efficiency.
Sparsity patterns and fill-in
- Reachability: nonzero $(i, j)$ entry means path from $i$ to $j$ in graph; reveals communication pattern.
- Fill-in: $L, U$ in $A = LU$ can have more nonzeros than $A$; reordering (minimum degree, nested dissection) minimizes fill-in.
- Symbolic factorization: determine nonzero pattern of $L, U$ without computing values.
Matrix-vector products and efficient computation
- Dense $y = A x$ costs $O(n^2)$; sparse costs $O(\text{nnz})$.
- SpMV (sparse matrix-vector product) is bandwidth-limited on modern hardware; careful data layout crucial.
- GPU sparse kernels: CUSPARSE; CPU sparse kernels: MKL Sparse BLAS, OpenMP parallelization.
Direct sparse solvers
- LU with partial pivoting: reorder to minimize fill-in, apply pivoting.
- Cholesky for SPD: symmetric structure implies symmetric reordering; often very sparse.
- Supernodal methods: group small dense blocks for cache efficiency; key to modern solvers (PARDISO, SuperLU, UMFPACK).
Iterative sparse solvers
- CG, GMRES, MINRES: require only matrix-vector products; no explicit factorization.
- Preconditioning essential: ILU (incomplete LU), IC (incomplete Cholesky), algebraic multigrid (AMG) enable convergence.
- Matrix-free methods: supply only matrix-vector product function; never form or store matrix.
Graph algorithms and adjacency matrices
- Adjacency matrix $A$ where $A_{ij} = 1$ if edge $(i, j)$ exists; number of nonzeros = $2 \times$ edges (undirected).
- Shortest paths, PageRank, diffusion: matrix powers, powers of adjacency matrix encode reachability.
- Laplacian matrix $L = D - A$ (degree diagonal minus adjacency); eigenvectors yield graph partitions (spectral clustering).
Storage and computational complexity trade-offs
- Dense $A \in \mathbb{R}^{n \times n}$: $O(n^2)$ memory, $O(n^3)$ LU, $O(n^2)$ SpMV.
- Sparse $A$ with $\text{nnz} = O(n)$: $O(n)$ memory, $O(n^{1.5})$ LU (with reordering), $O(n)$ SpMV.
- For $n = 10^6$: dense requires terabytes; sparse (with $\text{nnz} = 10^7$) requires gigabytes.

Relevance to ML#

Text and NLP: Document-term matrices are sparse (million words, hundreds per document); sparse SVD enables topic models.
Graphs and networks: Social networks, citation graphs, knowledge graphs are sparse; graph neural networks leverage adjacency matrix sparsity.
Recommender systems: User-item ratings sparse ($\sim 0.1\%$ observed); matrix completion via sparse factorization.
Inverse problems and imaging: Medical imaging, seismic inversion, tomography discretizations are sparse; sparse solvers essential.
Large-scale optimization: SGD and proximal methods operate on sparse data; distributed solvers exploit communication sparsity.

Algorithmic development (milestones)#

1960s: Sparse matrix computation recognized as practical necessity in structural engineering (finite element analysis).
1971: Tinney & Walker develop minimum degree ordering; George extends to nested dissection.
1979: Duff develops efficient symbolic factorization algorithms.
1980s: Supernodal methods developed (Ashcraft, Eisenstat); enable cache-efficient dense kernels within sparse framework.
1990s: PARDISO, SuperLU, UMFPACK emerge as production sparse solvers.
2000s: Algebraic multigrid (AMG) becomes standard preconditioner for sparse iterative methods.
2005–2010: GPU sparse kernels (CUSPARSE) enable acceleration; sparse matrix formats adapt to GPU memory.
2010s: Sparse deep learning (neural networks on graph-structured data); graph convolutional networks (GCN), graph attention networks (GAT) exploit adjacency matrix sparsity.

Definitions#

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A) \ll n^2$ nonzeros; typically $\text{nnz} = O(n)$ or $O(n \log n)$.
Sparsity: $\text{sparsity} = 1 - \frac{\text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity} > 0.99$ is typical for ML.
Compressed Sparse Row (CSR): Three arrays: values[nnz], col[nnz], row_ptr[n+1]; row_ptr[i] points to start of row $i$ in values/col.
Fill-in: Nonzero entries created during factorization that were zero in original matrix.
Reordering: Permutation $P$ such that $P A P^\top$ has lower fill-in; minimum degree, nested dissection are common strategies.
Graph of a matrix: Undirected graph $G$ where vertices are matrix indices, edge $(i, j)$ exists iff $A_{ij} \ne 0$.
Degree: Vertex $i$ has degree = number of nonzeros in row/column $i$.
Laplacian matrix: $L = D - A$ where $D = \text{diag}(d_1, \ldots, d_n)$ is degree matrix, $A$ is adjacency matrix.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Sparse matrix representations: CSR, CSC, COO formats; trade-offs between storage and access patterns. Reference: Golub & Van Loan (2013).
Fill-in and reordering: Minimum degree, nested dissection reduce fill-in; symbolic factorization predicts nonzero pattern. Reference: George & Liu (1981).
SpMV complexity: $O(\text{nnz})$ time, bandwidth-limited on modern hardware. Reference: Williams et al. (2009).
Direct sparse solvers: LU with reordering, supernodal methods, pivot selection for stability. Reference: Davis (2006) (SuiteSparse); Schenk & Gärtner (2004) (PARDISO).
Iterative sparse solvers: CG, GMRES on Krylov subspaces; preconditioning essential (ILU, IC, AMG). Reference: Saad (2003); Briggs et al. (2000) (multigrid tutorial).
Graph Laplacian and spectral methods: Eigenvalues/eigenvectors of $L = D - A$ for clustering, diffusion. Reference: Spielman & Teng (2004).

Applied (landmark systems)#

Sparse SVD for dimensionality reduction: ARPACK for large sparse matrices; used in topic modeling (LSA, LDA). Implementation: scikit-learn TruncatedSVD. Reference: Deerwester et al. (1990).
Direct sparse solvers (production): PARDISO (Intel MKL), SuperLU (LBNL), UMFPACK (UMF collection). Deployed in finite element, circuit simulation. Reference: Schenk & Gärtner (2004); Li (2005).
Iterative sparse solvers with preconditioning: GMRES + ILU in PETSc, Trilinos; ALS for matrix completion in Spark MLlib. Reference: Balay et al. (2021) (PETSc); Herring et al. (2009) (Trilinos).
Graph neural networks: PyTorch Geometric adjacency matrix operations; sparse matrix multiplication for graph convolution. Reference: Kipf & Welling (2017).
Compressed sensing and sparse recovery: ADMM/FISTA for L1-regularized problems; applications in imaging, signal processing. Reference: Boyd et al. (2011); Beck & Teboulle (2009).

Key ideas: Where it shows up

Text and NLP
- Document-term matrices $X \in \mathbb{R}^{n \times d}$ with $n$ documents, $d$ vocabulary; typically $\text{nnz} = O(nd)$ (words per doc).
- Achievements: sparse SVD/NMF in topic modeling (LSA, LDA); scikit-learn TfidfVectorizer produces sparse matrices. References: Deerwester et al. (1990) (LSA); Blei et al. (2003) (LDA).
Graph neural networks and social networks
- Adjacency matrix $A \in \mathbb{R}^{n \times n}$ sparse; graph convolution $X' = \sigma(AXW)$ exploits sparsity for $O(\text{nnz} \cdot d)$ per layer vs. $O(n^2 d)$ dense.
- Achievements: PyTorch Geometric, DGL, Spektral enable billion-node GNNs; GCN (Kipf & Welling 2017), GAT (Velickovic et al. 2018) standard. References: Kipf & Welling (2017); Velickovic et al. (2018).
Recommender systems and matrix completion
- Rating matrices $R \in \mathbb{R}^{m \times n}$ (users × items) extremely sparse ($\sim 0.01\%$ observed); matrix factorization via alternating least squares.
- Achievements: Sparse SVD in Netflix prize; ALS in Spark MLlib scales to billions of ratings. References: Koren et al. (2009) (matrix factorization); Zhou et al. (2008) (implicit feedback).
Inverse problems and sparse regularization
- Measurement matrix $A$ in compressed sensing, tomography often sparse or low-rank. Sparse solutions via L1 regularization (LASSO).
- Achievements: ADMM (alternating direction method of multipliers) for large-scale L1-regularized problems; FISTA (fast iterative shrinkage-thresholding). References: Boyd et al. (2011) (ADMM); Beck & Teboulle (2009) (FISTA).
Distributed and federated learning
- Distributed optimization exploits sparsity in communication: gradient compression, local SGD reduce communication. Sparse parameter updates.
- Achievements: gradient sparsification in federated learning (TensorFlow Federated); communication-efficient distributed SGD (Kairouz et al. 2021). References: Kairouz et al. (2021) (federated learning).

Notation

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A)$ nonzeros; typically $\text{nnz} \ll n^2$.
Sparsity: $\text{sparsity}(A) = \frac{n^2 - \text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity}(A) > 0.99$ typical.
CSR format: Values v[1..nnz], column indices j[1..nnz], row pointers p[1..n+1] with p[i+1] - p[i] = nonzeros in row $i$.
Adjacency matrix: $A \in \{0,1\}^{n \times n}$ where $A_{ij} = 1$ if edge $(i, j)$ exists; $\text{nnz}(A) = 2m$ for $m$ edges (undirected).
Degree matrix: $D = \text{diag}(d_1, \ldots, d_n)$ where $d_i = \sum_j A_{ij}$ (degree of vertex $i$).
Laplacian: $L = D - A$ (graph Laplacian); $L$ is PSD with $\lambda_1 = 0$ (constant eigenvector).
Fill-in ratio: $\text{fill}(L, U) = \frac{\text{nnz}(L) + \text{nnz}(U)}{\text{nnz}(A)}$ (factorization creates nonzeros).
Example: Document-term matrix for 100K documents, 1M vocabulary words, ~200 words per doc: $\text{nnz} \approx 100K \times 200 = 20M$; density $\approx 0.00002\%$.

Pitfalls & sanity checks

Never convert large sparse matrix to dense: Memory explosion; dense matrix for $n = 10^6$ requires terabytes.
Choose representation based on access pattern: CSR for row access/SpMV, CSC for column access/factorization.
Monitor fill-in in factorization: If fill-in ratio $> 100$, consider iterative solver + preconditioner instead.
Ensure reordering is applied: Production solvers (PARDISO, SuperLU) apply automatically; verify if using custom code.
Test on subset before scaling: Verify correctness on small sparse system before large deployment.
GPU sparse kernels have overhead: SpMV on GPU slower than CPU for small matrices; threshold typically $n > 1000$.
Preconditioning setup vs. solve trade-off: Single solve: skip preconditioning. Multiple solves with same matrix: preconditioning pays off.
Sparsity pattern changes: If $A$ has varying sparsity over iterations, symbolic factorization must be updated.

References

Foundational sparse matrix theory

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Sparse matrix representations and algorithms

Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Tinney, W. F., & Walker, J. W. (1967). Direct solutions of sparse network equations by optimally ordered triangular factorization.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Sparse iterative solvers and preconditioning

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic Multigrid (AMG).

Sparse linear algebra in scientific computing

Balay, S., Abhyankar, S., Adams, M. F., et al. (2021). PETSc Users Manual.
Herring, A. L., Gilmour, A., Paddon, D. J., & Giles, M. B. (2009). Trilinos Overview.
Li, X. S. (2005). An overview of SuperLU: Algorithms, implementation, and user interface.

Topic modeling and sparse SVD

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.

Graph neural networks and sparse operations

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Recommender systems and matrix completion

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.
Zhou, T., Kuscsik, Z., Liu, J. K., Medo, M., Yeung, C. H., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems.

Optimization and sparse recovery

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
Kairouz, P., McMahan, H. B., Avent, B., et al. (2021). Advances and Open Problems in Federated Learning.

Five worked examples

Worked Example 1: Sparse matrix representations and storage efficiency#

Introduction#

Construct a sparse matrix (adjacency matrix of a large graph); compare storage and computation time across dense, CSR, and COO formats; demonstrate memory/time trade-offs.

Purpose#

Illustrate practical efficiency gains of sparse formats; motivate algorithm selection based on access patterns.

Importance#

Foundation for understanding when sparse methods are worthwhile; shows memory savings enable billion-scale problems.

What this example demonstrates#

Generate sparse graph (e.g., scale-free network or grid).
Store as dense array, then convert to CSR, CSC, COO.
Measure memory footprint for each format.
Benchmark matrix-vector product on dense vs. sparse.
Plot time vs. sparsity level.

Background#

Dense storage is simple but wasteful for sparse matrices. Sparse formats trade memory for computational complexity in accessing elements.

Historical context#

Sparse matrix formats developed as computers gained memory constraints in 1960s–1970s (finite element analysis).

History#

CSR became de facto standard; specialized hardware (GPU) support added via CUSPARSE, cuSPARSE.

Prevalence in ML#

Every large-scale ML system (Spark, Hadoop, TensorFlow) supports sparse matrices; default for text, graphs.

Notes#

CSR efficient for row access, SpMV; CSC efficient for column access, matrix-matrix products.
COO flexible for construction; convert to CSR/CSC before computation.
GPU memory bandwidth is bottleneck; careful format choice crucial.

Connection to ML#

Text vectorization produces sparse matrices; graphs produce sparse adjacency matrices. System must efficiently handle both.

Connection to Linear Algebra Theory#

Sparse matrix properties (connectivity, degree distribution) determine factorization cost and algorithm performance.

Pedagogical Significance#

Demonstrates performance trade-offs; motivates specialized algorithms.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(30)

# Create sparse adjacency matrix (random graph)
n = 10000
density = 0.001  # 0.1% sparse
nnz = int(density * n * n)

# Generate random sparse matrix
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
data = np.ones(nnz)

# CSR format (efficient for SpMV)
A_csr = sp.csr_matrix((data, (rows, cols)), shape=(n, n))

# CSC format (efficient for column access)
A_csc = A_csr.tocsc()

# COO format (flexible)
A_coo = sp.coo_matrix((data, (rows, cols)), shape=(n, n))

# Dense format
A_dense = A_csr.toarray()

# Measure memory
import sys
mem_csr = sys.getsizeof(A_csr.data) + sys.getsizeof(A_csr.indices) + sys.getsizeof(A_csr.indptr)
mem_dense = A_dense.nbytes

print("Memory comparison for {}x{} matrix (density {:.4f}%)".format(n, n, 100*density))
print("  Dense: {:.2f} MB".format(mem_dense / 1e6))
print("  CSR: {:.2f} MB".format(mem_csr / 1e6))
print("  Savings: {:.1f}x".format(mem_dense / mem_csr))

# Benchmark matrix-vector product
x = np.random.randn(n)
iterations = 100

# Sparse SpMV
t0 = time.time()
for _ in range(iterations):
    y_sparse = A_csr @ x
t_sparse = time.time() - t0

# Dense gemv
t0 = time.time()
for _ in range(iterations):
    y_dense = A_dense @ x
t_dense = time.time() - t0

print("\nMatrix-vector product time ({} iterations):".format(iterations))
print("  Sparse: {:.4f} s".format(t_sparse))
print("  Dense: {:.4f} s".format(t_dense))
print("  Speedup: {:.1f}x".format(t_dense / t_sparse))

print("\nSparsity level: {:.1f}%".format(100 * A_csr.nnz / (n*n)))
print("NNZ: {}".format(A_csr.nnz))

Worked Example 2: Fill-in and reordering in sparse LU factorization#

Introduction#

Construct a sparse matrix; compute LU factorization without reordering and with minimum-degree reordering; compare fill-in and factorization time.

Purpose#

Demonstrate how reordering dramatically reduces fill-in and speeds factorization.

Importance#

Critical for practical sparse solvers; shows that problem structure matters.

What this example demonstrates#

Construct sparse system (banded, block-structured, or random).
Perform symbolic LU factorization (predict nonzero pattern).
Compute LU without reordering; measure fill-in ratio.
Apply minimum-degree reordering; compute LU again.
Compare factorization time and fill-in.

Background#

Fill-in (new nonzeros created during factorization) can destroy sparsity; reordering (permutation of rows/columns) reduces fill-in.

Historical context#

Tinney & Walker (1967) minimum degree; George (1973) nested dissection; standard in SPARSE BLAS era.

History#

All production sparse solvers (PARDISO, SuperLU, UMFPACK) include automatic reordering.

Prevalence in ML#

Used in sparse inverse problems, finite element discretizations; less visible in ML but critical for scalability.

Notes#

Minimum degree is heuristic; nested dissection is theoretically superior but more complex.
Fill-in ratio typically 5–100× for unordered; reordering reduces to 1–5×.

Connection to ML#

Sparse solve is bottleneck in iterative methods (Newton, interior-point algorithms). Reordering accelerates optimization.

Connection to Linear Algebra Theory#

Graph reachability (matrix powers) predicts fill-in; reduction is graph-theoretic problem.

Pedagogical Significance#

Shows interplay between linear algebra and graph algorithms.

References#

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
from scipy.sparse import csgraph
import time

np.random.seed(31)

# Create sparse banded system (mimics discretized PDE)
n = 1000
band_width = 5
A_banded = sp.diags([np.ones(n-1), -2*np.ones(n), np.ones(n-1)],
                     [-1, 0, 1], shape=(n, n), format='csr')
# Add random entries to create irregular pattern
rows = np.random.choice(n, 50)
cols = np.random.choice(n, 50)
A_banded = A_banded + 0.01 * sp.csr_matrix((np.ones(50), (rows, cols)), shape=(n, n))
A_banded.eliminate_zeros()

# Convert to CSC for LU (standard for sparse direct solvers)
A_csc = A_banded.tocsc()

# LU without reordering
print("LU factorization comparison (n={})".format(n))
print("\nNo reordering:")
t0 = time.time()
LU_no_reorder = spla.splu(A_csc, permc_spec='natural')
t_no_reorder = time.time() - t0
nnz_no_reorder = LU_no_reorder.nnz

# LU with COLAMD reordering (minimum degree heuristic)
print("\nWith COLAMD reordering:")
t0 = time.time()
LU_reorder = spla.splu(A_csc, permc_spec='COLAMD')
t_reorder = time.time() - t0
nnz_reorder = LU_reorder.nnz

print("\nResults:")
print("  Original NNZ: {}".format(A_csc.nnz))
print("  No reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_no_reorder, nnz_no_reorder / A_csc.nnz, t_no_reorder))
print("  With reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_reorder, nnz_reorder / A_csc.nnz, t_reorder))
print("  Fill-in reduction: {:.1f}x".format(nnz_no_reorder / nnz_reorder))
print("  Time improvement: {:.1f}x".format(t_no_reorder / t_reorder))

Worked Example 3: Sparse SVD for topic modeling (LSA)#

Introduction#

Apply truncated SVD to sparse document-term matrix; extract latent topics; compare sparse vs. dense SVD efficiency.

Purpose#

Show practical use of sparse linear algebra in NLP; demonstrate dimensionality reduction on real-world sparse data.

Importance#

Sparse SVD is standard in topic modeling; ARPACK enables scalability to billions of words.

What this example demonstrates#

Generate synthetic document-term matrix (sparse, realistic sparsity ~0.01%).
Compute truncated SVD via ARPACK (sparse).
Compute dense SVD for comparison (infeasible for large $n$).
Extract topics (top words per topic).
Measure reconstruction error vs. rank.

Background#

LSA (latent semantic analysis) applies SVD to tf-idf matrix; low-rank factors reveal latent topics.

Historical context#

Deerwester et al. (1990) pioneered LSA; ARPACK (Lehoucq et al. 1998) made scalable SVD practical.

History#

Standard in topic modeling pipelines; scikit-learn TruncatedSVD widely used.

Prevalence in ML#

Textmining, information retrieval, recommendation systems routinely use sparse SVD.

Notes#

Truncated SVD (rank $k \ll n$) computes only $k$ largest singular values; much cheaper than full SVD.
Sparse matrix multiplication dominates cost; $O(k \cdot \text{nnz})$ per iteration.

Connection to ML#

Topics are interpretable; document-topic and topic-word matrices enable downstream tasks.

Connection to Linear Algebra Theory#

ARPACK uses Krylov subspaces and implicitly restarted Arnoldi; exploits SpMV efficiency.

Pedagogical Significance#

Demonstrates power of combining sparse formats with specialized algorithms (Krylov methods).

References#

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import time

np.random.seed(32)

# Synthetic document-term matrix (sparse)
n_docs, n_terms = 10000, 50000
nnz_per_doc = 200

# Generate sparse matrix (realistic sparsity)
data = []
row_idx = []
col_idx = []
for doc in range(n_docs):
    terms = np.random.choice(n_terms, nnz_per_doc, replace=False)
    counts = np.random.poisson(2, nnz_per_doc) + 1
    data.extend(counts)
    row_idx.extend([doc] * nnz_per_doc)
    col_idx.extend(terms)

X = sp.csr_matrix((data, (row_idx, col_idx)), shape=(n_docs, n_terms))

# Normalize (tf-idf variant)
X = X.astype(np.float32)
row_sums = np.array(X.sum(axis=1)).flatten()
X.data /= row_sums[X.nonzero()[0]]

print("Document-term matrix: {}x{}, sparsity: {:.4f}%".format(
    n_docs, n_terms, 100*(1 - X.nnz/(n_docs*n_terms))))

# Sparse SVD (ARPACK)
k = 50  # Number of topics
print("\nTruncated SVD (rank {})...".format(k))
t0 = time.time()
U_sparse, s_sparse, Vt_sparse = svds(X, k=k, which='LM')
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense SVD (for small subset for illustration)
print("\nFull dense SVD (on 100x1000 subset for illustration)...")
X_dense_small = X[:100, :1000].toarray()
t0 = time.time()
U_dense, s_dense, Vt_dense = np.linalg.svd(X_dense_small, full_matrices=False)
t_dense = time.time() - t0
print("  Time (100x1000): {:.4f} s".format(t_dense))
print("  Dense SVD on full matrix would take >> hours")

# Extract topics (top terms per topic)
print("\nTop 5 terms per topic (first 3 topics):")
for topic_idx in range(min(3, k)):
    top_terms = np.argsort(Vt_sparse[topic_idx])[-5:][::-1]
    print("  Topic {}: terms {}".format(topic_idx, top_terms))

# Reconstruction error
rank_list = [5, 10, 20, 50]
print("\nReconstruction error vs. rank:")
for r in rank_list:
    U_r = U_sparse[:, -r:]
    s_r = s_sparse[-r:]
    Vt_r = Vt_sparse[-r:, :]
    X_recon = U_r @ np.diag(s_r) @ Vt_r
    err = np.linalg.norm(X[:100, :].toarray() - X_recon[:100]) / np.linalg.norm(X[:100])
    print("  Rank {}: relative error {:.4f}".format(r, err))

Worked Example 4: Preconditioned iterative solver on sparse system#

Introduction#

Construct a large sparse SPD system (from discretized PDE); solve via unpreconditioned and preconditioned CG; demonstrate preconditioning impact.

Purpose#

Show how preconditioning enables convergence on sparse systems where direct solvers would be expensive.

Importance#

Practical approach for million-scale sparse systems; avoids expensive fill-in of direct LU.

What this example demonstrates#

Construct sparse Laplacian or discretized diffusion matrix.
Solve unpreconditioned CG; count iterations.
Construct ILU or AMG preconditioner.
Solve preconditioned CG; count iterations.
Compare iteration counts and wall-clock time.

Background#

Iterative solvers avoid factorization fill-in; preconditioning clusters spectrum for fast convergence.

Historical context#

Preconditioned CG developed 1970s–1980s; ILU standard; AMG emerged 1980s–1990s.

History#

Standard in PETSc, Trilinos, deal.II (finite element libraries); production code for billion-node problems.

Prevalence in ML#

Used in large-scale optimization (Newton, interior-point), graph algorithms (diffusion on networks).

Notes#

Preconditioning setup cost amortized over many solves.
ILU relatively cheap; AMG more expensive but better convergence for difficult problems.

Connection to ML#

Large-scale optimization (e.g., logistic regression on massive sparse data) requires preconditioned iterative methods.

Connection to Linear Algebra Theory#

CG iteration count $\approx \sqrt{\kappa}$; preconditioning reduces $\kappa$ to enable convergence.

Pedagogical Significance#

Demonstrates integration of sparsity, preconditioning, and iterative methods.

References#

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla

np.random.seed(33)

# Create sparse Laplacian (discretized 2D Poisson, 100x100 grid)
n = 10000  # 100x100 grid
# 5-point stencil: center + 4 neighbors
I = []
J = []
S = []
for i in range(n):
    I.extend([i, i, i])
    J.extend([i, (i+1) % n, (i-1) % n])
    S.extend([4, -1, -1])

A = sp.csr_matrix((S, (I, J)), shape=(n, n))
A = A.tocsr()

# RHS
b = np.random.randn(n)

print("Sparse Laplacian system: {}x{}, nnz={}".format(n, n, A.nnz))
print("Condition number (estimated): ~{}".format(int(n**0.5)))

# Unpreconditioned CG
print("\nUnpreconditioned CG:")
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))

x_un, info_un = spla.cg(A, b, tol=1e-6, callback=callback_un, maxiter=n)
print("  Iterations: {}".format(len(residuals_unprecond)))

# ILU preconditioner
print("\nBuilding ILU preconditioner...")
M_ilu = spla.spilu(A.tocsc())
M = spla.LinearOperator((n, n), matvec=M_ilu.solve)

print("Preconditioned CG:")
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))

x_pre, info_pre = spla.cg(A, b, M=M, tol=1e-6, callback=callback_pre, maxiter=n)
print("  Iterations: {}".format(len(residuals_precond)))

print("\nSummary:")
print("  Unpreconditioned iterations: {}".format(len(residuals_unprecond)))
print("  Preconditioned iterations: {}".format(len(residuals_precond)))
print("  Speedup: {:.1f}x".format(len(residuals_unprecond) / len(residuals_precond)))
print("  Both converged:", info_un == 0 and info_pre == 0)

Worked Example 5: Graph neural network convolution on sparse adjacency matrix#

Introduction#

Construct a graph (sparse adjacency matrix); implement graph convolution layer; measure SpMM efficiency vs. dense multiplication.

Purpose#

Demonstrate sparse matrix operations in modern ML (graph neural networks).

Importance#

GNNs process billion-node graphs; sparsity is essential for feasibility.

What this example demonstrates#

Generate graph (scale-free, grid, or real network).
Build node feature matrix $X \in \mathbb{R}^{n \times d}$.
Implement sparse graph convolution: $X' = \sigma(A X W)$ with SpMM.
Compare sparse vs. dense GCN layer time/memory.
Verify output correctness.

Background#

Graph convolution aggregates features from neighbors via sparse $A$ multiplication.

Historical context#

Kipf & Welling (2017) GCN; PyTorch Geometric (2019) enables efficient sparse operations.

History#

Standard in graph learning frameworks (DGL, Spektral, PyTorch Geometric).

Prevalence in ML#

GNNs on social networks, citation graphs, molecular graphs; sparse adjacency matrix standard.

Notes#

SpMM (sparse-dense matrix product) is key operation; GPU acceleration available.
For $n$-node graph with $m$ edges and $d$ features: sparse is $O(m \cdot d)$, dense is $O(n^2 \cdot d)$.

Connection to ML#

GNNs enable learning on structured, graph-based data; scalable to billion-node networks via sparsity.

Connection to Linear Algebra Theory#

Graph convolution is polynomial filter on graph Laplacian; spectral perspective connects to Chebyshev polynomials.

Pedagogical Significance#

Integrates sparse linear algebra, graph theory, and modern deep learning.

References#

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(34)

# Generate graph (scale-free network)
n = 10000  # Nodes
avg_degree = 10
edges = []
for i in range(n):
    # Preferential attachment: connect to random nodes
    targets = np.random.choice(n, min(avg_degree, n-1), replace=False)
    for j in targets:
        if i != j:
            edges.append((i, j))
            edges.append((j, i))  # Undirected

edges = np.array(edges)
A = sp.csr_matrix((np.ones(len(edges)), edges.T), shape=(n, n))
A.eliminate_zeros()

# Node features
d = 64  # Feature dimension
X = np.random.randn(n, d).astype(np.float32)

# Weight matrix (GCN parameters)
W = np.random.randn(d, d).astype(np.float32)

print("Graph: n={}, edges={}, sparsity={:.4f}%".format(
    n, A.nnz // 2, 100*(1 - A.nnz/(n*n))))
print("Features: n={}, d={}".format(n, d))

# Sparse graph convolution: X_out = A @ X @ W
print("\nSparse graph convolution:")
t0 = time.time()
X_intermediate = A @ X  # SpMM: sparse @ dense
X_out_sparse = X_intermediate @ W  # Dense @ dense
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense graph convolution (for comparison on subset)
print("\nDense graph convolution (on 1000-node subgraph):")
A_dense_sub = A[:1000, :1000].toarray()
X_sub = X[:1000]
t0 = time.time()
X_out_dense_sub = A_dense_sub @ X_sub @ W
t_dense_sub = time.time() - t0
print("  Time (1000 nodes): {:.4f} s".format(t_dense_sub))
# Extrapolate to full graph
t_dense_extrapolated = t_dense_sub * (n/1000)**2
print("  Extrapolated to full graph: {:.2f} s".format(t_dense_extrapolated))

print("\nSpeedup (dense vs. sparse): {:.1f}x".format(t_dense_extrapolated / t_sparse))

# Apply nonlinearity (ReLU)
X_out_sparse = np.maximum(X_out_sparse, 0)

print("\nGCN layer output shape:", X_out_sparse.shape)
print("Output mean:", np.mean(X_out_sparse))
print("Output sparsity after ReLU:", 100 * np.sum(X_out_sparse == 0) / X_out_sparse.size, "%")

Comments

Algorithm Category

Direct Methods

Iterative Methods

Graph Algorithms

Computational Efficiency

Sparse Matrix Operations

Data Modality

Sparse Matrices

Structured Matrices

Historical & Attribution

Modern Era

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing

Advanced

ML Applications

Graph Neural Networks

Recommender Systems

Dimensionality Reduction

Problem Structure & Exploitation

Sparsity

Theoretical Foundation

Applied Machine Learning

Conditioning & Stability

Chapter 14

Conditioning & Stability

Key ideas: Introduction

Introduction#

Numerical stability determines whether an algorithm produces a correct answer. Machine arithmetic operates on finite precision (float64: ~16 significant digits). An ill-conditioned problem (large $\kappa$) is inherently difficult: small input perturbations cause large output changes. Backward-stable algorithms guarantee that computed solution is exact solution to a nearby problem. Forward error depends on conditioning; backward error depends on algorithm stability. Together, they predict what accuracy is achievable. This is fundamental to reliable ML: ill-conditioned matrices appear in nearly singular Gram matrices, ill-posed inverse problems, and deep network Hessians.

Important ideas#

Condition number and sensitivity
- $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
- Forward error bound: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \cdot \frac{\| \Delta A \|}{\| A \|}$ (perturbation on $A$ causes error in $x$).
- Ill-conditioned ($\kappa \gg 1$) problems amplify noise; well-conditioned ($\kappa \approx 1$) are benign.
Machine precision and floating-point arithmetic
- Machine epsilon: $\varepsilon_{\text{machine}} \approx 2^{-52} \approx 2.2 \times 10^{-16}$ for float64.
- Relative error per operation: $O(\varepsilon_{\text{machine}})$; accumulated error over $n$ operations: $O(n \cdot \varepsilon_{\text{machine}})$.
- Loss of significance: if $a \approx b$, then $a - b$ has few correct digits (cancellation).
Backward error analysis and stability
- Algorithm is backward-stable if computed solution $\tilde{x}$ is exact solution to perturbed problem: $\tilde{A} \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon_{\text{machine}})$.
- Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|} \approx \kappa(A) \cdot O(\varepsilon_{\text{machine}})$ for stable algorithm.
- Unstable algorithm: backward error is large; forward error can be catastrophic.
Condition number of Gram matrix vs. original matrix
- $\kappa(A^\top A) = \kappa(A)^2$; normal equations square the condition number.
- QR/SVD avoid this: $\kappa(R) = \kappa(A)$ (no squaring), $\kappa(\Sigma) = \kappa(A)$.
- For least squares: direct method on $A$ (QR) preferable to solving $A^\top A x = A^\top b$.
Preconditioning and effective conditioning
- Preconditioner $M \approx A$ transforms $A \to M^{-1} A$; goal is $\kappa(M^{-1} A) \ll \kappa(A)$.
- Incomplete LU (ILU), Jacobi (diagonal), algebraic multigrid (AMG) are practical preconditioners.
- CG iteration count $\approx \sqrt{\kappa(M^{-1} A)}$; reducing $\kappa$ by factor 100 reduces iterations by factor 10.
Regularization and ill-posedness
- Ill-posed problem: small changes in data cause arbitrarily large changes in solution ($\kappa = \infty$).
- Tikhonov regularization: $(A^\top A + \lambda I) x = A^\top b$ shifts small singular values by $\lambda$.
- Effective $\kappa$ after regularization: $\kappa_{\text{reg}} = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$; well-chosen $\lambda$ reduces condition number.
Numerical cancellation and loss of significance
- Subtraction $a - b$ where $a \approx b$ cancels leading digits; result has low precision.
- Orthogonal transformations (QR, SVD, Householder) avoid this; preserve angles and norms.
- Modified Gram–Schmidt (MGS) more stable than classical Gram–Schmidt due to reorthogonalization.

Relevance to ML#

Linear regression and least squares: Gram matrix $X^\top X$ can be ill-conditioned if features are correlated; normal equations fail; QR/SVD required.
Kernel methods and Gaussian processes: Gram/covariance matrix condition number determines sample complexity and numerical stability of Cholesky.
Deep learning and Hessian conditioning: Neural network loss Hessian is often ill-conditioned; affects convergence rate, learning rate selection, and second-order optimization.
Regularization and implicit bias: Ill-posed learning problems ($\kappa = \infty$) require regularization (L2, dropout, early stopping); reduces effective $\kappa$.
Numerical robustness in production: Conditioning determines whether model training is reproducible, whether predictions are stable to input noise, and whether inference is safe.

Algorithmic development (milestones)#

1947: von Neumann & Goldstine analyze numerical stability of Gaussian elimination; discover amplification by machine epsilon.
1961: Wilkinson develops comprehensive framework for backward error analysis; introduces condition number formally.
1965: Golub & Kahan recognize $\kappa(A^\top A) = \kappa(A)^2$ is catastrophic for normal equations; advocate for QR/SVD.
1971: Peters & Wilkinson develop LINPACK; implement numerically stable algorithms with careful pivot selection.
1979–1990: Higham develops extended backward error analysis; proves stability of Cholesky, QR, modified Gram–Schmidt.
1992: Arioli et al. develop a posteriori error bounds; can verify computed solution without knowing true solution.
2000s: Preconditioning becomes standard in iterative solvers; algebraic multigrid (AMG) enables linear scaling.
2010s: Automatic differentiation frameworks (TensorFlow, PyTorch) provide implicit solvers with backward stability analysis.

Definitions#

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); $\kappa_\infty(A) = \| A \|_\infty \| A^{-1} \|_\infty$ (operator norm).
Machine epsilon: $\varepsilon_{\text{machine}} = 2^{-(p-1)}$ for $p$-bit mantissa; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point number: Normalized form: $\pm m \times 2^e$ with $1 \le m < 2$ (hidden bit), $e \in [-1022, 1023]$.
Relative error: $\frac{\| \tilde{x} - x \|}{\| x \|}$ (dimensionless; independent of scaling).
Backward error: $\frac{\| \tilde{A} - A \|}{\| A \|}$ for perturbed matrix; small backward error $\Rightarrow$ computed $\tilde{x}$ is exact solution to nearby problem.
Forward error: $\frac{\| \tilde{x} - x \|}{\| x \|}$; depends on both backward error and conditioning: $\text{forward error} \approx \kappa(A) \times \text{backward error}$.
Ill-conditioned: $\kappa(A) \gg 1$; small perturbations cause large output changes.
Well-conditioned: $\kappa(A) \approx 1$; stable to perturbations.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Condition number definition and properties: $\kappa(A) = \sigma_1 / \sigma_n$; $\kappa(AB) \le \kappa(A) \kappa(B)$; $\kappa(A) \ge 1$. Reference: Golub & Van Loan (2013).
Perturbation theory and forward error bounds: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \frac{\| \Delta A \|}{\| A \|}$ (first-order). Reference: Wilkinson (1961); Higham (2002).
Backward error analysis: Algorithm is backward-stable if $\tilde{x}$ satisfies $(\tilde{A}) \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon)$. Reference: Higham (2002).
Machine precision and floating-point error: $\varepsilon_{\text{mach}} \approx 2^{-p}$; error per operation $O(\varepsilon)$; cumulative error over $n$ steps $O(n \varepsilon)$. Reference: IEEE 754 (2019); Goldberg (1991).
Condition number of Gram matrix: $\kappa(A^\top A) = \kappa(A)^2$; why normal equations are risky. Reference: Golub & Kahan (1965).
Preconditioning theory: Transform $A \to M^{-1} A$; goal $\kappa(M^{-1} A) \ll \kappa(A)$. Reference: Axelsson (1994).
Tikhonov regularization: Shift eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$; reduce effective condition number. Reference: Tikhonov (1963).

Applied (landmark systems)#

QR factorization for stable least squares: $\kappa(R) = \kappa(A)$ (no squaring); stable even for ill-conditioned $A$. Implementation: LAPACK dgeqrf; scikit-learn default solver. Reference: Golub & Kahan (1965); Golub & Van Loan (2013).
SVD for rank-deficient and ill-conditioned systems: Direct pseudoinverse computation; threshold small singular values. Implementation: LAPACK dgesvd; scipy.linalg.svd. Reference: Golub & Kahan (1965).
Cholesky with jitter for near-singular covariance: Add $\epsilon I$ if Cholesky fails; trade-off between accuracy and stability. Standard in GPyTorch, Pyro, Stan. Reference: Rasmussen & Williams (2006).
Modified Gram–Schmidt (MGS) for reorthogonalization: More stable than classical GS; requires reorthogonalization passes. Reference: Golub & Pereyra (1973); Yarlagadda et al. (1979).
Preconditioned conjugate gradient (PCG): L-BFGS (quasi-Newton), ILU preconditioning, algebraic multigrid reduce iteration count by factors 10–100. Standard in scipy.sparse.linalg, PETSc. Reference: Nocedal & Wright (2006); Saad (2003).
Ridge regression and elastic net for regularization: Manual $\lambda$ tuning or cross-validation; scikit-learn Ridge, ElasticNet. Reference: Hoerl & Kennard (1970); Zou & Hastie (2005).

Key ideas: Where it shows up

Linear regression and least squares
- Gram matrix $X^\top X$ condition number determines numerical stability of normal equations.
- Achievements: QR/SVD solvers in scikit-learn, statsmodels ensure stability; warning when $\kappa > 10^{10}$. References: Hastie et al. (2009); Golub & Kahan (1965).
Kernel methods and Gaussian processes
- Covariance/kernel matrix $K$ condition number governs Cholesky stability and marginal likelihood evaluation.
- Achievements: Cholesky with jitter in Stan/PyMC3; sparse GP approximations circumvent $O(n^3)$ and ill-conditioning. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005).
Deep learning and second-order optimization
- Hessian condition number determines convergence rate of Newton/quasi-Newton methods; affects learning rate and step size.
- Achievements: L-BFGS (quasi-Newton) in PyTorch/TensorFlow adapts to Hessian conditioning; natural gradient methods rescale by Fisher information. References: Martens & Grosse (2015); Nocedal & Wright (2006).
Regularization and implicit bias
- Ill-posed learning problems require regularization; L2 penalty (ridge) shifts eigenvalues, reducing effective condition number.
- Achievements: elastic net (L1+L2) in scikit-learn balances sparsity and conditioning; early stopping in SGD acts as implicit regularization. References: Tibshirani (1996); Zou & Hastie (2005); Zhu et al. (2021).
Numerical robustness and reliability
- Production ML systems must be stable to input noise, feature preprocessing, and computational precision.
- Achievements: feature standardization reduces condition number; model monitoring tracks numerical stability (checks for NaN/Inf); validation on adversarial perturbations. References: Goodfellow et al. (2014) (adversarial robustness); LeCun et al. (1998) (feature normalization).

Notation

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)} = \frac{\lambda_{\max}(A^\top A)}{\lambda_{\min}(A^\top A)}$ (for SPD $A$).
Machine epsilon: $\varepsilon_{\text{mach}}$ or $\text{eps}$ in code; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point relative error: $\tilde{a} = a(1 + \delta)$ where $|\delta| \le \varepsilon_{\text{mach}}$.
Perturbation: $\tilde{A} = A + \Delta A$ with $\| \Delta A \| \le \varepsilon \| A \|$.
Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$ (computed vs. true solution).
Backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$ (residual relative to data norm).
Norm: $\| A \|_2 = \sigma_1(A)$ (spectral norm); $\| A \|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius).
Example: For $A \in \mathbb{R}^{100 \times 100}$ with $\sigma_1 = 10, \sigma_{100} = 0.01$: $\kappa(A) = 1000$. Relative error in solving $Ax = b$ is amplified by factor $\approx 1000$; if input has error $10^{-15}$, output error $\approx 10^{-12}$.

Pitfalls & sanity checks

Never ignore condition number: If $\kappa(A) > 10^{8}$, accuracy is compromised; either regularize or use higher precision.
Use QR/SVD for least squares, never normal equations (unless you’ve verified conditioning): $\kappa(X^\top X) = \kappa(X)^2$ is catastrophic.
Check for catastrophic cancellation: Subtraction of nearly equal numbers loses precision; use orthogonal algorithms instead.
Machine precision is absolute limit: Even with perfect algorithm, relative error $\gtrsim \varepsilon_{\text{mach}} \times \kappa(A)$ is unavoidable.
Backward error bounds are meaningful: Small backward error $\Rightarrow$ computed solution is exact for nearby problem (reassuring).
Regularization parameter selection matters: Cross-validation, L-curve, or discrepancy principle; manual tuning often fails.
Feature scaling affects conditioning: Standardize features before regression; poor scaling inflates condition number unnecessarily.
Preconditioning amortized over multiple solves: Single solve: preconditioning setup not worth it; multiple solves: huge benefit.

References

Foundational backward error analysis

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.
IEEE 754. (2019). IEEE Standard for Floating-Point Arithmetic.

Classical numerical linear algebra

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Comprehensive stability analysis

Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.
Givens, W. (1958). Computation of plane unitary rotations transforming a general matrix to triangular form.

Orthogonal factorizations and reorthogonalization

Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Yarlagadda, R. K., Hershey, J. E., & Sule, S. M. (1979). Gram–Schmidt orthogonalization with application to order recursive lattice filters.

Iterative methods and preconditioning

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

Ill-posed problems and regularization

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Hansen, P. C., Nagy, J. G., & O’Leary, D. P. (2006). Deblurring Images: Matrices, Spectra, and Filtering.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Applied ML and regularization

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Five worked examples

Worked Example 1: Condition number of Gram matrix and instability of normal equations#

Introduction#

Construct a data matrix $X$ with correlated features; compute condition number $\kappa(X)$ and $\kappa(X^\top X)$; solve least squares via normal equations vs. QR to demonstrate error amplification.

Purpose#

Illustrate why normal equations fail for ill-conditioned data; show that $\kappa(X^\top X) = \kappa(X)^2$ causes error to explode.

Importance#

Foundational motivation for using QR/SVD in regression; explains why scikit-learn warns on ill-conditioned matrices.

What this example demonstrates#

Construct $X$ with exponentially decaying singular values (high correlation).
Compute $\kappa(X)$ and $\kappa(X^\top X) = \kappa(X)^2$.
Solve via normal equations: $G = X^\top X$, $w = G^{-1} X^\top y$.
Solve via QR factorization: $w$ from $R w = Q^\top y$.
Compare residuals and error growth on perturbed data.

Background#

Gram matrix $G = X^\top X$ is square and symmetric but inherits (squared) conditioning from $X$. Gaussian elimination on $G$ amplifies errors by $\kappa(G)^2 = \kappa(X)^4$ in worst case.

Historical context#

Golub & Kahan (1965) proved $\kappa(X^\top X) = \kappa(X)^2$; advocated QR as practical solution.

History#

Recognition of this issue led to LINPACK (1979) and LAPACK (1992) emphasis on QR/SVD.

Prevalence in ML#

Statisticians and practitioners routinely encounter this; feature correlation is common in real data.

Notes#

Problem is fundamental to linear algebra, not implementation-specific.
Data standardization (mean 0, std 1) helps but doesn’t eliminate correlation.
Large-scale regression ($n, d > 10^6$) exacerbates conditioning issues.

Connection to ML#

Ridge regression adds $\lambda I$ to shift eigenvalues; reduces effective condition number. Feature engineering (orthogonalization, PCA preprocessing) also helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ is exact; QR avoids Gram matrix entirely.

Pedagogical Significance#

Concrete demonstration of why theory (backward error analysis) matters in practice.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

Solution (Python)#

import numpy as np
np.random.seed(20)

# Create ill-conditioned data (correlated features)
n, d = 100, 10
# Generate features with exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s)

# True solution and target
w_true = np.random.randn(d)
y = X @ w_true + 0.001 * np.random.randn(n)

# Condition numbers
kappa_X = np.linalg.cond(X)
G = X.T @ X
kappa_G = np.linalg.cond(G)

# Solve via normal equations
w_ne = np.linalg.solve(G, X.T @ y)
residual_ne = np.linalg.norm(X @ w_ne - y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)
residual_qr = np.linalg.norm(X @ w_qr - y)

# Solve via SVD (most stable)
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))
residual_svd = np.linalg.norm(X @ w_svd - y)

print("Condition numbers:")
print("  kappa(X) =", round(kappa_X, 2))
print("  kappa(X^T X) =", round(kappa_G, 2))
print("  kappa(X^T X) / kappa(X)^2 =", round(kappa_G / kappa_X**2, 4), "(should be ~1)")
print("\nResiduals:")
print("  Normal equations:", round(residual_ne, 8))
print("  QR:", round(residual_qr, 8))
print("  SVD:", round(residual_svd, 8))
print("\nSolution errors:")
print("  NE error:", round(np.linalg.norm(w_ne - w_true), 4))
print("  QR error:", round(np.linalg.norm(w_qr - w_true), 4))
print("  SVD error:", round(np.linalg.norm(w_svd - w_true), 4))

Worked Example 2: Backward error analysis and numerical stability of QR vs. normal equations#

Introduction#

Solve the same least squares problem via QR and normal equations; compute backward and forward errors; verify that QR is backward-stable while normal equations magnify error.

Purpose#

Demonstrate backward error analysis in practice; show that backward-stable algorithm (QR) guarantees small forward error even on ill-conditioned problems.

Importance#

Foundation for trusting numerical algorithms; explains why stable algorithms are worth the computational cost.

What this example demonstrates#

Solve $Ax = b$ via QR factorization and normal equations.
Compute backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$.
Compute forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$.
Verify $\text{forward error} \approx \kappa(A) \times \text{backward error}$ for both methods.

Background#

Backward-stable algorithm: computed $\tilde{x}$ is exact solution to $(A + \Delta A) \tilde{x} = b$ with small $\Delta A$.

Historical context#

Wilkinson (1961) formalized backward error analysis; revolutionary impact on numerical linear algebra.

History#

Modern algorithms designed to be backward-stable (QR, SVD, Cholesky); guarantees via theorems in Higham (2002).

Prevalence in ML#

Used in numerical software certification; backward error commonly reported in solvers.

Notes#

QR backward error: $O(\varepsilon_{\text{mach}} \| A \|)$ (small).
Normal equations backward error: $O(\varepsilon_{\text{mach}} \| A \|^2 / \min(\sigma_i))$ (amplified).

Connection to ML#

Backward stability ensures that model parameters are not polluted by rounding error; prediction is numerically sound.

Connection to Linear Algebra Theory#

Backward error analysis decouples algorithm stability from problem conditioning.

Pedagogical Significance#

Shows that algorithm design (not just theory) determines practical accuracy.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np
np.random.seed(21)

# Ill-conditioned system
n = 50
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
x_true = np.random.randn(n)
b = A @ x_true

# Solve via QR
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Solve via normal equations
G = A.T @ A
x_ne = np.linalg.solve(G, A.T @ b)

# Direct solve (reference)
x_ref = np.linalg.solve(A, b)

# Backward errors
residual_qr = np.linalg.norm(A @ x_qr - b)
residual_ne = np.linalg.norm(A @ x_ne - b)
backward_error_qr = residual_qr / np.linalg.norm(b)
backward_error_ne = residual_ne / np.linalg.norm(b)

# Forward errors
forward_error_qr = np.linalg.norm(x_qr - x_ref) / np.linalg.norm(x_ref)
forward_error_ne = np.linalg.norm(x_ne - x_ref) / np.linalg.norm(x_ref)

# Condition number
kappa = np.linalg.cond(A)

# Verify: forward_error ≈ kappa × backward_error
pred_forward_qr = kappa * backward_error_qr
pred_forward_ne = kappa * backward_error_ne

print("Condition number kappa(A):", round(kappa, 0))
print("\nQR Factorization:")
print("  Backward error:", format(backward_error_qr, '.2e'))
print("  Observed forward error:", format(forward_error_qr, '.2e'))
print("  Predicted forward error:", format(pred_forward_qr, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_qr / pred_forward_qr, 2))
print("\nNormal Equations:")
print("  Backward error:", format(backward_error_ne, '.2e'))
print("  Observed forward error:", format(forward_error_ne, '.2e'))
print("  Predicted forward error:", format(pred_forward_ne, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_ne / pred_forward_ne, 2))
print("\nConclusion: QR is backward-stable; NE backward error is larger.")

Worked Example 3: Machine precision and loss of significance in subtraction#

Introduction#

Demonstrate catastrophic cancellation when subtracting nearby numbers; show how orthogonal transformations avoid this.

Purpose#

Illustrate fundamental floating-point pitfall; motivate use of orthogonal algorithms (QR, SVD, Householder).

Importance#

Concrete example of why naive implementations fail; builds intuition for numerical stability.

What this example demonstrates#

Compute $c = a - b$ where $a \approx b$ (many leading digits cancel).
Compare relative error in $c$ vs. in $a, b$ individually.
Show that orthogonal transformations (e.g., Householder reflection) preserve norms and avoid subtraction.

Background#

Subtraction of nearly equal numbers is inherently unstable; cannot be fixed by higher precision alone.

Historical context#

Recognized early in computer arithmetic; led to development of Householder reflections (1958) and other orthogonal methods.

History#

Modern best practice: avoid subtraction of near-equal numbers; use transformation-based algorithms (QR, SVD).

Prevalence in ML#

Common issue in numerical gradient computation, vector normalization, and Gram–Schmidt orthogonalization.

Notes#

Relative error in $c$ can be $O(1)$ even if $a, b$ are accurate.
Classical Gram–Schmidt suffers; modified Gram–Schmidt fixes by reorthogonalization.

Connection to ML#

Adversarial robustness and numerical gradient computation are vulnerable to cancellation; careful implementation required.

Connection to Linear Algebra Theory#

Orthogonal transformations (Householder, Givens) have condition number $\kappa = 1$; stable regardless of input.

Pedagogical Significance#

Demonstrates limit of forward-stable algorithms; backward stability (computing exact solution to nearby problem) is more achievable.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

Solution (Python)#

import numpy as np

# Example 1: Catastrophic cancellation
a = 1.0 + 1e-16
b = 1.0
c = a - b
print("Example: Subtraction of nearly equal numbers")
print("  a = 1.0 + 1e-16 =", repr(a))
print("  b = 1.0 =", repr(b))
print("  c = a - b =", repr(c))
print("  Exact: 1e-16, Computed: ", c)
print("  Relative error:", abs(c - 1e-16) / 1e-16 if c != 0 else "inf")

# Example 2: Loss of significance in dot product
print("\nExample: Loss of significance in inner product")
np.random.seed(22)
x = np.array([1e8, 1.0, 1.0])
y = np.array([1e-8, 1.0, 1.0])

# Naive: compute separately then subtract
xy_naive = x[0] * y[0] + x[1] * y[1] + x[2] * y[2]  # 1 + 1 + 1 = 3

# More stable: accumulate with smaller terms first
xy_stable = x[2] * y[2] + x[1] * y[1] + x[0] * y[0]  # Reverse order

print("  Inner product (x · y):")
print("    Naive order: {:.15f}".format(xy_naive))
print("    Reverse order: {:.15f}".format(xy_stable))
print("    NumPy dot: {:.15f}".format(np.dot(x, y)))

# Example 3: Orthogonal transformation vs. subtraction
print("\nExample: Orthogonal transformation avoids cancellation")
# Create orthogonal matrix (preserve norms)
theta = np.pi / 6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
v = np.array([1.0, 0.0])
v_rot = Q @ v
print("  ||v|| =", np.linalg.norm(v))
print("  ||Q v|| =", np.linalg.norm(v_rot), "(should equal ||v||)")
print("  Orthogonal transformation preserves norms: stable!")

Worked Example 4: Preconditioning and condition number reduction#

Introduction#

Construct an ill-conditioned SPD system; solve via unpreconditioned CG and preconditioned CG with diagonal preconditioner; measure iteration count reduction.

Purpose#

Show quantitatively how preconditioning reduces condition number and accelerates convergence.

Importance#

Essential technique for large-scale optimization; directly translates to practical speedups.

What this example demonstrates#

Construct ill-conditioned SPD matrix (exponentially decaying eigenvalues).
Solve via CG (count iterations to convergence).
Construct diagonal (Jacobi) preconditioner $M = \text{diag}(A)$.
Solve via preconditioned CG (count iterations).
Compute effective $\kappa(M^{-1} A)$ and verify iteration count reduction.

Background#

CG iteration count $\approx \sqrt{\kappa(A)}$; preconditioning reduces $\kappa \to \kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential by 1970s; became standard in iterative solvers.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg (all offer preconditioning options).

Prevalence in ML#

L-BFGS uses quasi-Newton preconditioning; preconditioned SGD in modern optimizers.

Notes#

Iteration count $\approx \sqrt{\kappa}$; reducing $\kappa$ by 100 reduces iterations by $\sqrt{100} = 10$.
Preconditioner setup cost must be amortized over multiple solves.

Connection to ML#

Optimization accelerators: Newton’s method, natural gradient, adaptive methods all precondition the gradient.

Connection to Linear Algebra Theory#

Preconditioning clusters eigenvalues; standard result in iterative methods.

Pedagogical Significance#

Demonstrates power of structure exploitation (knowing $M \approx A$) to improve algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(23)

# Create ill-conditioned SPD matrix
n = 200
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_un, info_un = spla.cg(A, b, tol=1e-8, callback=callback_un, maxiter=n)

# Preconditioned CG with diagonal (Jacobi) preconditioner
M_diag = np.diag(1.0 / np.diag(A))
M_diag_sparse = sp.diags(np.diag(M_diag))
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_pre, info_pre = spla.cg(A, b, M=M_diag_sparse, tol=1e-8, 
                           callback=callback_pre, maxiter=n)

kappa_A = np.linalg.cond(A)
# Approximate kappa(M^{-1} A)
AM = M_diag @ A
kappa_AM = np.linalg.cond(AM)

print("Condition numbers:")
print("  kappa(A) =", round(kappa_A, 0))
print("  kappa(M^{-1} A) ≈", round(kappa_AM, 0), "(after diagonal preconditioning)")
print("\nCG Convergence:")
print("  Unpreconditioned iterations:", len(residuals_unprecond))
print("  Preconditioned iterations:", len(residuals_precond))
print("  Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("  Predicted speedup (~sqrt(kappa)):", round(np.sqrt(kappa_A / kappa_AM), 1))
print("\nBoth solutions satisfy A x = b:",
      np.allclose(A @ x_un, b) and np.allclose(A @ x_pre, b))

Worked Example 5: Regularization and effective condition number#

Introduction#

Construct an ill-posed rectangular system (underdetermined or rank-deficient); solve via Tikhonov regularization with varying $\lambda$; plot condition number reduction and solution norm vs. residual (L-curve).

Purpose#

Demonstrate how regularization shifts small singular values and reduces effective condition number.

Importance#

Core technique in inverse problems, ridge regression, and ill-posed ML problems.

What this example demonstrates#

Construct rank-deficient or ill-posed system (exponentially decaying singular values).
Solve regularized system $(A^\top A + \lambda I) w = A^\top b$ for range of $\lambda$.
Compute effective condition number: $\kappa_\lambda = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$.
Plot L-curve: solution norm vs. residual; optimal $\lambda$ at corner.
Verify that small $\lambda$ gives large solution, large $\lambda$ gives small solution but large residual.

Background#

Tikhonov adds penalty; shifts eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$.

Historical context#

Tikhonov (1963) for ill-posed problems; Hansen (1998) developed regularization parameter selection tools.

History#

Standard in inverse problems (deblurring, tomography); ridge regression standard in ML.

Prevalence in ML#

Ridge regression, elastic net, dropout, early stopping all act as implicit regularization.

Notes#

Optimal $\lambda$ balances fit (small residual) and solution norm (small solution).
L-curve method (Golub et al. 1979): plot $\log(\| r \|)$ vs. $\log(\| w \|)$; corner indicates optimal $\lambda$.

Connection to ML#

Bias-variance trade-off: underfitting vs. overfitting; regularization parameter selection critical.

Connection to Linear Algebra Theory#

Spectral filtering: Tikhonov filters small singular values; LSQR via early stopping does implicit filtering.

Pedagogical Significance#

Shows how problem structure (ill-posedness) guides algorithm design (regularization).

References#

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Golub, G. H., Van Loan, C. F., & Milanfar, P. (1999). The total least squares problem.

Solution (Python)#

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(24)

# Construct ill-posed system
m, n = 100, 40
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T

# True solution and noisy data
x_true = np.zeros(n)
x_true[:5] = [10, 8, 6, 4, 2]
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# Solve Tikhonov for range of lambda
lams = np.logspace(-6, 2, 50)
sol_norms = []
residuals = []
cond_numbers = []

for lam in lams:
    G = A.T @ A + lam * np.eye(n)
    w = np.linalg.solve(G, A.T @ b)
    sol_norms.append(np.linalg.norm(w))
    residuals.append(np.linalg.norm(A @ w - b))
    cond_numbers.append(np.linalg.cond(G))

sol_norms = np.array(sol_norms)
residuals = np.array(residuals)
cond_numbers = np.array(cond_numbers)

# Find corner of L-curve (optimal lambda via curvature)
log_res = np.log(residuals)
log_norm = np.log(sol_norms)
curvature = np.gradient(np.gradient(log_res) / np.gradient(log_norm))
opt_idx = np.argmax(curvature)
lam_opt = lams[opt_idx]

# Solve with optimal lambda
G_opt = A.T @ A + lam_opt * np.eye(n)
w_opt = np.linalg.solve(G_opt, A.T @ b)

print("Regularization parameter selection (L-curve method):")
print("  Optimal lambda:", format(lam_opt, '.2e'))
print("  At optimal lambda:")
print("    Solution norm:", round(np.linalg.norm(w_opt), 4))
print("    Residual norm:", round(np.linalg.norm(A @ w_opt - b), 4))
print("    Condition number:", round(np.linalg.cond(G_opt), 2))
print("\nFor comparison:")
print("  Smallest lambda: cond =", round(cond_numbers[0], 0), ", residual =", 
      format(residuals[0], '.2e'))
print("  Largest lambda: cond =", round(cond_numbers[-1], 0), ", residual =", 
      format(residuals[-1], '.2e'))
print("\nTrue solution norm:", round(np.linalg.norm(x_true), 4))
print("Solution error at optimal lambda:", round(np.linalg.norm(w_opt - x_true), 4))

Comments

Algorithm Category

Direct Methods

Iterative Methods

Computational Efficiency

GPU & Distributed Computing

Data Modality

Dense Matrices

Structured Matrices

Historical & Attribution

Modern Era

Key Concepts & Theorems

Condition Number

Positive Semidefiniteness

Norm & Distance

Learning Path & Sequencing

Advanced

ML Applications

Optimization & Training

Numerical Stability & Robustness

Conditioning & Sensitivity

Floating-Point Arithmetic

Loss Scaling

Theoretical Foundation

Numerical Linear Algebra

Solving Systems

Chapter 13

Solving Systems

Key ideas: Introduction

Introduction#

Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.

Important ideas#

Gaussian elimination and LU factorization
- Direct method: $A = LU$ via row operations.
- Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.
- Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).
Cholesky factorization for SPD systems
- For symmetric positive definite $A$: $A = L L^\top$ (one triangle).
- Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.
- Numerically stable if $A$ is well-conditioned.
QR factorization and least squares
- $A = QR$ with orthonormal $Q$ and upper triangular $R$.
- Numerically stable; used for least squares (Chapter 12).
- Cost: $O(n^2 d)$ for $n \times d$ matrix.
Iterative solvers and conjugate gradient
- Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.
- GMRES/MINRES: for general/symmetric non-SPD systems.
- Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.
Preconditioning and conditioning
- Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.
- Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.
- Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.
Sparse matrix structure
- Banded, tridiagonal, block-structured systems exploit locality.
- Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.
- Fill-in during factorization can destroy sparsity; ordering matters.
Rank deficiency and ill-posedness
- Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.
- Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.
- Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.

Relevance to ML#

Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.
Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.
Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.
Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.
Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.

Algorithmic development (milestones)#

1670s: Newton’s method and early algebraic solutions.
1810: Gaussian elimination formalized (Gauss, Legendre).
1875: Cholesky decomposition developed (rediscovered 1910).
1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.
1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.
1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.
1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.
1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.
2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.

Definitions#

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.
LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.
Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.
Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.
Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.
Residual: $r = b - A x$; measure of solution error.
Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.
Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.
Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Gaussian elimination and LU: Forward elimination, row pivoting for stability, $A = LU$ factorization. References: Golub & Van Loan (2013); Trefethen & Bau (1997).
Cholesky factorization: $A = L L^\top$ for SPD matrices; numerical stability via diagonal dominance and conditioning. Reference: Golub & Van Loan (2013).
Forward/back-substitution: $O(n^2)$ solve for triangular systems; essential subroutine.
QR factorization: $A = QR$ with orthonormal $Q$; stable least squares (Chapter 12). Reference: Golub & Kahan (1965).
Conjugate gradient: Minimize $\frac{1}{2} x^\top A x - b^\top x$ on Krylov subspace; optimality in $n$ iterations for SPD. Reference: Hestenes & Stiefel (1952).
Condition number and residual analysis: $\kappa(A) = \sigma_1 / \sigma_n$; backward error bounds. Reference: Wilkinson (1961).
Preconditioning: Transform $A \to M^{-1} A$ to reduce condition number. Reference: Axelsson (1994).

Applied (landmark systems)#

Cholesky solve for Gram matrices: $(X^\top X) w = X^\top y$ via Cholesky; scikit-learn LinearRegression with solver='cholesky'. Reference: Hastie et al. (2009).
CG for large-scale least squares: CGLS (CG on normal equations) for $n, d > 10^6$; scikit-learn SGDRegressor. Reference: Paige & Saunders (1982).
Gaussian process inference: Cholesky of $K$ for marginal likelihood; approximate GPs via inducing points reduce $O(n^3)$ to $O(m^3)$. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005); Hensman et al. (2015).
Preconditioned optimization: L-BFGS with Hessian approximation; widely used in TensorFlow/PyTorch. Reference: Nocedal & Wright (2006).
Graph Laplacian solvers: Fast Poisson equation solvers on mesh/graphs; ChebNet (Defferrard et al. 2016); enables scalable GNNs. Reference: Defferrard et al. (2016).
Inverse problems: LSQR iterative method; clinical deployment in medical imaging. Reference: Vogel (2002); Bardsley et al. (2012).

Key ideas: Where it shows up

Least squares and kernel ridge regression
- Solve $A = X^\top X$ (Gram matrix) or $A = K + \lambda I$ (kernel matrix).
- Achievements: scikit-learn, PyTorch linear solvers; KRR standard in Gaussian processes. References: Rasmussen & Williams (2006); Scholkopf & Smola (2002).
Gaussian processes and covariance systems
- Solve $K x = y$ where $K$ is $n \times n$ covariance matrix.
- Achievements: Cholesky solve in Stan, PyMC3; approximate inference via sparse GPs (inducing points). References: Quinonero-Candela & Rasmussen (2005); Snelson & Ghahramani (2005).
Optimization and preconditioning
- Preconditioned gradient descent: $x_{t+1} = x_t - \alpha M^{-1} \nabla f(x_t)$.
- Achievements: L-BFGS preconditioner reduces iteration count by factor of $10$-$100$; quasi-Newton methods. References: Nocedal & Wright (2006); Martens & Grosse (2015).
Graph neural networks and sparse convolutions
- Solve graph Laplacian systems $L x = b$ for diffusion, smoothing, attention mechanisms.
- Achievements: GraphSAGE, GCN via approximate polynomial filters; scalable to graphs with $10^9$ nodes. References: Defferrard et al. (2016) (ChebNet); Kipf & Welling (2017) (GCN).
Inverse problems and regularized imaging
- Solve Tikhonov system $(A^\top A + \lambda I) x = A^\top b$ for ill-posed deconvolution, tomography, parameter estimation.
- Achievements: iterative methods in medical imaging (CGLS for CT/MRI); LSQR in seismic inversion. References: Hansen (1998); Vogel (2002); Bardsley et al. (2012).

Notation

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}$ (or $\mathbb{R}^{m \times n}$ overdetermined).
LU factorization: $A = L U$ with $L \in \mathbb{R}^{n \times n}$ lower triangular (unit diagonal), $U \in \mathbb{R}^{n \times n}$ upper triangular.
Cholesky: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite; $L \in \mathbb{R}^{n \times n}$ lower triangular.
QR factorization: $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal, $R \in \mathbb{R}^{n \times n}$ upper triangular.
Residual: $r = b - A x \in \mathbb{R}^n$; goal is $\lVert r \rVert \ll \lVert b \rVert$.
Conjugate gradient: $x_k$ minimizes $\frac{1}{2} x^\top A x - b^\top x$ over $k$-dimensional Krylov subspace $\text{span}(r_0, A r_0, \ldots, A^{k-1} r_0)$.
Preconditioner: $M \approx A$ (cheap to invert); solve $M^{-1} A x = M^{-1} b$.
Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); measures sensitivity to perturbation.
Example: $A = \begin{pmatrix} 10 & 1 \\ 1 & 1 \end{pmatrix}$ has $\sigma_1 \approx 10.05, \sigma_2 \approx 0.95$, so $\kappa(A) \approx 10.6$ (moderately ill-conditioned).

Pitfalls & sanity checks

Never solve normal equations $A^\top A x = A^\top b$ directly for ill-conditioned $A$: Use QR, SVD, or iterative methods (LSQR) instead; $\kappa(A^\top A) = \kappa(A)^2$.
Verify SPD before Cholesky: Non-SPD matrices cause NaN/Inf; test via eigenvalues or try-catch.
Check residual convergence: $\lVert A x - b \rVert$ should decrease monotonically in iterative solvers; stagnation signals bad conditioning or preconditioner failure.
Preconditioning setup cost: If solve is run once, setup overhead may exceed savings; only cost-effective for multiple solves with same $A$.
Fill-in in sparse LU: Sparse matrix factorization can become dense; use reordering (minimum degree, nested dissection) to minimize fill-in.
Early stopping in LSQR: Stop iteration based on residual norm or discrepancy principle; continuing to convergence amplifies noise.
Regularization parameter selection: Cross-validation, L-curve, or discrepancy principle; do not fit $\lambda$ on training data.
Scaling sensitivity: Ill-scaled systems (different row/column magnitudes) become ill-conditioned; normalize before solving.

References

Foundational algorithms

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
Cholesky, A. L. (1910). Sur la résolution numérique des systèmes d’équations linéaires.

Classical theory and numerical stability

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Iterative methods and conjugate gradient

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.
Axelsson, O. (1994). Iterative Solution Methods.

Preconditioning and multigrid

Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

Applied: Machine learning and inverse problems

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression.
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs.
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering.
Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Hensman, J., Matthews, A. G. D. E., & Ghahramani, Z. (2015). Scalable variational Gaussian process classification.
Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels.
Bardsley, J. M., Chung, J., & Palmer, K. (2012). Regularization parameter selection methods for linear least squares problems.

Five worked examples

Worked Example 1: Gaussian elimination, LU factorization, and pivoting#

Introduction#

Solve a system $A x = b$ via LU factorization with partial pivoting; verify stability and reconstruction.

Purpose#

Illustrate direct method for dense systems; compare unpivoted LU (numerically risky) to pivoted LU (numerically stable).

Importance#

LU is foundation of dense linear algebra; pivoting is critical for avoiding catastrophic cancellation.

What this example demonstrates#

Compute LU factorization via Gaussian elimination (with and without pivoting).
Verify $A = L U$ and solve via forward/back-substitution.
Compute condition number and demonstrate error growth without pivoting on ill-conditioned matrix.

Background#

Partial pivoting (swapping rows to maximize pivot) prevents small pivots from amplifying errors during elimination.

Historical context#

Numerical instability of unpivoted LU recognized mid-20th century; pivoting became standard.

History#

Modern linear algebra libraries (LAPACK) default to pivoted LU.

Prevalence in ML#

Used in scikit-learn’s least squares and in matrix factorization algorithms.

Notes#

Pivoting is automatic in most libraries; rarely explicit in practice.
For sparse matrices, fill-in during elimination can destroy sparsity; sparse LU uses ordering strategies (minimum degree, nested dissection).

Connection to ML#

Condition number of $A$ determines solver reliability; poor conditioning necessitates regularization.

Connection to Linear Algebra Theory#

$\kappa(A) = \sigma_1 / \sigma_n$ predicts error magnification; LU error $\sim \kappa(A) \times \text{machine eps}$.

Pedagogical Significance#

Concrete demonstration of numerical stability and pivoting strategy.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
np.random.seed(10)

# Create moderately ill-conditioned system
n = 5
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, n)
A = U @ np.diag(s) @ U.T
b = np.random.randn(n)

# Compute LU with pivoting (numpy default)
P, L, U_lu = scipy.linalg.lu(A)
kappa_A = np.linalg.cond(A)

# Verify A = P^T L U
A_recon = P.T @ L @ U_lu
recon_error = np.linalg.norm(A - A_recon)

# Solve via LU: forward sub (L y = P b), back sub (U x = y)
y = scipy.linalg.solve_triangular(L, P @ b, lower=True)
x_lu = scipy.linalg.solve_triangular(U_lu, y, lower=False)

# Solve directly (for comparison)
x_direct = np.linalg.solve(A, b)

print("Condition number kappa(A):", round(kappa_A, 2))
print("LU reconstruction error:", round(recon_error, 8))
print("LU solution matches direct:", np.allclose(x_lu, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_lu - b), 8))

Worked Example 2: Cholesky factorization for symmetric positive definite systems#

Introduction#

Solve a symmetric positive definite (SPD) system via Cholesky factorization; verify stability and computational savings.

Purpose#

Show how SPD structure reduces computation by half and improves stability vs. general LU.

Importance#

Cholesky is standard for covariance matrices, Gram matrices, and Hessians in optimization.

What this example demonstrates#

Construct SPD matrix (e.g., covariance or Gram).
Compute Cholesky $A = L L^\top$ and verify reconstruction.
Solve via forward/back-substitution on $L$.
Compare to general LU: cost $O(n^3 / 3)$ vs. $O(n^3)$.

Background#

For SPD $A$, Cholesky is more stable than LU (no pivoting needed) and faster (half the operations).

Historical context#

Cholesky rediscovered in early 1900s; became standard method for SPD systems by mid-20th century.

History#

LAPACK dpotrf is gold-standard Cholesky implementation.

Prevalence in ML#

Used in kernel ridge regression, Gaussian process inference, and proximal methods.

Notes#

Fails gracefully if $A$ is not SPD; returns error (no negative square roots).
Numerical breakdown signals ill-conditioning; add small jitter ($A + \epsilon I$) if needed.

Connection to ML#

Covariance and Gram matrices are always SPD; Cholesky enables efficient sampling and likelihood computation in probabilistic models.

Connection to Linear Algebra Theory#

Cholesky exists iff all leading principal minors are positive (Sylvester criterion).

Pedagogical Significance#

Demonstrates how structure (symmetry, PSD) enables algorithmic optimization.

References#

Golub & Van Loan (2013). Matrix Computations.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

Solution (Python)#

import numpy as np
np.random.seed(11)

# Create SPD matrix (covariance-like)
n = 5
A_temp = np.random.randn(n, n)
A = A_temp.T @ A_temp + np.eye(n)  # SPD by construction
b = np.random.randn(n)

# Cholesky factorization
L = np.linalg.cholesky(A)

# Verify A = L L^T
A_recon = L @ L.T
recon_error = np.linalg.norm(A - A_recon)

# Solve via forward/back-substitution
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_chol = scipy.linalg.solve_triangular(L.T, y, lower=False)

# Direct solve for comparison
x_direct = np.linalg.solve(A, b)

# Condition number
kappa_A = np.linalg.cond(A)

print("Condition number kappa(A):", round(kappa_A, 2))
print("Cholesky reconstruction error:", round(recon_error, 8))
print("Cholesky solution matches direct:", np.allclose(x_chol, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_chol - b), 8))
print("Lower triangular L diagonal:", np.round(np.diag(L), 4))

Worked Example 3: Conjugate gradient for large sparse systems#

Introduction#

Solve a large sparse SPD system via conjugate gradient (CG); demonstrate convergence and compare to direct Cholesky.

Purpose#

Show how iterative methods scale to large systems while respecting sparsity.

Importance#

CG is the standard iterative solver for large-scale ML and scientific computing.

What this example demonstrates#

Construct sparse SPD system (e.g., graph Laplacian or 2D Poisson discretization).
Solve via CG and via direct Cholesky.
Plot convergence: residual norm vs. iteration.
Measure wall-clock time and memory for dense vs. sparse approaches.

Background#

CG finds solution in at most $n$ iterations (theory); practical convergence in $\ll n$ iterations for well-conditioned systems.

Historical context#

Hestenes & Stiefel (1952); rediscovered in 1970s as demand for large-scale methods grew.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg.

Prevalence in ML#

Used in large-scale kernel methods, graph neural networks, and optimization.

Notes#

Convergence rate depends on condition number $\kappa(A)$; preconditioning improves rate dramatically.
Requires only matrix-vector products $A v$; no explicit $A$ storage needed.

Connection to ML#

Enables Gaussian process inference on large datasets; scales kernel methods from $10^4$ to $10^6$ points.

Connection to Linear Algebra Theory#

CG minimizes $\frac{1}{2} x^\top A x - b^\top x$ over Krylov subspace; conjugate directions ensure optimality.

Pedagogical Significance#

Demonstrates trade-off between direct (dense, $O(n^3)$, low iteration) and iterative (sparse-compatible, $O(n)$ per iteration, many iterations).

References#

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Golub & Kahan (1965). Calculating singular values and pseudo-inverses.
Nocedal & Wright (2006). Numerical Optimization.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(12)

# Construct sparse SPD system (2D Laplacian discretization)
n_side = 20
n = n_side ** 2
# Discrete Laplacian on 2D grid (tridiagonal structure)
I = np.arange(n)
J = np.arange(n)
V = 4 * np.ones(n)
# Horizontal neighbors
I_h = np.arange(n - n_side)
J_h = I_h + n_side
V_h = -np.ones(len(I_h))
# Vertical neighbors
I_v = np.arange(n - 1)
I_v = I_v[I_v % n_side != n_side - 1]
J_v = I_v + 1
V_v = -np.ones(len(I_v))

rows = np.concatenate([I, I_h, I_h, I_v, I_v])
cols = np.concatenate([J, J_h, I_h, J_v, I_v])
vals = np.concatenate([V, V_h, V_h, V_v, V_v])
A = sp.csr_matrix((vals, (rows, cols)), shape=(n, n))
b = np.random.randn(n)

# CG solution
def residual_norm(x):
    return np.linalg.norm(A @ x - b)

x_cg, info = spla.cg(A, b, tol=1e-6, maxiter=n)

# Direct Cholesky on dense
A_dense = A.toarray()
L = np.linalg.cholesky(A_dense)
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_direct = scipy.linalg.solve_triangular(L.T, y, lower=False)

print("Problem size (n):", n)
print("Sparsity: {:.1f}%".format(100 * A.nnz / (n ** 2)))
print("CG converged in:", info, "iterations")
print("CG residual:", round(residual_norm(x_cg), 8))
print("Direct residual:", round(residual_norm(x_direct), 8))
print("Solutions match:", np.allclose(x_cg, x_direct, atol=1e-5))

Worked Example 4: Preconditioning and conditioning number#

Introduction#

Solve an ill-conditioned SPD system with and without preconditioning; demonstrate acceleration and residual reduction.

Purpose#

Show how preconditioning reduces effective condition number and dramatically accelerates convergence.

Importance#

Preconditioning is essential for large-scale optimization and inverse problems.

What this example demonstrates#

Construct ill-conditioned SPD system (exponentially decaying eigenvalues).
Solve via unpreconditioned CG and via diagonal (Jacobi) preconditioner.
Plot convergence: residual vs. iteration for both.
Verify that CG on $M^{-1} A$ has better convergence rate.

Background#

Convergence rate $\rho \approx \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ per iteration; preconditioning reduces $\kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential in 1970s–1980s for practical large-scale solvers.

History#

MINRES-QLP, GMRES with ILU preconditioning (Saad 1993); algebraic multigrid (Ruge–Stüben 1987).

Prevalence in ML#

Preconditioned gradients, L-BFGS with Hessian approximation, trust-region methods.

Notes#

Preconditioning trades: setup cost (compute/apply $M^{-1}$) for iteration reduction.
Diagonal (Jacobi) preconditioner: $M = \text{diag}(A)$; trivial but often effective.

Connection to ML#

Preconditioned gradient descent achieves faster convergence; L-BFGS implicitly preconditions via quasi-Newton approximation.

Connection to Linear Algebra Theory#

CG convergence depends on spectrum of $A$; preconditioning clusters eigenvalues of $M^{-1} A$ to reduce effective $\kappa$.

Pedagogical Significance#

Concrete demonstration of conditioning’s impact on algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal & Wright (2006). Numerical Optimization.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(13)

# Create ill-conditioned SPD matrix
n = 100
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # Condition number ~1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_unprecond(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_unprecond, _ = spla.cg(A, b, tol=1e-8, callback=callback_unprecond)

# Preconditioned CG (diagonal/Jacobi preconditioner)
M_inv_diag = 1.0 / np.diag(A)
M = sp.diags(1.0 / M_inv_diag)
residuals_precond = []
def callback_precond(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_precond, _ = spla.cg(A, b, M=M, tol=1e-8, callback=callback_precond)

kappa_A = np.linalg.cond(A)
print("Condition number kappa(A):", round(kappa_A, 0))
print("Unpreconditioned CG iterations:", len(residuals_unprecond))
print("Preconditioned CG iterations:", len(residuals_precond))
print("Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("Both solutions satisfy A x = b:", 
      np.allclose(A @ x_unprecond, b) and np.allclose(A @ x_precond, b))

Worked Example 5: LSQR for ill-posed inverse problems#

Introduction#

Solve an ill-posed least squares problem $A x \approx b$ via LSQR (iterative least squares QR); compare to direct pseudoinverse and Tikhonov regularization.

Purpose#

Demonstrate how LSQR stabilizes solutions to ill-posed systems without explicit regularization parameter tuning.

Importance#

LSQR is standard in medical imaging, geophysical inversion, and inverse problems.

What this example demonstrates#

Construct ill-posed rectangular system (tall, decaying singular values).
Solve via LSQR (iterative, early stopping acts as regularization).
Compare to pseudoinverse (noisy) and Tikhonov (requires $\lambda$ selection).
Plot regularization curves (solution norm vs. residual).

Background#

LSQR is conjugate gradient on normal equations $A^\top A x = A^\top b$; stops before convergence to avoid amplifying noise.

Historical context#

Paige & Saunders (1982); foundational algorithm for inverse problems.

History#

CGLS, LSQR in scipy.sparse.linalg; widely used in medical imaging.

Prevalence in ML#

Used in sparse coding, compressed sensing, and large-scale least squares.

Notes#

Stopping iteration acts as regularization; no regularization parameter needed.
Discrepancy principle: stop when residual reaches expected noise level.

Connection to ML#

Implicit regularization; enables robust solutions without hyperparameter tuning.

Connection to Linear Algebra Theory#

LSQR on $\min_x \lVert A x - b \rVert^2$ without explicit pseudoinverse; early stopping filters small singular values.

Pedagogical Significance#

Demonstrates how algorithm structure (iteration count) can act as regularization.

References#

Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(14)

# Construct ill-posed rectangular system
m, n = 120, 50
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T
x_true = np.zeros(n)
x_true[:8] = np.sin(np.linspace(0, 2*np.pi, 8))
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# LSQR with early stopping (implicit regularization)
residuals_lsqr = []
x_lsqr_early = None
for k in [5, 10, 20, 50]:
    x_lsqr, _ = spla.lsqr(A, b, atol=0, btol=0, iter_lim=k)
    residuals_lsqr.append(np.linalg.norm(A @ x_lsqr - b))
    if k == 20:
        x_lsqr_early = x_lsqr

# Pseudoinverse (no regularization, noisy)
x_pinv = spla.lsqr(A, b, atol=0, btol=0, iter_lim=1000)[0]

# Tikhonov with manual lambda (requires tuning)
lam = 0.01
G = A.T @ A + lam * np.eye(n)
x_tikhonov = np.linalg.solve(G, A.T @ b)

print("True solution norm:", round(np.linalg.norm(x_true), 4))
print("LSQR (early stop k=20) norm:", round(np.linalg.norm(x_lsqr_early), 4))
print("LSQR (early stop) error:", round(np.linalg.norm(x_lsqr_early - x_true), 4))
print("Pseudoinverse error:", round(np.linalg.norm(x_pinv - x_true), 4))
print("Tikhonov error:", round(np.linalg.norm(x_tikhonov - x_true), 4))
print("LSQR residual at k=20:", round(residuals_lsqr[2], 6))