ex1.ai

Solving Systems

Chapter 13

Solving Systems

Key ideas: Introduction

Introduction#

Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.

Important ideas#

Gaussian elimination and LU factorization
- Direct method: $A = LU$ via row operations.
- Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.
- Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).
Cholesky factorization for SPD systems
- For symmetric positive definite $A$: $A = L L^\top$ (one triangle).
- Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.
- Numerically stable if $A$ is well-conditioned.
QR factorization and least squares
- $A = QR$ with orthonormal $Q$ and upper triangular $R$.
- Numerically stable; used for least squares (Chapter 12).
- Cost: $O(n^2 d)$ for $n \times d$ matrix.
Iterative solvers and conjugate gradient
- Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.
- GMRES/MINRES: for general/symmetric non-SPD systems.
- Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.
Preconditioning and conditioning
- Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.
- Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.
- Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.
Sparse matrix structure
- Banded, tridiagonal, block-structured systems exploit locality.
- Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.
- Fill-in during factorization can destroy sparsity; ordering matters.
Rank deficiency and ill-posedness
- Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.
- Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.
- Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.

Relevance to ML#

Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.
Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.
Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.
Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.
Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.

Algorithmic development (milestones)#

1670s: Newton’s method and early algebraic solutions.
1810: Gaussian elimination formalized (Gauss, Legendre).
1875: Cholesky decomposition developed (rediscovered 1910).
1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.
1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.
1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.
1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.
1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.
2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.

Definitions#

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.
LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.
Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.
Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.
Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.
Residual: $r = b - A x$; measure of solution error.
Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.
Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.
Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Gaussian elimination and LU: Forward elimination, row pivoting for stability, $A = LU$ factorization. References: Golub & Van Loan (2013); Trefethen & Bau (1997).
Cholesky factorization: $A = L L^\top$ for SPD matrices; numerical stability via diagonal dominance and conditioning. Reference: Golub & Van Loan (2013).
Forward/back-substitution: $O(n^2)$ solve for triangular systems; essential subroutine.
QR factorization: $A = QR$ with orthonormal $Q$; stable least squares (Chapter 12). Reference: Golub & Kahan (1965).
Conjugate gradient: Minimize $\frac{1}{2} x^\top A x - b^\top x$ on Krylov subspace; optimality in $n$ iterations for SPD. Reference: Hestenes & Stiefel (1952).
Condition number and residual analysis: $\kappa(A) = \sigma_1 / \sigma_n$; backward error bounds. Reference: Wilkinson (1961).
Preconditioning: Transform $A \to M^{-1} A$ to reduce condition number. Reference: Axelsson (1994).

Applied (landmark systems)#

Cholesky solve for Gram matrices: $(X^\top X) w = X^\top y$ via Cholesky; scikit-learn LinearRegression with solver='cholesky'. Reference: Hastie et al. (2009).
CG for large-scale least squares: CGLS (CG on normal equations) for $n, d > 10^6$; scikit-learn SGDRegressor. Reference: Paige & Saunders (1982).
Gaussian process inference: Cholesky of $K$ for marginal likelihood; approximate GPs via inducing points reduce $O(n^3)$ to $O(m^3)$. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005); Hensman et al. (2015).
Preconditioned optimization: L-BFGS with Hessian approximation; widely used in TensorFlow/PyTorch. Reference: Nocedal & Wright (2006).
Graph Laplacian solvers: Fast Poisson equation solvers on mesh/graphs; ChebNet (Defferrard et al. 2016); enables scalable GNNs. Reference: Defferrard et al. (2016).
Inverse problems: LSQR iterative method; clinical deployment in medical imaging. Reference: Vogel (2002); Bardsley et al. (2012).

Key ideas: Where it shows up

Least squares and kernel ridge regression
- Solve $A = X^\top X$ (Gram matrix) or $A = K + \lambda I$ (kernel matrix).
- Achievements: scikit-learn, PyTorch linear solvers; KRR standard in Gaussian processes. References: Rasmussen & Williams (2006); Scholkopf & Smola (2002).
Gaussian processes and covariance systems
- Solve $K x = y$ where $K$ is $n \times n$ covariance matrix.
- Achievements: Cholesky solve in Stan, PyMC3; approximate inference via sparse GPs (inducing points). References: Quinonero-Candela & Rasmussen (2005); Snelson & Ghahramani (2005).
Optimization and preconditioning
- Preconditioned gradient descent: $x_{t+1} = x_t - \alpha M^{-1} \nabla f(x_t)$.
- Achievements: L-BFGS preconditioner reduces iteration count by factor of $10$-$100$; quasi-Newton methods. References: Nocedal & Wright (2006); Martens & Grosse (2015).
Graph neural networks and sparse convolutions
- Solve graph Laplacian systems $L x = b$ for diffusion, smoothing, attention mechanisms.
- Achievements: GraphSAGE, GCN via approximate polynomial filters; scalable to graphs with $10^9$ nodes. References: Defferrard et al. (2016) (ChebNet); Kipf & Welling (2017) (GCN).
Inverse problems and regularized imaging
- Solve Tikhonov system $(A^\top A + \lambda I) x = A^\top b$ for ill-posed deconvolution, tomography, parameter estimation.
- Achievements: iterative methods in medical imaging (CGLS for CT/MRI); LSQR in seismic inversion. References: Hansen (1998); Vogel (2002); Bardsley et al. (2012).

Notation

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}$ (or $\mathbb{R}^{m \times n}$ overdetermined).
LU factorization: $A = L U$ with $L \in \mathbb{R}^{n \times n}$ lower triangular (unit diagonal), $U \in \mathbb{R}^{n \times n}$ upper triangular.
Cholesky: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite; $L \in \mathbb{R}^{n \times n}$ lower triangular.
QR factorization: $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal, $R \in \mathbb{R}^{n \times n}$ upper triangular.
Residual: $r = b - A x \in \mathbb{R}^n$; goal is $\lVert r \rVert \ll \lVert b \rVert$.
Conjugate gradient: $x_k$ minimizes $\frac{1}{2} x^\top A x - b^\top x$ over $k$-dimensional Krylov subspace $\text{span}(r_0, A r_0, \ldots, A^{k-1} r_0)$.
Preconditioner: $M \approx A$ (cheap to invert); solve $M^{-1} A x = M^{-1} b$.
Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); measures sensitivity to perturbation.
Example: $A = \begin{pmatrix} 10 & 1 \\ 1 & 1 \end{pmatrix}$ has $\sigma_1 \approx 10.05, \sigma_2 \approx 0.95$, so $\kappa(A) \approx 10.6$ (moderately ill-conditioned).

Pitfalls & sanity checks

Never solve normal equations $A^\top A x = A^\top b$ directly for ill-conditioned $A$: Use QR, SVD, or iterative methods (LSQR) instead; $\kappa(A^\top A) = \kappa(A)^2$.
Verify SPD before Cholesky: Non-SPD matrices cause NaN/Inf; test via eigenvalues or try-catch.
Check residual convergence: $\lVert A x - b \rVert$ should decrease monotonically in iterative solvers; stagnation signals bad conditioning or preconditioner failure.
Preconditioning setup cost: If solve is run once, setup overhead may exceed savings; only cost-effective for multiple solves with same $A$.
Fill-in in sparse LU: Sparse matrix factorization can become dense; use reordering (minimum degree, nested dissection) to minimize fill-in.
Early stopping in LSQR: Stop iteration based on residual norm or discrepancy principle; continuing to convergence amplifies noise.
Regularization parameter selection: Cross-validation, L-curve, or discrepancy principle; do not fit $\lambda$ on training data.
Scaling sensitivity: Ill-scaled systems (different row/column magnitudes) become ill-conditioned; normalize before solving.

References

Foundational algorithms

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
Cholesky, A. L. (1910). Sur la résolution numérique des systèmes d’équations linéaires.

Classical theory and numerical stability

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Iterative methods and conjugate gradient

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.
Axelsson, O. (1994). Iterative Solution Methods.

Preconditioning and multigrid

Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

Applied: Machine learning and inverse problems

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression.
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs.
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering.
Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Hensman, J., Matthews, A. G. D. E., & Ghahramani, Z. (2015). Scalable variational Gaussian process classification.
Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels.
Bardsley, J. M., Chung, J., & Palmer, K. (2012). Regularization parameter selection methods for linear least squares problems.

Five worked examples

Worked Example 1: Gaussian elimination, LU factorization, and pivoting#

Introduction#

Solve a system $A x = b$ via LU factorization with partial pivoting; verify stability and reconstruction.

Purpose#

Illustrate direct method for dense systems; compare unpivoted LU (numerically risky) to pivoted LU (numerically stable).

Importance#

LU is foundation of dense linear algebra; pivoting is critical for avoiding catastrophic cancellation.

What this example demonstrates#

Compute LU factorization via Gaussian elimination (with and without pivoting).
Verify $A = L U$ and solve via forward/back-substitution.
Compute condition number and demonstrate error growth without pivoting on ill-conditioned matrix.

Background#

Partial pivoting (swapping rows to maximize pivot) prevents small pivots from amplifying errors during elimination.

Historical context#

Numerical instability of unpivoted LU recognized mid-20th century; pivoting became standard.

History#

Modern linear algebra libraries (LAPACK) default to pivoted LU.

Prevalence in ML#

Used in scikit-learn’s least squares and in matrix factorization algorithms.

Notes#

Pivoting is automatic in most libraries; rarely explicit in practice.
For sparse matrices, fill-in during elimination can destroy sparsity; sparse LU uses ordering strategies (minimum degree, nested dissection).

Connection to ML#

Condition number of $A$ determines solver reliability; poor conditioning necessitates regularization.

Connection to Linear Algebra Theory#

$\kappa(A) = \sigma_1 / \sigma_n$ predicts error magnification; LU error $\sim \kappa(A) \times \text{machine eps}$.

Pedagogical Significance#

Concrete demonstration of numerical stability and pivoting strategy.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
np.random.seed(10)

# Create moderately ill-conditioned system
n = 5
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, n)
A = U @ np.diag(s) @ U.T
b = np.random.randn(n)

# Compute LU with pivoting (numpy default)
P, L, U_lu = scipy.linalg.lu(A)
kappa_A = np.linalg.cond(A)

# Verify A = P^T L U
A_recon = P.T @ L @ U_lu
recon_error = np.linalg.norm(A - A_recon)

# Solve via LU: forward sub (L y = P b), back sub (U x = y)
y = scipy.linalg.solve_triangular(L, P @ b, lower=True)
x_lu = scipy.linalg.solve_triangular(U_lu, y, lower=False)

# Solve directly (for comparison)
x_direct = np.linalg.solve(A, b)

print("Condition number kappa(A):", round(kappa_A, 2))
print("LU reconstruction error:", round(recon_error, 8))
print("LU solution matches direct:", np.allclose(x_lu, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_lu - b), 8))

Worked Example 2: Cholesky factorization for symmetric positive definite systems#

Introduction#

Solve a symmetric positive definite (SPD) system via Cholesky factorization; verify stability and computational savings.

Purpose#

Show how SPD structure reduces computation by half and improves stability vs. general LU.

Importance#

Cholesky is standard for covariance matrices, Gram matrices, and Hessians in optimization.

What this example demonstrates#

Construct SPD matrix (e.g., covariance or Gram).
Compute Cholesky $A = L L^\top$ and verify reconstruction.
Solve via forward/back-substitution on $L$.
Compare to general LU: cost $O(n^3 / 3)$ vs. $O(n^3)$.

Background#

For SPD $A$, Cholesky is more stable than LU (no pivoting needed) and faster (half the operations).

Historical context#

Cholesky rediscovered in early 1900s; became standard method for SPD systems by mid-20th century.

History#

LAPACK dpotrf is gold-standard Cholesky implementation.

Prevalence in ML#

Used in kernel ridge regression, Gaussian process inference, and proximal methods.

Notes#

Fails gracefully if $A$ is not SPD; returns error (no negative square roots).
Numerical breakdown signals ill-conditioning; add small jitter ($A + \epsilon I$) if needed.

Connection to ML#

Covariance and Gram matrices are always SPD; Cholesky enables efficient sampling and likelihood computation in probabilistic models.

Connection to Linear Algebra Theory#

Cholesky exists iff all leading principal minors are positive (Sylvester criterion).

Pedagogical Significance#

Demonstrates how structure (symmetry, PSD) enables algorithmic optimization.

References#

Golub & Van Loan (2013). Matrix Computations.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

Solution (Python)#

import numpy as np
np.random.seed(11)

# Create SPD matrix (covariance-like)
n = 5
A_temp = np.random.randn(n, n)
A = A_temp.T @ A_temp + np.eye(n)  # SPD by construction
b = np.random.randn(n)

# Cholesky factorization
L = np.linalg.cholesky(A)

# Verify A = L L^T
A_recon = L @ L.T
recon_error = np.linalg.norm(A - A_recon)

# Solve via forward/back-substitution
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_chol = scipy.linalg.solve_triangular(L.T, y, lower=False)

# Direct solve for comparison
x_direct = np.linalg.solve(A, b)

# Condition number
kappa_A = np.linalg.cond(A)

print("Condition number kappa(A):", round(kappa_A, 2))
print("Cholesky reconstruction error:", round(recon_error, 8))
print("Cholesky solution matches direct:", np.allclose(x_chol, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_chol - b), 8))
print("Lower triangular L diagonal:", np.round(np.diag(L), 4))

Worked Example 3: Conjugate gradient for large sparse systems#

Introduction#

Solve a large sparse SPD system via conjugate gradient (CG); demonstrate convergence and compare to direct Cholesky.

Purpose#

Show how iterative methods scale to large systems while respecting sparsity.

Importance#

CG is the standard iterative solver for large-scale ML and scientific computing.

What this example demonstrates#

Construct sparse SPD system (e.g., graph Laplacian or 2D Poisson discretization).
Solve via CG and via direct Cholesky.
Plot convergence: residual norm vs. iteration.
Measure wall-clock time and memory for dense vs. sparse approaches.

Background#

CG finds solution in at most $n$ iterations (theory); practical convergence in $\ll n$ iterations for well-conditioned systems.

Historical context#

Hestenes & Stiefel (1952); rediscovered in 1970s as demand for large-scale methods grew.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg.

Prevalence in ML#

Used in large-scale kernel methods, graph neural networks, and optimization.

Notes#

Convergence rate depends on condition number $\kappa(A)$; preconditioning improves rate dramatically.
Requires only matrix-vector products $A v$; no explicit $A$ storage needed.

Connection to ML#

Enables Gaussian process inference on large datasets; scales kernel methods from $10^4$ to $10^6$ points.

Connection to Linear Algebra Theory#

CG minimizes $\frac{1}{2} x^\top A x - b^\top x$ over Krylov subspace; conjugate directions ensure optimality.

Pedagogical Significance#

Demonstrates trade-off between direct (dense, $O(n^3)$, low iteration) and iterative (sparse-compatible, $O(n)$ per iteration, many iterations).

References#

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Golub & Kahan (1965). Calculating singular values and pseudo-inverses.
Nocedal & Wright (2006). Numerical Optimization.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(12)

# Construct sparse SPD system (2D Laplacian discretization)
n_side = 20
n = n_side ** 2
# Discrete Laplacian on 2D grid (tridiagonal structure)
I = np.arange(n)
J = np.arange(n)
V = 4 * np.ones(n)
# Horizontal neighbors
I_h = np.arange(n - n_side)
J_h = I_h + n_side
V_h = -np.ones(len(I_h))
# Vertical neighbors
I_v = np.arange(n - 1)
I_v = I_v[I_v % n_side != n_side - 1]
J_v = I_v + 1
V_v = -np.ones(len(I_v))

rows = np.concatenate([I, I_h, I_h, I_v, I_v])
cols = np.concatenate([J, J_h, I_h, J_v, I_v])
vals = np.concatenate([V, V_h, V_h, V_v, V_v])
A = sp.csr_matrix((vals, (rows, cols)), shape=(n, n))
b = np.random.randn(n)

# CG solution
def residual_norm(x):
    return np.linalg.norm(A @ x - b)

x_cg, info = spla.cg(A, b, tol=1e-6, maxiter=n)

# Direct Cholesky on dense
A_dense = A.toarray()
L = np.linalg.cholesky(A_dense)
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_direct = scipy.linalg.solve_triangular(L.T, y, lower=False)

print("Problem size (n):", n)
print("Sparsity: {:.1f}%".format(100 * A.nnz / (n ** 2)))
print("CG converged in:", info, "iterations")
print("CG residual:", round(residual_norm(x_cg), 8))
print("Direct residual:", round(residual_norm(x_direct), 8))
print("Solutions match:", np.allclose(x_cg, x_direct, atol=1e-5))

Worked Example 4: Preconditioning and conditioning number#

Introduction#

Solve an ill-conditioned SPD system with and without preconditioning; demonstrate acceleration and residual reduction.

Purpose#

Show how preconditioning reduces effective condition number and dramatically accelerates convergence.

Importance#

Preconditioning is essential for large-scale optimization and inverse problems.

What this example demonstrates#

Construct ill-conditioned SPD system (exponentially decaying eigenvalues).
Solve via unpreconditioned CG and via diagonal (Jacobi) preconditioner.
Plot convergence: residual vs. iteration for both.
Verify that CG on $M^{-1} A$ has better convergence rate.

Background#

Convergence rate $\rho \approx \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ per iteration; preconditioning reduces $\kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential in 1970s–1980s for practical large-scale solvers.

History#

MINRES-QLP, GMRES with ILU preconditioning (Saad 1993); algebraic multigrid (Ruge–Stüben 1987).

Prevalence in ML#

Preconditioned gradients, L-BFGS with Hessian approximation, trust-region methods.

Notes#

Preconditioning trades: setup cost (compute/apply $M^{-1}$) for iteration reduction.
Diagonal (Jacobi) preconditioner: $M = \text{diag}(A)$; trivial but often effective.

Connection to ML#

Preconditioned gradient descent achieves faster convergence; L-BFGS implicitly preconditions via quasi-Newton approximation.

Connection to Linear Algebra Theory#

CG convergence depends on spectrum of $A$; preconditioning clusters eigenvalues of $M^{-1} A$ to reduce effective $\kappa$.

Pedagogical Significance#

Concrete demonstration of conditioning’s impact on algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal & Wright (2006). Numerical Optimization.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(13)

# Create ill-conditioned SPD matrix
n = 100
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # Condition number ~1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_unprecond(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_unprecond, _ = spla.cg(A, b, tol=1e-8, callback=callback_unprecond)

# Preconditioned CG (diagonal/Jacobi preconditioner)
M_inv_diag = 1.0 / np.diag(A)
M = sp.diags(1.0 / M_inv_diag)
residuals_precond = []
def callback_precond(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_precond, _ = spla.cg(A, b, M=M, tol=1e-8, callback=callback_precond)

kappa_A = np.linalg.cond(A)
print("Condition number kappa(A):", round(kappa_A, 0))
print("Unpreconditioned CG iterations:", len(residuals_unprecond))
print("Preconditioned CG iterations:", len(residuals_precond))
print("Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("Both solutions satisfy A x = b:", 
      np.allclose(A @ x_unprecond, b) and np.allclose(A @ x_precond, b))

Worked Example 5: LSQR for ill-posed inverse problems#

Introduction#

Solve an ill-posed least squares problem $A x \approx b$ via LSQR (iterative least squares QR); compare to direct pseudoinverse and Tikhonov regularization.

Purpose#

Demonstrate how LSQR stabilizes solutions to ill-posed systems without explicit regularization parameter tuning.

Importance#

LSQR is standard in medical imaging, geophysical inversion, and inverse problems.

What this example demonstrates#

Construct ill-posed rectangular system (tall, decaying singular values).
Solve via LSQR (iterative, early stopping acts as regularization).
Compare to pseudoinverse (noisy) and Tikhonov (requires $\lambda$ selection).
Plot regularization curves (solution norm vs. residual).

Background#

LSQR is conjugate gradient on normal equations $A^\top A x = A^\top b$; stops before convergence to avoid amplifying noise.

Historical context#

Paige & Saunders (1982); foundational algorithm for inverse problems.

History#

CGLS, LSQR in scipy.sparse.linalg; widely used in medical imaging.

Prevalence in ML#

Used in sparse coding, compressed sensing, and large-scale least squares.

Notes#

Stopping iteration acts as regularization; no regularization parameter needed.
Discrepancy principle: stop when residual reaches expected noise level.

Connection to ML#

Implicit regularization; enables robust solutions without hyperparameter tuning.

Connection to Linear Algebra Theory#

LSQR on $\min_x \lVert A x - b \rVert^2$ without explicit pseudoinverse; early stopping filters small singular values.

Pedagogical Significance#

Demonstrates how algorithm structure (iteration count) can act as regularization.

References#

Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(14)

# Construct ill-posed rectangular system
m, n = 120, 50
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T
x_true = np.zeros(n)
x_true[:8] = np.sin(np.linspace(0, 2*np.pi, 8))
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# LSQR with early stopping (implicit regularization)
residuals_lsqr = []
x_lsqr_early = None
for k in [5, 10, 20, 50]:
    x_lsqr, _ = spla.lsqr(A, b, atol=0, btol=0, iter_lim=k)
    residuals_lsqr.append(np.linalg.norm(A @ x_lsqr - b))
    if k == 20:
        x_lsqr_early = x_lsqr

# Pseudoinverse (no regularization, noisy)
x_pinv = spla.lsqr(A, b, atol=0, btol=0, iter_lim=1000)[0]

# Tikhonov with manual lambda (requires tuning)
lam = 0.01
G = A.T @ A + lam * np.eye(n)
x_tikhonov = np.linalg.solve(G, A.T @ b)

print("True solution norm:", round(np.linalg.norm(x_true), 4))
print("LSQR (early stop k=20) norm:", round(np.linalg.norm(x_lsqr_early), 4))
print("LSQR (early stop) error:", round(np.linalg.norm(x_lsqr_early - x_true), 4))
print("Pseudoinverse error:", round(np.linalg.norm(x_pinv - x_true), 4))
print("Tikhonov error:", round(np.linalg.norm(x_tikhonov - x_true), 4))
print("LSQR residual at k=20:", round(residuals_lsqr[2], 6))

Comments

Algorithm Category

Direct Methods

Iterative Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Dense Matrices

Structured Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Projection & Orthogonality

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Cholesky Decomposition

Orthogonal Factorizations

ML Applications

Optimization & Training

Numerical Stability & Robustness

Preconditioning

Theoretical Foundation

Numerical Linear Algebra

PSD & PD Matrices

Chapter 9

PSD & PD Matrices

Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

Eigenvalue characterization
- Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.
Quadratic forms and convexity
- If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).
Gram matrices
- $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).
Cholesky factorization
- PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.
Schur complement
- For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.
Mahalanobis metric
- For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.
Numerical PSD
- In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.
Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.
Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.
Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.
Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.
1918: Schur complement criteria.
1910–1940s: Cholesky factorization for SPD solves.
1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.
1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.
Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.
Principal minors: PD iff all leading principal minors are positive (Sylvester).
Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.
Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

PSD/PD definitions via quadratic form and eigenvalues.
Sylvester’s criterion (leading principal minors > 0 for PD).
Schur complement PSD condition.
Mercer’s theorem (kernel PSD).
Cholesky existence for SPD.

Applied (landmark systems)#

SVM with PSD kernels (Cortes–Vapnik 1995).
Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.
Kernel Ridge Regression (Murphy 2022) solved via SPD systems.
Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).
GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up

Covariance and PCA

$\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

Kernels and SVMs/GPs

RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Hessians and convexity

For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

Mahalanobis distance & whitening

SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

Cholesky-based solvers (KRR/GP)

SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation

PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.
Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.
Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.
Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.
Schur complement: For blocks, $S = C - B^\top A^{-1} B$.
Examples:
- PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.
- Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.
- Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks

Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.
Near-singular PSD: add jitter; do not invert directly.
Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.
Kernel parameters: very small/large length-scales can hurt conditioning.
Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

Symmetrization, eigenvalue check, and Cholesky factorization.
Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

Golub & Van Loan (2013). Matrix Computations.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

$\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.
Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

RBF kernel is PSD; small datasets can be nearly singular when points are close.
Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

$d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

De Maesschalck et al. (2000). The Mahalanobis distance.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))