Introduction#
Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.
Important ideas#
Gaussian elimination and LU factorization
Direct method: $A = LU$ via row operations.
Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.
Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).
Cholesky factorization for SPD systems
For symmetric positive definite $A$: $A = L L^\top$ (one triangle).
Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.
Numerically stable if $A$ is well-conditioned.
QR factorization and least squares
$A = QR$ with orthonormal $Q$ and upper triangular $R$.
Numerically stable; used for least squares (Chapter 12).
Cost: $O(n^2 d)$ for $n \times d$ matrix.
Iterative solvers and conjugate gradient
Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.
GMRES/MINRES: for general/symmetric non-SPD systems.
Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.
Preconditioning and conditioning
Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.
Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.
Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.
Sparse matrix structure
Banded, tridiagonal, block-structured systems exploit locality.
Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.
Fill-in during factorization can destroy sparsity; ordering matters.
Rank deficiency and ill-posedness
Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.
Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.
Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.
Relevance to ML#
Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.
Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.
Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.
Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.
Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.
Algorithmic development (milestones)#
1670s: Newton’s method and early algebraic solutions.
1810: Gaussian elimination formalized (Gauss, Legendre).
1875: Cholesky decomposition developed (rediscovered 1910).
1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.
1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.
1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.
1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.
1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.
2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.
Definitions#
Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.
LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.
Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.
Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.
Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.
Residual: $r = b - A x$; measure of solution error.
Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.
Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.
Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.
Comments