ex1.ai

Conditioning & Stability

Chapter 14

Conditioning & Stability

Key ideas: Introduction

Introduction#

Numerical stability determines whether an algorithm produces a correct answer. Machine arithmetic operates on finite precision (float64: ~16 significant digits). An ill-conditioned problem (large $\kappa$) is inherently difficult: small input perturbations cause large output changes. Backward-stable algorithms guarantee that computed solution is exact solution to a nearby problem. Forward error depends on conditioning; backward error depends on algorithm stability. Together, they predict what accuracy is achievable. This is fundamental to reliable ML: ill-conditioned matrices appear in nearly singular Gram matrices, ill-posed inverse problems, and deep network Hessians.

Important ideas#

Condition number and sensitivity
- $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
- Forward error bound: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \cdot \frac{\| \Delta A \|}{\| A \|}$ (perturbation on $A$ causes error in $x$).
- Ill-conditioned ($\kappa \gg 1$) problems amplify noise; well-conditioned ($\kappa \approx 1$) are benign.
Machine precision and floating-point arithmetic
- Machine epsilon: $\varepsilon_{\text{machine}} \approx 2^{-52} \approx 2.2 \times 10^{-16}$ for float64.
- Relative error per operation: $O(\varepsilon_{\text{machine}})$; accumulated error over $n$ operations: $O(n \cdot \varepsilon_{\text{machine}})$.
- Loss of significance: if $a \approx b$, then $a - b$ has few correct digits (cancellation).
Backward error analysis and stability
- Algorithm is backward-stable if computed solution $\tilde{x}$ is exact solution to perturbed problem: $\tilde{A} \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon_{\text{machine}})$.
- Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|} \approx \kappa(A) \cdot O(\varepsilon_{\text{machine}})$ for stable algorithm.
- Unstable algorithm: backward error is large; forward error can be catastrophic.
Condition number of Gram matrix vs. original matrix
- $\kappa(A^\top A) = \kappa(A)^2$; normal equations square the condition number.
- QR/SVD avoid this: $\kappa(R) = \kappa(A)$ (no squaring), $\kappa(\Sigma) = \kappa(A)$.
- For least squares: direct method on $A$ (QR) preferable to solving $A^\top A x = A^\top b$.
Preconditioning and effective conditioning
- Preconditioner $M \approx A$ transforms $A \to M^{-1} A$; goal is $\kappa(M^{-1} A) \ll \kappa(A)$.
- Incomplete LU (ILU), Jacobi (diagonal), algebraic multigrid (AMG) are practical preconditioners.
- CG iteration count $\approx \sqrt{\kappa(M^{-1} A)}$; reducing $\kappa$ by factor 100 reduces iterations by factor 10.
Regularization and ill-posedness
- Ill-posed problem: small changes in data cause arbitrarily large changes in solution ($\kappa = \infty$).
- Tikhonov regularization: $(A^\top A + \lambda I) x = A^\top b$ shifts small singular values by $\lambda$.
- Effective $\kappa$ after regularization: $\kappa_{\text{reg}} = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$; well-chosen $\lambda$ reduces condition number.
Numerical cancellation and loss of significance
- Subtraction $a - b$ where $a \approx b$ cancels leading digits; result has low precision.
- Orthogonal transformations (QR, SVD, Householder) avoid this; preserve angles and norms.
- Modified Gram–Schmidt (MGS) more stable than classical Gram–Schmidt due to reorthogonalization.

Relevance to ML#

Linear regression and least squares: Gram matrix $X^\top X$ can be ill-conditioned if features are correlated; normal equations fail; QR/SVD required.
Kernel methods and Gaussian processes: Gram/covariance matrix condition number determines sample complexity and numerical stability of Cholesky.
Deep learning and Hessian conditioning: Neural network loss Hessian is often ill-conditioned; affects convergence rate, learning rate selection, and second-order optimization.
Regularization and implicit bias: Ill-posed learning problems ($\kappa = \infty$) require regularization (L2, dropout, early stopping); reduces effective $\kappa$.
Numerical robustness in production: Conditioning determines whether model training is reproducible, whether predictions are stable to input noise, and whether inference is safe.

Algorithmic development (milestones)#

1947: von Neumann & Goldstine analyze numerical stability of Gaussian elimination; discover amplification by machine epsilon.
1961: Wilkinson develops comprehensive framework for backward error analysis; introduces condition number formally.
1965: Golub & Kahan recognize $\kappa(A^\top A) = \kappa(A)^2$ is catastrophic for normal equations; advocate for QR/SVD.
1971: Peters & Wilkinson develop LINPACK; implement numerically stable algorithms with careful pivot selection.
1979–1990: Higham develops extended backward error analysis; proves stability of Cholesky, QR, modified Gram–Schmidt.
1992: Arioli et al. develop a posteriori error bounds; can verify computed solution without knowing true solution.
2000s: Preconditioning becomes standard in iterative solvers; algebraic multigrid (AMG) enables linear scaling.
2010s: Automatic differentiation frameworks (TensorFlow, PyTorch) provide implicit solvers with backward stability analysis.

Definitions#

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); $\kappa_\infty(A) = \| A \|_\infty \| A^{-1} \|_\infty$ (operator norm).
Machine epsilon: $\varepsilon_{\text{machine}} = 2^{-(p-1)}$ for $p$-bit mantissa; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point number: Normalized form: $\pm m \times 2^e$ with $1 \le m < 2$ (hidden bit), $e \in [-1022, 1023]$.
Relative error: $\frac{\| \tilde{x} - x \|}{\| x \|}$ (dimensionless; independent of scaling).
Backward error: $\frac{\| \tilde{A} - A \|}{\| A \|}$ for perturbed matrix; small backward error $\Rightarrow$ computed $\tilde{x}$ is exact solution to nearby problem.
Forward error: $\frac{\| \tilde{x} - x \|}{\| x \|}$; depends on both backward error and conditioning: $\text{forward error} \approx \kappa(A) \times \text{backward error}$.
Ill-conditioned: $\kappa(A) \gg 1$; small perturbations cause large output changes.
Well-conditioned: $\kappa(A) \approx 1$; stable to perturbations.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Condition number definition and properties: $\kappa(A) = \sigma_1 / \sigma_n$; $\kappa(AB) \le \kappa(A) \kappa(B)$; $\kappa(A) \ge 1$. Reference: Golub & Van Loan (2013).
Perturbation theory and forward error bounds: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \frac{\| \Delta A \|}{\| A \|}$ (first-order). Reference: Wilkinson (1961); Higham (2002).
Backward error analysis: Algorithm is backward-stable if $\tilde{x}$ satisfies $(\tilde{A}) \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon)$. Reference: Higham (2002).
Machine precision and floating-point error: $\varepsilon_{\text{mach}} \approx 2^{-p}$; error per operation $O(\varepsilon)$; cumulative error over $n$ steps $O(n \varepsilon)$. Reference: IEEE 754 (2019); Goldberg (1991).
Condition number of Gram matrix: $\kappa(A^\top A) = \kappa(A)^2$; why normal equations are risky. Reference: Golub & Kahan (1965).
Preconditioning theory: Transform $A \to M^{-1} A$; goal $\kappa(M^{-1} A) \ll \kappa(A)$. Reference: Axelsson (1994).
Tikhonov regularization: Shift eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$; reduce effective condition number. Reference: Tikhonov (1963).

Applied (landmark systems)#

QR factorization for stable least squares: $\kappa(R) = \kappa(A)$ (no squaring); stable even for ill-conditioned $A$. Implementation: LAPACK dgeqrf; scikit-learn default solver. Reference: Golub & Kahan (1965); Golub & Van Loan (2013).
SVD for rank-deficient and ill-conditioned systems: Direct pseudoinverse computation; threshold small singular values. Implementation: LAPACK dgesvd; scipy.linalg.svd. Reference: Golub & Kahan (1965).
Cholesky with jitter for near-singular covariance: Add $\epsilon I$ if Cholesky fails; trade-off between accuracy and stability. Standard in GPyTorch, Pyro, Stan. Reference: Rasmussen & Williams (2006).
Modified Gram–Schmidt (MGS) for reorthogonalization: More stable than classical GS; requires reorthogonalization passes. Reference: Golub & Pereyra (1973); Yarlagadda et al. (1979).
Preconditioned conjugate gradient (PCG): L-BFGS (quasi-Newton), ILU preconditioning, algebraic multigrid reduce iteration count by factors 10–100. Standard in scipy.sparse.linalg, PETSc. Reference: Nocedal & Wright (2006); Saad (2003).
Ridge regression and elastic net for regularization: Manual $\lambda$ tuning or cross-validation; scikit-learn Ridge, ElasticNet. Reference: Hoerl & Kennard (1970); Zou & Hastie (2005).

Key ideas: Where it shows up

Linear regression and least squares
- Gram matrix $X^\top X$ condition number determines numerical stability of normal equations.
- Achievements: QR/SVD solvers in scikit-learn, statsmodels ensure stability; warning when $\kappa > 10^{10}$. References: Hastie et al. (2009); Golub & Kahan (1965).
Kernel methods and Gaussian processes
- Covariance/kernel matrix $K$ condition number governs Cholesky stability and marginal likelihood evaluation.
- Achievements: Cholesky with jitter in Stan/PyMC3; sparse GP approximations circumvent $O(n^3)$ and ill-conditioning. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005).
Deep learning and second-order optimization
- Hessian condition number determines convergence rate of Newton/quasi-Newton methods; affects learning rate and step size.
- Achievements: L-BFGS (quasi-Newton) in PyTorch/TensorFlow adapts to Hessian conditioning; natural gradient methods rescale by Fisher information. References: Martens & Grosse (2015); Nocedal & Wright (2006).
Regularization and implicit bias
- Ill-posed learning problems require regularization; L2 penalty (ridge) shifts eigenvalues, reducing effective condition number.
- Achievements: elastic net (L1+L2) in scikit-learn balances sparsity and conditioning; early stopping in SGD acts as implicit regularization. References: Tibshirani (1996); Zou & Hastie (2005); Zhu et al. (2021).
Numerical robustness and reliability
- Production ML systems must be stable to input noise, feature preprocessing, and computational precision.
- Achievements: feature standardization reduces condition number; model monitoring tracks numerical stability (checks for NaN/Inf); validation on adversarial perturbations. References: Goodfellow et al. (2014) (adversarial robustness); LeCun et al. (1998) (feature normalization).

Notation

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)} = \frac{\lambda_{\max}(A^\top A)}{\lambda_{\min}(A^\top A)}$ (for SPD $A$).
Machine epsilon: $\varepsilon_{\text{mach}}$ or $\text{eps}$ in code; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point relative error: $\tilde{a} = a(1 + \delta)$ where $|\delta| \le \varepsilon_{\text{mach}}$.
Perturbation: $\tilde{A} = A + \Delta A$ with $\| \Delta A \| \le \varepsilon \| A \|$.
Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$ (computed vs. true solution).
Backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$ (residual relative to data norm).
Norm: $\| A \|_2 = \sigma_1(A)$ (spectral norm); $\| A \|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius).
Example: For $A \in \mathbb{R}^{100 \times 100}$ with $\sigma_1 = 10, \sigma_{100} = 0.01$: $\kappa(A) = 1000$. Relative error in solving $Ax = b$ is amplified by factor $\approx 1000$; if input has error $10^{-15}$, output error $\approx 10^{-12}$.

Pitfalls & sanity checks

Never ignore condition number: If $\kappa(A) > 10^{8}$, accuracy is compromised; either regularize or use higher precision.
Use QR/SVD for least squares, never normal equations (unless you’ve verified conditioning): $\kappa(X^\top X) = \kappa(X)^2$ is catastrophic.
Check for catastrophic cancellation: Subtraction of nearly equal numbers loses precision; use orthogonal algorithms instead.
Machine precision is absolute limit: Even with perfect algorithm, relative error $\gtrsim \varepsilon_{\text{mach}} \times \kappa(A)$ is unavoidable.
Backward error bounds are meaningful: Small backward error $\Rightarrow$ computed solution is exact for nearby problem (reassuring).
Regularization parameter selection matters: Cross-validation, L-curve, or discrepancy principle; manual tuning often fails.
Feature scaling affects conditioning: Standardize features before regression; poor scaling inflates condition number unnecessarily.
Preconditioning amortized over multiple solves: Single solve: preconditioning setup not worth it; multiple solves: huge benefit.

References

Foundational backward error analysis

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.
IEEE 754. (2019). IEEE Standard for Floating-Point Arithmetic.

Classical numerical linear algebra

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Comprehensive stability analysis

Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.
Givens, W. (1958). Computation of plane unitary rotations transforming a general matrix to triangular form.

Orthogonal factorizations and reorthogonalization

Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Yarlagadda, R. K., Hershey, J. E., & Sule, S. M. (1979). Gram–Schmidt orthogonalization with application to order recursive lattice filters.

Iterative methods and preconditioning

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

Ill-posed problems and regularization

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Hansen, P. C., Nagy, J. G., & O’Leary, D. P. (2006). Deblurring Images: Matrices, Spectra, and Filtering.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Applied ML and regularization

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Five worked examples

Worked Example 1: Condition number of Gram matrix and instability of normal equations#

Introduction#

Construct a data matrix $X$ with correlated features; compute condition number $\kappa(X)$ and $\kappa(X^\top X)$; solve least squares via normal equations vs. QR to demonstrate error amplification.

Purpose#

Illustrate why normal equations fail for ill-conditioned data; show that $\kappa(X^\top X) = \kappa(X)^2$ causes error to explode.

Importance#

Foundational motivation for using QR/SVD in regression; explains why scikit-learn warns on ill-conditioned matrices.

What this example demonstrates#

Construct $X$ with exponentially decaying singular values (high correlation).
Compute $\kappa(X)$ and $\kappa(X^\top X) = \kappa(X)^2$.
Solve via normal equations: $G = X^\top X$, $w = G^{-1} X^\top y$.
Solve via QR factorization: $w$ from $R w = Q^\top y$.
Compare residuals and error growth on perturbed data.

Background#

Gram matrix $G = X^\top X$ is square and symmetric but inherits (squared) conditioning from $X$. Gaussian elimination on $G$ amplifies errors by $\kappa(G)^2 = \kappa(X)^4$ in worst case.

Historical context#

Golub & Kahan (1965) proved $\kappa(X^\top X) = \kappa(X)^2$; advocated QR as practical solution.

History#

Recognition of this issue led to LINPACK (1979) and LAPACK (1992) emphasis on QR/SVD.

Prevalence in ML#

Statisticians and practitioners routinely encounter this; feature correlation is common in real data.

Notes#

Problem is fundamental to linear algebra, not implementation-specific.
Data standardization (mean 0, std 1) helps but doesn’t eliminate correlation.
Large-scale regression ($n, d > 10^6$) exacerbates conditioning issues.

Connection to ML#

Ridge regression adds $\lambda I$ to shift eigenvalues; reduces effective condition number. Feature engineering (orthogonalization, PCA preprocessing) also helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ is exact; QR avoids Gram matrix entirely.

Pedagogical Significance#

Concrete demonstration of why theory (backward error analysis) matters in practice.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

Solution (Python)#

import numpy as np
np.random.seed(20)

# Create ill-conditioned data (correlated features)
n, d = 100, 10
# Generate features with exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s)

# True solution and target
w_true = np.random.randn(d)
y = X @ w_true + 0.001 * np.random.randn(n)

# Condition numbers
kappa_X = np.linalg.cond(X)
G = X.T @ X
kappa_G = np.linalg.cond(G)

# Solve via normal equations
w_ne = np.linalg.solve(G, X.T @ y)
residual_ne = np.linalg.norm(X @ w_ne - y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)
residual_qr = np.linalg.norm(X @ w_qr - y)

# Solve via SVD (most stable)
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))
residual_svd = np.linalg.norm(X @ w_svd - y)

print("Condition numbers:")
print("  kappa(X) =", round(kappa_X, 2))
print("  kappa(X^T X) =", round(kappa_G, 2))
print("  kappa(X^T X) / kappa(X)^2 =", round(kappa_G / kappa_X**2, 4), "(should be ~1)")
print("\nResiduals:")
print("  Normal equations:", round(residual_ne, 8))
print("  QR:", round(residual_qr, 8))
print("  SVD:", round(residual_svd, 8))
print("\nSolution errors:")
print("  NE error:", round(np.linalg.norm(w_ne - w_true), 4))
print("  QR error:", round(np.linalg.norm(w_qr - w_true), 4))
print("  SVD error:", round(np.linalg.norm(w_svd - w_true), 4))

Worked Example 2: Backward error analysis and numerical stability of QR vs. normal equations#

Introduction#

Solve the same least squares problem via QR and normal equations; compute backward and forward errors; verify that QR is backward-stable while normal equations magnify error.

Purpose#

Demonstrate backward error analysis in practice; show that backward-stable algorithm (QR) guarantees small forward error even on ill-conditioned problems.

Importance#

Foundation for trusting numerical algorithms; explains why stable algorithms are worth the computational cost.

What this example demonstrates#

Solve $Ax = b$ via QR factorization and normal equations.
Compute backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$.
Compute forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$.
Verify $\text{forward error} \approx \kappa(A) \times \text{backward error}$ for both methods.

Background#

Backward-stable algorithm: computed $\tilde{x}$ is exact solution to $(A + \Delta A) \tilde{x} = b$ with small $\Delta A$.

Historical context#

Wilkinson (1961) formalized backward error analysis; revolutionary impact on numerical linear algebra.

History#

Modern algorithms designed to be backward-stable (QR, SVD, Cholesky); guarantees via theorems in Higham (2002).

Prevalence in ML#

Used in numerical software certification; backward error commonly reported in solvers.

Notes#

QR backward error: $O(\varepsilon_{\text{mach}} \| A \|)$ (small).
Normal equations backward error: $O(\varepsilon_{\text{mach}} \| A \|^2 / \min(\sigma_i))$ (amplified).

Connection to ML#

Backward stability ensures that model parameters are not polluted by rounding error; prediction is numerically sound.

Connection to Linear Algebra Theory#

Backward error analysis decouples algorithm stability from problem conditioning.

Pedagogical Significance#

Shows that algorithm design (not just theory) determines practical accuracy.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np
np.random.seed(21)

# Ill-conditioned system
n = 50
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
x_true = np.random.randn(n)
b = A @ x_true

# Solve via QR
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Solve via normal equations
G = A.T @ A
x_ne = np.linalg.solve(G, A.T @ b)

# Direct solve (reference)
x_ref = np.linalg.solve(A, b)

# Backward errors
residual_qr = np.linalg.norm(A @ x_qr - b)
residual_ne = np.linalg.norm(A @ x_ne - b)
backward_error_qr = residual_qr / np.linalg.norm(b)
backward_error_ne = residual_ne / np.linalg.norm(b)

# Forward errors
forward_error_qr = np.linalg.norm(x_qr - x_ref) / np.linalg.norm(x_ref)
forward_error_ne = np.linalg.norm(x_ne - x_ref) / np.linalg.norm(x_ref)

# Condition number
kappa = np.linalg.cond(A)

# Verify: forward_error ≈ kappa × backward_error
pred_forward_qr = kappa * backward_error_qr
pred_forward_ne = kappa * backward_error_ne

print("Condition number kappa(A):", round(kappa, 0))
print("\nQR Factorization:")
print("  Backward error:", format(backward_error_qr, '.2e'))
print("  Observed forward error:", format(forward_error_qr, '.2e'))
print("  Predicted forward error:", format(pred_forward_qr, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_qr / pred_forward_qr, 2))
print("\nNormal Equations:")
print("  Backward error:", format(backward_error_ne, '.2e'))
print("  Observed forward error:", format(forward_error_ne, '.2e'))
print("  Predicted forward error:", format(pred_forward_ne, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_ne / pred_forward_ne, 2))
print("\nConclusion: QR is backward-stable; NE backward error is larger.")

Worked Example 3: Machine precision and loss of significance in subtraction#

Introduction#

Demonstrate catastrophic cancellation when subtracting nearby numbers; show how orthogonal transformations avoid this.

Purpose#

Illustrate fundamental floating-point pitfall; motivate use of orthogonal algorithms (QR, SVD, Householder).

Importance#

Concrete example of why naive implementations fail; builds intuition for numerical stability.

What this example demonstrates#

Compute $c = a - b$ where $a \approx b$ (many leading digits cancel).
Compare relative error in $c$ vs. in $a, b$ individually.
Show that orthogonal transformations (e.g., Householder reflection) preserve norms and avoid subtraction.

Background#

Subtraction of nearly equal numbers is inherently unstable; cannot be fixed by higher precision alone.

Historical context#

Recognized early in computer arithmetic; led to development of Householder reflections (1958) and other orthogonal methods.

History#

Modern best practice: avoid subtraction of near-equal numbers; use transformation-based algorithms (QR, SVD).

Prevalence in ML#

Common issue in numerical gradient computation, vector normalization, and Gram–Schmidt orthogonalization.

Notes#

Relative error in $c$ can be $O(1)$ even if $a, b$ are accurate.
Classical Gram–Schmidt suffers; modified Gram–Schmidt fixes by reorthogonalization.

Connection to ML#

Adversarial robustness and numerical gradient computation are vulnerable to cancellation; careful implementation required.

Connection to Linear Algebra Theory#

Orthogonal transformations (Householder, Givens) have condition number $\kappa = 1$; stable regardless of input.

Pedagogical Significance#

Demonstrates limit of forward-stable algorithms; backward stability (computing exact solution to nearby problem) is more achievable.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

Solution (Python)#

import numpy as np

# Example 1: Catastrophic cancellation
a = 1.0 + 1e-16
b = 1.0
c = a - b
print("Example: Subtraction of nearly equal numbers")
print("  a = 1.0 + 1e-16 =", repr(a))
print("  b = 1.0 =", repr(b))
print("  c = a - b =", repr(c))
print("  Exact: 1e-16, Computed: ", c)
print("  Relative error:", abs(c - 1e-16) / 1e-16 if c != 0 else "inf")

# Example 2: Loss of significance in dot product
print("\nExample: Loss of significance in inner product")
np.random.seed(22)
x = np.array([1e8, 1.0, 1.0])
y = np.array([1e-8, 1.0, 1.0])

# Naive: compute separately then subtract
xy_naive = x[0] * y[0] + x[1] * y[1] + x[2] * y[2]  # 1 + 1 + 1 = 3

# More stable: accumulate with smaller terms first
xy_stable = x[2] * y[2] + x[1] * y[1] + x[0] * y[0]  # Reverse order

print("  Inner product (x · y):")
print("    Naive order: {:.15f}".format(xy_naive))
print("    Reverse order: {:.15f}".format(xy_stable))
print("    NumPy dot: {:.15f}".format(np.dot(x, y)))

# Example 3: Orthogonal transformation vs. subtraction
print("\nExample: Orthogonal transformation avoids cancellation")
# Create orthogonal matrix (preserve norms)
theta = np.pi / 6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
v = np.array([1.0, 0.0])
v_rot = Q @ v
print("  ||v|| =", np.linalg.norm(v))
print("  ||Q v|| =", np.linalg.norm(v_rot), "(should equal ||v||)")
print("  Orthogonal transformation preserves norms: stable!")

Worked Example 4: Preconditioning and condition number reduction#

Introduction#

Construct an ill-conditioned SPD system; solve via unpreconditioned CG and preconditioned CG with diagonal preconditioner; measure iteration count reduction.

Purpose#

Show quantitatively how preconditioning reduces condition number and accelerates convergence.

Importance#

Essential technique for large-scale optimization; directly translates to practical speedups.

What this example demonstrates#

Construct ill-conditioned SPD matrix (exponentially decaying eigenvalues).
Solve via CG (count iterations to convergence).
Construct diagonal (Jacobi) preconditioner $M = \text{diag}(A)$.
Solve via preconditioned CG (count iterations).
Compute effective $\kappa(M^{-1} A)$ and verify iteration count reduction.

Background#

CG iteration count $\approx \sqrt{\kappa(A)}$; preconditioning reduces $\kappa \to \kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential by 1970s; became standard in iterative solvers.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg (all offer preconditioning options).

Prevalence in ML#

L-BFGS uses quasi-Newton preconditioning; preconditioned SGD in modern optimizers.

Notes#

Iteration count $\approx \sqrt{\kappa}$; reducing $\kappa$ by 100 reduces iterations by $\sqrt{100} = 10$.
Preconditioner setup cost must be amortized over multiple solves.

Connection to ML#

Optimization accelerators: Newton’s method, natural gradient, adaptive methods all precondition the gradient.

Connection to Linear Algebra Theory#

Preconditioning clusters eigenvalues; standard result in iterative methods.

Pedagogical Significance#

Demonstrates power of structure exploitation (knowing $M \approx A$) to improve algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(23)

# Create ill-conditioned SPD matrix
n = 200
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_un, info_un = spla.cg(A, b, tol=1e-8, callback=callback_un, maxiter=n)

# Preconditioned CG with diagonal (Jacobi) preconditioner
M_diag = np.diag(1.0 / np.diag(A))
M_diag_sparse = sp.diags(np.diag(M_diag))
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_pre, info_pre = spla.cg(A, b, M=M_diag_sparse, tol=1e-8, 
                           callback=callback_pre, maxiter=n)

kappa_A = np.linalg.cond(A)
# Approximate kappa(M^{-1} A)
AM = M_diag @ A
kappa_AM = np.linalg.cond(AM)

print("Condition numbers:")
print("  kappa(A) =", round(kappa_A, 0))
print("  kappa(M^{-1} A) ≈", round(kappa_AM, 0), "(after diagonal preconditioning)")
print("\nCG Convergence:")
print("  Unpreconditioned iterations:", len(residuals_unprecond))
print("  Preconditioned iterations:", len(residuals_precond))
print("  Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("  Predicted speedup (~sqrt(kappa)):", round(np.sqrt(kappa_A / kappa_AM), 1))
print("\nBoth solutions satisfy A x = b:",
      np.allclose(A @ x_un, b) and np.allclose(A @ x_pre, b))

Worked Example 5: Regularization and effective condition number#

Introduction#

Construct an ill-posed rectangular system (underdetermined or rank-deficient); solve via Tikhonov regularization with varying $\lambda$; plot condition number reduction and solution norm vs. residual (L-curve).

Purpose#

Demonstrate how regularization shifts small singular values and reduces effective condition number.

Importance#

Core technique in inverse problems, ridge regression, and ill-posed ML problems.

What this example demonstrates#

Construct rank-deficient or ill-posed system (exponentially decaying singular values).
Solve regularized system $(A^\top A + \lambda I) w = A^\top b$ for range of $\lambda$.
Compute effective condition number: $\kappa_\lambda = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$.
Plot L-curve: solution norm vs. residual; optimal $\lambda$ at corner.
Verify that small $\lambda$ gives large solution, large $\lambda$ gives small solution but large residual.

Background#

Tikhonov adds penalty; shifts eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$.

Historical context#

Tikhonov (1963) for ill-posed problems; Hansen (1998) developed regularization parameter selection tools.

History#

Standard in inverse problems (deblurring, tomography); ridge regression standard in ML.

Prevalence in ML#

Ridge regression, elastic net, dropout, early stopping all act as implicit regularization.

Notes#

Optimal $\lambda$ balances fit (small residual) and solution norm (small solution).
L-curve method (Golub et al. 1979): plot $\log(\| r \|)$ vs. $\log(\| w \|)$; corner indicates optimal $\lambda$.

Connection to ML#

Bias-variance trade-off: underfitting vs. overfitting; regularization parameter selection critical.

Connection to Linear Algebra Theory#

Spectral filtering: Tikhonov filters small singular values; LSQR via early stopping does implicit filtering.

Pedagogical Significance#

Shows how problem structure (ill-posedness) guides algorithm design (regularization).

References#

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Golub, G. H., Van Loan, C. F., & Milanfar, P. (1999). The total least squares problem.

Solution (Python)#

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(24)

# Construct ill-posed system
m, n = 100, 40
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T

# True solution and noisy data
x_true = np.zeros(n)
x_true[:5] = [10, 8, 6, 4, 2]
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# Solve Tikhonov for range of lambda
lams = np.logspace(-6, 2, 50)
sol_norms = []
residuals = []
cond_numbers = []

for lam in lams:
    G = A.T @ A + lam * np.eye(n)
    w = np.linalg.solve(G, A.T @ b)
    sol_norms.append(np.linalg.norm(w))
    residuals.append(np.linalg.norm(A @ w - b))
    cond_numbers.append(np.linalg.cond(G))

sol_norms = np.array(sol_norms)
residuals = np.array(residuals)
cond_numbers = np.array(cond_numbers)

# Find corner of L-curve (optimal lambda via curvature)
log_res = np.log(residuals)
log_norm = np.log(sol_norms)
curvature = np.gradient(np.gradient(log_res) / np.gradient(log_norm))
opt_idx = np.argmax(curvature)
lam_opt = lams[opt_idx]

# Solve with optimal lambda
G_opt = A.T @ A + lam_opt * np.eye(n)
w_opt = np.linalg.solve(G_opt, A.T @ b)

print("Regularization parameter selection (L-curve method):")
print("  Optimal lambda:", format(lam_opt, '.2e'))
print("  At optimal lambda:")
print("    Solution norm:", round(np.linalg.norm(w_opt), 4))
print("    Residual norm:", round(np.linalg.norm(A @ w_opt - b), 4))
print("    Condition number:", round(np.linalg.cond(G_opt), 2))
print("\nFor comparison:")
print("  Smallest lambda: cond =", round(cond_numbers[0], 0), ", residual =", 
      format(residuals[0], '.2e'))
print("  Largest lambda: cond =", round(cond_numbers[-1], 0), ", residual =", 
      format(residuals[-1], '.2e'))
print("\nTrue solution norm:", round(np.linalg.norm(x_true), 4))
print("Solution error at optimal lambda:", round(np.linalg.norm(w_opt - x_true), 4))

Comments

Algorithm Category

Direct Methods

Iterative Methods

Computational Efficiency

GPU & Distributed Computing

Data Modality

Dense Matrices

Structured Matrices

Historical & Attribution

Modern Era

Key Concepts & Theorems

Condition Number

Positive Semidefiniteness

Norm & Distance

Learning Path & Sequencing

Advanced

ML Applications

Optimization & Training

Numerical Stability & Robustness

Conditioning & Sensitivity

Floating-Point Arithmetic

Loss Scaling

Theoretical Foundation

Numerical Linear Algebra

Solving Systems

Chapter 13

Solving Systems

Key ideas: Introduction

Introduction#

Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.

Important ideas#

Gaussian elimination and LU factorization
- Direct method: $A = LU$ via row operations.
- Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.
- Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).
Cholesky factorization for SPD systems
- For symmetric positive definite $A$: $A = L L^\top$ (one triangle).
- Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.
- Numerically stable if $A$ is well-conditioned.
QR factorization and least squares
- $A = QR$ with orthonormal $Q$ and upper triangular $R$.
- Numerically stable; used for least squares (Chapter 12).
- Cost: $O(n^2 d)$ for $n \times d$ matrix.
Iterative solvers and conjugate gradient
- Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.
- GMRES/MINRES: for general/symmetric non-SPD systems.
- Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.
Preconditioning and conditioning
- Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.
- Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.
- Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.
Sparse matrix structure
- Banded, tridiagonal, block-structured systems exploit locality.
- Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.
- Fill-in during factorization can destroy sparsity; ordering matters.
Rank deficiency and ill-posedness
- Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.
- Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.
- Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.

Relevance to ML#

Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.
Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.
Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.
Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.
Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.

Algorithmic development (milestones)#

1670s: Newton’s method and early algebraic solutions.
1810: Gaussian elimination formalized (Gauss, Legendre).
1875: Cholesky decomposition developed (rediscovered 1910).
1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.
1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.
1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.
1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.
1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.
2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.

Definitions#

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.
LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.
Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.
Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.
Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.
Residual: $r = b - A x$; measure of solution error.
Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.
Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.
Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Gaussian elimination and LU: Forward elimination, row pivoting for stability, $A = LU$ factorization. References: Golub & Van Loan (2013); Trefethen & Bau (1997).
Cholesky factorization: $A = L L^\top$ for SPD matrices; numerical stability via diagonal dominance and conditioning. Reference: Golub & Van Loan (2013).
Forward/back-substitution: $O(n^2)$ solve for triangular systems; essential subroutine.
QR factorization: $A = QR$ with orthonormal $Q$; stable least squares (Chapter 12). Reference: Golub & Kahan (1965).
Conjugate gradient: Minimize $\frac{1}{2} x^\top A x - b^\top x$ on Krylov subspace; optimality in $n$ iterations for SPD. Reference: Hestenes & Stiefel (1952).
Condition number and residual analysis: $\kappa(A) = \sigma_1 / \sigma_n$; backward error bounds. Reference: Wilkinson (1961).
Preconditioning: Transform $A \to M^{-1} A$ to reduce condition number. Reference: Axelsson (1994).

Applied (landmark systems)#

Cholesky solve for Gram matrices: $(X^\top X) w = X^\top y$ via Cholesky; scikit-learn LinearRegression with solver='cholesky'. Reference: Hastie et al. (2009).
CG for large-scale least squares: CGLS (CG on normal equations) for $n, d > 10^6$; scikit-learn SGDRegressor. Reference: Paige & Saunders (1982).
Gaussian process inference: Cholesky of $K$ for marginal likelihood; approximate GPs via inducing points reduce $O(n^3)$ to $O(m^3)$. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005); Hensman et al. (2015).
Preconditioned optimization: L-BFGS with Hessian approximation; widely used in TensorFlow/PyTorch. Reference: Nocedal & Wright (2006).
Graph Laplacian solvers: Fast Poisson equation solvers on mesh/graphs; ChebNet (Defferrard et al. 2016); enables scalable GNNs. Reference: Defferrard et al. (2016).
Inverse problems: LSQR iterative method; clinical deployment in medical imaging. Reference: Vogel (2002); Bardsley et al. (2012).

Key ideas: Where it shows up

Least squares and kernel ridge regression
- Solve $A = X^\top X$ (Gram matrix) or $A = K + \lambda I$ (kernel matrix).
- Achievements: scikit-learn, PyTorch linear solvers; KRR standard in Gaussian processes. References: Rasmussen & Williams (2006); Scholkopf & Smola (2002).
Gaussian processes and covariance systems
- Solve $K x = y$ where $K$ is $n \times n$ covariance matrix.
- Achievements: Cholesky solve in Stan, PyMC3; approximate inference via sparse GPs (inducing points). References: Quinonero-Candela & Rasmussen (2005); Snelson & Ghahramani (2005).
Optimization and preconditioning
- Preconditioned gradient descent: $x_{t+1} = x_t - \alpha M^{-1} \nabla f(x_t)$.
- Achievements: L-BFGS preconditioner reduces iteration count by factor of $10$-$100$; quasi-Newton methods. References: Nocedal & Wright (2006); Martens & Grosse (2015).
Graph neural networks and sparse convolutions
- Solve graph Laplacian systems $L x = b$ for diffusion, smoothing, attention mechanisms.
- Achievements: GraphSAGE, GCN via approximate polynomial filters; scalable to graphs with $10^9$ nodes. References: Defferrard et al. (2016) (ChebNet); Kipf & Welling (2017) (GCN).
Inverse problems and regularized imaging
- Solve Tikhonov system $(A^\top A + \lambda I) x = A^\top b$ for ill-posed deconvolution, tomography, parameter estimation.
- Achievements: iterative methods in medical imaging (CGLS for CT/MRI); LSQR in seismic inversion. References: Hansen (1998); Vogel (2002); Bardsley et al. (2012).

Notation

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}$ (or $\mathbb{R}^{m \times n}$ overdetermined).
LU factorization: $A = L U$ with $L \in \mathbb{R}^{n \times n}$ lower triangular (unit diagonal), $U \in \mathbb{R}^{n \times n}$ upper triangular.
Cholesky: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite; $L \in \mathbb{R}^{n \times n}$ lower triangular.
QR factorization: $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal, $R \in \mathbb{R}^{n \times n}$ upper triangular.
Residual: $r = b - A x \in \mathbb{R}^n$; goal is $\lVert r \rVert \ll \lVert b \rVert$.
Conjugate gradient: $x_k$ minimizes $\frac{1}{2} x^\top A x - b^\top x$ over $k$-dimensional Krylov subspace $\text{span}(r_0, A r_0, \ldots, A^{k-1} r_0)$.
Preconditioner: $M \approx A$ (cheap to invert); solve $M^{-1} A x = M^{-1} b$.
Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); measures sensitivity to perturbation.
Example: $A = \begin{pmatrix} 10 & 1 \\ 1 & 1 \end{pmatrix}$ has $\sigma_1 \approx 10.05, \sigma_2 \approx 0.95$, so $\kappa(A) \approx 10.6$ (moderately ill-conditioned).

Pitfalls & sanity checks

Never solve normal equations $A^\top A x = A^\top b$ directly for ill-conditioned $A$: Use QR, SVD, or iterative methods (LSQR) instead; $\kappa(A^\top A) = \kappa(A)^2$.
Verify SPD before Cholesky: Non-SPD matrices cause NaN/Inf; test via eigenvalues or try-catch.
Check residual convergence: $\lVert A x - b \rVert$ should decrease monotonically in iterative solvers; stagnation signals bad conditioning or preconditioner failure.
Preconditioning setup cost: If solve is run once, setup overhead may exceed savings; only cost-effective for multiple solves with same $A$.
Fill-in in sparse LU: Sparse matrix factorization can become dense; use reordering (minimum degree, nested dissection) to minimize fill-in.
Early stopping in LSQR: Stop iteration based on residual norm or discrepancy principle; continuing to convergence amplifies noise.
Regularization parameter selection: Cross-validation, L-curve, or discrepancy principle; do not fit $\lambda$ on training data.
Scaling sensitivity: Ill-scaled systems (different row/column magnitudes) become ill-conditioned; normalize before solving.

References

Foundational algorithms

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
Cholesky, A. L. (1910). Sur la résolution numérique des systèmes d’équations linéaires.

Classical theory and numerical stability

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Iterative methods and conjugate gradient

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.
Axelsson, O. (1994). Iterative Solution Methods.

Preconditioning and multigrid

Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

Applied: Machine learning and inverse problems

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression.
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs.
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering.
Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Hensman, J., Matthews, A. G. D. E., & Ghahramani, Z. (2015). Scalable variational Gaussian process classification.
Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels.
Bardsley, J. M., Chung, J., & Palmer, K. (2012). Regularization parameter selection methods for linear least squares problems.

Five worked examples

Worked Example 1: Gaussian elimination, LU factorization, and pivoting#

Introduction#

Solve a system $A x = b$ via LU factorization with partial pivoting; verify stability and reconstruction.

Purpose#

Illustrate direct method for dense systems; compare unpivoted LU (numerically risky) to pivoted LU (numerically stable).

Importance#

LU is foundation of dense linear algebra; pivoting is critical for avoiding catastrophic cancellation.

What this example demonstrates#

Compute LU factorization via Gaussian elimination (with and without pivoting).
Verify $A = L U$ and solve via forward/back-substitution.
Compute condition number and demonstrate error growth without pivoting on ill-conditioned matrix.

Background#

Partial pivoting (swapping rows to maximize pivot) prevents small pivots from amplifying errors during elimination.

Historical context#

Numerical instability of unpivoted LU recognized mid-20th century; pivoting became standard.

History#

Modern linear algebra libraries (LAPACK) default to pivoted LU.

Prevalence in ML#

Used in scikit-learn’s least squares and in matrix factorization algorithms.

Notes#

Pivoting is automatic in most libraries; rarely explicit in practice.
For sparse matrices, fill-in during elimination can destroy sparsity; sparse LU uses ordering strategies (minimum degree, nested dissection).

Connection to ML#

Condition number of $A$ determines solver reliability; poor conditioning necessitates regularization.

Connection to Linear Algebra Theory#

$\kappa(A) = \sigma_1 / \sigma_n$ predicts error magnification; LU error $\sim \kappa(A) \times \text{machine eps}$.

Pedagogical Significance#

Concrete demonstration of numerical stability and pivoting strategy.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
np.random.seed(10)

# Create moderately ill-conditioned system
n = 5
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, n)
A = U @ np.diag(s) @ U.T
b = np.random.randn(n)

# Compute LU with pivoting (numpy default)
P, L, U_lu = scipy.linalg.lu(A)
kappa_A = np.linalg.cond(A)

# Verify A = P^T L U
A_recon = P.T @ L @ U_lu
recon_error = np.linalg.norm(A - A_recon)

# Solve via LU: forward sub (L y = P b), back sub (U x = y)
y = scipy.linalg.solve_triangular(L, P @ b, lower=True)
x_lu = scipy.linalg.solve_triangular(U_lu, y, lower=False)

# Solve directly (for comparison)
x_direct = np.linalg.solve(A, b)

print("Condition number kappa(A):", round(kappa_A, 2))
print("LU reconstruction error:", round(recon_error, 8))
print("LU solution matches direct:", np.allclose(x_lu, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_lu - b), 8))

Worked Example 2: Cholesky factorization for symmetric positive definite systems#

Introduction#

Solve a symmetric positive definite (SPD) system via Cholesky factorization; verify stability and computational savings.

Purpose#

Show how SPD structure reduces computation by half and improves stability vs. general LU.

Importance#

Cholesky is standard for covariance matrices, Gram matrices, and Hessians in optimization.

What this example demonstrates#

Construct SPD matrix (e.g., covariance or Gram).
Compute Cholesky $A = L L^\top$ and verify reconstruction.
Solve via forward/back-substitution on $L$.
Compare to general LU: cost $O(n^3 / 3)$ vs. $O(n^3)$.

Background#

For SPD $A$, Cholesky is more stable than LU (no pivoting needed) and faster (half the operations).

Historical context#

Cholesky rediscovered in early 1900s; became standard method for SPD systems by mid-20th century.

History#

LAPACK dpotrf is gold-standard Cholesky implementation.

Prevalence in ML#

Used in kernel ridge regression, Gaussian process inference, and proximal methods.

Notes#

Fails gracefully if $A$ is not SPD; returns error (no negative square roots).
Numerical breakdown signals ill-conditioning; add small jitter ($A + \epsilon I$) if needed.

Connection to ML#

Covariance and Gram matrices are always SPD; Cholesky enables efficient sampling and likelihood computation in probabilistic models.

Connection to Linear Algebra Theory#

Cholesky exists iff all leading principal minors are positive (Sylvester criterion).

Pedagogical Significance#

Demonstrates how structure (symmetry, PSD) enables algorithmic optimization.

References#

Golub & Van Loan (2013). Matrix Computations.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

Solution (Python)#

import numpy as np
np.random.seed(11)

# Create SPD matrix (covariance-like)
n = 5
A_temp = np.random.randn(n, n)
A = A_temp.T @ A_temp + np.eye(n)  # SPD by construction
b = np.random.randn(n)

# Cholesky factorization
L = np.linalg.cholesky(A)

# Verify A = L L^T
A_recon = L @ L.T
recon_error = np.linalg.norm(A - A_recon)

# Solve via forward/back-substitution
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_chol = scipy.linalg.solve_triangular(L.T, y, lower=False)

# Direct solve for comparison
x_direct = np.linalg.solve(A, b)

# Condition number
kappa_A = np.linalg.cond(A)

print("Condition number kappa(A):", round(kappa_A, 2))
print("Cholesky reconstruction error:", round(recon_error, 8))
print("Cholesky solution matches direct:", np.allclose(x_chol, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_chol - b), 8))
print("Lower triangular L diagonal:", np.round(np.diag(L), 4))

Worked Example 3: Conjugate gradient for large sparse systems#

Introduction#

Solve a large sparse SPD system via conjugate gradient (CG); demonstrate convergence and compare to direct Cholesky.

Purpose#

Show how iterative methods scale to large systems while respecting sparsity.

Importance#

CG is the standard iterative solver for large-scale ML and scientific computing.

What this example demonstrates#

Construct sparse SPD system (e.g., graph Laplacian or 2D Poisson discretization).
Solve via CG and via direct Cholesky.
Plot convergence: residual norm vs. iteration.
Measure wall-clock time and memory for dense vs. sparse approaches.

Background#

CG finds solution in at most $n$ iterations (theory); practical convergence in $\ll n$ iterations for well-conditioned systems.

Historical context#

Hestenes & Stiefel (1952); rediscovered in 1970s as demand for large-scale methods grew.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg.

Prevalence in ML#

Used in large-scale kernel methods, graph neural networks, and optimization.

Notes#

Convergence rate depends on condition number $\kappa(A)$; preconditioning improves rate dramatically.
Requires only matrix-vector products $A v$; no explicit $A$ storage needed.

Connection to ML#

Enables Gaussian process inference on large datasets; scales kernel methods from $10^4$ to $10^6$ points.

Connection to Linear Algebra Theory#

CG minimizes $\frac{1}{2} x^\top A x - b^\top x$ over Krylov subspace; conjugate directions ensure optimality.

Pedagogical Significance#

Demonstrates trade-off between direct (dense, $O(n^3)$, low iteration) and iterative (sparse-compatible, $O(n)$ per iteration, many iterations).

References#

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Golub & Kahan (1965). Calculating singular values and pseudo-inverses.
Nocedal & Wright (2006). Numerical Optimization.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(12)

# Construct sparse SPD system (2D Laplacian discretization)
n_side = 20
n = n_side ** 2
# Discrete Laplacian on 2D grid (tridiagonal structure)
I = np.arange(n)
J = np.arange(n)
V = 4 * np.ones(n)
# Horizontal neighbors
I_h = np.arange(n - n_side)
J_h = I_h + n_side
V_h = -np.ones(len(I_h))
# Vertical neighbors
I_v = np.arange(n - 1)
I_v = I_v[I_v % n_side != n_side - 1]
J_v = I_v + 1
V_v = -np.ones(len(I_v))

rows = np.concatenate([I, I_h, I_h, I_v, I_v])
cols = np.concatenate([J, J_h, I_h, J_v, I_v])
vals = np.concatenate([V, V_h, V_h, V_v, V_v])
A = sp.csr_matrix((vals, (rows, cols)), shape=(n, n))
b = np.random.randn(n)

# CG solution
def residual_norm(x):
    return np.linalg.norm(A @ x - b)

x_cg, info = spla.cg(A, b, tol=1e-6, maxiter=n)

# Direct Cholesky on dense
A_dense = A.toarray()
L = np.linalg.cholesky(A_dense)
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_direct = scipy.linalg.solve_triangular(L.T, y, lower=False)

print("Problem size (n):", n)
print("Sparsity: {:.1f}%".format(100 * A.nnz / (n ** 2)))
print("CG converged in:", info, "iterations")
print("CG residual:", round(residual_norm(x_cg), 8))
print("Direct residual:", round(residual_norm(x_direct), 8))
print("Solutions match:", np.allclose(x_cg, x_direct, atol=1e-5))

Worked Example 4: Preconditioning and conditioning number#

Introduction#

Solve an ill-conditioned SPD system with and without preconditioning; demonstrate acceleration and residual reduction.

Purpose#

Show how preconditioning reduces effective condition number and dramatically accelerates convergence.

Importance#

Preconditioning is essential for large-scale optimization and inverse problems.

What this example demonstrates#

Construct ill-conditioned SPD system (exponentially decaying eigenvalues).
Solve via unpreconditioned CG and via diagonal (Jacobi) preconditioner.
Plot convergence: residual vs. iteration for both.
Verify that CG on $M^{-1} A$ has better convergence rate.

Background#

Convergence rate $\rho \approx \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ per iteration; preconditioning reduces $\kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential in 1970s–1980s for practical large-scale solvers.

History#

MINRES-QLP, GMRES with ILU preconditioning (Saad 1993); algebraic multigrid (Ruge–Stüben 1987).

Prevalence in ML#

Preconditioned gradients, L-BFGS with Hessian approximation, trust-region methods.

Notes#

Preconditioning trades: setup cost (compute/apply $M^{-1}$) for iteration reduction.
Diagonal (Jacobi) preconditioner: $M = \text{diag}(A)$; trivial but often effective.

Connection to ML#

Preconditioned gradient descent achieves faster convergence; L-BFGS implicitly preconditions via quasi-Newton approximation.

Connection to Linear Algebra Theory#

CG convergence depends on spectrum of $A$; preconditioning clusters eigenvalues of $M^{-1} A$ to reduce effective $\kappa$.

Pedagogical Significance#

Concrete demonstration of conditioning’s impact on algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal & Wright (2006). Numerical Optimization.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(13)

# Create ill-conditioned SPD matrix
n = 100
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # Condition number ~1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_unprecond(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_unprecond, _ = spla.cg(A, b, tol=1e-8, callback=callback_unprecond)

# Preconditioned CG (diagonal/Jacobi preconditioner)
M_inv_diag = 1.0 / np.diag(A)
M = sp.diags(1.0 / M_inv_diag)
residuals_precond = []
def callback_precond(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_precond, _ = spla.cg(A, b, M=M, tol=1e-8, callback=callback_precond)

kappa_A = np.linalg.cond(A)
print("Condition number kappa(A):", round(kappa_A, 0))
print("Unpreconditioned CG iterations:", len(residuals_unprecond))
print("Preconditioned CG iterations:", len(residuals_precond))
print("Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("Both solutions satisfy A x = b:", 
      np.allclose(A @ x_unprecond, b) and np.allclose(A @ x_precond, b))

Worked Example 5: LSQR for ill-posed inverse problems#

Introduction#

Solve an ill-posed least squares problem $A x \approx b$ via LSQR (iterative least squares QR); compare to direct pseudoinverse and Tikhonov regularization.

Purpose#

Demonstrate how LSQR stabilizes solutions to ill-posed systems without explicit regularization parameter tuning.

Importance#

LSQR is standard in medical imaging, geophysical inversion, and inverse problems.

What this example demonstrates#

Construct ill-posed rectangular system (tall, decaying singular values).
Solve via LSQR (iterative, early stopping acts as regularization).
Compare to pseudoinverse (noisy) and Tikhonov (requires $\lambda$ selection).
Plot regularization curves (solution norm vs. residual).

Background#

LSQR is conjugate gradient on normal equations $A^\top A x = A^\top b$; stops before convergence to avoid amplifying noise.

Historical context#

Paige & Saunders (1982); foundational algorithm for inverse problems.

History#

CGLS, LSQR in scipy.sparse.linalg; widely used in medical imaging.

Prevalence in ML#

Used in sparse coding, compressed sensing, and large-scale least squares.

Notes#

Stopping iteration acts as regularization; no regularization parameter needed.
Discrepancy principle: stop when residual reaches expected noise level.

Connection to ML#

Implicit regularization; enables robust solutions without hyperparameter tuning.

Connection to Linear Algebra Theory#

LSQR on $\min_x \lVert A x - b \rVert^2$ without explicit pseudoinverse; early stopping filters small singular values.

Pedagogical Significance#

Demonstrates how algorithm structure (iteration count) can act as regularization.

References#

Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(14)

# Construct ill-posed rectangular system
m, n = 120, 50
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T
x_true = np.zeros(n)
x_true[:8] = np.sin(np.linspace(0, 2*np.pi, 8))
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# LSQR with early stopping (implicit regularization)
residuals_lsqr = []
x_lsqr_early = None
for k in [5, 10, 20, 50]:
    x_lsqr, _ = spla.lsqr(A, b, atol=0, btol=0, iter_lim=k)
    residuals_lsqr.append(np.linalg.norm(A @ x_lsqr - b))
    if k == 20:
        x_lsqr_early = x_lsqr

# Pseudoinverse (no regularization, noisy)
x_pinv = spla.lsqr(A, b, atol=0, btol=0, iter_lim=1000)[0]

# Tikhonov with manual lambda (requires tuning)
lam = 0.01
G = A.T @ A + lam * np.eye(n)
x_tikhonov = np.linalg.solve(G, A.T @ b)

print("True solution norm:", round(np.linalg.norm(x_true), 4))
print("LSQR (early stop k=20) norm:", round(np.linalg.norm(x_lsqr_early), 4))
print("LSQR (early stop) error:", round(np.linalg.norm(x_lsqr_early - x_true), 4))
print("Pseudoinverse error:", round(np.linalg.norm(x_pinv - x_true), 4))
print("Tikhonov error:", round(np.linalg.norm(x_tikhonov - x_true), 4))
print("LSQR residual at k=20:", round(residuals_lsqr[2], 6))