Chapter 14
Conditioning & Stability
Key ideas: Introduction

Introduction#

Numerical stability determines whether an algorithm produces a correct answer. Machine arithmetic operates on finite precision (float64: ~16 significant digits). An ill-conditioned problem (large $\kappa$) is inherently difficult: small input perturbations cause large output changes. Backward-stable algorithms guarantee that computed solution is exact solution to a nearby problem. Forward error depends on conditioning; backward error depends on algorithm stability. Together, they predict what accuracy is achievable. This is fundamental to reliable ML: ill-conditioned matrices appear in nearly singular Gram matrices, ill-posed inverse problems, and deep network Hessians.

Important ideas#

  1. Condition number and sensitivity

    • $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).

    • Forward error bound: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \cdot \frac{\| \Delta A \|}{\| A \|}$ (perturbation on $A$ causes error in $x$).

    • Ill-conditioned ($\kappa \gg 1$) problems amplify noise; well-conditioned ($\kappa \approx 1$) are benign.

  2. Machine precision and floating-point arithmetic

    • Machine epsilon: $\varepsilon_{\text{machine}} \approx 2^{-52} \approx 2.2 \times 10^{-16}$ for float64.

    • Relative error per operation: $O(\varepsilon_{\text{machine}})$; accumulated error over $n$ operations: $O(n \cdot \varepsilon_{\text{machine}})$.

    • Loss of significance: if $a \approx b$, then $a - b$ has few correct digits (cancellation).

  3. Backward error analysis and stability

    • Algorithm is backward-stable if computed solution $\tilde{x}$ is exact solution to perturbed problem: $\tilde{A} \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon_{\text{machine}})$.

    • Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|} \approx \kappa(A) \cdot O(\varepsilon_{\text{machine}})$ for stable algorithm.

    • Unstable algorithm: backward error is large; forward error can be catastrophic.

  4. Condition number of Gram matrix vs. original matrix

    • $\kappa(A^\top A) = \kappa(A)^2$; normal equations square the condition number.

    • QR/SVD avoid this: $\kappa(R) = \kappa(A)$ (no squaring), $\kappa(\Sigma) = \kappa(A)$.

    • For least squares: direct method on $A$ (QR) preferable to solving $A^\top A x = A^\top b$.

  5. Preconditioning and effective conditioning

    • Preconditioner $M \approx A$ transforms $A \to M^{-1} A$; goal is $\kappa(M^{-1} A) \ll \kappa(A)$.

    • Incomplete LU (ILU), Jacobi (diagonal), algebraic multigrid (AMG) are practical preconditioners.

    • CG iteration count $\approx \sqrt{\kappa(M^{-1} A)}$; reducing $\kappa$ by factor 100 reduces iterations by factor 10.

  6. Regularization and ill-posedness

    • Ill-posed problem: small changes in data cause arbitrarily large changes in solution ($\kappa = \infty$).

    • Tikhonov regularization: $(A^\top A + \lambda I) x = A^\top b$ shifts small singular values by $\lambda$.

    • Effective $\kappa$ after regularization: $\kappa_{\text{reg}} = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$; well-chosen $\lambda$ reduces condition number.

  7. Numerical cancellation and loss of significance

    • Subtraction $a - b$ where $a \approx b$ cancels leading digits; result has low precision.

    • Orthogonal transformations (QR, SVD, Householder) avoid this; preserve angles and norms.

    • Modified Gram–Schmidt (MGS) more stable than classical Gram–Schmidt due to reorthogonalization.

Relevance to ML#

  • Linear regression and least squares: Gram matrix $X^\top X$ can be ill-conditioned if features are correlated; normal equations fail; QR/SVD required.

  • Kernel methods and Gaussian processes: Gram/covariance matrix condition number determines sample complexity and numerical stability of Cholesky.

  • Deep learning and Hessian conditioning: Neural network loss Hessian is often ill-conditioned; affects convergence rate, learning rate selection, and second-order optimization.

  • Regularization and implicit bias: Ill-posed learning problems ($\kappa = \infty$) require regularization (L2, dropout, early stopping); reduces effective $\kappa$.

  • Numerical robustness in production: Conditioning determines whether model training is reproducible, whether predictions are stable to input noise, and whether inference is safe.

Algorithmic development (milestones)#

  • 1947: von Neumann & Goldstine analyze numerical stability of Gaussian elimination; discover amplification by machine epsilon.

  • 1961: Wilkinson develops comprehensive framework for backward error analysis; introduces condition number formally.

  • 1965: Golub & Kahan recognize $\kappa(A^\top A) = \kappa(A)^2$ is catastrophic for normal equations; advocate for QR/SVD.

  • 1971: Peters & Wilkinson develop LINPACK; implement numerically stable algorithms with careful pivot selection.

  • 1979–1990: Higham develops extended backward error analysis; proves stability of Cholesky, QR, modified Gram–Schmidt.

  • 1992: Arioli et al. develop a posteriori error bounds; can verify computed solution without knowing true solution.

  • 2000s: Preconditioning becomes standard in iterative solvers; algebraic multigrid (AMG) enables linear scaling.

  • 2010s: Automatic differentiation frameworks (TensorFlow, PyTorch) provide implicit solvers with backward stability analysis.

Definitions#

  • Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); $\kappa_\infty(A) = \| A \|_\infty \| A^{-1} \|_\infty$ (operator norm).

  • Machine epsilon: $\varepsilon_{\text{machine}} = 2^{-(p-1)}$ for $p$-bit mantissa; $\approx 2.2 \times 10^{-16}$ for float64.

  • Floating-point number: Normalized form: $\pm m \times 2^e$ with $1 \le m < 2$ (hidden bit), $e \in [-1022, 1023]$.

  • Relative error: $\frac{\| \tilde{x} - x \|}{\| x \|}$ (dimensionless; independent of scaling).

  • Backward error: $\frac{\| \tilde{A} - A \|}{\| A \|}$ for perturbed matrix; small backward error $\Rightarrow$ computed $\tilde{x}$ is exact solution to nearby problem.

  • Forward error: $\frac{\| \tilde{x} - x \|}{\| x \|}$; depends on both backward error and conditioning: $\text{forward error} \approx \kappa(A) \times \text{backward error}$.

  • Ill-conditioned: $\kappa(A) \gg 1$; small perturbations cause large output changes.

  • Well-conditioned: $\kappa(A) \approx 1$; stable to perturbations.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Condition number definition and properties: $\kappa(A) = \sigma_1 / \sigma_n$; $\kappa(AB) \le \kappa(A) \kappa(B)$; $\kappa(A) \ge 1$. Reference: Golub & Van Loan (2013).

  • Perturbation theory and forward error bounds: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \frac{\| \Delta A \|}{\| A \|}$ (first-order). Reference: Wilkinson (1961); Higham (2002).

  • Backward error analysis: Algorithm is backward-stable if $\tilde{x}$ satisfies $(\tilde{A}) \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon)$. Reference: Higham (2002).

  • Machine precision and floating-point error: $\varepsilon_{\text{mach}} \approx 2^{-p}$; error per operation $O(\varepsilon)$; cumulative error over $n$ steps $O(n \varepsilon)$. Reference: IEEE 754 (2019); Goldberg (1991).

  • Condition number of Gram matrix: $\kappa(A^\top A) = \kappa(A)^2$; why normal equations are risky. Reference: Golub & Kahan (1965).

  • Preconditioning theory: Transform $A \to M^{-1} A$; goal $\kappa(M^{-1} A) \ll \kappa(A)$. Reference: Axelsson (1994).

  • Tikhonov regularization: Shift eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$; reduce effective condition number. Reference: Tikhonov (1963).

Applied (landmark systems)#

  • QR factorization for stable least squares: $\kappa(R) = \kappa(A)$ (no squaring); stable even for ill-conditioned $A$. Implementation: LAPACK dgeqrf; scikit-learn default solver. Reference: Golub & Kahan (1965); Golub & Van Loan (2013).

  • SVD for rank-deficient and ill-conditioned systems: Direct pseudoinverse computation; threshold small singular values. Implementation: LAPACK dgesvd; scipy.linalg.svd. Reference: Golub & Kahan (1965).

  • Cholesky with jitter for near-singular covariance: Add $\epsilon I$ if Cholesky fails; trade-off between accuracy and stability. Standard in GPyTorch, Pyro, Stan. Reference: Rasmussen & Williams (2006).

  • Modified Gram–Schmidt (MGS) for reorthogonalization: More stable than classical GS; requires reorthogonalization passes. Reference: Golub & Pereyra (1973); Yarlagadda et al. (1979).

  • Preconditioned conjugate gradient (PCG): L-BFGS (quasi-Newton), ILU preconditioning, algebraic multigrid reduce iteration count by factors 10–100. Standard in scipy.sparse.linalg, PETSc. Reference: Nocedal & Wright (2006); Saad (2003).

  • Ridge regression and elastic net for regularization: Manual $\lambda$ tuning or cross-validation; scikit-learn Ridge, ElasticNet. Reference: Hoerl & Kennard (1970); Zou & Hastie (2005).

Key ideas: Where it shows up
  1. Linear regression and least squares

    • Gram matrix $X^\top X$ condition number determines numerical stability of normal equations.

    • Achievements: QR/SVD solvers in scikit-learn, statsmodels ensure stability; warning when $\kappa > 10^{10}$. References: Hastie et al. (2009); Golub & Kahan (1965).

  2. Kernel methods and Gaussian processes

    • Covariance/kernel matrix $K$ condition number governs Cholesky stability and marginal likelihood evaluation.

    • Achievements: Cholesky with jitter in Stan/PyMC3; sparse GP approximations circumvent $O(n^3)$ and ill-conditioning. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005).

  3. Deep learning and second-order optimization

    • Hessian condition number determines convergence rate of Newton/quasi-Newton methods; affects learning rate and step size.

    • Achievements: L-BFGS (quasi-Newton) in PyTorch/TensorFlow adapts to Hessian conditioning; natural gradient methods rescale by Fisher information. References: Martens & Grosse (2015); Nocedal & Wright (2006).

  4. Regularization and implicit bias

    • Ill-posed learning problems require regularization; L2 penalty (ridge) shifts eigenvalues, reducing effective condition number.

    • Achievements: elastic net (L1+L2) in scikit-learn balances sparsity and conditioning; early stopping in SGD acts as implicit regularization. References: Tibshirani (1996); Zou & Hastie (2005); Zhu et al. (2021).

  5. Numerical robustness and reliability

    • Production ML systems must be stable to input noise, feature preprocessing, and computational precision.

    • Achievements: feature standardization reduces condition number; model monitoring tracks numerical stability (checks for NaN/Inf); validation on adversarial perturbations. References: Goodfellow et al. (2014) (adversarial robustness); LeCun et al. (1998) (feature normalization).

Notation
  • Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)} = \frac{\lambda_{\max}(A^\top A)}{\lambda_{\min}(A^\top A)}$ (for SPD $A$).

  • Machine epsilon: $\varepsilon_{\text{mach}}$ or $\text{eps}$ in code; $\approx 2.2 \times 10^{-16}$ for float64.

  • Floating-point relative error: $\tilde{a} = a(1 + \delta)$ where $|\delta| \le \varepsilon_{\text{mach}}$.

  • Perturbation: $\tilde{A} = A + \Delta A$ with $\| \Delta A \| \le \varepsilon \| A \|$.

  • Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$ (computed vs. true solution).

  • Backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$ (residual relative to data norm).

  • Norm: $\| A \|_2 = \sigma_1(A)$ (spectral norm); $\| A \|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius).

  • Example: For $A \in \mathbb{R}^{100 \times 100}$ with $\sigma_1 = 10, \sigma_{100} = 0.01$: $\kappa(A) = 1000$. Relative error in solving $Ax = b$ is amplified by factor $\approx 1000$; if input has error $10^{-15}$, output error $\approx 10^{-12}$.

Pitfalls & sanity checks
  • Never ignore condition number: If $\kappa(A) > 10^{8}$, accuracy is compromised; either regularize or use higher precision.

  • Use QR/SVD for least squares, never normal equations (unless you’ve verified conditioning): $\kappa(X^\top X) = \kappa(X)^2$ is catastrophic.

  • Check for catastrophic cancellation: Subtraction of nearly equal numbers loses precision; use orthogonal algorithms instead.

  • Machine precision is absolute limit: Even with perfect algorithm, relative error $\gtrsim \varepsilon_{\text{mach}} \times \kappa(A)$ is unavoidable.

  • Backward error bounds are meaningful: Small backward error $\Rightarrow$ computed solution is exact for nearby problem (reassuring).

  • Regularization parameter selection matters: Cross-validation, L-curve, or discrepancy principle; manual tuning often fails.

  • Feature scaling affects conditioning: Standardize features before regression; poor scaling inflates condition number unnecessarily.

  • Preconditioning amortized over multiple solves: Single solve: preconditioning setup not worth it; multiple solves: huge benefit.

References

Foundational backward error analysis

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

  3. IEEE 754. (2019). IEEE Standard for Floating-Point Arithmetic.

Classical numerical linear algebra

  1. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Comprehensive stability analysis

  1. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  2. Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.

  3. Givens, W. (1958). Computation of plane unitary rotations transforming a general matrix to triangular form.

Orthogonal factorizations and reorthogonalization

  1. Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.

  2. Yarlagadda, R. K., Hershey, J. E., & Sule, S. M. (1979). Gram–Schmidt orthogonalization with application to order recursive lattice filters.

Iterative methods and preconditioning

  1. Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.

  2. Axelsson, O. (1994). Iterative Solution Methods.

  3. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

  4. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

  5. Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

Ill-posed problems and regularization

  1. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.

  2. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  3. Hansen, P. C., Nagy, J. G., & O’Leary, D. P. (2006). Deblurring Images: Matrices, Spectra, and Filtering.

  4. Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Applied ML and regularization

  1. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.

  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).

  3. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

  4. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

  5. Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.

  6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Five worked examples

Worked Example 1: Condition number of Gram matrix and instability of normal equations#

Introduction#

Construct a data matrix $X$ with correlated features; compute condition number $\kappa(X)$ and $\kappa(X^\top X)$; solve least squares via normal equations vs. QR to demonstrate error amplification.

Purpose#

Illustrate why normal equations fail for ill-conditioned data; show that $\kappa(X^\top X) = \kappa(X)^2$ causes error to explode.

Importance#

Foundational motivation for using QR/SVD in regression; explains why scikit-learn warns on ill-conditioned matrices.

What this example demonstrates#

  • Construct $X$ with exponentially decaying singular values (high correlation).

  • Compute $\kappa(X)$ and $\kappa(X^\top X) = \kappa(X)^2$.

  • Solve via normal equations: $G = X^\top X$, $w = G^{-1} X^\top y$.

  • Solve via QR factorization: $w$ from $R w = Q^\top y$.

  • Compare residuals and error growth on perturbed data.

Background#

Gram matrix $G = X^\top X$ is square and symmetric but inherits (squared) conditioning from $X$. Gaussian elimination on $G$ amplifies errors by $\kappa(G)^2 = \kappa(X)^4$ in worst case.

Historical context#

Golub & Kahan (1965) proved $\kappa(X^\top X) = \kappa(X)^2$; advocated QR as practical solution.

History#

Recognition of this issue led to LINPACK (1979) and LAPACK (1992) emphasis on QR/SVD.

Prevalence in ML#

Statisticians and practitioners routinely encounter this; feature correlation is common in real data.

Notes#

  • Problem is fundamental to linear algebra, not implementation-specific.

  • Data standardization (mean 0, std 1) helps but doesn’t eliminate correlation.

  • Large-scale regression ($n, d > 10^6$) exacerbates conditioning issues.

Connection to ML#

Ridge regression adds $\lambda I$ to shift eigenvalues; reduces effective condition number. Feature engineering (orthogonalization, PCA preprocessing) also helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ is exact; QR avoids Gram matrix entirely.

Pedagogical Significance#

Concrete demonstration of why theory (backward error analysis) matters in practice.

References#

  1. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

Solution (Python)#

import numpy as np
np.random.seed(20)

# Create ill-conditioned data (correlated features)
n, d = 100, 10
# Generate features with exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s)

# True solution and target
w_true = np.random.randn(d)
y = X @ w_true + 0.001 * np.random.randn(n)

# Condition numbers
kappa_X = np.linalg.cond(X)
G = X.T @ X
kappa_G = np.linalg.cond(G)

# Solve via normal equations
w_ne = np.linalg.solve(G, X.T @ y)
residual_ne = np.linalg.norm(X @ w_ne - y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)
residual_qr = np.linalg.norm(X @ w_qr - y)

# Solve via SVD (most stable)
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))
residual_svd = np.linalg.norm(X @ w_svd - y)

print("Condition numbers:")
print("  kappa(X) =", round(kappa_X, 2))
print("  kappa(X^T X) =", round(kappa_G, 2))
print("  kappa(X^T X) / kappa(X)^2 =", round(kappa_G / kappa_X**2, 4), "(should be ~1)")
print("\nResiduals:")
print("  Normal equations:", round(residual_ne, 8))
print("  QR:", round(residual_qr, 8))
print("  SVD:", round(residual_svd, 8))
print("\nSolution errors:")
print("  NE error:", round(np.linalg.norm(w_ne - w_true), 4))
print("  QR error:", round(np.linalg.norm(w_qr - w_true), 4))
print("  SVD error:", round(np.linalg.norm(w_svd - w_true), 4))

Worked Example 2: Backward error analysis and numerical stability of QR vs. normal equations#

Introduction#

Solve the same least squares problem via QR and normal equations; compute backward and forward errors; verify that QR is backward-stable while normal equations magnify error.

Purpose#

Demonstrate backward error analysis in practice; show that backward-stable algorithm (QR) guarantees small forward error even on ill-conditioned problems.

Importance#

Foundation for trusting numerical algorithms; explains why stable algorithms are worth the computational cost.

What this example demonstrates#

  • Solve $Ax = b$ via QR factorization and normal equations.

  • Compute backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$.

  • Compute forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$.

  • Verify $\text{forward error} \approx \kappa(A) \times \text{backward error}$ for both methods.

Background#

Backward-stable algorithm: computed $\tilde{x}$ is exact solution to $(A + \Delta A) \tilde{x} = b$ with small $\Delta A$.

Historical context#

Wilkinson (1961) formalized backward error analysis; revolutionary impact on numerical linear algebra.

History#

Modern algorithms designed to be backward-stable (QR, SVD, Cholesky); guarantees via theorems in Higham (2002).

Prevalence in ML#

Used in numerical software certification; backward error commonly reported in solvers.

Notes#

  • QR backward error: $O(\varepsilon_{\text{mach}} \| A \|)$ (small).

  • Normal equations backward error: $O(\varepsilon_{\text{mach}} \| A \|^2 / \min(\sigma_i))$ (amplified).

Connection to ML#

Backward stability ensures that model parameters are not polluted by rounding error; prediction is numerically sound.

Connection to Linear Algebra Theory#

Backward error analysis decouples algorithm stability from problem conditioning.

Pedagogical Significance#

Shows that algorithm design (not just theory) determines practical accuracy.

References#

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  3. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np
np.random.seed(21)

# Ill-conditioned system
n = 50
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
x_true = np.random.randn(n)
b = A @ x_true

# Solve via QR
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Solve via normal equations
G = A.T @ A
x_ne = np.linalg.solve(G, A.T @ b)

# Direct solve (reference)
x_ref = np.linalg.solve(A, b)

# Backward errors
residual_qr = np.linalg.norm(A @ x_qr - b)
residual_ne = np.linalg.norm(A @ x_ne - b)
backward_error_qr = residual_qr / np.linalg.norm(b)
backward_error_ne = residual_ne / np.linalg.norm(b)

# Forward errors
forward_error_qr = np.linalg.norm(x_qr - x_ref) / np.linalg.norm(x_ref)
forward_error_ne = np.linalg.norm(x_ne - x_ref) / np.linalg.norm(x_ref)

# Condition number
kappa = np.linalg.cond(A)

# Verify: forward_error ≈ kappa × backward_error
pred_forward_qr = kappa * backward_error_qr
pred_forward_ne = kappa * backward_error_ne

print("Condition number kappa(A):", round(kappa, 0))
print("\nQR Factorization:")
print("  Backward error:", format(backward_error_qr, '.2e'))
print("  Observed forward error:", format(forward_error_qr, '.2e'))
print("  Predicted forward error:", format(pred_forward_qr, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_qr / pred_forward_qr, 2))
print("\nNormal Equations:")
print("  Backward error:", format(backward_error_ne, '.2e'))
print("  Observed forward error:", format(forward_error_ne, '.2e'))
print("  Predicted forward error:", format(pred_forward_ne, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_ne / pred_forward_ne, 2))
print("\nConclusion: QR is backward-stable; NE backward error is larger.")

Worked Example 3: Machine precision and loss of significance in subtraction#

Introduction#

Demonstrate catastrophic cancellation when subtracting nearby numbers; show how orthogonal transformations avoid this.

Purpose#

Illustrate fundamental floating-point pitfall; motivate use of orthogonal algorithms (QR, SVD, Householder).

Importance#

Concrete example of why naive implementations fail; builds intuition for numerical stability.

What this example demonstrates#

  • Compute $c = a - b$ where $a \approx b$ (many leading digits cancel).

  • Compare relative error in $c$ vs. in $a, b$ individually.

  • Show that orthogonal transformations (e.g., Householder reflection) preserve norms and avoid subtraction.

Background#

Subtraction of nearly equal numbers is inherently unstable; cannot be fixed by higher precision alone.

Historical context#

Recognized early in computer arithmetic; led to development of Householder reflections (1958) and other orthogonal methods.

History#

Modern best practice: avoid subtraction of near-equal numbers; use transformation-based algorithms (QR, SVD).

Prevalence in ML#

Common issue in numerical gradient computation, vector normalization, and Gram–Schmidt orthogonalization.

Notes#

  • Relative error in $c$ can be $O(1)$ even if $a, b$ are accurate.

  • Classical Gram–Schmidt suffers; modified Gram–Schmidt fixes by reorthogonalization.

Connection to ML#

Adversarial robustness and numerical gradient computation are vulnerable to cancellation; careful implementation required.

Connection to Linear Algebra Theory#

Orthogonal transformations (Householder, Givens) have condition number $\kappa = 1$; stable regardless of input.

Pedagogical Significance#

Demonstrates limit of forward-stable algorithms; backward stability (computing exact solution to nearby problem) is more achievable.

References#

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  3. Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

Solution (Python)#

import numpy as np

# Example 1: Catastrophic cancellation
a = 1.0 + 1e-16
b = 1.0
c = a - b
print("Example: Subtraction of nearly equal numbers")
print("  a = 1.0 + 1e-16 =", repr(a))
print("  b = 1.0 =", repr(b))
print("  c = a - b =", repr(c))
print("  Exact: 1e-16, Computed: ", c)
print("  Relative error:", abs(c - 1e-16) / 1e-16 if c != 0 else "inf")

# Example 2: Loss of significance in dot product
print("\nExample: Loss of significance in inner product")
np.random.seed(22)
x = np.array([1e8, 1.0, 1.0])
y = np.array([1e-8, 1.0, 1.0])

# Naive: compute separately then subtract
xy_naive = x[0] * y[0] + x[1] * y[1] + x[2] * y[2]  # 1 + 1 + 1 = 3

# More stable: accumulate with smaller terms first
xy_stable = x[2] * y[2] + x[1] * y[1] + x[0] * y[0]  # Reverse order

print("  Inner product (x · y):")
print("    Naive order: {:.15f}".format(xy_naive))
print("    Reverse order: {:.15f}".format(xy_stable))
print("    NumPy dot: {:.15f}".format(np.dot(x, y)))

# Example 3: Orthogonal transformation vs. subtraction
print("\nExample: Orthogonal transformation avoids cancellation")
# Create orthogonal matrix (preserve norms)
theta = np.pi / 6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
v = np.array([1.0, 0.0])
v_rot = Q @ v
print("  ||v|| =", np.linalg.norm(v))
print("  ||Q v|| =", np.linalg.norm(v_rot), "(should equal ||v||)")
print("  Orthogonal transformation preserves norms: stable!")

Worked Example 4: Preconditioning and condition number reduction#

Introduction#

Construct an ill-conditioned SPD system; solve via unpreconditioned CG and preconditioned CG with diagonal preconditioner; measure iteration count reduction.

Purpose#

Show quantitatively how preconditioning reduces condition number and accelerates convergence.

Importance#

Essential technique for large-scale optimization; directly translates to practical speedups.

What this example demonstrates#

  • Construct ill-conditioned SPD matrix (exponentially decaying eigenvalues).

  • Solve via CG (count iterations to convergence).

  • Construct diagonal (Jacobi) preconditioner $M = \text{diag}(A)$.

  • Solve via preconditioned CG (count iterations).

  • Compute effective $\kappa(M^{-1} A)$ and verify iteration count reduction.

Background#

CG iteration count $\approx \sqrt{\kappa(A)}$; preconditioning reduces $\kappa \to \kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential by 1970s; became standard in iterative solvers.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg (all offer preconditioning options).

Prevalence in ML#

L-BFGS uses quasi-Newton preconditioning; preconditioned SGD in modern optimizers.

Notes#

  • Iteration count $\approx \sqrt{\kappa}$; reducing $\kappa$ by 100 reduces iterations by $\sqrt{100} = 10$.

  • Preconditioner setup cost must be amortized over multiple solves.

Connection to ML#

Optimization accelerators: Newton’s method, natural gradient, adaptive methods all precondition the gradient.

Connection to Linear Algebra Theory#

Preconditioning clusters eigenvalues; standard result in iterative methods.

Pedagogical Significance#

Demonstrates power of structure exploitation (knowing $M \approx A$) to improve algorithm complexity.

References#

  1. Axelsson, O. (1994). Iterative Solution Methods.

  2. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

  3. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(23)

# Create ill-conditioned SPD matrix
n = 200
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_un, info_un = spla.cg(A, b, tol=1e-8, callback=callback_un, maxiter=n)

# Preconditioned CG with diagonal (Jacobi) preconditioner
M_diag = np.diag(1.0 / np.diag(A))
M_diag_sparse = sp.diags(np.diag(M_diag))
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_pre, info_pre = spla.cg(A, b, M=M_diag_sparse, tol=1e-8, 
                           callback=callback_pre, maxiter=n)

kappa_A = np.linalg.cond(A)
# Approximate kappa(M^{-1} A)
AM = M_diag @ A
kappa_AM = np.linalg.cond(AM)

print("Condition numbers:")
print("  kappa(A) =", round(kappa_A, 0))
print("  kappa(M^{-1} A) ≈", round(kappa_AM, 0), "(after diagonal preconditioning)")
print("\nCG Convergence:")
print("  Unpreconditioned iterations:", len(residuals_unprecond))
print("  Preconditioned iterations:", len(residuals_precond))
print("  Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("  Predicted speedup (~sqrt(kappa)):", round(np.sqrt(kappa_A / kappa_AM), 1))
print("\nBoth solutions satisfy A x = b:",
      np.allclose(A @ x_un, b) and np.allclose(A @ x_pre, b))

Worked Example 5: Regularization and effective condition number#

Introduction#

Construct an ill-posed rectangular system (underdetermined or rank-deficient); solve via Tikhonov regularization with varying $\lambda$; plot condition number reduction and solution norm vs. residual (L-curve).

Purpose#

Demonstrate how regularization shifts small singular values and reduces effective condition number.

Importance#

Core technique in inverse problems, ridge regression, and ill-posed ML problems.

What this example demonstrates#

  • Construct rank-deficient or ill-posed system (exponentially decaying singular values).

  • Solve regularized system $(A^\top A + \lambda I) w = A^\top b$ for range of $\lambda$.

  • Compute effective condition number: $\kappa_\lambda = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$.

  • Plot L-curve: solution norm vs. residual; optimal $\lambda$ at corner.

  • Verify that small $\lambda$ gives large solution, large $\lambda$ gives small solution but large residual.

Background#

Tikhonov adds penalty; shifts eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$.

Historical context#

Tikhonov (1963) for ill-posed problems; Hansen (1998) developed regularization parameter selection tools.

History#

Standard in inverse problems (deblurring, tomography); ridge regression standard in ML.

Prevalence in ML#

Ridge regression, elastic net, dropout, early stopping all act as implicit regularization.

Notes#

  • Optimal $\lambda$ balances fit (small residual) and solution norm (small solution).

  • L-curve method (Golub et al. 1979): plot $\log(\| r \|)$ vs. $\log(\| w \|)$; corner indicates optimal $\lambda$.

Connection to ML#

Bias-variance trade-off: underfitting vs. overfitting; regularization parameter selection critical.

Connection to Linear Algebra Theory#

Spectral filtering: Tikhonov filters small singular values; LSQR via early stopping does implicit filtering.

Pedagogical Significance#

Shows how problem structure (ill-posedness) guides algorithm design (regularization).

References#

  1. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.

  2. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  3. Golub, G. H., Van Loan, C. F., & Milanfar, P. (1999). The total least squares problem.

Solution (Python)#

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(24)

# Construct ill-posed system
m, n = 100, 40
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T

# True solution and noisy data
x_true = np.zeros(n)
x_true[:5] = [10, 8, 6, 4, 2]
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# Solve Tikhonov for range of lambda
lams = np.logspace(-6, 2, 50)
sol_norms = []
residuals = []
cond_numbers = []

for lam in lams:
    G = A.T @ A + lam * np.eye(n)
    w = np.linalg.solve(G, A.T @ b)
    sol_norms.append(np.linalg.norm(w))
    residuals.append(np.linalg.norm(A @ w - b))
    cond_numbers.append(np.linalg.cond(G))

sol_norms = np.array(sol_norms)
residuals = np.array(residuals)
cond_numbers = np.array(cond_numbers)

# Find corner of L-curve (optimal lambda via curvature)
log_res = np.log(residuals)
log_norm = np.log(sol_norms)
curvature = np.gradient(np.gradient(log_res) / np.gradient(log_norm))
opt_idx = np.argmax(curvature)
lam_opt = lams[opt_idx]

# Solve with optimal lambda
G_opt = A.T @ A + lam_opt * np.eye(n)
w_opt = np.linalg.solve(G_opt, A.T @ b)

print("Regularization parameter selection (L-curve method):")
print("  Optimal lambda:", format(lam_opt, '.2e'))
print("  At optimal lambda:")
print("    Solution norm:", round(np.linalg.norm(w_opt), 4))
print("    Residual norm:", round(np.linalg.norm(A @ w_opt - b), 4))
print("    Condition number:", round(np.linalg.cond(G_opt), 2))
print("\nFor comparison:")
print("  Smallest lambda: cond =", round(cond_numbers[0], 0), ", residual =", 
      format(residuals[0], '.2e'))
print("  Largest lambda: cond =", round(cond_numbers[-1], 0), ", residual =", 
      format(residuals[-1], '.2e'))
print("\nTrue solution norm:", round(np.linalg.norm(x_true), 4))
print("Solution error at optimal lambda:", round(np.linalg.norm(w_opt - x_true), 4))

Comments

Algorithm Category
Computational Efficiency
Historical & Attribution
Learning Path & Sequencing
ML Applications
Theoretical Foundation