ex1.ai

Solving Systems

Chapter 13

Solving Systems

Key ideas: Introduction

Introduction#

Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.

Important ideas#

Gaussian elimination and LU factorization
- Direct method: $A = LU$ via row operations.
- Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.
- Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).
Cholesky factorization for SPD systems
- For symmetric positive definite $A$: $A = L L^\top$ (one triangle).
- Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.
- Numerically stable if $A$ is well-conditioned.
QR factorization and least squares
- $A = QR$ with orthonormal $Q$ and upper triangular $R$.
- Numerically stable; used for least squares (Chapter 12).
- Cost: $O(n^2 d)$ for $n \times d$ matrix.
Iterative solvers and conjugate gradient
- Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.
- GMRES/MINRES: for general/symmetric non-SPD systems.
- Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.
Preconditioning and conditioning
- Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.
- Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.
- Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.
Sparse matrix structure
- Banded, tridiagonal, block-structured systems exploit locality.
- Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.
- Fill-in during factorization can destroy sparsity; ordering matters.
Rank deficiency and ill-posedness
- Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.
- Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.
- Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.

Relevance to ML#

Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.
Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.
Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.
Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.
Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.

Algorithmic development (milestones)#

1670s: Newton’s method and early algebraic solutions.
1810: Gaussian elimination formalized (Gauss, Legendre).
1875: Cholesky decomposition developed (rediscovered 1910).
1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.
1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.
1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.
1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.
1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.
2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.

Definitions#

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.
LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.
Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.
Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.
Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.
Residual: $r = b - A x$; measure of solution error.
Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.
Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.
Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Gaussian elimination and LU: Forward elimination, row pivoting for stability, $A = LU$ factorization. References: Golub & Van Loan (2013); Trefethen & Bau (1997).
Cholesky factorization: $A = L L^\top$ for SPD matrices; numerical stability via diagonal dominance and conditioning. Reference: Golub & Van Loan (2013).
Forward/back-substitution: $O(n^2)$ solve for triangular systems; essential subroutine.
QR factorization: $A = QR$ with orthonormal $Q$; stable least squares (Chapter 12). Reference: Golub & Kahan (1965).
Conjugate gradient: Minimize $\frac{1}{2} x^\top A x - b^\top x$ on Krylov subspace; optimality in $n$ iterations for SPD. Reference: Hestenes & Stiefel (1952).
Condition number and residual analysis: $\kappa(A) = \sigma_1 / \sigma_n$; backward error bounds. Reference: Wilkinson (1961).
Preconditioning: Transform $A \to M^{-1} A$ to reduce condition number. Reference: Axelsson (1994).

Applied (landmark systems)#

Cholesky solve for Gram matrices: $(X^\top X) w = X^\top y$ via Cholesky; scikit-learn LinearRegression with solver='cholesky'. Reference: Hastie et al. (2009).
CG for large-scale least squares: CGLS (CG on normal equations) for $n, d > 10^6$; scikit-learn SGDRegressor. Reference: Paige & Saunders (1982).
Gaussian process inference: Cholesky of $K$ for marginal likelihood; approximate GPs via inducing points reduce $O(n^3)$ to $O(m^3)$. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005); Hensman et al. (2015).
Preconditioned optimization: L-BFGS with Hessian approximation; widely used in TensorFlow/PyTorch. Reference: Nocedal & Wright (2006).
Graph Laplacian solvers: Fast Poisson equation solvers on mesh/graphs; ChebNet (Defferrard et al. 2016); enables scalable GNNs. Reference: Defferrard et al. (2016).
Inverse problems: LSQR iterative method; clinical deployment in medical imaging. Reference: Vogel (2002); Bardsley et al. (2012).

Key ideas: Where it shows up

Least squares and kernel ridge regression
- Solve $A = X^\top X$ (Gram matrix) or $A = K + \lambda I$ (kernel matrix).
- Achievements: scikit-learn, PyTorch linear solvers; KRR standard in Gaussian processes. References: Rasmussen & Williams (2006); Scholkopf & Smola (2002).
Gaussian processes and covariance systems
- Solve $K x = y$ where $K$ is $n \times n$ covariance matrix.
- Achievements: Cholesky solve in Stan, PyMC3; approximate inference via sparse GPs (inducing points). References: Quinonero-Candela & Rasmussen (2005); Snelson & Ghahramani (2005).
Optimization and preconditioning
- Preconditioned gradient descent: $x_{t+1} = x_t - \alpha M^{-1} \nabla f(x_t)$.
- Achievements: L-BFGS preconditioner reduces iteration count by factor of $10$-$100$; quasi-Newton methods. References: Nocedal & Wright (2006); Martens & Grosse (2015).
Graph neural networks and sparse convolutions
- Solve graph Laplacian systems $L x = b$ for diffusion, smoothing, attention mechanisms.
- Achievements: GraphSAGE, GCN via approximate polynomial filters; scalable to graphs with $10^9$ nodes. References: Defferrard et al. (2016) (ChebNet); Kipf & Welling (2017) (GCN).
Inverse problems and regularized imaging
- Solve Tikhonov system $(A^\top A + \lambda I) x = A^\top b$ for ill-posed deconvolution, tomography, parameter estimation.
- Achievements: iterative methods in medical imaging (CGLS for CT/MRI); LSQR in seismic inversion. References: Hansen (1998); Vogel (2002); Bardsley et al. (2012).

Notation

Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}$ (or $\mathbb{R}^{m \times n}$ overdetermined).
LU factorization: $A = L U$ with $L \in \mathbb{R}^{n \times n}$ lower triangular (unit diagonal), $U \in \mathbb{R}^{n \times n}$ upper triangular.
Cholesky: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite; $L \in \mathbb{R}^{n \times n}$ lower triangular.
QR factorization: $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal, $R \in \mathbb{R}^{n \times n}$ upper triangular.
Residual: $r = b - A x \in \mathbb{R}^n$; goal is $\lVert r \rVert \ll \lVert b \rVert$.
Conjugate gradient: $x_k$ minimizes $\frac{1}{2} x^\top A x - b^\top x$ over $k$-dimensional Krylov subspace $\text{span}(r_0, A r_0, \ldots, A^{k-1} r_0)$.
Preconditioner: $M \approx A$ (cheap to invert); solve $M^{-1} A x = M^{-1} b$.
Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); measures sensitivity to perturbation.
Example: $A = \begin{pmatrix} 10 & 1 \\ 1 & 1 \end{pmatrix}$ has $\sigma_1 \approx 10.05, \sigma_2 \approx 0.95$, so $\kappa(A) \approx 10.6$ (moderately ill-conditioned).

Pitfalls & sanity checks

Never solve normal equations $A^\top A x = A^\top b$ directly for ill-conditioned $A$: Use QR, SVD, or iterative methods (LSQR) instead; $\kappa(A^\top A) = \kappa(A)^2$.
Verify SPD before Cholesky: Non-SPD matrices cause NaN/Inf; test via eigenvalues or try-catch.
Check residual convergence: $\lVert A x - b \rVert$ should decrease monotonically in iterative solvers; stagnation signals bad conditioning or preconditioner failure.
Preconditioning setup cost: If solve is run once, setup overhead may exceed savings; only cost-effective for multiple solves with same $A$.
Fill-in in sparse LU: Sparse matrix factorization can become dense; use reordering (minimum degree, nested dissection) to minimize fill-in.
Early stopping in LSQR: Stop iteration based on residual norm or discrepancy principle; continuing to convergence amplifies noise.
Regularization parameter selection: Cross-validation, L-curve, or discrepancy principle; do not fit $\lambda$ on training data.
Scaling sensitivity: Ill-scaled systems (different row/column magnitudes) become ill-conditioned; normalize before solving.

References

Foundational algorithms

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.
Cholesky, A. L. (1910). Sur la résolution numérique des systèmes d’équations linéaires.

Classical theory and numerical stability

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Iterative methods and conjugate gradient

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.
Axelsson, O. (1994). Iterative Solution Methods.

Preconditioning and multigrid

Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

Applied: Machine learning and inverse problems

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression.
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs.
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering.
Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Hensman, J., Matthews, A. G. D. E., & Ghahramani, Z. (2015). Scalable variational Gaussian process classification.
Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels.
Bardsley, J. M., Chung, J., & Palmer, K. (2012). Regularization parameter selection methods for linear least squares problems.

Five worked examples

Worked Example 1: Gaussian elimination, LU factorization, and pivoting#

Introduction#

Solve a system $A x = b$ via LU factorization with partial pivoting; verify stability and reconstruction.

Purpose#

Illustrate direct method for dense systems; compare unpivoted LU (numerically risky) to pivoted LU (numerically stable).

Importance#

LU is foundation of dense linear algebra; pivoting is critical for avoiding catastrophic cancellation.

What this example demonstrates#

Compute LU factorization via Gaussian elimination (with and without pivoting).
Verify $A = L U$ and solve via forward/back-substitution.
Compute condition number and demonstrate error growth without pivoting on ill-conditioned matrix.

Background#

Partial pivoting (swapping rows to maximize pivot) prevents small pivots from amplifying errors during elimination.

Historical context#

Numerical instability of unpivoted LU recognized mid-20th century; pivoting became standard.

History#

Modern linear algebra libraries (LAPACK) default to pivoted LU.

Prevalence in ML#

Used in scikit-learn’s least squares and in matrix factorization algorithms.

Notes#

Pivoting is automatic in most libraries; rarely explicit in practice.
For sparse matrices, fill-in during elimination can destroy sparsity; sparse LU uses ordering strategies (minimum degree, nested dissection).

Connection to ML#

Condition number of $A$ determines solver reliability; poor conditioning necessitates regularization.

Connection to Linear Algebra Theory#

$\kappa(A) = \sigma_1 / \sigma_n$ predicts error magnification; LU error $\sim \kappa(A) \times \text{machine eps}$.

Pedagogical Significance#

Concrete demonstration of numerical stability and pivoting strategy.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
np.random.seed(10)

# Create moderately ill-conditioned system
n = 5
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, n)
A = U @ np.diag(s) @ U.T
b = np.random.randn(n)

# Compute LU with pivoting (numpy default)
P, L, U_lu = scipy.linalg.lu(A)
kappa_A = np.linalg.cond(A)

# Verify A = P^T L U
A_recon = P.T @ L @ U_lu
recon_error = np.linalg.norm(A - A_recon)

# Solve via LU: forward sub (L y = P b), back sub (U x = y)
y = scipy.linalg.solve_triangular(L, P @ b, lower=True)
x_lu = scipy.linalg.solve_triangular(U_lu, y, lower=False)

# Solve directly (for comparison)
x_direct = np.linalg.solve(A, b)

print("Condition number kappa(A):", round(kappa_A, 2))
print("LU reconstruction error:", round(recon_error, 8))
print("LU solution matches direct:", np.allclose(x_lu, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_lu - b), 8))

Worked Example 2: Cholesky factorization for symmetric positive definite systems#

Introduction#

Solve a symmetric positive definite (SPD) system via Cholesky factorization; verify stability and computational savings.

Purpose#

Show how SPD structure reduces computation by half and improves stability vs. general LU.

Importance#

Cholesky is standard for covariance matrices, Gram matrices, and Hessians in optimization.

What this example demonstrates#

Construct SPD matrix (e.g., covariance or Gram).
Compute Cholesky $A = L L^\top$ and verify reconstruction.
Solve via forward/back-substitution on $L$.
Compare to general LU: cost $O(n^3 / 3)$ vs. $O(n^3)$.

Background#

For SPD $A$, Cholesky is more stable than LU (no pivoting needed) and faster (half the operations).

Historical context#

Cholesky rediscovered in early 1900s; became standard method for SPD systems by mid-20th century.

History#

LAPACK dpotrf is gold-standard Cholesky implementation.

Prevalence in ML#

Used in kernel ridge regression, Gaussian process inference, and proximal methods.

Notes#

Fails gracefully if $A$ is not SPD; returns error (no negative square roots).
Numerical breakdown signals ill-conditioning; add small jitter ($A + \epsilon I$) if needed.

Connection to ML#

Covariance and Gram matrices are always SPD; Cholesky enables efficient sampling and likelihood computation in probabilistic models.

Connection to Linear Algebra Theory#

Cholesky exists iff all leading principal minors are positive (Sylvester criterion).

Pedagogical Significance#

Demonstrates how structure (symmetry, PSD) enables algorithmic optimization.

References#

Golub & Van Loan (2013). Matrix Computations.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

Solution (Python)#

import numpy as np
np.random.seed(11)

# Create SPD matrix (covariance-like)
n = 5
A_temp = np.random.randn(n, n)
A = A_temp.T @ A_temp + np.eye(n)  # SPD by construction
b = np.random.randn(n)

# Cholesky factorization
L = np.linalg.cholesky(A)

# Verify A = L L^T
A_recon = L @ L.T
recon_error = np.linalg.norm(A - A_recon)

# Solve via forward/back-substitution
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_chol = scipy.linalg.solve_triangular(L.T, y, lower=False)

# Direct solve for comparison
x_direct = np.linalg.solve(A, b)

# Condition number
kappa_A = np.linalg.cond(A)

print("Condition number kappa(A):", round(kappa_A, 2))
print("Cholesky reconstruction error:", round(recon_error, 8))
print("Cholesky solution matches direct:", np.allclose(x_chol, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_chol - b), 8))
print("Lower triangular L diagonal:", np.round(np.diag(L), 4))

Worked Example 3: Conjugate gradient for large sparse systems#

Introduction#

Solve a large sparse SPD system via conjugate gradient (CG); demonstrate convergence and compare to direct Cholesky.

Purpose#

Show how iterative methods scale to large systems while respecting sparsity.

Importance#

CG is the standard iterative solver for large-scale ML and scientific computing.

What this example demonstrates#

Construct sparse SPD system (e.g., graph Laplacian or 2D Poisson discretization).
Solve via CG and via direct Cholesky.
Plot convergence: residual norm vs. iteration.
Measure wall-clock time and memory for dense vs. sparse approaches.

Background#

CG finds solution in at most $n$ iterations (theory); practical convergence in $\ll n$ iterations for well-conditioned systems.

Historical context#

Hestenes & Stiefel (1952); rediscovered in 1970s as demand for large-scale methods grew.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg.

Prevalence in ML#

Used in large-scale kernel methods, graph neural networks, and optimization.

Notes#

Convergence rate depends on condition number $\kappa(A)$; preconditioning improves rate dramatically.
Requires only matrix-vector products $A v$; no explicit $A$ storage needed.

Connection to ML#

Enables Gaussian process inference on large datasets; scales kernel methods from $10^4$ to $10^6$ points.

Connection to Linear Algebra Theory#

CG minimizes $\frac{1}{2} x^\top A x - b^\top x$ over Krylov subspace; conjugate directions ensure optimality.

Pedagogical Significance#

Demonstrates trade-off between direct (dense, $O(n^3)$, low iteration) and iterative (sparse-compatible, $O(n)$ per iteration, many iterations).

References#

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Golub & Kahan (1965). Calculating singular values and pseudo-inverses.
Nocedal & Wright (2006). Numerical Optimization.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(12)

# Construct sparse SPD system (2D Laplacian discretization)
n_side = 20
n = n_side ** 2
# Discrete Laplacian on 2D grid (tridiagonal structure)
I = np.arange(n)
J = np.arange(n)
V = 4 * np.ones(n)
# Horizontal neighbors
I_h = np.arange(n - n_side)
J_h = I_h + n_side
V_h = -np.ones(len(I_h))
# Vertical neighbors
I_v = np.arange(n - 1)
I_v = I_v[I_v % n_side != n_side - 1]
J_v = I_v + 1
V_v = -np.ones(len(I_v))

rows = np.concatenate([I, I_h, I_h, I_v, I_v])
cols = np.concatenate([J, J_h, I_h, J_v, I_v])
vals = np.concatenate([V, V_h, V_h, V_v, V_v])
A = sp.csr_matrix((vals, (rows, cols)), shape=(n, n))
b = np.random.randn(n)

# CG solution
def residual_norm(x):
    return np.linalg.norm(A @ x - b)

x_cg, info = spla.cg(A, b, tol=1e-6, maxiter=n)

# Direct Cholesky on dense
A_dense = A.toarray()
L = np.linalg.cholesky(A_dense)
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_direct = scipy.linalg.solve_triangular(L.T, y, lower=False)

print("Problem size (n):", n)
print("Sparsity: {:.1f}%".format(100 * A.nnz / (n ** 2)))
print("CG converged in:", info, "iterations")
print("CG residual:", round(residual_norm(x_cg), 8))
print("Direct residual:", round(residual_norm(x_direct), 8))
print("Solutions match:", np.allclose(x_cg, x_direct, atol=1e-5))

Worked Example 4: Preconditioning and conditioning number#

Introduction#

Solve an ill-conditioned SPD system with and without preconditioning; demonstrate acceleration and residual reduction.

Purpose#

Show how preconditioning reduces effective condition number and dramatically accelerates convergence.

Importance#

Preconditioning is essential for large-scale optimization and inverse problems.

What this example demonstrates#

Construct ill-conditioned SPD system (exponentially decaying eigenvalues).
Solve via unpreconditioned CG and via diagonal (Jacobi) preconditioner.
Plot convergence: residual vs. iteration for both.
Verify that CG on $M^{-1} A$ has better convergence rate.

Background#

Convergence rate $\rho \approx \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ per iteration; preconditioning reduces $\kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential in 1970s–1980s for practical large-scale solvers.

History#

MINRES-QLP, GMRES with ILU preconditioning (Saad 1993); algebraic multigrid (Ruge–Stüben 1987).

Prevalence in ML#

Preconditioned gradients, L-BFGS with Hessian approximation, trust-region methods.

Notes#

Preconditioning trades: setup cost (compute/apply $M^{-1}$) for iteration reduction.
Diagonal (Jacobi) preconditioner: $M = \text{diag}(A)$; trivial but often effective.

Connection to ML#

Preconditioned gradient descent achieves faster convergence; L-BFGS implicitly preconditions via quasi-Newton approximation.

Connection to Linear Algebra Theory#

CG convergence depends on spectrum of $A$; preconditioning clusters eigenvalues of $M^{-1} A$ to reduce effective $\kappa$.

Pedagogical Significance#

Concrete demonstration of conditioning’s impact on algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal & Wright (2006). Numerical Optimization.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(13)

# Create ill-conditioned SPD matrix
n = 100
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # Condition number ~1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_unprecond(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_unprecond, _ = spla.cg(A, b, tol=1e-8, callback=callback_unprecond)

# Preconditioned CG (diagonal/Jacobi preconditioner)
M_inv_diag = 1.0 / np.diag(A)
M = sp.diags(1.0 / M_inv_diag)
residuals_precond = []
def callback_precond(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_precond, _ = spla.cg(A, b, M=M, tol=1e-8, callback=callback_precond)

kappa_A = np.linalg.cond(A)
print("Condition number kappa(A):", round(kappa_A, 0))
print("Unpreconditioned CG iterations:", len(residuals_unprecond))
print("Preconditioned CG iterations:", len(residuals_precond))
print("Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("Both solutions satisfy A x = b:", 
      np.allclose(A @ x_unprecond, b) and np.allclose(A @ x_precond, b))

Worked Example 5: LSQR for ill-posed inverse problems#

Introduction#

Solve an ill-posed least squares problem $A x \approx b$ via LSQR (iterative least squares QR); compare to direct pseudoinverse and Tikhonov regularization.

Purpose#

Demonstrate how LSQR stabilizes solutions to ill-posed systems without explicit regularization parameter tuning.

Importance#

LSQR is standard in medical imaging, geophysical inversion, and inverse problems.

What this example demonstrates#

Construct ill-posed rectangular system (tall, decaying singular values).
Solve via LSQR (iterative, early stopping acts as regularization).
Compare to pseudoinverse (noisy) and Tikhonov (requires $\lambda$ selection).
Plot regularization curves (solution norm vs. residual).

Background#

LSQR is conjugate gradient on normal equations $A^\top A x = A^\top b$; stops before convergence to avoid amplifying noise.

Historical context#

Paige & Saunders (1982); foundational algorithm for inverse problems.

History#

CGLS, LSQR in scipy.sparse.linalg; widely used in medical imaging.

Prevalence in ML#

Used in sparse coding, compressed sensing, and large-scale least squares.

Notes#

Stopping iteration acts as regularization; no regularization parameter needed.
Discrepancy principle: stop when residual reaches expected noise level.

Connection to ML#

Implicit regularization; enables robust solutions without hyperparameter tuning.

Connection to Linear Algebra Theory#

LSQR on $\min_x \lVert A x - b \rVert^2$ without explicit pseudoinverse; early stopping filters small singular values.

Pedagogical Significance#

Demonstrates how algorithm structure (iteration count) can act as regularization.

References#

Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(14)

# Construct ill-posed rectangular system
m, n = 120, 50
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T
x_true = np.zeros(n)
x_true[:8] = np.sin(np.linspace(0, 2*np.pi, 8))
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# LSQR with early stopping (implicit regularization)
residuals_lsqr = []
x_lsqr_early = None
for k in [5, 10, 20, 50]:
    x_lsqr, _ = spla.lsqr(A, b, atol=0, btol=0, iter_lim=k)
    residuals_lsqr.append(np.linalg.norm(A @ x_lsqr - b))
    if k == 20:
        x_lsqr_early = x_lsqr

# Pseudoinverse (no regularization, noisy)
x_pinv = spla.lsqr(A, b, atol=0, btol=0, iter_lim=1000)[0]

# Tikhonov with manual lambda (requires tuning)
lam = 0.01
G = A.T @ A + lam * np.eye(n)
x_tikhonov = np.linalg.solve(G, A.T @ b)

print("True solution norm:", round(np.linalg.norm(x_true), 4))
print("LSQR (early stop k=20) norm:", round(np.linalg.norm(x_lsqr_early), 4))
print("LSQR (early stop) error:", round(np.linalg.norm(x_lsqr_early - x_true), 4))
print("Pseudoinverse error:", round(np.linalg.norm(x_pinv - x_true), 4))
print("Tikhonov error:", round(np.linalg.norm(x_tikhonov - x_true), 4))
print("LSQR residual at k=20:", round(residuals_lsqr[2], 6))

Comments

Algorithm Category

Direct Methods

Iterative Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Dense Matrices

Structured Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Projection & Orthogonality

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Cholesky Decomposition

Orthogonal Factorizations

ML Applications

Optimization & Training

Numerical Stability & Robustness

Preconditioning

Theoretical Foundation

Numerical Linear Algebra

Least Squares

Chapter 12

Least Squares

Key ideas: Algorithmic development history

Algorithmic development (milestones)#

1795: Legendre and Gauss independently develop least squares for astronomy/surveying.
1881–1920: Cholesky factorization and early numerical algorithms.
1960s: Golub–Kahan QR algorithm; recognition of conditioning issues in normal equations.
1970s–1980s: Tikhonov regularization and Hansen’s methods for ill-posed problems.
1990s: Ridge regression, elastic net, and LASSO via modern regularization theory (Hastie et al.).
2000s: Stochastic gradient descent for large-scale least squares (Bottou–LeCun).
2010s: Implicit regularization in deep learning; connections between SGD and generalization.

Key ideas: Definitions

Definitions#

Least squares problem: $\min_w \lVert X w - y \rVert_2^2$ with $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$.
Normal equations: $X^\top X w = X^\top y$.
Residual: $r = X w - y \in \mathbb{R}^n$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$ (ratio of singular values).
Ridge regression: $\min_w (\lVert X w - y \rVert^2 + \lambda \lVert w \rVert^2)$; solution $(X^\top X + \lambda I)^{-1} X^\top y$.
Regularization parameter: $\lambda \ge 0$ controls trade-off between fit and smoothness.

Key ideas: Introduction

Introduction#

Least squares is the workhorse of supervised learning. Given data $X \in \mathbb{R}^{n\times d}$ and targets $y \in \mathbb{R}^n$ with $n > d$, least squares finds $w \in \mathbb{R}^d$ minimizing $f(w) = \tfrac{1}{2}\lVert X w - y \rVert_2^2$. Geometrically, it projects $y$ onto the column space of $X$. The solution $w^* = (X^\top X)^{-1} X^\top y$ exists if $X$ has full rank; stable computation uses QR or SVD.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Overdetermined systems and least squares formulation as projection onto column space.
Normal equations and optimality: $\nabla f(w) = X^\top(X w - y) = 0$.
Gram matrix $G = X^\top X$ is PSD; condition number $\kappa(G) = \kappa(X)^2$.
QR decomposition $X = QR$; normal equations become $R w = Q^\top y$ (stable).
SVD solution $w^* = V \Sigma^{-1} U^\top y$ and pseudoinverse.
Ridge regression normal equations and bias-variance trade-off.
Regularization parameter selection (cross-validation, L-curve, GCV).

Applied (landmark systems)#

Linear regression (Hastie et al. 2009; scikit-learn implementation).
Kernel ridge regression (Rasmussen & Williams 2006; standard GP predictor).
Regularization for ill-posed inverse problems (Hansen 1998; Vogel 2002).
Elastic net for feature selection (Zou & Hastie 2005).
LASSO regression (Tibshirani 1996).
SGD for large-scale least squares (Bottou & LeCun 1998).
Implicit regularization in neural networks (Zhu et al. 2021).

Key ideas: Important ideas

Important ideas#

Normal equations
- $X^\top X w = X^\top y$ characterizes optimality via zero gradient.
Residuals and loss
- Residual $r = X w - y$; loss $f(w) = \tfrac{1}{2}\lVert r \rVert^2$ is convex in $w$.
Geometry: projection
- $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y = P_X y$ projects onto column space.
Conditioning and stability
- Condition number $\kappa(X^\top X) = \kappa(X)^2$ amplifies numerical error; prefer QR/SVD.
Pseudoinverse solution
- $w^* = X^\dagger y$ with $X^\dagger = V \Sigma^{-1} U^\top$ (SVD-based); handles rank-deficiency.
Ridge regression
- Add regularizer $\lambda \lVert w \rVert^2$; normal equations become $(X^\top X + \lambda I) w = X^\top y$. Trades bias for lower variance.
Regularization and ill-posedness
- Truncated SVD or Tikhonov filtering remove small singular values; stabilizes solutions to ill-posed inverse problems.

Key ideas: Relevance to ML

Relevance to ML#

Core regression algorithm: linear, polynomial, feature-engineered models.
Bias-variance trade-off: unregularized overfits on noise; regularization improves generalization.
Feature selection and dimensionality: via regularization (L1/elastic net) or subset selection.
Inverse problems: medical imaging, seismic inversion, parameter estimation.
Kernel methods: kernel ridge regression as Tikhonov in infinite-dimensional spaces.
Deep learning: implicit regularization in SGD and architecture design inspired by least squares principles.

Key ideas: Where it shows up

Linear regression and generalized linear models

Core supervised learning; extends to logistic regression, Poisson regression, etc. Achievements: classical statistical foundation; scikit-learn, TensorFlow standard solvers. References: Hastie et al. 2009.

Kernel methods and kernel ridge regression

Least squares in kernel-induced spaces; KRR = Tikhonov regularization in RKHS. Achievements: competitive with SVMs, enables Gaussian process prediction. References: Rasmussen & Williams 2006.

Inverse problems and imaging

Regularized least squares for ill-posed geophysics, medical imaging (CT, MRI). Achievements: Hansen 1998 (regularization tools); clinical deployment. References: Vogel 2002 (computational methods).

Dimensionality reduction via regularization

Ridge regression reduces variance on high-dimensional data; elastic net combines L1/L2 penalties. Achievements: Zou & Hastie 2005 (elastic net); foundation for modern feature selection. References: Tibshirani 1996 (LASSO).

Stochastic gradient descent and deep learning

SGD on least squares loss drives optimization; implicit regularization enables generalization. Achievements: Bottou & LeCun 1998 (stochastic methods); foundation for deep learning. References: Zhu et al. 2021 (implicit regularization theory).

Notation

Data and targets: $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$ (overdetermined: $n > d$).
Parameter vector: $w \in \mathbb{R}^d$.
Predictions and residuals: $\hat{y} = X w$, $r = y - X w$.
Loss (least squares): $f(w) = \tfrac{1}{2} \lVert X w - y \rVert_2^2 = \tfrac{1}{2} \lVert r \rVert_2^2$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Normal equations: $G w = X^\top y$.
QR factorization: $X = QR$ with $Q \in \mathbb{R}^{n\times d}, R \in \mathbb{R}^{d\times d}$ upper triangular.
SVD: $X = U \Sigma V^\top$; solution $w^* = V \Sigma^{-1} U^\top y$.
Ridge regression: $w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$.
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$; $\kappa(G) = \kappa(X)^2$.
Example: If $X$ is $100 \times 5$ with $\sigma_1 = 10, \sigma_5 = 0.1$, then $\kappa(X) = 100$ and $\kappa(X^\top X) = 10000$ (ill-conditioned); use QR or SVD instead of normal equations.

Pitfalls & sanity checks

Never solve normal equations for ill-conditioned $X$; use QR or SVD instead.
Verify system is overdetermined ($n > d$); underdetermined requires pseudoinverse or regularization.
Check $\operatorname{rank}(X) = d$; if rank-deficient, pseudoinverse is needed.
Residual $\lVert X w - y \rVert$ should be small but nonzero (unless exact solution exists).
Condition number $\kappa(X)$ predicts error magnification; regularize if too large.
Cross-validate regularization parameter $\lambda$; do not fit on training data.
Check for multicollinearity: if columns of $X$ are nearly dependent, condition number explodes.
Standardize features before ridge regression; otherwise $\lambda$ is scale-dependent.

References

Historical foundations

Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium.

Classical theory and methods 3. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

Regularization and ridge regression 7. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. 8. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods. 9. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. 10. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Inverse problems and regularization 11. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems. 12. Vogel, C. R. (2002). Computational Methods for Inverse Problems. 13. Ben-Israel, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications.

Stochastic optimization and deep learning 14. Bottou, L., & LeCun, Y. (1998). Large-scale machine learning with stochastic gradient descent. 15. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 16. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

Five worked examples

Worked Example 1: Normal equations and condition number#

Introduction#

Solve an overdetermined least squares system via normal equations; compute condition number and compare to QR.

Purpose#

Illustrate how Gram matrix conditioning affects solution accuracy and why normal equations can fail.

Importance#

Guides choice between normal equations (fast but risky) and QR/SVD (stable but slower).

What this example demonstrates#

Construct overdetermined system $X w = y$.
Solve via normal equations and via QR factorization.
Compute condition numbers $\kappa(X)$ and $\kappa(X^\top X)$.
Compare residuals and solution difference.

Background#

Normal equations are fast but square the condition number, amplifying errors when ill-conditioned.

Historical context#

Recognized by Golub–Kahan (1960s) as a fundamental numerical stability issue.

History#

Modern solvers default to QR/SVD and treat normal equations as historical reference.

Prevalence in ML#

Normal equations still used for quick estimates; QR/SVD for production systems.

Notes#

Condition number roughly predicts relative error magnification (error ~ $\kappa$ × machine epsilon).
For ill-conditioned problems, QR/SVD reduce error by factor of $\kappa(X)$.

Connection to ML#

Conditioning affects whether training converges and generalization; regularization helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ follows from SVD; QR avoids squaring via triangular solve.

Pedagogical Significance#

Concrete demonstration of why stable algorithms matter.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 80, 6
# Create ill-conditioned system
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s) @ V.T
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# Solve via normal equations
G = X.T @ X
kappa_G = np.linalg.cond(G)
w_ne = np.linalg.solve(G, X.T @ y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)

# Solve via SVD
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))

kappa_X = s_svd[0] / s_svd[-1]
print("kappa(X):", round(kappa_X, 4), "kappa(X^T X):", round(kappa_G, 4))
print("residual NE:", round(np.linalg.norm(X @ w_ne - y), 6))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))
print("residual SVD:", round(np.linalg.norm(X @ w_svd - y), 6))

Worked Example 2: QR factorization and stable least squares#

Introduction#

Solve least squares via QR factorization; verify projection onto column space.

Purpose#

Show numerically stable approach compared to normal equations.

Importance#

QR is standard in practice; enables backward-substitution on triangular systems.

What this example demonstrates#

Compute QR of $X = QR$.
Solve normal equations as $R w = Q^\top y$ (via back-substitution).
Verify $\hat{y} = Q Q^\top y$ is the projection.

Background#

QR factorization avoids forming $X^\top X$ explicitly; more stable for ill-conditioned data.

Historical context#

Golub–Kahan algorithm (1965) made QR practical; became standard in numerical libraries.

History#

LAPACK and NumPy default QR implementation.

Prevalence in ML#

Used in scikit-learn LinearRegression, statsmodels, and production systems.

Notes#

$\kappa(R) = \kappa(X)$, so no amplification from squaring.
Back-substitution on $R$ is faster than forming inverse.

Connection to ML#

Faster convergence for large-scale regression; enables incremental updates.

Connection to Linear Algebra Theory#

QR reduces $\kappa$ compared to normal equations; triangular solve is $O(d^2)$.

Pedagogical Significance#

Demonstrates practical stability improvements.

References#

Golub & Kahan (1965). Singular values and pseudo-inverses.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 80, 6
X = np.random.randn(n, d)
X = X / np.linalg.norm(X, axis=0)  # normalize columns
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# QR factorization
Q, R = np.linalg.qr(X, mode='reduced')

# Solve via back-substitution
w_qr = np.linalg.solve(R, Q.T @ y)

# Verify projection
y_proj = Q @ (Q.T @ y)
proj_error = np.linalg.norm(y - y_proj)

# Compare to normal equations
G = X.T @ X
w_ne = np.linalg.solve(G, X.T @ y)

print("QR solution:", np.round(w_qr[:3], 4))
print("NE solution:", np.round(w_ne[:3], 4))
print("projection error:", round(proj_error, 8))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))

Worked Example 3: Ridge regression and regularization parameter#

Introduction#

Solve ridge regression for different $\lambda$ values; demonstrate bias-variance trade-off.

Purpose#

Show how regularization reduces variance at cost of bias; guide $\lambda$ selection via cross-validation.

Importance#

Ridge is standard regularizer in practice; teaches regularization principles.

What this example demonstrates#

Solve ridge normal equations $(X^\top X + \lambda I) w = X^\top y$ for range of $\lambda$.
Compute training error, test error, and norm of solution $\lVert w \rVert$.
Find optimal $\lambda$ via k-fold cross-validation.

Background#

Tikhonov regularization: add penalty $\lambda \lVert w \rVert^2$ to balance fit and complexity.

Historical context#

Tikhonov (1963) for ill-posed problems; Hoerl & Kennard (1970) for regression.

History#

Ridge regression now standard in modern ML frameworks and statistical software.

Prevalence in ML#

Used in virtually all supervised learning systems for regularization.

Notes#

As $\lambda \to 0$: unregularized least squares (high variance, low bias).
As $\lambda \to \infty$: solution $w \to 0$ (high bias, low variance).
Optimal $\lambda$ found by cross-validation or L-curve method.

Connection to ML#

Core regularization strategy; extends to LASSO (L1), elastic net (L1+L2).

Connection to Linear Algebra Theory#

Regularization improves conditioning: $\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda) / (\sigma_d^2 + \lambda)$.

Pedagogical Significance#

Illustrates bias-variance trade-off quantitatively.

References#

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 20
# Create ill-conditioned design matrix
A = np.random.randn(d, d)
X = np.random.randn(n, d) @ np.linalg.cholesky(A.T @ A).T
w_true = np.random.randn(d)
y = X @ w_true + 0.1 * np.random.randn(n)

lams = np.logspace(-4, 2, 20)
errors_train = []
errors_test = []
norms_w = []

for lam in lams:
    G = X.T @ X + lam * np.eye(d)
    w = np.linalg.solve(G, X.T @ y)
    errors_train.append(np.linalg.norm(X @ w - y)**2 / n)
    errors_test.append(np.linalg.norm(X @ w - y)**2 / n + lam * np.linalg.norm(w)**2)
    norms_w.append(np.linalg.norm(w))

opt_idx = np.argmin(errors_test)
print("optimal lambda:", round(lams[opt_idx], 6))
print("train error at opt:", round(errors_train[opt_idx], 6))
print("test error at opt:", round(errors_test[opt_idx], 6))
print("norm(w) at opt:", round(norms_w[opt_idx], 4))

Worked Example 4: SVD-based pseudoinverse for rank-deficient systems#

Introduction#

Solve rank-deficient least squares via SVD pseudoinverse; compare to underdetermined system.

Purpose#

Show how SVD handles rank deficiency gracefully (vs. normal equations failing).

Importance#

Essential for underdetermined and ill-posed problems; enables robust solutions.

What this example demonstrates#

Construct rank-deficient $X$ (more columns than linearly independent rows).
Compute pseudoinverse $X^\dagger = V \Sigma^{-1} U^\top$ via SVD.
Find minimum-norm solution $w^* = X^\dagger y$.
Verify that solution has smallest $\lVert w \rVert$ among all least-squares solutions.

Background#

Moore–Penrose pseudoinverse extends inverse to non-square/rank-deficient matrices.

Historical context#

Formalized early 1900s; SVD computation enabled practical implementation (Golub 1960s).

History#

Standard in scientific computing and ML libraries for robust least squares.

Prevalence in ML#

Used in feature selection (removing redundant features) and underdetermined systems.

Notes#

Minimum-norm solution is unique; smallest in $\ell_2$ norm among all minimizers.
Handle tiny singular values carefully (threshold or regularize).

Connection to ML#

Supports feature selection and handles collinear features.

Connection to Linear Algebra Theory#

Pseudoinverse via SVD; minimum norm property from projection theory.

Pedagogical Significance#

Extends inversion to singular/rectangular matrices.

References#

Golub & Pereyra (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Ben-Israel & Greville (2003). Generalized Inverses: Theory and Applications.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 30
# Rank deficient: only 20 independent columns
X = np.random.randn(n, 20) @ np.random.randn(20, d)
w_true = np.random.randn(d)
w_true[25:] = 0  # sparse ground truth
y = X @ w_true + 0.01 * np.random.randn(n)

# SVD-based pseudoinverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
r = np.sum(s > 1e-10)
w_pinv = Vt[:r].T @ (np.linalg.solve(np.diag(s[:r]), U[:, :r].T @ y))

# Extend to full dimension
w_pinv_full = np.zeros(d)
w_pinv_full[:len(w_pinv)] = w_pinv if len(w_pinv) == d else np.concatenate([w_pinv, np.zeros(d - len(w_pinv))])

print("rank of X:", r)
print("residual:", round(np.linalg.norm(X @ w_pinv_full - y), 6))
print("norm(w):", round(np.linalg.norm(w_pinv_full), 4))

Worked Example 5: Truncated SVD for ill-posed inverse problems#

Introduction#

Solve an ill-posed inverse problem; apply truncated SVD regularization to stabilize solution.

Purpose#

Demonstrate spectral filtering and its effect on noise amplification.

Importance#

Core technique in inverse problems (imaging, geophysics); teaches when to truncate spectrum.

What this example demonstrates#

Construct ill-posed system with decaying singular values.
Solve with pseudoinverse (amplifies noise) vs. truncated SVD (filters noise).
Compare noise-free and noisy solutions; show improved robustness of truncation.

Background#

Ill-posed problems have tiny singular values; pseudoinverse amplifies noise. Truncation discards these.

Historical context#

Hansen (1998) and Vogel (2002) developed regularization tools for inverse problems.

History#

Standard in medical imaging (deblurring CT/MRI) and geophysical inversion.

Prevalence in ML#

Used in deblurring, denoising, and parameter estimation in inverse problems.

Notes#

Choose truncation point via L-curve, GCV, or discrepancy principle.
Trade-off: lower truncation $\to$ more smoothing, less noise, but more bias.

Connection to ML#

Improves robustness of learned models in presence of noise and measurement error.

Connection to Linear Algebra Theory#

Small singular values correspond to high-frequency/noisy directions; truncation removes them.

Pedagogical Significance#

Shows quantitative benefit of spectral filtering.

References#

Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 80, 50
# Create ill-posed system: exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.exp(-np.linspace(0, 3, min(n, d)))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U @ Sigma @ V.T

# True solution and clean data
w_true = np.zeros(d)
w_true[:5] = [10, 5, 2, 1, 0.5]
y_clean = A @ w_true

# Add noise
noise_level = 0.01
y_noisy = y_clean + noise_level * np.random.randn(n)

# Full pseudoinverse solution
U_a, s_a, Vt_a = np.linalg.svd(A, full_matrices=False)
w_full = Vt_a.T @ (np.linalg.solve(np.diag(s_a), U_a.T @ y_noisy))

# Truncated SVD solutions
errors = []
truncs = range(5, 30)
for trunc in truncs:
    s_trunc = s_a[:trunc]
    w_trunc = Vt_a[:trunc].T @ (np.linalg.solve(np.diag(s_trunc), U_a[:, :trunc].T @ y_noisy))
    err = np.linalg.norm(w_trunc - w_true)
    errors.append(err)

best_trunc = truncs[np.argmin(errors)]
print("smallest singular value:", round(s_a[-1], 8))
print("error full pseudoinverse:", round(np.linalg.norm(w_full - w_true), 4))
print("error best truncation (k={})".format(best_trunc), round(min(errors), 4))

Comments

Algorithm Category

Approximation Methods

Direct Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Dense Matrices

Low-Rank Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Projection & Orthogonality

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Orthogonal Factorizations

Singular Value Decomposition (SVD)

ML Applications

Supervised Learning

Numerical Stability & Robustness

Algorithm Stability

Theoretical Foundation

Numerical Linear Algebra

Orthogonality & Projections

Chapter 6

Orthogonality & Projections

Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

Orthogonality and complements
- $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
- A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
- For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
- If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
- Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
- For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
- $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.
Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).
Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.
Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.
Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.
Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.
Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).
Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).
Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up

PCA and subspace denoising

PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.
Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Least squares as projection

$\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.
Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

Orthogonalization algorithms (QR)

Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.
Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

Orthogonal Procrustes and embedding alignment

Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).
Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

Orthogonal constraints/initialization in deep nets

Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.
Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation

Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.
Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.
Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.
QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.
SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).
Examples:
- Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.
- PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.
- Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks

Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.
Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.
Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.
Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.
Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

$X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.
Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

Prefer QR/SVD over normal equations.
Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

Gauss (1809); Legendre (1805) — least squares.
Golub & Van Loan (2013) — QR solvers.
Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

$X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.
Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

Eckart & Young (1936) — best rank-$k$.
Jolliffe (2002) — PCA.
Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

$H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))