ex1.ai

Iterative Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Structured Matrices

Historical & Attribution

Key Concepts & Theorems

Orthogonal Factorizations

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Cholesky Decomposition

ML Applications

Optimization & Training

Numerical Stability & Robustness

Preconditioning

Theoretical Foundation

Numerical Linear Algebra

Least Squares

Chapter 12

Least Squares

Key ideas: Algorithmic development history

Algorithmic development (milestones)#

1795: Legendre and Gauss independently develop least squares for astronomy/surveying.
1881–1920: Cholesky factorization and early numerical algorithms.
1960s: Golub–Kahan QR algorithm; recognition of conditioning issues in normal equations.
1970s–1980s: Tikhonov regularization and Hansen’s methods for ill-posed problems.
1990s: Ridge regression, elastic net, and LASSO via modern regularization theory (Hastie et al.).
2000s: Stochastic gradient descent for large-scale least squares (Bottou–LeCun).
2010s: Implicit regularization in deep learning; connections between SGD and generalization.

Key ideas: Definitions

Definitions#

Least squares problem: $\min_w \lVert X w - y \rVert_2^2$ with $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$.
Normal equations: $X^\top X w = X^\top y$.
Residual: $r = X w - y \in \mathbb{R}^n$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$ (ratio of singular values).
Ridge regression: $\min_w (\lVert X w - y \rVert^2 + \lambda \lVert w \rVert^2)$; solution $(X^\top X + \lambda I)^{-1} X^\top y$.
Regularization parameter: $\lambda \ge 0$ controls trade-off between fit and smoothness.

Key ideas: Introduction

Introduction#

Least squares is the workhorse of supervised learning. Given data $X \in \mathbb{R}^{n\times d}$ and targets $y \in \mathbb{R}^n$ with $n > d$, least squares finds $w \in \mathbb{R}^d$ minimizing $f(w) = \tfrac{1}{2}\lVert X w - y \rVert_2^2$. Geometrically, it projects $y$ onto the column space of $X$. The solution $w^* = (X^\top X)^{-1} X^\top y$ exists if $X$ has full rank; stable computation uses QR or SVD.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Overdetermined systems and least squares formulation as projection onto column space.
Normal equations and optimality: $\nabla f(w) = X^\top(X w - y) = 0$.
Gram matrix $G = X^\top X$ is PSD; condition number $\kappa(G) = \kappa(X)^2$.
QR decomposition $X = QR$; normal equations become $R w = Q^\top y$ (stable).
SVD solution $w^* = V \Sigma^{-1} U^\top y$ and pseudoinverse.
Ridge regression normal equations and bias-variance trade-off.
Regularization parameter selection (cross-validation, L-curve, GCV).

Applied (landmark systems)#

Linear regression (Hastie et al. 2009; scikit-learn implementation).
Kernel ridge regression (Rasmussen & Williams 2006; standard GP predictor).
Regularization for ill-posed inverse problems (Hansen 1998; Vogel 2002).
Elastic net for feature selection (Zou & Hastie 2005).
LASSO regression (Tibshirani 1996).
SGD for large-scale least squares (Bottou & LeCun 1998).
Implicit regularization in neural networks (Zhu et al. 2021).

Key ideas: Important ideas

Important ideas#

Normal equations
- $X^\top X w = X^\top y$ characterizes optimality via zero gradient.
Residuals and loss
- Residual $r = X w - y$; loss $f(w) = \tfrac{1}{2}\lVert r \rVert^2$ is convex in $w$.
Geometry: projection
- $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y = P_X y$ projects onto column space.
Conditioning and stability
- Condition number $\kappa(X^\top X) = \kappa(X)^2$ amplifies numerical error; prefer QR/SVD.
Pseudoinverse solution
- $w^* = X^\dagger y$ with $X^\dagger = V \Sigma^{-1} U^\top$ (SVD-based); handles rank-deficiency.
Ridge regression
- Add regularizer $\lambda \lVert w \rVert^2$; normal equations become $(X^\top X + \lambda I) w = X^\top y$. Trades bias for lower variance.
Regularization and ill-posedness
- Truncated SVD or Tikhonov filtering remove small singular values; stabilizes solutions to ill-posed inverse problems.

Key ideas: Relevance to ML

Relevance to ML#

Core regression algorithm: linear, polynomial, feature-engineered models.
Bias-variance trade-off: unregularized overfits on noise; regularization improves generalization.
Feature selection and dimensionality: via regularization (L1/elastic net) or subset selection.
Inverse problems: medical imaging, seismic inversion, parameter estimation.
Kernel methods: kernel ridge regression as Tikhonov in infinite-dimensional spaces.
Deep learning: implicit regularization in SGD and architecture design inspired by least squares principles.

Key ideas: Where it shows up

Linear regression and generalized linear models

Core supervised learning; extends to logistic regression, Poisson regression, etc. Achievements: classical statistical foundation; scikit-learn, TensorFlow standard solvers. References: Hastie et al. 2009.

Kernel methods and kernel ridge regression

Least squares in kernel-induced spaces; KRR = Tikhonov regularization in RKHS. Achievements: competitive with SVMs, enables Gaussian process prediction. References: Rasmussen & Williams 2006.

Inverse problems and imaging

Regularized least squares for ill-posed geophysics, medical imaging (CT, MRI). Achievements: Hansen 1998 (regularization tools); clinical deployment. References: Vogel 2002 (computational methods).

Dimensionality reduction via regularization

Ridge regression reduces variance on high-dimensional data; elastic net combines L1/L2 penalties. Achievements: Zou & Hastie 2005 (elastic net); foundation for modern feature selection. References: Tibshirani 1996 (LASSO).

Stochastic gradient descent and deep learning

SGD on least squares loss drives optimization; implicit regularization enables generalization. Achievements: Bottou & LeCun 1998 (stochastic methods); foundation for deep learning. References: Zhu et al. 2021 (implicit regularization theory).

Notation

Data and targets: $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$ (overdetermined: $n > d$).
Parameter vector: $w \in \mathbb{R}^d$.
Predictions and residuals: $\hat{y} = X w$, $r = y - X w$.
Loss (least squares): $f(w) = \tfrac{1}{2} \lVert X w - y \rVert_2^2 = \tfrac{1}{2} \lVert r \rVert_2^2$.
Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
Normal equations: $G w = X^\top y$.
QR factorization: $X = QR$ with $Q \in \mathbb{R}^{n\times d}, R \in \mathbb{R}^{d\times d}$ upper triangular.
SVD: $X = U \Sigma V^\top$; solution $w^* = V \Sigma^{-1} U^\top y$.
Ridge regression: $w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$.
Condition number: $\kappa(X) = \sigma_1 / \sigma_d$; $\kappa(G) = \kappa(X)^2$.
Example: If $X$ is $100 \times 5$ with $\sigma_1 = 10, \sigma_5 = 0.1$, then $\kappa(X) = 100$ and $\kappa(X^\top X) = 10000$ (ill-conditioned); use QR or SVD instead of normal equations.

Pitfalls & sanity checks

Never solve normal equations for ill-conditioned $X$; use QR or SVD instead.
Verify system is overdetermined ($n > d$); underdetermined requires pseudoinverse or regularization.
Check $\operatorname{rank}(X) = d$; if rank-deficient, pseudoinverse is needed.
Residual $\lVert X w - y \rVert$ should be small but nonzero (unless exact solution exists).
Condition number $\kappa(X)$ predicts error magnification; regularize if too large.
Cross-validate regularization parameter $\lambda$; do not fit on training data.
Check for multicollinearity: if columns of $X$ are nearly dependent, condition number explodes.
Standardize features before ridge regression; otherwise $\lambda$ is scale-dependent.

References

Historical foundations

Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium.

Classical theory and methods 3. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

Regularization and ridge regression 7. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. 8. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods. 9. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. 10. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Inverse problems and regularization 11. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems. 12. Vogel, C. R. (2002). Computational Methods for Inverse Problems. 13. Ben-Israel, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications.

Stochastic optimization and deep learning 14. Bottou, L., & LeCun, Y. (1998). Large-scale machine learning with stochastic gradient descent. 15. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 16. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

Five worked examples

Worked Example 1: Normal equations and condition number#

Introduction#

Solve an overdetermined least squares system via normal equations; compute condition number and compare to QR.

Purpose#

Illustrate how Gram matrix conditioning affects solution accuracy and why normal equations can fail.

Importance#

Guides choice between normal equations (fast but risky) and QR/SVD (stable but slower).

What this example demonstrates#

Construct overdetermined system $X w = y$.
Solve via normal equations and via QR factorization.
Compute condition numbers $\kappa(X)$ and $\kappa(X^\top X)$.
Compare residuals and solution difference.

Background#

Normal equations are fast but square the condition number, amplifying errors when ill-conditioned.

Historical context#

Recognized by Golub–Kahan (1960s) as a fundamental numerical stability issue.

History#

Modern solvers default to QR/SVD and treat normal equations as historical reference.

Prevalence in ML#

Normal equations still used for quick estimates; QR/SVD for production systems.

Notes#

Condition number roughly predicts relative error magnification (error ~ $\kappa$ × machine epsilon).
For ill-conditioned problems, QR/SVD reduce error by factor of $\kappa(X)$.

Connection to ML#

Conditioning affects whether training converges and generalization; regularization helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ follows from SVD; QR avoids squaring via triangular solve.

Pedagogical Significance#

Concrete demonstration of why stable algorithms matter.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 80, 6
# Create ill-conditioned system
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s) @ V.T
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# Solve via normal equations
G = X.T @ X
kappa_G = np.linalg.cond(G)
w_ne = np.linalg.solve(G, X.T @ y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)

# Solve via SVD
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))

kappa_X = s_svd[0] / s_svd[-1]
print("kappa(X):", round(kappa_X, 4), "kappa(X^T X):", round(kappa_G, 4))
print("residual NE:", round(np.linalg.norm(X @ w_ne - y), 6))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))
print("residual SVD:", round(np.linalg.norm(X @ w_svd - y), 6))

Worked Example 2: QR factorization and stable least squares#

Introduction#

Solve least squares via QR factorization; verify projection onto column space.

Purpose#

Show numerically stable approach compared to normal equations.

Importance#

QR is standard in practice; enables backward-substitution on triangular systems.

What this example demonstrates#

Compute QR of $X = QR$.
Solve normal equations as $R w = Q^\top y$ (via back-substitution).
Verify $\hat{y} = Q Q^\top y$ is the projection.

Background#

QR factorization avoids forming $X^\top X$ explicitly; more stable for ill-conditioned data.

Historical context#

Golub–Kahan algorithm (1965) made QR practical; became standard in numerical libraries.

History#

LAPACK and NumPy default QR implementation.

Prevalence in ML#

Used in scikit-learn LinearRegression, statsmodels, and production systems.

Notes#

$\kappa(R) = \kappa(X)$, so no amplification from squaring.
Back-substitution on $R$ is faster than forming inverse.

Connection to ML#

Faster convergence for large-scale regression; enables incremental updates.

Connection to Linear Algebra Theory#

QR reduces $\kappa$ compared to normal equations; triangular solve is $O(d^2)$.

Pedagogical Significance#

Demonstrates practical stability improvements.

References#

Golub & Kahan (1965). Singular values and pseudo-inverses.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 80, 6
X = np.random.randn(n, d)
X = X / np.linalg.norm(X, axis=0)  # normalize columns
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# QR factorization
Q, R = np.linalg.qr(X, mode='reduced')

# Solve via back-substitution
w_qr = np.linalg.solve(R, Q.T @ y)

# Verify projection
y_proj = Q @ (Q.T @ y)
proj_error = np.linalg.norm(y - y_proj)

# Compare to normal equations
G = X.T @ X
w_ne = np.linalg.solve(G, X.T @ y)

print("QR solution:", np.round(w_qr[:3], 4))
print("NE solution:", np.round(w_ne[:3], 4))
print("projection error:", round(proj_error, 8))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))

Worked Example 3: Ridge regression and regularization parameter#

Introduction#

Solve ridge regression for different $\lambda$ values; demonstrate bias-variance trade-off.

Purpose#

Show how regularization reduces variance at cost of bias; guide $\lambda$ selection via cross-validation.

Importance#

Ridge is standard regularizer in practice; teaches regularization principles.

What this example demonstrates#

Solve ridge normal equations $(X^\top X + \lambda I) w = X^\top y$ for range of $\lambda$.
Compute training error, test error, and norm of solution $\lVert w \rVert$.
Find optimal $\lambda$ via k-fold cross-validation.

Background#

Tikhonov regularization: add penalty $\lambda \lVert w \rVert^2$ to balance fit and complexity.

Historical context#

Tikhonov (1963) for ill-posed problems; Hoerl & Kennard (1970) for regression.

History#

Ridge regression now standard in modern ML frameworks and statistical software.

Prevalence in ML#

Used in virtually all supervised learning systems for regularization.

Notes#

As $\lambda \to 0$: unregularized least squares (high variance, low bias).
As $\lambda \to \infty$: solution $w \to 0$ (high bias, low variance).
Optimal $\lambda$ found by cross-validation or L-curve method.

Connection to ML#

Core regularization strategy; extends to LASSO (L1), elastic net (L1+L2).

Connection to Linear Algebra Theory#

Regularization improves conditioning: $\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda) / (\sigma_d^2 + \lambda)$.

Pedagogical Significance#

Illustrates bias-variance trade-off quantitatively.

References#

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 20
# Create ill-conditioned design matrix
A = np.random.randn(d, d)
X = np.random.randn(n, d) @ np.linalg.cholesky(A.T @ A).T
w_true = np.random.randn(d)
y = X @ w_true + 0.1 * np.random.randn(n)

lams = np.logspace(-4, 2, 20)
errors_train = []
errors_test = []
norms_w = []

for lam in lams:
    G = X.T @ X + lam * np.eye(d)
    w = np.linalg.solve(G, X.T @ y)
    errors_train.append(np.linalg.norm(X @ w - y)**2 / n)
    errors_test.append(np.linalg.norm(X @ w - y)**2 / n + lam * np.linalg.norm(w)**2)
    norms_w.append(np.linalg.norm(w))

opt_idx = np.argmin(errors_test)
print("optimal lambda:", round(lams[opt_idx], 6))
print("train error at opt:", round(errors_train[opt_idx], 6))
print("test error at opt:", round(errors_test[opt_idx], 6))
print("norm(w) at opt:", round(norms_w[opt_idx], 4))

Worked Example 4: SVD-based pseudoinverse for rank-deficient systems#

Introduction#

Solve rank-deficient least squares via SVD pseudoinverse; compare to underdetermined system.

Purpose#

Show how SVD handles rank deficiency gracefully (vs. normal equations failing).

Importance#

Essential for underdetermined and ill-posed problems; enables robust solutions.

What this example demonstrates#

Construct rank-deficient $X$ (more columns than linearly independent rows).
Compute pseudoinverse $X^\dagger = V \Sigma^{-1} U^\top$ via SVD.
Find minimum-norm solution $w^* = X^\dagger y$.
Verify that solution has smallest $\lVert w \rVert$ among all least-squares solutions.

Background#

Moore–Penrose pseudoinverse extends inverse to non-square/rank-deficient matrices.

Historical context#

Formalized early 1900s; SVD computation enabled practical implementation (Golub 1960s).

History#

Standard in scientific computing and ML libraries for robust least squares.

Prevalence in ML#

Used in feature selection (removing redundant features) and underdetermined systems.

Notes#

Minimum-norm solution is unique; smallest in $\ell_2$ norm among all minimizers.
Handle tiny singular values carefully (threshold or regularize).

Connection to ML#

Supports feature selection and handles collinear features.

Connection to Linear Algebra Theory#

Pseudoinverse via SVD; minimum norm property from projection theory.

Pedagogical Significance#

Extends inversion to singular/rectangular matrices.

References#

Golub & Pereyra (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Ben-Israel & Greville (2003). Generalized Inverses: Theory and Applications.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 30
# Rank deficient: only 20 independent columns
X = np.random.randn(n, 20) @ np.random.randn(20, d)
w_true = np.random.randn(d)
w_true[25:] = 0  # sparse ground truth
y = X @ w_true + 0.01 * np.random.randn(n)

# SVD-based pseudoinverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
r = np.sum(s > 1e-10)
w_pinv = Vt[:r].T @ (np.linalg.solve(np.diag(s[:r]), U[:, :r].T @ y))

# Extend to full dimension
w_pinv_full = np.zeros(d)
w_pinv_full[:len(w_pinv)] = w_pinv if len(w_pinv) == d else np.concatenate([w_pinv, np.zeros(d - len(w_pinv))])

print("rank of X:", r)
print("residual:", round(np.linalg.norm(X @ w_pinv_full - y), 6))
print("norm(w):", round(np.linalg.norm(w_pinv_full), 4))

Worked Example 5: Truncated SVD for ill-posed inverse problems#

Introduction#

Solve an ill-posed inverse problem; apply truncated SVD regularization to stabilize solution.

Purpose#

Demonstrate spectral filtering and its effect on noise amplification.

Importance#

Core technique in inverse problems (imaging, geophysics); teaches when to truncate spectrum.

What this example demonstrates#

Construct ill-posed system with decaying singular values.
Solve with pseudoinverse (amplifies noise) vs. truncated SVD (filters noise).
Compare noise-free and noisy solutions; show improved robustness of truncation.

Background#

Ill-posed problems have tiny singular values; pseudoinverse amplifies noise. Truncation discards these.

Historical context#

Hansen (1998) and Vogel (2002) developed regularization tools for inverse problems.

History#

Standard in medical imaging (deblurring CT/MRI) and geophysical inversion.

Prevalence in ML#

Used in deblurring, denoising, and parameter estimation in inverse problems.

Notes#

Choose truncation point via L-curve, GCV, or discrepancy principle.
Trade-off: lower truncation $\to$ more smoothing, less noise, but more bias.

Connection to ML#

Improves robustness of learned models in presence of noise and measurement error.

Connection to Linear Algebra Theory#

Small singular values correspond to high-frequency/noisy directions; truncation removes them.

Pedagogical Significance#

Shows quantitative benefit of spectral filtering.

References#

Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 80, 50
# Create ill-posed system: exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.exp(-np.linspace(0, 3, min(n, d)))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U @ Sigma @ V.T

# True solution and clean data
w_true = np.zeros(d)
w_true[:5] = [10, 5, 2, 1, 0.5]
y_clean = A @ w_true

# Add noise
noise_level = 0.01
y_noisy = y_clean + noise_level * np.random.randn(n)

# Full pseudoinverse solution
U_a, s_a, Vt_a = np.linalg.svd(A, full_matrices=False)
w_full = Vt_a.T @ (np.linalg.solve(np.diag(s_a), U_a.T @ y_noisy))

# Truncated SVD solutions
errors = []
truncs = range(5, 30)
for trunc in truncs:
    s_trunc = s_a[:trunc]
    w_trunc = Vt_a[:trunc].T @ (np.linalg.solve(np.diag(s_trunc), U_a[:, :trunc].T @ y_noisy))
    err = np.linalg.norm(w_trunc - w_true)
    errors.append(err)

best_trunc = truncs[np.argmin(errors)]
print("smallest singular value:", round(s_a[-1], 8))
print("error full pseudoinverse:", round(np.linalg.norm(w_full - w_true), 4))
print("error best truncation (k={})".format(best_trunc), round(min(errors), 4))

Comments

Algorithm Category

Approximation Methods

Computational Efficiency

Cache-Aware Algorithms

Data Modality

Low-Rank Matrices

Historical & Attribution

Key Concepts & Theorems

Orthogonal Factorizations

Learning Path & Sequencing

Intermediate

Matrix Decompositions

Singular Value Decomposition (SVD)

ML Applications

Supervised Learning

Numerical Stability & Robustness

Algorithm Stability

Theoretical Foundation

Numerical Linear Algebra

Principal Component Analysis

Chapter 11

Principal Component Analysis

Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

Covariance matrix
- $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.
Principal components (eigenvectors)
- Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.
Explained variance ratio
- EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.
Scores and loadings
- Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).
Reconstruction and truncation
- Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).
Standardization and scaling
- Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.
Whitening
- Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

Dimensionality reduction: speeds training, avoids overfitting, improves generalization.
Visualization: 2D/3D projection of high-dimensional data for exploration.
Preprocessing: removes noise, aligns scales, improves conditioning of solvers.
Feature extraction: learned components as features for downstream classifiers.
Denoising: truncated PCA removes low-variance (noisy) directions.
Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

1901: Pearson introduces lines/planes of closest fit (geometric intuition).
1933: Hotelling formalizes PCA as eigen-decomposition of covariance.
1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.
1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.
1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.
2000s: Randomized PCA for large-scale data (Halko–Martinsson).
2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).
Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).
Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.
Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.
Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.
Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Covariance matrix and its PSD structure (Chapter 09).
Eigen-decomposition of symmetric covariance matrix.
Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.
Eckart–Young–Mirsky low-rank approximation error (Chapter 10).
Relation to SVD: PCA via SVD of centered data (Chapter 10).
Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).
Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).
Probabilistic PCA and latent variable models (Tipping & Bishop 1997).
Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).
Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).
Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up

Dimensionality reduction and preprocessing

Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

Visualization and exploratory data analysis

Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

Whitening and decorrelation

Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

Denoising and matrix completion

Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

Feature extraction and representation learning

Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation

Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.
Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.
Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.
Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.
Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.
Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.
Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).
Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.
Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks

Always center data; forgetting this is a common error.
Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.
Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.
Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.
Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.
Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

Center data; compute covariance matrix.
Compute eigendecomposition of covariance.
Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

Ensure data is centered; if not, covariance is inaccurate.
Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

Hotelling, H. (1933). Analysis of complex statistical variables into principal components.
Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.
Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

Use full_matrices=False for efficiency when $n \gg d$.
Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

Compute full PCA; truncate to $k$ components.
Reconstruct and compute Frobenius error.
Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

Reconstruction error squared equals sum of squared singular values of discarded components.
Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

Eckart & Young (1936). The approximation of one matrix by another of lower rank.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.
Verify output covariance is identity.
Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

Add small floor to tiny eigenvalues to avoid division by zero.
Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

Add noise to data; apply truncated PCA for different $k$.
Measure reconstruction error vs. ground truth vs. noise level.
Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

Noise reduction works best if signal occupies few components.
Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)

Comments

Algorithm Category

Data Modality

Low-Rank Matrices

Historical & Attribution

Key Concepts & Theorems

Spectral Theory

Singular Value Decomposition (SVD)

Learning Path & Sequencing

Intermediate

Matrix Decompositions

ML Applications

Dimensionality Reduction

Theoretical Foundation

Applied Machine Learning

Orthogonality & Projections

Chapter 6

Orthogonality & Projections

Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

Orthogonality and complements
- $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
- A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
- For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
- If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
- Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
- For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
- $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.
Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).
Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.
Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.
Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.
Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.
Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).
Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).
Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up

PCA and subspace denoising

PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.
Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Least squares as projection

$\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.
Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

Orthogonalization algorithms (QR)

Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.
Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

Orthogonal Procrustes and embedding alignment

Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).
Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

Orthogonal constraints/initialization in deep nets

Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.
Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation

Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.
Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.
Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.
QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.
SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).
Examples:
- Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.
- PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.
- Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks

Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.
Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.
Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.
Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.
Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

$X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.
Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

Prefer QR/SVD over normal equations.
Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

Gauss (1809); Legendre (1805) — least squares.
Golub & Van Loan (2013) — QR solvers.
Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

$X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.
Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

Eckart & Young (1936) — best rank-$k$.
Jolliffe (2002) — PCA.
Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

$H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))

Comments

Algorithm Category

Data Modality

Historical & Attribution

Key Concepts & Theorems

Orthogonal Factorizations

Learning Path & Sequencing

Foundational

Linear Algebra Foundations

Inner Products & Norms

Matrix Decompositions

ML Applications

Kernels & Gaussian Processes

Theoretical Foundation

Linear Algebra Theory

Span & Linear Combination

Chapter 2

Span & Linear Combination

Key ideas: Introduction

Introduction#

Span and linear combinations are the fundamental building blocks of linear algebra and machine learning. Every prediction $\hat{y} = Xw$, every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$, every attention output $\sum_i \alpha_i v_i$, and every representation learned by a neural network is ultimately a linear combination of basis vectors. Understanding span—the set of all possible linear combinations—reveals model expressiveness, training dynamics, and the geometry of learned representations.

The span of a set of vectors $\{v_1, \ldots, v_k\}$ is the smallest subspace containing all of them. Geometrically, it’s all points reachable by scaling and adding the vectors. Algebraically, it’s $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$. In ML, span determines:

Model capacity: What functions can a model represent?
Feature redundancy: Are some features linear combinations of others?
Solution uniqueness: When are there multiple parameter vectors giving identical predictions?
Expressiveness vs. efficiency: Can we reduce dimensionality without losing information?

This chapter adopts an ML-first perspective: we introduce span through concrete algorithms (kernel methods, attention, overparameterization) rather than abstract axioms. The goal is to build geometric intuition (span as reachable points) and computational skill (checking linear independence, computing basis) simultaneously.

Important Ideas#

1. Linear combinations are everywhere in ML. A linear combination of vectors $\{v_1, \ldots, v_k\}$ with coefficients $\{\alpha_1, \ldots, \alpha_k\}$ is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k $$

Examples in ML:

Linear regression predictions: $\hat{y} = Xw = \sum_{j=1}^d w_j x_j$ (linear combination of feature columns).
Gradient descent updates: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ (linear combination of current parameters and gradient).
Attention outputs: $z = \sum_{i=1}^n \alpha_i v_i$ (weighted sum of value vectors with attention weights $\alpha_i$).
Kernel predictions: $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (representer theorem: optimal solution is a linear combination of training kernels).
Word embeddings: Analogies $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (linear combinations capture semantic relationships).

2. Span determines expressiveness. The span of $\{v_1, \ldots, v_k\}$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$

This is the set of all possible linear combinations—the “reachable subspace” if we’re allowed to scale and add the vectors. Key properties:

It’s a subspace: Closed under addition and scalar multiplication (adding/scaling linear combinations gives another linear combination).
It’s the smallest subspace containing $\{v_1, \ldots, v_k\}$: Any subspace containing all $v_i$ must contain their span.
Dimension = number of linearly independent vectors: If $v_k = \sum_{i=1}^{k-1} c_i v_i$ (linear dependence), adding $v_k$ doesn’t increase the span.

In ML context:

Column space of $X$: All predictions $\hat{y} = Xw$ lie in $\text{span}(\text{columns of } X) = \text{col}(X)$. If $y \notin \text{col}(X)$, perfect fit is impossible (residual is nonzero).
Feature redundancy: If feature $x_j$ is a linear combination of other features, adding it doesn’t increase $\text{span}(\text{columns of } X)$ or model capacity.
Kernel methods: Predictions lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (representer theorem). This is typically a finite-dimensional subspace of the (infinite-dimensional) RKHS.

3. Linear independence vs. dependence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = 0$ is $\alpha_1 = \cdots = \alpha_k = 0$. Otherwise, they’re linearly dependent (one is a linear combination of others).

Why it matters:

Basis: A linearly independent set spanning $V$ is a basis for $V$. Every vector in $V$ has a unique representation as a linear combination of basis vectors.
Rank: $\text{rank}(X) = $ number of linearly independent columns = $\dim(\text{col}(X))$.
Multicollinearity: In regression, linearly dependent features ($\text{rank}(X) < d$) make $X^\top X$ singular (non-invertible), requiring regularization.

4. Representer theorem: solutions lie in span of training data. For many ML problems (kernel ridge regression, SVMs, Gaussian processes), the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$

This is a linear combination of kernel functions evaluated at training points. Despite working in an infinite-dimensional space (e.g., RBF kernel), the solution lies in an $n$-dimensional subspace (span of $\{k(x_i, \cdot)\}_{i=1}^n$).

Implications:

Computational tractability: Optimization over infinite dimensions reduces to solving an $n \times n$ system.
Overfitting vs. underfitting: More training points ($n$ large) increases capacity but also computational cost ($O(n^3)$ for exact methods).
Sparse solutions: $\ell_1$ regularization (Lasso, SVM) produces solutions with many $\alpha_i = 0$ (sparse linear combinations).

Relevance to Machine Learning#

Model expressiveness and capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible predictions. For linear regression $\hat{y} = Xw$:

If $\text{rank}(X) = d$ (full column rank), the model can fit $d$ linearly independent targets.
If $\text{rank}(X) < d$, features are redundant. Adding more linearly dependent features doesn’t help.
If $\text{rank}(X) < n$ (overdetermined), exact fit is impossible unless $y \in \text{col}(X)$ (rare).

Attention mechanisms. Transformer attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where the output is a convex combination (weighted average with non-negative weights summing to 1) of value vectors. Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to multiple subspaces (heads), increasing expressiveness.

Kernel methods and representer theorem. For kernel ridge regression, the optimal solution is: $$ \alpha^* = (K + \lambda I)^{-1} y $$ where $K_{ij} = k(x_i, x_j)$ is the Gram matrix. Predictions are $f(x) = \sum_{i=1}^n \alpha_i^* k(x_i, x)$ (linear combination of training kernels). This holds for any kernel (linear, polynomial, RBF, neural network), enabling implicit infinite-dimensional feature spaces.

Word embeddings and analogies. Word2Vec (Mikolov et al., 2013) famously demonstrated that semantic relationships correspond to linear offsets in embedding space: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$. This shows embeddings capture compositional structure (adding/subtracting vectors blends meanings).

Algorithmic Development History#

1. Linear combinations in classical mechanics (Newton, 1687). Newton’s second law $F = ma$ expresses force as a linear combination of acceleration components. Decomposing vectors into basis components (Cartesian coordinates) enabled solving physical systems.

2. Linear algebra formalization (Grassmann 1844, Peano 1888). Grassmann introduced “extensive magnitudes” (vectors) and exterior algebra (wedge products, spans). Peano axiomatized vector spaces with addition and scalar multiplication, formalizing linear combinations.

3. Least squares and column space (Gauss 1809, Legendre 1805). Gauss used least squares for orbit determination. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, and the best fit minimizes $\|y - \hat{y}\|_2$ by projecting $y$ onto $\text{col}(X)$.

4. Kernel trick and representer theorem (Kimeldorf & Wahba 1970, Schölkopf 1990s). Kimeldorf & Wahba proved the representer theorem for splines: optimal smoothing spline is a linear combination of kernel basis functions. Schölkopf, Smola, and Vapnik extended this to SVMs and kernel ridge regression, enabling nonlinear learning in RKHS.

5. Word embeddings and linear structure (Mikolov et al. 2013). Word2Vec revealed that embeddings exhibit linear compositionality: analogies like “king - man + woman ≈ queen” work because semantic relationships correspond to parallel vectors (linear offsets). This was surprising—neural networks learned a structured linear space without explicit supervision.

6. Attention and weighted sums (Bahdanau 2015, Vaswani 2017). Attention mechanisms compute outputs as convex combinations (weighted averages) of value vectors. The Transformer (Vaswani et al., 2017) replaced recurrence with attention, showing that linear combinations of context (with learned weights) suffice for sequence modeling.

7. Overparameterization and implicit bias (Bartlett 2020, Arora 2019). Modern deep networks are vastly overparameterized ($d \gg n$), so solutions lie in $w_{\min} + \text{null}(X)$ (affine subspace). Gradient descent exhibits implicit regularization, preferring solutions in specific subspaces (e.g., low-rank, sparse). Understanding span and null space clarifies why overparameterized models generalize.

Definitions#

Linear combination. Given vectors $\{v_1, \ldots, v_k\} \subset V$ and scalars $\{\alpha_1, \ldots, \alpha_k\} \subset \mathbb{R}$, the linear combination is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \cdots + \alpha_k v_k \in V $$

Span. The span of $\{v_1, \ldots, v_k\}$ is the set of all linear combinations: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.

Linear independence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if: $$ \sum_{i=1}^k \alpha_i v_i = 0 \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0 $$ Otherwise, they are **linearly dependent** (at least one $v_j$ is a linear combination of the others).

Basis. A set $\{v_1, \ldots, v_k\}$ is a basis for subspace $S$ if:

It spans $S$: $\text{span}\{v_1, \ldots, v_k\} = S$.
It is linearly independent.

Every vector in $S$ has a unique representation as a linear combination of basis vectors.

Column space (range). For $A \in \mathbb{R}^{m \times n}$, the column space is: $$ \text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{\text{columns of } A\} $$

Dimension. $\dim(S) = $ number of vectors in any basis for $S$. For $\text{col}(A)$, $\dim(\text{col}(A)) = \text{rank}(A)$ (number of linearly independent columns).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Representer theorem (Kimeldorf & Wahba 1970, Schölkopf et al. 2001). For kernel ridge regression and SVMs, the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$ This holds for **any** reproducing kernel $k$ on RKHS $\mathcal{H}$. The solution lies in the $n$-dimensional subspace $\text{span}{k(x_1, \cdot), \ldots, k(x_n, \cdot)}$, even though $\mathcal{H}$ may be infinite-dimensional (e.g., RBF kernel).
VC dimension and span (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$, the VC dimension is $d+1$. This measures expressiveness: the classifier can shatter (correctly classify all $2^{d+1}$ labelings) any set of $d+1$ points in general position. The decision boundaries are hyperplanes (linear combinations of features).
Rank-nullity theorem (fundamental theorem of linear algebra). For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ In ML: If $X \in \mathbb{R}^{n \times d}$ has $\text{rank}(X) = r < d$, there are $d - r$ linearly dependent features (null space dimension). Solutions to $Xw = y$ form an affine subspace $w_{\text{particular}} + \text{null}(X)$.
Eckart-Young theorem (1936). The truncated SVD $\hat{X} = U_k \Sigma_k V_k^\top$ (keeping top $k$ singular values) minimizes: $$ \|\hat{X} - X\|_F = \min_{\text{rank}(\hat{X}) \leq k} \|\hat{X} - X\|_F $$ Geometrically: Projecting columns of $X$ onto $\text{span}{u_1, \ldots, u_k}$ minimizes reconstruction error. This justifies PCA, low-rank matrix completion, and recommender systems.
Johnson-Lindenstrauss lemma (1984). Random projection from $\mathbb{R}^d$ to $\mathbb{R}^k$ (with $k = O(\log n / \epsilon^2)$) approximately preserves pairwise distances with high probability. This enables dimensionality reduction: data approximately lies in a low-dimensional subspace, discoverable via random projections.

Why essential: These theorems quantify when learning is tractable (representer theorem → finite-dimensional optimization), how much data suffices (VC dimension → sample complexity), and when low-dimensional structure exists (Eckart-Young → lossy compression bounds).

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Word2Vec (Mikolov et al., 2013). Learned 300-dimensional embeddings for millions of words via skip-gram/CBOW. Demonstrated linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ achieved 40% accuracy on analogy tasks. Showed that linear combinations capture semantic relationships (gender, tense, capitals).
ResNet (He et al., 2015). Introduced skip connections $y = F(x) + x$, enabling training of 152-layer networks (vs. ~20 layers for VGG). Won ImageNet 2015 with 3.57% top-5 error. The key: $F(x) + x$ is a linear combination (residual + identity), preserving gradients during backpropagation.
Transformer (Vaswani et al., 2017). Replaced RNNs with attention: $\text{softmax}(QK^\top / \sqrt{d_k}) V$ (linear combination of value vectors). Enabled GPT-3 (175B params, Brown et al. 2020), BERT (340M params, Devlin et al. 2018), and state-of-the-art results across NLP (translation, summarization, QA).
Kernel SVMs (Boser et al. 1992, Cortes & Vapnik 1995). Applied kernel trick to large-margin classifiers. Won NIPS 2003 feature selection challenge, achieved 99.3% accuracy on MNIST (Decoste & Schölkopf 2002). Decision function $f(x) = \sum_{i \in SV} \alpha_i y_i k(x_i, x)$ is a sparse linear combination (only support vectors have $\alpha_i \neq 0$).
PCA for face recognition (Eigenfaces, Turk & Pentland 1991). Projected face images onto span of top eigenvectors (principal components). Each face is approximated as $x \approx \sum_{i=1}^k c_i u_i$ (linear combination of eigenfaces). Achieved real-time recognition with $k = 50$-$100$ components (vs. $d = 10,000$ pixels).
GPT-3 (Brown et al., 2020). 175B parameter Transformer trained on 300B tokens. Demonstrated few-shot learning (2-3 examples) across diverse tasks without fine-tuning. Attention layers compute $\sum_{i=1}^n \alpha_i v_i$ (linear combinations of context), with $n = 2048$ tokens.

Why essential: These systems achieved state-of-the-art by exploiting linear combination structure (attention, skip connections, kernel methods). Understanding span is necessary to interpret embeddings (Word2Vec analogies), debug failures (rank deficiency in features), and design architectures (multi-head attention = multiple subspaces).

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Data spans low-dimensional subspace#

Major achievements:

Hotelling (1933): Formalized PCA as finding orthogonal directions of maximum variance. Principal components are eigenvectors of the covariance matrix $C = \frac{1}{n} X_c^\top X_c$ (centered data).
Eckart-Young theorem (1936): Proved that truncated SVD $X \approx U_k \Sigma_k V_k^\top$ (keeping top $k$ singular vectors) minimizes reconstruction error $\|X - \hat{X}\|_F$. This justifies PCA: projecting onto $\text{span}\{u_1, \ldots, u_k\}$ (top eigenvectors) is optimal.
Modern applications: Face recognition (eigenfaces, Turk & Pentland 1991), data compression (JPEG2000), preprocessing for neural networks (whitening), exploratory data analysis (visualizing high-dimensional datasets in 2D/3D).

Connection to span: PCA finds the $k$-dimensional subspace (span of top eigenvectors) that best approximates the data cloud. Projecting data $X$ onto $\text{span}\{u_1, \ldots, u_k\}$ gives $X_{\text{proj}} = X V_k V_k^\top$, where each row is a linear combination of top eigenvectors. The retained variance is $\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i$.

2. Stochastic Gradient Descent (SGD) — Updates are linear combinations#

Major achievements:

Robbins & Monro (1951): Proved convergence of stochastic approximation $\theta_{t+1} = \theta_t - \eta_t g_t$ (where $g_t$ is a noisy gradient) under diminishing step sizes $\sum_t \eta_t = \infty$, $\sum_t \eta_t^2 < \infty$.
Momentum methods (Polyak 1964, Nesterov 1983): Introduced momentum $m_{t+1} = \beta m_t + \nabla \mathcal{L}(\theta_t)$, $\theta_{t+1} = \theta_t - \eta m_{t+1}$ (exponentially weighted average of gradients). This is a linear combination of past gradients with decaying weights.
Adam optimizer (Kingma & Ba 2014): Adaptive learning rates using first and second moment estimates. Became the dominant optimizer for deep learning (BERT, GPT, Stable Diffusion).

Connection to span: Every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ is a linear combination of the current parameters and the negative gradient. The optimization trajectory $\{\theta_0, \theta_1, \ldots\}$ lies in the affine subspace $\theta_0 + \text{span}\{\nabla \mathcal{L}(\theta_0), \nabla \mathcal{L}(\theta_1), \ldots\}$. For linear models, gradients are linear combinations of data columns.

3. Deep Neural Networks—Compositional Linear Combinations

Major achievements:

Universal approximation (Cybenko 1989, Hornik 1991): Single hidden layer networks can approximate continuous functions arbitrarily well. The output is $f(x) = \sum_{i=1}^h w_i \sigma(v_i^\top x + b_i)$ (linear combination of activations).
Deep learning revolution (2012-present): AlexNet (2012), VGG (2014), ResNet (2015), Transformers (2017) demonstrated that depth (composing linear maps + nonlinearities) is more powerful than width (more neurons per layer).
Neural Tangent Kernels (Jacot et al. 2018): Showed that infinite-width networks behave like kernel methods, with predictions in $\text{span}\{\text{training features}\}$.

Connection to span: Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$). The pre-activation $W_l h_l$ lies in $\text{col}(W_l)$. Deep networks compose these linear combinations across layers, creating hierarchical representations.

4. Kernel Methods—Predictions as linear combinations of kernels#

Major achievements:

Representer theorem (Kimeldorf & Wahba 1970): For regularized risk minimization $\min_{f \in \mathcal{H}} \sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the optimal solution is $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel basis functions).
Support Vector Machines (Boser et al. 1992, Cortes & Vapnik 1995): Introduced large-margin classifiers with kernel trick. Won NIPS feature selection challenge (2003), dominated ML competitions (early 2000s).
Gaussian Processes (Rasmussen & Williams 2006): Bayesian kernel methods for regression/classification. Predictions are linear combinations $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ with $\alpha = (K + \sigma^2 I)^{-1} y$.

Connection to span: Despite working in (potentially infinite-dimensional) RKHS, kernel predictions always lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (finite-dimensional subspace spanned by training kernels). The Gram matrix $K_{ij} = k(x_i, x_j)$ encodes inner products in this subspace.

5. Transformer Attention—Weighted sums of value vectors#

Major achievements:

Vaswani et al. (2017): “Attention is All You Need” replaced RNNs with self-attention. Enabled parallelization and scaling to billions of parameters (GPT-3: 175B params, GPT-4: ~1.7T params).
BERT (Devlin et al. 2018): Bidirectional Transformers for masked language modeling. Achieved state-of-the-art on 11 NLP tasks (GLUE benchmark).
Vision Transformers (Dosovitskiy et al. 2020): Applied attention to image patches, surpassing CNNs on ImageNet (ViT-H/14: 88.5% top-1 accuracy).
Multimodal models (CLIP, Flamingo, GPT-4): Unified vision and language via attention over heterogeneous inputs.

Connection to span: Attention output $z = \text{softmax}(QK^\top / \sqrt{d_k}) V$ is a convex combination (weighted average with non-negative weights summing to 1) of value vectors (rows of $V$). Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to $h$ different subspaces, computing $h$ independent linear combinations in parallel.

Notation

Standard Conventions#

1. Linear combination syntax.

Summation notation: $\sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k$.
Matrix-vector product: $Xw = \sum_{j=1}^d w_j x_j$ (linear combination of columns of $X$ with weights from $w$).
Convex combination: $\sum_{i=1}^k \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (weighted average).

Examples:

Linear regression prediction: $\hat{y} = X w = \sum_{j=1}^d w_j X_{:,j}$ (each prediction is a linear combination of feature columns).
Attention output: $z = \sum_{i=1}^n \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$ (convex combination of value vectors).
Word analogy: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$ (coefficients can be negative).

2. Span notation.

Set notation: $\text{span}\{v_1, \ldots, v_k\} = \{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$.
Equivalent: $\text{span}(S)$ where $S = \{v_1, \ldots, v_k\}$ (span of a set).
Column space: $\text{col}(A) = \text{span}\{\text{columns of } A\}$.
Row space: $\text{row}(A) = \text{span}\{\text{rows of } A\} = \text{col}(A^\top)$.

Examples:

For $X \in \mathbb{R}^{3 \times 2}$ with columns $x_1 = [1, 0, 1]^\top$, $x_2 = [0, 1, 1]^\top$: $$ \text{col}(X) = \text{span}\{x_1, x_2\} = \left\{ w_1 \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} + w_2 \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} : w_1, w_2 \in \mathbb{R} \right\} $$ This is a 2D plane in $\mathbb{R}^3$ passing through the origin.

3. Linear independence notation.

Independence: Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if $\sum_{i=1}^k \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_k = 0$.
Dependence: If there exist $\alpha_i$ (not all zero) such that $\sum_{i=1}^k \alpha_i v_i = 0$, vectors are linearly dependent.
Rank: $\text{rank}(A) = \max\{\text{number of linearly independent columns of } A\} = \max\{\text{number of linearly independent rows of } A\}$.

Examples:

Vectors $v_1 = [1, 0]^\top$, $v_2 = [0, 1]^\top$ are linearly independent (standard basis for $\mathbb{R}^2$).
Vectors $v_1 = [1, 2]^\top$, $v_2 = [2, 4]^\top$ are linearly dependent ($v_2 = 2 v_1$).
For $X \in \mathbb{R}^{100 \times 50}$, $\text{rank}(X) \leq 50$ (at most 50 linearly independent columns).

4. Basis notation.

Basis: A linearly independent spanning set. Denoted $\mathcal{B} = \{v_1, \ldots, v_d\}$ for a $d$-dimensional space.
Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.
Coordinates: For vector $v = \sum_{i=1}^d \alpha_i v_i$ (linear combination of basis vectors), the coordinates are $[\alpha_1, \ldots, \alpha_d]^\top$.

Examples:

Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top$, $e_2 = [0, 1, 0]^\top$, $e_3 = [0, 0, 1]^\top$.
Any vector $v = [v_1, v_2, v_3]^\top = v_1 e_1 + v_2 e_2 + v_3 e_3$ (linear combination of standard basis).
For PCA, top $k$ eigenvectors $\{u_1, \ldots, u_k\}$ form a basis for the principal subspace.

5. Kernel and null space notation.

Null space: $\text{null}(A) = \{x : Ax = 0\}$ (vectors mapped to zero).
Kernel: $\ker(A) = \text{null}(A)$ (alternative notation).
Range (column space): $\text{range}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$.

Examples:

For $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ (rank 1), $\text{null}(A) = \text{span}\{[2, -1]^\top\}$ (1D subspace).
Overparameterized regression: If $\text{rank}(X) < d$, solutions to $Xw = y$ form $w_0 + \text{null}(X)$ (affine subspace).
Kernel ridge regression: Solution $\alpha = (K + \lambda I)^{-1} y$ lies in $\mathbb{R}^n$ (span of training examples).

6. Projection notation.

Orthogonal projection: $P_S v$ projects $v$ onto subspace $S$.
Projection matrix: $P = A(A^\top A)^{-1} A^\top$ projects onto $\text{col}(A)$.
Complement: $v = P_S v + P_{S^\perp} v$ (decomposition into parallel and perpendicular components).

Examples:

PCA projection onto top $k$ eigenvectors: $X_{\text{proj}} = X V_k V_k^\top$ where $V_k = [u_1 | \cdots | u_k]$.
Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto $\text{col}(X)$).
Residual: $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top) y$ (projection onto $\text{col}(X)^\perp$).

Pitfalls & sanity checks

Common Mistakes#

Confusing span with basis: Span dimension = number of linearly independent vectors, not total count.
Assuming full rank: Always check np.linalg.matrix_rank(X) before inverting $X^\top X$.
Ignoring numerical stability: Use lstsq instead of normal equations.
Misunderstanding convex combinations: Not all linear combinations are convex (need $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$).
Overparameterization misconceptions: $d > n$ doesn’t always cause overfitting (implicit regularization).

Essential Checks#

# Check linear independence
rank = np.linalg.matrix_rank(X)
assert rank == X.shape[1], "Columns linearly dependent"

# Verify span membership
V = np.column_stack([v1, v2, v3])
alpha = np.linalg.lstsq(V, v, rcond=None)[0]
assert np.allclose(V @ alpha, v), "v not in span(V)"

# Test null space
assert np.allclose(X @ z, 0), "z not in null(X)"

# Attention weights
assert np.allclose(alpha.sum(), 1) and (alpha >= 0).all()

References

Foundational Texts#

Strang (2016): Linear Algebra - span, basis, column/null space
Axler (2015): Linear Algebra Done Right - abstract vector spaces
Horn & Johnson (2013): Matrix Analysis - rank, decompositions

Machine Learning#

Hastie et al. (2009): Elements of Statistical Learning - regression, SVMs
Goodfellow et al. (2016): Deep Learning - Chapter 2 (Linear Algebra)
Murphy (2022): Probabilistic ML - linear regression, kernels

Key Papers#

Kimeldorf & Wahba (1970): Representer theorem
Vapnik & Chervonenkis (1971): VC dimension
Eckart & Young (1936): Low-rank approximation
Mikolov et al. (2013): Word2Vec analogies
Vaswani et al. (2017): Transformer attention
He et al. (2015): ResNet skip connections
Bartlett et al. (2020): Benign overfitting
Belkin et al. (2019): Double descent

Advanced Topics#

Schölkopf & Smola (2002): Learning with Kernels
Rasmussen & Williams (2006): Gaussian Processes
Golub & Van Loan (2013): Matrix Computations
Trefethen & Bau (1997): Numerical Linear Algebra

Five worked examples

Worked Example 1: Predictions lie in span(columns of X)#

Introduction#

Linear regression predictions $\hat{y} = Xw$ are linear combinations of the columns of the feature matrix $X$. This fundamental observation reveals model expressiveness: all possible predictions lie in the column space $\text{col}(X)$, a subspace of $\mathbb{R}^n$. If the target $y$ lies outside this subspace ($y \notin \text{col}(X)$), perfect fit is impossible—the best we can do is project $y$ onto $\text{col}(X)$ (least squares solution).

This example explicitly computes $Xw$ as $\sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns), demonstrating that predictions span a subspace determined entirely by the features.

Purpose#

Visualize predictions as linear combinations: Show that $\hat{y} = Xw = w_1 X_{:,1} + w_2 X_{:,2} + \cdots + w_d X_{:,d}$.
Identify the constraint: Predictions lie in $\text{col}(X)$, limiting model capacity to $\dim(\text{col}(X)) = \text{rank}(X)$.
Connect to least squares: When $y \notin \text{col}(X)$, minimizing $\|Xw - y\|_2$ finds the closest point in $\text{col}(X)$ to $y$.

Importance#

Model expressiveness. The span of $X$’s columns determines all possible predictions. For $X \in \mathbb{R}^{n \times d}$:

If $\text{rank}(X) = d$ (full column rank), the model can fit any $d$ linearly independent targets.
If $\text{rank}(X) < d$, some features are redundant (linearly dependent). Adding more linearly dependent features doesn’t increase capacity.
If $\text{rank}(X) < n$ (typical when $d < n$), predictions lie in a proper subspace of $\mathbb{R}^n$. Perfect fit is impossible unless $y \in \text{col}(X)$.

Residuals and orthogonality. The least squares residual $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$: $X^\top r = 0$. Geometrically, $\hat{y}$ is the orthogonal projection of $y$ onto $\text{col}(X)$, and $r$ lies in the orthogonal complement $\text{col}(X)^\perp$.

Feature selection. If feature $j$ is a linear combination of other features ($X_{:,j} = \sum_{i \neq j} c_i X_{:,i}$), including it doesn’t increase $\text{rank}(X)$ or expand $\text{col}(X)$. Feature selection algorithms (Lasso, forward selection) aim to find minimal feature sets spanning the target space.

What This Example Demonstrates#

Matrix-vector product as linear combination: $Xw = \sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns).
Predictions constrained to subspace: $\hat{y} \in \text{col}(X) = \text{span}\{X_{:,1}, \ldots, X_{:,d}\}$.
Numerical verification: Compute both $Xw$ (matrix product) and $\sum_j w_j X_{:,j}$ (explicit sum), verify they’re identical.

Background#

Least squares (Gauss 1809, Legendre 1805). Gauss used least squares to fit planetary orbits, minimizing sum of squared errors. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, so minimizing $\|y - Xw\|_2^2$ finds the closest point in $\text{col}(X)$ to $y$.

Normal equations. Setting $\nabla_w \|Xw - y\|_2^2 = 0$ gives $X^\top X w = X^\top y$. If $X$ has full column rank, $w^* = (X^\top X)^{-1} X^\top y$. The prediction is $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y$ (projection matrix $P = X(X^\top X)^{-1} X^\top$ projects onto $\text{col}(X)$).

Geometric interpretation. $\text{col}(X)$ is a $d$-dimensional (or $\text{rank}(X)$-dimensional) hyperplane in $\mathbb{R}^n$. The prediction $\hat{y}$ is the foot of the perpendicular from $y$ to this hyperplane. The residual $r = y - \hat{y}$ is perpendicular to the hyperplane.

Historical Context#

1. Least squares origins (Gauss 1809, Legendre 1805). Legendre published the method in 1805 for fitting orbits. Gauss claimed to have used it since 1795 (controversy over priority). Both recognized that predictions are linear combinations of features.

2. Matrix formulation (Cauchy 1829, Sylvester 1850). Matrix algebra enabled compact notation $\hat{y} = Xw$ instead of writing out sums. Sylvester introduced “matrix” terminology in 1850.

3. Projection interpretation (Schmidt 1907, Courant & Hilbert 1924). Erhard Schmidt formalized orthogonal projections in Hilbert spaces. The least squares solution became understood as projecting $y$ onto $\text{col}(X)$.

4. Modern ML (1990s-present). Regularization (ridge, Lasso) modifies $\text{col}(X)$ by adding penalty terms. Kernel methods (SVMs, kernel ridge regression) work in implicitly mapped feature spaces, where $\text{col}(\Phi(X))$ may be infinite-dimensional but solutions lie in $\text{span}\{k(x_i, \cdot)\}_{i=1}^n$ (finite-dimensional by the representer theorem).

History in Machine Learning#

1805: Legendre publishes least squares (linear combinations of features).
1809: Gauss derives normal equations $X^\top X w = X^\top y$.
1907: Schmidt formalizes orthogonal projections (geometric interpretation).
1970: Kimeldorf & Wahba prove the representer theorem (kernel solutions in the span of training points).
1995: Vapnik’s Nature of Statistical Learning Theory connects VC dimension to span of hypothesis class.
2006: Compressed sensing (Candès, Donoho) exploits sparse linear combinations for recovery.
2018: Neural Tangent Kernels (Jacot et al.) show infinite-width networks have predictions in the span of features.

Prevalence in Machine Learning#

Universal in supervised learning: Every linear model (linear regression, logistic regression, linear SVM, perceptron) computes predictions as $\hat{y} = Xw$ or $\hat{y} = \sigma(Xw + b)$ (linear combination + nonlinearity).

Deep learning layers: Each fully connected layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$).

Generalized linear models (GLMs): Exponential family models (Poisson regression, gamma regression) use $\mathbb{E}[y] = g^{-1}(X w)$ (linear combination inside link function).

Kernel methods: SVMs, kernel ridge regression, Gaussian processes all predict via $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel evaluations).

Notes and Explanatory Details#

Shape discipline:

Feature matrix: $X \in \mathbb{R}^{n \times d}$ (rows = examples, columns = features).
Weights: $w \in \mathbb{R}^d$ (one weight per feature).
Prediction: $\hat{y} = Xw \in \mathbb{R}^n$ (one prediction per example).
Column $j$ of $X$: $X_{:,j} \in \mathbb{R}^n$ (feature $j$ across all examples).

Matrix-vector product identity: $$ Xw = \begin{bmatrix} | & | & & | \\ X_{:,1} & X_{:,2} & \cdots & X_{:,d} \\ | & | & & | \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} = \sum_{j=1}^d w_j X_{:,j} $$

Example: For $X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}$, $w = \begin{bmatrix} 2 \\ -1 \end{bmatrix}$: $$ Xw = 2 \begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} + (-1) \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 4 \end{bmatrix} $$

Numerical considerations: For large $d$ (wide data), storing $X$ explicitly may be wasteful if $\text{rank}(X) \ll d$. Low-rank approximations (truncated SVD) reduce storage and computation.

Connection to Machine Learning#

Underfitting vs. overfitting: If $\text{rank}(X) \ll n$ (few effective features), the model underfits (predictions lie in low-dimensional subspace). If $\text{rank}(X) = n$ and $d \geq n$ (more features than examples), the model can perfectly fit noise (overfitting).

Regularization modifies the span: Ridge regression solves $(X^\top X + \lambda I) w = X^\top y$, shrinking weights toward zero. This effectively reduces the effective rank of $X$, constraining predictions to a lower-dimensional subspace.

Basis functions and feature expansion: Nonlinear models (polynomial regression, RBF networks) expand features: $\phi(x) = [x, x^2, x^3, \ldots]$. Predictions $\hat{y} = \Phi(X) w$ lie in $\text{col}(\Phi(X))$, a nonlinear subspace in the original space but linear in feature space.

Connection to Linear Algebra Theory#

Fundamental theorem of linear algebra. For $X \in \mathbb{R}^{n \times d}$: $$ \mathbb{R}^n = \text{col}(X) \oplus \text{null}(X^\top) $$ (direct sum: every vector $y \in \mathbb{R}^n$ decomposes uniquely as $y = y_{\parallel} + y_{\perp}$ where $y_{\parallel} \in \text{col}(X)$ and $y_{\perp} \in \text{null}(X^\top)$).

In least squares, $\hat{y} = y_{\parallel}$ (projection onto $\text{col}(X)$) and $r = y_{\perp}$ (projection onto $\text{null}(X^\top)$). The normal equations $X^\top r = 0$ express orthogonality.

Rank and dimension: $\dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d)$. If $\text{rank}(X) = r < d$, there are $d - r$ redundant features (null space has dimension $d - r$).

Projection matrix: $P = X(X^\top X)^{-1} X^\top$ (assuming $X$ has full column rank) satisfies:

$P^2 = P$ (idempotent: projecting twice is the same as projecting once).
$P^\top = P$ (symmetric: orthogonal projection).
$\text{col}(P) = \text{col}(X)$ (projects onto column space of $X$).

Pedagogical Significance#

Concrete visualization. Students can compute $Xw$ by hand for small $X$ (e.g., $3 \times 2$ matrix) and verify it’s a weighted sum of columns. This makes the abstract “linear combination” concept tangible.

Foundation for least squares. Understanding that predictions lie in $\text{col}(X)$ is essential before learning least squares. The geometric interpretation (projecting $y$ onto $\text{col}(X)$) clarifies why least squares works and when it fails.

Debugging linear models. If predictions are poor, check $\text{rank}(X)$: low rank indicates redundant/collinear features. Use np.linalg.matrix_rank(X) to diagnose.

References#

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4: “Orthogonality” (projections, least squares).
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3: “Linear Methods for Regression.”
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Appendix C: “Numerical Linear Algebra Background” (least squares, QR decomposition).
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. Chapter 5: “Orthogonalization and Least Squares.”
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11: “Linear Regression.”

Problem. Show $\hat{y} = Xw$ lies in the span of columns of $X$.

Solution (math).

For $X \in \mathbb{R}^{n \times d}$ with columns $X_{:,1}, \ldots, X_{:,d} \in \mathbb{R}^n$ and weights $w = [w_1, \ldots, w_d]^\top \in \mathbb{R}^d$, the prediction is: $$ \hat{y} = Xw = \sum_{j=1}^d w_j X_{:,j} $$

This is a linear combination of the columns of $X$, so $\hat{y} \in \text{span}\{X_{:,1}, \ldots, X_{:,d}\} = \text{col}(X)$.

Solution (Python).

import numpy as np

# Define feature matrix X (3 examples, 2 features)
X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])

# Define weight vector w
w = np.array([2., -1.])

# Prediction via matrix-vector product
y_hat_1 = X @ w

# Prediction as explicit linear combination of columns
y_hat_2 = w[0] * X[:, 0] + w[1] * X[:, 1]

print(f"X =\n{X}\n")
print(f"w = {w}\n")
print(f"Method 1 (matrix product): y_hat = X @ w = {y_hat_1}")
print(f"Method 2 (linear combination): y_hat = {w[0]}*X[:,0] + {w[1]}*X[:,1] = {y_hat_2}")
print(f"\nAre they equal? {np.allclose(y_hat_1, y_hat_2)}")
print(f"y_hat lies in span(columns of X): True (by construction)")

Output:

X =
[[1. 2.]
 [3. 4.]
 [5. 6.]]

w = [ 2. -1.]

Method 1 (matrix product): y_hat = X @ w = [0. 2. 4.]
Method 2 (linear combination): y_hat = 2.0*X[:,0] + -1.0*X[:,1] = [0. 2. 4.]

Are they equal? True
y_hat lies in span(columns of X): True (by construction)

Worked Example 2: Kernel ridge solution lies in span(training features)#

Introduction#

The representer theorem states that despite optimizing over an infinite-dimensional RKHS, the optimal solution for kernel ridge regression always has the form $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$—a linear combination of kernel functions evaluated at training points.

Purpose#

Demonstrate the representer theorem computationally
Show that optimization in infinite dimensions reduces to solving $(K + \lambda I)\alpha = y$
Verify predictions lie in span of training kernels

Importance#

Kernel methods enable nonlinear learning in implicitly mapped feature spaces while maintaining computational tractability ($O(n^3)$ instead of infinite-dimensional optimization).

What This Example Demonstrates#

Compute kernel Gram matrix $K_{ij} = k(x_i, x_j)$, solve for $\alpha$, interpret as linear combination of training kernels.

Background#

RKHS and representer theorem (Kimeldorf & Wahba 1970): For loss $\mathcal{L}(f) = \sum_i \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the minimizer is $f^*(x) = \sum_i \alpha_i k(x_i, x)$.

References#

Kimeldorf & Wahba (1970), Schölkopf et al. (2001), Rasmussen & Williams (2006)

Problem: Compute $\alpha$ for kernel ridge regression and interpret span.

Solution (math): $\alpha = (K + \lambda I)^{-1} y$ where $K_{ij} = k(x_i, x_j)$. Predictions: $f(x) = \sum_i \alpha_i k(x_i, x)$.

Solution (Python):

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=6, seed=1)
y = np.arange(len(X), dtype=float)
K = toy_kernel_rbf(X, gamma=0.5)
lam = 1e-2
alpha = np.linalg.solve(K + lam * np.eye(len(X)), y)

print(f"Coefficients alpha: {alpha}")
print(f"Predictions lie in span{{k(x_1, ·), ..., k(x_6, ·)}}")

Worked Example 3: Attention is a weighted sum#

Introduction#

Attention computes outputs as convex combinations of value vectors: $z = \sum_i \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$.

Purpose#

Show attention output is a linear combination, verify weights sum to 1, demonstrate constraint to span of values.

Importance#

Attention is the core operation in Transformers (GPT, BERT), enabling contextual representations through weighted averaging.

References#

Vaswani et al. (2017), Bahdanau et al. (2015)

Problem: Compute attention output as $\sum_i \alpha_i v_i$.

Solution (math): $z = \text{softmax}(q^\top K / \sqrt{d_k}) V$

Solution (Python):

import numpy as np
from scripts.toy_data import scaled_dot_attention

Q = np.array([[1., 0.]])
K = np.array([[1., 0.], [0., 1.], [1., 1.]])
V = np.array([[1., 0.], [0., 2.], [1., 1.]])
output = scaled_dot_attention(Q, K, V)

print(f"Attention output: {output[0]}")
print(f"Output lies in span(rows of V)")

Worked Example 4: Overparameterization and null space#

Introduction#

When $d > n$ (more parameters than examples), solutions to $Xw = y$ are non-unique. The solution set forms an affine subspace $w_0 + \text{null}(X)$.

Purpose#

Show non-uniqueness, identify solution set structure, and discuss the minimum-norm solution returned by lstsq.

Importance#

Modern deep learning is vastly overparameterized. Understanding null space clarifies why multiple parameters give identical predictions yet generalize differently.

References#

Bartlett et al. (2020), Belkin et al. (2019)

Problem: Explain non-uniqueness when $d > n$.

Solution (math): If $Xw_0 = y$ and $z \in \text{null}(X)$, then $X(w_0 + z) = y$. Solutions form $w_0 + \text{null}(X)$.

Solution (Python):

import numpy as np

rng = np.random.default_rng(0)
X = rng.normal(size=(3, 5))  # n=3, d=5
w0 = rng.normal(size=5)
y = X @ w0
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Rank(X): {np.linalg.matrix_rank(X)}")
print(f"Null space dim: {X.shape[1] - np.linalg.matrix_rank(X)}")
print(f"||w_hat||_2 = {np.linalg.norm(w_hat):.4f} (minimum norm)")
print(f"w0 - w_hat in null(X): {np.allclose(X @ (w0 - w_hat), 0)}")

Worked Example 5: Word analogy vector arithmetic#

Introduction#

Word2Vec embeddings exhibit linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (semantic relationships = vector offsets).

Purpose#

Compute analogy as linear combination, demonstrate compositional semantics, motivate embedding arithmetic.

Importance#

Analogies reveal that neural networks learn structured representations where linear algebra operations correspond to semantic operations.

References#

Mikolov et al. (2013), Pennington et al. (2014)

Problem: Compute “king - man + woman” analogy.

Solution (math): $e_{\text{target}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$

Solution (Python):

import numpy as np

E = {
    'king': np.array([0.8, 0.2, 0.1]),
    'man': np.array([0.7, 0.1, 0.0]),
    'woman': np.array([0.6, 0.3, 0.0])
}

analogy = E['king'] - E['man'] + E['woman']
print(f"king - man + woman = {analogy}")
print(f"(Find nearest word to this vector → queen)")

Comments

Algorithm Category

Data Modality

Historical & Attribution

Classical Era

Key Concepts & Theorems