Chapter 9
PSD & PD Matrices
Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

  1. Eigenvalue characterization

    • Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.

  2. Quadratic forms and convexity

    • If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).

  3. Gram matrices

    • $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).

  4. Cholesky factorization

    • PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.

  5. Schur complement

    • For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.

  6. Mahalanobis metric

    • For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.

  7. Numerical PSD

    • In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

  • Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.

  • Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.

  • Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.

  • Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.

  • Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

  • 1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.

  • 1918: Schur complement criteria.

  • 1910–1940s: Cholesky factorization for SPD solves.

  • 1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.

  • 1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

  • PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.

  • Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.

  • Principal minors: PD iff all leading principal minors are positive (Sylvester).

  • Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.

  • Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • PSD/PD definitions via quadratic form and eigenvalues.

  • Sylvester’s criterion (leading principal minors > 0 for PD).

  • Schur complement PSD condition.

  • Mercer’s theorem (kernel PSD).

  • Cholesky existence for SPD.

Applied (landmark systems)#

  • SVM with PSD kernels (Cortes–Vapnik 1995).

  • Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.

  • Kernel Ridge Regression (Murphy 2022) solved via SPD systems.

  • Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).

  • GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up
  1. Covariance and PCA

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

  1. Kernels and SVMs/GPs

  • RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

  1. Hessians and convexity

  • For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

  1. Mahalanobis distance & whitening

  • SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

  1. Cholesky-based solvers (KRR/GP)

  • SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation
  • PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.

  • Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.

  • Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.

  • Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.

  • Schur complement: For blocks, $S = C - B^\top A^{-1} B$.

  • Examples:

    • PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.

    • Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.

    • Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks
  • Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.

  • Near-singular PSD: add jitter; do not invert directly.

  • Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.

  • Kernel parameters: very small/large length-scales can hurt conditioning.

  • Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

  1. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  2. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

  • Symmetrization, eigenvalue check, and Cholesky factorization.

  • Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

  • Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.

  • Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

  • Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

  • RBF kernel is PSD; small datasets can be nearly singular when points are close.

  • Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

  • Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

  • $d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

  • Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

  1. De Maesschalck et al. (2000). The Mahalanobis distance.

  2. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

  • Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

  • Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

  3. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation
Chapter 8
Eigenvalues & Eigenvectors
Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

  1. Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.

  2. Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.

  3. Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.

  4. Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.

  5. PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.

  6. Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.

  7. Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

  • PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.

  • Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.

  • PageRank/Markov chains: Stationary distribution is dominant eigenvector.

  • Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.

  • Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

  • 1900s: Spectral theorem for symmetric matrices.

  • 1911: Gershgorin circle theorem (eigenvalue bounds).

  • 1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.

  • 1998: PageRank (Brin–Page) uses power iteration at web scale.

  • 2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).

  • 2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

  • Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.

  • Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.

  • Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.

  • Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

  • Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.

  • Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.

  • Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.

  • Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.

  • Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

  • PCA: Jolliffe (2002); Shlens (2014).

  • Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).

  • PageRank: Brin–Page (1998).

  • Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).

  • Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up
  1. PCA and covariance spectra

  • Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.

  • Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

  1. Spectral clustering (graph Laplacian)

  • Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.

  • Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

  1. PageRank / Markov chains

  • Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.

  • Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

  1. Conditioning and stability (Jacobians/Hessians)

  • Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.

  • Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

  1. Randomized eigen/svd for large ML

  • Approximate leading eigenpairs with fewer passes over data.

  • Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation
  • Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.

  • Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.

  • Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.

  • Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.

  • Examples:

    • Dominant eigenpair by power iteration on SPD matrix.

    • PCA via eigen-decomposition of $\Sigma$.

    • PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks
  • Non-symmetric matrices may have complex eigenvalues; use appropriate routines.

  • Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.

  • Normalize in power iteration to avoid overflow/underflow.

  • Center data before PCA; otherwise leading eigenvectors capture mean.

  • Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  3. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  4. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

  • Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.

  • Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

  • Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

  1. Trefethen & Bau (1997). Numerical Linear Algebra.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

  • $\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.

  • Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

  • Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

  • Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.

  • Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

  • Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

  1. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

  2. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

  • All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

  • Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

  1. Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.

  2. Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

  • $L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

  • For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

  1. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.

  2. von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation
Chapter 7
Rank & Nullspace
Key ideas: Introduction

Introduction#

Rank and null space describe how information flows through matrices:

  • Rank $r$ = number of independent columns/rows (nonzero singular values)

  • Null space $\text{null}(A)$ = set of inputs mapped to zero (lost information)

  • Column/row spaces = subspaces where outputs/inputs live; orthogonal complements relate via FTLA

  • Pseudoinverse solves least squares even when $A$ is rank-deficient (minimal-norm solutions)

  • Low-rank structure compresses models and reveals latent factors (factorization)

Important ideas#

  1. Row rank equals column rank

    • $\operatorname{rank}(A)$ is the dimension of $\text{col}(A)$ and equals that of $\text{row}(A)$.

  2. Rank via singular values

    • If $A=U\Sigma V^\top$, then $\operatorname{rank}(A)$ equals the number of nonzero singular values $\sigma_i$.

  3. Rank–nullity theorem

    • For $A\in\mathbb{R}^{m\times d}$, $$\operatorname{rank}(A) + \operatorname{nullity}(A) = d.$$

  4. Fundamental theorem of linear algebra (FTLA)

    • $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$ and $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$ (orthogonal decompositions).

  5. Rank of products and sums

    • $\operatorname{rank}(AB) \le \min\{\operatorname{rank}(A), \operatorname{rank}(B)\}$; subadditivity for sums.

  6. Pseudoinverse $A^+$

    • Moore–Penrose $A^+$ gives minimal-norm solutions $x^* = A^+ b$; satisfies $AA^+A = A$.

  7. Numerical rank

    • Practical rank uses thresholds on singular values to handle floating-point noise.

Relevance to ML#

  • Multicollinearity: rank-deficient design $X$ yields non-unique OLS solutions; regularization/pseudoinverse needed.

  • PCA/compression: low rank captures variance efficiently; truncation yields best rank-$k$ approximation.

  • Recommendation systems: user–item matrices modeled as low-rank factorization.

  • Kernels/Gram matrices: rank informs capacity and generalization; $\operatorname{rank}(XX^\top) \le \min(n,d)$.

  • Attention: score matrix $QK^\top$ has rank bounded by $\min(n, d_k)$; head dimension limits expressivity.

  • Deep nets: bottleneck layers enforce low-rank mapping; adapters/LoRA factorize weights.

Algorithmic development (milestones)#

  • 1936: Eckart–Young — best rank-$k$ approximation via SVD.

  • 1955: Penrose — Moore–Penrose pseudoinverse.

  • 1990s–2000s: Matrix factorization in recommender systems (SVD-based, ALS).

  • 2009: Candès–Recht — nuclear norm relaxation for matrix completion.

  • 2011: Halko–Martinsson–Tropp — randomized SVD for large-scale low-rank.

  • 2019–2021: Low-rank adapters (LoRA) compress transformer weights.

Definitions#

  • $\operatorname{rank}(A)$: dimension of $\text{col}(A)$ (or $\text{row}(A)$); number of nonzero singular values.

  • $\text{null}(A) = \{x: Ax=0\}$; $\text{null}(A^\top)$ similarly.

  • $\text{col}(A)$: span of columns; $\text{row}(A)$: span of rows.

  • FTLA decompositions: $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$, $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$.

  • Pseudoinverse: $A^+ = V \Sigma^+ U^\top$ where $\Sigma^+$ reciprocates nonzero $\sigma_i$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

  • Rank–nullity: $$\operatorname{rank}(A)+\operatorname{nullity}(A)=d.$$

  • FTLA (four subspaces): $\text{col}(A) \perp \text{null}(A^\top)$ and $\text{row}(A) \perp \text{null}(A)$.

  • Row=column rank: $\dim\text{row}(A) = \dim\text{col}(A)$.

  • Singular values and rank: $\operatorname{rank}(A)$ is the count of positive $\sigma_i$.

  • Sylvester’s inequality: $\operatorname{rank}(AB) \ge \operatorname{rank}(A) + \operatorname{rank}(B) - k$ (context-dependent; upper/lower bounds useful).

  • Eckart–Young–Mirsky: Truncated SVD minimizes error among rank-$k$ approximations.

  • Moore–Penrose pseudoinverse properties: $AA^+A=A$, $A^+AA^+=A^+$.

Applied (landmark systems/practices)#

  • PCA: Jolliffe (2002); Shlens (2014).

  • Stable least squares: Golub–Van Loan (2013).

  • Matrix completion via nuclear norm: Candès–Recht (2009).

  • Randomized SVD for scale: Halko–Martinsson–Tropp (2011).

  • Recommender systems: Koren–Bell–Volinsky (2009).

  • Low-rank adapters in transformers: Hu et al. (2021).

Key ideas: Where it shows up
  1. PCA and covariance rank

  • Centered data $X_c$ yields covariance $\Sigma = \tfrac{1}{n} X_c^\top X_c$ with $\operatorname{rank}(\Sigma) \le \min(n-1, d)$.

  • Achievements: Dimensionality reduction with $k\ll d$; whitening in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. Regression and multicollinearity

  • If $\operatorname{rank}(X) < d$, normal equations $X^\top X w = X^\top y$ are singular; pseudoinverse/regularization resolve ambiguity.

  • Achievements: Robust linear modeling; Ridge/Lasso mitigate rank issues. References: Hoerl–Kennard 1970; Tibshirani 1996; Golub–Van Loan 2013.

  1. Low-rank models and compression

  • Factorize $W \approx AB^\top$ with small inner dimension to reduce parameters and computation (adapters, LoRA).

  • Achievements: Efficient fine-tuning of large transformers. References: Hu et al. 2021 (LoRA); Tishby & Zaslavsky 2015 (bottlenecks conceptual).

  1. Matrix factorization for recommendation

  • User–item ratings approximated by low-rank matrices; SVD/ALS used in practice.

  • Achievements: Netflix Prize-era improvements; widespread deployment. References: Koren et al. 2009; Funk 2006.

  1. Kernels/Gram and attention score rank

  • $G=XX^\top$ has rank $\le \min(n,d)$; $QK^\top$ rank $\le \min(n,d_k)$. Rank limits expressivity and affects generalization.

  • Achievements: Scalable kernel methods via low-rank approximations; attention head size trade-offs. References: Schölkopf–Smola 2002; Vaswani et al. 2017.

Notation
  • Shapes: $A\in\mathbb{R}^{m\times d}$; $X\in\mathbb{R}^{n\times d}$ is data.

  • Spaces: $\text{col}(A)$, $\text{row}(A)$, $\text{null}(A)$, $\text{null}(A^\top)$.

  • Rank: $\operatorname{rank}(A)$; Nullity: $\operatorname{nullity}(A)$.

  • SVD: $A=U\Sigma V^\top$; $U\in\mathbb{R}^{m\times r}$, $V\in\mathbb{R}^{d\times r}$ span column/row spaces; $r=\operatorname{rank}(A)$.

  • Pseudoinverse: $A^+ = V\Sigma^+ U^\top$; minimal-norm solution $x^* = A^+ b$.

  • Examples:

    • Rank via SVD: count $\sigma_i > \tau$ with threshold $\tau$.

    • Projection onto column space: $P_{\text{col}} = U_r U_r^\top$; onto row space: $P_{\text{row}} = V_r V_r^\top$.

    • Covariance rank: $\operatorname{rank}(X_c^\top X_c) \le n-1$ for centered data.

Pitfalls & sanity checks
  • Never invert $X^\top X$ when $\operatorname{rank}(X)<d$; use QR/SVD or regularize.

  • Diagnose numerical rank via singular values; set thresholds based on scale.

  • Center data for covariance; otherwise rank properties and PCA directions change.

  • Beware overfitting: increasing rank (k in PCA/factorization) beyond signal raises variance.

  • Attention heads: too small $d_k$ may limit expressivity; too large may hurt stability.

References

Foundations and theory

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Low-rank approximation and factorization 4. Eckart, C., & Young, G. (1936). Best rank-$k$ approximation. 5. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 6. Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. 7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

Regression and pseudoinverse 8. Penrose, R. (1955). A generalized inverse for matrices. 9. Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso.

ML systems and practice 11. Jolliffe, I. (2002). Principal Component Analysis. 12. Shlens, J. (2014). A Tutorial on Principal Component Analysis. 13. Murphy, K. P. (2022). Probabilistic Machine Learning. 14. Vaswani, A. et al. (2017). Attention Is All You Need. 15. Devlin, J. et al. (2019). BERT.

Five worked examples

Worked Example 1: Detecting multicollinearity via null space (non-unique regression)#

Introduction#

Show how null space reveals linear dependencies among features and why OLS becomes non-unique when $\operatorname{rank}(X)<d$.

Purpose#

Compute null space vectors and connect them to redundant directions; use pseudoinverse for a minimal-norm solution.

Importance#

Avoids unstable fits and clarifies identifiability in models.

What this example demonstrates#

  • If $v\in\text{null}(X)$, $X(w+\alpha v) = Xw$ for all $\alpha$; infinitely many OLS solutions.

  • Pseudoinverse $w^* = X^+ y$ yields the minimal-norm solution.

Background#

Rank deficiency arises from duplicate/derived features or insufficient data.

Historical context#

Gauss/Legendre least squares; Penrose pseudoinverse enables solutions in singular cases.

Prevalence in ML#

High-dimensional regression, feature engineering pipelines, polynomial expansions.

Notes#

  • Use SVD to diagnose numerical rank; add Ridge to regularize.

Connection to ML#

Feature selection and regularization strategies hinge on rank awareness.

Connection to Linear Algebra Theory#

FTLA: residuals in $\text{null}(X^\top)$; solution set $w_0 + \text{null}(X)$.

Pedagogical Significance#

Makes the geometry of “non-unique solutions” tangible.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Penrose (1955). Moore–Penrose pseudoinverse.

  3. Hoerl & Kennard (1970). Ridge regression.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 6
X = np.random.randn(n, d)
X[:, 5] = X[:, 0] + X[:, 1]  # make a perfectly colinear feature
w_true = np.array([1.0, -0.5, 0.3, 0.0, 2.0, 0.3])
y = X @ w_true + 0.1 * np.random.randn(n)

U, S, Vt = np.linalg.svd(X, full_matrices=False)
rank = np.sum(S > 1e-8)
nullspace_basis = Vt[rank:].T  # columns spanning null(X)

print("rank(X)=", rank, " d=", d, " nullity=", d - rank)
print("Nullspace basis shape:", nullspace_basis.shape)

# Minimal-norm solution via pseudoinverse
w_min = Vt.T @ (np.where(S > 1e-12, (U.T @ y) / S, 0.0))
print("||w_min||2=", np.linalg.norm(w_min))
print("OLS residual norm:", np.linalg.norm(y - X @ w_min))

Worked Example 2: Covariance rank ≤ n−1 (PCA in n<d regimes)#

Introduction#

Verify empirically that centered covariance has rank at most $n-1$ regardless of feature dimension.

Purpose#

Explain why PCA cannot produce more than $n-1$ nonzero eigenvalues and how this affects high-dimensional settings.

Importance#

Shapes expectations for PCA on small data; prevents overinterpretation.

What this example demonstrates#

  • With $X_c\in\mathbb{R}^{n\times d}$ centered, $\operatorname{rank}(X_c) \le \min(n-1, d)$; hence $\operatorname{rank}(\Sigma) \le n-1$.

Background#

Centering imposes a linear constraint across rows, reducing rank by at least one when $n>1$.

Historical context#

PCA theory and practice emphasize centering for correct variance structure.

Prevalence in ML#

Common in text, genomics, and other $d\gg n$ problems.

Notes#

  • Always center before PCA; whitening depends on accurate rank.

Connection to ML#

Model selection of $k$ principal components must respect $n-1$ limit.

Connection to Linear Algebra Theory#

Row-sum constraint places $\mathbf{1}$ in $\text{null}(X_c^\top)$.

Pedagogical Significance#

Reinforces how constraints reduce rank.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 30, 200
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
rank = np.sum(S > 1e-10)
print("rank(Xc)=", rank, " <= min(n-1,d)=", min(n-1, d))

Worked Example 3: Low-rank matrix factorization for recommendation (synthetic)#

Introduction#

Construct a synthetic user–item rating matrix with known low rank and recover it via truncated SVD.

Purpose#

Demonstrate latent-factor modeling and show reconstruction error scales with tail singular values.

Importance#

Illustrates the power of rank reduction in recommender systems.

What this example demonstrates#

  • $R\approx U_k \Sigma_k V_k^\top$ captures most variance when spectrum decays.

Background#

Matrix factorization underlies collaborative filtering; ALS/SGD optimize latent vectors.

Historical context#

Post-Netflix Prize, low-rank methods became industry standard.

Prevalence in ML#

Ubiquitous in recommendation and implicit feedback modeling.

Notes#

  • For missing data, completion requires specialized optimization (not shown here).

Connection to ML#

Latent dimensions reflect user/item factors; rank controls capacity.

Connection to Linear Algebra Theory#

Eckart–Young guarantees best rank-$k$ approximation.

Pedagogical Significance#

Shows direct link from SVD to practical factor models.

References#

  1. Koren, Bell, Volinsky (2009). Matrix factorization techniques for recommender systems.

  2. Candès & Recht (2009). Exact matrix completion via convex optimization.

Solution (Python)#

import numpy as np

np.random.seed(2)
u, i, k = 80, 60, 5
U_true = np.random.randn(u, k)
V_true = np.random.randn(i, k)
R = U_true @ V_true.T + 0.1 * np.random.randn(u, i)

U, S, Vt = np.linalg.svd(R, full_matrices=False)
Rk = (U[:, :k] * S[:k]) @ Vt[:k]
err = np.linalg.norm(R - Rk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-5))

Worked Example 4: Moore–Penrose pseudoinverse — minimal-norm solutions#

Introduction#

Solve $Ax=b$ when $A$ is rectangular or rank-deficient; verify $AA^+A=A$ and minimal norm among all solutions.

Purpose#

Provide a robust recipe for under-/overdetermined systems.

Importance#

Avoids fragile inverses and clarifies the solution geometry.

What this example demonstrates#

  • $x^*=A^+ b$ minimizes $\lVert x\rVert_2$ subject to $Ax=b$ for consistent systems.

  • Penrose conditions hold numerically.

Background#

Pseudoinverse defined via SVD; used in control, signal processing, ML.

Historical context#

Penrose (1955) established the four defining equations.

Prevalence in ML#

Closed-form layers, analytic baselines, and data-fitting routines.

Notes#

  • Use SVD-backed implementations; threshold small singular values.

Connection to ML#

Stable baselines and analytic steps inside pipelines.

Connection to Linear Algebra Theory#

Projects onto $\text{row}(A)$/$\text{col}(A)$; minimal-norm in $\text{null}(A)$ components.

Pedagogical Significance#

Bridges algebraic definition to numerical practice.

References#

  1. Penrose, R. (1955). A generalized inverse for matrices.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(3)
m, d = 10, 12
A = np.random.randn(m, d)
# Make A rank-deficient by zeroing a singular value via colinearity
A[:, 0] = A[:, 1] + A[:, 2]
b = np.random.randn(m)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
S_inv = np.where(S > 1e-10, 1.0 / S, 0.0)
A_plus = Vt.T @ np.diag(S_inv) @ U.T

x_star = A_plus @ b
print("Penrose A A^+ A ~ A?", np.allclose(A @ (A_plus @ A), A, atol=1e-8))
print("Residual ||Ax-b||:", np.linalg.norm(A @ x_star - b))
print("||x_star||2:", np.linalg.norm(x_star))

Worked Example 5: Rank of attention scores QK^T (expressivity bound)#

Introduction#

Show that the attention score matrix $S=QK^\top$ has rank at most $\min(n, d_k)$ and explore implications for head dimension.

Purpose#

Connect feature dimension to expressivity through rank bounds.

Importance#

Head size choices affect the diversity of attention patterns.

What this example demonstrates#

  • For $Q\in\mathbb{R}^{n\times d_k}$ and $K\in\mathbb{R}^{n\times d_k}$, $\operatorname{rank}(QK^\top) \le \min(n, d_k)$.

Background#

Rank of product bounded by inner dimension; scaled dot-products preserve rank.

Historical context#

Transformers leverage multiple heads to increase effective rank/expressivity.

Prevalence in ML#

All transformer models; multi-head concatenation increases representational capacity.

Notes#

  • Multi-head attention can be seen as block structures that raise overall rank after concatenation.

Connection to ML#

Guides architecture design (choosing $d_k$ and number of heads).

Connection to Linear Algebra Theory#

Rank bounds and product properties.

Pedagogical Significance#

Links a practical hyperparameter to a crisp linear algebra bound.

References#

  1. Vaswani, A. et al. (2017). Attention Is All You Need.

  2. Devlin, J. et al. (2019). BERT.

Solution (Python)#

import numpy as np

np.random.seed(4)
for n, dk in [(32, 8), (32, 32), (64, 16)]:
	 Q = np.random.randn(n, dk)
	 K = np.random.randn(n, dk)
	 S = Q @ K.T
	 r = np.linalg.matrix_rank(S)
	 print(f"n={n}, d_k={dk}, rank(S)={r}, bound={min(n, dk)}")

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation
Chapter 6
Orthogonality & Projections
Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

  • Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)

  • Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace

  • Orthonormal bases simplify computations and improve numerical stability

  • Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers

  • PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

  1. Orthogonality and complements

    • $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.

  2. Orthogonal projectors

    • A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.

  3. Projection theorem

    • For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.

  4. Pythagorean identity

    • If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.

  5. Orthonormal bases and QR

    • Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.

  6. Spectral/SVD structure

    • For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).

  7. Orthogonal transformations

    • $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

  • Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.

  • PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.

  • Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.

  • Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.

  • Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.

  • Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

  • 1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.

  • 1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.

  • 1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).

  • 1966: Orthogonal Procrustes (Schönemann) closed-form solution.

  • 1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.

  • 2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

  • Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.

  • Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.

  • Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.

  • Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.

  • PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

  • Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.

  • Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).

  • Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.

  • Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.

  • Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

  • PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.

  • Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.

  • Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).

  • Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).

  • Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up
  1. PCA and subspace denoising

  • PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.

  • Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. Least squares as projection

  • $\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.

  • Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

  1. Orthogonalization algorithms (QR)

  • Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.

  • Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

  1. Orthogonal Procrustes and embedding alignment

  • Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).

  • Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

  1. Orthogonal constraints/initialization in deep nets

  • Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.

  • Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation
  • Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.

  • Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.

  • Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.

  • QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.

  • SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).

  • Examples:

    • Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.

    • PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.

    • Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks
  • Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.

  • Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.

  • Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.

  • Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.

  • Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

  • $X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.

  • Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

  • Prefer QR/SVD over normal equations.

  • Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

  1. Gauss (1809); Legendre (1805) — least squares.

  2. Golub & Van Loan (2013) — QR solvers.

  3. Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

  • $X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.

  • Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

  • Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

  1. Eckart & Young (1936) — best rank-$k$.

  2. Jolliffe (2002) — PCA.

  3. Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

  • Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

  • Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

  1. Trefethen & Bau (1997). Numerical Linear Algebra.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

  • If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

  • Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

  1. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.

  2. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

  • $H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

  • Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

  1. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Theoretical Foundation
Chapter 5
Inner Products & Norms
Key ideas: Introduction

Introduction#

Inner products and norms provide the geometry for data and models:

  • Similarity via inner products $\langle x, y\rangle$ and cosine $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$

  • Size and distance via norms $\lVert x\rVert$ and induced metrics $d(x,y) = \lVert x-y\rVert$

  • Orthogonality ($\langle x, y\rangle = 0$) and projections onto subspaces

  • Positive semidefinite (PSD) Gram matrices and kernels driving SVMs/GPs

  • Stability and regularization via $\ell_2$ (Ridge) and $\ell_1$ (Lasso) penalties

  • Scaled dot-product attention uses many inner products and a normalization factor $1/\sqrt{d}$

Important ideas#

  1. Inner product axioms and induced norms

    • An inner product $\langle x, y\rangle$ on $\mathbb{R}^d$ is symmetric, bilinear, and positive definite; the induced norm is $\lVert x\rVert = \sqrt{\langle x, x\rangle}$.

  2. Cauchy–Schwarz and cosine similarity

    • \[\big|\langle x, y\rangle\big| \le \lVert x\rVert\,\lVert y\rVert\]
    • Defines the angle via $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$.

  3. Triangle inequality and Minkowski/Hölder

    • For $p\in[1,\infty]$, $\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p$; Hölder duality connects $p$ and $q$ with $1/p+1/q=1$.

  4. Dual norms and bounds

    • The dual norm is $\lVert z\rVert_* = \sup_{\lVert x\rVert\le 1} \langle z, x\rangle$; e.g., dual of $\ell_1$ is $\ell_\infty$, dual of $\ell_2$ is $\ell_2$.

  5. Orthogonality, orthonormal bases, and projections

    • If $U\in\mathbb{R}^{d\times k}$ has orthonormal columns, the orthogonal projector is $P = UU^\top$, minimizing reconstruction error.

  6. Gram matrices, PSD, and kernels

    • For data matrix $X\in\mathbb{R}^{n\times d}$, $G=X X^\top$ has entries $G_{ij}=\langle x_i, x_j\rangle$ and is PSD. Kernel matrices generalize this to $K_{ij}=k(x_i,x_j)$.

  7. Mahalanobis norms

    • For SPD $M\succ 0$, $\lVert x\rVert_M = \sqrt{x^\top M x}$ reweights geometry (whitening, metric learning).

  8. Norm-induced stability

    • Lipschitz constants, gradient clipping, and regularization costs all depend on norms.

Relevance to ML#

  • Similarity search: cosine similarity is the standard for embeddings (IR, recommendation, retrieval, metric learning).

  • Regularization: $\ell_2$ (weight decay) controls scale; $\ell_1$ encourages sparsity.

  • Optimization: gradient norms determine step sizes; clipping prevents exploding gradients.

  • Kernels: SVMs, GPs rely on PSD Gram matrices of inner products.

  • Attention: scaled dot-products stabilize softmax logits as dimension grows.

  • PCA/covariance: variance equals squared $\ell_2$ norm along directions; orthogonal projections minimize $\ell_2$ error.

Algorithmic development (select milestones)#

  • 1850s–1900s: Euclidean geometry formalized; Cauchy–Schwarz inequality.

  • 1909: Mercer’s theorem (PSD kernels); foundations of kernel methods.

  • 1950: Aronszajn formalizes RKHS; inner products in function spaces.

  • 1960s–1970s: Robust norms (Huber); convex analysis; optimization bounds.

  • 1995: SVMs (Cortes–Vapnik) with kernel trick.

  • 2013–2015: Word2Vec, GloVe popularize cosine similarity in embeddings.

  • 2015–2016: BatchNorm/LayerNorm normalize activations (variance/norm control).

  • 2017: Scaled dot-product attention (Transformers) stabilizes inner-product logits.

  • 2020: Contrastive learning (SimCLR) uses normalized cosine objectives.

Definitions#

  • Inner product: $\langle x, y\rangle = x^\top y$ (standard), or weighted $\langle x, y\rangle_M = x^\top M y$ with $M\succ 0$.

  • Induced norm: $\lVert x\rVert = \sqrt{\langle x, x\rangle}$; $\ell_p$ norms: $\lVert x\rVert_1=\sum_i|x_i|$, $\lVert x\rVert_2=\sqrt{\sum_i x_i^2}$, $\lVert x\rVert_\infty=\max_i |x_i|$.

  • Cosine similarity: $\cos\theta(x,y) = \dfrac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}$.

  • Orthogonality: $\langle x, y\rangle = 0$; orthonormal set: $\langle u_i, u_j\rangle = \delta_{ij}$.

  • Gram matrix: $G_{ij}=\langle x_i, x_j\rangle$; PSD: $z^\top G z \ge 0$ $\forall z$.

  • Kernel: $k(x,y)=\langle \phi(x), \phi(y)\rangle$; $K_{ij}=k(x_i,x_j)$ is PSD.

  • Mahalanobis norm: $\lVert x\rVert_M = \sqrt{x^\top M x}$ with $M\succ 0$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems and tools)#

  • Cauchy–Schwarz: $$\big|\langle x,y\rangle\big|\le \lVert x\rVert\,\lVert y\rVert,$$ equality iff $x, y$ are linearly dependent.

  • Triangle inequality and Minkowski: $$\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p,$$ basis of $\ell_p$ geometries.

  • Hölder’s inequality: $$|\langle x,y\rangle| \le \lVert x\rVert_p\,\lVert y\rVert_q,$$ with $1/p+1/q=1$.

  • Pythagorean theorem (projections): For orthogonal $a\perp b$, $$\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2.$$

  • Norm equivalence (finite-dimensional): For any two norms on $\mathbb{R}^d$, there exist $c, C>0$ with $c\lVert x\rVert_a \le \lVert x\rVert_b \le C\lVert x\rVert_a$.

  • PSD characterization: $G$ is a Gram matrix iff $z^\top G z \ge 0$ for all $z$ (kernel validity test).

Applied (landmark systems and practices)#

  • SVMs (margins via inner products): Cortes–Vapnik (1995); kernel trick.

  • Gaussian Processes (inner products in function space): Rasmussen–Williams (2006).

  • BatchNorm/LayerNorm (norm/variance control): Ioffe–Szegedy (2015); Ba et al. (2016).

  • Word2Vec/GloVe (cosine similarity): Mikolov et al. (2013); Pennington et al. (2014).

  • SimCLR/contrastive learning (normalized dot-products): Chen et al. (2020).

  • Transformers (scaled dot-product): Vaswani et al. (2017).

  • Gradient clipping (norm control in training): Pascanu et al. (2013).

Key ideas: Where it shows up
  1. PCA and covariance geometry

  • Variance along $u$: $\sigma^2(u)=\lVert X_c u\rVert_2^2/n= u^\top \Sigma u$, with $\Sigma=\tfrac{1}{n}X_c^\top X_c$.

  • Principal components are eigenvectors of $\Sigma$ maximizing inner products with data; projection error uses Pythagorean decomposition.

  • Achievements: Dimensionality reduction at scale; whitening used broadly in vision and speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. SGD/optimization: gradient norms and clipping

  • Step sizes depend on Lipschitz constants tied to operator/dual norms.

  • Gradient clipping by $\ell_2$ norm prevents exploding gradients (RNNs). References: Pascanu et al. 2013; Goodfellow et al. 2016; Nesterov 2018.

  1. Deep nets: normalization and regularization

  • Weight decay ($\ell_2$) controls model complexity; $\ell_1$ encourages sparsity.

  • BatchNorm/LayerNorm normalize mean/variance, implicitly controlling activation norms. References: Ioffe–Szegedy 2015; Ba et al. 2016.

  1. Kernels and PSD Gram matrices

  • SVMs and GPs depend on PSD kernels (Mercer). $K=XX^\top$ is PSD; RBF kernel yields smooth function priors.

  • Achievements: Kernel SVMs in text/vision (1990s–2000s); GPs in Bayesian ML. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

  1. Transformers: scaled dot-product attention

  • Scores $S=QK^\top/\sqrt{d_k}$ use many inner products; the $\sqrt{d_k}$ factor stabilizes softmax variance.

  • Achievements: SOTA in NLP/vision; ubiquitous backbone. References: Vaswani et al. 2017; Devlin et al. 2019; Dosovitskiy et al. 2020.

  1. Embeddings and retrieval

  • Cosine similarity is the default for semantic retrieval and metric learning; normalization puts data on the unit sphere.

  • Achievements: Word2Vec/GloVe; SimCLR; CLIP/contrastive vision-language models. References: Mikolov et al. 2013; Pennington et al. 2014; Chen et al. 2020; Radford et al. 2021.

Notation
  • Vectors are column vectors. Data matrix: $X\in\mathbb{R}^{n\times d}$ (rows are examples; columns features). Centered data: $X_c$.

  • Inner product: $\langle x, y\rangle = x^\top y$; cosine similarity: $$\cos\theta(x,y) = \frac{\langle x, y\rangle}{\lVert x\rVert_2\,\lVert y\rVert_2}.$$

  • Norms: $\lVert x\rVert_1, \lVert x\rVert_2, \lVert x\rVert_\infty$; dual norms $\lVert\cdot\rVert_*$; Mahalanobis $\lVert x\rVert_M = \sqrt{x^\top M x}$.

  • Projection: If $U\in\mathbb{R}^{d\times k}$ is orthonormal, $P=UU^\top$; residual $r=(I-P)x$ is orthogonal to $\text{col}(U)$.

  • Gram matrix: $G=XX^\top$ (PSD); kernel matrix: $K_{ij}=k(x_i,x_j)$.

  • Examples:

    • Embedding cosine: normalize $\hat{x}=x/\lVert x\rVert_2$, $\hat{y}=y/\lVert y\rVert_2$, then $\langle \hat{x},\hat{y}\rangle=\cos\theta$.

    • Ridge penalty: $\lambda\lVert w\rVert_2^2$; Lasso: $\lambda\lVert w\rVert_1$.

    • Attention scores: $S=QK^\top/\sqrt{d_k}$; softmax row-wise on $S$.

Pitfalls & sanity checks
  • Cosine vs Euclidean: without normalization, rankings can change due to scale.

  • PSD checks: ensure Gram/kernel matrices are PSD (numerically, allow tiny negatives).

  • Norm choice: $\ell_2$ is rotation-invariant; $\ell_1$ is robust/sparse but not smooth.

  • Attention scaling: omit $1/\sqrt{d_k}$ and softmax saturates for large $d_k$.

  • Centering for covariance: use $X_c$ for PCA; otherwise directions mix mean effects.

  • Gradient norms: clip by global norm to avoid exploding updates.

References

Foundations and geometry

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.).

  3. Horn, R. & Johnson, C. (2012). Matrix Analysis.

  4. Boyd, S. & Vandenberghe, L. (2004). Convex Optimization.

Kernels and PSD 5. Mercer, J. (1909). Functions of positive and negative type. 6. Aronszajn, N. (1950). Theory of Reproducing Kernels. 7. Schölkopf, B. & Smola, A. (2002). Learning with Kernels. 8. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

Regularization and optimization 9. Hoerl, A. & Kennard, R. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso. 11. Nesterov, Y. (2018). Lectures on Convex Optimization. 12. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. 13. Pascanu, R. et al. (2013). On the difficulty of training RNNs (gradient clipping).

Embeddings, normalization, attention 14. Mikolov, T. et al. (2013). Word2Vec. 15. Pennington, J. et al. (2014). GloVe. 16. Ioffe, S. & Szegedy, C. (2015). Batch Normalization. 17. Ba, J. L. et al. (2016). Layer Normalization. 18. Chen, T. et al. (2020). SimCLR. 19. Radford, A. et al. (2021). CLIP. 20. Vaswani, A. et al. (2017). Attention Is All You Need.

PCA and projections 21. Jolliffe, I. (2002). Principal Component Analysis. 22. Shlens, J. (2014). A Tutorial on PCA. 23. Eckart, C. & Young, G. (1936). Low-rank approximation. 24. Golub, G. & Van Loan, C. (2013). Matrix Computations.

Five worked examples

Worked Example 1: Cosine similarity vs Euclidean distance for embedding retrieval#

Introduction#

Cosine similarity is ubiquitous for nearest-neighbor search in embedding spaces (text, images, audio). We show that for $\ell_2$-normalized vectors, maximizing cosine similarity is equivalent to minimizing Euclidean distance.

Purpose#

Relate inner products to distances under normalization; provide a fast retrieval recipe.

Importance#

Industrial search, recommendation, and retrieval pipelines rely on cosine similarity with normalized embeddings for stability and interpretability.

What this example demonstrates#

  • Equivalence: For unit vectors, $$\lVert x-y\rVert_2^2 = 2(1-\langle x,y\rangle).$$

  • Ranking by cosine equals ranking by negative Euclidean distance after normalization.

Background#

Vector space models in IR (Salton) and modern embeddings (Word2Vec, GloVe, CLIP) use cosine similarity due to scale invariance.

Historical context#

From tf–idf cosine in IR to neural embeddings; normalization combats varying document lengths and feature scales.

Prevalence in ML#

Text retrieval, semantic search, metric learning, contrastive pretraining; approximate nearest neighbor (ANN) indices often assume normalized data.

Notes#

  • Always normalize embeddings: $\hat{x}=x/\lVert x\rVert_2$.

  • For batched comparisons: use matrix products $S=\hat{X}\hat{Y}^\top$ to get all cosines.

Connection to ML#

Similarity search, contrastive objectives, and re-ranking all hinge on stable cosine scores.

Connection to Linear Algebra Theory#

Inner products induce norms/angles; normalization maps data to the unit sphere $\mathbb{S}^{d-1}$.

Pedagogical Significance#

Shows direct algebraic link between inner product and Euclidean geometry under normalization.

References#

  1. Salton, G. et al. (1975). A vector space model for information retrieval.

  2. Mikolov, T. et al. (2013). Efficient Estimation of Word Representations.

  3. Pennington, J. et al. (2014). GloVe.

  4. Radford, A. et al. (2021). CLIP.

Solution (Python)#

import numpy as np

np.random.seed(0)
d, n_query, n_db = 128, 4, 6
X = np.random.randn(n_query, d)
Y = np.random.randn(n_db, d)

def normalize(A):
	 nrm = np.linalg.norm(A, axis=1, keepdims=True) + 1e-12
	 return A / nrm

Xn, Yn = normalize(X), normalize(Y)
cos = Xn @ Yn.T                  # cosine similarities
eucl2 = ((Xn[:, None, :] - Yn[None, :, :])**2).sum(-1)  # squared distances

print("Cosine matrix:\n", np.round(cos, 3))
print("Squared distances (normalized):\n", np.round(eucl2, 3))
print("Relationship check (row 0):", np.allclose(eucl2[0], 2 * (1 - cos[0]), atol=1e-6))

Worked Example 2: $\ell_2$ vs $\ell_1$ regularization under orthonormal design#

Introduction#

Compare Ridge ($\ell_2$) and Lasso ($\ell_1$) when $X^\top X = I$. Ridge has a closed form; Lasso reduces to soft-thresholding.

Purpose#

Show how norms shape solutions: $\ell_2$ shrinks weights smoothly; $\ell_1$ induces sparsity.

Importance#

Regularization choice affects interpretability, robustness, and generalization.

What this example demonstrates#

  • For $X^\top X=I$, OLS is $w_{\text{ls}}=X^\top y$.

  • Ridge: $$w_{\text{ridge}} = \frac{1}{1+\lambda} w_{\text{ls}}.$$

  • Lasso: $$w_{\text{lasso}, i} = \operatorname{sign}(w_{\text{ls}, i})\,\max\{|w_{\text{ls}, i}|-\lambda, 0\}.$$

Background#

Ridge stabilizes ill-conditioned problems; Lasso selects features.

Historical context#

Ridge (Tikhonov, 1963; Hoerl–Kennard, 1970) and Lasso (Tibshirani, 1996) are canonical.

Prevalence in ML#

Widely used in linear models, compressed sensing, and high-dimensional statistics.

Notes#

  • The soft-threshold formula holds exactly under orthonormal design; otherwise use coordinate descent.

Connection to ML#

Norm penalties as priors/constraints: weight decay, sparsity, and model selection.

Connection to Linear Algebra Theory#

Dual norms and subgradients for $\ell_1$; spectral properties for $\ell_2$.

Pedagogical Significance#

Highlights geometric differences: $\ell_2$ balls are round; $\ell_1$ balls have corners that promote zeros.

References#

  1. Hoerl, A. & Kennard, R. (1970). Ridge Regression.

  2. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.

  3. Hastie, T. et al. (2009). Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 64, 8
U, _ = np.linalg.qr(np.random.randn(n, d))  # n x d with orthonormal columns
X = U
w_true = np.zeros(d); w_true[:3] = [2.0, -1.5, 0.5]
y = X @ w_true + 0.1 * np.random.randn(n)

w_ls = X.T @ y
lam = 0.5
w_ridge = w_ls / (1.0 + lam)

def soft_threshold(a, lam):
	 return np.sign(a) * np.maximum(np.abs(a) - lam, 0.0)
w_lasso = soft_threshold(w_ls, lam)

print("||w_ls||2=", np.linalg.norm(w_ls))
print("Ridge (lam=0.5):", np.round(w_ridge, 3))
print("Lasso (lam=0.5):", np.round(w_lasso, 3))

Worked Example 3: Gram matrices are PSD; kernels in practice#

Introduction#

Show that $G=XX^\top$ is PSD and illustrate a common kernel (RBF). Verify PSD numerically.

Purpose#

Connect inner products to PSD matrices and kernel validity.

Importance#

Kernel methods hinge on PSD property; invalid kernels can break optimization.

What this example demonstrates#

  • For any $z$, $$z^\top (XX^\top) z = \lVert X^\top z\rVert_2^2 \ge 0.$$

  • RBF kernel is PSD; eigenvalues are nonnegative up to numerical tolerance.

Background#

Mercer’s theorem characterizes kernels as inner products in (possibly infinite-dimensional) feature spaces.

Historical context#

Kernel trick popularized SVMs and GPs; modern random features approximate kernels at scale.

Prevalence in ML#

Text, bioinformatics, small/medium tabular data, Bayesian regression.

Notes#

  • Numerical PSD check via eigenvalues or Cholesky with jitter.

Connection to ML#

SVM margin maximization and GP covariance both rely on PSD structure.

Connection to Linear Algebra Theory#

Gram operators encode geometry via inner products.

Pedagogical Significance#

Concrete link between data matrix products and PSD.

References#

  1. Mercer, J. (1909). Functions of positive and negative type.

  2. Schölkopf, B. & Smola, A. (2002). Learning with Kernels.

  3. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

  4. Rahimi, A. & Recht, B. (2007). Random features for large-scale kernels.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 10, 5
X = np.random.randn(n, d)
G = X @ X.T

evals = np.linalg.eigvalsh(G)
print("Gram PSD? min eigenvalue:", np.min(evals))

def rbf_kernel(A, B, sigma=1.0):
	 A2 = (A**2).sum(1)[:, None]
	 B2 = (B**2).sum(1)[None, :]
	 D2 = A2 + B2 - 2 * A @ B.T
	 return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.0)
kevals = np.linalg.eigvalsh(K)
print("RBF kernel PSD? min eigenvalue:", np.min(kevals))

Worked Example 4: Why attention uses $1/\sqrt{d_k}$ scaling#

Introduction#

For random features with variance 1, dot-products have variance that grows with $d_k$; scaling by $1/\sqrt{d_k}$ stabilizes softmax.

Purpose#

Quantify inner-product growth and show stabilization by scaling.

Importance#

Essential to prevent saturation and numerical instability in attention.

What this example demonstrates#

  • If $q,k\sim \mathcal{N}(0, I)$ in $\mathbb{R}^{d_k}$, then $\operatorname{Var}(q^\top k) = d_k$.

  • Scaling by $1/\sqrt{d_k}$ makes variance approximately 1 across dimensions.

Background#

Softmax is sensitive to logit scale; large variance yields peaky distributions and vanishing gradients.

Historical context#

Transformers introduced the scaling to stabilize training across widths.

Prevalence in ML#

Every modern Transformer variant uses this factor (self- and cross-attention).

Notes#

  • Normalization and temperature are closely related; tuning temperature affects entropy.

Connection to ML#

Stable attention distributions, better gradient flow, easier optimization.

Connection to Linear Algebra Theory#

Variance of inner products aggregates component variances; normalization rescales geometry.

Pedagogical Significance#

Shows a direct norm/variance argument behind a ubiquitous architectural choice.

References#

  1. Vaswani, A. et al. (2017). Attention Is All You Need.

  2. Goodfellow, I. et al. (2016). Deep Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
for d in [16, 64, 256, 1024]:
	 trials = 2000
	 q = np.random.randn(trials, d)
	 k = np.random.randn(trials, d)
	 dots = np.sum(q * k, axis=1)
	 scaled = dots / np.sqrt(d)
	 print(f"d={d:4d} var(dot)={np.var(dots):.1f}  var(scaled)={np.var(scaled):.2f}")

Worked Example 5: Orthogonal projection minimizes squared error (Pythagorean decomposition)#

Introduction#

Projecting onto an orthonormal subspace minimizes $\ell_2$ reconstruction error and decomposes energy orthogonally.

Purpose#

Connect projections, norms, and PCA-style reconstructions.

Importance#

Underlies least squares, PCA truncation, and many dimensionality-reduction pipelines.

What this example demonstrates#

  • For orthonormal $U\in\mathbb{R}^{d\times k}$, $$\hat{x}=UU^\top x = \arg\min_{z\in\text{col}(U)} \lVert x-z\rVert_2.$$

  • Pythagorean identity: $$\lVert x\rVert_2^2 = \lVert UU^\top x\rVert_2^2 + \lVert (I-UU^\top)x\rVert_2^2.$$

Background#

Least squares is projection onto column space; PCA chooses $U$ to maximize captured variance.

Historical context#

Orthogonal expansions from Fourier to PCA; SVD gives best rank-$k$ approximation.

Prevalence in ML#

Everywhere: regression, PCA, subspace tracking, recommendation.

Notes#

  • Orthonormality is crucial; otherwise use oblique projections or QR/SVD.

Connection to ML#

Data compression and denoising via low-dimensional projections.

Connection to Linear Algebra Theory#

Orthogonal projectors are idempotent and symmetric; decomposition follows from orthogonality of components.

Pedagogical Significance#

Reinforces geometric intuition of least squares and PCA.

References#

  1. Golub, G. & Van Loan, C. (2013). Matrix Computations.

  2. Jolliffe, I. (2002). Principal Component Analysis.

  3. Eckart, C. & Young, G. (1936). Approximation in terms of the best rank-$k$.

Solution (Python)#

import numpy as np

np.random.seed(4)
d, k = 20, 3
x = np.random.randn(d)
U, _ = np.linalg.qr(np.random.randn(d, k))  # orthonormal basis
P = U @ U.T
x_hat = P @ x
r = x - x_hat

lhs = np.linalg.norm(x)**2
rhs = np.linalg.norm(x_hat)**2 + np.linalg.norm(r)**2
print("Projection error minimal?", np.linalg.norm(r) <= np.linalg.norm(x - U @ (U.T @ x) + 1e-12))
print("Pythagorean holds (numeric):", np.allclose(lhs, rhs, atol=1e-10))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation
Chapter 4
Linear Maps & Matrices
Key ideas: Introduction

Introduction#

Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:

  • Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)

  • Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)

  • Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)

  • Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)

Linear maps are everywhere in ML:

  • Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).

  • Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.

  • Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.

  • PCA: Projection onto principal components is a linear map.

  • Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.

Important Ideas#

1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:

  • Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.

  • Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.

Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.

Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.

Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).

2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.

Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).

3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).

Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.

Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).

4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying: $$ (Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y $$

Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).

In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).

Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.

5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:

  • Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.

  • Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).

Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).

Relevance to Machine Learning#

Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.

Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.

Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).

Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.

Algorithmic Development History#

1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.

2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.

3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.

4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.

5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.

6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.

7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.

8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.

Definitions#

Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:

  1. $f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).

  2. $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).

Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).

Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$: $$ \text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A) $$

Rank. The rank of $A$ is: $$ \text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns} $$

Nullity. The nullity of $A$ is: $$ \text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A) $$

Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$

Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying: $$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$

Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Rank-nullity theorem. For $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$ Consequences: If $\text{rank}(A) < d$, solutions to $Ax = b$ are not unique (null space is non-trivial). For invertible $A$ (rank = $d$), solutions are unique.

  2. Fundamental theorem of linear algebra. Orthogonal decomposition: $\mathbb{R}^d = \text{col}(A^\top) \oplus \text{null}(A)$ and $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$ (orthogonal direct sums). Basis for all linear algebra.

  3. Universal approximation (Cybenko 1989, Hornik 1991). A neural network with one hidden layer (linear map + nonlinearity + output linear map) can approximate any continuous function on compact sets arbitrarily well (with enough hidden units).

  4. Spectral theorem for symmetric matrices (Hamilton, Sylvester, 1850s-1880s). Every symmetric $A$ has eigendecomposition $A = U \Lambda U^\top$ (orthogonal diagonalization). Basis for PCA, optimization, understanding symmetric structures.

  5. Singular Value Decomposition (Beltrami 1873, Eckart-Young 1936). Every matrix $A \in \mathbb{R}^{m \times d}$ can be written as $A = U \Sigma V^\top$ (orthogonal $U, V$, diagonal $\Sigma$). Reveals low-rank structure, optimal approximations, conditioning.

Why essential: These theorems quantify what linear maps can/cannot represent, how to invert them, when solutions exist, and how to find optimal approximations.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. Backpropagation and gradient-based learning (Rumelhart et al. 1986, 1990s-present). Automatic differentiation computes gradients via chain rule (composition of matrix transposes). Enables training networks with billions of parameters. All modern deep learning depends on this.

  2. Dense neural networks (Cybenko 1989, Hornik 1991, 1990s-present). Theoretical universality + practical training via backprop = powerful function approximators. AlexNet (2012) showed depth matters: stacking linear maps + activations learns hierarchical representations.

  3. Convolutional Neural Networks (LeCun et al. 1990, AlexNet 2012, ResNet 2015). Structured linear maps (convolution with weight sharing). Dramatically reduced parameters vs. dense. State-of-the-art on vision (ImageNet), object detection, segmentation.

  4. Recurrent Neural Networks and LSTMs (Hochreiter & Schmidhuert 1997, 2000s-present). Apply same linear map over time steps (sequence model for NLP, time series). Enabled machine translation, speech recognition.

  5. Transformers and Attention (Vaswani et al. 2017, Devlin et al. 2018, GPT series 2018-2023). All-attention architecture (linear projections + softmax + matrix multiply). Achieved state-of-the-art across NLP (GLUE, SuperGLUE), vision (ImageNet via ViT), multimodal (CLIP). Scales to trillions of parameters.

  6. Least squares for regression (Gauss, Legendre, Tikhonov, modern methods). Normal equations $(X^\top X) w = X^\top y$ solved via QR/SVD (numerically stable). Classical ML workhorse; fast closed-form solution, interpretable results.

Why essential: These systems achieve state-of-the-art by leveraging linear map structure (composition, transposes, efficient matrix multiply). Understanding linear algebra is necessary to design architectures, optimize, and debug.

Key ideas: Where it shows up

1. Backpropagation and Gradient Flow — Transpose carries errors backward#

Major achievements:

  • Backpropagation (Rumelhart, Hinton, Williams 1986): Efficient algorithm for computing gradients through neural networks via chain rule. Each layer applies $y = \sigma(W x + b)$; backward pass uses $\frac{\partial L}{\partial x} = W^\top \frac{\partial L}{\partial y}$ (transpose carries gradients).

  • Modern deep learning (1990s-2010s): Backprop enabled training of deep networks (10-1000+ layers). Scaling to billions of parameters (GPT, Vision Transformers).

  • Automatic differentiation (1980s-present): Frameworks (TensorFlow, PyTorch) implement backprop automatically by composing transposes. Practitioners never write transposes explicitly; framework handles it.

  • Applications: All supervised learning, reinforcement learning, generative models. Billions of backprop steps every day globally.

Connection to linear maps: Forward pass chains linear maps with nonlinearities: $f = \sigma_k \circ (A_k \sigma_{k-1} \circ (A_{k-1} \cdots))$. Backward pass computes gradients: $\nabla_w L = (\sigma'_{k-1})^T A_{k-1}^T (\sigma'_{k-2})^T A_{k-2}^T \cdots$ (products of transposes).

2. Neural Network Layers — Linear maps + activation functions#

Major achievements:

  • Dense layers (Rosenblatt Perceptron 1958, MLPs 1970s-1980s): Input $x$, linear map $h = Wx + b$, activation $y = \sigma(h)$ (ReLU, sigmoid, tanh). Each layer is a learnable linear map.

  • Depth (ResNets, Vaswani 2015-2017): 50-1000 layers. Skip connections $x_{i+1} = \sigma(W_i x_i + b_i) + x_i$ allow training very deep networks. Each skip branch is a composition of linear maps.

  • Scaling (AlexNet 2012, GPT-3 2020, Gato 2022): Modern networks: billions to trillions of parameters. Matrix multiply dominates computation. Large linear maps $W \in \mathbb{R}^{4096 \times 4096}$ applied to batches.

  • Optimization: Understanding composition of linear maps helps explain generalization (implicit regularization favors low-complexity solutions in the span of data).

Connection to linear maps: Each dense layer is $W: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$. Network composes $W_k \circ \sigma \circ W_{k-1} \circ \sigma \circ \cdots \circ W_1$. Expressiveness comes from depth (composition) and nonlinearity ($\sigma$).

3. Attention Mechanism — Multi-head projections and weighted sums#

Major achievements:

  • Scaled dot-product attention (Vaswani et al. 2017): Queries, Keys, Values are projections (linear maps) $Q = XW_Q, K = XW_K, V = XW_V$. Attention weights $A = \text{softmax}(QK^\top / \sqrt{d_k})$. Output $\text{Attention}(Q,K,V) = AV$ (matrix multiply with softmax-weighted rows).

  • Multi-head attention: $h$ heads, each applying different linear projections. Concatenate: $\text{MultiHead}(Q,K,V) = \text{Concat}(A_1, \ldots, A_h) W^O$ (linear map combines heads).

  • Transformers (Vaswani 2017, Devlin et al. 2018): Attention layers (all linear maps + softmax) in sequence. BERT, GPT achieve state-of-the-art across NLP tasks.

  • Scale: GPT-3 (175B parameters), PaLM (540B), GPT-4. Training scales across thousands of GPUs, with matrix multiplication as bottleneck.

Connection to linear maps: Attention is composition of linear maps: $\text{Attention} = A V$ where $A = \text{softmax}(Q K^\top / \sqrt{d_k})$. Each head applies different linear projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$. Output is weighted linear combination of values.

4. Least Squares and Regression — Normal equations as linear system#

Major achievements:

  • Least squares (Gauss, Legendre, early 1800s): Solve $\min_w \|Xw - y\|_2^2$. Normal equations: $(X^\top X) w = X^\top y$. Linear system $Aw = b$ (product of two linear maps).

  • Ridge regression (Tikhonov 1963, Hoerl & Kennard 1970): Add regularization $\min_w (\|Xw - y\|_2^2 + \lambda \|w\|_2^2)$. Solution: $w = (X^\top X + \lambda I)^{-1} X^\top y$ (invertible for any $\lambda > 0$).

  • LASSO (Tibshirani 1996): L1 regularization forces sparsity. Solved via proximal methods (composition of proximal operators, each a linear map or projection).

  • Kernel methods (Mercer 1909, Schölkopf & Smola 2001): Non-linear regression via Gram matrix $K = X X^\top$ (product of linear maps, then apply kernel trick).

Connection to linear maps: Normal equations involve products of matrices: $X^\top X$ (composition of $X^\top$ and $X$), $X^\top y$ (linear map applied to $y$). Solution involves matrix inversion (inverse is also a linear map).

5. Convolutional and Recurrent Networks — Structured linear maps#

Major achievements:

  • CNNs (LeCun et al. 1990s, AlexNet 2012, ResNet 2015): Convolutional layers are linear maps with weight sharing (same weights applied across spatial positions). Reduces parameters vs. dense layer (e.g., conv 3×3×64→64 channels vs. dense with same feature count).

  • RNNs, LSTMs (Hochreiter & Schmidhuber 1997): Recurrent layers apply the same linear map $W$ repeatedly over time: $h_t = \sigma(W h_{t-1} + U x_t)$ (composition of linear maps over time steps).

  • Efficiency: Weight sharing and structured matrices (convolution, recurrence) reduce parameters and computation compared to dense layers.

  • Interpretability: Convolutional structure learned by early layers is interpretable (edge filters, textures). Linear maps with structured sparsity/sharing have semantic meaning.

Connection to linear maps: Conv layer is a linear map (convolution can be written as matrix multiplication with Toeplitz structure). RNN applies same linear map repeatedly: composition $W \circ W \circ \cdots \circ W$ over time.

Notation

Standard Conventions#

1. Linear map and matrix notation.

  • Linear map: $f: V \to W$ or $A: \mathbb{R}^d \to \mathbb{R}^m$ (function notation).

  • Matrix representation: $A \in \mathbb{R}^{m \times d}$ or $[A]_{ij}$ for entry in row $i$, column $j$.

  • Matrix-vector product: $y = Ax$ (linear map applied to vector $x$).

  • Matrix-matrix product: $C = AB$ (composition: apply $B$ then $A$).

  • Image and kernel: $\text{im}(A)$ or $\text{col}(A)$ for column space; $\ker(A)$ or $\text{null}(A)$ for null space.

Examples:

  • Linear map: $f(x) = 3x_1 - 2x_2 \in \mathbb{R}$. Matrix: $A = [3, -2] \in \mathbb{R}^{1 \times 2}$.

  • Linear map: $(x, y) \mapsto (2x + y, x - 3y)$. Matrix: $A = \begin{bmatrix} 2 & 1 \\ 1 & -3 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$.

  • Composition: Neural network layer 1: $h_1 = \sigma(W_1 x)$, layer 2: $h_2 = \sigma(W_2 h_1) = \sigma(W_2 \sigma(W_1 x))$. Composition: $f = \sigma \circ (W_2 \circ \sigma \circ W_1)$.

2. Rank notation.

  • Rank: $\text{rank}(A)$ = dimension of column space = number of linearly independent columns.

  • Nullity: $\text{nullity}(A) = d - \text{rank}(A)$ (dimension of null space).

  • Full rank: $\text{rank}(A) = \min(m, d)$ (maximum possible rank).

  • Rank deficient: $\text{rank}(A) < \min(m, d)$ (singular or near-singular).

Examples:

  • $A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 2 (full rank), columns independent.

  • $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 1 (rank deficient), second column = 2 × first column.

3. Transpose notation.

  • Transpose: $A^\top$ (rows and columns swapped).

  • Adjoint property: $(Ax)^\top y = x^\top (A^\top y)$ (inner product duality).

  • Composition rule: $(AB)^\top = B^\top A^\top$ (note reversed order).

  • Inverse of transpose: $(A^\top)^{-1} = (A^{-1})^\top$ (for invertible $A$).

Examples:

  • $A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$.

  • Gradient in backprop: $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (linear map $A$ → transpose $A^\top$ for gradient).

4. Composition and chaining notation.

  • Composition operator: $(f \circ g)(x) = f(g(x))$ (apply $g$ first, then $f$).

  • Matrix chaining: For $f = A, g = B$, composition is $f \circ g = A \circ B$ with matrix product $AB$ (apply $B$ then $A$).

  • Neural network layers: Output $h_i = \sigma_i(A_i h_{i-1})$ (chain $A_1, \sigma_1, A_2, \sigma_2, \ldots$).

Examples:

  • Rotate by $\theta$, then scale by $2$: $R_\theta \circ S_2$. Matrix: $S_2 R_\theta$.

  • Neural network: $f(x) = \sigma_2(A_2 \sigma_1(A_1 x))$. Composition: $\sigma_2 \circ A_2 \circ \sigma_1 \circ A_1$.

5. Invertibility and determinant notation.

  • Invertible (nonsingular): $A^{-1}$ exists; $AA^{-1} = A^{-1} A = I$.

  • Determinant: $\det(A)$ or $|A|$. For invertibility: $\det(A) \neq 0 \Leftrightarrow A$ invertible.

  • Condition number: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular value).

Examples:

  • $A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. $\det(A) = 2 \neq 0$, so $A$ is invertible. $A^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1/2 \end{bmatrix}$.

  • Ill-conditioned matrix: $\kappa(A) = 10^{10}$ (nearly singular). Small perturbations cause large changes in solution. Use regularization or preconditioning.

6. Special matrices notation.

  • Identity: $I \in \mathbb{R}^{d \times d}$ (diagonal matrix with 1’s).

  • Orthogonal/orthonormal: $Q^\top Q = QQ^\top = I$ (columns/rows orthonormal).

  • Symmetric: $A^\top = A$.

  • Positive semi-definite (PSD): $A \succeq 0$; all eigenvalues $\geq 0$. Covariance matrices are PSD.

Examples:

  • QR decomposition: $A = QR$ where $Q$ orthonormal, $R$ upper triangular.

  • Symmetric matrix: $\Sigma = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. Eigendecomposition: $\Sigma = U \Lambda U^\top$ (orthonormal $U$, diagonal $\Lambda$).

  • PSD matrix: Covariance $\text{Cov}(X) \succeq 0$ (always PSD). Gram matrix $G = X^\top X \succeq 0$ (always PSD).

Pitfalls & sanity checks

When working with linear maps and matrices:

  1. Always check shapes. Matrix multiply requires compatible dimensions. $A \in \mathbb{R}^{m \times d}$, $x \in \mathbb{R}^d$ yields $Ax \in \mathbb{R}^m$. Shape mismatch = runtime error.

  2. Prefer stable decompositions. Never compute $(X^\top X)^{-1}$ explicitly. Use QR (via solve) or SVD (truncate small singular values) for numerical stability.

  3. Transpose order matters. $(AB)^\top = B^\top A^\top$ (reversed order). In backprop, composition reverses layer order via transposes.

  4. Condition number determines stability. If $\kappa(A) > 10^8$, expect numerical errors. Use regularization (Ridge, Tikhonov) or preconditioning.

  5. Gradients flow via transposes. Backprop systematically applies transposes. Understand: ill-conditioned weights → vanishing/exploding gradients.

References

Foundational texts:

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.

  3. Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.

  4. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Linear maps and matrix theory:

  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

  2. Hoffman, K., & Kunze, R. (1971). Linear Algebra (2nd ed.). Prentice-Hall.

  3. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

  4. Axler, S. J., Bourdon, P. S., & Wade, W. M. (2000). Harmonic Function Theory (2nd ed.). Springer.

Neural networks and backpropagation:

  1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533–536.

  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

  3. Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.

  4. LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). “Efficient backprop.” In Neural Networks: Tricks of the Trade (pp. 9–50). Springer.

Optimization:

  1. Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.

  2. Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer.

  3. Kingma, D. P., & Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv:1412.6980.

Transformers and attention:

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).

  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” NAACL.

  3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). “An image is worth 16×16 words: Transformers for image recognition at scale.” ICLR.

Least squares and numerical methods:

  1. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Dover reprint.

  2. Golub, G. H., & Pereyra, V. (1973). “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.” SIAM Journal on Numerical Analysis, 10(2), 413–432.

Five worked examples

Worked Example 1: Backprop uses transpose#

Problem. For y=Wx, show ∂L/∂x = W^T ∂L/∂y.

Solution (math). Jacobian of y=Wx is W; chain rule yields transpose in reverse mode.

Solution (Python).

import numpy as np
W=np.array([[2.,1.],[-1.,3.]])
dL_dy=np.array([0.5,-2.])
print(W.T@dL_dy)

Worked Example 2: Q,K,V projections in transformers#

Problem. Compute Q=XW_Q, K=XW_K, V=XW_V.

Solution (math). These are linear maps from model dimension to head dimensions.

Solution (Python).

import numpy as np
X=np.array([[1.,0.],[0.,1.],[1.,1.]])
Wq=np.array([[1.,0.],[0.,2.]])
Wk=np.array([[2.,0.],[0.,1.]])
Wv=np.array([[1.,1.],[0.,1.]])
print(X@Wq)
print(X@Wk)
print(X@Wv)

Worked Example 3: Normal equations matrix#

Problem. Form A=X^TX and b=X^Ty for least squares.

Solution (math). Solving A w=b is equivalent to minimizing ||Xw-y||^2 when X has full rank.

Solution (Python).

import numpy as np
X=np.array([[1.,1.],[1.,2.],[1.,3.]])
y=np.array([1.,2.,2.5])
A=X.T@X; b=X.T@y
print(A)
print(b)

Worked Example 4: Batch GD as matrix products#

Problem. Compute one gradient step for MSE.

Solution (math). w←w-η(1/n)X^T(Xw-y).

Solution (Python).

import numpy as np
X=np.array([[1.,2.],[3.,4.],[5.,6.]])
y=np.array([1.,0.,1.])
w=np.zeros(2)
eta=0.1
g=(1/len(X))*X.T@(X@w-y)
print(w-eta*g)

Worked Example 5: Attention is matrix multiplication#

Problem. Compute A=softmax(QK^T/√d) and output O=AV.

Solution (math). Attention is a composition of matrix multiplications plus a row-wise softmax.

Solution (Python).

import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores,axis=1)
print(A@V)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation
Chapter 3
Basis & Dimension
Key ideas: Introduction

Introduction#

Basis and dimension are the language for measuring and reasoning about vector spaces. A basis is a minimal spanning set—a collection of linearly independent vectors that can be scaled and added to represent every vector in the space. The dimension is simply the size of a basis (number of basis vectors). These concepts are ubiquitous in ML:

  • PCA: Principal components form a basis for a lower-dimensional subspace capturing most data variance.

  • Autoencoders: Encoder learns a basis for latent representations (bottleneck layer); decoder reconstructs using this basis.

  • Neural networks: Each layer’s hidden activations form a basis for representations learned by that layer.

  • Whitening/normalization: Change of basis to decorrelate and rescale features (covariance becomes identity).

  • Sparse coding: Find a basis (dictionary) that sparsely represents data.

Understanding basis and dimension clarifies model capacity (how many independent directions can the model control?), data complexity (how many dimensions does the data actually occupy?), and numerical stability (are basis vectors nearly parallel, i.e., ill-conditioned?).

Important Ideas#

1. Basis = minimal spanning set. A basis $\{v_1, \ldots, v_d\}$ for subspace $S$ satisfies:

  • Spans $S$: Every vector in $S$ is a linear combination $s = \sum_{i=1}^d \alpha_i v_i$.

  • Linearly independent: No $v_j$ is a linear combination of the others. Equivalently, $\sum_{i=1}^d \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_d = 0$.

Why minimal? If any vector is removed, the set no longer spans $S$. If any vector is linearly dependent on others, it’s redundant.

Uniqueness of representation: For basis $\{v_1, \ldots, v_d\}$, each vector $s \in S$ has unique coefficients: if $s = \sum_i \alpha_i v_i = \sum_i \beta_i v_i$, then $\alpha_i = \beta_i$ for all $i$ (follows from linear independence).

2. Dimension = basis size. All bases for a subspace have the same number of vectors. This number is the dimension: $\dim(S) = $ number of vectors in any basis for $S$. For matrix $A$: $$ \dim(\text{col}(A)) = \text{rank}(A), \quad \dim(\text{null}(A)) = \text{nullity}(A) = n - \text{rank}(A) $$

Why constant? Different bases may use different vectors, but all bases have the same cardinality. Changing basis doesn’t change dimension (it’s a property of the subspace, not the basis).

3. Change of basis. The same vector has different coordinates in different bases. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the change of basis matrix $P = [v'_1 | \cdots | v'_d]$ (columns are new basis vectors in old basis) satisfies: $$ v_\text{old basis} = P v_\text{new basis}, \quad v_\text{new basis} = P^{-1} v_\text{old basis} $$

In ML: Changing basis = change of representation. PCA rotates to principal component basis. Whitening rotates to decorrelated basis (covariance = identity).

4. Standard basis and explicit coordinates. The standard basis in $\mathbb{R}^d$ is $\{e_1, \ldots, e_d\}$ where $e_i = [0, \ldots, 1, \ldots, 0]^\top$ (1 in position $i$). For this basis: $$ v = [v_1, \ldots, v_d]^\top = \sum_{i=1}^d v_i e_i $$

Coordinates in the standard basis are just the vector’s components. Embedding lookups use one-hot basis: to get embedding of token $i$, multiply by $e_i$ (selection vector).

Relevance to Machine Learning#

Model capacity and expressiveness. A model operating in a $d$-dimensional space can control at most $d$ independent directions. Linear regression with $d$ features can fit $d$ linearly independent targets. Deep networks learn hierarchical bases: layer $\ell$ learns a basis for its activation space (dimension = number of hidden units).

Data intrinsic dimensionality. Real data often lies in a lower-dimensional subspace (manifold hypothesis). PCA finds a basis for the dominant subspace; if top $k$ eigenvalues capture 95% of variance, data is approximately $k$-dimensional. This justifies dimensionality reduction without information loss.

Numerical stability and conditioning. Basis properties affect computation: orthonormal bases (columns have norm 1, mutually perpendicular) are numerically stable (condition number = 1). Nearly parallel basis vectors (ill-conditioned) cause numerical errors in linear algebra operations.

Feature engineering and representation learning. Hand-crafted features (e.g., polynomial features, Fourier basis) are explicit bases. Neural networks learn implicit bases (hidden layers) that are optimized for the task. Autoencoders learn an optimal basis for data reconstruction (variational autoencoders add structure to this basis).

Algorithmic Development History#

1. Coordinate geometry (Descartes & Fermat, 1630s-1640s). Cartesian coordinates introduced the standard basis for Euclidean space. Descartes’ La Géométrie (1637) showed geometry could be expressed algebraically using coordinate axes.

2. Change of basis and coordinate transformations (Euler 1770s, Cauchy 1820s-1830s). Euler rotated coordinate systems to simplify equations. Cauchy formalized change of basis through matrix transformations.

3. Linear independence and dimension (Grassmann 1844, Peano 1888). Grassmann introduced independence axiomatically. Peano formalized dimension as the size of maximal independent sets.

4. Orthonormal bases and orthogonalization (Schmidt 1907, Gram-Schmidt process). Erhard Schmidt proved every finite-dimensional space admits an orthonormal basis. Gram-Schmidt orthogonalization computes one constructively.

5. Eigendecomposition and spectral bases (Schur 1909, 1920s-1930s). Schur showed every matrix has an eigenvalue decomposition. Eigenvalues/eigenvectors define natural bases (spectral decomposition).

6. PCA and dimensionality reduction (Pearson 1901, Hotelling 1933, SVD 1960s). PCA finds the basis vectors (principal components) that best capture data variance. SVD algorithm (Golub & Kahan 1965) computes this stably.

7. Neural basis learning (1980s-2010s). Neural networks learn basis implicitly: hidden layers learn representations that act as bases for downstream computations. Deeper networks learn hierarchical bases (abstract high-level concepts from concrete low-level features).

8. Dictionary learning and sparse coding (2000s). Learn overcomplete bases (more basis vectors than dimension) that sparsely represent data. Applications: image denoising (K-SVD, Aharon et al. 2006), signal processing.

Definitions#

Basis. A set $\mathcal{B} = \{v_1, \ldots, v_d\}$ is a basis for subspace $S$ if:

  1. $\text{span}(\mathcal{B}) = S$ (spans the subspace).

  2. Vectors in $\mathcal{B}$ are linearly independent.

Every vector in $S$ has a unique representation: $s = \sum_{i=1}^d \alpha_i v_i$ (coordinates $\alpha_i$ are unique).

Dimension. The dimension of subspace $S$ is: $$ \dim(S) = \text{size of any basis for } S $$ This is well-defined: all bases for $S$ have the same size.

Rank and nullity. For matrix $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A)) $$ $$ \text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A) $$

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$.

Change of basis matrix. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the matrix $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates) satisfies: $$ P = \text{matrix whose columns are basis } \mathcal{B}' \text{ in coordinates of basis } \mathcal{B} $$ For coordinates $[v]\mathcal{B}$ (in basis $\mathcal{B}$) and $[v]{\mathcal{B}’}$ (in basis $\mathcal{B}’$): $$ [v]_\mathcal{B} = P [v]_{\mathcal{B}'}, \quad [v]_{\mathcal{B}'} = P^{-1} [v]_\mathcal{B} $$

Coordinates. For vector $v \in S$ and basis $\mathcal{B} = \{v_1, \ldots, v_d\}$, the coordinates are scalars $[\alpha_1, \ldots, \alpha_d]^\top$ satisfying: $$ v = \sum_{i=1}^d \alpha_i v_i $$ Coordinates are unique (follows from linear independence).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Rank-nullity theorem. For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ Consequences: If $\text{rank}(A) = d < n$, then $\dim(\text{null}(A)) = n - d$ (dimension of solution set). If $\text{rank}(A) = n$ (full rank), solutions are unique (if they exist).

  2. Basis existence (Steinitz 1913). Every vector space has a basis. For finite-dimensional spaces: basis exists, all bases have same size (dimension).

  3. Dimension and approximation (approximation theory). Best $k$-dimensional approximation to $x$ is $\text{proj}_{S_k} x$ (projection onto $k$-dimensional subspace). The $k$-dimensional subspace with minimum approximation error is $\text{span}\{u_1, \ldots, u_k\}$ (top $k$ singular vectors of data matrix).

  4. VC dimension and capacity (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$. Capacity grows with dimension (more basis dimensions = more expressiveness).

  5. Matrix rank and complexity. Rank determines complexity: rank-$r$ matrices form an $O(r(m+n))$ parameter subspace (vs. $mn$ for general matrices). Low-rank approximation is easier to learn (fewer degrees of freedom).

Why essential: These theorems quantify relationships between dimension, expressiveness, and solution complexity.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. PCA for dimensionality reduction (Turk & Pentland 1991). Eigenfaces for face recognition: project face images onto top 50-100 eigenvectors (basis), achieve real-time recognition. Dimension reduction from 10,000 pixels to ~100 PCA coordinates.

  2. Whitening for preprocessing (LeCun et al. 1998). Decorrelate and rescale features: change of basis to make covariance matrix identity. Improves gradient-based optimization (condition number = 1), enables smaller learning rates.

  3. Batch normalization (Ioffe & Szegedy 2015). Normalize layer activations: change of basis (center and rescale). Speeds training (50-100x), enables higher learning rates, reduces internal covariate shift.

  4. Autoencoders for representation learning (Hinton & Salakhutdinov 2006, 2010s-present). Learn non-linear basis via encoder bottleneck. Applications: image compression, anomaly detection, generative models (VAE, 2013).

  5. Transformer attention with multiple bases (Vaswani et al. 2017, Devlin et al. 2018). 8-64 attention heads = 8-64 different basis projections. Achieved state-of-the-art on GLUE (NLP benchmark), ImageNet-scale vision (ViT), multimodal (CLIP).

  6. Dictionary learning and sparse coding (Aharon et al. 2006, Mairal et al. 2009). Learn overcomplete basis that sparsely represents data. Applications: image denoising (matched K-SVD, PSNR improvement 2-5dB), face recognition (sparse representation).

Why essential: These systems achieve state-of-the-art by exploiting basis structure (dimensionality reduction, whitening, learned bases, attention heads). Understanding dimension and basis is necessary to design architectures and interpret representations.

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Optimal basis for dimensionality reduction#

Major achievements:

  • Pearson (1901), Hotelling (1933): Formalized finding axes (basis vectors) of maximum variance. Principal components are orthonormal basis vectors.

  • Eigendecomposition solution: Eigenvectors of covariance matrix $C = \frac{1}{n} X_c^\top X_c$ are principal components. Top $k$ eigenvectors form a $k$-dimensional basis capturing maximum variance.

  • SVD connection (Eckart-Young 1936): Truncated SVD $X \approx U_k \Sigma_k V_k^\top$ gives same basis as PCA. Top $k$ rows of $V_k^\top$ are principal component coordinates.

  • Modern applications: Preprocessing (feature whitening), visualization (2D/3D plots from high-D data), compression (keep top $k$ components).

Connection to basis: PCA finds an orthonormal basis $\{u_1, \ldots, u_k\}$ (eigenvectors) for the principal subspace. Each data point $x_i$ has coordinates $(u_1^\top x_i, \ldots, u_k^\top x_i)$ in this basis (dimension reduction from $d$ to $k$).

2. Autoencoders and Latent Representations — Learned bases#

Major achievements:

  • Autoencoders (1980s-1990s): Neural networks learn a bottleneck (low-dimensional basis). Encoder compresses to basis coordinates; decoder reconstructs from coordinates.

  • Variational Autoencoders (Kingma & Welling 2013): Adds probabilistic structure: basis coordinates are drawn from Gaussian prior. Enables generative modeling (sampling new data).

  • Disentangled representations (2010s): Learn bases where each coordinate captures an interpretable factor (e.g., pose, lighting in faces). Beta-VAE encourages disentanglement.

  • Applications: Image generation (basis for visual features), anomaly detection (reconstruction error when input outside learned basis), data compression.

Connection to basis: Encoder learns a basis $\mathcal{B} = \{\text{hidden activations}\}$ for a low-dimensional latent space. Decoder reconstructs using this basis.

3. Deep Neural Networks — Hierarchical basis learning#

Major achievements:

  • Universal approximation (Cybenko 1989): Hidden layer activations form a basis for representing functions. Single layer suffices for continuous functions on compact sets.

  • Hierarchy of bases (Bengio 2013): Deep networks learn multiple bases: low layers learn simple bases (edges, textures), high layers learn complex bases (objects, concepts).

  • Representation learning (LeCun et al. 2015): Deep networks optimize basis learned representations for the task (task-specific basis, not generic PCA).

  • Scale (AlexNet 2012, ResNet 2015, Vision Transformers 2020): Depth enables learning of rich hierarchical bases, enabling state-of-the-art performance on complex tasks.

Connection to basis: Layer $\ell$ operates in a $d_\ell$-dimensional activation space (dimension = number of hidden units). Weights $W_\ell$ form a basis (or part of one) for transforming layer $\ell-1$ representations to layer $\ell$ basis.

4. Feature Engineering and Feature Spaces — Hand-crafted vs. learned bases#

Major achievements:

  • Polynomial basis (classical ML): Augment features with powers: $[x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots]$ (explicit basis for nonlinear decision boundaries).

  • Fourier basis (signal processing): Decompose signals using Fourier basis $\{\cos(k\omega t), \sin(k\omega t)\}_{k=0}^\infty$ (frequency domain basis).

  • Radial Basis Functions (RBF networks, 1980s): Basis functions centered at data points (e.g., Gaussian bumps). Kernel methods implicitly use infinite RBF basis.

  • Learned bases (deep learning, 2010s-present): End-to-end training optimizes basis (feature space) jointly with downstream task, outperforming hand-crafted bases.

Connection to basis: Features are coordinates in an explicit basis space. Deep learning learns the basis implicitly through weight optimization.

5. Transformer Attention and Multi-Head Projections — Multiple basis subspaces#

Major achievements:

  • Multi-head attention (Vaswani et al. 2017): Project inputs to $h$ different subspaces (heads), each with its own basis. Enables learning multiple relationships in parallel.

  • Scaled dot-product (Vaswani 2017): Each head computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where $V$ columns form a basis for the value subspace.

  • BERT (Devlin et al. 2018): Bidirectional Transformers with 12-24 heads. Different heads learn different linguistic bases (syntax, semantics, discourse).

  • Applications: Machine translation (parallel bases for source/target), question answering (basis for understanding), language generation.

Connection to basis: Each attention head projects to a $(d_v / h)$-dimensional basis (subspace). Multi-head concatenation combines multiple basis representations.

Notation

Standard Conventions#

1. Basis notation.

  • Basis: $\mathcal{B} = \{v_1, \ldots, v_d\}$ (set of basis vectors).

  • Basis matrix: $V = [v_1 | \cdots | v_d] \in \mathbb{R}^{n \times d}$ (columns are basis vectors).

  • Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.

  • Coordinates: $[v]_\mathcal{B} = [\alpha_1, \ldots, \alpha_d]^\top$ satisfying $v = \sum_i \alpha_i v_i$.

Examples:

  • Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top, e_2 = [0, 1, 0]^\top, e_3 = [0, 0, 1]^\top$.

  • Vector $v = [3, -1, 2]^\top$ has coordinates $[3, -1, 2]^\top$ in standard basis.

  • PCA basis for 2D data: $\mathcal{B} = \{u_1, u_2\}$ (top 2 eigenvectors). Coordinates $[v]_\mathcal{B} = [u_1^\top v, u_2^\top v]^\top$.

2. Dimension notation.

  • Dimension: $\dim(V)$ or $\dim(S)$ for subspace $S$.

  • Rank: $\text{rank}(A)$ = dimension of column space (equivalently, row space).

  • Nullity: $\text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A)$.

Examples:

  • For $X \in \mathbb{R}^{100 \times 50}$ (100 examples, 50 features), if $\text{rank}(X) = 30$, then $\dim(\text{col}(X)) = 30$ (data approximately 30-dimensional).

  • If $\text{rank}(X) = 30 < 50$, then $\text{nullity}(X) = 50 - 30 = 20$ (solution set is 20-dimensional affine subspace).

3. Change of basis notation.

  • Basis transition: $\mathcal{B} \to \mathcal{B}'$ (change from basis $\mathcal{B}$ to $\mathcal{B}'$).

  • Transition matrix: $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates).

  • Coordinates transform: $[v]_\mathcal{B} = P [v]_{\mathcal{B}'}$ (old basis = transition matrix times new basis coordinates).

Examples:

  • Rotate from standard basis to basis aligned with principal components: $P = [u_1 | \cdots | u_d]$ (eigenvectors of covariance matrix).

  • Whitening: $P = \Lambda^{-1/2} U^\top$ (inverse square root of covariance eigenvalues, times eigenvectors transpose) transforms to decorrelated basis.

4. Orthonormal basis notation.

  • Orthonormal: Basis vectors $\{q_1, \ldots, q_d\}$ satisfy $\|q_i\|_2 = 1$ and $q_i^\top q_j = 0$ for $i \neq j$.

  • Orthonormal matrix: $Q = [q_1 | \cdots | q_d]$ satisfies $Q^\top Q = I$ (columns orthonormal).

  • Orthogonal matrix: $Q \in \mathbb{R}^{d \times d}$ satisfies $Q^\top Q = QQ^\top = I$ (square, invertible, $Q^{-1} = Q^\top$).

Examples:

  • Eigenvectors of symmetric matrices form orthonormal basis.

  • QR decomposition: $A = QR$ where $Q$ has orthonormal columns (basis for $\text{col}(A)$).

  • SVD: $A = U \Sigma V^\top$ where $U, V$ are orthogonal matrices (orthonormal bases for column and row spaces).

5. Span and basis rank notation.

  • Full rank: $\text{rank}(A) = \min(m, n)$ (maximum possible rank). Columns (or rows) are linearly independent.

  • Rank deficient: $\text{rank}(A) < \min(m, n)$ (some columns/rows linearly dependent).

  • Column space dimension: $\dim(\text{col}(A)) = \text{rank}(A)$ (basis for column space has $\text{rank}(A)$ vectors).

Examples:

  • $X = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix}$ has rank 1 (second column = 2× first column). Basis for column space: $\{[1, 2, 3]^\top\}$ (1 vector).

  • $X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}$ has rank 2 (columns linearly independent). Basis: $\{[1, 0, 0]^\top, [0, 1, 0]^\top\}$ (2 vectors).

6. Projection onto basis notation.

  • Projection matrix: $P_V = V(V^\top V)^{-1} V^\top$ projects onto column space of $V$ (assuming full column rank).

  • Orthogonal projection: If $V$ has orthonormal columns, $P_V = VV^\top$ (simpler form).

  • Coordinates via projection: For orthonormal basis $V$, coordinates are $[v]_\mathcal{B} = V^\top v$.

Examples:

  • PCA projection: $X_\text{proj} = XVV^\top$ (project onto basis of top eigenvectors $V$).

  • Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto column space of $X$).

Pitfalls & sanity checks

When working with bases and dimension:

  1. Always center before PCA. Uncentered data gives wrong principal components. Check: mean of centered data should be ~0.

  2. Rank-deficient systems. If rank($X$) < $d$, solution is not unique. Use minimum norm solution or add regularization (Ridge).

  3. Numerical instability from nearly-dependent features. Check condition number: if $\kappa(X) > 10^8$, expect numerical errors. Use SVD/QR instead of explicit inverse.

  4. One-hot trap: $k$ categories → only $k-1$ independent one-hot variables. Adding all $k$ causes singularity. Drop one category.

  5. Coordinate consistency: When changing basis, verify: (1) old = $P$ × new, (2) $P$ is invertible (full column rank), (3) reconstruction error ~0.

References

Foundational texts:

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.

  3. Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.

  4. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

PCA and dimensionality reduction:

  1. Pearson, K. (1901). “On lines and planes of closest fit to systems of points.” Philosophical Magazine, 2(11), 559–572.

  2. Hotelling, H. (1933). “Analysis of a complex of statistical variables.” Journal of Educational Psychology, 24(6), 417–441.

  3. Eckart, C., & Young, G. (1936). “The approximation of one matrix by another of lower rank.” Psychometrika, 1(3), 211–218.

  4. Turk, M., & Pentland, A. (1991). “Eigenfaces for recognition.” Journal of Cognitive Neuroscience, 3(1), 71–86.

Neural networks and representation learning:

  1. Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems, 2(4), 303–314.

  2. Bengio, Y. (2013). “Deep learning of representations: Looking forward.” In Statistical Language and Speech Processing (pp. 1–37). Springer.

  3. Hinton, G. E., & Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science, 313(5786), 504–507.

  4. Kingma, D. P., & Welling, M. (2013). “Auto-encoding variational Bayes.” arXiv:1312.6114.

Optimization and normalization:

  1. Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.

  2. Ioffe, S., & Szegedy, C. (2015). “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” In ICML (pp. 448–456).

  3. Ba, J., Kiros, J. R., & Hinton, G. E. (2016). “Layer normalization.” arXiv:1607.06450.

Transformers and attention:

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).

  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805.

Dictionary learning and sparse coding:

  1. Aharon, M., Elad, M., & Bruckstein, A. (2006). “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation.” IEEE Transactions on Signal Processing, 54(11), 4311–4322.

Five worked examples

Worked Example 1: One-hot basis and embedding lookup#

Problem. Show embedding lookup is matrix multiplication by a one-hot vector.

Solution (math). If x=e_i (standard basis), then E x selects the i-th column of E.

Solution (Python).

import numpy as np
E=np.array([[1.,2.,3.],[0.,-1.,1.]])
x=np.array([0.,1.,0.])
print(E@x)

Worked Example 2: Coordinates in a PCA basis#

Problem. Compute 1D PCA coordinate z of a point.

Solution (math). If u is the top PC, z=u^T(x-μ).

Solution (Python).

import numpy as np
from scripts.toy_data import toy_pca_points
X=toy_pca_points(n=10,seed=0)
mu=X.mean(0)
Xc=X-mu
U,S,Vt=np.linalg.svd(Xc,full_matrices=False)
u=Vt[0]
z=u@(X[0]-mu)
print(z)

Worked Example 3: Redundant engineered features via rank#

Problem. Detect redundancy by checking rank.

Solution (math). If rank(X)<d, columns are dependent.

Solution (Python).

import numpy as np
X=np.array([[1.,2.,3.],[2.,4.,6.],[3.,6.,9.]])
print(np.linalg.matrix_rank(X))

Worked Example 4: Whitening as change of basis#

Problem. Whiten data using covariance eigendecomposition.

Solution (math). If Σ=UΛU^T, then x_white=Λ^{-1/2}U^T(x-μ).

Solution (Python).

import numpy as np
rng=np.random.default_rng(0)
X=rng.normal(size=(200,2))
mu=X.mean(0); Xc=X-mu
Sigma=np.cov(Xc,rowvar=False)
lam,U=np.linalg.eigh(Sigma)
W=np.diag(1/np.sqrt(lam))@U.T
Xw=(W@Xc.T).T
print(np.cov(Xw,rowvar=False))

Worked Example 5: SGD gradients live in feature span#

Problem. Show each per-example gradient is a scalar times the feature vector.

Solution (math). For squared loss, ∇_w = (x^Tw - y)x.

Solution (Python).

import numpy as np
x=np.array([1.,-2.,0.5])
w=np.array([0.2,0.1,-0.3])
y=1.0
g=(x@w-y)*x
print(g)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation
Chapter 2
Span & Linear Combination
Key ideas: Introduction

Introduction#

Span and linear combinations are the fundamental building blocks of linear algebra and machine learning. Every prediction $\hat{y} = Xw$, every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$, every attention output $\sum_i \alpha_i v_i$, and every representation learned by a neural network is ultimately a linear combination of basis vectors. Understanding span—the set of all possible linear combinations—reveals model expressiveness, training dynamics, and the geometry of learned representations.

The span of a set of vectors $\{v_1, \ldots, v_k\}$ is the smallest subspace containing all of them. Geometrically, it’s all points reachable by scaling and adding the vectors. Algebraically, it’s $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$. In ML, span determines:

  • Model capacity: What functions can a model represent?
  • Feature redundancy: Are some features linear combinations of others?
  • Solution uniqueness: When are there multiple parameter vectors giving identical predictions?
  • Expressiveness vs. efficiency: Can we reduce dimensionality without losing information?

This chapter adopts an ML-first perspective: we introduce span through concrete algorithms (kernel methods, attention, overparameterization) rather than abstract axioms. The goal is to build geometric intuition (span as reachable points) and computational skill (checking linear independence, computing basis) simultaneously.

 

Important Ideas#

1. Linear combinations are everywhere in ML. A linear combination of vectors $\{v_1, \ldots, v_k\}$ with coefficients $\{\alpha_1, \ldots, \alpha_k\}$ is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k $$

Examples in ML:

  • Linear regression predictions: $\hat{y} = Xw = \sum_{j=1}^d w_j x_j$ (linear combination of feature columns).
  • Gradient descent updates: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ (linear combination of current parameters and gradient).
  • Attention outputs: $z = \sum_{i=1}^n \alpha_i v_i$ (weighted sum of value vectors with attention weights $\alpha_i$).
  • Kernel predictions: $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (representer theorem: optimal solution is a linear combination of training kernels).
  • Word embeddings: Analogies $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (linear combinations capture semantic relationships).

2. Span determines expressiveness. The span of $\{v_1, \ldots, v_k\}$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$

This is the set of all possible linear combinations—the “reachable subspace” if we’re allowed to scale and add the vectors. Key properties:

  • It’s a subspace: Closed under addition and scalar multiplication (adding/scaling linear combinations gives another linear combination).
  • It’s the smallest subspace containing $\{v_1, \ldots, v_k\}$: Any subspace containing all $v_i$ must contain their span.
  • Dimension = number of linearly independent vectors: If $v_k = \sum_{i=1}^{k-1} c_i v_i$ (linear dependence), adding $v_k$ doesn’t increase the span.

In ML context:

  • Column space of $X$: All predictions $\hat{y} = Xw$ lie in $\text{span}(\text{columns of } X) = \text{col}(X)$. If $y \notin \text{col}(X)$, perfect fit is impossible (residual is nonzero).
  • Feature redundancy: If feature $x_j$ is a linear combination of other features, adding it doesn’t increase $\text{span}(\text{columns of } X)$ or model capacity.
  • Kernel methods: Predictions lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (representer theorem). This is typically a finite-dimensional subspace of the (infinite-dimensional) RKHS.

3. Linear independence vs. dependence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = 0$ is $\alpha_1 = \cdots = \alpha_k = 0$. Otherwise, they’re linearly dependent (one is a linear combination of others).

Why it matters:

  • Basis: A linearly independent set spanning $V$ is a basis for $V$. Every vector in $V$ has a unique representation as a linear combination of basis vectors.
  • Rank: $\text{rank}(X) = $ number of linearly independent columns = $\dim(\text{col}(X))$.
  • Multicollinearity: In regression, linearly dependent features ($\text{rank}(X) < d$) make $X^\top X$ singular (non-invertible), requiring regularization.

4. Representer theorem: solutions lie in span of training data. For many ML problems (kernel ridge regression, SVMs, Gaussian processes), the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$

This is a linear combination of kernel functions evaluated at training points. Despite working in an infinite-dimensional space (e.g., RBF kernel), the solution lies in an $n$-dimensional subspace (span of $\{k(x_i, \cdot)\}_{i=1}^n$).

Implications:

  • Computational tractability: Optimization over infinite dimensions reduces to solving an $n \times n$ system.
  • Overfitting vs. underfitting: More training points ($n$ large) increases capacity but also computational cost ($O(n^3)$ for exact methods).
  • Sparse solutions: $\ell_1$ regularization (Lasso, SVM) produces solutions with many $\alpha_i = 0$ (sparse linear combinations).

 

Relevance to Machine Learning#

Model expressiveness and capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible predictions. For linear regression $\hat{y} = Xw$:

  • If $\text{rank}(X) = d$ (full column rank), the model can fit $d$ linearly independent targets.
  • If $\text{rank}(X) < d$, features are redundant. Adding more linearly dependent features doesn’t help.
  • If $\text{rank}(X) < n$ (overdetermined), exact fit is impossible unless $y \in \text{col}(X)$ (rare).

Attention mechanisms. Transformer attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where the output is a convex combination (weighted average with non-negative weights summing to 1) of value vectors. Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to multiple subspaces (heads), increasing expressiveness.

Kernel methods and representer theorem. For kernel ridge regression, the optimal solution is: $$ \alpha^* = (K + \lambda I)^{-1} y $$ where $K_{ij} = k(x_i, x_j)$ is the Gram matrix. Predictions are $f(x) = \sum_{i=1}^n \alpha_i^* k(x_i, x)$ (linear combination of training kernels). This holds for any kernel (linear, polynomial, RBF, neural network), enabling implicit infinite-dimensional feature spaces.

Word embeddings and analogies. Word2Vec (Mikolov et al., 2013) famously demonstrated that semantic relationships correspond to linear offsets in embedding space: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$. This shows embeddings capture compositional structure (adding/subtracting vectors blends meanings).

 

Algorithmic Development History#

1. Linear combinations in classical mechanics (Newton, 1687). Newton’s second law $F = ma$ expresses force as a linear combination of acceleration components. Decomposing vectors into basis components (Cartesian coordinates) enabled solving physical systems.

2. Linear algebra formalization (Grassmann 1844, Peano 1888). Grassmann introduced “extensive magnitudes” (vectors) and exterior algebra (wedge products, spans). Peano axiomatized vector spaces with addition and scalar multiplication, formalizing linear combinations.

3. Least squares and column space (Gauss 1809, Legendre 1805). Gauss used least squares for orbit determination. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, and the best fit minimizes $\|y - \hat{y}\|_2$ by projecting $y$ onto $\text{col}(X)$.

4. Kernel trick and representer theorem (Kimeldorf & Wahba 1970, Schölkopf 1990s). Kimeldorf & Wahba proved the representer theorem for splines: optimal smoothing spline is a linear combination of kernel basis functions. Schölkopf, Smola, and Vapnik extended this to SVMs and kernel ridge regression, enabling nonlinear learning in RKHS.

5. Word embeddings and linear structure (Mikolov et al. 2013). Word2Vec revealed that embeddings exhibit linear compositionality: analogies like “king - man + woman ≈ queen” work because semantic relationships correspond to parallel vectors (linear offsets). This was surprising—neural networks learned a structured linear space without explicit supervision.

6. Attention and weighted sums (Bahdanau 2015, Vaswani 2017). Attention mechanisms compute outputs as convex combinations (weighted averages) of value vectors. The Transformer (Vaswani et al., 2017) replaced recurrence with attention, showing that linear combinations of context (with learned weights) suffice for sequence modeling.

7. Overparameterization and implicit bias (Bartlett 2020, Arora 2019). Modern deep networks are vastly overparameterized ($d \gg n$), so solutions lie in $w_{\min} + \text{null}(X)$ (affine subspace). Gradient descent exhibits implicit regularization, preferring solutions in specific subspaces (e.g., low-rank, sparse). Understanding span and null space clarifies why overparameterized models generalize.

 

Definitions#

Linear combination. Given vectors $\{v_1, \ldots, v_k\} \subset V$ and scalars $\{\alpha_1, \ldots, \alpha_k\} \subset \mathbb{R}$, the linear combination is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \cdots + \alpha_k v_k \in V $$

Span. The span of $\{v_1, \ldots, v_k\}$ is the set of all linear combinations: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.

Linear independence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if: $$ \sum_{i=1}^k \alpha_i v_i = 0 \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0 $$ Otherwise, they are **linearly dependent** (at least one $v_j$ is a linear combination of the others).

Basis. A set $\{v_1, \ldots, v_k\}$ is a basis for subspace $S$ if:

  1. It spans $S$: $\text{span}\{v_1, \ldots, v_k\} = S$.
  2. It is linearly independent.

Every vector in $S$ has a unique representation as a linear combination of basis vectors.

Column space (range). For $A \in \mathbb{R}^{m \times n}$, the column space is: $$ \text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{\text{columns of } A\} $$

Dimension. $\dim(S) = $ number of vectors in any basis for $S$. For $\text{col}(A)$, $\dim(\text{col}(A)) = \text{rank}(A)$ (number of linearly independent columns).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Representer theorem (Kimeldorf & Wahba 1970, Schölkopf et al. 2001). For kernel ridge regression and SVMs, the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$ This holds for **any** reproducing kernel $k$ on RKHS $\mathcal{H}$. The solution lies in the $n$-dimensional subspace $\text{span}{k(x_1, \cdot), \ldots, k(x_n, \cdot)}$, even though $\mathcal{H}$ may be infinite-dimensional (e.g., RBF kernel).
  2. VC dimension and span (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$, the VC dimension is $d+1$. This measures expressiveness: the classifier can shatter (correctly classify all $2^{d+1}$ labelings) any set of $d+1$ points in general position. The decision boundaries are hyperplanes (linear combinations of features).
  3. Rank-nullity theorem (fundamental theorem of linear algebra). For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ In ML: If $X \in \mathbb{R}^{n \times d}$ has $\text{rank}(X) = r < d$, there are $d - r$ linearly dependent features (null space dimension). Solutions to $Xw = y$ form an affine subspace $w_{\text{particular}} + \text{null}(X)$.
  4. Eckart-Young theorem (1936). The truncated SVD $\hat{X} = U_k \Sigma_k V_k^\top$ (keeping top $k$ singular values) minimizes: $$ \|\hat{X} - X\|_F = \min_{\text{rank}(\hat{X}) \leq k} \|\hat{X} - X\|_F $$ Geometrically: Projecting columns of $X$ onto $\text{span}{u_1, \ldots, u_k}$ minimizes reconstruction error. This justifies PCA, low-rank matrix completion, and recommender systems.
  5. Johnson-Lindenstrauss lemma (1984). Random projection from $\mathbb{R}^d$ to $\mathbb{R}^k$ (with $k = O(\log n / \epsilon^2)$) approximately preserves pairwise distances with high probability. This enables dimensionality reduction: data approximately lies in a low-dimensional subspace, discoverable via random projections.

Why essential: These theorems quantify when learning is tractable (representer theorem → finite-dimensional optimization), how much data suffices (VC dimension → sample complexity), and when low-dimensional structure exists (Eckart-Young → lossy compression bounds).

 

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. Word2Vec (Mikolov et al., 2013). Learned 300-dimensional embeddings for millions of words via skip-gram/CBOW. Demonstrated linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ achieved 40% accuracy on analogy tasks. Showed that linear combinations capture semantic relationships (gender, tense, capitals).
  2. ResNet (He et al., 2015). Introduced skip connections $y = F(x) + x$, enabling training of 152-layer networks (vs. ~20 layers for VGG). Won ImageNet 2015 with 3.57% top-5 error. The key: $F(x) + x$ is a linear combination (residual + identity), preserving gradients during backpropagation.
  3. Transformer (Vaswani et al., 2017). Replaced RNNs with attention: $\text{softmax}(QK^\top / \sqrt{d_k}) V$ (linear combination of value vectors). Enabled GPT-3 (175B params, Brown et al. 2020), BERT (340M params, Devlin et al. 2018), and state-of-the-art results across NLP (translation, summarization, QA).
  4. Kernel SVMs (Boser et al. 1992, Cortes & Vapnik 1995). Applied kernel trick to large-margin classifiers. Won NIPS 2003 feature selection challenge, achieved 99.3% accuracy on MNIST (Decoste & Schölkopf 2002). Decision function $f(x) = \sum_{i \in SV} \alpha_i y_i k(x_i, x)$ is a sparse linear combination (only support vectors have $\alpha_i \neq 0$).
  5. PCA for face recognition (Eigenfaces, Turk & Pentland 1991). Projected face images onto span of top eigenvectors (principal components). Each face is approximated as $x \approx \sum_{i=1}^k c_i u_i$ (linear combination of eigenfaces). Achieved real-time recognition with $k = 50$-$100$ components (vs. $d = 10,000$ pixels).
  6. GPT-3 (Brown et al., 2020). 175B parameter Transformer trained on 300B tokens. Demonstrated few-shot learning (2-3 examples) across diverse tasks without fine-tuning. Attention layers compute $\sum_{i=1}^n \alpha_i v_i$ (linear combinations of context), with $n = 2048$ tokens.

Why essential: These systems achieved state-of-the-art by exploiting linear combination structure (attention, skip connections, kernel methods). Understanding span is necessary to interpret embeddings (Word2Vec analogies), debug failures (rank deficiency in features), and design architectures (multi-head attention = multiple subspaces).

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Data spans low-dimensional subspace#

Major achievements:

  • Hotelling (1933): Formalized PCA as finding orthogonal directions of maximum variance. Principal components are eigenvectors of the covariance matrix $C = \frac{1}{n} X_c^\top X_c$ (centered data).
  • Eckart-Young theorem (1936): Proved that truncated SVD $X \approx U_k \Sigma_k V_k^\top$ (keeping top $k$ singular vectors) minimizes reconstruction error $\|X - \hat{X}\|_F$. This justifies PCA: projecting onto $\text{span}\{u_1, \ldots, u_k\}$ (top eigenvectors) is optimal.
  • Modern applications: Face recognition (eigenfaces, Turk & Pentland 1991), data compression (JPEG2000), preprocessing for neural networks (whitening), exploratory data analysis (visualizing high-dimensional datasets in 2D/3D).

Connection to span: PCA finds the $k$-dimensional subspace (span of top eigenvectors) that best approximates the data cloud. Projecting data $X$ onto $\text{span}\{u_1, \ldots, u_k\}$ gives $X_{\text{proj}} = X V_k V_k^\top$, where each row is a linear combination of top eigenvectors. The retained variance is $\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i$.

 

2. Stochastic Gradient Descent (SGD) — Updates are linear combinations#

Major achievements:

  • Robbins & Monro (1951): Proved convergence of stochastic approximation $\theta_{t+1} = \theta_t - \eta_t g_t$ (where $g_t$ is a noisy gradient) under diminishing step sizes $\sum_t \eta_t = \infty$, $\sum_t \eta_t^2 < \infty$.
  • Momentum methods (Polyak 1964, Nesterov 1983): Introduced momentum $m_{t+1} = \beta m_t + \nabla \mathcal{L}(\theta_t)$, $\theta_{t+1} = \theta_t - \eta m_{t+1}$ (exponentially weighted average of gradients). This is a linear combination of past gradients with decaying weights.
  • Adam optimizer (Kingma & Ba 2014): Adaptive learning rates using first and second moment estimates. Became the dominant optimizer for deep learning (BERT, GPT, Stable Diffusion).

Connection to span: Every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ is a linear combination of the current parameters and the negative gradient. The optimization trajectory $\{\theta_0, \theta_1, \ldots\}$ lies in the affine subspace $\theta_0 + \text{span}\{\nabla \mathcal{L}(\theta_0), \nabla \mathcal{L}(\theta_1), \ldots\}$. For linear models, gradients are linear combinations of data columns.

 

3. Deep Neural Networks—Compositional Linear Combinations

Major achievements:

  • Universal approximation (Cybenko 1989, Hornik 1991): Single hidden layer networks can approximate continuous functions arbitrarily well. The output is $f(x) = \sum_{i=1}^h w_i \sigma(v_i^\top x + b_i)$ (linear combination of activations).
  • Deep learning revolution (2012-present): AlexNet (2012), VGG (2014), ResNet (2015), Transformers (2017) demonstrated that depth (composing linear maps + nonlinearities) is more powerful than width (more neurons per layer).
  • Neural Tangent Kernels (Jacot et al. 2018): Showed that infinite-width networks behave like kernel methods, with predictions in $\text{span}\{\text{training features}\}$.

Connection to span: Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$). The pre-activation $W_l h_l$ lies in $\text{col}(W_l)$. Deep networks compose these linear combinations across layers, creating hierarchical representations.

 

4. Kernel Methods—Predictions as linear combinations of kernels#

Major achievements:

  • Representer theorem (Kimeldorf & Wahba 1970): For regularized risk minimization $\min_{f \in \mathcal{H}} \sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the optimal solution is $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel basis functions).
  • Support Vector Machines (Boser et al. 1992, Cortes & Vapnik 1995): Introduced large-margin classifiers with kernel trick. Won NIPS feature selection challenge (2003), dominated ML competitions (early 2000s).
  • Gaussian Processes (Rasmussen & Williams 2006): Bayesian kernel methods for regression/classification. Predictions are linear combinations $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ with $\alpha = (K + \sigma^2 I)^{-1} y$.

Connection to span: Despite working in (potentially infinite-dimensional) RKHS, kernel predictions always lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (finite-dimensional subspace spanned by training kernels). The Gram matrix $K_{ij} = k(x_i, x_j)$ encodes inner products in this subspace.

 

5. Transformer Attention—Weighted sums of value vectors#

Major achievements:

  • Vaswani et al. (2017): “Attention is All You Need” replaced RNNs with self-attention. Enabled parallelization and scaling to billions of parameters (GPT-3: 175B params, GPT-4: ~1.7T params).
  • BERT (Devlin et al. 2018): Bidirectional Transformers for masked language modeling. Achieved state-of-the-art on 11 NLP tasks (GLUE benchmark).
  • Vision Transformers (Dosovitskiy et al. 2020): Applied attention to image patches, surpassing CNNs on ImageNet (ViT-H/14: 88.5% top-1 accuracy).
  • Multimodal models (CLIP, Flamingo, GPT-4): Unified vision and language via attention over heterogeneous inputs.

Connection to span: Attention output $z = \text{softmax}(QK^\top / \sqrt{d_k}) V$ is a convex combination (weighted average with non-negative weights summing to 1) of value vectors (rows of $V$). Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to $h$ different subspaces, computing $h$ independent linear combinations in parallel.

Notation

Standard Conventions#

1. Linear combination syntax.

  • Summation notation: $\sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k$.
  • Matrix-vector product: $Xw = \sum_{j=1}^d w_j x_j$ (linear combination of columns of $X$ with weights from $w$).
  • Convex combination: $\sum_{i=1}^k \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (weighted average).

Examples:

  • Linear regression prediction: $\hat{y} = X w = \sum_{j=1}^d w_j X_{:,j}$ (each prediction is a linear combination of feature columns).
  • Attention output: $z = \sum_{i=1}^n \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$ (convex combination of value vectors).
  • Word analogy: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$ (coefficients can be negative).

2. Span notation.

  • Set notation: $\text{span}\{v_1, \ldots, v_k\} = \{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$.
  • Equivalent: $\text{span}(S)$ where $S = \{v_1, \ldots, v_k\}$ (span of a set).
  • Column space: $\text{col}(A) = \text{span}\{\text{columns of } A\}$.
  • Row space: $\text{row}(A) = \text{span}\{\text{rows of } A\} = \text{col}(A^\top)$.

Examples:

  • For $X \in \mathbb{R}^{3 \times 2}$ with columns $x_1 = [1, 0, 1]^\top$, $x_2 = [0, 1, 1]^\top$: $$ \text{col}(X) = \text{span}\{x_1, x_2\} = \left\{ w_1 \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} + w_2 \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} : w_1, w_2 \in \mathbb{R} \right\} $$ This is a 2D plane in $\mathbb{R}^3$ passing through the origin.

3. Linear independence notation.

  • Independence: Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if $\sum_{i=1}^k \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_k = 0$.
  • Dependence: If there exist $\alpha_i$ (not all zero) such that $\sum_{i=1}^k \alpha_i v_i = 0$, vectors are linearly dependent.
  • Rank: $\text{rank}(A) = \max\{\text{number of linearly independent columns of } A\} = \max\{\text{number of linearly independent rows of } A\}$.

Examples:

  • Vectors $v_1 = [1, 0]^\top$, $v_2 = [0, 1]^\top$ are linearly independent (standard basis for $\mathbb{R}^2$).
  • Vectors $v_1 = [1, 2]^\top$, $v_2 = [2, 4]^\top$ are linearly dependent ($v_2 = 2 v_1$).
  • For $X \in \mathbb{R}^{100 \times 50}$, $\text{rank}(X) \leq 50$ (at most 50 linearly independent columns).

4. Basis notation.

  • Basis: A linearly independent spanning set. Denoted $\mathcal{B} = \{v_1, \ldots, v_d\}$ for a $d$-dimensional space.
  • Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.
  • Coordinates: For vector $v = \sum_{i=1}^d \alpha_i v_i$ (linear combination of basis vectors), the coordinates are $[\alpha_1, \ldots, \alpha_d]^\top$.

Examples:

  • Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top$, $e_2 = [0, 1, 0]^\top$, $e_3 = [0, 0, 1]^\top$.
  • Any vector $v = [v_1, v_2, v_3]^\top = v_1 e_1 + v_2 e_2 + v_3 e_3$ (linear combination of standard basis).
  • For PCA, top $k$ eigenvectors $\{u_1, \ldots, u_k\}$ form a basis for the principal subspace.

5. Kernel and null space notation.

  • Null space: $\text{null}(A) = \{x : Ax = 0\}$ (vectors mapped to zero).
  • Kernel: $\ker(A) = \text{null}(A)$ (alternative notation).
  • Range (column space): $\text{range}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$.

Examples:

  • For $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ (rank 1), $\text{null}(A) = \text{span}\{[2, -1]^\top\}$ (1D subspace).
  • Overparameterized regression: If $\text{rank}(X) < d$, solutions to $Xw = y$ form $w_0 + \text{null}(X)$ (affine subspace).
  • Kernel ridge regression: Solution $\alpha = (K + \lambda I)^{-1} y$ lies in $\mathbb{R}^n$ (span of training examples).

6. Projection notation.

  • Orthogonal projection: $P_S v$ projects $v$ onto subspace $S$.
  • Projection matrix: $P = A(A^\top A)^{-1} A^\top$ projects onto $\text{col}(A)$.
  • Complement: $v = P_S v + P_{S^\perp} v$ (decomposition into parallel and perpendicular components).

Examples:

  • PCA projection onto top $k$ eigenvectors: $X_{\text{proj}} = X V_k V_k^\top$ where $V_k = [u_1 | \cdots | u_k]$.
  • Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto $\text{col}(X)$).
  • Residual: $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top) y$ (projection onto $\text{col}(X)^\perp$).
Pitfalls & sanity checks

Common Mistakes#

  1. Confusing span with basis: Span dimension = number of linearly independent vectors, not total count.
  2. Assuming full rank: Always check np.linalg.matrix_rank(X) before inverting $X^\top X$.
  3. Ignoring numerical stability: Use lstsq instead of normal equations.
  4. Misunderstanding convex combinations: Not all linear combinations are convex (need $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$).
  5. Overparameterization misconceptions: $d > n$ doesn’t always cause overfitting (implicit regularization).

     

Essential Checks#

# Check linear independence
rank = np.linalg.matrix_rank(X)
assert rank == X.shape[1], "Columns linearly dependent"

# Verify span membership
V = np.column_stack([v1, v2, v3])
alpha = np.linalg.lstsq(V, v, rcond=None)[0]
assert np.allclose(V @ alpha, v), "v not in span(V)"

# Test null space
assert np.allclose(X @ z, 0), "z not in null(X)"

# Attention weights
assert np.allclose(alpha.sum(), 1) and (alpha >= 0).all()
References

Foundational Texts#

  1. Strang (2016): Linear Algebra - span, basis, column/null space
  2. Axler (2015): Linear Algebra Done Right - abstract vector spaces
  3. Horn & Johnson (2013): Matrix Analysis - rank, decompositions

     

Machine Learning#

  1. Hastie et al. (2009): Elements of Statistical Learning - regression, SVMs
  2. Goodfellow et al. (2016): Deep Learning - Chapter 2 (Linear Algebra)
  3. Murphy (2022): Probabilistic ML - linear regression, kernels

     

Key Papers#

  1. Kimeldorf & Wahba (1970): Representer theorem
  2. Vapnik & Chervonenkis (1971): VC dimension
  3. Eckart & Young (1936): Low-rank approximation
  4. Mikolov et al. (2013): Word2Vec analogies
  5. Vaswani et al. (2017): Transformer attention
  6. He et al. (2015): ResNet skip connections
  7. Bartlett et al. (2020): Benign overfitting
  8. Belkin et al. (2019): Double descent

     

Advanced Topics#

  1. Schölkopf & Smola (2002): Learning with Kernels
  2. Rasmussen & Williams (2006): Gaussian Processes
  3. Golub & Van Loan (2013): Matrix Computations
  4. Trefethen & Bau (1997): Numerical Linear Algebra
Five worked examples

Worked Example 1: Predictions lie in span(columns of X)#

Introduction#

Linear regression predictions $\hat{y} = Xw$ are linear combinations of the columns of the feature matrix $X$. This fundamental observation reveals model expressiveness: all possible predictions lie in the column space $\text{col}(X)$, a subspace of $\mathbb{R}^n$. If the target $y$ lies outside this subspace ($y \notin \text{col}(X)$), perfect fit is impossible—the best we can do is project $y$ onto $\text{col}(X)$ (least squares solution).

This example explicitly computes $Xw$ as $\sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns), demonstrating that predictions span a subspace determined entirely by the features.

 

Purpose#

  • Visualize predictions as linear combinations: Show that $\hat{y} = Xw = w_1 X_{:,1} + w_2 X_{:,2} + \cdots + w_d X_{:,d}$.
  • Identify the constraint: Predictions lie in $\text{col}(X)$, limiting model capacity to $\dim(\text{col}(X)) = \text{rank}(X)$.
  • Connect to least squares: When $y \notin \text{col}(X)$, minimizing $\|Xw - y\|_2$ finds the closest point in $\text{col}(X)$ to $y$.
     

Importance#

Model expressiveness. The span of $X$’s columns determines all possible predictions. For $X \in \mathbb{R}^{n \times d}$:

  • If $\text{rank}(X) = d$ (full column rank), the model can fit any $d$ linearly independent targets.
  • If $\text{rank}(X) < d$, some features are redundant (linearly dependent). Adding more linearly dependent features doesn’t increase capacity.
  • If $\text{rank}(X) < n$ (typical when $d < n$), predictions lie in a proper subspace of $\mathbb{R}^n$. Perfect fit is impossible unless $y \in \text{col}(X)$.

Residuals and orthogonality. The least squares residual $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$: $X^\top r = 0$. Geometrically, $\hat{y}$ is the orthogonal projection of $y$ onto $\text{col}(X)$, and $r$ lies in the orthogonal complement $\text{col}(X)^\perp$.

Feature selection. If feature $j$ is a linear combination of other features ($X_{:,j} = \sum_{i \neq j} c_i X_{:,i}$), including it doesn’t increase $\text{rank}(X)$ or expand $\text{col}(X)$. Feature selection algorithms (Lasso, forward selection) aim to find minimal feature sets spanning the target space.

 

What This Example Demonstrates#

  • Matrix-vector product as linear combination: $Xw = \sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns).
  • Predictions constrained to subspace: $\hat{y} \in \text{col}(X) = \text{span}\{X_{:,1}, \ldots, X_{:,d}\}$.
  • Numerical verification: Compute both $Xw$ (matrix product) and $\sum_j w_j X_{:,j}$ (explicit sum), verify they’re identical.

     

Background#

Least squares (Gauss 1809, Legendre 1805). Gauss used least squares to fit planetary orbits, minimizing sum of squared errors. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, so minimizing $\|y - Xw\|_2^2$ finds the closest point in $\text{col}(X)$ to $y$.

Normal equations. Setting $\nabla_w \|Xw - y\|_2^2 = 0$ gives $X^\top X w = X^\top y$. If $X$ has full column rank, $w^* = (X^\top X)^{-1} X^\top y$. The prediction is $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y$ (projection matrix $P = X(X^\top X)^{-1} X^\top$ projects onto $\text{col}(X)$).

Geometric interpretation. $\text{col}(X)$ is a $d$-dimensional (or $\text{rank}(X)$-dimensional) hyperplane in $\mathbb{R}^n$. The prediction $\hat{y}$ is the foot of the perpendicular from $y$ to this hyperplane. The residual $r = y - \hat{y}$ is perpendicular to the hyperplane.

 

Historical Context#

1. Least squares origins (Gauss 1809, Legendre 1805). Legendre published the method in 1805 for fitting orbits. Gauss claimed to have used it since 1795 (controversy over priority). Both recognized that predictions are linear combinations of features.

2. Matrix formulation (Cauchy 1829, Sylvester 1850). Matrix algebra enabled compact notation $\hat{y} = Xw$ instead of writing out sums. Sylvester introduced “matrix” terminology in 1850.

3. Projection interpretation (Schmidt 1907, Courant & Hilbert 1924). Erhard Schmidt formalized orthogonal projections in Hilbert spaces. The least squares solution became understood as projecting $y$ onto $\text{col}(X)$.

4. Modern ML (1990s-present). Regularization (ridge, Lasso) modifies $\text{col}(X)$ by adding penalty terms. Kernel methods (SVMs, kernel ridge regression) work in implicitly mapped feature spaces, where $\text{col}(\Phi(X))$ may be infinite-dimensional but solutions lie in $\text{span}\{k(x_i, \cdot)\}_{i=1}^n$ (finite-dimensional by the representer theorem).

 

History in Machine Learning#

  • 1805: Legendre publishes least squares (linear combinations of features).
  • 1809: Gauss derives normal equations $X^\top X w = X^\top y$.
  • 1907: Schmidt formalizes orthogonal projections (geometric interpretation).
  • 1970: Kimeldorf & Wahba prove the representer theorem (kernel solutions in the span of training points).
  • 1995: Vapnik’s Nature of Statistical Learning Theory connects VC dimension to span of hypothesis class.
  • 2006: Compressed sensing (Candès, Donoho) exploits sparse linear combinations for recovery.
  • 2018: Neural Tangent Kernels (Jacot et al.) show infinite-width networks have predictions in the span of features.

     

Prevalence in Machine Learning#

Universal in supervised learning: Every linear model (linear regression, logistic regression, linear SVM, perceptron) computes predictions as $\hat{y} = Xw$ or $\hat{y} = \sigma(Xw + b)$ (linear combination + nonlinearity).

Deep learning layers: Each fully connected layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$).

Generalized linear models (GLMs): Exponential family models (Poisson regression, gamma regression) use $\mathbb{E}[y] = g^{-1}(X w)$ (linear combination inside link function).

Kernel methods: SVMs, kernel ridge regression, Gaussian processes all predict via $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel evaluations).

 

Notes and Explanatory Details#

Shape discipline:

  • Feature matrix: $X \in \mathbb{R}^{n \times d}$ (rows = examples, columns = features).
  • Weights: $w \in \mathbb{R}^d$ (one weight per feature).
  • Prediction: $\hat{y} = Xw \in \mathbb{R}^n$ (one prediction per example).
  • Column $j$ of $X$: $X_{:,j} \in \mathbb{R}^n$ (feature $j$ across all examples).

Matrix-vector product identity: $$ Xw = \begin{bmatrix} | & | & & | \\ X_{:,1} & X_{:,2} & \cdots & X_{:,d} \\ | & | & & | \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} = \sum_{j=1}^d w_j X_{:,j} $$

Example: For $X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}$, $w = \begin{bmatrix} 2 \\ -1 \end{bmatrix}$: $$ Xw = 2 \begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} + (-1) \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 4 \end{bmatrix} $$

Numerical considerations: For large $d$ (wide data), storing $X$ explicitly may be wasteful if $\text{rank}(X) \ll d$. Low-rank approximations (truncated SVD) reduce storage and computation.

 

Connection to Machine Learning#

Underfitting vs. overfitting: If $\text{rank}(X) \ll n$ (few effective features), the model underfits (predictions lie in low-dimensional subspace). If $\text{rank}(X) = n$ and $d \geq n$ (more features than examples), the model can perfectly fit noise (overfitting).

Regularization modifies the span: Ridge regression solves $(X^\top X + \lambda I) w = X^\top y$, shrinking weights toward zero. This effectively reduces the effective rank of $X$, constraining predictions to a lower-dimensional subspace.

Basis functions and feature expansion: Nonlinear models (polynomial regression, RBF networks) expand features: $\phi(x) = [x, x^2, x^3, \ldots]$. Predictions $\hat{y} = \Phi(X) w$ lie in $\text{col}(\Phi(X))$, a nonlinear subspace in the original space but linear in feature space.

 

Connection to Linear Algebra Theory#

Fundamental theorem of linear algebra. For $X \in \mathbb{R}^{n \times d}$: $$ \mathbb{R}^n = \text{col}(X) \oplus \text{null}(X^\top) $$ (direct sum: every vector $y \in \mathbb{R}^n$ decomposes uniquely as $y = y_{\parallel} + y_{\perp}$ where $y_{\parallel} \in \text{col}(X)$ and $y_{\perp} \in \text{null}(X^\top)$).

In least squares, $\hat{y} = y_{\parallel}$ (projection onto $\text{col}(X)$) and $r = y_{\perp}$ (projection onto $\text{null}(X^\top)$). The normal equations $X^\top r = 0$ express orthogonality.

Rank and dimension: $\dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d)$. If $\text{rank}(X) = r < d$, there are $d - r$ redundant features (null space has dimension $d - r$).

Projection matrix: $P = X(X^\top X)^{-1} X^\top$ (assuming $X$ has full column rank) satisfies:

  • $P^2 = P$ (idempotent: projecting twice is the same as projecting once).
  • $P^\top = P$ (symmetric: orthogonal projection).
  • $\text{col}(P) = \text{col}(X)$ (projects onto column space of $X$).

     

Pedagogical Significance#

Concrete visualization. Students can compute $Xw$ by hand for small $X$ (e.g., $3 \times 2$ matrix) and verify it’s a weighted sum of columns. This makes the abstract “linear combination” concept tangible.

Foundation for least squares. Understanding that predictions lie in $\text{col}(X)$ is essential before learning least squares. The geometric interpretation (projecting $y$ onto $\text{col}(X)$) clarifies why least squares works and when it fails.

Debugging linear models. If predictions are poor, check $\text{rank}(X)$: low rank indicates redundant/collinear features. Use np.linalg.matrix_rank(X) to diagnose.

 

References#

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4: “Orthogonality” (projections, least squares).
  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3: “Linear Methods for Regression.”
  3. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Appendix C: “Numerical Linear Algebra Background” (least squares, QR decomposition).
  4. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. Chapter 5: “Orthogonalization and Least Squares.”
  5. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11: “Linear Regression.”

Problem. Show $\hat{y} = Xw$ lies in the span of columns of $X$.

Solution (math).

For $X \in \mathbb{R}^{n \times d}$ with columns $X_{:,1}, \ldots, X_{:,d} \in \mathbb{R}^n$ and weights $w = [w_1, \ldots, w_d]^\top \in \mathbb{R}^d$, the prediction is: $$ \hat{y} = Xw = \sum_{j=1}^d w_j X_{:,j} $$

This is a linear combination of the columns of $X$, so $\hat{y} \in \text{span}\{X_{:,1}, \ldots, X_{:,d}\} = \text{col}(X)$.

Solution (Python).

import numpy as np

# Define feature matrix X (3 examples, 2 features)
X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])

# Define weight vector w
w = np.array([2., -1.])

# Prediction via matrix-vector product
y_hat_1 = X @ w

# Prediction as explicit linear combination of columns
y_hat_2 = w[0] * X[:, 0] + w[1] * X[:, 1]

print(f"X =\n{X}\n")
print(f"w = {w}\n")
print(f"Method 1 (matrix product): y_hat = X @ w = {y_hat_1}")
print(f"Method 2 (linear combination): y_hat = {w[0]}*X[:,0] + {w[1]}*X[:,1] = {y_hat_2}")
print(f"\nAre they equal? {np.allclose(y_hat_1, y_hat_2)}")
print(f"y_hat lies in span(columns of X): True (by construction)")

Output:

X =
[[1. 2.]
 [3. 4.]
 [5. 6.]]

w = [ 2. -1.]

Method 1 (matrix product): y_hat = X @ w = [0. 2. 4.]
Method 2 (linear combination): y_hat = 2.0*X[:,0] + -1.0*X[:,1] = [0. 2. 4.]

Are they equal? True
y_hat lies in span(columns of X): True (by construction)

Worked Example 2: Kernel ridge solution lies in span(training features)#

Introduction#

The representer theorem states that despite optimizing over an infinite-dimensional RKHS, the optimal solution for kernel ridge regression always has the form $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$—a linear combination of kernel functions evaluated at training points.

 

Purpose#

  • Demonstrate the representer theorem computationally
  • Show that optimization in infinite dimensions reduces to solving $(K + \lambda I)\alpha = y$
  • Verify predictions lie in span of training kernels
  •  

Importance#

Kernel methods enable nonlinear learning in implicitly mapped feature spaces while maintaining computational tractability ($O(n^3)$ instead of infinite-dimensional optimization).

 

What This Example Demonstrates#

Compute kernel Gram matrix $K_{ij} = k(x_i, x_j)$, solve for $\alpha$, interpret as linear combination of training kernels.

 

Background#

RKHS and representer theorem (Kimeldorf & Wahba 1970): For loss $\mathcal{L}(f) = \sum_i \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the minimizer is $f^*(x) = \sum_i \alpha_i k(x_i, x)$.

 

References#

  1. Kimeldorf & Wahba (1970), Schölkopf et al. (2001), Rasmussen & Williams (2006)

Problem: Compute $\alpha$ for kernel ridge regression and interpret span.

Solution (math): $\alpha = (K + \lambda I)^{-1} y$ where $K_{ij} = k(x_i, x_j)$. Predictions: $f(x) = \sum_i \alpha_i k(x_i, x)$.

Solution (Python):

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=6, seed=1)
y = np.arange(len(X), dtype=float)
K = toy_kernel_rbf(X, gamma=0.5)
lam = 1e-2
alpha = np.linalg.solve(K + lam * np.eye(len(X)), y)

print(f"Coefficients alpha: {alpha}")
print(f"Predictions lie in span{{k(x_1, ·), ..., k(x_6, ·)}}")

Worked Example 3: Attention is a weighted sum#

Introduction#

Attention computes outputs as convex combinations of value vectors: $z = \sum_i \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$.

 

Purpose#

Show attention output is a linear combination, verify weights sum to 1, demonstrate constraint to span of values.

 

Importance#

Attention is the core operation in Transformers (GPT, BERT), enabling contextual representations through weighted averaging.

 

References#

Vaswani et al. (2017), Bahdanau et al. (2015)

Problem: Compute attention output as $\sum_i \alpha_i v_i$.

Solution (math): $z = \text{softmax}(q^\top K / \sqrt{d_k}) V$

Solution (Python):

import numpy as np
from scripts.toy_data import scaled_dot_attention

Q = np.array([[1., 0.]])
K = np.array([[1., 0.], [0., 1.], [1., 1.]])
V = np.array([[1., 0.], [0., 2.], [1., 1.]])
output = scaled_dot_attention(Q, K, V)

print(f"Attention output: {output[0]}")
print(f"Output lies in span(rows of V)")

Worked Example 4: Overparameterization and null space#

Introduction#

When $d > n$ (more parameters than examples), solutions to $Xw = y$ are non-unique. The solution set forms an affine subspace $w_0 + \text{null}(X)$.

 

Purpose#

Show non-uniqueness, identify solution set structure, and discuss the minimum-norm solution returned by lstsq.

 

Importance#

Modern deep learning is vastly overparameterized. Understanding null space clarifies why multiple parameters give identical predictions yet generalize differently.

 

References#

Bartlett et al. (2020), Belkin et al. (2019)

Problem: Explain non-uniqueness when $d > n$.

Solution (math): If $Xw_0 = y$ and $z \in \text{null}(X)$, then $X(w_0 + z) = y$. Solutions form $w_0 + \text{null}(X)$.

Solution (Python):

import numpy as np

rng = np.random.default_rng(0)
X = rng.normal(size=(3, 5))  # n=3, d=5
w0 = rng.normal(size=5)
y = X @ w0
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Rank(X): {np.linalg.matrix_rank(X)}")
print(f"Null space dim: {X.shape[1] - np.linalg.matrix_rank(X)}")
print(f"||w_hat||_2 = {np.linalg.norm(w_hat):.4f} (minimum norm)")
print(f"w0 - w_hat in null(X): {np.allclose(X @ (w0 - w_hat), 0)}")

Worked Example 5: Word analogy vector arithmetic#

Introduction#

Word2Vec embeddings exhibit linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (semantic relationships = vector offsets).

Purpose#

Compute analogy as linear combination, demonstrate compositional semantics, motivate embedding arithmetic.

Importance#

Analogies reveal that neural networks learn structured representations where linear algebra operations correspond to semantic operations.

References#

Mikolov et al. (2013), Pennington et al. (2014)

Problem: Compute “king - man + woman” analogy.

Solution (math): $e_{\text{target}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$

Solution (Python):

import numpy as np

E = {
    'king': np.array([0.8, 0.2, 0.1]),
    'man': np.array([0.7, 0.1, 0.0]),
    'woman': np.array([0.6, 0.3, 0.0])
}

analogy = E['king'] - E['man'] + E['woman']
print(f"king - man + woman = {analogy}")
print(f"(Find nearest word to this vector → queen)")

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation
Chapter 1
Vector Spaces & Subspaces
Key ideas: Introduction

Introduction#

Vector spaces and subspaces form the foundational algebraic structures underlying all of machine learning. Every dataset, parameter vector, gradient, embedding, and prediction lives in a vector space. Understanding vector space structure—closure under addition and scalar multiplication, the existence of subspaces, and the geometric interpretation of span—is essential for reasoning about model capacity, optimization trajectories, dimensionality reduction, and numerical stability.

This chapter adopts an ML-first approach: we introduce definitions only when they illuminate practical algorithms or enable rigorous reasoning about ML systems. Rather than axiomatizing vector spaces abstractly, we show how closure properties guarantee that gradient descent never “leaves” the parameter space, how subspaces capture low-dimensional structure in data (PCA, autoencoders), and how span determines the expressiveness of linear models.

Important Ideas#

1. Closure under linear combinations. A vector space $V$ over $\mathbb{R}$ is closed under addition and scalar multiplication: for any $u, v \in V$ and $\alpha, \beta \in \mathbb{R}$, we have $\alpha u + \beta v \in V$. This seemingly trivial property is foundational:

  • Optimization: Gradient descent updates $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ are linear combinations, so parameters remain in $\mathbb{R}^d$.

  • Convex combinations: Interpolations $v = \alpha a + (1-\alpha)b$ with $\alpha \in [0,1]$ stay in the space (used in mixup data augmentation, model averaging, momentum methods).

  • Span: The set of all linear combinations $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$ forms a subspace (the span of $\{v_1, \ldots, v_k\}$).

2. Subspaces capture structure. A subspace $S \subseteq V$ is itself a vector space (closed under addition/scaling and contains the zero vector). Key examples in ML:

  • Column space of $X$: All possible predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, the span of feature columns. This determines model expressiveness.

  • Null space (kernel): Solutions to $Xw = 0$ form the null space, revealing parameter redundancy and identifiability issues.

  • Orthogonal complements: Residuals $r = y - Xw$ lie in $\text{col}(X)^\perp$, the subspace perpendicular to all predictions.

  • Eigenspaces: Eigenvectors with the same eigenvalue span an eigenspace (used in spectral clustering, PCA).

3. Geometric vs. algebraic perspectives. Vector spaces admit dual interpretations:

  • Algebraic: Vectors as tuples of numbers, operations as element-wise arithmetic, subspaces defined by equations.

  • Geometric: Vectors as arrows, subspaces as planes/lines, projections as “shadows,” orthogonality as perpendicularity.

  • ML benefit: Switching perspectives clarifies why algorithms work (geometry) and how to implement them (algebra).

Relevance to Machine Learning#

Model capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible linear predictions. If $\text{rank}(X) < d$, features are redundant (collinear). If $\text{rank}(X) < n$, the model cannot fit arbitrary targets (underdetermined system). Understanding span reveals when adding features helps vs. when it introduces multicollinearity.

Dimensionality reduction. PCA projects data onto the span of top eigenvectors, a low-dimensional subspace capturing most variance. Autoencoders learn nonlinear mappings to low-dimensional subspaces (latent spaces). Kernels implicitly map to high-dimensional (or infinite-dimensional) feature spaces where data becomes linearly separable.

Optimization and numerical stability. Gradient-based methods exploit closure: updates are linear combinations of parameters and gradients. Regularization (ridge, Lasso) modifies the effective subspace where solutions lie. Numerical conditioning depends on subspace geometry (angles between basis vectors, subspace dimension).

Algorithmic Development History#

1. Grassmann and the formal axiomatization (1844). Hermann Grassmann introduced the concept of an “extensive magnitude” (vector space) in Die lineale Ausdehnungslehre, defining addition and scalar multiplication axiomatically. His work was largely ignored until the 20th century but provided the first rigorous algebraic treatment of linear combinations and subspaces.

2. Peano’s axioms (1888). Giuseppe Peano formalized vector spaces with the modern axiomatic definition (closure, associativity, distributivity, identity, inverses). This abstraction enabled studying function spaces, polynomial spaces, and infinite-dimensional spaces under a unified framework.

3. Hilbert spaces and functional analysis (1900s-1920s). David Hilbert extended vector space theory to infinite dimensions with inner products, enabling rigorous foundations for quantum mechanics and integral equations. Banach, Fréchet, and Riesz developed norm theory, completing the modern framework.

4. Numerical linear algebra (1950s-1970s). With the advent of digital computers, numerical stability became critical. Householder (QR decomposition, 1958), Golub (SVD algorithm, 1965-1970), and Wilkinson (error analysis, 1960s-1980s) developed stable algorithms exploiting subspace orthogonality. These methods underpin modern least-squares solvers, eigensolvers, and PCA implementations.

5. Kernel methods and reproducing kernel Hilbert spaces (1990s-2000s). The kernel trick (Boser, Guyon, Vapnik, 1992; Schölkopf, Smola, 1998) showed that nonlinear problems become linear in high-dimensional (or infinite-dimensional) feature spaces. Support Vector Machines exploit subspace geometry (maximum margin hyperplanes) in these spaces.

6. Deep learning and representation learning (2010s-present). Neural networks learn hierarchical representations by composing linear maps (matrix multiplications) with nonlinearities. Each layer’s output spans a subspace; training adjusts these subspaces to separate classes or capture structure. Attention mechanisms (Vaswani et al., 2017) compute weighted sums (linear combinations) of value vectors, with outputs constrained to the span of the value subspace.

Definitions#

Vector space. A set $V$ over a field $\mathbb{F}$ (typically $\mathbb{R}$ or $\mathbb{C}$) with operations $+: V \times V \to V$ (addition) and $\cdot: \mathbb{F} \times V \to V$ (scalar multiplication) satisfying:

  1. Closure: $u + v \in V$ and $\alpha v \in V$ for all $u, v \in V$, $\alpha \in \mathbb{F}$.

  2. Associativity: $(u + v) + w = u + (v + w)$ and $\alpha(\beta v) = (\alpha\beta) v$.

  3. Commutativity: $u + v = v + u$.

  4. Identity: There exists $0 \in V$ such that $v + 0 = v$ for all $v \in V$.

  5. Inverses: For each $v \in V$, there exists $-v \in V$ such that $v + (-v) = 0$.

  6. Distributivity: $\alpha(u + v) = \alpha u + \alpha v$ and $(\alpha + \beta)v = \alpha v + \beta v$.

  7. Scalar identity: $1 \cdot v = v$ for all $v \in V$.

Subspace. A subset $S \subseteq V$ is a subspace if:

  1. $0 \in S$ (contains the zero vector).

  2. $u + v \in S$ for all $u, v \in S$ (closed under addition).

  3. $\alpha u \in S$ for all $u \in S$, $\alpha \in \mathbb{F}$ (closed under scalar multiplication).

Equivalently, $S$ is a subspace if it is closed under linear combinations.

Span. The span of vectors $\{v_1, \ldots, v_k\} \subset V$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{F} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.

Column space and range. For a matrix $A \in \mathbb{R}^{m \times n}$, the column space is $\text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{a_1, \ldots, a_n\}$, where $a_i$ are the columns of $A$. This is also called the range or image of $A$.

Null space (kernel). The null space of $A \in \mathbb{R}^{m \times n}$ is $\text{null}(A) = \{x \in \mathbb{R}^n : Ax = 0\}$, the set of vectors mapped to zero by $A$.

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Rademacher complexity bounds. Generalization error depends on the complexity of the hypothesis class (function space). For linear models, the hypothesis space is finite-dimensional (span of features), enabling tight bounds. Key results:

    • Vapnik-Chervonenkis dimension for linear classifiers is $d+1$ (Vapnik & Chervonenkis, 1971).

    • Rademacher complexity of unit ball in $\mathbb{R}^d$ scales as $O(1/\sqrt{n})$ (Bartlett & Mendelson, 2002).

  2. Universal approximation. Existence of dense subspaces in function spaces:

    • Single hidden layer neural networks are dense in $C([0,1]^d)$ (Cybenko 1989).

    • Span of RBF kernels is dense in $L^2$ (Micchelli 1986).

    • Fourier series: span of $\{\sin(kx), \cos(kx)\}_{k=0}^\infty$ is dense in $L^2[0, 2\pi]$.

  3. Convex optimization. Gradient descent converges globally for convex functions over vector spaces (Nesterov 1983). Convergence rates depend on subspace properties (strong convexity, smoothness).

  4. Matrix concentration inequalities. Random matrix theory provides tail bounds for spectral norms, operator norms, and subspace angles (Tropp 2015). Used in randomized linear algebra (sketching, low-rank approximation).

Why essential: These theorems quantify when learning is possible, how many examples suffice, and when optimization succeeds. Vector space structure (dimension, subspaces, inner products) appears directly in the bounds.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. AlexNet (Krizhevsky et al., 2012). First deep convolutional network to win ImageNet (top-5 error 15.3% → 10.9% over runner-up). Demonstrated that compositional linear maps (convolutions as local weight-sharing matrices) with nonlinearities learn hierarchical representations.

    • Vector space insight: Each convolutional layer maps feature maps $X_l \in \mathbb{R}^{h \times w \times c_l}$ through linear filters $W_l$ to $X_{l+1}$. The output space dimension (number of channels $c_{l+1}$) is the rank of the effective weight matrix.

  2. Word2Vec (Mikolov et al., 2013). Learned dense word embeddings in $\mathbb{R}^{300}$ by predicting context words. Famous “king - man + woman = queen” demonstrated that semantic relationships are linear offsets in embedding space.

    • Subspace insight: Analogies correspond to parallel vectors in subspaces (gender direction, verb tense direction). Linear algebra operations (vector arithmetic) capture linguistic structure.

  3. ResNet (He et al., 2015). Introduced skip connections $y = F(x) + x$, enabling training of 152-layer networks (previous best: ~20 layers). Won ImageNet 2015 with 3.57% top-5 error.

    • Closure insight: Adding $x$ and $F(x)$ is a linear combination, guaranteed to stay in the same vector space. Residuals $F(x)$ span a learned subspace; identity shortcuts preserve gradients during backpropagation.

  4. Transformer (Vaswani et al., 2017). Replaced recurrence with attention, enabling parallelization and scaling to billions of parameters (GPT-3 has 175B).

    • Linear combination insight: Attention outputs are weighted sums $\sum_i \alpha_i V_i$, constrained to $\text{span}(V)$. Multi-head attention learns multiple subspaces in parallel.

  5. Diffusion Models (Ho et al., 2020; Rombach et al., 2022). DALL-E 2, Stable Diffusion generate images by iteratively denoising in latent space. Latent vectors $z \in \mathbb{R}^{d_{\text{latent}}}$ lie in an autoencoder’s learned subspace.

Why essential: These systems achieve state-of-the-art performance by exploiting vector space structure (linear combinations, subspaces, closure). Understanding span, null space, and projections is necessary to debug failures, interpret representations, and design architectures.

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Subspace projections for dimensionality reduction#

Major achievements:

  • Hotelling (1933): Formalized PCA as finding orthogonal axes of maximum variance. Applied to psychology/economics data.

  • Pearson (1901): Introduced the concept of “lines of closest fit” (principal components) for reducing multidimensional data to low-dimensional representations.

  • Modern applications: Face recognition (eigenfaces, Turk & Pentland 1991), image compression (JPEG2000 uses SVD/PCA principles), preprocessing for neural networks (whitening, decorrelation), latent semantic analysis (LSA for text, Deerwester et al. 1990).

  • Computational impact: Covariance matrix $C = \frac{1}{n} X^\top X$ is PSD, eigenspaces are orthogonal subspaces, data projected onto top-$k$ eigenvectors minimizes reconstruction error.

Connection to subspaces: PCA finds the $k$-dimensional subspace (span of top eigenvectors) that best approximates the data cloud. The residuals lie in the orthogonal complement (discarded eigenspaces).

2. Stochastic Gradient Descent (SGD) — Parameter updates as linear combinations#

Major achievements:

  • Robbins & Monro (1951): Proved convergence of stochastic approximation methods under diminishing step sizes.

  • Deep learning era (2012-present): SGD with minibatches is the dominant optimizer for neural networks. Variants (momentum, Adam, RMSprop) use weighted averages of gradients—linear combinations in parameter space.

  • Theoretical foundations: Gradient descent never leaves the parameter vector space $\mathbb{R}^d$ because updates $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ are linear combinations. Convergence analysis relies on inner products (gradient angles) and subspace projections (low-rank gradients, Hessian-free optimization).

Connection to vector spaces: The optimization trajectory $\{\theta_0, \theta_1, \theta_2, \ldots\}$ lies entirely within the parameter space by closure. Momentum methods average previous gradients (linear combinations with exponential decay weights). Coordinate descent restricts updates to axis-aligned subspaces.

3. Deep Neural Networks — Compositional linear maps between layer subspaces#

Major achievements:

  • Universal approximation (Cybenko 1989, Hornik 1991): Neural networks with one hidden layer can approximate continuous functions arbitrarily well. The span of hidden layer activations determines expressiveness.

  • ImageNet revolution (Krizhevsky, Sutskever, Hinton 2012): AlexNet demonstrated that deep networks learn hierarchical feature representations. Each layer maps inputs through a linear transformation (matrix multiplication) followed by nonlinearity.

  • Residual connections (He et al. 2015): ResNets add skip connections $y = f(x) + x$, keeping outputs in the span of inputs plus a learned residual subspace.

Connection to linear maps: Each layer $h_{l+1} = \sigma(W_l h_l + b_l)$ applies a linear map $W_l$ (matrix multiplication) followed by a nonlinearity $\sigma$. The intermediate representation $h_l$ lives in a vector space; the column space of $W_l$ determines which subspace $h_{l+1}$ (pre-activation) can span.

4. Kernel Methods — Implicit infinite-dimensional feature spaces#

Major achievements:

  • Support Vector Machines (Boser, Guyon, Vapnik 1992): Introduced the kernel trick for implicitly computing inner products in high-dimensional spaces without explicitly constructing features.

  • Reproducing Kernel Hilbert Spaces (Aronszajn 1950): Provided rigorous mathematical foundation. Kernels $k(x, x')$ correspond to inner products in a (possibly infinite-dimensional) feature space $\mathcal{H}$: $k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}$.

  • Modern applications: Gaussian processes (Rasmussen & Williams 2006), kernel PCA, kernel ridge regression, attention mechanisms (scaled dot-product is an inner product in value space).

Connection to vector spaces: The feature map $\phi: \mathcal{X} \to \mathcal{H}$ embeds inputs into a vector space (often infinite-dimensional). The kernel trick avoids explicit computation by working in the dual (span of training examples). Decision boundaries are hyperplanes in $\mathcal{H}$, corresponding to nonlinear boundaries in input space.

5. Transformer Attention — Weighted sums over value subspaces#

Major achievements:

  • Vaswani et al. (2017): “Attention is All You Need” introduced the Transformer architecture, replacing recurrence with self-attention. Enabled scaling to billion-parameter models (GPT-3, GPT-4, LLaMA).

  • Mechanism: Attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where $Q, K, V$ are linear projections of inputs. The output is a linear combination of value vectors $V$, with weights from softmax-normalized inner products $QK^\top$.

  • Multi-head attention: Projects to multiple subspaces (heads), learns different span representations in parallel, concatenates results.

Connection to subspaces: Each head’s output lies in the span of its value matrix $V$. The attention weights $\alpha_i$ (softmax scores) determine the convex combination $\sum_{i=1}^n \alpha_i V_i$ (each row is a weighted sum of value vectors). The final representation is constrained to $\text{span}(\{V_1, \ldots, V_n\})$.

Notation

Standard Conventions#

1. Vectors and matrices.

  • Scalars: Lowercase Roman or Greek letters ($a, b, \alpha, \beta, \lambda$).

  • Vectors: Lowercase bold ($\mathbf{x}, \mathbf{w}$) or with explicit space annotation ($x \in \mathbb{R}^d$). Default: column vectors.

  • Matrices: Uppercase Roman letters ($A, X, W, \Sigma$). $A \in \mathbb{R}^{m \times n}$ has $m$ rows and $n$ columns.

  • Transpose: $A^\top$ (not $A^T$).

Examples:

  • MNIST images flattened to $x \in \mathbb{R}^{784}$ (28×28 pixels).

  • Dataset matrix $X \in \mathbb{R}^{n \times d}$ with $n$ examples (rows) and $d$ features (columns). Example: ImageNet batch $X \in \mathbb{R}^{256 \times 150528}$ (256 images, 224×224×3 pixels).

  • Weight matrix for a linear layer: $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ maps $\mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$ via $y = Wx$.

2. Norms and inner products.

  • Euclidean norm (L2 norm): $\|x\|_2 = \sqrt{x_1^2 + \cdots + x_d^2} = \sqrt{x^\top x}$.

  • L1 norm (sparsity-inducing): $\|x\|_1 = |x_1| + \cdots + |x_d|$ (used in Lasso regression).

  • Frobenius norm (matrix): $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{trace}(A^\top A)}$.

  • Inner product (dot product): $\langle x, y \rangle = x^\top y = \sum_{i=1}^d x_i y_i$.

Examples:

  • Regularization: Ridge regression minimizes $\|Xw - y\|_2^2 + \lambda \|w\|_2^2$ (L2 penalty).

  • Lasso regression: $\|Xw - y\|_2^2 + \lambda \|w\|_1$ (L1 penalty encourages sparse $w$).

  • Gradient magnitude: $\|\nabla \mathcal{L}(\theta)\|_2$ measures steepness of loss surface.

3. Subspaces and projections.

  • Column space: $\text{col}(A)$ or $\text{range}(A)$ or $\mathcal{R}(A)$.

  • Null space (kernel): $\text{null}(A)$ or $\ker(A)$ or $\mathcal{N}(A)$.

  • Orthogonal complement: $S^\perp = \{v \in V : \langle v, s \rangle = 0 \text{ for all } s \in S\}$.

  • Span: $\text{span}\{v_1, \ldots, v_k\}$ = all linear combinations $\sum_{i=1}^k \alpha_i v_i$.

Examples:

  • Least squares: predictions $\hat{y} = Xw$ lie in $\text{col}(X) \subseteq \mathbb{R}^n$. Residuals $r = y - \hat{y}$ lie in $\text{col}(X)^\perp$.

  • PCA: data projected onto $\text{span}\{u_1, \ldots, u_k\}$ where $u_i$ are top eigenvectors of covariance matrix.

  • Underdetermined systems: $Xw = y$ has infinitely many solutions in $w_0 + \text{null}(X)$ (affine subspace).

4. Special matrices and decompositions.

  • Identity matrix: $I$ (or $I_n$ for $n \times n$). Satisfies $Ix = x$ for all $x$.

  • Zero vector: $0$ (or $\mathbf{0}$). Satisfies $v + 0 = v$ for all $v$.

  • Eigenvalues/eigenvectors: $Ax = \lambda x$ with $x \neq 0$. Eigenvalue $\lambda \in \mathbb{R}$ (or $\mathbb{C}$), eigenvector $x \in \mathbb{R}^d$.

  • Singular value decomposition: $X = U \Sigma V^\top$ with $U \in \mathbb{R}^{n \times n}$ (left singular vectors), $\Sigma \in \mathbb{R}^{n \times d}$ (diagonal singular values $\sigma_i \geq 0$), $V \in \mathbb{R}^{d \times d}$ (right singular vectors).

Examples:

  • Covariance matrix: $C = \frac{1}{n} X^\top X$ is PSD, has eigenpairs $(\lambda_i, u_i)$ with $\lambda_i \geq 0$.

  • SVD truncation: $X \approx U_k \Sigma_k V_k^\top$ (rank-$k$ approximation minimizing $\|X - \hat{X}\|_F$).

  • Condition number: $\kappa(X) = \sigma_{\max} / \sigma_{\min}$ measures numerical stability (large $\kappa$ → ill-conditioned).

5. Index conventions.

  • Matrix indexing: $A_{ij}$ = element in row $i$, column $j$. Python uses 0-indexing; math uses 1-indexing.

  • Vector indexing: $x_i$ = $i$-th element of $x$. In Python: x[i] (0-based).

  • Colon notation: $A_{:,j}$ = $j$-th column of $A$. $A_{i,:}$ = $i$-th row. Ranges: $A_{1:k, :}$ = first $k$ rows.

Examples:

  • Feature $j$ across all examples: $X_{:,j} \in \mathbb{R}^n$ (column vector).

  • Example $i$ features: $X_{i,:} \in \mathbb{R}^{1 \times d}$ (row vector).

  • Top-$k$ singular vectors: $U_{:, 1:k} \in \mathbb{R}^{n \times k}$ (first $k$ columns of $U$).

Pitfalls & sanity checks

Common Mistakes#

1. Confusing affine and linear maps.

  • Error: Calling $f(x) = Wx + b$ a “linear” function.

  • Correction: It’s affine (not linear) if $b \neq 0$. Linear maps satisfy $f(0) = 0$; affine maps don’t.

  • Why it matters: Composition of affine maps is affine (not linear unless biases cancel). Regularization treats $W$ and $b$ differently.

2. Forgetting to center data for PCA.

  • Error: Computing eigenvalues of $X^\top X$ without centering $X$.

  • Correction: First compute $X_c = X - \frac{1}{n} \mathbf{1}\mathbf{1}^\top X$ (subtract column means), then use $X_c^\top X_c$.

  • Why it matters: Without centering, the first principal component points toward the mean (captures location, not variance).

3. Assuming rank(X) = d by default.

  • Error: Solving $X^\top X w = X^\top y$ without checking if $X^\top X$ is invertible.

  • Correction: Check $\text{rank}(X)$ with np.linalg.matrix_rank(X). If $\text{rank}(X) < d$, use regularization (ridge regression) or pseudoinverse.

  • Why it matters: Singular $X^\top X$ causes LinAlgError or numerical instability (condition number $\kappa \to \infty$).

4. Confusing column space and row space.

  • Error: Saying “predictions $Xw$ lie in the span of rows of $X$.”

  • Correction: $Xw$ lies in the span of columns of $X$ (column space). Row space is the span of rows (equivalently, column space of $X^\top$).

  • Why it matters: For $X \in \mathbb{R}^{n \times d}$, column space is in $\mathbb{R}^n$ (prediction space), row space is in $\mathbb{R}^d$ (feature space).

5. Ignoring numerical stability.

  • Error: Computing $(X^\top X)^{-1} X^\top y$ explicitly (normal equations).

  • Correction: Use np.linalg.lstsq(X, y) (QR or SVD internally) or scipy.linalg.solve(X.T @ X, X.T @ y, assume_a='pos') (Cholesky).

  • Why it matters: Explicitly forming $X^\top X$ squares the condition number ($\kappa(X^\top X) = \kappa(X)^2$), amplifying errors.

Essential Sanity Checks#

Always verify shapes:

  • After matrix multiply $C = AB$, check C.shape == (A.shape[0], B.shape[1]).

  • For batch processing, ensure leading dimensions match (e.g., $X \in \mathbb{R}^{B \times d}$, $W \in \mathbb{R}^{d \times m}$ gives $XW \in \mathbb{R}^{B \times m}$).

Check rank before solving:

rank = np.linalg.matrix_rank(X)
if rank < X.shape[1]:
    print(f"Warning: X is rank-deficient ({rank} < {X.shape[1]}). Use regularization.")

Verify projections are idempotent: For projection matrix $P$, check $P^2 = P$ and $P^\top = P$ (orthogonal projection).

assert np.allclose(P @ P, P), "Projection not idempotent"
assert np.allclose(P.T, P), "Projection not symmetric"

Test centering explicitly: After centering $X_c = X - \text{mean}(X)$, verify column means are zero:

assert np.allclose(X_c.mean(axis=0), 0), "Data not centered"

Condition number monitoring: For ill-conditioned systems, check $\kappa(X) = \sigma_{\max}(X) / \sigma_{\min}(X)$:

cond = np.linalg.cond(X)
if cond > 1e10:
    print(f"Warning: X is ill-conditioned (κ = {cond:.2e}). Results may be numerically unstable.")

Debugging Checklist#

  • Shapes mismatch? Print X.shape, w.shape before every matrix operation.

  • Unexpected zeros? Check for rank deficiency (np.linalg.matrix_rank(X)).

  • Large errors? Compute residuals $\|Xw - y\|_2$, check if $y \in \text{col}(X)$.

  • Numerical issues? Switch to stable solvers (np.linalg.lstsq, QR, SVD instead of normal equations).

  • Non-converging optimization? Verify gradients $\nabla \mathcal{L}(\theta)$ stay in parameter space (closure), check learning rate.

References

Foundational Texts#

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.

    • Chapters 1-4: Vector spaces, subspaces, orthogonality, least squares.

    • Emphasizes geometric intuition and computational methods.

    • Companion video lectures: MIT OpenCourseWare 18.06.

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.

    • Rigorous, abstract treatment (avoids determinants until late).

    • Focuses on vector spaces, linear maps, eigenvalues.

    • Best for theoretical foundations.

  3. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.). Cambridge University Press.

    • Comprehensive reference for matrix theory.

    • Covers norms, singular values, matrix decompositions, perturbation theory.

    • Graduate-level depth.

Machine Learning Perspectives#

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

    • Chapter 2: Linear Algebra (vectors, matrices, norms, eigendecomposition, SVD).

    • Chapter 6: Feedforward Networks (linear layers, activation functions).

    • Free online: deeplearningbook.org

  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.

    • Chapter 3: Linear Methods for Regression (least squares, ridge, lasso, PCA).

    • Chapter 4: Linear Methods for Classification (LDA, logistic regression).

    • Emphasizes statistical perspective (bias-variance, model selection).

  3. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

    • Chapter 7: Linear Algebra (subspaces, rank, matrix calculus).

    • Chapter 11: Linear Regression (Bayesian, regularization).

    • Modern treatment with probabilistic framing.

Historical Papers#

  1. Pearson, K. (1901). “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine, 2(11), 559–572.

    • Introduced principal components (PCA).

  2. Hotelling, H. (1933). “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology, 24(6), 417–441.

    • Formalized PCA with covariance matrices.

  3. Eckart, C., & Young, G. (1936). “The Approximation of One Matrix by Another of Lower Rank.” Psychometrika, 1(3), 211–218.

    • Proved SVD gives optimal low-rank approximation.

Modern Machine Learning#

  1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” NeurIPS 2013.

    • Word2Vec embeddings; demonstrated linear structure (analogies).

  2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention is All You Need.” NeurIPS 2017.

    • Transformer architecture; attention as weighted sums (linear combinations).

  3. He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Deep Residual Learning for Image Recognition.” CVPR 2016.

    • ResNets with skip connections ($y = F(x) + x$, closure in vector space).

  4. Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ICML 2015.

    • Batch norm (centering + scaling activations).

Numerical Linear Algebra#

  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

    • Authoritative reference for numerical algorithms (QR, SVD, eigensolvers).

    • Emphasizes stability, conditioning, complexity.

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

    • Concise treatment of QR, SVD, least squares, eigenvalue algorithms.

    • Focus on geometric intuition and practical computation.

Online Resources#

  1. 3Blue1Brown (Grant Sanderson). Essence of Linear Algebra (video series).

  2. Gilbert Strang. MIT OpenCourseWare 18.06: Linear Algebra (video lectures).

  3. The Matrix Cookbook (Petersen & Pedersen, 2012).

Five worked examples

Worked Example 1: Embedding interpolation is still a vector#

Introduction#

Token embeddings in NLP models (Word2Vec, GloVe, BERT, GPT) map discrete tokens to continuous vectors in $\mathbb{R}^d$. A fundamental property: any linear combination of embeddings remains a valid embedding (closure under vector space operations). This enables semantic arithmetic (“king” - “man” + “woman” ≈ “queen”), interpolation between concepts, and averaging embeddings for sentences or documents.

This example demonstrates that embedding spaces are vector spaces by explicitly computing an interpolation (convex combination) of two token embeddings. The result stays in $\mathbb{R}^d$, illustrating closure under linear combinations.

Purpose#

  • Verify closure: Show that $\alpha e(a) + (1-\alpha) e(b) \in \mathbb{R}^d$ for any embeddings $e(a), e(b)$ and scalar $\alpha \in [0,1]$.

  • Introduce convex combinations: Interpolation with $\alpha \in [0,1]$ produces points on the line segment between $e(a)$ and $e(b)$.

  • Connect to ML: Embedding arithmetic is used in analogy tasks, compositional semantics, and prompt engineering (e.g., blending concepts for image generation).

Importance#

Semantic compositionality. The vector space structure of embeddings enables composing meanings via linear algebra. Famous examples:

  • Word2Vec analogies (Mikolov et al., 2013): $v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$ achieves ~40% accuracy on analogy tasks.

  • Sentence embeddings: Average token embeddings $\bar{v} = \frac{1}{n} \sum_{i=1}^n e(t_i)$ (simple but effective baseline for sentence similarity).

  • Image-text embeddings (CLIP, 2021): Contrastive learning aligns image and text embeddings in a shared vector space. Interpolations blend visual/textual concepts.

Training stability. Gradient descent updates embeddings via $e(t) \leftarrow e(t) - \eta \nabla \mathcal{L}$. Closure ensures embeddings never “leave” $\mathbb{R}^d$ during training.

What This Example Demonstrates#

This example shows that embedding spaces are closed under linear combinations, a necessary condition for being a vector space. Interpolation $v = \alpha e(a) + (1-\alpha)e(b)$ produces a point between $e(a)$ and $e(b)$, illustrating that we can “blend” semantic meanings by taking weighted averages.

The geometric interpretation: $e(a)$ and $e(b)$ define a line in $\mathbb{R}^d$; all convex combinations lie on the line segment $[e(a), e(b)]$. This extends to arbitrary linear combinations (not just convex), forming the span $\{\alpha e(a) + \beta e(b) : \alpha, \beta \in \mathbb{R}\}$ (a 2D subspace if $e(a)$ and $e(b)$ are linearly independent).

Background#

Distributional semantics. The idea that “words are characterized by the company they keep” (Firth, 1957) led to vector space models in NLP. Early methods (latent semantic analysis, 1990; HAL, 1997) used co-occurrence matrices. Modern neural embeddings (Word2Vec, 2013; GloVe, 2014) learn dense representations by predicting context words.

Vector space models in NLP:

  • Bag-of-words: Represent documents as sparse vectors in $\mathbb{R}^{|\text{vocab}|}$ (counts or TF-IDF weights).

  • Word embeddings: Learn dense vectors $e(w) \in \mathbb{R}^d$ ($d \approx 50$-$1000$) capturing semantic similarity. Similar words have nearby vectors (measured by cosine similarity or Euclidean distance).

  • Contextual embeddings (BERT, GPT): Embeddings depend on context; $e(w | \text{context})$ varies across sentences. Still vectors in $\mathbb{R}^d$ at each layer.

Closure and linearity: The vector space axioms (closure, distributivity) are assumed in embedding models but rarely verified explicitly. This example makes closure concrete: interpolation $\alpha e(a) + (1-\alpha)e(b)$ stays in $\mathbb{R}^d$ because $\mathbb{R}^d$ is a vector space.

Historical Context#

1. Distributional hypothesis (1950s-1960s). Harris (1954) and Firth (1957) proposed that word meaning is determined by distribution (co-occurrence patterns). This motivated vector representations based on context counts.

2. Latent Semantic Analysis (Deerwester et al., 1990). Applied SVD to term-document matrices, projecting words/documents into low-dimensional subspaces. Demonstrated that dimensionality reduction (via truncated SVD) preserves semantic relationships.

3. Word2Vec (Mikolov et al., 2013). Introduced skip-gram and CBOW models, training shallow neural networks to predict context words. Showed that embeddings exhibit linear structure: analogies correspond to parallel vectors ($v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$).

4. GloVe (Pennington et al., 2014). Combined global co-occurrence statistics with local context prediction, achieving state-of-the-art performance on analogy and similarity tasks.

5. Contextual embeddings (2018-present). BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) compute embeddings that vary by context, using Transformer architectures. Embeddings at each layer are still vectors in $\mathbb{R}^d$, but $e(w)$ depends on the entire input sequence.

History in Machine Learning#

  • 1990: LSA applies SVD to term-document matrices (vector space models).

  • 2013: Word2Vec popularizes dense embeddings; analogy tasks demonstrate linear structure.

  • 2014: GloVe combines global statistics with neural methods.

  • 2017: Transformers (Vaswani et al.) enable contextualized embeddings via attention.

  • 2018: BERT and GPT revolutionize NLP by learning contextual representations at scale.

  • 2021: CLIP (Radford et al.) aligns image and text embeddings in a shared vector space, enabling zero-shot image classification and text-to-image generation.

Prevalence in Machine Learning#

Ubiquitous in NLP: Every modern NLP model (BERT, GPT, T5, LLaMA) uses token embeddings in $\mathbb{R}^d$ ($d = 768$ for BERT-base, $d = 4096$ for GPT-3, $d = 12288$ for GPT-4). Embeddings are the primary representation for text.

Vision and multimodal models:

  • Vision Transformers (ViT, 2020): Patch embeddings in $\mathbb{R}^d$ replace pixel representations.

  • CLIP (2021): Image and text embeddings in a shared $\mathbb{R}^{512}$ space enable cross-modal retrieval.

  • DALL-E, Stable Diffusion (2021-2022): Text embeddings condition diffusion models for image generation.

Recommendation systems: Item embeddings in $\mathbb{R}^d$ capture user preferences. Collaborative filtering factorizes user-item matrices into embeddings.

Notes and Explanatory Details#

Shape discipline: $e(a) \in \mathbb{R}^d$, $e(b) \in \mathbb{R}^d$, $\alpha \in \mathbb{R}$. The interpolation $v = \alpha e(a) + (1-\alpha)e(b)$ is a linear combination, so $v \in \mathbb{R}^d$ by closure.

Convex combinations: Restricting $\alpha \in [0,1]$ ensures $v$ lies on the line segment $[e(a), e(b)]$. Allowing $\alpha \in \mathbb{R}$ gives the entire line through $e(a)$ and $e(b)$ (the span).

Geometric interpretation: In 3D, if $e(a) = [1, 0, 2]$ and $e(b) = [-1, 3, 0]$, then $v = 0.3 e(a) + 0.7 e(b)$ lies 30% of the way from $e(b)$ to $e(a)$.

Numerical considerations: Embedding norms vary (typical $\|e(w)\|_2 \approx 1$-$10$ depending on initialization). Normalization (dividing by $\|e(w)\|_2$) is common for cosine similarity metrics.

Connection to Machine Learning#

Analogy tasks. Linear offsets capture semantic relationships: $v_{\text{France}} - v_{\text{Paris}} \approx v_{\text{Germany}} - v_{\text{Berlin}}$ (capital relationship). The vector $v_{\text{France}} - v_{\text{Paris}}$ represents the “capital-of” direction in embedding space.

Prompt interpolation. In text-to-image models, interpolating prompt embeddings generates images blending two concepts. Example: $\alpha e(\text{"dog"}) + (1-\alpha)e(\text{"cat"})$ with $\alpha = 0.5$ might generate a hybrid “doge” image.

Sentence embeddings. Averaging token embeddings $\bar{v} = \frac{1}{n} \sum_{i=1}^n e(t_i)$ is a simple but effective sentence representation (used in Skip-Thought, InferSent). More sophisticated: weighted averages (TF-IDF weights) or learned aggregations (attention).

Connection to Linear Algebra Theory#

Vector space axioms. $\mathbb{R}^d$ satisfies all vector space axioms:

  1. Closure: $e(a) + e(b) \in \mathbb{R}^d$ and $\alpha e(a) \in \mathbb{R}^d$.

  2. Associativity: $(e(a) + e(b)) + e(c) = e(a) + (e(b) + e(c))$.

  3. Commutativity: $e(a) + e(b) = e(b) + e(a)$.

  4. Identity: $e(a) + 0 = e(a)$ where $0 = [0, \ldots, 0] \in \mathbb{R}^d$.

  5. Inverses: $e(a) + (-e(a)) = 0$.

  6. Scalar distributivity: $\alpha(e(a) + e(b)) = \alpha e(a) + \alpha e(b)$.

Subspaces. The span of embeddings $\{\text{span}\{e(t_1), \ldots, e(t_k)\}\}$ is a subspace of $\mathbb{R}^d$. For a vocabulary of size $|V|$, all embeddings lie in a $k$-dimensional subspace if $k < d$ (low-rank embedding matrix).

Affine combinations. Convex combinations $\sum_{i=1}^k \alpha_i e(t_i)$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ form a convex hull (polytope in $\mathbb{R}^d$). Sentence embeddings via averaging lie in this convex hull.

Pedagogical Significance#

Concrete verification of closure. Many students learn vector space axioms abstractly but rarely see explicit numerical verification. This example shows that $\alpha e(a) + (1-\alpha)e(b) \in \mathbb{R}^d$ by computing actual numbers.

Geometric intuition. Interpolation visualizes the line segment between two points in $\mathbb{R}^d$. Extending to $\alpha \notin [0,1]$ shows extrapolation (moving beyond $e(a)$ or $e(b)$ along the line).

Foundation for advanced topics. Understanding embedding spaces as vector spaces is prerequisite for:

  • Analogies: Vector arithmetic $e(a) - e(b) + e(c)$ requires closure.

  • Dimensionality reduction: Projecting embeddings to lower-dimensional subspaces (PCA, t-SNE).

  • Alignment: Mapping embeddings between languages (Procrustes alignment, learned transforms).

References#

  1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” NeurIPS 2013. Introduced Word2Vec (skip-gram, CBOW); demonstrated analogy tasks.

  2. Pennington, J., Socher, R., & Manning, C. D. (2014). “GloVe: Global Vectors for Word Representation.” EMNLP 2014. Combined global co-occurrence statistics with local context.

  3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019. Contextual embeddings via masked language modeling.

  4. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” ICML 2021. CLIP aligns image/text embeddings in shared space.

  5. Firth, J. R. (1957). “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis. Introduced distributional hypothesis: “You shall know a word by the company it keeps.”

Problem. Show token embeddings live in a vector space and compute an interpolation.

Solution (math).

Given embeddings $e(a), e(b) \in \mathbb{R}^d$ and $\alpha \in [0,1]$, the interpolation is: $$ v = \alpha e(a) + (1-\alpha)e(b) $$

By closure of $\mathbb{R}^d$ under linear combinations, $v \in \mathbb{R}^d$. For $\alpha = 0$, $v = e(b)$; for $\alpha = 1$, $v = e(a)$; for $\alpha = 0.5$, $v$ is the midpoint.

Solution (Python).

import numpy as np

# Define embeddings for tokens 'a' and 'b' in R^3
E = {
    'a': np.array([1., 0., 2.]),
    'b': np.array([-1., 3., 0.])
}

# Interpolation parameter (0 <= alpha <= 1)
alpha = 0.3

# Compute convex combination
v = alpha * E['a'] + (1 - alpha) * E['b']

print(f"e(a) = {E['a']}")
print(f"e(b) = {E['b']}")
print(f"v = {alpha} * e(a) + {1-alpha} * e(b) = {v}")
print(f"v is in R^3: {v.shape == (3,)}")

Output:

e(a) = [1. 0. 2.]
e(b) = [-1.  3.  0.]
v = 0.3 * e(a) + 0.7 * e(b) = [-0.4  2.1  0.6]
v is in R^3: True

Worked Example 2: Zero-mean subspace projection#

Introduction#

Centering data (subtracting the mean) is a ubiquitous preprocessing step in machine learning. PCA, covariance estimation, batch normalization, and many other algorithms assume zero-mean data. Mathematically, zero-mean vectors form a subspace: $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ where $\mathbf{1} = [1, \ldots, 1]^\top$ is the all-ones vector.

This example shows that $S$ is the null space of the row vector $\mathbf{1}^\top$, demonstrates projection onto $S$ via the centering matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$, and verifies that the projected vector has zero mean.

Purpose#

  • Demonstrate subspaces defined by constraints: $S = \{x : \mathbf{1}^\top x = 0\}$ is a hyperplane through the origin ($(n-1)$-dimensional subspace).

  • Introduce projection matrices: $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ projects onto $S$ (removes the mean).

  • Connect to ML: Centering data is equivalent to projecting onto the zero-mean subspace.

Importance#

PCA and covariance estimation. PCA operates on the centered data matrix $X_c = X - \frac{1}{n} \mathbf{1}\mathbf{1}^\top X$ (each column has zero mean). The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ measures variance around the mean; if data is not centered, $C$ mixes mean and variance.

Batch normalization (Ioffe & Szegedy, 2015). Normalizes layer activations by subtracting batch mean and dividing by batch std. The mean-centering step is projection onto the zero-mean subspace.

Regularization and identifiability. In linear regression with an intercept $f(x) = w^\top x + b$, centering inputs $x \mapsto x - \bar{x}$ and targets $y \mapsto y - \bar{y}$ decouples the intercept from the weights, improving numerical stability and interpretability.

What This Example Demonstrates#

This example shows that:

  1. Zero-mean vectors form a subspace (closed under addition/scaling, contains zero).

  2. Projection onto $S$ is linear: $x_{\text{proj}} = Px$ with $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$.

  3. The projection removes the mean: $\mathbf{1}^\top (Px) = 0$ for all $x$.

  4. $P$ is idempotent: $P^2 = P$ (projecting twice is the same as projecting once).

  5. $P$ is symmetric: $P^\top = P$ (orthogonal projection).

Background#

Affine subspaces vs. linear subspaces. An affine subspace (hyperplane) has the form $\{x : a^\top x = c\}$ for $c \neq 0$. This is not a subspace unless $c = 0$ (does not contain the origin). Zero-mean vectors form a linear subspace because $\mathbf{1}^\top 0 = 0$.

Centering matrix. The matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ is called the centering matrix or projection onto zero-mean subspace. It satisfies:

  • $P\mathbf{1} = 0$ (projects all-ones vector to zero).

  • $Px = x - \bar{x} \mathbf{1}$ where $\bar{x} = \frac{1}{n} \mathbf{1}^\top x$ (subtracts the mean).

  • $P^2 = P$ (idempotent: projecting twice does nothing).

  • $P^\top = P$ (symmetric: orthogonal projection).

Null space and column space. $S = \text{null}(\mathbf{1}^\top)$ is the set of vectors perpendicular to $\mathbf{1}$. The orthogonal complement is $S^\perp = \text{span}\{\mathbf{1}\}$ (scalar multiples of $\mathbf{1}$). By the fundamental theorem of linear algebra, $\mathbb{R}^n = S \oplus S^\perp$ (direct sum).

Historical Context#

1. Gaussian elimination and centering (Gauss, 1809). Gauss used mean-centering in least squares for astronomical orbit fitting (method of least squares, published in Theoria Motus).

2. PCA (Pearson 1901, Hotelling 1933). Both Pearson’s “lines of closest fit” and Hotelling’s “principal components” assume centered data. The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ is undefined without centering (would conflate mean and variance).

3. Projection matrices (Penrose 1955, Rao 1955). The theory of orthogonal projections was formalized in the 1950s. The centering matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ is a rank-$(n-1)$ projection matrix.

4. Batch normalization (Ioffe & Szegedy, 2015). Revolutionized deep learning by normalizing layer activations. The first step is centering: $\hat{x}_i = x_i - \frac{1}{B} \sum_{i=1}^B x_i$ (subtract batch mean).

History in Machine Learning#

  • 1809: Gauss applies least squares with mean-centering (astronomical data).

  • 1901: Pearson introduces PCA (assumes centered data).

  • 1933: Hotelling formalizes PCA (covariance matrix requires centering).

  • 1955: Penrose and Rao develop theory of projection matrices.

  • 2015: Batch normalization (Ioffe & Szegedy) makes centering a learned layer operation.

  • 2016: Layer normalization (Ba et al.) centers across features instead of batches.

  • 2019: Group normalization (Wu & He) centers within feature groups (used in computer vision).

Prevalence in Machine Learning#

Preprocessing: Nearly all classical ML algorithms (PCA, LDA, SVM, ridge regression) assume centered data. Scikit-learn’s StandardScaler first centers (`x \mapsto x - \bar{x}$) then scales ($x \mapsto x / \sigma$).

Deep learning normalization:

  • Batch norm: Centers and scales mini-batch statistics.

  • Layer norm: Centers and scales across feature dimension (used in Transformers).

  • Instance norm: Centers each example independently (style transfer, GANs).

Optimization: Adam optimizer maintains exponential moving averages of gradients and squared gradients. The first moment $m_t$ is effectively a centered gradient estimate.

Notes and Explanatory Details#

Shape discipline:

  • Input: $x \in \mathbb{R}^n$ (column vector).

  • All-ones vector: $\mathbf{1} \in \mathbb{R}^n$ (column vector).

  • Mean: $\bar{x} = \frac{1}{n} \mathbf{1}^\top x \in \mathbb{R}$ (scalar).

  • Centering matrix: $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \in \mathbb{R}^{n \times n}$.

  • Projected vector: $x_{\text{proj}} = Px \in \mathbb{R}^n$.

Verification of projection properties:

  1. Projects to zero-mean subspace: $\mathbf{1}^\top (Px) = \mathbf{1}^\top \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right) x = \mathbf{1}^\top x - \frac{1}{n} (\mathbf{1}^\top \mathbf{1})(\mathbf{1}^\top x) = \mathbf{1}^\top x - \mathbf{1}^\top x = 0$.

  2. Idempotent: $P^2 = \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right)^2 = I - \frac{2}{n} \mathbf{1}\mathbf{1}^\top + \frac{1}{n^2} \mathbf{1}(\mathbf{1}^\top \mathbf{1})\mathbf{1}^\top = I - \frac{2}{n} \mathbf{1}\mathbf{1}^\top + \frac{1}{n} \mathbf{1}\mathbf{1}^\top = P$.

  3. Symmetric: $P^\top = \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right)^\top = I - \frac{1}{n} (\mathbf{1}\mathbf{1}^\top)^\top = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top = P$.

Numerical considerations: Computing $P$ explicitly (storing $n \times n$ matrix) is wasteful for large $n$. Instead, compute $Px = x - \bar{x} \mathbf{1}$ directly (subtracting the mean).

Connection to Machine Learning#

PCA. The centered data matrix $X_c = PX$ (apply $P$ to each column) ensures principal components capture variance, not mean. The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ would be biased without centering.

Batch normalization. For a mini-batch $\{x_1, \ldots, x_B\}$, batch norm computes: $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad \mu_B = \frac{1}{B} \sum_{i=1}^B x_i, \quad \sigma_B^2 = \frac{1}{B} \sum_{i=1}^B (x_i - \mu_B)^2 $$ The centering step $x_i - \mu_B$ is projection onto the zero-mean subspace.

Residuals in regression. In ordinary least squares, residuals $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top)y$ are projections onto the orthogonal complement of $\text{col}(X)$. If $X$ includes a column of ones (intercept), residuals automatically have zero mean.

Connection to Linear Algebra Theory#

Projection theorem. Every vector $x \in \mathbb{R}^n$ can be uniquely decomposed as $x = x_\parallel + x_\perp$ where $x_\parallel \in S$ and $x_\perp \in S^\perp$. For $S = \{x : \mathbf{1}^\top x = 0\}$, we have: $$ x_\parallel = Px = x - \bar{x} \mathbf{1}, \quad x_\perp = (I-P)x = \bar{x} \mathbf{1} $$

Rank of projection matrix. $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ has rank $n-1$ because:

  • $\text{null}(P) = \text{span}\{\mathbf{1}\}$ (1D subspace).

  • By rank-nullity theorem, $\text{rank}(P) + \dim(\text{null}(P)) = n$, so $\text{rank}(P) = n-1$.

Eigenvalues of $P$. $P$ has eigenvalues $\lambda = 1$ (multiplicity $n-1$) and $\lambda = 0$ (multiplicity 1):

  • Eigenvectors with $\lambda = 1$: any $v \perp \mathbf{1}$ (orthogonal to all-ones).

  • Eigenvector with $\lambda = 0$: $v = \mathbf{1}$.

This confirms $P$ is a projection: eigenvalues are 0 or 1, characteristic of projection matrices.

Relation to covariance. The sample covariance matrix is: $$ C = \frac{1}{n} X_c^\top X_c = \frac{1}{n} (PX)^\top (PX) = \frac{1}{n} X^\top P^\top P X = \frac{1}{n} X^\top P X $$ using $P^\top = P$ and $P^2 = P$.

Pedagogical Significance#

Concrete example of a subspace. Students often learn subspaces abstractly (“closed under addition and scaling”). This example gives a geometric and algebraic definition: $S = \{x : \mathbf{1}^\top x = 0\}$ (algebraic) is an $(n-1)$-dimensional hyperplane through the origin (geometric).

Projection as matrix multiplication. Projecting $x$ onto $S$ is simply $x_{\text{proj}} = Px$. This demystifies projections (often introduced with complicated formulas) by showing they’re linear maps.

Foundation for PCA. Understanding centering is essential before learning PCA. Many textbooks jump to “compute eigenvalues of $X^\top X$” without explaining why $X$ must be centered.

Computational perspective. Explicitly forming $P$ (storing $n^2$ entries) is wasteful. Implementing $Px$ as $x - \bar{x} \mathbf{1}$ (computing mean, subtracting) is much faster ($O(n)$ vs. $O(n^2)$).

References#

  1. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Introduced least squares with mean-centering for orbit determination.

  2. Hotelling, H. (1933). “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology, 24(6), 417–441. Formalized PCA (assumes centered data).

  3. Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ICML 2015. Introduced batch normalization (centering + scaling layer activations).

  4. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4 covers projection matrices and least squares.

  5. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. Section 2.5 covers idempotent matrices (projections).

Problem. Project $x$ onto $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ (zero-mean subspace).

Solution (math).

$S$ is a subspace (null space of $\mathbf{1}^\top$). The projection matrix is: $$ P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top $$

Applying $P$ to $x$ gives: $$ x_{\text{proj}} = Px = x - \frac{1}{n} (\mathbf{1}^\top x) \mathbf{1} = x - \bar{x} \mathbf{1} $$ where $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ is the mean of $x$. Verification: $\mathbf{1}^\top x_{\text{proj}} = \mathbf{1}^\top (x - \bar{x} \mathbf{1}) = \mathbf{1}^\top x - n \bar{x} = 0$.

Solution (Python).

import numpy as np

# Define vector x in R^5
n = 5
x = np.array([3., 1., 0., -2., 4.])

# Centering matrix P = I - (1/n) * 1 * 1^T
I = np.eye(n)
one = np.ones((n, 1))
P = I - (1 / n) * (one @ one.T)

# Project x onto zero-mean subspace
x_proj = P @ x

print(f"Original x: {x}")
print(f"Mean of x: {x.mean():.2f}")
print(f"Projected x_proj: {x_proj}")
print(f"Mean of x_proj: {x_proj.sum():.2e} (should be ~0)")
print(f"Verification: 1^T @ x_proj = {one.T @ x_proj.reshape(-1, 1)[0, 0]:.2e}")

Output:

Original x: [ 3.  1.  0. -2.  4.]
Mean of x: 1.20
Projected x_proj: [ 1.8 -0.2 -1.2 -3.2  2.8]
Mean of x_proj: 0.00e+00 (should be ~0)
Verification: 1^T @ x_proj = 0.00e+00

Worked Example 3: Model outputs form range(X)#

Introduction#

In linear regression $\hat{y} = Xw$, all possible predictions lie in the column space (range) of the feature matrix $X$. This fundamental constraint determines model expressiveness: if the target $y \notin \text{col}(X)$, the model cannot fit perfectly (residual is nonzero). Understanding $\text{col}(X)$ as a subspace clarifies when adding features helps, when features are redundant (linearly dependent), and how model capacity relates to matrix rank.

Purpose#

  • Identify $\text{col}(X)$ as a subspace: All vectors $Xw$ (for $w \in \mathbb{R}^d$) form a subspace of $\mathbb{R}^n$.

  • Relate expressiveness to rank: $\dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d)$.

  • Connect to ML: Model predictions span a $\text{rank}(X)$-dimensional subspace. If $\text{rank}(X) < n$, the model cannot fit arbitrary targets.

Importance#

Underdetermined vs. overdetermined systems.

  • Underdetermined ($n < d$): More features than examples. $\text{rank}(X) \leq n < d$, so infinitely many solutions exist (null space is nontrivial).

  • Overdetermined ($n > d$): More examples than features. $\text{rank}(X) \leq d < n$, so exact fit is impossible unless $y \in \text{col}(X)$ (rare). Least squares finds best approximation.

Multicollinearity. If $\text{rank}(X) < d$, features are linearly dependent (redundant). Example: including both “temperature in Celsius” and “temperature in Fahrenheit” as features makes $X$ rank-deficient. Solutions are non-unique ($w + v$ is also a solution for any $v \in \text{null}(X)$).

What This Example Demonstrates#

  • Column space = set of all predictions: $\{Xw : w \in \mathbb{R}^d\} = \text{col}(X) = \text{span}\{x_1, \ldots, x_d\}$ where $x_j$ are columns of $X$.

  • Rank determines dimension: $\dim(\text{col}(X)) = \text{rank}(X)$.

  • NumPy verification: np.linalg.matrix_rank(X) computes rank via SVD.

Background#

Fundamental Theorem of Linear Algebra (Strang). For $A \in \mathbb{R}^{m \times n}$:

  1. $\text{col}(A)$ and $\text{null}(A^\top)$ are orthogonal complements in $\mathbb{R}^m$: $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$.

  2. $\text{col}(A^\top)$ and $\text{null}(A)$ are orthogonal complements in $\mathbb{R}^n$: $\mathbb{R}^n = \text{col}(A^\top) \oplus \text{null}(A)$.

  3. $\dim(\text{col}(A)) = \dim(\text{col}(A^\top)) = \text{rank}(A)$.

  4. Rank-nullity theorem: $\text{rank}(A) + \dim(\text{null}(A)) = n$.

Historical Context: The concept of rank appeared in Frobenius (1911) and Sylvester (1850s), but the geometric interpretation as “dimension of column space” became standard only in the 20th century with abstract linear algebra.

Connection to Machine Learning#

Regularization and identifiability. If $\text{rank}(X) < d$, the normal equations $X^\top X w = X^\top y$ are singular ($X^\top X$ is not invertible). Ridge regression adds $\lambda I$ to make $(X^\top X + \lambda I)$ invertible, effectively restricting solutions to a preferred subspace.

Feature selection. Adding a feature that’s a linear combination of existing features (e.g., $x_{\text{new}} = 2x_1 + 3x_2$) does not increase $\text{rank}(X)$ or model capacity. Feature selection algorithms (LASSO, forward selection) aim to maximize rank while minimizing redundancy.

Low-rank approximation. If $\text{rank}(X) \ll \min(n, d)$, truncated SVD $X \approx U_k \Sigma_k V_k^\top$ captures most information with $k \ll d$ features. This is the basis of PCA, autoencoders, and matrix factorization (recommender systems).

References#

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 3: “The Four Fundamental Subspaces.”

  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 2: “Linear Algebra” (discusses rank and span).

  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Section 3.4: “Shrinkage Methods” (ridge regression, handling rank deficiency).

Problem. Interpret $\{Xw : w \in \mathbb{R}^d\}$ as a subspace and compute its dimension.

Solution (math).

The set $\{Xw : w \in \mathbb{R}^d\}$ is the column space (range) of $X$: $$ \text{col}(X) = \{Xw : w \in \mathbb{R}^d\} = \text{span}\{x_1, \ldots, x_d\} $$ where $x_j \in \mathbb{R}^n$ are the columns of $X \in \mathbb{R}^{n \times d}$. The dimension is: $$ \dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d) $$

Solution (Python).

import numpy as np

# Define feature matrix X (3 examples, 2 features)
X = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])

# Compute rank (dimension of column space)
rank = np.linalg.matrix_rank(X)

print(f"X shape: {X.shape}")
print(f"X =\n{X}")
print(f"Rank(X) = {rank}")
print(f"Column space dimension = {rank}")
print(f"Predictions Xw span a {rank}-dimensional subspace of R^{X.shape[0]}")

Output:

X shape: (3, 2)
X =
[[1. 0.]
 [0. 1.]
 [1. 1.]]
Rank(X) = 2
Column space dimension = 2
Predictions Xw span a 2-dimensional subspace of R^3

Worked Example 4: Bias trick (affine → linear)#

Introduction#

Most machine learning models use affine transformations: $f(x) = Wx + b$ where $W$ is a weight matrix and $b$ is a bias vector. Affine maps are not linear (they don’t map zero to zero if $b \neq 0$), but there’s an elegant trick: augment the input with a constant 1, turning $f(x) = Wx + b$ into a purely linear map $f(x') = W' x'$ in a higher-dimensional space.

This “bias trick” (also called “homogeneous coordinates”) is ubiquitous in ML: neural network layers, logistic regression, SVMs, and computer graphics all use it. It simplifies implementation (one matrix multiply instead of multiply + add) and unifies the treatment of weights and biases.

Purpose#

  • Unify affine and linear maps: Convert $f(x) = Wx + b$ to $f(x') = W'x'$ where $x' = [x; 1]$ (augmented input) and $W' = [W \,|\, b]$ (concatenated weight matrix and bias).

  • Simplify backpropagation: Gradients w.r.t. $W'$ handle both weights and biases uniformly.

  • Connect to projective geometry: Homogeneous coordinates in computer graphics use the same augmentation.

Importance#

Neural network implementation. Every linear layer $h = Wx + b$ can be written as $h = W'x'$ where $x' = [x; 1]$ and $W' = [W \,|\, b]$. Many frameworks (PyTorch, TensorFlow) handle biases separately for efficiency, but conceptually this augmentation clarifies the math.

Logistic regression. The decision boundary $w^\top x + b = 0$ becomes $w'^\top x' = 0$ in augmented space, a linear classifier (hyperplane through the origin in $\mathbb{R}^{d+1}$).

Conditioning and regularization. Regularizing $\|w\|_2^2$ without penalizing $b$ (common practice) is harder to express if $w$ and $b$ are combined. Keeping them separate maintains flexibility, but the augmentation perspective clarifies that they live in different subspaces.

What This Example Demonstrates#

  • Affine maps become linear in augmented space: $f(x) = Wx + b$ (affine in $\mathbb{R}^d$) equals $f(x') = W'x'$ (linear in $\mathbb{R}^{d+1}$).

  • Augmentation preserves structure: Adding a constant 1 extends the input space without losing information.

  • Numerical verification: Compute both $Wx + b$ and $W'x'$, verify they’re identical.

Background#

Affine vs. linear. A map $f: \mathbb{R}^d \to \mathbb{R}^m$ is:

  • Linear if $f(\alpha x + \beta y) = \alpha f(x) + \beta f(y)$ for all $x, y, \alpha, \beta$. Equivalently, $f(x) = Ax$ for some matrix $A$.

  • Affine if $f(x) = Ax + b$ for some matrix $A$ and vector $b$. Affine maps preserve affine combinations (weighted averages with weights summing to 1) but not arbitrary linear combinations.

Homogeneous coordinates. In computer graphics, 3D points $(x, y, z)$ are represented as 4-vectors $(x, y, z, 1)$ to handle translations uniformly. The last coordinate acts as a “scaling factor” (1 for ordinary points, 0 for vectors). This is exactly the bias trick.

Historical Context: Homogeneous coordinates were introduced by August Ferdinand Möbius (1827) for projective geometry. Their use in ML is a modern application of this classical idea.

Connection to Machine Learning#

Deep neural networks. Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$ where $\sigma$ is a nonlinearity (ReLU, sigmoid). The affine transformation $W_l h_l + b_l$ can be written as $W'_l h'_l$ with $h'_l = [h_l; 1]$.

Batch processing. For a mini-batch $X \in \mathbb{R}^{B \times d}$ (rows are examples), the transformation $Y = XW^\top + \mathbf{1} b^\top$ (broadcasting bias) becomes $Y = X' W'^\top$ where $X' = [X \,|\, \mathbf{1}]$ (augmented batch) and $W' = [W \,|\, b]$.

Regularization subtlety. Ridge regression penalizes $\|w\|_2^2$ but not $b$ (bias is unregularized). If we augment and use $w' = [w; b]$, regularizing $\|w'\|_2^2$ would incorrectly penalize the bias. This is why most implementations keep $w$ and $b$ separate despite the conceptual elegance of augmentation.

References#

  1. Möbius, A. F. (1827). Der barycentrische Calcul. Introduced homogeneous coordinates for projective geometry.

  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Section 6.1: “Feedforward Networks” (linear layers with biases).

  3. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Section 3.1: “Linear Regression” (discusses augmentation for intercept terms).

Problem. Rewrite $f(x) = Wx + b$ as a linear map in augmented space.

Solution (math).

Define the augmented input $x' \in \mathbb{R}^{d+1}$: $$ x' = \begin{bmatrix} x \\ 1 \end{bmatrix} $$

and the augmented weight matrix $W' \in \mathbb{R}^{m \times (d+1)}$: $$ W' = \begin{bmatrix} W & b \end{bmatrix} $$

Then: $$ f(x) = Wx + b = W' x' = \begin{bmatrix} W & b \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} = Wx + b \cdot 1 $$

This is a linear map in $\mathbb{R}^{d+1}$ (no bias term needed).

Solution (Python).

import numpy as np

# Define weight matrix W (2x2) and bias vector b (2,)
W = np.array([[2., 1.],
              [-1., 3.]])
b = np.array([0.5, -2.])

# Input vector x (2,)
x = np.array([1., 4.])

# Standard affine transformation: Wx + b
y_affine = W @ x + b

# Augmented transformation: W' @ x'
# W' = [W | b] (concatenate b as a column)
W_aug = np.c_[W, b]  # Shape: (2, 3)

# x' = [x; 1] (append 1)
x_aug = np.r_[x, 1.]  # Shape: (3,)

# Linear transformation in augmented space
y_linear = W_aug @ x_aug

print(f"W =\n{W}")
print(f"b = {b}")
print(f"x = {x}\n")

print(f"Affine: Wx + b = {y_affine}")
print(f"\nAugmented W' =\n{W_aug}")
print(f"Augmented x' = {x_aug}")
print(f"Linear: W'x' = {y_linear}\n")

print(f"Are they equal? {np.allclose(y_affine, y_linear)}")

Output:

W =
[[ 2.  1.]
 [-1.  3.]]
b = [ 0.5 -2. ]
x = [1. 4.]

Affine: Wx + b = [ 6.5 9. ]

Augmented W' =
[[ 2.   1.   0.5]
 [-1.   3.  -2. ]]
Augmented x' = [1. 4. 1.]
Linear: W'x' = [ 6.5 9. ]

Are they equal? True

Worked Example 5: Attention outputs lie in span(V)#

Introduction#

The attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) is the core operation in Transformers, powering modern LLMs (GPT, BERT, LLaMA), vision models (ViT), and multimodal systems (CLIP, Flamingo). Attention computes weighted sums of value vectors $V$, with weights determined by query-key similarities. Crucially, attention outputs are constrained to lie in $\text{span}(V)$ — they cannot “invent” information outside the value subspace.

This example demonstrates that attention is a linear combination operation: the output $z = \sum_{i=1}^n \alpha_i v_i$ (where $\alpha_i$ are softmax-normalized attention scores) always lies in $\text{span}\{v_1, \ldots, v_n\}$, a subspace of $\mathbb{R}^{d_v}$.

Purpose#

  • Understand attention as weighted averaging: Output = $\sum_{i=1}^n \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (convex combination).

  • Identify the constraint: Attention outputs lie in $\text{span}(V)$, limiting expressiveness to the value subspace.

  • Connect to ML: Multi-head attention learns multiple subspaces (heads) in parallel, increasing capacity.

Importance#

Transformer architecture. Attention is the primary operation in Transformers, replacing recurrence (RNNs) and convolution (CNNs). Each layer computes: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V $$ where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of inputs. The output is a weighted sum of value vectors, constrained to $\text{span}(V)$.

Multi-head attention. Splitting into $h$ heads projects to $h$ different subspaces: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$ where $\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$. Each head operates in a different $(d_v / h)$-dimensional subspace.

Expressiveness vs. efficiency. Attention can only combine existing values (linear combinations), not generate new directions. This is a feature, not a bug: it provides inductive bias (outputs depend on inputs) and computational efficiency (matrix multiplications).

What This Example Demonstrates#

  • Attention is linear combination: For softmax weights $\alpha \in \mathbb{R}^{1 \times n}$ and value matrix $V \in \mathbb{R}^{n \times d_v}$, the output $z = \alpha V \in \mathbb{R}^{1 \times d_v}$ is a linear combination of rows of $V$.

  • Outputs lie in subspace: $z \in \text{span}\{v_1, \ldots, v_n\}$ where $v_i \in \mathbb{R}^{d_v}$ are rows of $V$.

  • Convex combination: Since $\alpha_i \geq 0$ and $\sum_i \alpha_i = 1$, $z$ lies in the convex hull of $\{v_1, \ldots, v_n\}$.

Background#

Attention mechanism history:

  1. Bahdanau attention (2015): Introduced for neural machine translation. Computes alignment scores between encoder hidden states and decoder state, uses weighted sum for context.

  2. Scaled dot-product attention (Vaswani 2017): Simplified to $\text{softmax}(QK^\top / \sqrt{d_k})V$, enabling parallelization and scaling.

  3. Multi-head attention (Vaswani 2017): Projects to multiple subspaces (heads), learns different relationships.

Mathematical interpretation: Attention is a content-based addressing mechanism: queries “look up” relevant keys, retrieve corresponding values. The softmax ensures smooth interpolation (differentiable, convex weights).

Relation to kernels: Attention can be viewed as a kernel method where $K(q, k) = \exp(q^\top k / \sqrt{d_k})$ (unnormalized softmax). The output is a weighted sum in the kernel space (span of values).

Connection to Machine Learning#

Self-attention in Transformers. For input sequence $X \in \mathbb{R}^{n \times d}$, self-attention computes $Q = XW^Q$, $K = XW^K$, $V = XW^V$. Each output token is a linear combination of all input tokens’ values, weighted by query-key similarities.

Cross-attention in encoder-decoder models. Decoder queries attend to encoder keys/values. Example: in machine translation, the decoder (target language) attends to encoder representations (source language). Outputs lie in the span of encoder values.

Positional embeddings. Since attention is permutation-invariant (outputs are linear combinations regardless of input order), position information must be injected via positional encodings $p_i \in \mathbb{R}^d$ added to embeddings. This augments the value subspace.

Computational complexity. Attention requires $O(n^2 d_v)$ operations ($n \times n$ attention matrix times $n \times d_v$ value matrix). For long sequences (e.g., books, genomic data), this becomes prohibitive, motivating sparse attention and low-rank approximations.

Connection to Linear Algebra Theory#

Convex combinations and convex hulls. If $\alpha_i \geq 0$ and $\sum_i \alpha_i = 1$, then $z = \sum_i \alpha_i v_i$ lies in the convex hull $\text{conv}\{v_1, \ldots, v_n\}$, the smallest convex set containing all $v_i$. Geometrically, this is a polytope (bounded region) in $\mathbb{R}^{d_v}$.

Rank of attention output. The attention output matrix $Z = \text{softmax}(QK^\top / \sqrt{d_k}) V \in \mathbb{R}^{n \times d_v}$ has $\text{rank}(Z) \leq \min(n, \text{rank}(V))$. If $V$ is low-rank, attention cannot increase rank (outputs lie in a low-dimensional subspace).

Projection interpretation. If all attention weights concentrate on a single token ($\alpha_i = 1$, $\alpha_j = 0$ for $j \neq i$), the output equals $v_i$ (projection to a single basis vector). Smooth attention weights (uniform $\alpha_i = 1/n$) give the average $\bar{v} = \frac{1}{n} \sum_i v_i$ (projection to center of value cloud).

Orthogonality and attention scores. High query-key similarity $q^\top k$ indicates alignment (small angle between $q$ and $k$). Orthogonal query-key pairs ($q^\top k = 0$) receive low attention weight. This is the same geometric intuition as inner products measuring similarity.

Pedagogical Significance#

Concrete example of span. Attention outputs visibly demonstrate that linear combinations $\sum_i \alpha_i v_i$ lie in $\text{span}\{v_1, \ldots, v_n\}$. Students can compute actual numbers and verify the output is expressible as a weighted sum.

Geometric visualization. For 2D or 3D value vectors, plot $\{v_1, \ldots, v_n\}$ and the attention output $z$. $z$ lies inside the convex hull (polytope formed by connecting $v_i$).

Foundation for Transformers. Understanding attention as linear combination clarifies:

  • Why attention is permutation-invariant: Linear combinations don’t depend on order.

  • Why multi-head attention helps: Different heads explore different subspaces.

  • Limitations: Attention can only interpolate existing values, not generate new directions (nonlinearity comes from layer stacking and feedforward networks).

References#

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention is All You Need.” NeurIPS 2017. Introduced Transformer architecture with scaled dot-product attention and multi-head attention.

  2. Bahdanau, D., Cho, K., & Bengio, Y. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR 2015. First attention mechanism for seq2seq models.

  3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019. Bidirectional self-attention for masked language modeling.

  4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ICLR 2021. Vision Transformers (ViT) apply attention to image patches.

  5. Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). “Efficient Transformers: A Survey.” arXiv:2009.06732. Reviews sparse attention, low-rank approximations, and other efficiency techniques.

Problem. Show attention output is in the span of value vectors.

Solution (math).

For value matrix $V \in \mathbb{R}^{n \times d_v}$ (rows $v_1, \ldots, v_n$) and attention weights $\alpha \in \mathbb{R}^{1 \times n}$ (from softmax, so $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$), the attention output is: $$ z = \alpha V = \sum_{i=1}^n \alpha_i v_i \in \mathbb{R}^{1 \times d_v} $$

This is a convex combination of rows of $V$, hence $z \in \text{span}\{v_1, \ldots, v_n\} \subseteq \mathbb{R}^{d_v}$.

Solution (Python).

import numpy as np

# Define value matrix V (3 tokens, 2-dimensional values)
V = np.array([[1., 0.],   # v_1
              [0., 1.],   # v_2
              [2., 2.]])  # v_3

# Define attention weights (from softmax, sum to 1)
A = np.array([[0.2, 0.5, 0.3]])  # Shape: (1, 3)

# Compute attention output z = A @ V
z = A @ V  # Shape: (1, 2)

print(f"Value matrix V (rows are values):\n{V}\n")
print(f"Attention weights A = {A[0]} (sum = {A.sum():.1f})\n")
print(f"Attention output z = A @ V = {z[0]}")
print(f"\nVerification as linear combination:")
print(f"z = {A[0,0]}*v_1 + {A[0,1]}*v_2 + {A[0,2]}*v_3")
print(f"  = {A[0,0]}*{V[0]} + {A[0,1]}*{V[1]} + {A[0,2]}*{V[2]}")
print(f"  = {A[0,0]*V[0] + A[0,1]*V[1] + A[0,2]*V[2]}")
print(f"\nz lies in span(V): True (z is a linear combination of rows of V)")

Output:

Value matrix V (rows are values):
[[1. 0.]
 [0. 1.]
 [2. 2.]]

Attention weights A = [0.2 0.5 0.3] (sum = 1.0)

Attention output z = A @ V = [0.8 1.1]

Verification as linear combination:
z = 0.2*v_1 + 0.5*v_2 + 0.3*v_3
  = 0.2*[1. 0.] + 0.5*[0. 1.] + 0.3*[2. 2.]
  = [0.8 1.1]

z lies in span(V): True (z is a linear combination of rows of V)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation