ex1.ai

PSD & PD Matrices

Chapter 9

PSD & PD Matrices

Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

Eigenvalue characterization
- Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.
Quadratic forms and convexity
- If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).
Gram matrices
- $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).
Cholesky factorization
- PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.
Schur complement
- For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.
Mahalanobis metric
- For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.
Numerical PSD
- In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.
Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.
Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.
Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.
Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.
1918: Schur complement criteria.
1910–1940s: Cholesky factorization for SPD solves.
1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.
1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.
Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.
Principal minors: PD iff all leading principal minors are positive (Sylvester).
Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.
Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

PSD/PD definitions via quadratic form and eigenvalues.
Sylvester’s criterion (leading principal minors > 0 for PD).
Schur complement PSD condition.
Mercer’s theorem (kernel PSD).
Cholesky existence for SPD.

Applied (landmark systems)#

SVM with PSD kernels (Cortes–Vapnik 1995).
Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.
Kernel Ridge Regression (Murphy 2022) solved via SPD systems.
Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).
GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up

Covariance and PCA

$\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

Kernels and SVMs/GPs

RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Hessians and convexity

For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

Mahalanobis distance & whitening

SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

Cholesky-based solvers (KRR/GP)

SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation

PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.
Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.
Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.
Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.
Schur complement: For blocks, $S = C - B^\top A^{-1} B$.
Examples:
- PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.
- Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.
- Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks

Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.
Near-singular PSD: add jitter; do not invert directly.
Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.
Kernel parameters: very small/large length-scales can hurt conditioning.
Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

Symmetrization, eigenvalue check, and Cholesky factorization.
Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

Golub & Van Loan (2013). Matrix Computations.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

$\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.
Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

RBF kernel is PSD; small datasets can be nearly singular when points are close.
Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

$d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

De Maesschalck et al. (2000). The Mahalanobis distance.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Positive Semidefiniteness

Learning Path & Sequencing

Intermediate

Linear Algebra Foundations

Matrix Theory

Matrix Decompositions

Cholesky Decomposition

ML Applications

Kernels & Gaussian Processes

Problem Structure & Exploitation

Symmetry & Structure

Theoretical Foundation

Linear Algebra Theory

Eigenvalues & Eigenvectors

Chapter 8

Eigenvalues & Eigenvectors

Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.
Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.
Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.
Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.
PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.
Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.
Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.
Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.
PageRank/Markov chains: Stationary distribution is dominant eigenvector.
Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.
Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

1900s: Spectral theorem for symmetric matrices.
1911: Gershgorin circle theorem (eigenvalue bounds).
1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.
1998: PageRank (Brin–Page) uses power iteration at web scale.
2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).
2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.
Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.
Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.
Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.
Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.
Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.
Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.
Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

PCA: Jolliffe (2002); Shlens (2014).
Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).
PageRank: Brin–Page (1998).
Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).
Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up

PCA and covariance spectra

Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.
Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

Spectral clustering (graph Laplacian)

Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.
Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

PageRank / Markov chains

Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.
Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

Conditioning and stability (Jacobians/Hessians)

Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.
Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

Randomized eigen/svd for large ML

Approximate leading eigenpairs with fewer passes over data.
Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation

Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.
Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.
Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.
Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.
Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.
Examples:
- Dominant eigenpair by power iteration on SPD matrix.
- PCA via eigen-decomposition of $\Sigma$.
- PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks

Non-symmetric matrices may have complex eigenvalues; use appropriate routines.
Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.
Normalize in power iteration to avoid overflow/underflow.
Center data before PCA; otherwise leading eigenvectors capture mean.
Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.
Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

$\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.
Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.
Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.
Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

$L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.
von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing

Intermediate

Linear Algebra Foundations

Matrix Theory

Matrix Decompositions

Eigendecomposition

Problem Structure & Exploitation

Symmetry & Structure

Theoretical Foundation

Linear Algebra Theory

Rank & Nullspace

Chapter 7

Rank & Nullspace

Key ideas: Introduction

Introduction#

Rank and null space describe how information flows through matrices:

Rank $r$ = number of independent columns/rows (nonzero singular values)
Null space $\text{null}(A)$ = set of inputs mapped to zero (lost information)
Column/row spaces = subspaces where outputs/inputs live; orthogonal complements relate via FTLA
Pseudoinverse solves least squares even when $A$ is rank-deficient (minimal-norm solutions)
Low-rank structure compresses models and reveals latent factors (factorization)

Important ideas#

Row rank equals column rank
- $\operatorname{rank}(A)$ is the dimension of $\text{col}(A)$ and equals that of $\text{row}(A)$.
Rank via singular values
- If $A=U\Sigma V^\top$, then $\operatorname{rank}(A)$ equals the number of nonzero singular values $\sigma_i$.
Rank–nullity theorem
- For $A\in\mathbb{R}^{m\times d}$, $$\operatorname{rank}(A) + \operatorname{nullity}(A) = d.$$
Fundamental theorem of linear algebra (FTLA)
- $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$ and $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$ (orthogonal decompositions).
Rank of products and sums
- $\operatorname{rank}(AB) \le \min\{\operatorname{rank}(A), \operatorname{rank}(B)\}$; subadditivity for sums.
Pseudoinverse $A^+$
- Moore–Penrose $A^+$ gives minimal-norm solutions $x^* = A^+ b$; satisfies $AA^+A = A$.
Numerical rank
- Practical rank uses thresholds on singular values to handle floating-point noise.

Relevance to ML#

Multicollinearity: rank-deficient design $X$ yields non-unique OLS solutions; regularization/pseudoinverse needed.
PCA/compression: low rank captures variance efficiently; truncation yields best rank-$k$ approximation.
Recommendation systems: user–item matrices modeled as low-rank factorization.
Kernels/Gram matrices: rank informs capacity and generalization; $\operatorname{rank}(XX^\top) \le \min(n,d)$.
Attention: score matrix $QK^\top$ has rank bounded by $\min(n, d_k)$; head dimension limits expressivity.
Deep nets: bottleneck layers enforce low-rank mapping; adapters/LoRA factorize weights.

Algorithmic development (milestones)#

1936: Eckart–Young — best rank-$k$ approximation via SVD.
1955: Penrose — Moore–Penrose pseudoinverse.
1990s–2000s: Matrix factorization in recommender systems (SVD-based, ALS).
2009: Candès–Recht — nuclear norm relaxation for matrix completion.
2011: Halko–Martinsson–Tropp — randomized SVD for large-scale low-rank.
2019–2021: Low-rank adapters (LoRA) compress transformer weights.

Definitions#

$\operatorname{rank}(A)$: dimension of $\text{col}(A)$ (or $\text{row}(A)$); number of nonzero singular values.
$\text{null}(A) = \{x: Ax=0\}$; $\text{null}(A^\top)$ similarly.
$\text{col}(A)$: span of columns; $\text{row}(A)$: span of rows.
FTLA decompositions: $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$, $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$.
Pseudoinverse: $A^+ = V \Sigma^+ U^\top$ where $\Sigma^+$ reciprocates nonzero $\sigma_i$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

Rank–nullity: $$\operatorname{rank}(A)+\operatorname{nullity}(A)=d.$$
FTLA (four subspaces): $\text{col}(A) \perp \text{null}(A^\top)$ and $\text{row}(A) \perp \text{null}(A)$.
Row=column rank: $\dim\text{row}(A) = \dim\text{col}(A)$.
Singular values and rank: $\operatorname{rank}(A)$ is the count of positive $\sigma_i$.
Sylvester’s inequality: $\operatorname{rank}(AB) \ge \operatorname{rank}(A) + \operatorname{rank}(B) - k$ (context-dependent; upper/lower bounds useful).
Eckart–Young–Mirsky: Truncated SVD minimizes error among rank-$k$ approximations.
Moore–Penrose pseudoinverse properties: $AA^+A=A$, $A^+AA^+=A^+$.

Applied (landmark systems/practices)#

PCA: Jolliffe (2002); Shlens (2014).
Stable least squares: Golub–Van Loan (2013).
Matrix completion via nuclear norm: Candès–Recht (2009).
Randomized SVD for scale: Halko–Martinsson–Tropp (2011).
Recommender systems: Koren–Bell–Volinsky (2009).
Low-rank adapters in transformers: Hu et al. (2021).

Key ideas: Where it shows up

PCA and covariance rank

Centered data $X_c$ yields covariance $\Sigma = \tfrac{1}{n} X_c^\top X_c$ with $\operatorname{rank}(\Sigma) \le \min(n-1, d)$.
Achievements: Dimensionality reduction with $k\ll d$; whitening in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Regression and multicollinearity

If $\operatorname{rank}(X) < d$, normal equations $X^\top X w = X^\top y$ are singular; pseudoinverse/regularization resolve ambiguity.
Achievements: Robust linear modeling; Ridge/Lasso mitigate rank issues. References: Hoerl–Kennard 1970; Tibshirani 1996; Golub–Van Loan 2013.

Low-rank models and compression

Factorize $W \approx AB^\top$ with small inner dimension to reduce parameters and computation (adapters, LoRA).
Achievements: Efficient fine-tuning of large transformers. References: Hu et al. 2021 (LoRA); Tishby & Zaslavsky 2015 (bottlenecks conceptual).

Matrix factorization for recommendation

User–item ratings approximated by low-rank matrices; SVD/ALS used in practice.
Achievements: Netflix Prize-era improvements; widespread deployment. References: Koren et al. 2009; Funk 2006.

Kernels/Gram and attention score rank

$G=XX^\top$ has rank $\le \min(n,d)$; $QK^\top$ rank $\le \min(n,d_k)$. Rank limits expressivity and affects generalization.
Achievements: Scalable kernel methods via low-rank approximations; attention head size trade-offs. References: Schölkopf–Smola 2002; Vaswani et al. 2017.

Notation

Shapes: $A\in\mathbb{R}^{m\times d}$; $X\in\mathbb{R}^{n\times d}$ is data.
Spaces: $\text{col}(A)$, $\text{row}(A)$, $\text{null}(A)$, $\text{null}(A^\top)$.
Rank: $\operatorname{rank}(A)$; Nullity: $\operatorname{nullity}(A)$.
SVD: $A=U\Sigma V^\top$; $U\in\mathbb{R}^{m\times r}$, $V\in\mathbb{R}^{d\times r}$ span column/row spaces; $r=\operatorname{rank}(A)$.
Pseudoinverse: $A^+ = V\Sigma^+ U^\top$; minimal-norm solution $x^* = A^+ b$.
Examples:
- Rank via SVD: count $\sigma_i > \tau$ with threshold $\tau$.
- Projection onto column space: $P_{\text{col}} = U_r U_r^\top$; onto row space: $P_{\text{row}} = V_r V_r^\top$.
- Covariance rank: $\operatorname{rank}(X_c^\top X_c) \le n-1$ for centered data.

Pitfalls & sanity checks

Never invert $X^\top X$ when $\operatorname{rank}(X)<d$; use QR/SVD or regularize.
Diagnose numerical rank via singular values; set thresholds based on scale.
Center data for covariance; otherwise rank properties and PCA directions change.
Beware overfitting: increasing rank (k in PCA/factorization) beyond signal raises variance.
Attention heads: too small $d_k$ may limit expressivity; too large may hurt stability.

References

Foundations and theory

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Low-rank approximation and factorization 4. Eckart, C., & Young, G. (1936). Best rank-$k$ approximation. 5. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 6. Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. 7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

Regression and pseudoinverse 8. Penrose, R. (1955). A generalized inverse for matrices. 9. Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso.

ML systems and practice 11. Jolliffe, I. (2002). Principal Component Analysis. 12. Shlens, J. (2014). A Tutorial on Principal Component Analysis. 13. Murphy, K. P. (2022). Probabilistic Machine Learning. 14. Vaswani, A. et al. (2017). Attention Is All You Need. 15. Devlin, J. et al. (2019). BERT.

Five worked examples

Worked Example 1: Detecting multicollinearity via null space (non-unique regression)#

Introduction#

Show how null space reveals linear dependencies among features and why OLS becomes non-unique when $\operatorname{rank}(X)<d$.

Purpose#

Compute null space vectors and connect them to redundant directions; use pseudoinverse for a minimal-norm solution.

Importance#

Avoids unstable fits and clarifies identifiability in models.

What this example demonstrates#

If $v\in\text{null}(X)$, $X(w+\alpha v) = Xw$ for all $\alpha$; infinitely many OLS solutions.
Pseudoinverse $w^* = X^+ y$ yields the minimal-norm solution.

Background#

Rank deficiency arises from duplicate/derived features or insufficient data.

Historical context#

Gauss/Legendre least squares; Penrose pseudoinverse enables solutions in singular cases.

Prevalence in ML#

High-dimensional regression, feature engineering pipelines, polynomial expansions.

Notes#

Use SVD to diagnose numerical rank; add Ridge to regularize.

Connection to ML#

Feature selection and regularization strategies hinge on rank awareness.

Connection to Linear Algebra Theory#

FTLA: residuals in $\text{null}(X^\top)$; solution set $w_0 + \text{null}(X)$.

Pedagogical Significance#

Makes the geometry of “non-unique solutions” tangible.

References#

Golub & Van Loan (2013). Matrix Computations.
Penrose (1955). Moore–Penrose pseudoinverse.
Hoerl & Kennard (1970). Ridge regression.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 6
X = np.random.randn(n, d)
X[:, 5] = X[:, 0] + X[:, 1]  # make a perfectly colinear feature
w_true = np.array([1.0, -0.5, 0.3, 0.0, 2.0, 0.3])
y = X @ w_true + 0.1 * np.random.randn(n)

U, S, Vt = np.linalg.svd(X, full_matrices=False)
rank = np.sum(S > 1e-8)
nullspace_basis = Vt[rank:].T  # columns spanning null(X)

print("rank(X)=", rank, " d=", d, " nullity=", d - rank)
print("Nullspace basis shape:", nullspace_basis.shape)

# Minimal-norm solution via pseudoinverse
w_min = Vt.T @ (np.where(S > 1e-12, (U.T @ y) / S, 0.0))
print("||w_min||2=", np.linalg.norm(w_min))
print("OLS residual norm:", np.linalg.norm(y - X @ w_min))

Worked Example 2: Covariance rank ≤ n−1 (PCA in n<d regimes)#

Introduction#

Verify empirically that centered covariance has rank at most $n-1$ regardless of feature dimension.

Purpose#

Explain why PCA cannot produce more than $n-1$ nonzero eigenvalues and how this affects high-dimensional settings.

Importance#

Shapes expectations for PCA on small data; prevents overinterpretation.

What this example demonstrates#

With $X_c\in\mathbb{R}^{n\times d}$ centered, $\operatorname{rank}(X_c) \le \min(n-1, d)$; hence $\operatorname{rank}(\Sigma) \le n-1$.

Background#

Centering imposes a linear constraint across rows, reducing rank by at least one when $n>1$.

Historical context#

PCA theory and practice emphasize centering for correct variance structure.

Prevalence in ML#

Common in text, genomics, and other $d\gg n$ problems.

Notes#

Always center before PCA; whitening depends on accurate rank.

Connection to ML#

Model selection of $k$ principal components must respect $n-1$ limit.

Connection to Linear Algebra Theory#

Row-sum constraint places $\mathbf{1}$ in $\text{null}(X_c^\top)$.

Pedagogical Significance#

Reinforces how constraints reduce rank.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 30, 200
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
rank = np.sum(S > 1e-10)
print("rank(Xc)=", rank, " <= min(n-1,d)=", min(n-1, d))

Worked Example 3: Low-rank matrix factorization for recommendation (synthetic)#

Introduction#

Construct a synthetic user–item rating matrix with known low rank and recover it via truncated SVD.

Purpose#

Demonstrate latent-factor modeling and show reconstruction error scales with tail singular values.

Importance#

Illustrates the power of rank reduction in recommender systems.

What this example demonstrates#

$R\approx U_k \Sigma_k V_k^\top$ captures most variance when spectrum decays.

Background#

Matrix factorization underlies collaborative filtering; ALS/SGD optimize latent vectors.

Historical context#

Post-Netflix Prize, low-rank methods became industry standard.

Prevalence in ML#

Ubiquitous in recommendation and implicit feedback modeling.

Notes#

For missing data, completion requires specialized optimization (not shown here).

Connection to ML#

Latent dimensions reflect user/item factors; rank controls capacity.

Connection to Linear Algebra Theory#

Eckart–Young guarantees best rank-$k$ approximation.

Pedagogical Significance#

Shows direct link from SVD to practical factor models.

References#

Koren, Bell, Volinsky (2009). Matrix factorization techniques for recommender systems.
Candès & Recht (2009). Exact matrix completion via convex optimization.

Solution (Python)#

import numpy as np

np.random.seed(2)
u, i, k = 80, 60, 5
U_true = np.random.randn(u, k)
V_true = np.random.randn(i, k)
R = U_true @ V_true.T + 0.1 * np.random.randn(u, i)

U, S, Vt = np.linalg.svd(R, full_matrices=False)
Rk = (U[:, :k] * S[:k]) @ Vt[:k]
err = np.linalg.norm(R - Rk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-5))

Worked Example 4: Moore–Penrose pseudoinverse — minimal-norm solutions#

Introduction#

Solve $Ax=b$ when $A$ is rectangular or rank-deficient; verify $AA^+A=A$ and minimal norm among all solutions.

Purpose#

Provide a robust recipe for under-/overdetermined systems.

Importance#

Avoids fragile inverses and clarifies the solution geometry.

What this example demonstrates#

$x^*=A^+ b$ minimizes $\lVert x\rVert_2$ subject to $Ax=b$ for consistent systems.
Penrose conditions hold numerically.

Background#

Pseudoinverse defined via SVD; used in control, signal processing, ML.

Historical context#

Penrose (1955) established the four defining equations.

Prevalence in ML#

Closed-form layers, analytic baselines, and data-fitting routines.

Notes#

Use SVD-backed implementations; threshold small singular values.

Connection to ML#

Stable baselines and analytic steps inside pipelines.

Connection to Linear Algebra Theory#

Projects onto $\text{row}(A)$/$\text{col}(A)$; minimal-norm in $\text{null}(A)$ components.

Pedagogical Significance#

Bridges algebraic definition to numerical practice.

References#

Penrose, R. (1955). A generalized inverse for matrices.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(3)
m, d = 10, 12
A = np.random.randn(m, d)
# Make A rank-deficient by zeroing a singular value via colinearity
A[:, 0] = A[:, 1] + A[:, 2]
b = np.random.randn(m)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
S_inv = np.where(S > 1e-10, 1.0 / S, 0.0)
A_plus = Vt.T @ np.diag(S_inv) @ U.T

x_star = A_plus @ b
print("Penrose A A^+ A ~ A?", np.allclose(A @ (A_plus @ A), A, atol=1e-8))
print("Residual ||Ax-b||:", np.linalg.norm(A @ x_star - b))
print("||x_star||2:", np.linalg.norm(x_star))

Worked Example 5: Rank of attention scores QK^T (expressivity bound)#

Introduction#

Show that the attention score matrix $S=QK^\top$ has rank at most $\min(n, d_k)$ and explore implications for head dimension.

Purpose#

Connect feature dimension to expressivity through rank bounds.

Importance#

Head size choices affect the diversity of attention patterns.

What this example demonstrates#

For $Q\in\mathbb{R}^{n\times d_k}$ and $K\in\mathbb{R}^{n\times d_k}$, $\operatorname{rank}(QK^\top) \le \min(n, d_k)$.

Background#

Rank of product bounded by inner dimension; scaled dot-products preserve rank.

Historical context#

Transformers leverage multiple heads to increase effective rank/expressivity.

Prevalence in ML#

All transformer models; multi-head concatenation increases representational capacity.

Notes#

Multi-head attention can be seen as block structures that raise overall rank after concatenation.

Connection to ML#

Guides architecture design (choosing $d_k$ and number of heads).

Connection to Linear Algebra Theory#

Rank bounds and product properties.

Pedagogical Significance#

Links a practical hyperparameter to a crisp linear algebra bound.

References#

Vaswani, A. et al. (2017). Attention Is All You Need.
Devlin, J. et al. (2019). BERT.

Solution (Python)#

import numpy as np

np.random.seed(4)
for n, dk in [(32, 8), (32, 32), (64, 16)]:
	 Q = np.random.randn(n, dk)
	 K = np.random.randn(n, dk)
	 S = Q @ K.T
	 r = np.linalg.matrix_rank(S)
	 print(f"n={n}, d_k={dk}, rank(S)={r}, bound={min(n, dk)}")

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

Classical Era

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Foundational

Linear Algebra Foundations

Matrix Theory

Theoretical Foundation

Linear Algebra Theory

Linear Maps & Matrices

Chapter 4

Linear Maps & Matrices

Key ideas: Introduction

Introduction#

Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:

Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)
Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)
Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)
Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)

Linear maps are everywhere in ML:

Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).
Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.
Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.
PCA: Projection onto principal components is a linear map.
Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.

Important Ideas#

1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:

Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.
Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.

Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.

Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.

Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).

2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.

Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).

3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).

Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.

Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).

4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying: $$ (Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y $$

Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).

In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).

Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.

5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:

Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.
Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).

Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).

Relevance to Machine Learning#

Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.

Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.

Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).

Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.

Algorithmic Development History#

1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.

2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.

3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.

4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.

5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.

6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.

7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.

8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.

Definitions#

Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:

$f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).
$f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).

Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).

Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$: $$ \text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A) $$

Rank. The rank of $A$ is: $$ \text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns} $$

Nullity. The nullity of $A$ is: $$ \text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A) $$

Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$

Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying: $$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$

Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Rank-nullity theorem. For $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$ Consequences: If $\text{rank}(A) < d$, solutions to $Ax = b$ are not unique (null space is non-trivial). For invertible $A$ (rank = $d$), solutions are unique.
Fundamental theorem of linear algebra. Orthogonal decomposition: $\mathbb{R}^d = \text{col}(A^\top) \oplus \text{null}(A)$ and $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$ (orthogonal direct sums). Basis for all linear algebra.
Universal approximation (Cybenko 1989, Hornik 1991). A neural network with one hidden layer (linear map + nonlinearity + output linear map) can approximate any continuous function on compact sets arbitrarily well (with enough hidden units).
Spectral theorem for symmetric matrices (Hamilton, Sylvester, 1850s-1880s). Every symmetric $A$ has eigendecomposition $A = U \Lambda U^\top$ (orthogonal diagonalization). Basis for PCA, optimization, understanding symmetric structures.
Singular Value Decomposition (Beltrami 1873, Eckart-Young 1936). Every matrix $A \in \mathbb{R}^{m \times d}$ can be written as $A = U \Sigma V^\top$ (orthogonal $U, V$, diagonal $\Sigma$). Reveals low-rank structure, optimal approximations, conditioning.

Why essential: These theorems quantify what linear maps can/cannot represent, how to invert them, when solutions exist, and how to find optimal approximations.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Backpropagation and gradient-based learning (Rumelhart et al. 1986, 1990s-present). Automatic differentiation computes gradients via chain rule (composition of matrix transposes). Enables training networks with billions of parameters. All modern deep learning depends on this.
Dense neural networks (Cybenko 1989, Hornik 1991, 1990s-present). Theoretical universality + practical training via backprop = powerful function approximators. AlexNet (2012) showed depth matters: stacking linear maps + activations learns hierarchical representations.
Convolutional Neural Networks (LeCun et al. 1990, AlexNet 2012, ResNet 2015). Structured linear maps (convolution with weight sharing). Dramatically reduced parameters vs. dense. State-of-the-art on vision (ImageNet), object detection, segmentation.
Recurrent Neural Networks and LSTMs (Hochreiter & Schmidhuert 1997, 2000s-present). Apply same linear map over time steps (sequence model for NLP, time series). Enabled machine translation, speech recognition.
Transformers and Attention (Vaswani et al. 2017, Devlin et al. 2018, GPT series 2018-2023). All-attention architecture (linear projections + softmax + matrix multiply). Achieved state-of-the-art across NLP (GLUE, SuperGLUE), vision (ImageNet via ViT), multimodal (CLIP). Scales to trillions of parameters.
Least squares for regression (Gauss, Legendre, Tikhonov, modern methods). Normal equations $(X^\top X) w = X^\top y$ solved via QR/SVD (numerically stable). Classical ML workhorse; fast closed-form solution, interpretable results.

Why essential: These systems achieve state-of-the-art by leveraging linear map structure (composition, transposes, efficient matrix multiply). Understanding linear algebra is necessary to design architectures, optimize, and debug.

Key ideas: Where it shows up

1. Backpropagation and Gradient Flow — Transpose carries errors backward#

Major achievements:

Backpropagation (Rumelhart, Hinton, Williams 1986): Efficient algorithm for computing gradients through neural networks via chain rule. Each layer applies $y = \sigma(W x + b)$; backward pass uses $\frac{\partial L}{\partial x} = W^\top \frac{\partial L}{\partial y}$ (transpose carries gradients).
Modern deep learning (1990s-2010s): Backprop enabled training of deep networks (10-1000+ layers). Scaling to billions of parameters (GPT, Vision Transformers).
Automatic differentiation (1980s-present): Frameworks (TensorFlow, PyTorch) implement backprop automatically by composing transposes. Practitioners never write transposes explicitly; framework handles it.
Applications: All supervised learning, reinforcement learning, generative models. Billions of backprop steps every day globally.

Connection to linear maps: Forward pass chains linear maps with nonlinearities: $f = \sigma_k \circ (A_k \sigma_{k-1} \circ (A_{k-1} \cdots))$. Backward pass computes gradients: $\nabla_w L = (\sigma'_{k-1})^T A_{k-1}^T (\sigma'_{k-2})^T A_{k-2}^T \cdots$ (products of transposes).

2. Neural Network Layers — Linear maps + activation functions#

Major achievements:

Dense layers (Rosenblatt Perceptron 1958, MLPs 1970s-1980s): Input $x$, linear map $h = Wx + b$, activation $y = \sigma(h)$ (ReLU, sigmoid, tanh). Each layer is a learnable linear map.
Depth (ResNets, Vaswani 2015-2017): 50-1000 layers. Skip connections $x_{i+1} = \sigma(W_i x_i + b_i) + x_i$ allow training very deep networks. Each skip branch is a composition of linear maps.
Scaling (AlexNet 2012, GPT-3 2020, Gato 2022): Modern networks: billions to trillions of parameters. Matrix multiply dominates computation. Large linear maps $W \in \mathbb{R}^{4096 \times 4096}$ applied to batches.
Optimization: Understanding composition of linear maps helps explain generalization (implicit regularization favors low-complexity solutions in the span of data).

Connection to linear maps: Each dense layer is $W: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$. Network composes $W_k \circ \sigma \circ W_{k-1} \circ \sigma \circ \cdots \circ W_1$. Expressiveness comes from depth (composition) and nonlinearity ($\sigma$).

3. Attention Mechanism — Multi-head projections and weighted sums#

Major achievements:

Scaled dot-product attention (Vaswani et al. 2017): Queries, Keys, Values are projections (linear maps) $Q = XW_Q, K = XW_K, V = XW_V$. Attention weights $A = \text{softmax}(QK^\top / \sqrt{d_k})$. Output $\text{Attention}(Q,K,V) = AV$ (matrix multiply with softmax-weighted rows).
Multi-head attention: $h$ heads, each applying different linear projections. Concatenate: $\text{MultiHead}(Q,K,V) = \text{Concat}(A_1, \ldots, A_h) W^O$ (linear map combines heads).
Transformers (Vaswani 2017, Devlin et al. 2018): Attention layers (all linear maps + softmax) in sequence. BERT, GPT achieve state-of-the-art across NLP tasks.
Scale: GPT-3 (175B parameters), PaLM (540B), GPT-4. Training scales across thousands of GPUs, with matrix multiplication as bottleneck.

Connection to linear maps: Attention is composition of linear maps: $\text{Attention} = A V$ where $A = \text{softmax}(Q K^\top / \sqrt{d_k})$. Each head applies different linear projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$. Output is weighted linear combination of values.

4. Least Squares and Regression — Normal equations as linear system#

Major achievements:

Least squares (Gauss, Legendre, early 1800s): Solve $\min_w \|Xw - y\|_2^2$. Normal equations: $(X^\top X) w = X^\top y$. Linear system $Aw = b$ (product of two linear maps).
Ridge regression (Tikhonov 1963, Hoerl & Kennard 1970): Add regularization $\min_w (\|Xw - y\|_2^2 + \lambda \|w\|_2^2)$. Solution: $w = (X^\top X + \lambda I)^{-1} X^\top y$ (invertible for any $\lambda > 0$).
LASSO (Tibshirani 1996): L1 regularization forces sparsity. Solved via proximal methods (composition of proximal operators, each a linear map or projection).
Kernel methods (Mercer 1909, Schölkopf & Smola 2001): Non-linear regression via Gram matrix $K = X X^\top$ (product of linear maps, then apply kernel trick).

Connection to linear maps: Normal equations involve products of matrices: $X^\top X$ (composition of $X^\top$ and $X$), $X^\top y$ (linear map applied to $y$). Solution involves matrix inversion (inverse is also a linear map).

5. Convolutional and Recurrent Networks — Structured linear maps#

Major achievements:

CNNs (LeCun et al. 1990s, AlexNet 2012, ResNet 2015): Convolutional layers are linear maps with weight sharing (same weights applied across spatial positions). Reduces parameters vs. dense layer (e.g., conv 3×3×64→64 channels vs. dense with same feature count).
RNNs, LSTMs (Hochreiter & Schmidhuber 1997): Recurrent layers apply the same linear map $W$ repeatedly over time: $h_t = \sigma(W h_{t-1} + U x_t)$ (composition of linear maps over time steps).
Efficiency: Weight sharing and structured matrices (convolution, recurrence) reduce parameters and computation compared to dense layers.
Interpretability: Convolutional structure learned by early layers is interpretable (edge filters, textures). Linear maps with structured sparsity/sharing have semantic meaning.

Connection to linear maps: Conv layer is a linear map (convolution can be written as matrix multiplication with Toeplitz structure). RNN applies same linear map repeatedly: composition $W \circ W \circ \cdots \circ W$ over time.

Notation

Standard Conventions#

1. Linear map and matrix notation.

Linear map: $f: V \to W$ or $A: \mathbb{R}^d \to \mathbb{R}^m$ (function notation).
Matrix representation: $A \in \mathbb{R}^{m \times d}$ or $[A]_{ij}$ for entry in row $i$, column $j$.
Matrix-vector product: $y = Ax$ (linear map applied to vector $x$).
Matrix-matrix product: $C = AB$ (composition: apply $B$ then $A$).
Image and kernel: $\text{im}(A)$ or $\text{col}(A)$ for column space; $\ker(A)$ or $\text{null}(A)$ for null space.

Examples:

Linear map: $f(x) = 3x_1 - 2x_2 \in \mathbb{R}$. Matrix: $A = [3, -2] \in \mathbb{R}^{1 \times 2}$.
Linear map: $(x, y) \mapsto (2x + y, x - 3y)$. Matrix: $A = \begin{bmatrix} 2 & 1 \\ 1 & -3 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$.
Composition: Neural network layer 1: $h_1 = \sigma(W_1 x)$, layer 2: $h_2 = \sigma(W_2 h_1) = \sigma(W_2 \sigma(W_1 x))$. Composition: $f = \sigma \circ (W_2 \circ \sigma \circ W_1)$.

2. Rank notation.

Rank: $\text{rank}(A)$ = dimension of column space = number of linearly independent columns.
Nullity: $\text{nullity}(A) = d - \text{rank}(A)$ (dimension of null space).
Full rank: $\text{rank}(A) = \min(m, d)$ (maximum possible rank).
Rank deficient: $\text{rank}(A) < \min(m, d)$ (singular or near-singular).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 2 (full rank), columns independent.
$A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 1 (rank deficient), second column = 2 × first column.

3. Transpose notation.

Transpose: $A^\top$ (rows and columns swapped).
Adjoint property: $(Ax)^\top y = x^\top (A^\top y)$ (inner product duality).
Composition rule: $(AB)^\top = B^\top A^\top$ (note reversed order).
Inverse of transpose: $(A^\top)^{-1} = (A^{-1})^\top$ (for invertible $A$).

Examples:

$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$.
Gradient in backprop: $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (linear map $A$ → transpose $A^\top$ for gradient).

4. Composition and chaining notation.

Composition operator: $(f \circ g)(x) = f(g(x))$ (apply $g$ first, then $f$).
Matrix chaining: For $f = A, g = B$, composition is $f \circ g = A \circ B$ with matrix product $AB$ (apply $B$ then $A$).
Neural network layers: Output $h_i = \sigma_i(A_i h_{i-1})$ (chain $A_1, \sigma_1, A_2, \sigma_2, \ldots$).

Examples:

Rotate by $\theta$, then scale by $2$: $R_\theta \circ S_2$. Matrix: $S_2 R_\theta$.
Neural network: $f(x) = \sigma_2(A_2 \sigma_1(A_1 x))$. Composition: $\sigma_2 \circ A_2 \circ \sigma_1 \circ A_1$.

5. Invertibility and determinant notation.

Invertible (nonsingular): $A^{-1}$ exists; $AA^{-1} = A^{-1} A = I$.
Determinant: $\det(A)$ or $|A|$. For invertibility: $\det(A) \neq 0 \Leftrightarrow A$ invertible.
Condition number: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular value).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. $\det(A) = 2 \neq 0$, so $A$ is invertible. $A^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1/2 \end{bmatrix}$.
Ill-conditioned matrix: $\kappa(A) = 10^{10}$ (nearly singular). Small perturbations cause large changes in solution. Use regularization or preconditioning.

6. Special matrices notation.

Identity: $I \in \mathbb{R}^{d \times d}$ (diagonal matrix with 1’s).
Orthogonal/orthonormal: $Q^\top Q = QQ^\top = I$ (columns/rows orthonormal).
Symmetric: $A^\top = A$.
Positive semi-definite (PSD): $A \succeq 0$; all eigenvalues $\geq 0$. Covariance matrices are PSD.

Examples:

QR decomposition: $A = QR$ where $Q$ orthonormal, $R$ upper triangular.
Symmetric matrix: $\Sigma = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. Eigendecomposition: $\Sigma = U \Lambda U^\top$ (orthonormal $U$, diagonal $\Lambda$).
PSD matrix: Covariance $\text{Cov}(X) \succeq 0$ (always PSD). Gram matrix $G = X^\top X \succeq 0$ (always PSD).

Pitfalls & sanity checks

When working with linear maps and matrices:

Always check shapes. Matrix multiply requires compatible dimensions. $A \in \mathbb{R}^{m \times d}$, $x \in \mathbb{R}^d$ yields $Ax \in \mathbb{R}^m$. Shape mismatch = runtime error.
Prefer stable decompositions. Never compute $(X^\top X)^{-1}$ explicitly. Use QR (via solve) or SVD (truncate small singular values) for numerical stability.
Transpose order matters. $(AB)^\top = B^\top A^\top$ (reversed order). In backprop, composition reverses layer order via transposes.
Condition number determines stability. If $\kappa(A) > 10^8$, expect numerical errors. Use regularization (Ridge, Tikhonov) or preconditioning.
Gradients flow via transposes. Backprop systematically applies transposes. Understand: ill-conditioned weights → vanishing/exploding gradients.

References

Foundational texts:

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.
Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Linear maps and matrix theory:

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Hoffman, K., & Kunze, R. (1971). Linear Algebra (2nd ed.). Prentice-Hall.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Axler, S. J., Bourdon, P. S., & Wade, W. M. (2000). Harmonic Function Theory (2nd ed.). Springer.

Neural networks and backpropagation:

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533–536.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). “Efficient backprop.” In Neural Networks: Tricks of the Trade (pp. 9–50). Springer.

Optimization:

Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.
Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer.
Kingma, D. P., & Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv:1412.6980.

Transformers and attention:

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” NAACL.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). “An image is worth 16×16 words: Transformers for image recognition at scale.” ICLR.

Least squares and numerical methods:

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Dover reprint.
Golub, G. H., & Pereyra, V. (1973). “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.” SIAM Journal on Numerical Analysis, 10(2), 413–432.

Five worked examples

Worked Example 1: Backprop uses transpose#

Problem. For y=Wx, show ∂L/∂x = W^T ∂L/∂y.

Solution (math). Jacobian of y=Wx is W; chain rule yields transpose in reverse mode.

Solution (Python).

import numpy as np
W=np.array([[2.,1.],[-1.,3.]])
dL_dy=np.array([0.5,-2.])
print(W.T@dL_dy)

Worked Example 2: Q,K,V projections in transformers#

Problem. Compute Q=XW_Q, K=XW_K, V=XW_V.

Solution (math). These are linear maps from model dimension to head dimensions.

Solution (Python).

import numpy as np
X=np.array([[1.,0.],[0.,1.],[1.,1.]])
Wq=np.array([[1.,0.],[0.,2.]])
Wk=np.array([[2.,0.],[0.,1.]])
Wv=np.array([[1.,1.],[0.,1.]])
print(X@Wq)
print(X@Wk)
print(X@Wv)

Worked Example 3: Normal equations matrix#

Problem. Form A=X^TX and b=X^Ty for least squares.

Solution (math). Solving A w=b is equivalent to minimizing ||Xw-y||^2 when X has full rank.

Solution (Python).

import numpy as np
X=np.array([[1.,1.],[1.,2.],[1.,3.]])
y=np.array([1.,2.,2.5])
A=X.T@X; b=X.T@y
print(A)
print(b)

Worked Example 4: Batch GD as matrix products#

Problem. Compute one gradient step for MSE.

Solution (math). w←w-η(1/n)X^T(Xw-y).

Solution (Python).

import numpy as np
X=np.array([[1.,2.],[3.,4.],[5.,6.]])
y=np.array([1.,0.,1.])
w=np.zeros(2)
eta=0.1
g=(1/len(X))*X.T@(X@w-y)
print(w-eta*g)

Worked Example 5: Attention is matrix multiplication#

Problem. Compute A=softmax(QK^T/√d) and output O=AV.

Solution (math). Attention is a composition of matrix multiplications plus a row-wise softmax.

Solution (Python).

import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores,axis=1)
print(A@V)