ex1.ai

Data Modality

Historical & Attribution

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Linear Algebra Foundations

Matrix Theory

Theoretical Foundation

Orthogonality & Projections

Chapter 6

Orthogonality & Projections

Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

Orthogonality and complements
- $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
- A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
- For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
- If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
- Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
- For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
- $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.
Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).
Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.
Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.
Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.
Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.
Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).
Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).
Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up

PCA and subspace denoising

PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.
Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Least squares as projection

$\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.
Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

Orthogonalization algorithms (QR)

Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.
Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

Orthogonal Procrustes and embedding alignment

Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).
Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

Orthogonal constraints/initialization in deep nets

Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.
Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation

Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.
Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.
Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.
QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.
SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).
Examples:
- Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.
- PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.
- Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks

Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.
Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.
Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.
Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.
Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

$X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.
Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

Prefer QR/SVD over normal equations.
Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

Gauss (1809); Legendre (1805) — least squares.
Golub & Van Loan (2013) — QR solvers.
Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

$X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.
Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

Eckart & Young (1936) — best rank-$k$.
Jolliffe (2002) — PCA.
Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

$H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))

Comments

Algorithm Category

Data Modality

Projection & Orthogonality

Historical & Attribution

20th Century

Key Concepts & Theorems

Learning Path & Sequencing

Orthogonal Factorizations

Linear Algebra Foundations

Inner Products & Norms

Matrix Decompositions

ML Applications

Kernels & Gaussian Processes

Theoretical Foundation

Inner Products & Norms

Chapter 5

Inner Products & Norms

Key ideas: Introduction

Introduction#

Inner products and norms provide the geometry for data and models:

Similarity via inner products $\langle x, y\rangle$ and cosine $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$
Size and distance via norms $\lVert x\rVert$ and induced metrics $d(x,y) = \lVert x-y\rVert$
Orthogonality ($\langle x, y\rangle = 0$) and projections onto subspaces
Positive semidefinite (PSD) Gram matrices and kernels driving SVMs/GPs
Stability and regularization via $\ell_2$ (Ridge) and $\ell_1$ (Lasso) penalties
Scaled dot-product attention uses many inner products and a normalization factor $1/\sqrt{d}$

Important ideas#

Inner product axioms and induced norms
- An inner product $\langle x, y\rangle$ on $\mathbb{R}^d$ is symmetric, bilinear, and positive definite; the induced norm is $\lVert x\rVert = \sqrt{\langle x, x\rangle}$.
Cauchy–Schwarz and cosine similarity
- \[\big|\langle x, y\rangle\big| \le \lVert x\rVert\,\lVert y\rVert\]
- Defines the angle via $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$.
Triangle inequality and Minkowski/Hölder
- For $p\in[1,\infty]$, $\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p$; Hölder duality connects $p$ and $q$ with $1/p+1/q=1$.
Dual norms and bounds
- The dual norm is $\lVert z\rVert_* = \sup_{\lVert x\rVert\le 1} \langle z, x\rangle$; e.g., dual of $\ell_1$ is $\ell_\infty$, dual of $\ell_2$ is $\ell_2$.
Orthogonality, orthonormal bases, and projections
- If $U\in\mathbb{R}^{d\times k}$ has orthonormal columns, the orthogonal projector is $P = UU^\top$, minimizing reconstruction error.
Gram matrices, PSD, and kernels
- For data matrix $X\in\mathbb{R}^{n\times d}$, $G=X X^\top$ has entries $G_{ij}=\langle x_i, x_j\rangle$ and is PSD. Kernel matrices generalize this to $K_{ij}=k(x_i,x_j)$.
Mahalanobis norms
- For SPD $M\succ 0$, $\lVert x\rVert_M = \sqrt{x^\top M x}$ reweights geometry (whitening, metric learning).
Norm-induced stability
- Lipschitz constants, gradient clipping, and regularization costs all depend on norms.

Relevance to ML#

Similarity search: cosine similarity is the standard for embeddings (IR, recommendation, retrieval, metric learning).
Regularization: $\ell_2$ (weight decay) controls scale; $\ell_1$ encourages sparsity.
Optimization: gradient norms determine step sizes; clipping prevents exploding gradients.
Kernels: SVMs, GPs rely on PSD Gram matrices of inner products.
Attention: scaled dot-products stabilize softmax logits as dimension grows.
PCA/covariance: variance equals squared $\ell_2$ norm along directions; orthogonal projections minimize $\ell_2$ error.

Algorithmic development (select milestones)#

1850s–1900s: Euclidean geometry formalized; Cauchy–Schwarz inequality.
1909: Mercer’s theorem (PSD kernels); foundations of kernel methods.
1950: Aronszajn formalizes RKHS; inner products in function spaces.
1960s–1970s: Robust norms (Huber); convex analysis; optimization bounds.
1995: SVMs (Cortes–Vapnik) with kernel trick.
2013–2015: Word2Vec, GloVe popularize cosine similarity in embeddings.
2015–2016: BatchNorm/LayerNorm normalize activations (variance/norm control).
2017: Scaled dot-product attention (Transformers) stabilizes inner-product logits.
2020: Contrastive learning (SimCLR) uses normalized cosine objectives.

Definitions#

Inner product: $\langle x, y\rangle = x^\top y$ (standard), or weighted $\langle x, y\rangle_M = x^\top M y$ with $M\succ 0$.
Induced norm: $\lVert x\rVert = \sqrt{\langle x, x\rangle}$; $\ell_p$ norms: $\lVert x\rVert_1=\sum_i|x_i|$, $\lVert x\rVert_2=\sqrt{\sum_i x_i^2}$, $\lVert x\rVert_\infty=\max_i |x_i|$.
Cosine similarity: $\cos\theta(x,y) = \dfrac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}$.
Orthogonality: $\langle x, y\rangle = 0$; orthonormal set: $\langle u_i, u_j\rangle = \delta_{ij}$.
Gram matrix: $G_{ij}=\langle x_i, x_j\rangle$; PSD: $z^\top G z \ge 0$ $\forall z$.
Kernel: $k(x,y)=\langle \phi(x), \phi(y)\rangle$; $K_{ij}=k(x_i,x_j)$ is PSD.
Mahalanobis norm: $\lVert x\rVert_M = \sqrt{x^\top M x}$ with $M\succ 0$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems and tools)#

Cauchy–Schwarz: $$\big|\langle x,y\rangle\big|\le \lVert x\rVert\,\lVert y\rVert,$$ equality iff $x, y$ are linearly dependent.
Triangle inequality and Minkowski: $$\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p,$$ basis of $\ell_p$ geometries.
Hölder’s inequality: $$|\langle x,y\rangle| \le \lVert x\rVert_p\,\lVert y\rVert_q,$$ with $1/p+1/q=1$.
Pythagorean theorem (projections): For orthogonal $a\perp b$, $$\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2.$$
Norm equivalence (finite-dimensional): For any two norms on $\mathbb{R}^d$, there exist $c, C>0$ with $c\lVert x\rVert_a \le \lVert x\rVert_b \le C\lVert x\rVert_a$.
PSD characterization: $G$ is a Gram matrix iff $z^\top G z \ge 0$ for all $z$ (kernel validity test).

Applied (landmark systems and practices)#

SVMs (margins via inner products): Cortes–Vapnik (1995); kernel trick.
Gaussian Processes (inner products in function space): Rasmussen–Williams (2006).
BatchNorm/LayerNorm (norm/variance control): Ioffe–Szegedy (2015); Ba et al. (2016).
Word2Vec/GloVe (cosine similarity): Mikolov et al. (2013); Pennington et al. (2014).
SimCLR/contrastive learning (normalized dot-products): Chen et al. (2020).
Transformers (scaled dot-product): Vaswani et al. (2017).
Gradient clipping (norm control in training): Pascanu et al. (2013).

Key ideas: Where it shows up

PCA and covariance geometry

Variance along $u$: $\sigma^2(u)=\lVert X_c u\rVert_2^2/n= u^\top \Sigma u$, with $\Sigma=\tfrac{1}{n}X_c^\top X_c$.
Principal components are eigenvectors of $\Sigma$ maximizing inner products with data; projection error uses Pythagorean decomposition.
Achievements: Dimensionality reduction at scale; whitening used broadly in vision and speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

SGD/optimization: gradient norms and clipping

Step sizes depend on Lipschitz constants tied to operator/dual norms.
Gradient clipping by $\ell_2$ norm prevents exploding gradients (RNNs). References: Pascanu et al. 2013; Goodfellow et al. 2016; Nesterov 2018.

Deep nets: normalization and regularization

Weight decay ($\ell_2$) controls model complexity; $\ell_1$ encourages sparsity.
BatchNorm/LayerNorm normalize mean/variance, implicitly controlling activation norms. References: Ioffe–Szegedy 2015; Ba et al. 2016.

Kernels and PSD Gram matrices

SVMs and GPs depend on PSD kernels (Mercer). $K=XX^\top$ is PSD; RBF kernel yields smooth function priors.
Achievements: Kernel SVMs in text/vision (1990s–2000s); GPs in Bayesian ML. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Transformers: scaled dot-product attention

Scores $S=QK^\top/\sqrt{d_k}$ use many inner products; the $\sqrt{d_k}$ factor stabilizes softmax variance.
Achievements: SOTA in NLP/vision; ubiquitous backbone. References: Vaswani et al. 2017; Devlin et al. 2019; Dosovitskiy et al. 2020.

Embeddings and retrieval

Cosine similarity is the default for semantic retrieval and metric learning; normalization puts data on the unit sphere.
Achievements: Word2Vec/GloVe; SimCLR; CLIP/contrastive vision-language models. References: Mikolov et al. 2013; Pennington et al. 2014; Chen et al. 2020; Radford et al. 2021.

Notation

Vectors are column vectors. Data matrix: $X\in\mathbb{R}^{n\times d}$ (rows are examples; columns features). Centered data: $X_c$.
Inner product: $\langle x, y\rangle = x^\top y$; cosine similarity: $$\cos\theta(x,y) = \frac{\langle x, y\rangle}{\lVert x\rVert_2\,\lVert y\rVert_2}.$$
Norms: $\lVert x\rVert_1, \lVert x\rVert_2, \lVert x\rVert_\infty$; dual norms $\lVert\cdot\rVert_*$; Mahalanobis $\lVert x\rVert_M = \sqrt{x^\top M x}$.
Projection: If $U\in\mathbb{R}^{d\times k}$ is orthonormal, $P=UU^\top$; residual $r=(I-P)x$ is orthogonal to $\text{col}(U)$.
Gram matrix: $G=XX^\top$ (PSD); kernel matrix: $K_{ij}=k(x_i,x_j)$.
Examples:
- Embedding cosine: normalize $\hat{x}=x/\lVert x\rVert_2$, $\hat{y}=y/\lVert y\rVert_2$, then $\langle \hat{x},\hat{y}\rangle=\cos\theta$.
- Ridge penalty: $\lambda\lVert w\rVert_2^2$; Lasso: $\lambda\lVert w\rVert_1$.
- Attention scores: $S=QK^\top/\sqrt{d_k}$; softmax row-wise on $S$.

Pitfalls & sanity checks

Cosine vs Euclidean: without normalization, rankings can change due to scale.
PSD checks: ensure Gram/kernel matrices are PSD (numerically, allow tiny negatives).
Norm choice: $\ell_2$ is rotation-invariant; $\ell_1$ is robust/sparse but not smooth.
Attention scaling: omit $1/\sqrt{d_k}$ and softmax saturates for large $d_k$.
Centering for covariance: use $X_c$ for PCA; otherwise directions mix mean effects.
Gradient norms: clip by global norm to avoid exploding updates.

References

Foundations and geometry

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Axler, S. (2015). Linear Algebra Done Right (3rd ed.).
Horn, R. & Johnson, C. (2012). Matrix Analysis.
Boyd, S. & Vandenberghe, L. (2004). Convex Optimization.

Kernels and PSD 5. Mercer, J. (1909). Functions of positive and negative type. 6. Aronszajn, N. (1950). Theory of Reproducing Kernels. 7. Schölkopf, B. & Smola, A. (2002). Learning with Kernels. 8. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

Regularization and optimization 9. Hoerl, A. & Kennard, R. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso. 11. Nesterov, Y. (2018). Lectures on Convex Optimization. 12. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. 13. Pascanu, R. et al. (2013). On the difficulty of training RNNs (gradient clipping).

Embeddings, normalization, attention 14. Mikolov, T. et al. (2013). Word2Vec. 15. Pennington, J. et al. (2014). GloVe. 16. Ioffe, S. & Szegedy, C. (2015). Batch Normalization. 17. Ba, J. L. et al. (2016). Layer Normalization. 18. Chen, T. et al. (2020). SimCLR. 19. Radford, A. et al. (2021). CLIP. 20. Vaswani, A. et al. (2017). Attention Is All You Need.

PCA and projections 21. Jolliffe, I. (2002). Principal Component Analysis. 22. Shlens, J. (2014). A Tutorial on PCA. 23. Eckart, C. & Young, G. (1936). Low-rank approximation. 24. Golub, G. & Van Loan, C. (2013). Matrix Computations.

Five worked examples

Worked Example 1: Cosine similarity vs Euclidean distance for embedding retrieval#

Introduction#

Cosine similarity is ubiquitous for nearest-neighbor search in embedding spaces (text, images, audio). We show that for $\ell_2$-normalized vectors, maximizing cosine similarity is equivalent to minimizing Euclidean distance.

Purpose#

Relate inner products to distances under normalization; provide a fast retrieval recipe.

Importance#

Industrial search, recommendation, and retrieval pipelines rely on cosine similarity with normalized embeddings for stability and interpretability.

What this example demonstrates#

Equivalence: For unit vectors, $$\lVert x-y\rVert_2^2 = 2(1-\langle x,y\rangle).$$
Ranking by cosine equals ranking by negative Euclidean distance after normalization.

Background#

Vector space models in IR (Salton) and modern embeddings (Word2Vec, GloVe, CLIP) use cosine similarity due to scale invariance.

Historical context#

From tf–idf cosine in IR to neural embeddings; normalization combats varying document lengths and feature scales.

Prevalence in ML#

Text retrieval, semantic search, metric learning, contrastive pretraining; approximate nearest neighbor (ANN) indices often assume normalized data.

Notes#

Always normalize embeddings: $\hat{x}=x/\lVert x\rVert_2$.
For batched comparisons: use matrix products $S=\hat{X}\hat{Y}^\top$ to get all cosines.

Connection to ML#

Similarity search, contrastive objectives, and re-ranking all hinge on stable cosine scores.

Connection to Linear Algebra Theory#

Inner products induce norms/angles; normalization maps data to the unit sphere $\mathbb{S}^{d-1}$.

Pedagogical Significance#

Shows direct algebraic link between inner product and Euclidean geometry under normalization.

References#

Salton, G. et al. (1975). A vector space model for information retrieval.
Mikolov, T. et al. (2013). Efficient Estimation of Word Representations.
Pennington, J. et al. (2014). GloVe.
Radford, A. et al. (2021). CLIP.

Solution (Python)#

import numpy as np

np.random.seed(0)
d, n_query, n_db = 128, 4, 6
X = np.random.randn(n_query, d)
Y = np.random.randn(n_db, d)

def normalize(A):
	 nrm = np.linalg.norm(A, axis=1, keepdims=True) + 1e-12
	 return A / nrm

Xn, Yn = normalize(X), normalize(Y)
cos = Xn @ Yn.T                  # cosine similarities
eucl2 = ((Xn[:, None, :] - Yn[None, :, :])**2).sum(-1)  # squared distances

print("Cosine matrix:\n", np.round(cos, 3))
print("Squared distances (normalized):\n", np.round(eucl2, 3))
print("Relationship check (row 0):", np.allclose(eucl2[0], 2 * (1 - cos[0]), atol=1e-6))

Worked Example 2: $\ell_2$ vs $\ell_1$ regularization under orthonormal design#

Introduction#

Compare Ridge ($\ell_2$) and Lasso ($\ell_1$) when $X^\top X = I$. Ridge has a closed form; Lasso reduces to soft-thresholding.

Purpose#

Show how norms shape solutions: $\ell_2$ shrinks weights smoothly; $\ell_1$ induces sparsity.

Importance#

Regularization choice affects interpretability, robustness, and generalization.

What this example demonstrates#

For $X^\top X=I$, OLS is $w_{\text{ls}}=X^\top y$.
Ridge: $$w_{\text{ridge}} = \frac{1}{1+\lambda} w_{\text{ls}}.$$
Lasso: $$w_{\text{lasso}, i} = \operatorname{sign}(w_{\text{ls}, i})\,\max\{|w_{\text{ls}, i}|-\lambda, 0\}.$$

Background#

Ridge stabilizes ill-conditioned problems; Lasso selects features.

Historical context#

Ridge (Tikhonov, 1963; Hoerl–Kennard, 1970) and Lasso (Tibshirani, 1996) are canonical.

Prevalence in ML#

Widely used in linear models, compressed sensing, and high-dimensional statistics.

Notes#

The soft-threshold formula holds exactly under orthonormal design; otherwise use coordinate descent.

Connection to ML#

Norm penalties as priors/constraints: weight decay, sparsity, and model selection.

Connection to Linear Algebra Theory#

Dual norms and subgradients for $\ell_1$; spectral properties for $\ell_2$.

Pedagogical Significance#

Highlights geometric differences: $\ell_2$ balls are round; $\ell_1$ balls have corners that promote zeros.

References#

Hoerl, A. & Kennard, R. (1970). Ridge Regression.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
Hastie, T. et al. (2009). Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 64, 8
U, _ = np.linalg.qr(np.random.randn(n, d))  # n x d with orthonormal columns
X = U
w_true = np.zeros(d); w_true[:3] = [2.0, -1.5, 0.5]
y = X @ w_true + 0.1 * np.random.randn(n)

w_ls = X.T @ y
lam = 0.5
w_ridge = w_ls / (1.0 + lam)

def soft_threshold(a, lam):
	 return np.sign(a) * np.maximum(np.abs(a) - lam, 0.0)
w_lasso = soft_threshold(w_ls, lam)

print("||w_ls||2=", np.linalg.norm(w_ls))
print("Ridge (lam=0.5):", np.round(w_ridge, 3))
print("Lasso (lam=0.5):", np.round(w_lasso, 3))

Worked Example 3: Gram matrices are PSD; kernels in practice#

Introduction#

Show that $G=XX^\top$ is PSD and illustrate a common kernel (RBF). Verify PSD numerically.

Purpose#

Connect inner products to PSD matrices and kernel validity.

Importance#

Kernel methods hinge on PSD property; invalid kernels can break optimization.

What this example demonstrates#

For any $z$, $$z^\top (XX^\top) z = \lVert X^\top z\rVert_2^2 \ge 0.$$
RBF kernel is PSD; eigenvalues are nonnegative up to numerical tolerance.

Background#

Mercer’s theorem characterizes kernels as inner products in (possibly infinite-dimensional) feature spaces.

Historical context#

Kernel trick popularized SVMs and GPs; modern random features approximate kernels at scale.

Prevalence in ML#

Text, bioinformatics, small/medium tabular data, Bayesian regression.

Notes#

Numerical PSD check via eigenvalues or Cholesky with jitter.

Connection to ML#

SVM margin maximization and GP covariance both rely on PSD structure.

Connection to Linear Algebra Theory#

Gram operators encode geometry via inner products.

Pedagogical Significance#

Concrete link between data matrix products and PSD.

References#

Mercer, J. (1909). Functions of positive and negative type.
Schölkopf, B. & Smola, A. (2002). Learning with Kernels.
Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.
Rahimi, A. & Recht, B. (2007). Random features for large-scale kernels.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 10, 5
X = np.random.randn(n, d)
G = X @ X.T

evals = np.linalg.eigvalsh(G)
print("Gram PSD? min eigenvalue:", np.min(evals))

def rbf_kernel(A, B, sigma=1.0):
	 A2 = (A**2).sum(1)[:, None]
	 B2 = (B**2).sum(1)[None, :]
	 D2 = A2 + B2 - 2 * A @ B.T
	 return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.0)
kevals = np.linalg.eigvalsh(K)
print("RBF kernel PSD? min eigenvalue:", np.min(kevals))

Worked Example 4: Why attention uses $1/\sqrt{d_k}$ scaling#

Introduction#

For random features with variance 1, dot-products have variance that grows with $d_k$; scaling by $1/\sqrt{d_k}$ stabilizes softmax.

Purpose#

Quantify inner-product growth and show stabilization by scaling.

Importance#

Essential to prevent saturation and numerical instability in attention.

What this example demonstrates#

If $q,k\sim \mathcal{N}(0, I)$ in $\mathbb{R}^{d_k}$, then $\operatorname{Var}(q^\top k) = d_k$.
Scaling by $1/\sqrt{d_k}$ makes variance approximately 1 across dimensions.

Background#

Softmax is sensitive to logit scale; large variance yields peaky distributions and vanishing gradients.

Historical context#

Transformers introduced the scaling to stabilize training across widths.

Prevalence in ML#

Every modern Transformer variant uses this factor (self- and cross-attention).

Notes#

Normalization and temperature are closely related; tuning temperature affects entropy.

Connection to ML#

Stable attention distributions, better gradient flow, easier optimization.

Connection to Linear Algebra Theory#

Variance of inner products aggregates component variances; normalization rescales geometry.

Pedagogical Significance#

Shows a direct norm/variance argument behind a ubiquitous architectural choice.

References#

Vaswani, A. et al. (2017). Attention Is All You Need.
Goodfellow, I. et al. (2016). Deep Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
for d in [16, 64, 256, 1024]:
	 trials = 2000
	 q = np.random.randn(trials, d)
	 k = np.random.randn(trials, d)
	 dots = np.sum(q * k, axis=1)
	 scaled = dots / np.sqrt(d)
	 print(f"d={d:4d} var(dot)={np.var(dots):.1f}  var(scaled)={np.var(scaled):.2f}")

Worked Example 5: Orthogonal projection minimizes squared error (Pythagorean decomposition)#

Introduction#

Projecting onto an orthonormal subspace minimizes $\ell_2$ reconstruction error and decomposes energy orthogonally.

Purpose#

Connect projections, norms, and PCA-style reconstructions.

Importance#

Underlies least squares, PCA truncation, and many dimensionality-reduction pipelines.

What this example demonstrates#

For orthonormal $U\in\mathbb{R}^{d\times k}$, $$\hat{x}=UU^\top x = \arg\min_{z\in\text{col}(U)} \lVert x-z\rVert_2.$$
Pythagorean identity: $$\lVert x\rVert_2^2 = \lVert UU^\top x\rVert_2^2 + \lVert (I-UU^\top)x\rVert_2^2.$$

Background#

Least squares is projection onto column space; PCA chooses $U$ to maximize captured variance.

Historical context#

Orthogonal expansions from Fourier to PCA; SVD gives best rank-$k$ approximation.

Prevalence in ML#

Everywhere: regression, PCA, subspace tracking, recommendation.

Notes#

Orthonormality is crucial; otherwise use oblique projections or QR/SVD.

Connection to ML#

Data compression and denoising via low-dimensional projections.

Connection to Linear Algebra Theory#

Orthogonal projectors are idempotent and symmetric; decomposition follows from orthogonality of components.

Pedagogical Significance#

Reinforces geometric intuition of least squares and PCA.

References#

Golub, G. & Van Loan, C. (2013). Matrix Computations.
Jolliffe, I. (2002). Principal Component Analysis.
Eckart, C. & Young, G. (1936). Approximation in terms of the best rank-$k$.

Solution (Python)#

import numpy as np

np.random.seed(4)
d, k = 20, 3
x = np.random.randn(d)
U, _ = np.linalg.qr(np.random.randn(d, k))  # orthonormal basis
P = U @ U.T
x_hat = P @ x
r = x - x_hat

lhs = np.linalg.norm(x)**2
rhs = np.linalg.norm(x_hat)**2 + np.linalg.norm(r)**2
print("Projection error minimal?", np.linalg.norm(r) <= np.linalg.norm(x - U @ (U.T @ x) + 1e-12))
print("Pythagorean holds (numeric):", np.allclose(lhs, rhs, atol=1e-10))

Comments

Algorithm Category

Data Modality

Historical & Attribution

Key Concepts & Theorems

Norm & Distance

Learning Path & Sequencing

Linear Algebra Foundations

Inner Products & Norms

Theoretical Foundation

Linear Maps & Matrices

Chapter 4

Linear Maps & Matrices

Key ideas: Introduction

Introduction#

Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:

Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)
Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)
Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)
Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)

Linear maps are everywhere in ML:

Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).
Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.
Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.
PCA: Projection onto principal components is a linear map.
Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.

Important Ideas#

1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:

Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.
Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.

Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.

Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.

Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).

2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.

Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).

3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).

Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.

Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).

4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying: $$ (Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y $$

Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).

In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).

Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.

5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:

Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.
Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).

Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).

Relevance to Machine Learning#

Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.

Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.

Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).

Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.

Algorithmic Development History#

1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.

2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.

3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.

4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.

5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.

6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.

7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.

8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.

Definitions#

Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:

$f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).
$f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).

Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).

Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$: $$ \text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A) $$

Rank. The rank of $A$ is: $$ \text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns} $$

Nullity. The nullity of $A$ is: $$ \text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A) $$

Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$

Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying: $$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$

Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Rank-nullity theorem. For $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$ Consequences: If $\text{rank}(A) < d$, solutions to $Ax = b$ are not unique (null space is non-trivial). For invertible $A$ (rank = $d$), solutions are unique.
Fundamental theorem of linear algebra. Orthogonal decomposition: $\mathbb{R}^d = \text{col}(A^\top) \oplus \text{null}(A)$ and $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$ (orthogonal direct sums). Basis for all linear algebra.
Universal approximation (Cybenko 1989, Hornik 1991). A neural network with one hidden layer (linear map + nonlinearity + output linear map) can approximate any continuous function on compact sets arbitrarily well (with enough hidden units).
Spectral theorem for symmetric matrices (Hamilton, Sylvester, 1850s-1880s). Every symmetric $A$ has eigendecomposition $A = U \Lambda U^\top$ (orthogonal diagonalization). Basis for PCA, optimization, understanding symmetric structures.
Singular Value Decomposition (Beltrami 1873, Eckart-Young 1936). Every matrix $A \in \mathbb{R}^{m \times d}$ can be written as $A = U \Sigma V^\top$ (orthogonal $U, V$, diagonal $\Sigma$). Reveals low-rank structure, optimal approximations, conditioning.

Why essential: These theorems quantify what linear maps can/cannot represent, how to invert them, when solutions exist, and how to find optimal approximations.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Backpropagation and gradient-based learning (Rumelhart et al. 1986, 1990s-present). Automatic differentiation computes gradients via chain rule (composition of matrix transposes). Enables training networks with billions of parameters. All modern deep learning depends on this.
Dense neural networks (Cybenko 1989, Hornik 1991, 1990s-present). Theoretical universality + practical training via backprop = powerful function approximators. AlexNet (2012) showed depth matters: stacking linear maps + activations learns hierarchical representations.
Convolutional Neural Networks (LeCun et al. 1990, AlexNet 2012, ResNet 2015). Structured linear maps (convolution with weight sharing). Dramatically reduced parameters vs. dense. State-of-the-art on vision (ImageNet), object detection, segmentation.
Recurrent Neural Networks and LSTMs (Hochreiter & Schmidhuert 1997, 2000s-present). Apply same linear map over time steps (sequence model for NLP, time series). Enabled machine translation, speech recognition.
Transformers and Attention (Vaswani et al. 2017, Devlin et al. 2018, GPT series 2018-2023). All-attention architecture (linear projections + softmax + matrix multiply). Achieved state-of-the-art across NLP (GLUE, SuperGLUE), vision (ImageNet via ViT), multimodal (CLIP). Scales to trillions of parameters.
Least squares for regression (Gauss, Legendre, Tikhonov, modern methods). Normal equations $(X^\top X) w = X^\top y$ solved via QR/SVD (numerically stable). Classical ML workhorse; fast closed-form solution, interpretable results.

Why essential: These systems achieve state-of-the-art by leveraging linear map structure (composition, transposes, efficient matrix multiply). Understanding linear algebra is necessary to design architectures, optimize, and debug.

Key ideas: Where it shows up

1. Backpropagation and Gradient Flow — Transpose carries errors backward#

Major achievements:

Backpropagation (Rumelhart, Hinton, Williams 1986): Efficient algorithm for computing gradients through neural networks via chain rule. Each layer applies $y = \sigma(W x + b)$; backward pass uses $\frac{\partial L}{\partial x} = W^\top \frac{\partial L}{\partial y}$ (transpose carries gradients).
Modern deep learning (1990s-2010s): Backprop enabled training of deep networks (10-1000+ layers). Scaling to billions of parameters (GPT, Vision Transformers).
Automatic differentiation (1980s-present): Frameworks (TensorFlow, PyTorch) implement backprop automatically by composing transposes. Practitioners never write transposes explicitly; framework handles it.
Applications: All supervised learning, reinforcement learning, generative models. Billions of backprop steps every day globally.

Connection to linear maps: Forward pass chains linear maps with nonlinearities: $f = \sigma_k \circ (A_k \sigma_{k-1} \circ (A_{k-1} \cdots))$. Backward pass computes gradients: $\nabla_w L = (\sigma'_{k-1})^T A_{k-1}^T (\sigma'_{k-2})^T A_{k-2}^T \cdots$ (products of transposes).

2. Neural Network Layers — Linear maps + activation functions#

Major achievements:

Dense layers (Rosenblatt Perceptron 1958, MLPs 1970s-1980s): Input $x$, linear map $h = Wx + b$, activation $y = \sigma(h)$ (ReLU, sigmoid, tanh). Each layer is a learnable linear map.
Depth (ResNets, Vaswani 2015-2017): 50-1000 layers. Skip connections $x_{i+1} = \sigma(W_i x_i + b_i) + x_i$ allow training very deep networks. Each skip branch is a composition of linear maps.
Scaling (AlexNet 2012, GPT-3 2020, Gato 2022): Modern networks: billions to trillions of parameters. Matrix multiply dominates computation. Large linear maps $W \in \mathbb{R}^{4096 \times 4096}$ applied to batches.
Optimization: Understanding composition of linear maps helps explain generalization (implicit regularization favors low-complexity solutions in the span of data).

Connection to linear maps: Each dense layer is $W: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$. Network composes $W_k \circ \sigma \circ W_{k-1} \circ \sigma \circ \cdots \circ W_1$. Expressiveness comes from depth (composition) and nonlinearity ($\sigma$).

3. Attention Mechanism — Multi-head projections and weighted sums#

Major achievements:

Scaled dot-product attention (Vaswani et al. 2017): Queries, Keys, Values are projections (linear maps) $Q = XW_Q, K = XW_K, V = XW_V$. Attention weights $A = \text{softmax}(QK^\top / \sqrt{d_k})$. Output $\text{Attention}(Q,K,V) = AV$ (matrix multiply with softmax-weighted rows).
Multi-head attention: $h$ heads, each applying different linear projections. Concatenate: $\text{MultiHead}(Q,K,V) = \text{Concat}(A_1, \ldots, A_h) W^O$ (linear map combines heads).
Transformers (Vaswani 2017, Devlin et al. 2018): Attention layers (all linear maps + softmax) in sequence. BERT, GPT achieve state-of-the-art across NLP tasks.
Scale: GPT-3 (175B parameters), PaLM (540B), GPT-4. Training scales across thousands of GPUs, with matrix multiplication as bottleneck.

Connection to linear maps: Attention is composition of linear maps: $\text{Attention} = A V$ where $A = \text{softmax}(Q K^\top / \sqrt{d_k})$. Each head applies different linear projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$. Output is weighted linear combination of values.

4. Least Squares and Regression — Normal equations as linear system#

Major achievements:

Least squares (Gauss, Legendre, early 1800s): Solve $\min_w \|Xw - y\|_2^2$. Normal equations: $(X^\top X) w = X^\top y$. Linear system $Aw = b$ (product of two linear maps).
Ridge regression (Tikhonov 1963, Hoerl & Kennard 1970): Add regularization $\min_w (\|Xw - y\|_2^2 + \lambda \|w\|_2^2)$. Solution: $w = (X^\top X + \lambda I)^{-1} X^\top y$ (invertible for any $\lambda > 0$).
LASSO (Tibshirani 1996): L1 regularization forces sparsity. Solved via proximal methods (composition of proximal operators, each a linear map or projection).
Kernel methods (Mercer 1909, Schölkopf & Smola 2001): Non-linear regression via Gram matrix $K = X X^\top$ (product of linear maps, then apply kernel trick).

Connection to linear maps: Normal equations involve products of matrices: $X^\top X$ (composition of $X^\top$ and $X$), $X^\top y$ (linear map applied to $y$). Solution involves matrix inversion (inverse is also a linear map).

5. Convolutional and Recurrent Networks — Structured linear maps#

Major achievements:

CNNs (LeCun et al. 1990s, AlexNet 2012, ResNet 2015): Convolutional layers are linear maps with weight sharing (same weights applied across spatial positions). Reduces parameters vs. dense layer (e.g., conv 3×3×64→64 channels vs. dense with same feature count).
RNNs, LSTMs (Hochreiter & Schmidhuber 1997): Recurrent layers apply the same linear map $W$ repeatedly over time: $h_t = \sigma(W h_{t-1} + U x_t)$ (composition of linear maps over time steps).
Efficiency: Weight sharing and structured matrices (convolution, recurrence) reduce parameters and computation compared to dense layers.
Interpretability: Convolutional structure learned by early layers is interpretable (edge filters, textures). Linear maps with structured sparsity/sharing have semantic meaning.

Connection to linear maps: Conv layer is a linear map (convolution can be written as matrix multiplication with Toeplitz structure). RNN applies same linear map repeatedly: composition $W \circ W \circ \cdots \circ W$ over time.

Notation

Standard Conventions#

1. Linear map and matrix notation.

Linear map: $f: V \to W$ or $A: \mathbb{R}^d \to \mathbb{R}^m$ (function notation).
Matrix representation: $A \in \mathbb{R}^{m \times d}$ or $[A]_{ij}$ for entry in row $i$, column $j$.
Matrix-vector product: $y = Ax$ (linear map applied to vector $x$).
Matrix-matrix product: $C = AB$ (composition: apply $B$ then $A$).
Image and kernel: $\text{im}(A)$ or $\text{col}(A)$ for column space; $\ker(A)$ or $\text{null}(A)$ for null space.

Examples:

Linear map: $f(x) = 3x_1 - 2x_2 \in \mathbb{R}$. Matrix: $A = [3, -2] \in \mathbb{R}^{1 \times 2}$.
Linear map: $(x, y) \mapsto (2x + y, x - 3y)$. Matrix: $A = \begin{bmatrix} 2 & 1 \\ 1 & -3 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$.
Composition: Neural network layer 1: $h_1 = \sigma(W_1 x)$, layer 2: $h_2 = \sigma(W_2 h_1) = \sigma(W_2 \sigma(W_1 x))$. Composition: $f = \sigma \circ (W_2 \circ \sigma \circ W_1)$.

2. Rank notation.

Rank: $\text{rank}(A)$ = dimension of column space = number of linearly independent columns.
Nullity: $\text{nullity}(A) = d - \text{rank}(A)$ (dimension of null space).
Full rank: $\text{rank}(A) = \min(m, d)$ (maximum possible rank).
Rank deficient: $\text{rank}(A) < \min(m, d)$ (singular or near-singular).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 2 (full rank), columns independent.
$A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 1 (rank deficient), second column = 2 × first column.

3. Transpose notation.

Transpose: $A^\top$ (rows and columns swapped).
Adjoint property: $(Ax)^\top y = x^\top (A^\top y)$ (inner product duality).
Composition rule: $(AB)^\top = B^\top A^\top$ (note reversed order).
Inverse of transpose: $(A^\top)^{-1} = (A^{-1})^\top$ (for invertible $A$).

Examples:

$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$.
Gradient in backprop: $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (linear map $A$ → transpose $A^\top$ for gradient).

4. Composition and chaining notation.

Composition operator: $(f \circ g)(x) = f(g(x))$ (apply $g$ first, then $f$).
Matrix chaining: For $f = A, g = B$, composition is $f \circ g = A \circ B$ with matrix product $AB$ (apply $B$ then $A$).
Neural network layers: Output $h_i = \sigma_i(A_i h_{i-1})$ (chain $A_1, \sigma_1, A_2, \sigma_2, \ldots$).

Examples:

Rotate by $\theta$, then scale by $2$: $R_\theta \circ S_2$. Matrix: $S_2 R_\theta$.
Neural network: $f(x) = \sigma_2(A_2 \sigma_1(A_1 x))$. Composition: $\sigma_2 \circ A_2 \circ \sigma_1 \circ A_1$.

5. Invertibility and determinant notation.

Invertible (nonsingular): $A^{-1}$ exists; $AA^{-1} = A^{-1} A = I$.
Determinant: $\det(A)$ or $|A|$. For invertibility: $\det(A) \neq 0 \Leftrightarrow A$ invertible.
Condition number: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular value).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. $\det(A) = 2 \neq 0$, so $A$ is invertible. $A^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1/2 \end{bmatrix}$.
Ill-conditioned matrix: $\kappa(A) = 10^{10}$ (nearly singular). Small perturbations cause large changes in solution. Use regularization or preconditioning.

6. Special matrices notation.

Identity: $I \in \mathbb{R}^{d \times d}$ (diagonal matrix with 1’s).
Orthogonal/orthonormal: $Q^\top Q = QQ^\top = I$ (columns/rows orthonormal).
Symmetric: $A^\top = A$.
Positive semi-definite (PSD): $A \succeq 0$; all eigenvalues $\geq 0$. Covariance matrices are PSD.

Examples:

QR decomposition: $A = QR$ where $Q$ orthonormal, $R$ upper triangular.
Symmetric matrix: $\Sigma = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. Eigendecomposition: $\Sigma = U \Lambda U^\top$ (orthonormal $U$, diagonal $\Lambda$).
PSD matrix: Covariance $\text{Cov}(X) \succeq 0$ (always PSD). Gram matrix $G = X^\top X \succeq 0$ (always PSD).

Pitfalls & sanity checks

When working with linear maps and matrices:

Always check shapes. Matrix multiply requires compatible dimensions. $A \in \mathbb{R}^{m \times d}$, $x \in \mathbb{R}^d$ yields $Ax \in \mathbb{R}^m$. Shape mismatch = runtime error.
Prefer stable decompositions. Never compute $(X^\top X)^{-1}$ explicitly. Use QR (via solve) or SVD (truncate small singular values) for numerical stability.
Transpose order matters. $(AB)^\top = B^\top A^\top$ (reversed order). In backprop, composition reverses layer order via transposes.
Condition number determines stability. If $\kappa(A) > 10^8$, expect numerical errors. Use regularization (Ridge, Tikhonov) or preconditioning.
Gradients flow via transposes. Backprop systematically applies transposes. Understand: ill-conditioned weights → vanishing/exploding gradients.

References

Foundational texts:

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.
Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Linear maps and matrix theory:

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Hoffman, K., & Kunze, R. (1971). Linear Algebra (2nd ed.). Prentice-Hall.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Axler, S. J., Bourdon, P. S., & Wade, W. M. (2000). Harmonic Function Theory (2nd ed.). Springer.

Neural networks and backpropagation:

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533–536.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). “Efficient backprop.” In Neural Networks: Tricks of the Trade (pp. 9–50). Springer.

Optimization:

Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.
Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer.
Kingma, D. P., & Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv:1412.6980.

Transformers and attention:

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” NAACL.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). “An image is worth 16×16 words: Transformers for image recognition at scale.” ICLR.

Least squares and numerical methods:

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Dover reprint.
Golub, G. H., & Pereyra, V. (1973). “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.” SIAM Journal on Numerical Analysis, 10(2), 413–432.

Five worked examples

Worked Example 1: Backprop uses transpose#

Problem. For y=Wx, show ∂L/∂x = W^T ∂L/∂y.

Solution (math). Jacobian of y=Wx is W; chain rule yields transpose in reverse mode.

Solution (Python).

import numpy as np
W=np.array([[2.,1.],[-1.,3.]])
dL_dy=np.array([0.5,-2.])
print(W.T@dL_dy)

Worked Example 2: Q,K,V projections in transformers#

Problem. Compute Q=XW_Q, K=XW_K, V=XW_V.

Solution (math). These are linear maps from model dimension to head dimensions.

Solution (Python).

import numpy as np
X=np.array([[1.,0.],[0.,1.],[1.,1.]])
Wq=np.array([[1.,0.],[0.,2.]])
Wk=np.array([[2.,0.],[0.,1.]])
Wv=np.array([[1.,1.],[0.,1.]])
print(X@Wq)
print(X@Wk)
print(X@Wv)

Worked Example 3: Normal equations matrix#

Problem. Form A=X^TX and b=X^Ty for least squares.

Solution (math). Solving A w=b is equivalent to minimizing ||Xw-y||^2 when X has full rank.

Solution (Python).

import numpy as np
X=np.array([[1.,1.],[1.,2.],[1.,3.]])
y=np.array([1.,2.,2.5])
A=X.T@X; b=X.T@y
print(A)
print(b)

Worked Example 4: Batch GD as matrix products#

Problem. Compute one gradient step for MSE.

Solution (math). w←w-η(1/n)X^T(Xw-y).

Solution (Python).

import numpy as np
X=np.array([[1.,2.],[3.,4.],[5.,6.]])
y=np.array([1.,0.,1.])
w=np.zeros(2)
eta=0.1
g=(1/len(X))*X.T@(X@w-y)
print(w-eta*g)

Worked Example 5: Attention is matrix multiplication#

Problem. Compute A=softmax(QK^T/√d) and output O=AV.

Solution (math). Attention is a composition of matrix multiplications plus a row-wise softmax.

Solution (Python).

import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores,axis=1)
print(A@V)

Comments

Algorithm Category

Data Modality

Historical & Attribution

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Deep Learning & Neural Networks

Linear Algebra Foundations

Matrix Theory

ML Applications

Theoretical Foundation

Basis & Dimension

Chapter 3

Basis & Dimension

Key ideas: Introduction

Introduction#

Basis and dimension are the language for measuring and reasoning about vector spaces. A basis is a minimal spanning set—a collection of linearly independent vectors that can be scaled and added to represent every vector in the space. The dimension is simply the size of a basis (number of basis vectors). These concepts are ubiquitous in ML:

PCA: Principal components form a basis for a lower-dimensional subspace capturing most data variance.
Autoencoders: Encoder learns a basis for latent representations (bottleneck layer); decoder reconstructs using this basis.
Neural networks: Each layer’s hidden activations form a basis for representations learned by that layer.
Whitening/normalization: Change of basis to decorrelate and rescale features (covariance becomes identity).
Sparse coding: Find a basis (dictionary) that sparsely represents data.

Understanding basis and dimension clarifies model capacity (how many independent directions can the model control?), data complexity (how many dimensions does the data actually occupy?), and numerical stability (are basis vectors nearly parallel, i.e., ill-conditioned?).

Important Ideas#

1. Basis = minimal spanning set. A basis $\{v_1, \ldots, v_d\}$ for subspace $S$ satisfies:

Spans $S$: Every vector in $S$ is a linear combination $s = \sum_{i=1}^d \alpha_i v_i$.
Linearly independent: No $v_j$ is a linear combination of the others. Equivalently, $\sum_{i=1}^d \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_d = 0$.

Why minimal? If any vector is removed, the set no longer spans $S$. If any vector is linearly dependent on others, it’s redundant.

Uniqueness of representation: For basis $\{v_1, \ldots, v_d\}$, each vector $s \in S$ has unique coefficients: if $s = \sum_i \alpha_i v_i = \sum_i \beta_i v_i$, then $\alpha_i = \beta_i$ for all $i$ (follows from linear independence).

2. Dimension = basis size. All bases for a subspace have the same number of vectors. This number is the dimension: $\dim(S) = $ number of vectors in any basis for $S$. For matrix $A$: $$ \dim(\text{col}(A)) = \text{rank}(A), \quad \dim(\text{null}(A)) = \text{nullity}(A) = n - \text{rank}(A) $$

Why constant? Different bases may use different vectors, but all bases have the same cardinality. Changing basis doesn’t change dimension (it’s a property of the subspace, not the basis).

3. Change of basis. The same vector has different coordinates in different bases. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the change of basis matrix $P = [v'_1 | \cdots | v'_d]$ (columns are new basis vectors in old basis) satisfies: $$ v_\text{old basis} = P v_\text{new basis}, \quad v_\text{new basis} = P^{-1} v_\text{old basis} $$

In ML: Changing basis = change of representation. PCA rotates to principal component basis. Whitening rotates to decorrelated basis (covariance = identity).

4. Standard basis and explicit coordinates. The standard basis in $\mathbb{R}^d$ is $\{e_1, \ldots, e_d\}$ where $e_i = [0, \ldots, 1, \ldots, 0]^\top$ (1 in position $i$). For this basis: $$ v = [v_1, \ldots, v_d]^\top = \sum_{i=1}^d v_i e_i $$

Coordinates in the standard basis are just the vector’s components. Embedding lookups use one-hot basis: to get embedding of token $i$, multiply by $e_i$ (selection vector).

Relevance to Machine Learning#

Model capacity and expressiveness. A model operating in a $d$-dimensional space can control at most $d$ independent directions. Linear regression with $d$ features can fit $d$ linearly independent targets. Deep networks learn hierarchical bases: layer $\ell$ learns a basis for its activation space (dimension = number of hidden units).

Data intrinsic dimensionality. Real data often lies in a lower-dimensional subspace (manifold hypothesis). PCA finds a basis for the dominant subspace; if top $k$ eigenvalues capture 95% of variance, data is approximately $k$-dimensional. This justifies dimensionality reduction without information loss.

Numerical stability and conditioning. Basis properties affect computation: orthonormal bases (columns have norm 1, mutually perpendicular) are numerically stable (condition number = 1). Nearly parallel basis vectors (ill-conditioned) cause numerical errors in linear algebra operations.

Feature engineering and representation learning. Hand-crafted features (e.g., polynomial features, Fourier basis) are explicit bases. Neural networks learn implicit bases (hidden layers) that are optimized for the task. Autoencoders learn an optimal basis for data reconstruction (variational autoencoders add structure to this basis).

Algorithmic Development History#

1. Coordinate geometry (Descartes & Fermat, 1630s-1640s). Cartesian coordinates introduced the standard basis for Euclidean space. Descartes’ La Géométrie (1637) showed geometry could be expressed algebraically using coordinate axes.

2. Change of basis and coordinate transformations (Euler 1770s, Cauchy 1820s-1830s). Euler rotated coordinate systems to simplify equations. Cauchy formalized change of basis through matrix transformations.

3. Linear independence and dimension (Grassmann 1844, Peano 1888). Grassmann introduced independence axiomatically. Peano formalized dimension as the size of maximal independent sets.

4. Orthonormal bases and orthogonalization (Schmidt 1907, Gram-Schmidt process). Erhard Schmidt proved every finite-dimensional space admits an orthonormal basis. Gram-Schmidt orthogonalization computes one constructively.

5. Eigendecomposition and spectral bases (Schur 1909, 1920s-1930s). Schur showed every matrix has an eigenvalue decomposition. Eigenvalues/eigenvectors define natural bases (spectral decomposition).

6. PCA and dimensionality reduction (Pearson 1901, Hotelling 1933, SVD 1960s). PCA finds the basis vectors (principal components) that best capture data variance. SVD algorithm (Golub & Kahan 1965) computes this stably.

7. Neural basis learning (1980s-2010s). Neural networks learn basis implicitly: hidden layers learn representations that act as bases for downstream computations. Deeper networks learn hierarchical bases (abstract high-level concepts from concrete low-level features).

8. Dictionary learning and sparse coding (2000s). Learn overcomplete bases (more basis vectors than dimension) that sparsely represent data. Applications: image denoising (K-SVD, Aharon et al. 2006), signal processing.

Definitions#

Basis. A set $\mathcal{B} = \{v_1, \ldots, v_d\}$ is a basis for subspace $S$ if:

$\text{span}(\mathcal{B}) = S$ (spans the subspace).
Vectors in $\mathcal{B}$ are linearly independent.

Every vector in $S$ has a unique representation: $s = \sum_{i=1}^d \alpha_i v_i$ (coordinates $\alpha_i$ are unique).

Dimension. The dimension of subspace $S$ is: $$ \dim(S) = \text{size of any basis for } S $$ This is well-defined: all bases for $S$ have the same size.

Rank and nullity. For matrix $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A)) $$ $$ \text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A) $$

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$.

Change of basis matrix. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the matrix $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates) satisfies: $$ P = \text{matrix whose columns are basis } \mathcal{B}' \text{ in coordinates of basis } \mathcal{B} $$ For coordinates $[v]\mathcal{B}$ (in basis $\mathcal{B}$) and $[v]{\mathcal{B}’}$ (in basis $\mathcal{B}’$): $$ [v]_\mathcal{B} = P [v]_{\mathcal{B}'}, \quad [v]_{\mathcal{B}'} = P^{-1} [v]_\mathcal{B} $$

Coordinates. For vector $v \in S$ and basis $\mathcal{B} = \{v_1, \ldots, v_d\}$, the coordinates are scalars $[\alpha_1, \ldots, \alpha_d]^\top$ satisfying: $$ v = \sum_{i=1}^d \alpha_i v_i $$ Coordinates are unique (follows from linear independence).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Rank-nullity theorem. For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ Consequences: If $\text{rank}(A) = d < n$, then $\dim(\text{null}(A)) = n - d$ (dimension of solution set). If $\text{rank}(A) = n$ (full rank), solutions are unique (if they exist).
Basis existence (Steinitz 1913). Every vector space has a basis. For finite-dimensional spaces: basis exists, all bases have same size (dimension).
Dimension and approximation (approximation theory). Best $k$-dimensional approximation to $x$ is $\text{proj}_{S_k} x$ (projection onto $k$-dimensional subspace). The $k$-dimensional subspace with minimum approximation error is $\text{span}\{u_1, \ldots, u_k\}$ (top $k$ singular vectors of data matrix).
VC dimension and capacity (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$. Capacity grows with dimension (more basis dimensions = more expressiveness).
Matrix rank and complexity. Rank determines complexity: rank-$r$ matrices form an $O(r(m+n))$ parameter subspace (vs. $mn$ for general matrices). Low-rank approximation is easier to learn (fewer degrees of freedom).

Why essential: These theorems quantify relationships between dimension, expressiveness, and solution complexity.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

PCA for dimensionality reduction (Turk & Pentland 1991). Eigenfaces for face recognition: project face images onto top 50-100 eigenvectors (basis), achieve real-time recognition. Dimension reduction from 10,000 pixels to ~100 PCA coordinates.
Whitening for preprocessing (LeCun et al. 1998). Decorrelate and rescale features: change of basis to make covariance matrix identity. Improves gradient-based optimization (condition number = 1), enables smaller learning rates.
Batch normalization (Ioffe & Szegedy 2015). Normalize layer activations: change of basis (center and rescale). Speeds training (50-100x), enables higher learning rates, reduces internal covariate shift.
Autoencoders for representation learning (Hinton & Salakhutdinov 2006, 2010s-present). Learn non-linear basis via encoder bottleneck. Applications: image compression, anomaly detection, generative models (VAE, 2013).
Transformer attention with multiple bases (Vaswani et al. 2017, Devlin et al. 2018). 8-64 attention heads = 8-64 different basis projections. Achieved state-of-the-art on GLUE (NLP benchmark), ImageNet-scale vision (ViT), multimodal (CLIP).
Dictionary learning and sparse coding (Aharon et al. 2006, Mairal et al. 2009). Learn overcomplete basis that sparsely represents data. Applications: image denoising (matched K-SVD, PSNR improvement 2-5dB), face recognition (sparse representation).

Why essential: These systems achieve state-of-the-art by exploiting basis structure (dimensionality reduction, whitening, learned bases, attention heads). Understanding dimension and basis is necessary to design architectures and interpret representations.

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Optimal basis for dimensionality reduction#

Major achievements:

Pearson (1901), Hotelling (1933): Formalized finding axes (basis vectors) of maximum variance. Principal components are orthonormal basis vectors.
Eigendecomposition solution: Eigenvectors of covariance matrix $C = \frac{1}{n} X_c^\top X_c$ are principal components. Top $k$ eigenvectors form a $k$-dimensional basis capturing maximum variance.
SVD connection (Eckart-Young 1936): Truncated SVD $X \approx U_k \Sigma_k V_k^\top$ gives same basis as PCA. Top $k$ rows of $V_k^\top$ are principal component coordinates.
Modern applications: Preprocessing (feature whitening), visualization (2D/3D plots from high-D data), compression (keep top $k$ components).

Connection to basis: PCA finds an orthonormal basis $\{u_1, \ldots, u_k\}$ (eigenvectors) for the principal subspace. Each data point $x_i$ has coordinates $(u_1^\top x_i, \ldots, u_k^\top x_i)$ in this basis (dimension reduction from $d$ to $k$).

2. Autoencoders and Latent Representations — Learned bases#

Major achievements:

Autoencoders (1980s-1990s): Neural networks learn a bottleneck (low-dimensional basis). Encoder compresses to basis coordinates; decoder reconstructs from coordinates.
Variational Autoencoders (Kingma & Welling 2013): Adds probabilistic structure: basis coordinates are drawn from Gaussian prior. Enables generative modeling (sampling new data).
Disentangled representations (2010s): Learn bases where each coordinate captures an interpretable factor (e.g., pose, lighting in faces). Beta-VAE encourages disentanglement.
Applications: Image generation (basis for visual features), anomaly detection (reconstruction error when input outside learned basis), data compression.

Connection to basis: Encoder learns a basis $\mathcal{B} = \{\text{hidden activations}\}$ for a low-dimensional latent space. Decoder reconstructs using this basis.

3. Deep Neural Networks — Hierarchical basis learning#

Major achievements:

Universal approximation (Cybenko 1989): Hidden layer activations form a basis for representing functions. Single layer suffices for continuous functions on compact sets.
Hierarchy of bases (Bengio 2013): Deep networks learn multiple bases: low layers learn simple bases (edges, textures), high layers learn complex bases (objects, concepts).
Representation learning (LeCun et al. 2015): Deep networks optimize basis learned representations for the task (task-specific basis, not generic PCA).
Scale (AlexNet 2012, ResNet 2015, Vision Transformers 2020): Depth enables learning of rich hierarchical bases, enabling state-of-the-art performance on complex tasks.

Connection to basis: Layer $\ell$ operates in a $d_\ell$-dimensional activation space (dimension = number of hidden units). Weights $W_\ell$ form a basis (or part of one) for transforming layer $\ell-1$ representations to layer $\ell$ basis.

4. Feature Engineering and Feature Spaces — Hand-crafted vs. learned bases#

Major achievements:

Polynomial basis (classical ML): Augment features with powers: $[x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots]$ (explicit basis for nonlinear decision boundaries).
Fourier basis (signal processing): Decompose signals using Fourier basis $\{\cos(k\omega t), \sin(k\omega t)\}_{k=0}^\infty$ (frequency domain basis).
Radial Basis Functions (RBF networks, 1980s): Basis functions centered at data points (e.g., Gaussian bumps). Kernel methods implicitly use infinite RBF basis.
Learned bases (deep learning, 2010s-present): End-to-end training optimizes basis (feature space) jointly with downstream task, outperforming hand-crafted bases.

Connection to basis: Features are coordinates in an explicit basis space. Deep learning learns the basis implicitly through weight optimization.

5. Transformer Attention and Multi-Head Projections — Multiple basis subspaces#

Major achievements:

Multi-head attention (Vaswani et al. 2017): Project inputs to $h$ different subspaces (heads), each with its own basis. Enables learning multiple relationships in parallel.
Scaled dot-product (Vaswani 2017): Each head computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where $V$ columns form a basis for the value subspace.
BERT (Devlin et al. 2018): Bidirectional Transformers with 12-24 heads. Different heads learn different linguistic bases (syntax, semantics, discourse).
Applications: Machine translation (parallel bases for source/target), question answering (basis for understanding), language generation.

Connection to basis: Each attention head projects to a $(d_v / h)$-dimensional basis (subspace). Multi-head concatenation combines multiple basis representations.

Notation

Standard Conventions#

1. Basis notation.

Basis: $\mathcal{B} = \{v_1, \ldots, v_d\}$ (set of basis vectors).
Basis matrix: $V = [v_1 | \cdots | v_d] \in \mathbb{R}^{n \times d}$ (columns are basis vectors).
Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.
Coordinates: $[v]_\mathcal{B} = [\alpha_1, \ldots, \alpha_d]^\top$ satisfying $v = \sum_i \alpha_i v_i$.

Examples:

Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top, e_2 = [0, 1, 0]^\top, e_3 = [0, 0, 1]^\top$.
Vector $v = [3, -1, 2]^\top$ has coordinates $[3, -1, 2]^\top$ in standard basis.
PCA basis for 2D data: $\mathcal{B} = \{u_1, u_2\}$ (top 2 eigenvectors). Coordinates $[v]_\mathcal{B} = [u_1^\top v, u_2^\top v]^\top$.

2. Dimension notation.

Dimension: $\dim(V)$ or $\dim(S)$ for subspace $S$.
Rank: $\text{rank}(A)$ = dimension of column space (equivalently, row space).
Nullity: $\text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A)$.

Examples:

For $X \in \mathbb{R}^{100 \times 50}$ (100 examples, 50 features), if $\text{rank}(X) = 30$, then $\dim(\text{col}(X)) = 30$ (data approximately 30-dimensional).
If $\text{rank}(X) = 30 < 50$, then $\text{nullity}(X) = 50 - 30 = 20$ (solution set is 20-dimensional affine subspace).

3. Change of basis notation.

Basis transition: $\mathcal{B} \to \mathcal{B}'$ (change from basis $\mathcal{B}$ to $\mathcal{B}'$).
Transition matrix: $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates).
Coordinates transform: $[v]_\mathcal{B} = P [v]_{\mathcal{B}'}$ (old basis = transition matrix times new basis coordinates).

Examples:

Rotate from standard basis to basis aligned with principal components: $P = [u_1 | \cdots | u_d]$ (eigenvectors of covariance matrix).
Whitening: $P = \Lambda^{-1/2} U^\top$ (inverse square root of covariance eigenvalues, times eigenvectors transpose) transforms to decorrelated basis.

4. Orthonormal basis notation.

Orthonormal: Basis vectors $\{q_1, \ldots, q_d\}$ satisfy $\|q_i\|_2 = 1$ and $q_i^\top q_j = 0$ for $i \neq j$.
Orthonormal matrix: $Q = [q_1 | \cdots | q_d]$ satisfies $Q^\top Q = I$ (columns orthonormal).
Orthogonal matrix: $Q \in \mathbb{R}^{d \times d}$ satisfies $Q^\top Q = QQ^\top = I$ (square, invertible, $Q^{-1} = Q^\top$).

Examples:

Eigenvectors of symmetric matrices form orthonormal basis.
QR decomposition: $A = QR$ where $Q$ has orthonormal columns (basis for $\text{col}(A)$).
SVD: $A = U \Sigma V^\top$ where $U, V$ are orthogonal matrices (orthonormal bases for column and row spaces).

5. Span and basis rank notation.

Full rank: $\text{rank}(A) = \min(m, n)$ (maximum possible rank). Columns (or rows) are linearly independent.
Rank deficient: $\text{rank}(A) < \min(m, n)$ (some columns/rows linearly dependent).
Column space dimension: $\dim(\text{col}(A)) = \text{rank}(A)$ (basis for column space has $\text{rank}(A)$ vectors).

Examples:

$X = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix}$ has rank 1 (second column = 2× first column). Basis for column space: $\{[1, 2, 3]^\top\}$ (1 vector).
$X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}$ has rank 2 (columns linearly independent). Basis: $\{[1, 0, 0]^\top, [0, 1, 0]^\top\}$ (2 vectors).

6. Projection onto basis notation.

Projection matrix: $P_V = V(V^\top V)^{-1} V^\top$ projects onto column space of $V$ (assuming full column rank).
Orthogonal projection: If $V$ has orthonormal columns, $P_V = VV^\top$ (simpler form).
Coordinates via projection: For orthonormal basis $V$, coordinates are $[v]_\mathcal{B} = V^\top v$.

Examples:

PCA projection: $X_\text{proj} = XVV^\top$ (project onto basis of top eigenvectors $V$).
Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto column space of $X$).

Pitfalls & sanity checks

When working with bases and dimension:

Always center before PCA. Uncentered data gives wrong principal components. Check: mean of centered data should be ~0.
Rank-deficient systems. If rank($X$) < $d$, solution is not unique. Use minimum norm solution or add regularization (Ridge).
Numerical instability from nearly-dependent features. Check condition number: if $\kappa(X) > 10^8$, expect numerical errors. Use SVD/QR instead of explicit inverse.
One-hot trap: $k$ categories → only $k-1$ independent one-hot variables. Adding all $k$ causes singularity. Drop one category.
Coordinate consistency: When changing basis, verify: (1) old = $P$ × new, (2) $P$ is invertible (full column rank), (3) reconstruction error ~0.

References

Foundational texts:

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.
Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

PCA and dimensionality reduction:

Pearson, K. (1901). “On lines and planes of closest fit to systems of points.” Philosophical Magazine, 2(11), 559–572.
Hotelling, H. (1933). “Analysis of a complex of statistical variables.” Journal of Educational Psychology, 24(6), 417–441.
Eckart, C., & Young, G. (1936). “The approximation of one matrix by another of lower rank.” Psychometrika, 1(3), 211–218.
Turk, M., & Pentland, A. (1991). “Eigenfaces for recognition.” Journal of Cognitive Neuroscience, 3(1), 71–86.

Neural networks and representation learning:

Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems, 2(4), 303–314.
Bengio, Y. (2013). “Deep learning of representations: Looking forward.” In Statistical Language and Speech Processing (pp. 1–37). Springer.
Hinton, G. E., & Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science, 313(5786), 504–507.
Kingma, D. P., & Welling, M. (2013). “Auto-encoding variational Bayes.” arXiv:1312.6114.

Optimization and normalization:

Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.
Ioffe, S., & Szegedy, C. (2015). “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” In ICML (pp. 448–456).
Ba, J., Kiros, J. R., & Hinton, G. E. (2016). “Layer normalization.” arXiv:1607.06450.

Transformers and attention:

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805.

Dictionary learning and sparse coding:

Aharon, M., Elad, M., & Bruckstein, A. (2006). “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation.” IEEE Transactions on Signal Processing, 54(11), 4311–4322.

Five worked examples

Worked Example 1: One-hot basis and embedding lookup#

Problem. Show embedding lookup is matrix multiplication by a one-hot vector.

Solution (math). If x=e_i (standard basis), then E x selects the i-th column of E.

Solution (Python).

import numpy as np
E=np.array([[1.,2.,3.],[0.,-1.,1.]])
x=np.array([0.,1.,0.])
print(E@x)

Worked Example 2: Coordinates in a PCA basis#

Problem. Compute 1D PCA coordinate z of a point.

Solution (math). If u is the top PC, z=u^T(x-μ).

Solution (Python).

import numpy as np
from scripts.toy_data import toy_pca_points
X=toy_pca_points(n=10,seed=0)
mu=X.mean(0)
Xc=X-mu
U,S,Vt=np.linalg.svd(Xc,full_matrices=False)
u=Vt[0]
z=u@(X[0]-mu)
print(z)

Worked Example 3: Redundant engineered features via rank#

Problem. Detect redundancy by checking rank.

Solution (math). If rank(X)<d, columns are dependent.

Solution (Python).

import numpy as np
X=np.array([[1.,2.,3.],[2.,4.,6.],[3.,6.,9.]])
print(np.linalg.matrix_rank(X))

Worked Example 4: Whitening as change of basis#

Problem. Whiten data using covariance eigendecomposition.

Solution (math). If Σ=UΛU^T, then x_white=Λ^{-1/2}U^T(x-μ).

Solution (Python).

import numpy as np
rng=np.random.default_rng(0)
X=rng.normal(size=(200,2))
mu=X.mean(0); Xc=X-mu
Sigma=np.cov(Xc,rowvar=False)
lam,U=np.linalg.eigh(Sigma)
W=np.diag(1/np.sqrt(lam))@U.T
Xw=(W@Xc.T).T
print(np.cov(Xw,rowvar=False))

Worked Example 5: SGD gradients live in feature span#

Problem. Show each per-example gradient is a scalar times the feature vector.

Solution (math). For squared loss, ∇_w = (x^Tw - y)x.

Solution (Python).

import numpy as np
x=np.array([1.,-2.,0.5])
w=np.array([0.2,0.1,-0.3])
y=1.0
g=(x@w-y)*x
print(g)

Comments

Algorithm Category

Data Modality

Historical & Attribution

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Linear Algebra Foundations

Vector Spaces

Theoretical Foundation

Span & Linear Combination

Chapter 2

Span & Linear Combination

Key ideas: Introduction

Introduction#

Span and linear combinations are the fundamental building blocks of linear algebra and machine learning. Every prediction $\hat{y} = Xw$, every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$, every attention output $\sum_i \alpha_i v_i$, and every representation learned by a neural network is ultimately a linear combination of basis vectors. Understanding span—the set of all possible linear combinations—reveals model expressiveness, training dynamics, and the geometry of learned representations.

The span of a set of vectors $\{v_1, \ldots, v_k\}$ is the smallest subspace containing all of them. Geometrically, it’s all points reachable by scaling and adding the vectors. Algebraically, it’s $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$. In ML, span determines:

Model capacity: What functions can a model represent?
Feature redundancy: Are some features linear combinations of others?
Solution uniqueness: When are there multiple parameter vectors giving identical predictions?
Expressiveness vs. efficiency: Can we reduce dimensionality without losing information?

This chapter adopts an ML-first perspective: we introduce span through concrete algorithms (kernel methods, attention, overparameterization) rather than abstract axioms. The goal is to build geometric intuition (span as reachable points) and computational skill (checking linear independence, computing basis) simultaneously.

Important Ideas#

1. Linear combinations are everywhere in ML. A linear combination of vectors $\{v_1, \ldots, v_k\}$ with coefficients $\{\alpha_1, \ldots, \alpha_k\}$ is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k $$

Examples in ML:

Linear regression predictions: $\hat{y} = Xw = \sum_{j=1}^d w_j x_j$ (linear combination of feature columns).
Gradient descent updates: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ (linear combination of current parameters and gradient).
Attention outputs: $z = \sum_{i=1}^n \alpha_i v_i$ (weighted sum of value vectors with attention weights $\alpha_i$).
Kernel predictions: $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (representer theorem: optimal solution is a linear combination of training kernels).
Word embeddings: Analogies $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (linear combinations capture semantic relationships).

2. Span determines expressiveness. The span of $\{v_1, \ldots, v_k\}$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$

This is the set of all possible linear combinations—the “reachable subspace” if we’re allowed to scale and add the vectors. Key properties:

It’s a subspace: Closed under addition and scalar multiplication (adding/scaling linear combinations gives another linear combination).
It’s the smallest subspace containing $\{v_1, \ldots, v_k\}$: Any subspace containing all $v_i$ must contain their span.
Dimension = number of linearly independent vectors: If $v_k = \sum_{i=1}^{k-1} c_i v_i$ (linear dependence), adding $v_k$ doesn’t increase the span.

In ML context:

Column space of $X$: All predictions $\hat{y} = Xw$ lie in $\text{span}(\text{columns of } X) = \text{col}(X)$. If $y \notin \text{col}(X)$, perfect fit is impossible (residual is nonzero).
Feature redundancy: If feature $x_j$ is a linear combination of other features, adding it doesn’t increase $\text{span}(\text{columns of } X)$ or model capacity.
Kernel methods: Predictions lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (representer theorem). This is typically a finite-dimensional subspace of the (infinite-dimensional) RKHS.

3. Linear independence vs. dependence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = 0$ is $\alpha_1 = \cdots = \alpha_k = 0$. Otherwise, they’re linearly dependent (one is a linear combination of others).

Why it matters:

Basis: A linearly independent set spanning $V$ is a basis for $V$. Every vector in $V$ has a unique representation as a linear combination of basis vectors.
Rank: $\text{rank}(X) = $ number of linearly independent columns = $\dim(\text{col}(X))$.
Multicollinearity: In regression, linearly dependent features ($\text{rank}(X) < d$) make $X^\top X$ singular (non-invertible), requiring regularization.

4. Representer theorem: solutions lie in span of training data. For many ML problems (kernel ridge regression, SVMs, Gaussian processes), the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$

This is a linear combination of kernel functions evaluated at training points. Despite working in an infinite-dimensional space (e.g., RBF kernel), the solution lies in an $n$-dimensional subspace (span of $\{k(x_i, \cdot)\}_{i=1}^n$).

Implications:

Computational tractability: Optimization over infinite dimensions reduces to solving an $n \times n$ system.
Overfitting vs. underfitting: More training points ($n$ large) increases capacity but also computational cost ($O(n^3)$ for exact methods).
Sparse solutions: $\ell_1$ regularization (Lasso, SVM) produces solutions with many $\alpha_i = 0$ (sparse linear combinations).

Relevance to Machine Learning#

Model expressiveness and capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible predictions. For linear regression $\hat{y} = Xw$:

If $\text{rank}(X) = d$ (full column rank), the model can fit $d$ linearly independent targets.
If $\text{rank}(X) < d$, features are redundant. Adding more linearly dependent features doesn’t help.
If $\text{rank}(X) < n$ (overdetermined), exact fit is impossible unless $y \in \text{col}(X)$ (rare).

Attention mechanisms. Transformer attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where the output is a convex combination (weighted average with non-negative weights summing to 1) of value vectors. Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to multiple subspaces (heads), increasing expressiveness.

Kernel methods and representer theorem. For kernel ridge regression, the optimal solution is: $$ \alpha^* = (K + \lambda I)^{-1} y $$ where $K_{ij} = k(x_i, x_j)$ is the Gram matrix. Predictions are $f(x) = \sum_{i=1}^n \alpha_i^* k(x_i, x)$ (linear combination of training kernels). This holds for any kernel (linear, polynomial, RBF, neural network), enabling implicit infinite-dimensional feature spaces.

Word embeddings and analogies. Word2Vec (Mikolov et al., 2013) famously demonstrated that semantic relationships correspond to linear offsets in embedding space: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$. This shows embeddings capture compositional structure (adding/subtracting vectors blends meanings).

Algorithmic Development History#

1. Linear combinations in classical mechanics (Newton, 1687). Newton’s second law $F = ma$ expresses force as a linear combination of acceleration components. Decomposing vectors into basis components (Cartesian coordinates) enabled solving physical systems.

2. Linear algebra formalization (Grassmann 1844, Peano 1888). Grassmann introduced “extensive magnitudes” (vectors) and exterior algebra (wedge products, spans). Peano axiomatized vector spaces with addition and scalar multiplication, formalizing linear combinations.

3. Least squares and column space (Gauss 1809, Legendre 1805). Gauss used least squares for orbit determination. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, and the best fit minimizes $\|y - \hat{y}\|_2$ by projecting $y$ onto $\text{col}(X)$.

4. Kernel trick and representer theorem (Kimeldorf & Wahba 1970, Schölkopf 1990s). Kimeldorf & Wahba proved the representer theorem for splines: optimal smoothing spline is a linear combination of kernel basis functions. Schölkopf, Smola, and Vapnik extended this to SVMs and kernel ridge regression, enabling nonlinear learning in RKHS.

5. Word embeddings and linear structure (Mikolov et al. 2013). Word2Vec revealed that embeddings exhibit linear compositionality: analogies like “king - man + woman ≈ queen” work because semantic relationships correspond to parallel vectors (linear offsets). This was surprising—neural networks learned a structured linear space without explicit supervision.

6. Attention and weighted sums (Bahdanau 2015, Vaswani 2017). Attention mechanisms compute outputs as convex combinations (weighted averages) of value vectors. The Transformer (Vaswani et al., 2017) replaced recurrence with attention, showing that linear combinations of context (with learned weights) suffice for sequence modeling.

7. Overparameterization and implicit bias (Bartlett 2020, Arora 2019). Modern deep networks are vastly overparameterized ($d \gg n$), so solutions lie in $w_{\min} + \text{null}(X)$ (affine subspace). Gradient descent exhibits implicit regularization, preferring solutions in specific subspaces (e.g., low-rank, sparse). Understanding span and null space clarifies why overparameterized models generalize.

Definitions#

Linear combination. Given vectors $\{v_1, \ldots, v_k\} \subset V$ and scalars $\{\alpha_1, \ldots, \alpha_k\} \subset \mathbb{R}$, the linear combination is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \cdots + \alpha_k v_k \in V $$

Span. The span of $\{v_1, \ldots, v_k\}$ is the set of all linear combinations: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.

Linear independence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if: $$ \sum_{i=1}^k \alpha_i v_i = 0 \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0 $$ Otherwise, they are **linearly dependent** (at least one $v_j$ is a linear combination of the others).

Basis. A set $\{v_1, \ldots, v_k\}$ is a basis for subspace $S$ if:

It spans $S$: $\text{span}\{v_1, \ldots, v_k\} = S$.
It is linearly independent.

Every vector in $S$ has a unique representation as a linear combination of basis vectors.

Column space (range). For $A \in \mathbb{R}^{m \times n}$, the column space is: $$ \text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{\text{columns of } A\} $$

Dimension. $\dim(S) = $ number of vectors in any basis for $S$. For $\text{col}(A)$, $\dim(\text{col}(A)) = \text{rank}(A)$ (number of linearly independent columns).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Representer theorem (Kimeldorf & Wahba 1970, Schölkopf et al. 2001). For kernel ridge regression and SVMs, the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$ This holds for **any** reproducing kernel $k$ on RKHS $\mathcal{H}$. The solution lies in the $n$-dimensional subspace $\text{span}{k(x_1, \cdot), \ldots, k(x_n, \cdot)}$, even though $\mathcal{H}$ may be infinite-dimensional (e.g., RBF kernel).
VC dimension and span (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$, the VC dimension is $d+1$. This measures expressiveness: the classifier can shatter (correctly classify all $2^{d+1}$ labelings) any set of $d+1$ points in general position. The decision boundaries are hyperplanes (linear combinations of features).
Rank-nullity theorem (fundamental theorem of linear algebra). For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ In ML: If $X \in \mathbb{R}^{n \times d}$ has $\text{rank}(X) = r < d$, there are $d - r$ linearly dependent features (null space dimension). Solutions to $Xw = y$ form an affine subspace $w_{\text{particular}} + \text{null}(X)$.
Eckart-Young theorem (1936). The truncated SVD $\hat{X} = U_k \Sigma_k V_k^\top$ (keeping top $k$ singular values) minimizes: $$ \|\hat{X} - X\|_F = \min_{\text{rank}(\hat{X}) \leq k} \|\hat{X} - X\|_F $$ Geometrically: Projecting columns of $X$ onto $\text{span}{u_1, \ldots, u_k}$ minimizes reconstruction error. This justifies PCA, low-rank matrix completion, and recommender systems.
Johnson-Lindenstrauss lemma (1984). Random projection from $\mathbb{R}^d$ to $\mathbb{R}^k$ (with $k = O(\log n / \epsilon^2)$) approximately preserves pairwise distances with high probability. This enables dimensionality reduction: data approximately lies in a low-dimensional subspace, discoverable via random projections.

Why essential: These theorems quantify when learning is tractable (representer theorem → finite-dimensional optimization), how much data suffices (VC dimension → sample complexity), and when low-dimensional structure exists (Eckart-Young → lossy compression bounds).

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Word2Vec (Mikolov et al., 2013). Learned 300-dimensional embeddings for millions of words via skip-gram/CBOW. Demonstrated linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ achieved 40% accuracy on analogy tasks. Showed that linear combinations capture semantic relationships (gender, tense, capitals).
ResNet (He et al., 2015). Introduced skip connections $y = F(x) + x$, enabling training of 152-layer networks (vs. ~20 layers for VGG). Won ImageNet 2015 with 3.57% top-5 error. The key: $F(x) + x$ is a linear combination (residual + identity), preserving gradients during backpropagation.
Transformer (Vaswani et al., 2017). Replaced RNNs with attention: $\text{softmax}(QK^\top / \sqrt{d_k}) V$ (linear combination of value vectors). Enabled GPT-3 (175B params, Brown et al. 2020), BERT (340M params, Devlin et al. 2018), and state-of-the-art results across NLP (translation, summarization, QA).
Kernel SVMs (Boser et al. 1992, Cortes & Vapnik 1995). Applied kernel trick to large-margin classifiers. Won NIPS 2003 feature selection challenge, achieved 99.3% accuracy on MNIST (Decoste & Schölkopf 2002). Decision function $f(x) = \sum_{i \in SV} \alpha_i y_i k(x_i, x)$ is a sparse linear combination (only support vectors have $\alpha_i \neq 0$).
PCA for face recognition (Eigenfaces, Turk & Pentland 1991). Projected face images onto span of top eigenvectors (principal components). Each face is approximated as $x \approx \sum_{i=1}^k c_i u_i$ (linear combination of eigenfaces). Achieved real-time recognition with $k = 50$-$100$ components (vs. $d = 10,000$ pixels).
GPT-3 (Brown et al., 2020). 175B parameter Transformer trained on 300B tokens. Demonstrated few-shot learning (2-3 examples) across diverse tasks without fine-tuning. Attention layers compute $\sum_{i=1}^n \alpha_i v_i$ (linear combinations of context), with $n = 2048$ tokens.

Why essential: These systems achieved state-of-the-art by exploiting linear combination structure (attention, skip connections, kernel methods). Understanding span is necessary to interpret embeddings (Word2Vec analogies), debug failures (rank deficiency in features), and design architectures (multi-head attention = multiple subspaces).

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Data spans low-dimensional subspace#

Major achievements:

Hotelling (1933): Formalized PCA as finding orthogonal directions of maximum variance. Principal components are eigenvectors of the covariance matrix $C = \frac{1}{n} X_c^\top X_c$ (centered data).
Eckart-Young theorem (1936): Proved that truncated SVD $X \approx U_k \Sigma_k V_k^\top$ (keeping top $k$ singular vectors) minimizes reconstruction error $\|X - \hat{X}\|_F$. This justifies PCA: projecting onto $\text{span}\{u_1, \ldots, u_k\}$ (top eigenvectors) is optimal.
Modern applications: Face recognition (eigenfaces, Turk & Pentland 1991), data compression (JPEG2000), preprocessing for neural networks (whitening), exploratory data analysis (visualizing high-dimensional datasets in 2D/3D).

Connection to span: PCA finds the $k$-dimensional subspace (span of top eigenvectors) that best approximates the data cloud. Projecting data $X$ onto $\text{span}\{u_1, \ldots, u_k\}$ gives $X_{\text{proj}} = X V_k V_k^\top$, where each row is a linear combination of top eigenvectors. The retained variance is $\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i$.

2. Stochastic Gradient Descent (SGD) — Updates are linear combinations#

Major achievements:

Robbins & Monro (1951): Proved convergence of stochastic approximation $\theta_{t+1} = \theta_t - \eta_t g_t$ (where $g_t$ is a noisy gradient) under diminishing step sizes $\sum_t \eta_t = \infty$, $\sum_t \eta_t^2 < \infty$.
Momentum methods (Polyak 1964, Nesterov 1983): Introduced momentum $m_{t+1} = \beta m_t + \nabla \mathcal{L}(\theta_t)$, $\theta_{t+1} = \theta_t - \eta m_{t+1}$ (exponentially weighted average of gradients). This is a linear combination of past gradients with decaying weights.
Adam optimizer (Kingma & Ba 2014): Adaptive learning rates using first and second moment estimates. Became the dominant optimizer for deep learning (BERT, GPT, Stable Diffusion).

Connection to span: Every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ is a linear combination of the current parameters and the negative gradient. The optimization trajectory $\{\theta_0, \theta_1, \ldots\}$ lies in the affine subspace $\theta_0 + \text{span}\{\nabla \mathcal{L}(\theta_0), \nabla \mathcal{L}(\theta_1), \ldots\}$. For linear models, gradients are linear combinations of data columns.

3. Deep Neural Networks—Compositional Linear Combinations

Major achievements:

Universal approximation (Cybenko 1989, Hornik 1991): Single hidden layer networks can approximate continuous functions arbitrarily well. The output is $f(x) = \sum_{i=1}^h w_i \sigma(v_i^\top x + b_i)$ (linear combination of activations).
Deep learning revolution (2012-present): AlexNet (2012), VGG (2014), ResNet (2015), Transformers (2017) demonstrated that depth (composing linear maps + nonlinearities) is more powerful than width (more neurons per layer).
Neural Tangent Kernels (Jacot et al. 2018): Showed that infinite-width networks behave like kernel methods, with predictions in $\text{span}\{\text{training features}\}$.

Connection to span: Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$). The pre-activation $W_l h_l$ lies in $\text{col}(W_l)$. Deep networks compose these linear combinations across layers, creating hierarchical representations.

4. Kernel Methods—Predictions as linear combinations of kernels#

Major achievements:

Representer theorem (Kimeldorf & Wahba 1970): For regularized risk minimization $\min_{f \in \mathcal{H}} \sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the optimal solution is $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel basis functions).
Support Vector Machines (Boser et al. 1992, Cortes & Vapnik 1995): Introduced large-margin classifiers with kernel trick. Won NIPS feature selection challenge (2003), dominated ML competitions (early 2000s).
Gaussian Processes (Rasmussen & Williams 2006): Bayesian kernel methods for regression/classification. Predictions are linear combinations $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ with $\alpha = (K + \sigma^2 I)^{-1} y$.

Connection to span: Despite working in (potentially infinite-dimensional) RKHS, kernel predictions always lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (finite-dimensional subspace spanned by training kernels). The Gram matrix $K_{ij} = k(x_i, x_j)$ encodes inner products in this subspace.

5. Transformer Attention—Weighted sums of value vectors#

Major achievements:

Vaswani et al. (2017): “Attention is All You Need” replaced RNNs with self-attention. Enabled parallelization and scaling to billions of parameters (GPT-3: 175B params, GPT-4: ~1.7T params).
BERT (Devlin et al. 2018): Bidirectional Transformers for masked language modeling. Achieved state-of-the-art on 11 NLP tasks (GLUE benchmark).
Vision Transformers (Dosovitskiy et al. 2020): Applied attention to image patches, surpassing CNNs on ImageNet (ViT-H/14: 88.5% top-1 accuracy).
Multimodal models (CLIP, Flamingo, GPT-4): Unified vision and language via attention over heterogeneous inputs.

Connection to span: Attention output $z = \text{softmax}(QK^\top / \sqrt{d_k}) V$ is a convex combination (weighted average with non-negative weights summing to 1) of value vectors (rows of $V$). Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to $h$ different subspaces, computing $h$ independent linear combinations in parallel.

Notation

Standard Conventions#

1. Linear combination syntax.

Summation notation: $\sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k$.
Matrix-vector product: $Xw = \sum_{j=1}^d w_j x_j$ (linear combination of columns of $X$ with weights from $w$).
Convex combination: $\sum_{i=1}^k \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (weighted average).

Examples:

Linear regression prediction: $\hat{y} = X w = \sum_{j=1}^d w_j X_{:,j}$ (each prediction is a linear combination of feature columns).
Attention output: $z = \sum_{i=1}^n \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$ (convex combination of value vectors).
Word analogy: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$ (coefficients can be negative).

2. Span notation.

Set notation: $\text{span}\{v_1, \ldots, v_k\} = \{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$.
Equivalent: $\text{span}(S)$ where $S = \{v_1, \ldots, v_k\}$ (span of a set).
Column space: $\text{col}(A) = \text{span}\{\text{columns of } A\}$.
Row space: $\text{row}(A) = \text{span}\{\text{rows of } A\} = \text{col}(A^\top)$.

Examples:

For $X \in \mathbb{R}^{3 \times 2}$ with columns $x_1 = [1, 0, 1]^\top$, $x_2 = [0, 1, 1]^\top$: $$ \text{col}(X) = \text{span}\{x_1, x_2\} = \left\{ w_1 \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} + w_2 \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} : w_1, w_2 \in \mathbb{R} \right\} $$ This is a 2D plane in $\mathbb{R}^3$ passing through the origin.

3. Linear independence notation.

Independence: Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if $\sum_{i=1}^k \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_k = 0$.
Dependence: If there exist $\alpha_i$ (not all zero) such that $\sum_{i=1}^k \alpha_i v_i = 0$, vectors are linearly dependent.
Rank: $\text{rank}(A) = \max\{\text{number of linearly independent columns of } A\} = \max\{\text{number of linearly independent rows of } A\}$.

Examples:

Vectors $v_1 = [1, 0]^\top$, $v_2 = [0, 1]^\top$ are linearly independent (standard basis for $\mathbb{R}^2$).
Vectors $v_1 = [1, 2]^\top$, $v_2 = [2, 4]^\top$ are linearly dependent ($v_2 = 2 v_1$).
For $X \in \mathbb{R}^{100 \times 50}$, $\text{rank}(X) \leq 50$ (at most 50 linearly independent columns).

4. Basis notation.

Basis: A linearly independent spanning set. Denoted $\mathcal{B} = \{v_1, \ldots, v_d\}$ for a $d$-dimensional space.
Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.
Coordinates: For vector $v = \sum_{i=1}^d \alpha_i v_i$ (linear combination of basis vectors), the coordinates are $[\alpha_1, \ldots, \alpha_d]^\top$.

Examples:

Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top$, $e_2 = [0, 1, 0]^\top$, $e_3 = [0, 0, 1]^\top$.
Any vector $v = [v_1, v_2, v_3]^\top = v_1 e_1 + v_2 e_2 + v_3 e_3$ (linear combination of standard basis).
For PCA, top $k$ eigenvectors $\{u_1, \ldots, u_k\}$ form a basis for the principal subspace.

5. Kernel and null space notation.

Null space: $\text{null}(A) = \{x : Ax = 0\}$ (vectors mapped to zero).
Kernel: $\ker(A) = \text{null}(A)$ (alternative notation).
Range (column space): $\text{range}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$.

Examples:

For $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ (rank 1), $\text{null}(A) = \text{span}\{[2, -1]^\top\}$ (1D subspace).
Overparameterized regression: If $\text{rank}(X) < d$, solutions to $Xw = y$ form $w_0 + \text{null}(X)$ (affine subspace).
Kernel ridge regression: Solution $\alpha = (K + \lambda I)^{-1} y$ lies in $\mathbb{R}^n$ (span of training examples).

6. Projection notation.

Orthogonal projection: $P_S v$ projects $v$ onto subspace $S$.
Projection matrix: $P = A(A^\top A)^{-1} A^\top$ projects onto $\text{col}(A)$.
Complement: $v = P_S v + P_{S^\perp} v$ (decomposition into parallel and perpendicular components).

Examples:

PCA projection onto top $k$ eigenvectors: $X_{\text{proj}} = X V_k V_k^\top$ where $V_k = [u_1 | \cdots | u_k]$.
Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto $\text{col}(X)$).
Residual: $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top) y$ (projection onto $\text{col}(X)^\perp$).

Pitfalls & sanity checks

Common Mistakes#

Confusing span with basis: Span dimension = number of linearly independent vectors, not total count.
Assuming full rank: Always check np.linalg.matrix_rank(X) before inverting $X^\top X$.
Ignoring numerical stability: Use lstsq instead of normal equations.
Misunderstanding convex combinations: Not all linear combinations are convex (need $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$).
Overparameterization misconceptions: $d > n$ doesn’t always cause overfitting (implicit regularization).

Essential Checks#

# Check linear independence
rank = np.linalg.matrix_rank(X)
assert rank == X.shape[1], "Columns linearly dependent"

# Verify span membership
V = np.column_stack([v1, v2, v3])
alpha = np.linalg.lstsq(V, v, rcond=None)[0]
assert np.allclose(V @ alpha, v), "v not in span(V)"

# Test null space
assert np.allclose(X @ z, 0), "z not in null(X)"

# Attention weights
assert np.allclose(alpha.sum(), 1) and (alpha >= 0).all()

References

Foundational Texts#

Strang (2016): Linear Algebra - span, basis, column/null space
Axler (2015): Linear Algebra Done Right - abstract vector spaces
Horn & Johnson (2013): Matrix Analysis - rank, decompositions

Machine Learning#

Hastie et al. (2009): Elements of Statistical Learning - regression, SVMs
Goodfellow et al. (2016): Deep Learning - Chapter 2 (Linear Algebra)
Murphy (2022): Probabilistic ML - linear regression, kernels

Key Papers#

Kimeldorf & Wahba (1970): Representer theorem
Vapnik & Chervonenkis (1971): VC dimension
Eckart & Young (1936): Low-rank approximation
Mikolov et al. (2013): Word2Vec analogies
Vaswani et al. (2017): Transformer attention
He et al. (2015): ResNet skip connections
Bartlett et al. (2020): Benign overfitting
Belkin et al. (2019): Double descent

Advanced Topics#

Schölkopf & Smola (2002): Learning with Kernels
Rasmussen & Williams (2006): Gaussian Processes
Golub & Van Loan (2013): Matrix Computations
Trefethen & Bau (1997): Numerical Linear Algebra

Five worked examples

Worked Example 1: Predictions lie in span(columns of X)#

Introduction#

Linear regression predictions $\hat{y} = Xw$ are linear combinations of the columns of the feature matrix $X$. This fundamental observation reveals model expressiveness: all possible predictions lie in the column space $\text{col}(X)$, a subspace of $\mathbb{R}^n$. If the target $y$ lies outside this subspace ($y \notin \text{col}(X)$), perfect fit is impossible—the best we can do is project $y$ onto $\text{col}(X)$ (least squares solution).

This example explicitly computes $Xw$ as $\sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns), demonstrating that predictions span a subspace determined entirely by the features.

Purpose#

Visualize predictions as linear combinations: Show that $\hat{y} = Xw = w_1 X_{:,1} + w_2 X_{:,2} + \cdots + w_d X_{:,d}$.
Identify the constraint: Predictions lie in $\text{col}(X)$, limiting model capacity to $\dim(\text{col}(X)) = \text{rank}(X)$.
Connect to least squares: When $y \notin \text{col}(X)$, minimizing $\|Xw - y\|_2$ finds the closest point in $\text{col}(X)$ to $y$.

Importance#

Model expressiveness. The span of $X$’s columns determines all possible predictions. For $X \in \mathbb{R}^{n \times d}$:

If $\text{rank}(X) = d$ (full column rank), the model can fit any $d$ linearly independent targets.
If $\text{rank}(X) < d$, some features are redundant (linearly dependent). Adding more linearly dependent features doesn’t increase capacity.
If $\text{rank}(X) < n$ (typical when $d < n$), predictions lie in a proper subspace of $\mathbb{R}^n$. Perfect fit is impossible unless $y \in \text{col}(X)$.

Residuals and orthogonality. The least squares residual $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$: $X^\top r = 0$. Geometrically, $\hat{y}$ is the orthogonal projection of $y$ onto $\text{col}(X)$, and $r$ lies in the orthogonal complement $\text{col}(X)^\perp$.

Feature selection. If feature $j$ is a linear combination of other features ($X_{:,j} = \sum_{i \neq j} c_i X_{:,i}$), including it doesn’t increase $\text{rank}(X)$ or expand $\text{col}(X)$. Feature selection algorithms (Lasso, forward selection) aim to find minimal feature sets spanning the target space.

What This Example Demonstrates#

Matrix-vector product as linear combination: $Xw = \sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns).
Predictions constrained to subspace: $\hat{y} \in \text{col}(X) = \text{span}\{X_{:,1}, \ldots, X_{:,d}\}$.
Numerical verification: Compute both $Xw$ (matrix product) and $\sum_j w_j X_{:,j}$ (explicit sum), verify they’re identical.

Background#

Least squares (Gauss 1809, Legendre 1805). Gauss used least squares to fit planetary orbits, minimizing sum of squared errors. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, so minimizing $\|y - Xw\|_2^2$ finds the closest point in $\text{col}(X)$ to $y$.

Normal equations. Setting $\nabla_w \|Xw - y\|_2^2 = 0$ gives $X^\top X w = X^\top y$. If $X$ has full column rank, $w^* = (X^\top X)^{-1} X^\top y$. The prediction is $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y$ (projection matrix $P = X(X^\top X)^{-1} X^\top$ projects onto $\text{col}(X)$).

Geometric interpretation. $\text{col}(X)$ is a $d$-dimensional (or $\text{rank}(X)$-dimensional) hyperplane in $\mathbb{R}^n$. The prediction $\hat{y}$ is the foot of the perpendicular from $y$ to this hyperplane. The residual $r = y - \hat{y}$ is perpendicular to the hyperplane.

Historical Context#

1. Least squares origins (Gauss 1809, Legendre 1805). Legendre published the method in 1805 for fitting orbits. Gauss claimed to have used it since 1795 (controversy over priority). Both recognized that predictions are linear combinations of features.

2. Matrix formulation (Cauchy 1829, Sylvester 1850). Matrix algebra enabled compact notation $\hat{y} = Xw$ instead of writing out sums. Sylvester introduced “matrix” terminology in 1850.

3. Projection interpretation (Schmidt 1907, Courant & Hilbert 1924). Erhard Schmidt formalized orthogonal projections in Hilbert spaces. The least squares solution became understood as projecting $y$ onto $\text{col}(X)$.

4. Modern ML (1990s-present). Regularization (ridge, Lasso) modifies $\text{col}(X)$ by adding penalty terms. Kernel methods (SVMs, kernel ridge regression) work in implicitly mapped feature spaces, where $\text{col}(\Phi(X))$ may be infinite-dimensional but solutions lie in $\text{span}\{k(x_i, \cdot)\}_{i=1}^n$ (finite-dimensional by the representer theorem).

History in Machine Learning#

1805: Legendre publishes least squares (linear combinations of features).
1809: Gauss derives normal equations $X^\top X w = X^\top y$.
1907: Schmidt formalizes orthogonal projections (geometric interpretation).
1970: Kimeldorf & Wahba prove the representer theorem (kernel solutions in the span of training points).
1995: Vapnik’s Nature of Statistical Learning Theory connects VC dimension to span of hypothesis class.
2006: Compressed sensing (Candès, Donoho) exploits sparse linear combinations for recovery.
2018: Neural Tangent Kernels (Jacot et al.) show infinite-width networks have predictions in the span of features.

Prevalence in Machine Learning#

Universal in supervised learning: Every linear model (linear regression, logistic regression, linear SVM, perceptron) computes predictions as $\hat{y} = Xw$ or $\hat{y} = \sigma(Xw + b)$ (linear combination + nonlinearity).

Deep learning layers: Each fully connected layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$).

Generalized linear models (GLMs): Exponential family models (Poisson regression, gamma regression) use $\mathbb{E}[y] = g^{-1}(X w)$ (linear combination inside link function).

Kernel methods: SVMs, kernel ridge regression, Gaussian processes all predict via $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel evaluations).

Notes and Explanatory Details#

Shape discipline:

Feature matrix: $X \in \mathbb{R}^{n \times d}$ (rows = examples, columns = features).
Weights: $w \in \mathbb{R}^d$ (one weight per feature).
Prediction: $\hat{y} = Xw \in \mathbb{R}^n$ (one prediction per example).
Column $j$ of $X$: $X_{:,j} \in \mathbb{R}^n$ (feature $j$ across all examples).

Matrix-vector product identity: $$ Xw = \begin{bmatrix} | & | & & | \\ X_{:,1} & X_{:,2} & \cdots & X_{:,d} \\ | & | & & | \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} = \sum_{j=1}^d w_j X_{:,j} $$

Example: For $X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}$, $w = \begin{bmatrix} 2 \\ -1 \end{bmatrix}$: $$ Xw = 2 \begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} + (-1) \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 4 \end{bmatrix} $$

Numerical considerations: For large $d$ (wide data), storing $X$ explicitly may be wasteful if $\text{rank}(X) \ll d$. Low-rank approximations (truncated SVD) reduce storage and computation.

Connection to Machine Learning#

Underfitting vs. overfitting: If $\text{rank}(X) \ll n$ (few effective features), the model underfits (predictions lie in low-dimensional subspace). If $\text{rank}(X) = n$ and $d \geq n$ (more features than examples), the model can perfectly fit noise (overfitting).

Regularization modifies the span: Ridge regression solves $(X^\top X + \lambda I) w = X^\top y$, shrinking weights toward zero. This effectively reduces the effective rank of $X$, constraining predictions to a lower-dimensional subspace.

Basis functions and feature expansion: Nonlinear models (polynomial regression, RBF networks) expand features: $\phi(x) = [x, x^2, x^3, \ldots]$. Predictions $\hat{y} = \Phi(X) w$ lie in $\text{col}(\Phi(X))$, a nonlinear subspace in the original space but linear in feature space.

Connection to Linear Algebra Theory#

Fundamental theorem of linear algebra. For $X \in \mathbb{R}^{n \times d}$: $$ \mathbb{R}^n = \text{col}(X) \oplus \text{null}(X^\top) $$ (direct sum: every vector $y \in \mathbb{R}^n$ decomposes uniquely as $y = y_{\parallel} + y_{\perp}$ where $y_{\parallel} \in \text{col}(X)$ and $y_{\perp} \in \text{null}(X^\top)$).

In least squares, $\hat{y} = y_{\parallel}$ (projection onto $\text{col}(X)$) and $r = y_{\perp}$ (projection onto $\text{null}(X^\top)$). The normal equations $X^\top r = 0$ express orthogonality.

Rank and dimension: $\dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d)$. If $\text{rank}(X) = r < d$, there are $d - r$ redundant features (null space has dimension $d - r$).

Projection matrix: $P = X(X^\top X)^{-1} X^\top$ (assuming $X$ has full column rank) satisfies:

$P^2 = P$ (idempotent: projecting twice is the same as projecting once).
$P^\top = P$ (symmetric: orthogonal projection).
$\text{col}(P) = \text{col}(X)$ (projects onto column space of $X$).

Pedagogical Significance#

Concrete visualization. Students can compute $Xw$ by hand for small $X$ (e.g., $3 \times 2$ matrix) and verify it’s a weighted sum of columns. This makes the abstract “linear combination” concept tangible.

Foundation for least squares. Understanding that predictions lie in $\text{col}(X)$ is essential before learning least squares. The geometric interpretation (projecting $y$ onto $\text{col}(X)$) clarifies why least squares works and when it fails.

Debugging linear models. If predictions are poor, check $\text{rank}(X)$: low rank indicates redundant/collinear features. Use np.linalg.matrix_rank(X) to diagnose.

References#

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4: “Orthogonality” (projections, least squares).
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3: “Linear Methods for Regression.”
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Appendix C: “Numerical Linear Algebra Background” (least squares, QR decomposition).
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. Chapter 5: “Orthogonalization and Least Squares.”
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11: “Linear Regression.”

Problem. Show $\hat{y} = Xw$ lies in the span of columns of $X$.

Solution (math).

For $X \in \mathbb{R}^{n \times d}$ with columns $X_{:,1}, \ldots, X_{:,d} \in \mathbb{R}^n$ and weights $w = [w_1, \ldots, w_d]^\top \in \mathbb{R}^d$, the prediction is: $$ \hat{y} = Xw = \sum_{j=1}^d w_j X_{:,j} $$

This is a linear combination of the columns of $X$, so $\hat{y} \in \text{span}\{X_{:,1}, \ldots, X_{:,d}\} = \text{col}(X)$.

Solution (Python).

import numpy as np

# Define feature matrix X (3 examples, 2 features)
X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])

# Define weight vector w
w = np.array([2., -1.])

# Prediction via matrix-vector product
y_hat_1 = X @ w

# Prediction as explicit linear combination of columns
y_hat_2 = w[0] * X[:, 0] + w[1] * X[:, 1]

print(f"X =\n{X}\n")
print(f"w = {w}\n")
print(f"Method 1 (matrix product): y_hat = X @ w = {y_hat_1}")
print(f"Method 2 (linear combination): y_hat = {w[0]}*X[:,0] + {w[1]}*X[:,1] = {y_hat_2}")
print(f"\nAre they equal? {np.allclose(y_hat_1, y_hat_2)}")
print(f"y_hat lies in span(columns of X): True (by construction)")

Output:

X =
[[1. 2.]
 [3. 4.]
 [5. 6.]]

w = [ 2. -1.]

Method 1 (matrix product): y_hat = X @ w = [0. 2. 4.]
Method 2 (linear combination): y_hat = 2.0*X[:,0] + -1.0*X[:,1] = [0. 2. 4.]

Are they equal? True
y_hat lies in span(columns of X): True (by construction)

Worked Example 2: Kernel ridge solution lies in span(training features)#

Introduction#

The representer theorem states that despite optimizing over an infinite-dimensional RKHS, the optimal solution for kernel ridge regression always has the form $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$—a linear combination of kernel functions evaluated at training points.

Purpose#

Demonstrate the representer theorem computationally
Show that optimization in infinite dimensions reduces to solving $(K + \lambda I)\alpha = y$
Verify predictions lie in span of training kernels

Importance#

Kernel methods enable nonlinear learning in implicitly mapped feature spaces while maintaining computational tractability ($O(n^3)$ instead of infinite-dimensional optimization).

What This Example Demonstrates#

Compute kernel Gram matrix $K_{ij} = k(x_i, x_j)$, solve for $\alpha$, interpret as linear combination of training kernels.

Background#

RKHS and representer theorem (Kimeldorf & Wahba 1970): For loss $\mathcal{L}(f) = \sum_i \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the minimizer is $f^*(x) = \sum_i \alpha_i k(x_i, x)$.

References#

Kimeldorf & Wahba (1970), Schölkopf et al. (2001), Rasmussen & Williams (2006)

Problem: Compute $\alpha$ for kernel ridge regression and interpret span.

Solution (math): $\alpha = (K + \lambda I)^{-1} y$ where $K_{ij} = k(x_i, x_j)$. Predictions: $f(x) = \sum_i \alpha_i k(x_i, x)$.

Solution (Python):

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=6, seed=1)
y = np.arange(len(X), dtype=float)
K = toy_kernel_rbf(X, gamma=0.5)
lam = 1e-2
alpha = np.linalg.solve(K + lam * np.eye(len(X)), y)

print(f"Coefficients alpha: {alpha}")
print(f"Predictions lie in span{{k(x_1, ·), ..., k(x_6, ·)}}")

Worked Example 3: Attention is a weighted sum#

Introduction#

Attention computes outputs as convex combinations of value vectors: $z = \sum_i \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$.

Purpose#

Show attention output is a linear combination, verify weights sum to 1, demonstrate constraint to span of values.

Importance#

Attention is the core operation in Transformers (GPT, BERT), enabling contextual representations through weighted averaging.

References#

Vaswani et al. (2017), Bahdanau et al. (2015)

Problem: Compute attention output as $\sum_i \alpha_i v_i$.

Solution (math): $z = \text{softmax}(q^\top K / \sqrt{d_k}) V$

Solution (Python):

import numpy as np
from scripts.toy_data import scaled_dot_attention

Q = np.array([[1., 0.]])
K = np.array([[1., 0.], [0., 1.], [1., 1.]])
V = np.array([[1., 0.], [0., 2.], [1., 1.]])
output = scaled_dot_attention(Q, K, V)

print(f"Attention output: {output[0]}")
print(f"Output lies in span(rows of V)")

Worked Example 4: Overparameterization and null space#

Introduction#

When $d > n$ (more parameters than examples), solutions to $Xw = y$ are non-unique. The solution set forms an affine subspace $w_0 + \text{null}(X)$.

Purpose#

Show non-uniqueness, identify solution set structure, and discuss the minimum-norm solution returned by lstsq.

Importance#

Modern deep learning is vastly overparameterized. Understanding null space clarifies why multiple parameters give identical predictions yet generalize differently.

References#

Bartlett et al. (2020), Belkin et al. (2019)

Problem: Explain non-uniqueness when $d > n$.

Solution (math): If $Xw_0 = y$ and $z \in \text{null}(X)$, then $X(w_0 + z) = y$. Solutions form $w_0 + \text{null}(X)$.

Solution (Python):

import numpy as np

rng = np.random.default_rng(0)
X = rng.normal(size=(3, 5))  # n=3, d=5
w0 = rng.normal(size=5)
y = X @ w0
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Rank(X): {np.linalg.matrix_rank(X)}")
print(f"Null space dim: {X.shape[1] - np.linalg.matrix_rank(X)}")
print(f"||w_hat||_2 = {np.linalg.norm(w_hat):.4f} (minimum norm)")
print(f"w0 - w_hat in null(X): {np.allclose(X @ (w0 - w_hat), 0)}")

Worked Example 5: Word analogy vector arithmetic#

Introduction#

Word2Vec embeddings exhibit linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (semantic relationships = vector offsets).

Purpose#

Compute analogy as linear combination, demonstrate compositional semantics, motivate embedding arithmetic.

Importance#

Analogies reveal that neural networks learn structured representations where linear algebra operations correspond to semantic operations.

References#

Mikolov et al. (2013), Pennington et al. (2014)

Problem: Compute “king - man + woman” analogy.

Solution (math): $e_{\text{target}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$

Solution (Python):

import numpy as np

E = {
    'king': np.array([0.8, 0.2, 0.1]),
    'man': np.array([0.7, 0.1, 0.0]),
    'woman': np.array([0.6, 0.3, 0.0])
}

analogy = E['king'] - E['man'] + E['woman']
print(f"king - man + woman = {analogy}")
print(f"(Find nearest word to this vector → queen)")

Comments

Algorithm Category

Data Modality

Historical & Attribution

Projection & Orthogonality

Key Concepts & Theorems

Learning Path & Sequencing

Linear Algebra Foundations

Vector Spaces

Theoretical Foundation