Chapter 6
Orthogonality & Projections
Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

  • Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)

  • Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace

  • Orthonormal bases simplify computations and improve numerical stability

  • Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers

  • PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

  1. Orthogonality and complements

    • $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.

  2. Orthogonal projectors

    • A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.

  3. Projection theorem

    • For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.

  4. Pythagorean identity

    • If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.

  5. Orthonormal bases and QR

    • Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.

  6. Spectral/SVD structure

    • For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).

  7. Orthogonal transformations

    • $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

  • Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.

  • PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.

  • Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.

  • Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.

  • Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.

  • Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

  • 1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.

  • 1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.

  • 1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).

  • 1966: Orthogonal Procrustes (Schönemann) closed-form solution.

  • 1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.

  • 2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

  • Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.

  • Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.

  • Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.

  • Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.

  • PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

  • Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.

  • Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).

  • Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.

  • Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.

  • Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

  • PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.

  • Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.

  • Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).

  • Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).

  • Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up
  1. PCA and subspace denoising

  • PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.

  • Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. Least squares as projection

  • $\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.

  • Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

  1. Orthogonalization algorithms (QR)

  • Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.

  • Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

  1. Orthogonal Procrustes and embedding alignment

  • Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).

  • Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

  1. Orthogonal constraints/initialization in deep nets

  • Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.

  • Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation
  • Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.

  • Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.

  • Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.

  • QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.

  • SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).

  • Examples:

    • Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.

    • PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.

    • Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks
  • Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.

  • Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.

  • Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.

  • Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.

  • Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

  • $X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.

  • Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

  • Prefer QR/SVD over normal equations.

  • Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

  1. Gauss (1809); Legendre (1805) — least squares.

  2. Golub & Van Loan (2013) — QR solvers.

  3. Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

  • $X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.

  • Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

  • Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

  1. Eckart & Young (1936) — best rank-$k$.

  2. Jolliffe (2002) — PCA.

  3. Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

  • Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

  • Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

  1. Trefethen & Bau (1997). Numerical Linear Algebra.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

  • If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

  • Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

  1. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.

  2. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

  • $H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

  • Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

  1. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Theoretical Foundation
Chapter 5
Inner Products & Norms
Key ideas: Introduction

Introduction#

Inner products and norms provide the geometry for data and models:

  • Similarity via inner products $\langle x, y\rangle$ and cosine $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$

  • Size and distance via norms $\lVert x\rVert$ and induced metrics $d(x,y) = \lVert x-y\rVert$

  • Orthogonality ($\langle x, y\rangle = 0$) and projections onto subspaces

  • Positive semidefinite (PSD) Gram matrices and kernels driving SVMs/GPs

  • Stability and regularization via $\ell_2$ (Ridge) and $\ell_1$ (Lasso) penalties

  • Scaled dot-product attention uses many inner products and a normalization factor $1/\sqrt{d}$

Important ideas#

  1. Inner product axioms and induced norms

    • An inner product $\langle x, y\rangle$ on $\mathbb{R}^d$ is symmetric, bilinear, and positive definite; the induced norm is $\lVert x\rVert = \sqrt{\langle x, x\rangle}$.

  2. Cauchy–Schwarz and cosine similarity

    • \[\big|\langle x, y\rangle\big| \le \lVert x\rVert\,\lVert y\rVert\]
    • Defines the angle via $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$.

  3. Triangle inequality and Minkowski/Hölder

    • For $p\in[1,\infty]$, $\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p$; Hölder duality connects $p$ and $q$ with $1/p+1/q=1$.

  4. Dual norms and bounds

    • The dual norm is $\lVert z\rVert_* = \sup_{\lVert x\rVert\le 1} \langle z, x\rangle$; e.g., dual of $\ell_1$ is $\ell_\infty$, dual of $\ell_2$ is $\ell_2$.

  5. Orthogonality, orthonormal bases, and projections

    • If $U\in\mathbb{R}^{d\times k}$ has orthonormal columns, the orthogonal projector is $P = UU^\top$, minimizing reconstruction error.

  6. Gram matrices, PSD, and kernels

    • For data matrix $X\in\mathbb{R}^{n\times d}$, $G=X X^\top$ has entries $G_{ij}=\langle x_i, x_j\rangle$ and is PSD. Kernel matrices generalize this to $K_{ij}=k(x_i,x_j)$.

  7. Mahalanobis norms

    • For SPD $M\succ 0$, $\lVert x\rVert_M = \sqrt{x^\top M x}$ reweights geometry (whitening, metric learning).

  8. Norm-induced stability

    • Lipschitz constants, gradient clipping, and regularization costs all depend on norms.

Relevance to ML#

  • Similarity search: cosine similarity is the standard for embeddings (IR, recommendation, retrieval, metric learning).

  • Regularization: $\ell_2$ (weight decay) controls scale; $\ell_1$ encourages sparsity.

  • Optimization: gradient norms determine step sizes; clipping prevents exploding gradients.

  • Kernels: SVMs, GPs rely on PSD Gram matrices of inner products.

  • Attention: scaled dot-products stabilize softmax logits as dimension grows.

  • PCA/covariance: variance equals squared $\ell_2$ norm along directions; orthogonal projections minimize $\ell_2$ error.

Algorithmic development (select milestones)#

  • 1850s–1900s: Euclidean geometry formalized; Cauchy–Schwarz inequality.

  • 1909: Mercer’s theorem (PSD kernels); foundations of kernel methods.

  • 1950: Aronszajn formalizes RKHS; inner products in function spaces.

  • 1960s–1970s: Robust norms (Huber); convex analysis; optimization bounds.

  • 1995: SVMs (Cortes–Vapnik) with kernel trick.

  • 2013–2015: Word2Vec, GloVe popularize cosine similarity in embeddings.

  • 2015–2016: BatchNorm/LayerNorm normalize activations (variance/norm control).

  • 2017: Scaled dot-product attention (Transformers) stabilizes inner-product logits.

  • 2020: Contrastive learning (SimCLR) uses normalized cosine objectives.

Definitions#

  • Inner product: $\langle x, y\rangle = x^\top y$ (standard), or weighted $\langle x, y\rangle_M = x^\top M y$ with $M\succ 0$.

  • Induced norm: $\lVert x\rVert = \sqrt{\langle x, x\rangle}$; $\ell_p$ norms: $\lVert x\rVert_1=\sum_i|x_i|$, $\lVert x\rVert_2=\sqrt{\sum_i x_i^2}$, $\lVert x\rVert_\infty=\max_i |x_i|$.

  • Cosine similarity: $\cos\theta(x,y) = \dfrac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}$.

  • Orthogonality: $\langle x, y\rangle = 0$; orthonormal set: $\langle u_i, u_j\rangle = \delta_{ij}$.

  • Gram matrix: $G_{ij}=\langle x_i, x_j\rangle$; PSD: $z^\top G z \ge 0$ $\forall z$.

  • Kernel: $k(x,y)=\langle \phi(x), \phi(y)\rangle$; $K_{ij}=k(x_i,x_j)$ is PSD.

  • Mahalanobis norm: $\lVert x\rVert_M = \sqrt{x^\top M x}$ with $M\succ 0$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems and tools)#

  • Cauchy–Schwarz: $$\big|\langle x,y\rangle\big|\le \lVert x\rVert\,\lVert y\rVert,$$ equality iff $x, y$ are linearly dependent.

  • Triangle inequality and Minkowski: $$\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p,$$ basis of $\ell_p$ geometries.

  • Hölder’s inequality: $$|\langle x,y\rangle| \le \lVert x\rVert_p\,\lVert y\rVert_q,$$ with $1/p+1/q=1$.

  • Pythagorean theorem (projections): For orthogonal $a\perp b$, $$\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2.$$

  • Norm equivalence (finite-dimensional): For any two norms on $\mathbb{R}^d$, there exist $c, C>0$ with $c\lVert x\rVert_a \le \lVert x\rVert_b \le C\lVert x\rVert_a$.

  • PSD characterization: $G$ is a Gram matrix iff $z^\top G z \ge 0$ for all $z$ (kernel validity test).

Applied (landmark systems and practices)#

  • SVMs (margins via inner products): Cortes–Vapnik (1995); kernel trick.

  • Gaussian Processes (inner products in function space): Rasmussen–Williams (2006).

  • BatchNorm/LayerNorm (norm/variance control): Ioffe–Szegedy (2015); Ba et al. (2016).

  • Word2Vec/GloVe (cosine similarity): Mikolov et al. (2013); Pennington et al. (2014).

  • SimCLR/contrastive learning (normalized dot-products): Chen et al. (2020).

  • Transformers (scaled dot-product): Vaswani et al. (2017).

  • Gradient clipping (norm control in training): Pascanu et al. (2013).

Key ideas: Where it shows up
  1. PCA and covariance geometry

  • Variance along $u$: $\sigma^2(u)=\lVert X_c u\rVert_2^2/n= u^\top \Sigma u$, with $\Sigma=\tfrac{1}{n}X_c^\top X_c$.

  • Principal components are eigenvectors of $\Sigma$ maximizing inner products with data; projection error uses Pythagorean decomposition.

  • Achievements: Dimensionality reduction at scale; whitening used broadly in vision and speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. SGD/optimization: gradient norms and clipping

  • Step sizes depend on Lipschitz constants tied to operator/dual norms.

  • Gradient clipping by $\ell_2$ norm prevents exploding gradients (RNNs). References: Pascanu et al. 2013; Goodfellow et al. 2016; Nesterov 2018.

  1. Deep nets: normalization and regularization

  • Weight decay ($\ell_2$) controls model complexity; $\ell_1$ encourages sparsity.

  • BatchNorm/LayerNorm normalize mean/variance, implicitly controlling activation norms. References: Ioffe–Szegedy 2015; Ba et al. 2016.

  1. Kernels and PSD Gram matrices

  • SVMs and GPs depend on PSD kernels (Mercer). $K=XX^\top$ is PSD; RBF kernel yields smooth function priors.

  • Achievements: Kernel SVMs in text/vision (1990s–2000s); GPs in Bayesian ML. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

  1. Transformers: scaled dot-product attention

  • Scores $S=QK^\top/\sqrt{d_k}$ use many inner products; the $\sqrt{d_k}$ factor stabilizes softmax variance.

  • Achievements: SOTA in NLP/vision; ubiquitous backbone. References: Vaswani et al. 2017; Devlin et al. 2019; Dosovitskiy et al. 2020.

  1. Embeddings and retrieval

  • Cosine similarity is the default for semantic retrieval and metric learning; normalization puts data on the unit sphere.

  • Achievements: Word2Vec/GloVe; SimCLR; CLIP/contrastive vision-language models. References: Mikolov et al. 2013; Pennington et al. 2014; Chen et al. 2020; Radford et al. 2021.

Notation
  • Vectors are column vectors. Data matrix: $X\in\mathbb{R}^{n\times d}$ (rows are examples; columns features). Centered data: $X_c$.

  • Inner product: $\langle x, y\rangle = x^\top y$; cosine similarity: $$\cos\theta(x,y) = \frac{\langle x, y\rangle}{\lVert x\rVert_2\,\lVert y\rVert_2}.$$

  • Norms: $\lVert x\rVert_1, \lVert x\rVert_2, \lVert x\rVert_\infty$; dual norms $\lVert\cdot\rVert_*$; Mahalanobis $\lVert x\rVert_M = \sqrt{x^\top M x}$.

  • Projection: If $U\in\mathbb{R}^{d\times k}$ is orthonormal, $P=UU^\top$; residual $r=(I-P)x$ is orthogonal to $\text{col}(U)$.

  • Gram matrix: $G=XX^\top$ (PSD); kernel matrix: $K_{ij}=k(x_i,x_j)$.

  • Examples:

    • Embedding cosine: normalize $\hat{x}=x/\lVert x\rVert_2$, $\hat{y}=y/\lVert y\rVert_2$, then $\langle \hat{x},\hat{y}\rangle=\cos\theta$.

    • Ridge penalty: $\lambda\lVert w\rVert_2^2$; Lasso: $\lambda\lVert w\rVert_1$.

    • Attention scores: $S=QK^\top/\sqrt{d_k}$; softmax row-wise on $S$.

Pitfalls & sanity checks
  • Cosine vs Euclidean: without normalization, rankings can change due to scale.

  • PSD checks: ensure Gram/kernel matrices are PSD (numerically, allow tiny negatives).

  • Norm choice: $\ell_2$ is rotation-invariant; $\ell_1$ is robust/sparse but not smooth.

  • Attention scaling: omit $1/\sqrt{d_k}$ and softmax saturates for large $d_k$.

  • Centering for covariance: use $X_c$ for PCA; otherwise directions mix mean effects.

  • Gradient norms: clip by global norm to avoid exploding updates.

References

Foundations and geometry

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.).

  3. Horn, R. & Johnson, C. (2012). Matrix Analysis.

  4. Boyd, S. & Vandenberghe, L. (2004). Convex Optimization.

Kernels and PSD 5. Mercer, J. (1909). Functions of positive and negative type. 6. Aronszajn, N. (1950). Theory of Reproducing Kernels. 7. Schölkopf, B. & Smola, A. (2002). Learning with Kernels. 8. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

Regularization and optimization 9. Hoerl, A. & Kennard, R. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso. 11. Nesterov, Y. (2018). Lectures on Convex Optimization. 12. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. 13. Pascanu, R. et al. (2013). On the difficulty of training RNNs (gradient clipping).

Embeddings, normalization, attention 14. Mikolov, T. et al. (2013). Word2Vec. 15. Pennington, J. et al. (2014). GloVe. 16. Ioffe, S. & Szegedy, C. (2015). Batch Normalization. 17. Ba, J. L. et al. (2016). Layer Normalization. 18. Chen, T. et al. (2020). SimCLR. 19. Radford, A. et al. (2021). CLIP. 20. Vaswani, A. et al. (2017). Attention Is All You Need.

PCA and projections 21. Jolliffe, I. (2002). Principal Component Analysis. 22. Shlens, J. (2014). A Tutorial on PCA. 23. Eckart, C. & Young, G. (1936). Low-rank approximation. 24. Golub, G. & Van Loan, C. (2013). Matrix Computations.

Five worked examples

Worked Example 1: Cosine similarity vs Euclidean distance for embedding retrieval#

Introduction#

Cosine similarity is ubiquitous for nearest-neighbor search in embedding spaces (text, images, audio). We show that for $\ell_2$-normalized vectors, maximizing cosine similarity is equivalent to minimizing Euclidean distance.

Purpose#

Relate inner products to distances under normalization; provide a fast retrieval recipe.

Importance#

Industrial search, recommendation, and retrieval pipelines rely on cosine similarity with normalized embeddings for stability and interpretability.

What this example demonstrates#

  • Equivalence: For unit vectors, $$\lVert x-y\rVert_2^2 = 2(1-\langle x,y\rangle).$$

  • Ranking by cosine equals ranking by negative Euclidean distance after normalization.

Background#

Vector space models in IR (Salton) and modern embeddings (Word2Vec, GloVe, CLIP) use cosine similarity due to scale invariance.

Historical context#

From tf–idf cosine in IR to neural embeddings; normalization combats varying document lengths and feature scales.

Prevalence in ML#

Text retrieval, semantic search, metric learning, contrastive pretraining; approximate nearest neighbor (ANN) indices often assume normalized data.

Notes#

  • Always normalize embeddings: $\hat{x}=x/\lVert x\rVert_2$.

  • For batched comparisons: use matrix products $S=\hat{X}\hat{Y}^\top$ to get all cosines.

Connection to ML#

Similarity search, contrastive objectives, and re-ranking all hinge on stable cosine scores.

Connection to Linear Algebra Theory#

Inner products induce norms/angles; normalization maps data to the unit sphere $\mathbb{S}^{d-1}$.

Pedagogical Significance#

Shows direct algebraic link between inner product and Euclidean geometry under normalization.

References#

  1. Salton, G. et al. (1975). A vector space model for information retrieval.

  2. Mikolov, T. et al. (2013). Efficient Estimation of Word Representations.

  3. Pennington, J. et al. (2014). GloVe.

  4. Radford, A. et al. (2021). CLIP.

Solution (Python)#

import numpy as np

np.random.seed(0)
d, n_query, n_db = 128, 4, 6
X = np.random.randn(n_query, d)
Y = np.random.randn(n_db, d)

def normalize(A):
	 nrm = np.linalg.norm(A, axis=1, keepdims=True) + 1e-12
	 return A / nrm

Xn, Yn = normalize(X), normalize(Y)
cos = Xn @ Yn.T                  # cosine similarities
eucl2 = ((Xn[:, None, :] - Yn[None, :, :])**2).sum(-1)  # squared distances

print("Cosine matrix:\n", np.round(cos, 3))
print("Squared distances (normalized):\n", np.round(eucl2, 3))
print("Relationship check (row 0):", np.allclose(eucl2[0], 2 * (1 - cos[0]), atol=1e-6))

Worked Example 2: $\ell_2$ vs $\ell_1$ regularization under orthonormal design#

Introduction#

Compare Ridge ($\ell_2$) and Lasso ($\ell_1$) when $X^\top X = I$. Ridge has a closed form; Lasso reduces to soft-thresholding.

Purpose#

Show how norms shape solutions: $\ell_2$ shrinks weights smoothly; $\ell_1$ induces sparsity.

Importance#

Regularization choice affects interpretability, robustness, and generalization.

What this example demonstrates#

  • For $X^\top X=I$, OLS is $w_{\text{ls}}=X^\top y$.

  • Ridge: $$w_{\text{ridge}} = \frac{1}{1+\lambda} w_{\text{ls}}.$$

  • Lasso: $$w_{\text{lasso}, i} = \operatorname{sign}(w_{\text{ls}, i})\,\max\{|w_{\text{ls}, i}|-\lambda, 0\}.$$

Background#

Ridge stabilizes ill-conditioned problems; Lasso selects features.

Historical context#

Ridge (Tikhonov, 1963; Hoerl–Kennard, 1970) and Lasso (Tibshirani, 1996) are canonical.

Prevalence in ML#

Widely used in linear models, compressed sensing, and high-dimensional statistics.

Notes#

  • The soft-threshold formula holds exactly under orthonormal design; otherwise use coordinate descent.

Connection to ML#

Norm penalties as priors/constraints: weight decay, sparsity, and model selection.

Connection to Linear Algebra Theory#

Dual norms and subgradients for $\ell_1$; spectral properties for $\ell_2$.

Pedagogical Significance#

Highlights geometric differences: $\ell_2$ balls are round; $\ell_1$ balls have corners that promote zeros.

References#

  1. Hoerl, A. & Kennard, R. (1970). Ridge Regression.

  2. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.

  3. Hastie, T. et al. (2009). Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 64, 8
U, _ = np.linalg.qr(np.random.randn(n, d))  # n x d with orthonormal columns
X = U
w_true = np.zeros(d); w_true[:3] = [2.0, -1.5, 0.5]
y = X @ w_true + 0.1 * np.random.randn(n)

w_ls = X.T @ y
lam = 0.5
w_ridge = w_ls / (1.0 + lam)

def soft_threshold(a, lam):
	 return np.sign(a) * np.maximum(np.abs(a) - lam, 0.0)
w_lasso = soft_threshold(w_ls, lam)

print("||w_ls||2=", np.linalg.norm(w_ls))
print("Ridge (lam=0.5):", np.round(w_ridge, 3))
print("Lasso (lam=0.5):", np.round(w_lasso, 3))

Worked Example 3: Gram matrices are PSD; kernels in practice#

Introduction#

Show that $G=XX^\top$ is PSD and illustrate a common kernel (RBF). Verify PSD numerically.

Purpose#

Connect inner products to PSD matrices and kernel validity.

Importance#

Kernel methods hinge on PSD property; invalid kernels can break optimization.

What this example demonstrates#

  • For any $z$, $$z^\top (XX^\top) z = \lVert X^\top z\rVert_2^2 \ge 0.$$

  • RBF kernel is PSD; eigenvalues are nonnegative up to numerical tolerance.

Background#

Mercer’s theorem characterizes kernels as inner products in (possibly infinite-dimensional) feature spaces.

Historical context#

Kernel trick popularized SVMs and GPs; modern random features approximate kernels at scale.

Prevalence in ML#

Text, bioinformatics, small/medium tabular data, Bayesian regression.

Notes#

  • Numerical PSD check via eigenvalues or Cholesky with jitter.

Connection to ML#

SVM margin maximization and GP covariance both rely on PSD structure.

Connection to Linear Algebra Theory#

Gram operators encode geometry via inner products.

Pedagogical Significance#

Concrete link between data matrix products and PSD.

References#

  1. Mercer, J. (1909). Functions of positive and negative type.

  2. Schölkopf, B. & Smola, A. (2002). Learning with Kernels.

  3. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

  4. Rahimi, A. & Recht, B. (2007). Random features for large-scale kernels.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 10, 5
X = np.random.randn(n, d)
G = X @ X.T

evals = np.linalg.eigvalsh(G)
print("Gram PSD? min eigenvalue:", np.min(evals))

def rbf_kernel(A, B, sigma=1.0):
	 A2 = (A**2).sum(1)[:, None]
	 B2 = (B**2).sum(1)[None, :]
	 D2 = A2 + B2 - 2 * A @ B.T
	 return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.0)
kevals = np.linalg.eigvalsh(K)
print("RBF kernel PSD? min eigenvalue:", np.min(kevals))

Worked Example 4: Why attention uses $1/\sqrt{d_k}$ scaling#

Introduction#

For random features with variance 1, dot-products have variance that grows with $d_k$; scaling by $1/\sqrt{d_k}$ stabilizes softmax.

Purpose#

Quantify inner-product growth and show stabilization by scaling.

Importance#

Essential to prevent saturation and numerical instability in attention.

What this example demonstrates#

  • If $q,k\sim \mathcal{N}(0, I)$ in $\mathbb{R}^{d_k}$, then $\operatorname{Var}(q^\top k) = d_k$.

  • Scaling by $1/\sqrt{d_k}$ makes variance approximately 1 across dimensions.

Background#

Softmax is sensitive to logit scale; large variance yields peaky distributions and vanishing gradients.

Historical context#

Transformers introduced the scaling to stabilize training across widths.

Prevalence in ML#

Every modern Transformer variant uses this factor (self- and cross-attention).

Notes#

  • Normalization and temperature are closely related; tuning temperature affects entropy.

Connection to ML#

Stable attention distributions, better gradient flow, easier optimization.

Connection to Linear Algebra Theory#

Variance of inner products aggregates component variances; normalization rescales geometry.

Pedagogical Significance#

Shows a direct norm/variance argument behind a ubiquitous architectural choice.

References#

  1. Vaswani, A. et al. (2017). Attention Is All You Need.

  2. Goodfellow, I. et al. (2016). Deep Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
for d in [16, 64, 256, 1024]:
	 trials = 2000
	 q = np.random.randn(trials, d)
	 k = np.random.randn(trials, d)
	 dots = np.sum(q * k, axis=1)
	 scaled = dots / np.sqrt(d)
	 print(f"d={d:4d} var(dot)={np.var(dots):.1f}  var(scaled)={np.var(scaled):.2f}")

Worked Example 5: Orthogonal projection minimizes squared error (Pythagorean decomposition)#

Introduction#

Projecting onto an orthonormal subspace minimizes $\ell_2$ reconstruction error and decomposes energy orthogonally.

Purpose#

Connect projections, norms, and PCA-style reconstructions.

Importance#

Underlies least squares, PCA truncation, and many dimensionality-reduction pipelines.

What this example demonstrates#

  • For orthonormal $U\in\mathbb{R}^{d\times k}$, $$\hat{x}=UU^\top x = \arg\min_{z\in\text{col}(U)} \lVert x-z\rVert_2.$$

  • Pythagorean identity: $$\lVert x\rVert_2^2 = \lVert UU^\top x\rVert_2^2 + \lVert (I-UU^\top)x\rVert_2^2.$$

Background#

Least squares is projection onto column space; PCA chooses $U$ to maximize captured variance.

Historical context#

Orthogonal expansions from Fourier to PCA; SVD gives best rank-$k$ approximation.

Prevalence in ML#

Everywhere: regression, PCA, subspace tracking, recommendation.

Notes#

  • Orthonormality is crucial; otherwise use oblique projections or QR/SVD.

Connection to ML#

Data compression and denoising via low-dimensional projections.

Connection to Linear Algebra Theory#

Orthogonal projectors are idempotent and symmetric; decomposition follows from orthogonality of components.

Pedagogical Significance#

Reinforces geometric intuition of least squares and PCA.

References#

  1. Golub, G. & Van Loan, C. (2013). Matrix Computations.

  2. Jolliffe, I. (2002). Principal Component Analysis.

  3. Eckart, C. & Young, G. (1936). Approximation in terms of the best rank-$k$.

Solution (Python)#

import numpy as np

np.random.seed(4)
d, k = 20, 3
x = np.random.randn(d)
U, _ = np.linalg.qr(np.random.randn(d, k))  # orthonormal basis
P = U @ U.T
x_hat = P @ x
r = x - x_hat

lhs = np.linalg.norm(x)**2
rhs = np.linalg.norm(x_hat)**2 + np.linalg.norm(r)**2
print("Projection error minimal?", np.linalg.norm(r) <= np.linalg.norm(x - U @ (U.T @ x) + 1e-12))
print("Pythagorean holds (numeric):", np.allclose(lhs, rhs, atol=1e-10))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation