ex1.ai

Chapter 6

Orthogonality & Projections

Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

Orthogonality and complements
- $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
- A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
- For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
- If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
- Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
- For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
- $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.
Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).
Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.
Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.
Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.
Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.
Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).
Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).
Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up

PCA and subspace denoising

PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.
Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Least squares as projection

$\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.
Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

Orthogonalization algorithms (QR)

Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.
Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

Orthogonal Procrustes and embedding alignment

Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).
Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

Orthogonal constraints/initialization in deep nets

Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.
Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation

Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.
Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.
Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.
QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.
SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).
Examples:
- Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.
- PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.
- Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks

Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.
Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.
Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.
Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.
Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

$X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.
Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

Prefer QR/SVD over normal equations.
Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

Gauss (1809); Legendre (1805) — least squares.
Golub & Van Loan (2013) — QR solvers.
Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

$X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.
Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

Eckart & Young (1936) — best rank-$k$.
Jolliffe (2002) — PCA.
Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

$H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))