ex1.ai

PSD & PD Matrices

Chapter 9

PSD & PD Matrices

Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

Eigenvalue characterization
- Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.
Quadratic forms and convexity
- If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).
Gram matrices
- $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).
Cholesky factorization
- PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.
Schur complement
- For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.
Mahalanobis metric
- For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.
Numerical PSD
- In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.
Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.
Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.
Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.
Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.
1918: Schur complement criteria.
1910–1940s: Cholesky factorization for SPD solves.
1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.
1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.
Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.
Principal minors: PD iff all leading principal minors are positive (Sylvester).
Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.
Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

PSD/PD definitions via quadratic form and eigenvalues.
Sylvester’s criterion (leading principal minors > 0 for PD).
Schur complement PSD condition.
Mercer’s theorem (kernel PSD).
Cholesky existence for SPD.

Applied (landmark systems)#

SVM with PSD kernels (Cortes–Vapnik 1995).
Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.
Kernel Ridge Regression (Murphy 2022) solved via SPD systems.
Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).
GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up

Covariance and PCA

$\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

Kernels and SVMs/GPs

RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Hessians and convexity

For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

Mahalanobis distance & whitening

SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

Cholesky-based solvers (KRR/GP)

SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation

PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.
Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.
Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.
Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.
Schur complement: For blocks, $S = C - B^\top A^{-1} B$.
Examples:
- PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.
- Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.
- Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks

Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.
Near-singular PSD: add jitter; do not invert directly.
Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.
Kernel parameters: very small/large length-scales can hurt conditioning.
Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

Symmetrization, eigenvalue check, and Cholesky factorization.
Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

Golub & Van Loan (2013). Matrix Computations.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

$\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.
Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

RBF kernel is PSD; small datasets can be nearly singular when points are close.
Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

$d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

De Maesschalck et al. (2000). The Mahalanobis distance.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

20th Century

Key Concepts & Theorems

Positive Semidefiniteness

Learning Path & Sequencing

Intermediate

Linear Algebra Foundations

Matrix Theory

Matrix Decompositions

Cholesky Decomposition

ML Applications

Kernels & Gaussian Processes

Problem Structure & Exploitation

Symmetry & Structure

Theoretical Foundation

Linear Algebra Theory

Orthogonality & Projections

Chapter 6

Orthogonality & Projections

Key ideas: Introduction

Introduction#

Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:

Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation

Important ideas#

Orthogonality and complements
- $x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
- A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
- For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
- If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
- Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
- For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
- $Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.

Relevance to ML#

Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).

Algorithmic development (milestones)#

1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.

Definitions#

Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems)#

Projection theorem: For closed subspace $\mathcal{S}$, projection $P_\mathcal{S}x$ uniquely minimizes $\lVert x-s\rVert_2$; residual is orthogonal to $\mathcal{S}$.
Pythagorean/Bessel/Parseval: Orthogonal decompositions preserve squared norms; partial sums bounded (Bessel); complete bases preserve energy (Parseval).
Fundamental theorem of linear algebra: $\text{col}(A)$ is orthogonal to $\text{null}(A^\top)$; $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$.
Spectral theorem: Symmetric matrices have orthonormal eigenbases; diagonalizable by $Q^\top A Q$.
Eckart–Young–Mirsky: Best rank-$k$ approximation in Frobenius/2-norm via truncated SVD.

Applied (landmark systems and practices)#

PCA/whitening: Jolliffe (2002); Shlens (2014) — denoising and compression.
Least squares/QR solvers: Golub–Van Loan (2013) — stable projections.
Orthogonal Procrustes in embedding alignment: Schönemann (1966); Smith et al. (2017).
Orthogonal initialization/constraints: Saxe et al. (2013); Mishkin & Matas (2015).
Subspace tracking and signal processing: Halko et al. (2011) randomized SVD.

Key ideas: Where it shows up

PCA and subspace denoising

PCA finds orthonormal directions $U$ maximizing variance; projection $X_k = X V_k V_k^\top$ minimizes reconstruction error.
Achievements: Dimensionality reduction at scale; whitening and denoising in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Least squares as projection

$\hat{y} = X w^*$ is the projection of $y$ onto $\text{col}(X)$; residual $r=y-\hat{y}$ satisfies $X^\top r=0$.
Achievements: Foundational to regression and linear models; efficient via QR/SVD. References: Gauss 1809; Golub–Van Loan 2013.

Orthogonalization algorithms (QR)

Householder/Modified Gram–Schmidt produce orthonormal bases with numerical stability; essential in solvers and factorizations.
Achievements: Robust, high-performance linear algebra libraries (LAPACK). References: Householder 1958; Golub 1965; Trefethen–Bau 1997.

Orthogonal Procrustes and embedding alignment

Best orthogonal alignment between representation spaces via SVD of $A^\top B$ (solution $R=UV^\top$).
Achievements: Cross-lingual word embedding alignment; domain adaptation. References: Schönemann 1966; Smith et al. 2017.

Orthogonal constraints/initialization in deep nets

Orthogonal weight matrices preserve variance across layers; improve training stability and gradient flow.
Achievements: Deep linear dynamics analysis; practical initializations. References: Saxe et al. 2013; Mishkin & Matas 2015.

Notation

Data matrix and spaces: $X\in\mathbb{R}^{n\times d}$, $\text{col}(X)\subseteq\mathbb{R}^n$, $\text{null}(X^\top)$.
Orthonormal basis: $U\in\mathbb{R}^{n\times k}$ with $U^\top U=I$.
Orthogonal projector: $P=UU^\top$ (symmetric, idempotent); residual $r=(I-P)y$ satisfies $U^\top r=0$.
QR factorization: $X=QR$ with $Q^\top Q=I$; $Q$ spans $\text{col}(X)$.
SVD/PCA: $X=U\Sigma V^\top$; top-$k$ projection $P_k=U_k U_k^\top$ (or $X V_k V_k^\top$ on features).
Examples:
- Least squares via projection: $\hat{y} = P y$ with $P=Q Q^\top$ for $Q$ from QR of $X$.
- PCA reconstruction: $\hat{X} = X V_k V_k^\top$; error $\lVert X-\hat{X}\rVert_F^2 = \sum_{i>k}\sigma_i^2$.
- Procrustes alignment: $R=UV^\top$ from SVD of $A^\top B$; $R$ is orthogonal.

Pitfalls & sanity checks

Centering for PCA: use $X_c$ to ensure principal directions capture variance, not mean.
Orthogonality of bases: $U$ must be orthonormal for $P=UU^\top$ to be an orthogonal projector; otherwise projection is oblique.
Numerical orthogonality: prefer QR/SVD; classical Gram–Schmidt can lose orthogonality under ill-conditioning.
Certificates: verify $P$ is symmetric/idempotent and that residuals are orthogonal to $\text{col}(X)$.
Overfitting with high-$k$ PCA: track retained variance and use validation.

References

Foundations and numerical linear algebra

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Projections, orthogonality, and approximation 4. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. 5. Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix. 6. Gram, J. (1883); Schmidt, E. (1907). Orthonormalization methods.

PCA and applications 7. Jolliffe, I. (2002). Principal Component Analysis. 8. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

Embedding alignment and orthogonal methods in ML 9. Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem. 10. Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Mishkin, D., & Matas, J. (2015). All you need is a good init.

General ML texts 13. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 14. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction.

Five worked examples

Worked Example 1: Least squares as orthogonal projection (QR certificate)#

Introduction#

Show that least squares fits correspond to orthogonal projection of $y$ onto $\text{col}(X)$, with residual orthogonal to features.

Purpose#

Derive $\hat{y}=P y$ with $P=Q Q^\top$ and verify $X^\top r=0$ numerically.

Importance#

Anchors regression in subspace geometry; provides robust implementation guidance via QR.

What this example demonstrates#

$X=QR$ with $Q^\top Q=I$ yields $\hat{y}=QQ^\top y$.
Residual $r=y-\hat{y}$ satisfies $Q^\top r=0$ and $X^\top r=0$.

Background#

Least squares minimizes squared error; projection theorem assures unique closest point in $\text{col}(X)$.

Historical context#

Gauss/Legendre least squares; Householder/Golub QR for numerical stability.

Prevalence in ML#

Linear models, GLM approximations, and as inner loops in larger systems.

Notes#

Prefer QR/SVD over normal equations.
Check $P$ is symmetric and idempotent in code.

Connection to ML#

Core of regression pipelines; basis for Ridge/Lasso solvers (with modifications).

Connection to Linear Algebra Theory#

Projection theorem; FTLA decomposition $\mathbb{R}^n=\text{col}(X)\oplus\text{null}(X^\top)$.

Pedagogical Significance#

Gives a geometric certificate of optimality via orthogonality.

References#

Gauss (1809); Legendre (1805) — least squares.
Golub & Van Loan (2013) — QR solvers.
Trefethen & Bau (1997) — numerical linear algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 5
X = np.random.randn(n, d)
w_true = np.array([1.2, -0.8, 0.5, 0.0, 2.0])
y = X @ w_true + 0.1 * np.random.randn(n)

Q, R = np.linalg.qr(X)
P = Q @ Q.T
y_hat = P @ y
r = y - y_hat

# Certificates
print("Symmetric P?", np.allclose(P, P.T, atol=1e-10))
print("Idempotent P?", np.allclose(P @ P, P, atol=1e-10))
print("Q^T r ~ 0?", np.linalg.norm(Q.T @ r))
print("X^T r ~ 0?", np.linalg.norm(X.T @ r))

# Compare to lstsq fit
w_ls, *_ = np.linalg.lstsq(X, y, rcond=None)
print("Projection match?", np.allclose(y_hat, X @ w_ls, atol=1e-8))

Worked Example 2: PCA projection and best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate orthogonal projection onto top-$k$ principal components and verify reconstruction error equals the sum of squared tail singular values.

Purpose#

Connect PCA’s orthogonal subspace to optimal low-rank approximation.

Importance#

Backbone of dimensionality reduction and denoising in ML.

What this example demonstrates#

$X=U\Sigma V^\top$; projection to rank-$k$ is $X_k = U_k \Sigma_k V_k^\top = X V_k V_k^\top$.
Error: $\lVert X-X_k\rVert_F^2 = \sum_{i>k} \sigma_i^2$.

Background#

Eckart–Young shows truncated SVD minimizes Frobenius/2-norm error among rank-$k$ matrices.

Historical context#

Low-rank approximation dates to the 1930s; widespread modern use in ML systems.

Prevalence in ML#

Feature compression, noise removal, approximate nearest neighbors, latent semantic analysis.

Notes#

Center data for covariance-based PCA; use SVD directly on $X_c$.

Connection to ML#

Trade off between compression (smaller $k$) and fidelity (retained variance).

Connection to Linear Algebra Theory#

Orthogonal projectors $U_k U_k^\top$; spectral ordering of singular values.

Pedagogical Significance#

Illustrates how orthogonality yields optimality guarantees.

References#

Eckart & Young (1936) — best rank-$k$.
Jolliffe (2002) — PCA.
Shlens (2014) — PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 80, 30, 5
X = np.random.randn(n, d) @ np.diag(np.linspace(5, 0.1, d))  # create decaying spectrum
Xc = X - X.mean(axis=0, keepdims=True)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
Vk = Vt[:k].T
Xk = Xc @ Vk @ Vk.T

err = np.linalg.norm(Xc - Xk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-6))

Worked Example 3: Gram–Schmidt vs Householder QR (orthogonality under stress)#

Introduction#

Compare classical Gram–Schmidt to numerically stable QR on nearly colinear vectors.

Purpose#

Show why stable orthogonalization matters when projecting in high dimensions.

Importance#

Precision loss destroys orthogonality and degrades projections/solvers.

What this example demonstrates#

Classical GS loses orthogonality; QR (Householder) maintains $Q^\top Q\approx I$.

Background#

Modified GS improves stability, but Householder QR is preferred in libraries.

Historical context#

Stability advancements from Gram–Schmidt to Householder underpin modern LAPACK.

Prevalence in ML#

Everywhere orthogonalization is needed: least squares, PCA, subspace tracking.

Notes#

Measure orthogonality via $\lVert Q^\top Q - I\rVert$.

Connection to ML#

Reliable projections and decompositions => reliable models.

Connection to Linear Algebra Theory#

Orthogonality preservation and rounding error analysis.

Pedagogical Significance#

Demonstrates the gap between algebraic identities and floating-point realities.

References#

Trefethen & Bau (1997). Numerical Linear Algebra.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 40, 8
X = np.random.randn(n, d)
X[:, 1] = X[:, 0] + 1e-6 * np.random.randn(n)  # near colinearity

# Classical Gram–Schmidt
def classical_gs(A):
	 A = A.copy().astype(float)
	 n, d = A.shape
	 Q = np.zeros_like(A)
	 for j in range(d):
		  v = A[:, j]
		  for i in range(j):
				v = v - Q[:, i] * (Q[:, i].T @ A[:, j])
		  Q[:, j] = v / (np.linalg.norm(v) + 1e-18)
	 return Q

Q_gs = classical_gs(X)
Q_qr, _ = np.linalg.qr(X)

orth_gs = np.linalg.norm(Q_gs.T @ Q_gs - np.eye(d))
orth_qr = np.linalg.norm(Q_qr.T @ Q_qr - np.eye(d))
print("||Q^TQ - I|| (GS)", orth_gs)
print("||Q^TQ - I|| (QR)", orth_qr)

Worked Example 4: Orthogonal Procrustes — aligning embeddings via SVD#

Introduction#

Find the orthogonal matrix $R$ that best aligns $A$ to $B$ by minimizing $\lVert AR - B\rVert_F$.

Purpose#

Show closed-form solution $R=UV^\top$ from SVD of $A^\top B$ and connect to embedding alignment.

Importance#

Stable alignment across domains/languages without distorting geometry.

What this example demonstrates#

If $A^\top B = U\Sigma V^\top$, the optimal orthogonal $R=UV^\top$.

Background#

Procrustes problems arise in shape analysis and representation alignment.

Historical context#

Schönemann (1966) established the orthogonal solution; widely used afterward.

Prevalence in ML#

Cross-lingual word embeddings and domain adaptation pipelines.

Notes#

Center and scale if appropriate; enforce $\det(R)=+1$ for rotation-only alignment (optional).

Connection to ML#

Enables mapping between independently trained embedding spaces.

Connection to Linear Algebra Theory#

Orthogonal transformations preserve inner products; SVD reveals optimal rotation/reflection.

Pedagogical Significance#

Bridges an optimization problem to a single SVD call.

References#

Schönemann, P. (1966). A generalized solution of the orthogonal Procrustes problem.
Smith, S. et al. (2017). Offline Bilingual Word Vectors, Orthogonal Transformations.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 16
A = np.random.randn(n, d)
Q, _ = np.linalg.qr(np.random.randn(d, d))  # true orthogonal map
B = A @ Q + 0.01 * np.random.randn(n, d)

M = A.T @ B
U, S, Vt = np.linalg.svd(M)
R = U @ Vt

err = np.linalg.norm(A @ R - B, 'fro')
print("Alignment error:", round(err, 4))
print("R orthogonal?", np.allclose(R.T @ R, np.eye(d), atol=1e-8))

Worked Example 5: Householder reflections — building orthogonal projectors#

Introduction#

Construct a Householder reflection to zero components and illustrate its orthogonality and symmetry; connect to QR and projection building.

Purpose#

Expose a basic orthogonal transformation used to construct $Q$ in QR.

Importance#

Underpins numerically stable orthogonalization in solvers and projections.

What this example demonstrates#

$H=I-2uu^\top$ is orthogonal and symmetric; $Hx$ zeros all but one component.

Background#

Householder reflections are the workhorse of QR; compose reflections to build $Q$.

Historical context#

Householder (1958) introduced the approach; remains standard.

Prevalence in ML#

Appears indirectly via libraries (NumPy/SciPy/LAPACK) that power ML pipelines.

Notes#

Stable and efficient vs. naive orthogonalization in finite precision.

Connection to ML#

Reliable QR leads to reliable least squares, PCA, and projection-based models.

Connection to Linear Algebra Theory#

Reflections generate orthogonal groups; preserve lengths and angles.

Pedagogical Significance#

Shows a concrete, constructive way to obtain orthogonal maps.

References#

Householder, A. (1958). Unitary Triangularization of a Nonsymmetric Matrix.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(4)
d = 6
x = np.random.randn(d)
e1 = np.zeros(d); e1[0] = 1.0
v = x + np.sign(x[0]) * np.linalg.norm(x) * e1
u = v / (np.linalg.norm(v) + 1e-18)
H = np.eye(d) - 2 * np.outer(u, u)

Hx = H @ x
print("H orthogonal?", np.allclose(H.T @ H, np.eye(d), atol=1e-10))
print("H symmetric?", np.allclose(H, H.T, atol=1e-10))
print("Zeroed tail?", np.allclose(Hx[1:], 0.0, atol=1e-8))