Introduction#
Orthogonality and projections are the geometry of fitting, decomposing, and compressing data:
Residuals in least squares are orthogonal to the column space (no further decrease possible within subspace)
Orthogonal projectors $P$ produce the best $\ell_2$ approximation in a subspace
Orthonormal bases simplify computations and improve numerical stability
Orthogonal transformations (rotations/reflections) preserve lengths, angles, and condition numbers
PCA chooses an orthonormal basis maximizing variance; truncation is the best rank-$k$ approximation
Important ideas#
Orthogonality and complements
$x \perp y$ iff $\langle x,y\rangle = 0$. For a subspace $\mathcal{S}$, the orthogonal complement $\mathcal{S}^\perp = \{z: \langle z, s\rangle = 0,\; \forall s\in\mathcal{S}\}$.
Orthogonal projectors
A projector $P$ onto $\mathcal{S}$ is idempotent and symmetric: $P^2=P$, $P^\top=P$. For orthonormal $U\in\mathbb{R}^{d\times k}$ spanning $\mathcal{S}$: $P=UU^\top$.
Projection theorem
For any $x$ and closed subspace $\mathcal{S}$, there is a unique decomposition $x = P_{\mathcal{S}}x + r$ with $r\in\mathcal{S}^\perp$ that minimizes $\lVert x - s\rVert_2$ over $s\in\mathcal{S}$.
Pythagorean identity
If $a\perp b$, then $\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2$. For $x = P x + r$ with $r\perp \mathcal{S}$: $\lVert x\rVert_2^2 = \lVert Px\rVert_2^2 + \lVert r\rVert_2^2$.
Orthonormal bases and QR
Gram–Schmidt, Modified Gram–Schmidt, and Householder QR compute orthonormal bases; Householder QR is numerically stable.
Spectral/SVD structure
For symmetric $\Sigma$, eigenvectors are orthonormal; SVD gives $X=U\Sigma V^\top$ with $U,V$ orthogonal. Truncation yields best rank-$k$ approximation (Eckart–Young).
Orthogonal transformations
$Q$ orthogonal ($Q^\top Q=I$) preserves inner products and norms; determinants $\pm1$ (rotations or reflections). Condition numbers remain unchanged.
Relevance to ML#
Least squares: residual orthogonality certifies optimality; $P=UU^\top$ gives fitted values.
PCA/denoising: orthogonal subspaces capture variance; residuals capture noise.
Numerical stability: QR/SVD underpin robust solvers and decompositions used across ML.
Deep nets: orthogonal initialization stabilizes signal propagation; orthogonal regularization promotes decorrelation.
Embedding alignment: Procrustes gives the best orthogonal alignment of spaces.
Projected methods: projection operators enforce constraints in optimization (e.g., norm balls, subspaces).
Algorithmic development (milestones)#
1900s–1930s: Gram–Schmidt orthonormalization; least squares geometry formalized.
1958–1965: Householder reflections and Golub’s QR algorithms stabilize orthogonalization.
1936: Eckart–Young theorem (best rank-$k$ approximation via SVD).
1966: Orthogonal Procrustes (Schönemann) closed-form solution.
1990s–2000s: PCA mainstream in data analysis; subspace methods in signal processing.
2013–2016: Orthogonal initialization (Saxe et al.) and normalization methods in deep learning.
Definitions#
Orthogonal/Orthonormal: columns of $U$ satisfy $U^\top U=I$; orthonormal if unit length as well.
Projector: $P^2=P$. Orthogonal projector satisfies $P^\top=P$; projection onto $\text{col}(U)$ is $P=UU^\top$ for orthonormal $U$.
Orthogonal complement: $\mathcal{S}^\perp=\{x: \langle x, s\rangle=0,\;\forall s\in\mathcal{S}\}$.
Orthogonal matrix: $Q^\top Q=I$; preserves norms and inner products.
PCA subspace: top-$k$ eigenvectors of covariance $\Sigma$; projection operator $P_k=U_k U_k^\top$.
Comments