Introduction#
PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.
Important ideas#
Covariance matrix
$\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.
Principal components (eigenvectors)
Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.
Explained variance ratio
EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.
Scores and loadings
Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).
Reconstruction and truncation
Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).
Standardization and scaling
Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.
Whitening
Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).
Relevance to ML#
Dimensionality reduction: speeds training, avoids overfitting, improves generalization.
Visualization: 2D/3D projection of high-dimensional data for exploration.
Preprocessing: removes noise, aligns scales, improves conditioning of solvers.
Feature extraction: learned components as features for downstream classifiers.
Denoising: truncated PCA removes low-variance (noisy) directions.
Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).
Algorithmic development (milestones)#
1901: Pearson introduces lines/planes of closest fit (geometric intuition).
1933: Hotelling formalizes PCA as eigen-decomposition of covariance.
1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.
1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.
1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.
2000s: Randomized PCA for large-scale data (Halko–Martinsson).
2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).
Definitions#
Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).
Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).
Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.
Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.
Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.
Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.
Comments