Chapter 11
Principal Component Analysis
Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

  1. Covariance matrix

    • $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.

  2. Principal components (eigenvectors)

    • Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.

  3. Explained variance ratio

    • EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.

  4. Scores and loadings

    • Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).

  5. Reconstruction and truncation

    • Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).

  6. Standardization and scaling

    • Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.

  7. Whitening

    • Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

  • Dimensionality reduction: speeds training, avoids overfitting, improves generalization.

  • Visualization: 2D/3D projection of high-dimensional data for exploration.

  • Preprocessing: removes noise, aligns scales, improves conditioning of solvers.

  • Feature extraction: learned components as features for downstream classifiers.

  • Denoising: truncated PCA removes low-variance (noisy) directions.

  • Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

  • 1901: Pearson introduces lines/planes of closest fit (geometric intuition).

  • 1933: Hotelling formalizes PCA as eigen-decomposition of covariance.

  • 1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.

  • 1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.

  • 1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.

  • 2000s: Randomized PCA for large-scale data (Halko–Martinsson).

  • 2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

  • Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).

  • Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).

  • Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.

  • Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.

  • Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.

  • Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Covariance matrix and its PSD structure (Chapter 09).

  • Eigen-decomposition of symmetric covariance matrix.

  • Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.

  • Eckart–Young–Mirsky low-rank approximation error (Chapter 10).

  • Relation to SVD: PCA via SVD of centered data (Chapter 10).

  • Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

  • Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).

  • Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).

  • Probabilistic PCA and latent variable models (Tipping & Bishop 1997).

  • Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).

  • Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).

  • Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up
  1. Dimensionality reduction and preprocessing

  • Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

  1. Visualization and exploratory data analysis

  • Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

  1. Whitening and decorrelation

  • Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

  1. Denoising and matrix completion

  • Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

  1. Feature extraction and representation learning

  • Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation
  • Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.

  • Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.

  • Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.

  • Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.

  • Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.

  • Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.

  • Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).

  • Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.

  • Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks
  • Always center data; forgetting this is a common error.

  • Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.

  • Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.

  • Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.

  • Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.

  • Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.

  2. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

  • Center data; compute covariance matrix.

  • Compute eigendecomposition of covariance.

  • Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

  • Ensure data is centered; if not, covariance is inaccurate.

  • Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

  1. Hotelling, H. (1933). Analysis of complex statistical variables into principal components.

  2. Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

  • SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.

  • Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

  • Use full_matrices=False for efficiency when $n \gg d$.

  • Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

  • Compute full PCA; truncate to $k$ components.

  • Reconstruct and compute Frobenius error.

  • Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

  • Reconstruction error squared equals sum of squared singular values of discarded components.

  • Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

  1. Eckart & Young (1936). The approximation of one matrix by another of lower rank.

  2. Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

  • Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.

  • Verify output covariance is identity.

  • Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

  • Add small floor to tiny eigenvalues to avoid division by zero.

  • Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

  1. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.

  2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

  • Add noise to data; apply truncated PCA for different $k$.

  • Measure reconstruction error vs. ground truth vs. noise level.

  • Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

  • Noise reduction works best if signal occupies few components.

  • Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.

  2. Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Learning Path & Sequencing
Matrix Decompositions
ML Applications
Theoretical Foundation