Chapter 16
Matrix Products in DL & Transformers
Key ideas: Introduction

Introduction#

Neural networks are fundamentally stacks of matrix multiplications. A forward pass through a deep network is a product of weight matrices and activations. Each layer computes $A_{l+1} = \sigma(W_l A_l)$ where $W_l \in \mathbb{R}^{d_{l+1} \times d_l}$, $A_l \in \mathbb{R}^{n \times d_l}$ (batch size $n$, layer input dimension $d_l$, output dimension $d_{l+1}$). The cost is $O(n d_l d_{l+1})$ per layer. Transformers add attention: $\text{Attention}(Q, K, V) = \sigma(Q K^\top / \sqrt{d_k}) V$, which involves three GEMM operations and a softmax (polynomial in $n$ sequence length). For a Transformer with sequence length $L$, hidden dimension $d$, and $H$ attention heads per layer, attention cost is $O(L^2 d)$ (quadratic in sequence length—a major bottleneck). Modern accelerators (GPUs, TPUs) are matrix-multiply engines: billions of floating-point operations per second (TFLOPs). Utilization depends on arithmetic intensity (ops/byte): bandwidth-bound operations underutilize the accelerator; computation-bound operations (high arithmetic intensity) achieve near-peak performance. Understanding how to write matrix products that achieve high arithmetic intensity, and how to distribute them across devices, determines whether you can train billion-parameter models.

Important ideas#

  1. Matrix-matrix multiplication (GEMM) structure

    • Dense GEMM: $C \leftarrow AB$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.

    • Arithmetic: $mk + kmn + mn = O(mkn)$ floating-point operations (FLOPs).

    • Memory: $O(m + k + n)$ words (inputs + output); on GPU, $m, k, n$ can be $1000$s, requiring GB of memory.

    • Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes}} = \frac{2mkn}{8(mk + kn + mn)} \approx \frac{mkn}{4(m + k + n)}$ (higher is better).

  2. Blocking and cache efficiency

    • GEMM blocked into $b \times b$ tiles; each tile multiplied using fast cache.

    • Cache line length (64 bytes typical); GEMM loads tile once, reuses it $O(b)$ times.

    • Roofline model: peak FLOP rate vs. memory bandwidth; if arithmetic intensity $< I_{\text{roof}}$, algorithm is bandwidth-bound.

  3. Batch matrix multiplication (batched GEMM)

    • Forward pass: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size).

    • Exploit parallelism: process multiple batches on multiple cores/GPU SMs.

    • Highly efficient when batch size is large; small batches underutilize accelerator.

  4. Convolution as matrix multiplication (im2col, Winograd)

    • Convolution unfolds as GEMM: reshape input patches into columns; multiply by filter matrix; reshape output.

    • im2col: input image to column matrix; allows use of highly optimized GEMM (cuBLAS, MKL).

    • Cost: $O(kh kw d_{\text{in}} d_{\text{out}} h_{\text{out}} w_{\text{out}})$ (kernel height/width, input/output channels, spatial dims).

    • Winograd: fast convolution via transformed domain; reduces arithmetic but increases numerical complexity.

  5. Scaled dot-product attention

    • Query-key-value paradigm: $Q \in \mathbb{R}^{L \times d_k}$, $K, V \in \mathbb{R}^{L \times d_v}$ (sequence length $L$, head dimension $d_k, d_v$).

    • Attention: (1) $M = Q K^\top / \sqrt{d_k}$ (matrix product $L \times d_k \times d_k \times L = O(L^2 d_k)$), (2) $A = \text{softmax}(M)$ (per-row normalization, no matrix products), (3) $O = AV$ (matrix product $L \times L \times d_v = O(L^2 d_v)$).

    • Total: $O(L^2 (d_k + d_v)) = O(L^2 d)$ (quadratic in sequence length).

    • Challenge: for $L = 4096$ (typical transformer), $L^2 = 16M$ operations per attention head; billions for multi-head.

  6. Mixed precision and numerical stability

    • FP32 (single precision, float32): 32 bits, ~7 significant digits; gradients, weights commonly stored in FP32.

    • FP16 (half precision, float16): 16 bits, ~4 significant digits; range $[6 \times 10^{-8}, 6 \times 10^4]$; GPU operations 2–10× faster.

    • BFloat16 (Brain Float): 16 bits, same exponent range as FP32, reduced mantissa; intermediate between FP32 and FP16.

    • Mixed precision: compute GEMM in FP16 (fast), accumulate in FP32 (stable); scale loss to prevent underflow.

  7. Distributed matrix multiplication

    • Data parallelism: replicate model on each device; partition minibatches; synchronize gradients via all-reduce.

    • Model parallelism: partition matrix weights across devices; communication within matrix product (e.g., matmul followed by communication).

    • Pipeline parallelism: partition layers across devices; overlap computation on layer $i$ with communication on layer $i-1$.

    • Cost: compute + communication latency; communication often dominates at large scale (Roofline model).

Relevance to ML#

  • Convolutional neural networks (CNNs): Forward and backward passes are GEMM-heavy; efficiency determines whether you can train on billion-pixel images or video.

  • Recurrent neural networks (RNNs), LSTMs, GRUs: Fully-connected layers between timesteps; matrix products per timestep.

  • Transformers and large language models: Attention is $O(L^2 d)$ matrix products; for GPT-3 ($L = 2048$, $d = 12288$), attention dominates forward/backward.

  • Graph neural networks (GNNs): Graph convolution is sparse matrix product; efficiency depends on sparsity and format.

  • Distributed training: Modern LLMs trained on thousands of GPUs/TPUs; communication cost (network bandwidth) often exceeds computation cost.

Algorithmic development (milestones)#

  • 1969: Strassen algorithm: $O(n^{2.807})$ vs. $O(n^3)$ naive GEMM (theoretically significant; rarely used in practice due to constants).

  • 1979–1990: Level-1/2/3 BLAS (Basic Linear Algebra Subprograms); standardized interface for matrix ops; LAPACK (1992) built on BLAS.

  • 1995–2005: GPU era begins: NVIDIA GeForce, Tesla; GPUs have 100× more memory bandwidth than CPUs; GEMMs run 10–100× faster.

  • 2006: CUDA (Compute Unified Device Architecture) released; enables general-purpose GPU computing; cuBLAS optimized GEMM for NVIDIA GPUs.

  • 2011: Mixed precision training proposed; FP16 + loss scaling enables 10–100× speedups on GPUs.

  • 2012: AlexNet (Krizhevsky et al.) demonstrates deep CNN training on GPUs; FLOPs dominate; GEMM-heavy.

  • 2015: Batch normalization (Ioffe & Szegedy); reduces sensitivity to initialization; enables mixed precision at scale.

  • 2017: Transformer architecture (Vaswani et al.); attention is dense GEMM-based; quadratic in sequence length.

  • 2018–2020: Distributed training frameworks mature (PyTorch DDP, TensorFlow Horovod); trillion-parameter models trained via model parallelism.

  • 2020–2023: Flash Attention (Dao et al. 2022) reduces attention memory via block-sparse operations; Megatron-LM and DeepSpeed enable distributed GEMMs at petaflop scales.

Definitions#

  • GEMM (General Matrix Multiply): $C \leftarrow \alpha A B + \beta C$ (standard matrix multiply with scaling/accumulation).

  • FLOP (floating-point operation): One add or multiply; GEMM $C \leftarrow AB$ is $2mkn$ FLOPs.

  • Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes read/written}}$ (ops per byte); high $I$ means compute-bound; low $I$ means bandwidth-bound.

  • Roofline model: Peak achievable throughput = $\min(\text{peak FLOP rate}, \text{memory bandwidth} \times \text{arithmetic intensity})$.

  • Memory-bound: Algorithm where memory bandwidth is bottleneck; cannot achieve peak FLOP rate.

  • Compute-bound: Algorithm where compute is bottleneck; limited by FLOPs/cycle, not memory.

  • Mixed precision: Using multiple precision levels (e.g., FP16 for compute, FP32 for accumulation) to trade accuracy for speed.

  • All-reduce: Distributed operation: each device sums its values with all others; result replicated on all devices. Cost: $O(\log D)$ communication rounds for $D$ devices.

  • Collective communication: Broadcasting, all-reduce, reduce-scatter, all-gather operations in distributed training.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • GEMM arithmetic and complexity: $O(mkn)$ FLOPs, memory $O(m + k + n)$. Reference: Golub & Van Loan (2013).

  • Arithmetic intensity and Roofline model: $I = \text{FLOPs/bytes}$; peak rate is $\min(\text{FLOP rate}, \text{bandwidth} \times I)$. Reference: Williams et al. (2009).

  • Cache-oblivious algorithms: Block-recursive GEMM achieves near-optimal cache behavior independent of cache size. Reference: Frigo et al. (1999).

  • Batched GEMM: Independent products $C_i \leftarrow A_i B_i$; parallelism across batch dimension. Reference: BLAS 3 standard (1990).

  • Attention complexity: Scaled dot-product $O(L^2 d)$ without optimizations; challenges for long contexts. Reference: Vaswani et al. (2017).

  • Distributed GEMM: Communication cost for gradient all-reduce, model/data parallelism. Reference: Thakur et al. (2005) (MPI Collective Communications).

Applied (landmark systems)#

  • Level-3 BLAS (cuBLAS, MKL): Industry-standard GEMM implementations; peak performance on CPUs/GPUs. Implementation: NVIDIA cuBLAS, Intel MKL. Reference: Dongarra et al. (1990) (BLAS 3).

  • Convolution as GEMM (im2col): Standard in libcnpy (Caffe, PyTorch); enables reuse of optimized GEMM. Implementation: PyTorch conv2d uses im2col on CPU. Reference: Krizhevsky et al. (2012).

  • Mixed precision training: Automatic mixed precision in PyTorch (torch.cuda.amp), TensorFlow (tf.keras.mixed_precision). Achieves 2–3× speedup on V100/A100. Reference: NVIDIA Automatic Mixed Precision Training Guide (2020).

  • Distributed GEMM (Megatron-LM, DeepSpeed): Tensor parallelism partitions GEMM across devices; pipeline parallelism overlaps layers. Implementation: Microsoft DeepSpeed, NVIDIA Megatron-LM. Reference: Shoeybi et al. (2019); Rasley et al. (2020).

  • Flash Attention: IO-efficient attention via blocked matrix products; reduces memory bandwidth by 10×. Implementation: Tri Dao’s flash-attention library. Reference: Dao et al. (2022).

Key ideas: Where it shows up
  1. Convolutional neural networks and image classification

    • Forward pass: convolutional layers (im2col GEMM), batch norm (element-wise), pooling (no GEMM).

    • Backward: weight gradient, input gradient via GEMM.

    • Achievements: ResNet-50 trains on 8 V100 GPUs in ~100 seconds (Goyal et al. 2017); mixed precision reduces time to ~60 seconds. References: Krizhevsky et al. (2012) (AlexNet); He et al. (2015) (ResNet); Goyal et al. (2017) (accurate large-batch SGD).

  2. Transformer models and large language models

    • Per-layer: projection QKV (3 GEMMs), attention (2 GEMMs), MLP (2 GEMMs) = ~7 GEMMs per layer.

    • Attention cost: $O(L^2 d)$ (quadratic in sequence length); dominates for long sequences.

    • Achievements: GPT-3 (Brown et al. 2020) trained in 300 billion FLOPs; parallelized across 1,024 A100 GPUs using model parallelism. Flash Attention (Dao et al. 2022) reduces attention memory by 10×. References: Vaswani et al. (2017) (Transformer); Brown et al. (2020) (GPT-3); Dao et al. (2022) (Flash Attention).

  3. Distributed training and synchronization

    • Data parallelism: gradient all-reduce after each minibatch.

    • Model parallelism: gradient exchanges within matrix products.

    • Achievements: LAMB optimizer (You et al. 2019) enables BERT training on 32k TPUs in 76 minutes. Megatron-LM (Shoeybi et al. 2019) trains GPT models with tensor parallelism. References: You et al. (2019) (LAMB); Shoeybi et al. (2019) (Megatron-LM).

  4. Mixed precision training

    • Automatic mixed precision (AMP): dynamically select FP16/FP32 for operations.

    • Loss scaling: prevent FP16 gradient underflow.

    • Achievements: NVIDIA Automatic Mixed Precision reduces training time by 2–3× on V100/A100 while maintaining accuracy. References: NVIDIA Mixed Precision Training guide; Micikevicius et al. (2018).

  5. Graph neural networks and sparse matrix products

    • Graph convolution: $X' = \sigma(AXW)$ where $A$ is sparse adjacency matrix.

    • Sparse-dense GEMM: $O(\text{nnz}(A) \cdot d)$ arithmetic intensity lower than dense, but feasible for sparse graphs.

    • Achievements: DGL, PyG enable billion-node GNNs via optimized sparse GEMMs. References: Kipf & Welling (2017) (GCN); Wang et al. (2019) (DGL); Fey et al. (2019) (PyG).

Notation
  • Matrix product: $C \leftarrow A B$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.

  • Batched product: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size); vectorization across batch.

  • Attention: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) V$ with $Q, K, V \in \mathbb{R}^{L \times d}$ (sequence length $L$, dimension $d$).

  • Complexity: Attention is $O(L^2 d)$ FLOPs; dense GEMM is $O(n d_{\text{in}} d_{\text{out}})$ per layer (batch size $n$).

  • Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$ (depends on matrix shapes; higher $I$ achieves better GPU utilization).

  • FLOP rate: Peak: $P$ (e.g., 20 TFLOP for V100 in FP32); practical: $P \times \text{efficiency}$ (typically 50–80%).

  • Memory bandwidth: $B$ (e.g., 900 GB/s for A100 HBM2e); roofline: achieved throughput $= \min(P, I \times B)$.

  • Example: ResNet-50 forward pass: ~8 GFLOPs per image; batch size 256 = 2 TFLOPs; A100 achieves ~80% utilization = 16 TFLOP achieved; time ~0.1 ms.

Pitfalls & sanity checks
  • Batch size too small: GPUs underutilized; poor arithmetic intensity. Typical minimum: 32–64 per device.

  • Tall-skinny GEMM: Low arithmetic intensity; underutilize accelerator. Prefer square or batched products.

  • Ignoring data layout: Row-major vs. column-major affects cache performance by 10×.

  • Mixed precision without loss scaling: FP16 gradients underflow ($\approx 10^{-6}$); loss scale prevents this (multiply loss by $2^{16}$, divide gradients).

  • Attention without length limits: Quadratic memory; even with batch size 1, $L = 8192$ requires 256 MB for single head.

  • Synchronous all-reduce without compression: Communication time dominates; gradient compression (sparsification, quantization) essential at scale.

  • Assuming linear scaling: Communication cost breaks linear scaling; efficiency drops from 90% (4 devices) to 30% (128 devices).

  • Convolution without im2col: Naive loops 100–1000× slower than GEMM-based implementation.

References

Matrix multiplication theory

  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  2. Strassen, V. (1969). Gaussian elimination is not optimal.

  3. Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Performance modeling and BLAS

  1. Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of FORTRAN basic linear algebra subprograms.

  2. Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.

  3. Demmel, J., Gearhart, J., Liphardt, B., Schwartz, O., & Toledo, S. (2009). Communication-avoiding Gaussian elimination.

Deep learning and convolution

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

  2. He, H., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.

  3. Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.

Transformer and attention

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.

  2. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are unsupervised multitask learners.

  3. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.

Mixed precision and numerical stability

  1. Micikevicius, P., Narang, S., Alben, J., et al. (2018). Mixed precision training.

  2. Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift.

  3. NVIDIA Automatic Mixed Precision Training Guide. (2020).

Distributed training

  1. Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.

  2. Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.

  3. Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.

  4. Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.

  5. You, Y., Gitman, I., & Ginsburg, B. (2019). Large batch optimization for deep learning: training BERT in 76 minutes.

Attention optimization

  1. Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.

  2. Child, A., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers.

  3. Peng, H., Schwartz-Ziv, R., & Armon, M. (2021). Reducing transformer depth on demand with structured dropout.

Five worked examples

Worked Example 1: GEMM efficiency and arithmetic intensity#

Introduction#

Implement dense matrix multiplication on CPU and GPU; measure FLOP rate and memory bandwidth utilization; demonstrate how matrix shape affects arithmetic intensity.

Purpose#

Understand relationship between GEMM dimensions and arithmetic intensity; show how to achieve peak GPU performance.

Importance#

Foundation for understanding deep learning performance; shapes (batch size, hidden dimensions) directly impact training time.

What this example demonstrates#

  • Construct tall-skinny vs. square GEMM matrices.

  • Measure FLOPs and memory bandwidth for each.

  • Compute arithmetic intensity $I = \text{FLOPs/bytes}$.

  • Compare achieved FLOP rate vs. peak.

  • Predict speedup from roofline model.

Background#

GEMM efficiency depends on matrix shape: square matrices have high arithmetic intensity; tall-skinny have low intensity.

Historical context#

Roofline model (Williams et al. 2009) formalizes this trade-off; guides architecture and algorithm design.

History#

Standard framework for performance modeling in HPC and ML systems.

Prevalence in ML#

Every deep learning practitioner adjusts batch size, layer dimensions to maximize GPU utilization.

Notes#

  • Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$; maximized when $m \approx k \approx n$ (cube).

  • For fixed $k$, varying $m, n$ (batch size, hidden dims) changes $I$ by 10×.

Connection to ML#

Batch size and hidden dimension choices affect both accuracy and training speed; understanding trade-offs is critical.

Connection to Linear Algebra Theory#

GEMM is fundamental linear algebra operation; efficiency is determined by cache locality (blocking theory).

Pedagogical Significance#

Demonstrates practical performance modeling; connects theory (arithmetic intensity) to practice (measured FLOP rates).

References#

  1. Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Solution (Python)#

import numpy as np
import time

np.random.seed(35)

# Test different matrix shapes (keeping k fixed)
k = 1024
shapes = [
    (128, k, 128),    # Tall-skinny-ish: low intensity
    (1024, k, 1024),  # Square: high intensity
    (4096, k, 4096),  # Large square: even higher
]

print("GEMM Efficiency Analysis")
print("=" * 80)
print(f"{'m x n':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Intensity':15} {'Est. GFLOPs':15}")
print("-" * 80)

for m, k_dim, n in shapes:
    # Arithmetic
    flops = 2 * m * k_dim * n
    # Memory: read A (m*k), read B (k*n), write C (m*n)
    mem_bytes = 8 * (m * k_dim + k_dim * n + m * n)
    intensity = flops / mem_bytes
    
    # Estimate performance from roofline
    # Assume: Peak 20 TFLOP (V100 FP32), Bandwidth 900 GB/s
    peak_flops = 20e12
    bandwidth = 900e9
    roofline = min(peak_flops, bandwidth * intensity)
    
    print(f"{m}x{n}         {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {intensity:>14.2f} {roofline/1e9:>14.1f}")

print("\n" + "=" * 80)
print("Key insight: Higher arithmetic intensity -> higher roofline GFLOPs")
print("Square matrices (m ~ k ~ n) achieve 10-100x higher intensity than tall-skinny")

Worked Example 2: Batched GEMM and GPU parallelism#

Introduction#

Implement batched matrix multiplication; measure performance as batch size varies; show speedup from batch parallelism.

Purpose#

Demonstrate how batch dimension enables parallelism; show relationship between batch size and GPU utilization.

Importance#

Batch size is a key hyperparameter; understanding its impact on performance guides training setup.

What this example demonstrates#

  • Generate batched matrices $A_i, B_i$ for $i = 1, \ldots, B$.

  • Time batched GEMM vs. sequential.

  • Measure speedup; show scaling with batch size.

  • Explain why small batches underutilize GPU.

Background#

GPUs have thousands of cores; small batches can’t keep all cores busy; large batches achieve better utilization.

Historical context#

Batch GEMM standardized in BLAS Level 3 (1990); essential for CNN/RNN training.

History#

Modern frameworks (PyTorch, TensorFlow) automatically batch GEMMs; rarely needs manual tuning.

Prevalence in ML#

Every training loop uses batched GEMM; batch size choice directly impacts throughput.

Notes#

  • Batch size $B = 1$: each GEMM is independent; throughput limited.

  • $B = 32$: better utilization; GPUs have 80+ SMs (streaming multiprocessors).

  • $B = 256$: excellent utilization; typical for modern training.

Connection to ML#

Batch size affects both convergence (larger batches can have worse generalization) and speed; practical sweet spot is usually 32–256.

Connection to Linear Algebra Theory#

Batched GEMM exploits structure (independent problems); vectorization across batch dimension.

Pedagogical Significance#

Shows interplay between algorithm structure and hardware parallelism.

References#

  1. Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of Fortran basic linear algebra subprograms.

  2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

  3. Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.

Solution (Python)#

import numpy as np
import time

np.random.seed(36)

# Batched GEMM performance
batch_sizes = [1, 4, 16, 64, 256]
m, k, n = 1024, 1024, 1024
iterations = 10

print("Batched GEMM Performance (m=k=n={}, {} iterations)".format(m, iterations))
print("=" * 60)
print(f"{'Batch Size':15} {'Total Time (s)':20} {'GFLOPs':15}")
print("-" * 60)

for B in batch_sizes:
    # Create batch of matrices
    A = np.random.randn(B, m, k).astype(np.float32)
    B_mat = np.random.randn(B, k, n).astype(np.float32)
    
    # Batched matmul (sequential in Python; normally GPU would parallelize)
    t0 = time.time()
    for _ in range(iterations):
        C = np.matmul(A, B_mat)
    t_total = time.time() - t0
    
    # FLOPs: 2mkn per batch, B batches, iterations
    flops = iterations * B * 2 * m * k * n
    gflops = flops / (t_total * 1e9)
    
    print(f"{B:>14} {t_total:>19.4f} {gflops:>14.1f}")

print("\n" + "=" * 60)
print("Note: Larger batch sizes achieve higher GFLOPs due to better parallelism")

Worked Example 3: Convolution as GEMM (im2col)#

Introduction#

Implement convolution using naive loops, then via im2col GEMM; measure speedup from optimized GEMM.

Purpose#

Show how convolution is equivalent to matrix multiplication; demonstrate efficiency gain from reusing optimized GEMM.

Importance#

Foundational for understanding why GPUs excel at CNNs; im2col is standard in production frameworks.

What this example demonstrates#

  • Naive 5-loop convolution implementation.

  • im2col transformation: reshape patches into columns.

  • GEMM on im2col matrix; reshape output.

  • Compare naive vs. GEMM time.

Background#

Convolution unfolds into GEMM; allows reuse of highly-tuned BLAS kernels; 10–100× speedup.

Historical context#

im2col technique developed for efficient convolution implementations in early deep learning (Caffe, 2013).

History#

Standard in all deep learning frameworks; sometimes augmented by Winograd for further speedup.

Prevalence in ML#

Every CNN implementation uses im2col or similar GEMM-based convolution.

Notes#

  • im2col memory overhead: factors of 2–4× larger than direct convolution; trade memory for speed.

  • Winograd convolution (for $3 \times 3$ kernels): lower arithmetic but numerically complex.

Connection to ML#

Convolutional layers dominate image classification and detection models; efficiency here directly impacts training speed.

Connection to Linear Algebra Theory#

Convolution is linear transformation; im2col exploits structure to reduce to GEMM.

Pedagogical Significance#

Demonstrates how abstract operations (convolution) map to concrete linear algebra (GEMM).

References#

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

  2. Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.

  3. Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks.

Solution (Python)#

import numpy as np
import time

np.random.seed(37)

# Convolution parameters
batch_size, in_height, in_width, in_channels = 32, 64, 64, 3
out_channels, kernel_h, kernel_w, stride = 16, 3, 3, 1
pad = 1

# Padded input
X_padded = np.pad(np.random.randn(batch_size, in_height, in_width, in_channels),
                   ((0,0), (pad,pad), (pad,pad), (0,0)), mode='constant')
W = np.random.randn(out_channels, kernel_h, kernel_w, in_channels)

# Output dimensions
out_height = (in_height + 2*pad - kernel_h) // stride + 1
out_width = (in_width + 2*pad - kernel_w) // stride + 1

# Naive convolution (slow)
print("Naive convolution (5-loop implementation):")
t0 = time.time()
Y_naive = np.zeros((batch_size, out_height, out_width, out_channels))
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            for c in range(out_channels):
                h_start = h * stride
                w_start = w * stride
                patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
                Y_naive[b, h, w, c] = np.sum(patch * W[c])
t_naive = time.time() - t0
print(f"  Time: {t_naive:.4f} s")

# im2col GEMM (fast)
print("\nim2col GEMM (optimized convolution):")
t0 = time.time()

# im2col: extract patches
X_col = np.zeros((batch_size * out_height * out_width, kernel_h * kernel_w * in_channels))
idx = 0
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            h_start = h * stride
            w_start = w * stride
            patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
            X_col[idx] = patch.reshape(-1)
            idx += 1

# Weight matrix (reshape filters)
W_mat = W.reshape(out_channels, -1).T  # (kernel_h*kernel_w*in_channels, out_channels)

# GEMM
Y_col = X_col @ W_mat  # (batch*out_h*out_w, out_channels)

# Reshape to output
Y_gemm = Y_col.reshape(batch_size, out_height, out_width, out_channels)

t_gemm = time.time() - t0
print(f"  Time: {t_gemm:.4f} s")

print(f"\nSpeedup: {t_naive / t_gemm:.1f}x")
print(f"Results match: {np.allclose(Y_naive, Y_gemm, atol=1e-5)}")

Worked Example 4: Scaled dot-product attention complexity#

Introduction#

Implement attention operation; measure memory and time complexity; show quadratic dependence on sequence length.

Purpose#

Understand why attention is a bottleneck for long sequences; motivate approximate attention methods.

Importance#

Attention scales as $O(L^2 d)$; for long sequences (4K tokens), this dominates; critical for efficiency research.

What this example demonstrates#

  • Implement attention: QK^T, softmax, output.

  • Measure memory (intermediate softmax matrix is $L \times L$).

  • Time scaling with $L$; show quadratic growth.

  • Compare attention time vs. other layers.

Background#

Quadratic attention complexity is fundamental limitation of transformer architecture; many proposed approximations.

Historical context#

Vaswani et al. (2017) introduce attention; complexity not initially recognized as bottleneck for $L > 512$.

History#

Post-2020, attention optimization becomes major research area: Flash Attention, sparse attention, linear attention variants.

Prevalence in ML#

Every transformer model suffers from quadratic attention; common workaround is to limit context length or use approximations.

Notes#

  • Attention FLOPs: $2L^2 d$ (dominant for $L > d$).

  • Memory: $O(L^2)$ for attention matrix; for $L = 4096, d = 768$: 64 MB per sequence.

Connection to ML#

Limiting context length ($L = 512$ vs. $L = 4096$) is common trade-off between expressiveness and efficiency.

Connection to Linear Algebra Theory#

Attention is polynomial in sequence length; matrix products scale quadratically in one dimension.

Pedagogical Significance#

Shows concrete example of how algorithmic bottleneck (quadratic) impacts practical ML.

References#

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.

  2. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.

  3. Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.

Solution (Python)#

import numpy as np
import time

np.random.seed(38)

# Attention parameters
d = 768  # Hidden dimension
num_heads = 12
d_k = d // num_heads
L_values = [128, 256, 512, 1024, 2048]  # Sequence lengths

print("Attention Complexity Analysis (d={}, num_heads={})".format(d, num_heads))
print("=" * 70)
print(f"{'Seq Len L':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Time (ms)':15}")
print("-" * 70)

for L in L_values:
    batch_size = 1
    
    # Create Q, K, V
    Q = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    K = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    V = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    
    # Measure time and memory
    t0 = time.time()
    
    # Attention: QK^T / sqrt(d_k)
    scores = np.matmul(Q, K.transpose(0, 1, 3, 2))  # (batch, heads, L, L)
    scores = scores / np.sqrt(d_k)
    
    # Softmax
    scores = scores - np.max(scores, axis=-1, keepdims=True)
    exp_scores = np.exp(scores)
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Output
    output = np.matmul(weights, V)  # (batch, heads, L, d_k)
    
    t_attn = time.time() - t0
    
    # FLOPs: QK^T = 2*L^2*d_k, softmax ~L^2, output = 2*L^2*d_k
    flops = batch_size * num_heads * (2 * L * L * d_k + 2 * L * L * d_k)
    
    # Memory: scores matrix is L x L per head
    mem_bytes = batch_size * num_heads * L * L * 4
    
    print(f"{L:>14} {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {t_attn*1e3:>14.2f}")

print("\n" + "=" * 70)
print("Key insight: FLOPs and memory scale quadratically with sequence length")
print("For L=4096: 15 GB memory, billions of FLOPs -- attention becomes bottleneck")

Worked Example 5: Distributed GEMM and communication cost#

Introduction#

Implement data parallel training with gradient synchronization; measure computation vs. communication time; show communication overhead.

Purpose#

Understand communication bottleneck in distributed training; motivate communication-efficient algorithms.

Importance#

Modern LLMs trained on 1000s of GPUs; communication often dominates; critical for scaling.

What this example demonstrates#

  • Simulate distributed GEMM (matmul on local device).

  • Simulate all-reduce for gradient synchronization.

  • Measure computation time vs. communication time.

  • Show how communication latency scales with number of devices.

Background#

Distributed training divides minibatches across devices; after each minibatch, devices exchange gradients via all-reduce.

Historical context#

Large-batch SGD and gradient compression (2017–2019) driven by communication bottleneck.

History#

Modern frameworks (PyTorch DDP, Horovod) optimize communication; mixed precision + gradient compression reduce overhead.

Prevalence in ML#

Every distributed training uses all-reduce; communication cost is well-studied bottleneck.

Notes#

  • Computation time: $O(B \cdot d_{\text{in}} \cdot d_{\text{out}})$ (linear in batch size, dimensions).

  • Communication time: $O(\log D + d_{\text{gradient}})$ (logarithmic in device count $D$, linear in gradient size).

  • For 1000 devices: all-reduce with $\log D \approx 10$ rounds; if each round takes 10 μs, total ~100 μs; computation often takes ms.

Connection to ML#

Large-batch training requires communication efficiency; gradient compression and other tricks essential for practical scaling.

Connection to Linear Algebra Theory#

All-reduce is tree-based collective communication; optimal complexity is $O(\log D)$.

Pedagogical Significance#

Shows distributed systems aspect of linear algebra; explains why scaling beyond certain point is challenging.

References#

  1. Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.

  2. Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.

  3. Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.

Solution (Python)#

import numpy as np
import time

np.random.seed(39)

# Distributed training simulation
num_devices = [1, 4, 8, 16, 32]
batch_size = 256
hidden_dim = 2048

print("Distributed GEMM: Computation vs. Communication")
print("=" * 70)
print(f"{'Devices':15} {'Comp Time (ms)':20} {'Comm Time (μs)':20} {'Comp/Comm Ratio':15}")
print("-" * 70)

# Assume:
# - Computation: 100 GFLOPs/device (V100)
# - Communication: 25 GB/s interconnect (typical)

compute_flops_per_device = 100e9  # 100 GFLOPs
comm_bandwidth = 25e9  # GB/s (25 GB/s)

for D in num_devices:
    # Local batch per device
    local_batch = batch_size // D
    
    # GEMM: local_batch x hidden_dim x hidden_dim
    flops_local = 2 * local_batch * hidden_dim * hidden_dim
    
    # Computation time
    t_compute = flops_local / compute_flops_per_device
    
    # Communication: all-reduce of gradients (hidden_dim)
    # Complexity: O(log D) communication rounds
    # Each round transmits O(hidden_dim) data (simplified)
    comm_rounds = int(np.log2(D)) + 1
    gradient_size = hidden_dim * 4  # bytes (FP32)
    comm_per_round = gradient_size / comm_bandwidth
    t_comm = comm_rounds * comm_per_round
    
    ratio = t_compute / t_comm
    
    print(f"{D:>14} {t_compute*1e3:>19.3f} {t_comm*1e6:>19.2f} {ratio:>14.1f}x")

print("\n" + "=" * 70)
print("Key insight: Communication becomes bottleneck at large scale")
print("For 32 devices: communication ~100 microseconds, computation ~10 milliseconds")
print("Compute/comm ratio decreases -> inefficiency at scale")

Comments

Chapter 15
Sparse Matrices
Key ideas: Introduction

Introduction#

Most real-world data matrices are sparse: social networks are sparse graphs; text is sparse (million-word vocabulary but document has hundreds of words); images are sparse in wavelet domains; scientific simulations generate sparse discretizations. Dense matrices require $O(n^2)$ memory and $O(n^3)$ computation; sparse matrices require only $O(\text{nnz})$. A sparse matrix with $\text{nnz} = O(n)$ or $O(n \log n)$ reduces complexity from cubic to linear or nearly linear. This is transformational: it enables algorithms that would be intractable on dense representations. Exploiting sparsity requires careful algorithm design: level-of-fill control, reordering strategies, specialized data structures, and communication patterns in distributed solvers.

Important ideas#

  1. Sparse matrix representations

    • Compressed Sparse Row (CSR): store nonzeros by row; three arrays (values, column indices, row pointers); $O(\text{nnz} + n)$ memory.

    • Compressed Sparse Column (CSC): symmetric role for columns; efficient for column access.

    • Coordinate format (COO): list of $(i, j, v)$ triples; inefficient for computation but flexible for construction.

    • Diagonal, tridiagonal, block-structured: exploit special structure for even greater efficiency.

  2. Sparsity patterns and fill-in

    • Reachability: nonzero $(i, j)$ entry means path from $i$ to $j$ in graph; reveals communication pattern.

    • Fill-in: $L, U$ in $A = LU$ can have more nonzeros than $A$; reordering (minimum degree, nested dissection) minimizes fill-in.

    • Symbolic factorization: determine nonzero pattern of $L, U$ without computing values.

  3. Matrix-vector products and efficient computation

    • Dense $y = A x$ costs $O(n^2)$; sparse costs $O(\text{nnz})$.

    • SpMV (sparse matrix-vector product) is bandwidth-limited on modern hardware; careful data layout crucial.

    • GPU sparse kernels: CUSPARSE; CPU sparse kernels: MKL Sparse BLAS, OpenMP parallelization.

  4. Direct sparse solvers

    • LU with partial pivoting: reorder to minimize fill-in, apply pivoting.

    • Cholesky for SPD: symmetric structure implies symmetric reordering; often very sparse.

    • Supernodal methods: group small dense blocks for cache efficiency; key to modern solvers (PARDISO, SuperLU, UMFPACK).

  5. Iterative sparse solvers

    • CG, GMRES, MINRES: require only matrix-vector products; no explicit factorization.

    • Preconditioning essential: ILU (incomplete LU), IC (incomplete Cholesky), algebraic multigrid (AMG) enable convergence.

    • Matrix-free methods: supply only matrix-vector product function; never form or store matrix.

  6. Graph algorithms and adjacency matrices

    • Adjacency matrix $A$ where $A_{ij} = 1$ if edge $(i, j)$ exists; number of nonzeros = $2 \times$ edges (undirected).

    • Shortest paths, PageRank, diffusion: matrix powers, powers of adjacency matrix encode reachability.

    • Laplacian matrix $L = D - A$ (degree diagonal minus adjacency); eigenvectors yield graph partitions (spectral clustering).

  7. Storage and computational complexity trade-offs

    • Dense $A \in \mathbb{R}^{n \times n}$: $O(n^2)$ memory, $O(n^3)$ LU, $O(n^2)$ SpMV.

    • Sparse $A$ with $\text{nnz} = O(n)$: $O(n)$ memory, $O(n^{1.5})$ LU (with reordering), $O(n)$ SpMV.

    • For $n = 10^6$: dense requires terabytes; sparse (with $\text{nnz} = 10^7$) requires gigabytes.

Relevance to ML#

  • Text and NLP: Document-term matrices are sparse (million words, hundreds per document); sparse SVD enables topic models.

  • Graphs and networks: Social networks, citation graphs, knowledge graphs are sparse; graph neural networks leverage adjacency matrix sparsity.

  • Recommender systems: User-item ratings sparse ($\sim 0.1\%$ observed); matrix completion via sparse factorization.

  • Inverse problems and imaging: Medical imaging, seismic inversion, tomography discretizations are sparse; sparse solvers essential.

  • Large-scale optimization: SGD and proximal methods operate on sparse data; distributed solvers exploit communication sparsity.

Algorithmic development (milestones)#

  • 1960s: Sparse matrix computation recognized as practical necessity in structural engineering (finite element analysis).

  • 1971: Tinney & Walker develop minimum degree ordering; George extends to nested dissection.

  • 1979: Duff develops efficient symbolic factorization algorithms.

  • 1980s: Supernodal methods developed (Ashcraft, Eisenstat); enable cache-efficient dense kernels within sparse framework.

  • 1990s: PARDISO, SuperLU, UMFPACK emerge as production sparse solvers.

  • 2000s: Algebraic multigrid (AMG) becomes standard preconditioner for sparse iterative methods.

  • 2005–2010: GPU sparse kernels (CUSPARSE) enable acceleration; sparse matrix formats adapt to GPU memory.

  • 2010s: Sparse deep learning (neural networks on graph-structured data); graph convolutional networks (GCN), graph attention networks (GAT) exploit adjacency matrix sparsity.

Definitions#

  • Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A) \ll n^2$ nonzeros; typically $\text{nnz} = O(n)$ or $O(n \log n)$.

  • Sparsity: $\text{sparsity} = 1 - \frac{\text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity} > 0.99$ is typical for ML.

  • Compressed Sparse Row (CSR): Three arrays: values[nnz], col[nnz], row_ptr[n+1]; row_ptr[i] points to start of row $i$ in values/col.

  • Fill-in: Nonzero entries created during factorization that were zero in original matrix.

  • Reordering: Permutation $P$ such that $P A P^\top$ has lower fill-in; minimum degree, nested dissection are common strategies.

  • Graph of a matrix: Undirected graph $G$ where vertices are matrix indices, edge $(i, j)$ exists iff $A_{ij} \ne 0$.

  • Degree: Vertex $i$ has degree = number of nonzeros in row/column $i$.

  • Laplacian matrix: $L = D - A$ where $D = \text{diag}(d_1, \ldots, d_n)$ is degree matrix, $A$ is adjacency matrix.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Sparse matrix representations: CSR, CSC, COO formats; trade-offs between storage and access patterns. Reference: Golub & Van Loan (2013).

  • Fill-in and reordering: Minimum degree, nested dissection reduce fill-in; symbolic factorization predicts nonzero pattern. Reference: George & Liu (1981).

  • SpMV complexity: $O(\text{nnz})$ time, bandwidth-limited on modern hardware. Reference: Williams et al. (2009).

  • Direct sparse solvers: LU with reordering, supernodal methods, pivot selection for stability. Reference: Davis (2006) (SuiteSparse); Schenk & Gärtner (2004) (PARDISO).

  • Iterative sparse solvers: CG, GMRES on Krylov subspaces; preconditioning essential (ILU, IC, AMG). Reference: Saad (2003); Briggs et al. (2000) (multigrid tutorial).

  • Graph Laplacian and spectral methods: Eigenvalues/eigenvectors of $L = D - A$ for clustering, diffusion. Reference: Spielman & Teng (2004).

Applied (landmark systems)#

  • Sparse SVD for dimensionality reduction: ARPACK for large sparse matrices; used in topic modeling (LSA, LDA). Implementation: scikit-learn TruncatedSVD. Reference: Deerwester et al. (1990).

  • Direct sparse solvers (production): PARDISO (Intel MKL), SuperLU (LBNL), UMFPACK (UMF collection). Deployed in finite element, circuit simulation. Reference: Schenk & Gärtner (2004); Li (2005).

  • Iterative sparse solvers with preconditioning: GMRES + ILU in PETSc, Trilinos; ALS for matrix completion in Spark MLlib. Reference: Balay et al. (2021) (PETSc); Herring et al. (2009) (Trilinos).

  • Graph neural networks: PyTorch Geometric adjacency matrix operations; sparse matrix multiplication for graph convolution. Reference: Kipf & Welling (2017).

  • Compressed sensing and sparse recovery: ADMM/FISTA for L1-regularized problems; applications in imaging, signal processing. Reference: Boyd et al. (2011); Beck & Teboulle (2009).

Key ideas: Where it shows up
  1. Text and NLP

    • Document-term matrices $X \in \mathbb{R}^{n \times d}$ with $n$ documents, $d$ vocabulary; typically $\text{nnz} = O(nd)$ (words per doc).

    • Achievements: sparse SVD/NMF in topic modeling (LSA, LDA); scikit-learn TfidfVectorizer produces sparse matrices. References: Deerwester et al. (1990) (LSA); Blei et al. (2003) (LDA).

  2. Graph neural networks and social networks

    • Adjacency matrix $A \in \mathbb{R}^{n \times n}$ sparse; graph convolution $X' = \sigma(AXW)$ exploits sparsity for $O(\text{nnz} \cdot d)$ per layer vs. $O(n^2 d)$ dense.

    • Achievements: PyTorch Geometric, DGL, Spektral enable billion-node GNNs; GCN (Kipf & Welling 2017), GAT (Velickovic et al. 2018) standard. References: Kipf & Welling (2017); Velickovic et al. (2018).

  3. Recommender systems and matrix completion

    • Rating matrices $R \in \mathbb{R}^{m \times n}$ (users × items) extremely sparse ($\sim 0.01\%$ observed); matrix factorization via alternating least squares.

    • Achievements: Sparse SVD in Netflix prize; ALS in Spark MLlib scales to billions of ratings. References: Koren et al. (2009) (matrix factorization); Zhou et al. (2008) (implicit feedback).

  4. Inverse problems and sparse regularization

    • Measurement matrix $A$ in compressed sensing, tomography often sparse or low-rank. Sparse solutions via L1 regularization (LASSO).

    • Achievements: ADMM (alternating direction method of multipliers) for large-scale L1-regularized problems; FISTA (fast iterative shrinkage-thresholding). References: Boyd et al. (2011) (ADMM); Beck & Teboulle (2009) (FISTA).

  5. Distributed and federated learning

    • Distributed optimization exploits sparsity in communication: gradient compression, local SGD reduce communication. Sparse parameter updates.

    • Achievements: gradient sparsification in federated learning (TensorFlow Federated); communication-efficient distributed SGD (Kairouz et al. 2021). References: Kairouz et al. (2021) (federated learning).

Notation
  • Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A)$ nonzeros; typically $\text{nnz} \ll n^2$.

  • Sparsity: $\text{sparsity}(A) = \frac{n^2 - \text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity}(A) > 0.99$ typical.

  • CSR format: Values v[1..nnz], column indices j[1..nnz], row pointers p[1..n+1] with p[i+1] - p[i] = nonzeros in row $i$.

  • Adjacency matrix: $A \in \{0,1\}^{n \times n}$ where $A_{ij} = 1$ if edge $(i, j)$ exists; $\text{nnz}(A) = 2m$ for $m$ edges (undirected).

  • Degree matrix: $D = \text{diag}(d_1, \ldots, d_n)$ where $d_i = \sum_j A_{ij}$ (degree of vertex $i$).

  • Laplacian: $L = D - A$ (graph Laplacian); $L$ is PSD with $\lambda_1 = 0$ (constant eigenvector).

  • Fill-in ratio: $\text{fill}(L, U) = \frac{\text{nnz}(L) + \text{nnz}(U)}{\text{nnz}(A)}$ (factorization creates nonzeros).

  • Example: Document-term matrix for 100K documents, 1M vocabulary words, ~200 words per doc: $\text{nnz} \approx 100K \times 200 = 20M$; density $\approx 0.00002\%$.

Pitfalls & sanity checks
  • Never convert large sparse matrix to dense: Memory explosion; dense matrix for $n = 10^6$ requires terabytes.

  • Choose representation based on access pattern: CSR for row access/SpMV, CSC for column access/factorization.

  • Monitor fill-in in factorization: If fill-in ratio $> 100$, consider iterative solver + preconditioner instead.

  • Ensure reordering is applied: Production solvers (PARDISO, SuperLU) apply automatically; verify if using custom code.

  • Test on subset before scaling: Verify correctness on small sparse system before large deployment.

  • GPU sparse kernels have overhead: SpMV on GPU slower than CPU for small matrices; threshold typically $n > 1000$.

  • Preconditioning setup vs. solve trade-off: Single solve: skip preconditioning. Multiple solves with same matrix: preconditioning pays off.

  • Sparsity pattern changes: If $A$ has varying sparsity over iterations, symbolic factorization must be updated.

References

Foundational sparse matrix theory

  1. George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Sparse matrix representations and algorithms

  1. Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

  2. Tinney, W. F., & Walker, J. W. (1967). Direct solutions of sparse network equations by optimally ordered triangular factorization.

  3. Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Sparse iterative solvers and preconditioning

  1. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

  2. Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).

  3. Ruge, J. W., & Stüben, K. (1987). Algebraic Multigrid (AMG).

Sparse linear algebra in scientific computing

  1. Balay, S., Abhyankar, S., Adams, M. F., et al. (2021). PETSc Users Manual.

  2. Herring, A. L., Gilmour, A., Paddon, D. J., & Giles, M. B. (2009). Trilinos Overview.

  3. Li, X. S. (2005). An overview of SuperLU: Algorithms, implementation, and user interface.

Topic modeling and sparse SVD

  1. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.

  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.

  3. Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.

Graph neural networks and sparse operations

  1. Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.

  2. Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.

  3. Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Recommender systems and matrix completion

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

  2. Zhou, T., Kuscsik, Z., Liu, J. K., Medo, M., Yeung, C. H., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems.

Optimization and sparse recovery

  1. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.

  2. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

  3. Kairouz, P., McMahan, H. B., Avent, B., et al. (2021). Advances and Open Problems in Federated Learning.

Five worked examples

Worked Example 1: Sparse matrix representations and storage efficiency#

Introduction#

Construct a sparse matrix (adjacency matrix of a large graph); compare storage and computation time across dense, CSR, and COO formats; demonstrate memory/time trade-offs.

Purpose#

Illustrate practical efficiency gains of sparse formats; motivate algorithm selection based on access patterns.

Importance#

Foundation for understanding when sparse methods are worthwhile; shows memory savings enable billion-scale problems.

What this example demonstrates#

  • Generate sparse graph (e.g., scale-free network or grid).

  • Store as dense array, then convert to CSR, CSC, COO.

  • Measure memory footprint for each format.

  • Benchmark matrix-vector product on dense vs. sparse.

  • Plot time vs. sparsity level.

Background#

Dense storage is simple but wasteful for sparse matrices. Sparse formats trade memory for computational complexity in accessing elements.

Historical context#

Sparse matrix formats developed as computers gained memory constraints in 1960s–1970s (finite element analysis).

History#

CSR became de facto standard; specialized hardware (GPU) support added via CUSPARSE, cuSPARSE.

Prevalence in ML#

Every large-scale ML system (Spark, Hadoop, TensorFlow) supports sparse matrices; default for text, graphs.

Notes#

  • CSR efficient for row access, SpMV; CSC efficient for column access, matrix-matrix products.

  • COO flexible for construction; convert to CSR/CSC before computation.

  • GPU memory bandwidth is bottleneck; careful format choice crucial.

Connection to ML#

Text vectorization produces sparse matrices; graphs produce sparse adjacency matrices. System must efficiently handle both.

Connection to Linear Algebra Theory#

Sparse matrix properties (connectivity, degree distribution) determine factorization cost and algorithm performance.

Pedagogical Significance#

Demonstrates performance trade-offs; motivates specialized algorithms.

References#

  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  2. Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

  3. Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(30)

# Create sparse adjacency matrix (random graph)
n = 10000
density = 0.001  # 0.1% sparse
nnz = int(density * n * n)

# Generate random sparse matrix
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
data = np.ones(nnz)

# CSR format (efficient for SpMV)
A_csr = sp.csr_matrix((data, (rows, cols)), shape=(n, n))

# CSC format (efficient for column access)
A_csc = A_csr.tocsc()

# COO format (flexible)
A_coo = sp.coo_matrix((data, (rows, cols)), shape=(n, n))

# Dense format
A_dense = A_csr.toarray()

# Measure memory
import sys
mem_csr = sys.getsizeof(A_csr.data) + sys.getsizeof(A_csr.indices) + sys.getsizeof(A_csr.indptr)
mem_dense = A_dense.nbytes

print("Memory comparison for {}x{} matrix (density {:.4f}%)".format(n, n, 100*density))
print("  Dense: {:.2f} MB".format(mem_dense / 1e6))
print("  CSR: {:.2f} MB".format(mem_csr / 1e6))
print("  Savings: {:.1f}x".format(mem_dense / mem_csr))

# Benchmark matrix-vector product
x = np.random.randn(n)
iterations = 100

# Sparse SpMV
t0 = time.time()
for _ in range(iterations):
    y_sparse = A_csr @ x
t_sparse = time.time() - t0

# Dense gemv
t0 = time.time()
for _ in range(iterations):
    y_dense = A_dense @ x
t_dense = time.time() - t0

print("\nMatrix-vector product time ({} iterations):".format(iterations))
print("  Sparse: {:.4f} s".format(t_sparse))
print("  Dense: {:.4f} s".format(t_dense))
print("  Speedup: {:.1f}x".format(t_dense / t_sparse))

print("\nSparsity level: {:.1f}%".format(100 * A_csr.nnz / (n*n)))
print("NNZ: {}".format(A_csr.nnz))

Worked Example 2: Fill-in and reordering in sparse LU factorization#

Introduction#

Construct a sparse matrix; compute LU factorization without reordering and with minimum-degree reordering; compare fill-in and factorization time.

Purpose#

Demonstrate how reordering dramatically reduces fill-in and speeds factorization.

Importance#

Critical for practical sparse solvers; shows that problem structure matters.

What this example demonstrates#

  • Construct sparse system (banded, block-structured, or random).

  • Perform symbolic LU factorization (predict nonzero pattern).

  • Compute LU without reordering; measure fill-in ratio.

  • Apply minimum-degree reordering; compute LU again.

  • Compare factorization time and fill-in.

Background#

Fill-in (new nonzeros created during factorization) can destroy sparsity; reordering (permutation of rows/columns) reduces fill-in.

Historical context#

Tinney & Walker (1967) minimum degree; George (1973) nested dissection; standard in SPARSE BLAS era.

History#

All production sparse solvers (PARDISO, SuperLU, UMFPACK) include automatic reordering.

Prevalence in ML#

Used in sparse inverse problems, finite element discretizations; less visible in ML but critical for scalability.

Notes#

  • Minimum degree is heuristic; nested dissection is theoretically superior but more complex.

  • Fill-in ratio typically 5–100× for unordered; reordering reduces to 1–5×.

Connection to ML#

Sparse solve is bottleneck in iterative methods (Newton, interior-point algorithms). Reordering accelerates optimization.

Connection to Linear Algebra Theory#

Graph reachability (matrix powers) predicts fill-in; reduction is graph-theoretic problem.

Pedagogical Significance#

Shows interplay between linear algebra and graph algorithms.

References#

  1. George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.

  2. Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

  3. Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
from scipy.sparse import csgraph
import time

np.random.seed(31)

# Create sparse banded system (mimics discretized PDE)
n = 1000
band_width = 5
A_banded = sp.diags([np.ones(n-1), -2*np.ones(n), np.ones(n-1)],
                     [-1, 0, 1], shape=(n, n), format='csr')
# Add random entries to create irregular pattern
rows = np.random.choice(n, 50)
cols = np.random.choice(n, 50)
A_banded = A_banded + 0.01 * sp.csr_matrix((np.ones(50), (rows, cols)), shape=(n, n))
A_banded.eliminate_zeros()

# Convert to CSC for LU (standard for sparse direct solvers)
A_csc = A_banded.tocsc()

# LU without reordering
print("LU factorization comparison (n={})".format(n))
print("\nNo reordering:")
t0 = time.time()
LU_no_reorder = spla.splu(A_csc, permc_spec='natural')
t_no_reorder = time.time() - t0
nnz_no_reorder = LU_no_reorder.nnz

# LU with COLAMD reordering (minimum degree heuristic)
print("\nWith COLAMD reordering:")
t0 = time.time()
LU_reorder = spla.splu(A_csc, permc_spec='COLAMD')
t_reorder = time.time() - t0
nnz_reorder = LU_reorder.nnz

print("\nResults:")
print("  Original NNZ: {}".format(A_csc.nnz))
print("  No reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_no_reorder, nnz_no_reorder / A_csc.nnz, t_no_reorder))
print("  With reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_reorder, nnz_reorder / A_csc.nnz, t_reorder))
print("  Fill-in reduction: {:.1f}x".format(nnz_no_reorder / nnz_reorder))
print("  Time improvement: {:.1f}x".format(t_no_reorder / t_reorder))

Worked Example 3: Sparse SVD for topic modeling (LSA)#

Introduction#

Apply truncated SVD to sparse document-term matrix; extract latent topics; compare sparse vs. dense SVD efficiency.

Purpose#

Show practical use of sparse linear algebra in NLP; demonstrate dimensionality reduction on real-world sparse data.

Importance#

Sparse SVD is standard in topic modeling; ARPACK enables scalability to billions of words.

What this example demonstrates#

  • Generate synthetic document-term matrix (sparse, realistic sparsity ~0.01%).

  • Compute truncated SVD via ARPACK (sparse).

  • Compute dense SVD for comparison (infeasible for large $n$).

  • Extract topics (top words per topic).

  • Measure reconstruction error vs. rank.

Background#

LSA (latent semantic analysis) applies SVD to tf-idf matrix; low-rank factors reveal latent topics.

Historical context#

Deerwester et al. (1990) pioneered LSA; ARPACK (Lehoucq et al. 1998) made scalable SVD practical.

History#

Standard in topic modeling pipelines; scikit-learn TruncatedSVD widely used.

Prevalence in ML#

Textmining, information retrieval, recommendation systems routinely use sparse SVD.

Notes#

  • Truncated SVD (rank $k \ll n$) computes only $k$ largest singular values; much cheaper than full SVD.

  • Sparse matrix multiplication dominates cost; $O(k \cdot \text{nnz})$ per iteration.

Connection to ML#

Topics are interpretable; document-topic and topic-word matrices enable downstream tasks.

Connection to Linear Algebra Theory#

ARPACK uses Krylov subspaces and implicitly restarted Arnoldi; exploits SpMV efficiency.

Pedagogical Significance#

Demonstrates power of combining sparse formats with specialized algorithms (Krylov methods).

References#

  1. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.

  2. Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.

  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import time

np.random.seed(32)

# Synthetic document-term matrix (sparse)
n_docs, n_terms = 10000, 50000
nnz_per_doc = 200

# Generate sparse matrix (realistic sparsity)
data = []
row_idx = []
col_idx = []
for doc in range(n_docs):
    terms = np.random.choice(n_terms, nnz_per_doc, replace=False)
    counts = np.random.poisson(2, nnz_per_doc) + 1
    data.extend(counts)
    row_idx.extend([doc] * nnz_per_doc)
    col_idx.extend(terms)

X = sp.csr_matrix((data, (row_idx, col_idx)), shape=(n_docs, n_terms))

# Normalize (tf-idf variant)
X = X.astype(np.float32)
row_sums = np.array(X.sum(axis=1)).flatten()
X.data /= row_sums[X.nonzero()[0]]

print("Document-term matrix: {}x{}, sparsity: {:.4f}%".format(
    n_docs, n_terms, 100*(1 - X.nnz/(n_docs*n_terms))))

# Sparse SVD (ARPACK)
k = 50  # Number of topics
print("\nTruncated SVD (rank {})...".format(k))
t0 = time.time()
U_sparse, s_sparse, Vt_sparse = svds(X, k=k, which='LM')
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense SVD (for small subset for illustration)
print("\nFull dense SVD (on 100x1000 subset for illustration)...")
X_dense_small = X[:100, :1000].toarray()
t0 = time.time()
U_dense, s_dense, Vt_dense = np.linalg.svd(X_dense_small, full_matrices=False)
t_dense = time.time() - t0
print("  Time (100x1000): {:.4f} s".format(t_dense))
print("  Dense SVD on full matrix would take >> hours")

# Extract topics (top terms per topic)
print("\nTop 5 terms per topic (first 3 topics):")
for topic_idx in range(min(3, k)):
    top_terms = np.argsort(Vt_sparse[topic_idx])[-5:][::-1]
    print("  Topic {}: terms {}".format(topic_idx, top_terms))

# Reconstruction error
rank_list = [5, 10, 20, 50]
print("\nReconstruction error vs. rank:")
for r in rank_list:
    U_r = U_sparse[:, -r:]
    s_r = s_sparse[-r:]
    Vt_r = Vt_sparse[-r:, :]
    X_recon = U_r @ np.diag(s_r) @ Vt_r
    err = np.linalg.norm(X[:100, :].toarray() - X_recon[:100]) / np.linalg.norm(X[:100])
    print("  Rank {}: relative error {:.4f}".format(r, err))

Worked Example 4: Preconditioned iterative solver on sparse system#

Introduction#

Construct a large sparse SPD system (from discretized PDE); solve via unpreconditioned and preconditioned CG; demonstrate preconditioning impact.

Purpose#

Show how preconditioning enables convergence on sparse systems where direct solvers would be expensive.

Importance#

Practical approach for million-scale sparse systems; avoids expensive fill-in of direct LU.

What this example demonstrates#

  • Construct sparse Laplacian or discretized diffusion matrix.

  • Solve unpreconditioned CG; count iterations.

  • Construct ILU or AMG preconditioner.

  • Solve preconditioned CG; count iterations.

  • Compare iteration counts and wall-clock time.

Background#

Iterative solvers avoid factorization fill-in; preconditioning clusters spectrum for fast convergence.

Historical context#

Preconditioned CG developed 1970s–1980s; ILU standard; AMG emerged 1980s–1990s.

History#

Standard in PETSc, Trilinos, deal.II (finite element libraries); production code for billion-node problems.

Prevalence in ML#

Used in large-scale optimization (Newton, interior-point), graph algorithms (diffusion on networks).

Notes#

  • Preconditioning setup cost amortized over many solves.

  • ILU relatively cheap; AMG more expensive but better convergence for difficult problems.

Connection to ML#

Large-scale optimization (e.g., logistic regression on massive sparse data) requires preconditioned iterative methods.

Connection to Linear Algebra Theory#

CG iteration count $\approx \sqrt{\kappa}$; preconditioning reduces $\kappa$ to enable convergence.

Pedagogical Significance#

Demonstrates integration of sparsity, preconditioning, and iterative methods.

References#

  1. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

  2. Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).

  3. Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla

np.random.seed(33)

# Create sparse Laplacian (discretized 2D Poisson, 100x100 grid)
n = 10000  # 100x100 grid
# 5-point stencil: center + 4 neighbors
I = []
J = []
S = []
for i in range(n):
    I.extend([i, i, i])
    J.extend([i, (i+1) % n, (i-1) % n])
    S.extend([4, -1, -1])

A = sp.csr_matrix((S, (I, J)), shape=(n, n))
A = A.tocsr()

# RHS
b = np.random.randn(n)

print("Sparse Laplacian system: {}x{}, nnz={}".format(n, n, A.nnz))
print("Condition number (estimated): ~{}".format(int(n**0.5)))

# Unpreconditioned CG
print("\nUnpreconditioned CG:")
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))

x_un, info_un = spla.cg(A, b, tol=1e-6, callback=callback_un, maxiter=n)
print("  Iterations: {}".format(len(residuals_unprecond)))

# ILU preconditioner
print("\nBuilding ILU preconditioner...")
M_ilu = spla.spilu(A.tocsc())
M = spla.LinearOperator((n, n), matvec=M_ilu.solve)

print("Preconditioned CG:")
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))

x_pre, info_pre = spla.cg(A, b, M=M, tol=1e-6, callback=callback_pre, maxiter=n)
print("  Iterations: {}".format(len(residuals_precond)))

print("\nSummary:")
print("  Unpreconditioned iterations: {}".format(len(residuals_unprecond)))
print("  Preconditioned iterations: {}".format(len(residuals_precond)))
print("  Speedup: {:.1f}x".format(len(residuals_unprecond) / len(residuals_precond)))
print("  Both converged:", info_un == 0 and info_pre == 0)

Worked Example 5: Graph neural network convolution on sparse adjacency matrix#

Introduction#

Construct a graph (sparse adjacency matrix); implement graph convolution layer; measure SpMM efficiency vs. dense multiplication.

Purpose#

Demonstrate sparse matrix operations in modern ML (graph neural networks).

Importance#

GNNs process billion-node graphs; sparsity is essential for feasibility.

What this example demonstrates#

  • Generate graph (scale-free, grid, or real network).

  • Build node feature matrix $X \in \mathbb{R}^{n \times d}$.

  • Implement sparse graph convolution: $X' = \sigma(A X W)$ with SpMM.

  • Compare sparse vs. dense GCN layer time/memory.

  • Verify output correctness.

Background#

Graph convolution aggregates features from neighbors via sparse $A$ multiplication.

Historical context#

Kipf & Welling (2017) GCN; PyTorch Geometric (2019) enables efficient sparse operations.

History#

Standard in graph learning frameworks (DGL, Spektral, PyTorch Geometric).

Prevalence in ML#

GNNs on social networks, citation graphs, molecular graphs; sparse adjacency matrix standard.

Notes#

  • SpMM (sparse-dense matrix product) is key operation; GPU acceleration available.

  • For $n$-node graph with $m$ edges and $d$ features: sparse is $O(m \cdot d)$, dense is $O(n^2 \cdot d)$.

Connection to ML#

GNNs enable learning on structured, graph-based data; scalable to billion-node networks via sparsity.

Connection to Linear Algebra Theory#

Graph convolution is polynomial filter on graph Laplacian; spectral perspective connects to Chebyshev polynomials.

Pedagogical Significance#

Integrates sparse linear algebra, graph theory, and modern deep learning.

References#

  1. Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.

  2. Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.

  3. Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(34)

# Generate graph (scale-free network)
n = 10000  # Nodes
avg_degree = 10
edges = []
for i in range(n):
    # Preferential attachment: connect to random nodes
    targets = np.random.choice(n, min(avg_degree, n-1), replace=False)
    for j in targets:
        if i != j:
            edges.append((i, j))
            edges.append((j, i))  # Undirected

edges = np.array(edges)
A = sp.csr_matrix((np.ones(len(edges)), edges.T), shape=(n, n))
A.eliminate_zeros()

# Node features
d = 64  # Feature dimension
X = np.random.randn(n, d).astype(np.float32)

# Weight matrix (GCN parameters)
W = np.random.randn(d, d).astype(np.float32)

print("Graph: n={}, edges={}, sparsity={:.4f}%".format(
    n, A.nnz // 2, 100*(1 - A.nnz/(n*n))))
print("Features: n={}, d={}".format(n, d))

# Sparse graph convolution: X_out = A @ X @ W
print("\nSparse graph convolution:")
t0 = time.time()
X_intermediate = A @ X  # SpMM: sparse @ dense
X_out_sparse = X_intermediate @ W  # Dense @ dense
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense graph convolution (for comparison on subset)
print("\nDense graph convolution (on 1000-node subgraph):")
A_dense_sub = A[:1000, :1000].toarray()
X_sub = X[:1000]
t0 = time.time()
X_out_dense_sub = A_dense_sub @ X_sub @ W
t_dense_sub = time.time() - t0
print("  Time (1000 nodes): {:.4f} s".format(t_dense_sub))
# Extrapolate to full graph
t_dense_extrapolated = t_dense_sub * (n/1000)**2
print("  Extrapolated to full graph: {:.2f} s".format(t_dense_extrapolated))

print("\nSpeedup (dense vs. sparse): {:.1f}x".format(t_dense_extrapolated / t_sparse))

# Apply nonlinearity (ReLU)
X_out_sparse = np.maximum(X_out_sparse, 0)

print("\nGCN layer output shape:", X_out_sparse.shape)
print("Output mean:", np.mean(X_out_sparse))
print("Output sparsity after ReLU:", 100 * np.sum(X_out_sparse == 0) / X_out_sparse.size, "%")

Comments

Computational Efficiency
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Problem Structure & Exploitation
Theoretical Foundation
Chapter 14
Conditioning & Stability
Key ideas: Introduction

Introduction#

Numerical stability determines whether an algorithm produces a correct answer. Machine arithmetic operates on finite precision (float64: ~16 significant digits). An ill-conditioned problem (large $\kappa$) is inherently difficult: small input perturbations cause large output changes. Backward-stable algorithms guarantee that computed solution is exact solution to a nearby problem. Forward error depends on conditioning; backward error depends on algorithm stability. Together, they predict what accuracy is achievable. This is fundamental to reliable ML: ill-conditioned matrices appear in nearly singular Gram matrices, ill-posed inverse problems, and deep network Hessians.

Important ideas#

  1. Condition number and sensitivity

    • $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).

    • Forward error bound: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \cdot \frac{\| \Delta A \|}{\| A \|}$ (perturbation on $A$ causes error in $x$).

    • Ill-conditioned ($\kappa \gg 1$) problems amplify noise; well-conditioned ($\kappa \approx 1$) are benign.

  2. Machine precision and floating-point arithmetic

    • Machine epsilon: $\varepsilon_{\text{machine}} \approx 2^{-52} \approx 2.2 \times 10^{-16}$ for float64.

    • Relative error per operation: $O(\varepsilon_{\text{machine}})$; accumulated error over $n$ operations: $O(n \cdot \varepsilon_{\text{machine}})$.

    • Loss of significance: if $a \approx b$, then $a - b$ has few correct digits (cancellation).

  3. Backward error analysis and stability

    • Algorithm is backward-stable if computed solution $\tilde{x}$ is exact solution to perturbed problem: $\tilde{A} \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon_{\text{machine}})$.

    • Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|} \approx \kappa(A) \cdot O(\varepsilon_{\text{machine}})$ for stable algorithm.

    • Unstable algorithm: backward error is large; forward error can be catastrophic.

  4. Condition number of Gram matrix vs. original matrix

    • $\kappa(A^\top A) = \kappa(A)^2$; normal equations square the condition number.

    • QR/SVD avoid this: $\kappa(R) = \kappa(A)$ (no squaring), $\kappa(\Sigma) = \kappa(A)$.

    • For least squares: direct method on $A$ (QR) preferable to solving $A^\top A x = A^\top b$.

  5. Preconditioning and effective conditioning

    • Preconditioner $M \approx A$ transforms $A \to M^{-1} A$; goal is $\kappa(M^{-1} A) \ll \kappa(A)$.

    • Incomplete LU (ILU), Jacobi (diagonal), algebraic multigrid (AMG) are practical preconditioners.

    • CG iteration count $\approx \sqrt{\kappa(M^{-1} A)}$; reducing $\kappa$ by factor 100 reduces iterations by factor 10.

  6. Regularization and ill-posedness

    • Ill-posed problem: small changes in data cause arbitrarily large changes in solution ($\kappa = \infty$).

    • Tikhonov regularization: $(A^\top A + \lambda I) x = A^\top b$ shifts small singular values by $\lambda$.

    • Effective $\kappa$ after regularization: $\kappa_{\text{reg}} = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$; well-chosen $\lambda$ reduces condition number.

  7. Numerical cancellation and loss of significance

    • Subtraction $a - b$ where $a \approx b$ cancels leading digits; result has low precision.

    • Orthogonal transformations (QR, SVD, Householder) avoid this; preserve angles and norms.

    • Modified Gram–Schmidt (MGS) more stable than classical Gram–Schmidt due to reorthogonalization.

Relevance to ML#

  • Linear regression and least squares: Gram matrix $X^\top X$ can be ill-conditioned if features are correlated; normal equations fail; QR/SVD required.

  • Kernel methods and Gaussian processes: Gram/covariance matrix condition number determines sample complexity and numerical stability of Cholesky.

  • Deep learning and Hessian conditioning: Neural network loss Hessian is often ill-conditioned; affects convergence rate, learning rate selection, and second-order optimization.

  • Regularization and implicit bias: Ill-posed learning problems ($\kappa = \infty$) require regularization (L2, dropout, early stopping); reduces effective $\kappa$.

  • Numerical robustness in production: Conditioning determines whether model training is reproducible, whether predictions are stable to input noise, and whether inference is safe.

Algorithmic development (milestones)#

  • 1947: von Neumann & Goldstine analyze numerical stability of Gaussian elimination; discover amplification by machine epsilon.

  • 1961: Wilkinson develops comprehensive framework for backward error analysis; introduces condition number formally.

  • 1965: Golub & Kahan recognize $\kappa(A^\top A) = \kappa(A)^2$ is catastrophic for normal equations; advocate for QR/SVD.

  • 1971: Peters & Wilkinson develop LINPACK; implement numerically stable algorithms with careful pivot selection.

  • 1979–1990: Higham develops extended backward error analysis; proves stability of Cholesky, QR, modified Gram–Schmidt.

  • 1992: Arioli et al. develop a posteriori error bounds; can verify computed solution without knowing true solution.

  • 2000s: Preconditioning becomes standard in iterative solvers; algebraic multigrid (AMG) enables linear scaling.

  • 2010s: Automatic differentiation frameworks (TensorFlow, PyTorch) provide implicit solvers with backward stability analysis.

Definitions#

  • Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); $\kappa_\infty(A) = \| A \|_\infty \| A^{-1} \|_\infty$ (operator norm).

  • Machine epsilon: $\varepsilon_{\text{machine}} = 2^{-(p-1)}$ for $p$-bit mantissa; $\approx 2.2 \times 10^{-16}$ for float64.

  • Floating-point number: Normalized form: $\pm m \times 2^e$ with $1 \le m < 2$ (hidden bit), $e \in [-1022, 1023]$.

  • Relative error: $\frac{\| \tilde{x} - x \|}{\| x \|}$ (dimensionless; independent of scaling).

  • Backward error: $\frac{\| \tilde{A} - A \|}{\| A \|}$ for perturbed matrix; small backward error $\Rightarrow$ computed $\tilde{x}$ is exact solution to nearby problem.

  • Forward error: $\frac{\| \tilde{x} - x \|}{\| x \|}$; depends on both backward error and conditioning: $\text{forward error} \approx \kappa(A) \times \text{backward error}$.

  • Ill-conditioned: $\kappa(A) \gg 1$; small perturbations cause large output changes.

  • Well-conditioned: $\kappa(A) \approx 1$; stable to perturbations.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Condition number definition and properties: $\kappa(A) = \sigma_1 / \sigma_n$; $\kappa(AB) \le \kappa(A) \kappa(B)$; $\kappa(A) \ge 1$. Reference: Golub & Van Loan (2013).

  • Perturbation theory and forward error bounds: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \frac{\| \Delta A \|}{\| A \|}$ (first-order). Reference: Wilkinson (1961); Higham (2002).

  • Backward error analysis: Algorithm is backward-stable if $\tilde{x}$ satisfies $(\tilde{A}) \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon)$. Reference: Higham (2002).

  • Machine precision and floating-point error: $\varepsilon_{\text{mach}} \approx 2^{-p}$; error per operation $O(\varepsilon)$; cumulative error over $n$ steps $O(n \varepsilon)$. Reference: IEEE 754 (2019); Goldberg (1991).

  • Condition number of Gram matrix: $\kappa(A^\top A) = \kappa(A)^2$; why normal equations are risky. Reference: Golub & Kahan (1965).

  • Preconditioning theory: Transform $A \to M^{-1} A$; goal $\kappa(M^{-1} A) \ll \kappa(A)$. Reference: Axelsson (1994).

  • Tikhonov regularization: Shift eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$; reduce effective condition number. Reference: Tikhonov (1963).

Applied (landmark systems)#

  • QR factorization for stable least squares: $\kappa(R) = \kappa(A)$ (no squaring); stable even for ill-conditioned $A$. Implementation: LAPACK dgeqrf; scikit-learn default solver. Reference: Golub & Kahan (1965); Golub & Van Loan (2013).

  • SVD for rank-deficient and ill-conditioned systems: Direct pseudoinverse computation; threshold small singular values. Implementation: LAPACK dgesvd; scipy.linalg.svd. Reference: Golub & Kahan (1965).

  • Cholesky with jitter for near-singular covariance: Add $\epsilon I$ if Cholesky fails; trade-off between accuracy and stability. Standard in GPyTorch, Pyro, Stan. Reference: Rasmussen & Williams (2006).

  • Modified Gram–Schmidt (MGS) for reorthogonalization: More stable than classical GS; requires reorthogonalization passes. Reference: Golub & Pereyra (1973); Yarlagadda et al. (1979).

  • Preconditioned conjugate gradient (PCG): L-BFGS (quasi-Newton), ILU preconditioning, algebraic multigrid reduce iteration count by factors 10–100. Standard in scipy.sparse.linalg, PETSc. Reference: Nocedal & Wright (2006); Saad (2003).

  • Ridge regression and elastic net for regularization: Manual $\lambda$ tuning or cross-validation; scikit-learn Ridge, ElasticNet. Reference: Hoerl & Kennard (1970); Zou & Hastie (2005).

Key ideas: Where it shows up
  1. Linear regression and least squares

    • Gram matrix $X^\top X$ condition number determines numerical stability of normal equations.

    • Achievements: QR/SVD solvers in scikit-learn, statsmodels ensure stability; warning when $\kappa > 10^{10}$. References: Hastie et al. (2009); Golub & Kahan (1965).

  2. Kernel methods and Gaussian processes

    • Covariance/kernel matrix $K$ condition number governs Cholesky stability and marginal likelihood evaluation.

    • Achievements: Cholesky with jitter in Stan/PyMC3; sparse GP approximations circumvent $O(n^3)$ and ill-conditioning. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005).

  3. Deep learning and second-order optimization

    • Hessian condition number determines convergence rate of Newton/quasi-Newton methods; affects learning rate and step size.

    • Achievements: L-BFGS (quasi-Newton) in PyTorch/TensorFlow adapts to Hessian conditioning; natural gradient methods rescale by Fisher information. References: Martens & Grosse (2015); Nocedal & Wright (2006).

  4. Regularization and implicit bias

    • Ill-posed learning problems require regularization; L2 penalty (ridge) shifts eigenvalues, reducing effective condition number.

    • Achievements: elastic net (L1+L2) in scikit-learn balances sparsity and conditioning; early stopping in SGD acts as implicit regularization. References: Tibshirani (1996); Zou & Hastie (2005); Zhu et al. (2021).

  5. Numerical robustness and reliability

    • Production ML systems must be stable to input noise, feature preprocessing, and computational precision.

    • Achievements: feature standardization reduces condition number; model monitoring tracks numerical stability (checks for NaN/Inf); validation on adversarial perturbations. References: Goodfellow et al. (2014) (adversarial robustness); LeCun et al. (1998) (feature normalization).

Notation
  • Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)} = \frac{\lambda_{\max}(A^\top A)}{\lambda_{\min}(A^\top A)}$ (for SPD $A$).

  • Machine epsilon: $\varepsilon_{\text{mach}}$ or $\text{eps}$ in code; $\approx 2.2 \times 10^{-16}$ for float64.

  • Floating-point relative error: $\tilde{a} = a(1 + \delta)$ where $|\delta| \le \varepsilon_{\text{mach}}$.

  • Perturbation: $\tilde{A} = A + \Delta A$ with $\| \Delta A \| \le \varepsilon \| A \|$.

  • Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$ (computed vs. true solution).

  • Backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$ (residual relative to data norm).

  • Norm: $\| A \|_2 = \sigma_1(A)$ (spectral norm); $\| A \|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius).

  • Example: For $A \in \mathbb{R}^{100 \times 100}$ with $\sigma_1 = 10, \sigma_{100} = 0.01$: $\kappa(A) = 1000$. Relative error in solving $Ax = b$ is amplified by factor $\approx 1000$; if input has error $10^{-15}$, output error $\approx 10^{-12}$.

Pitfalls & sanity checks
  • Never ignore condition number: If $\kappa(A) > 10^{8}$, accuracy is compromised; either regularize or use higher precision.

  • Use QR/SVD for least squares, never normal equations (unless you’ve verified conditioning): $\kappa(X^\top X) = \kappa(X)^2$ is catastrophic.

  • Check for catastrophic cancellation: Subtraction of nearly equal numbers loses precision; use orthogonal algorithms instead.

  • Machine precision is absolute limit: Even with perfect algorithm, relative error $\gtrsim \varepsilon_{\text{mach}} \times \kappa(A)$ is unavoidable.

  • Backward error bounds are meaningful: Small backward error $\Rightarrow$ computed solution is exact for nearby problem (reassuring).

  • Regularization parameter selection matters: Cross-validation, L-curve, or discrepancy principle; manual tuning often fails.

  • Feature scaling affects conditioning: Standardize features before regression; poor scaling inflates condition number unnecessarily.

  • Preconditioning amortized over multiple solves: Single solve: preconditioning setup not worth it; multiple solves: huge benefit.

References

Foundational backward error analysis

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

  3. IEEE 754. (2019). IEEE Standard for Floating-Point Arithmetic.

Classical numerical linear algebra

  1. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Comprehensive stability analysis

  1. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  2. Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.

  3. Givens, W. (1958). Computation of plane unitary rotations transforming a general matrix to triangular form.

Orthogonal factorizations and reorthogonalization

  1. Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.

  2. Yarlagadda, R. K., Hershey, J. E., & Sule, S. M. (1979). Gram–Schmidt orthogonalization with application to order recursive lattice filters.

Iterative methods and preconditioning

  1. Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.

  2. Axelsson, O. (1994). Iterative Solution Methods.

  3. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

  4. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

  5. Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

Ill-posed problems and regularization

  1. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.

  2. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  3. Hansen, P. C., Nagy, J. G., & O’Leary, D. P. (2006). Deblurring Images: Matrices, Spectra, and Filtering.

  4. Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Applied ML and regularization

  1. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.

  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).

  3. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

  4. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

  5. Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.

  6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Five worked examples

Worked Example 1: Condition number of Gram matrix and instability of normal equations#

Introduction#

Construct a data matrix $X$ with correlated features; compute condition number $\kappa(X)$ and $\kappa(X^\top X)$; solve least squares via normal equations vs. QR to demonstrate error amplification.

Purpose#

Illustrate why normal equations fail for ill-conditioned data; show that $\kappa(X^\top X) = \kappa(X)^2$ causes error to explode.

Importance#

Foundational motivation for using QR/SVD in regression; explains why scikit-learn warns on ill-conditioned matrices.

What this example demonstrates#

  • Construct $X$ with exponentially decaying singular values (high correlation).

  • Compute $\kappa(X)$ and $\kappa(X^\top X) = \kappa(X)^2$.

  • Solve via normal equations: $G = X^\top X$, $w = G^{-1} X^\top y$.

  • Solve via QR factorization: $w$ from $R w = Q^\top y$.

  • Compare residuals and error growth on perturbed data.

Background#

Gram matrix $G = X^\top X$ is square and symmetric but inherits (squared) conditioning from $X$. Gaussian elimination on $G$ amplifies errors by $\kappa(G)^2 = \kappa(X)^4$ in worst case.

Historical context#

Golub & Kahan (1965) proved $\kappa(X^\top X) = \kappa(X)^2$; advocated QR as practical solution.

History#

Recognition of this issue led to LINPACK (1979) and LAPACK (1992) emphasis on QR/SVD.

Prevalence in ML#

Statisticians and practitioners routinely encounter this; feature correlation is common in real data.

Notes#

  • Problem is fundamental to linear algebra, not implementation-specific.

  • Data standardization (mean 0, std 1) helps but doesn’t eliminate correlation.

  • Large-scale regression ($n, d > 10^6$) exacerbates conditioning issues.

Connection to ML#

Ridge regression adds $\lambda I$ to shift eigenvalues; reduces effective condition number. Feature engineering (orthogonalization, PCA preprocessing) also helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ is exact; QR avoids Gram matrix entirely.

Pedagogical Significance#

Concrete demonstration of why theory (backward error analysis) matters in practice.

References#

  1. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.

  2. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

Solution (Python)#

import numpy as np
np.random.seed(20)

# Create ill-conditioned data (correlated features)
n, d = 100, 10
# Generate features with exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s)

# True solution and target
w_true = np.random.randn(d)
y = X @ w_true + 0.001 * np.random.randn(n)

# Condition numbers
kappa_X = np.linalg.cond(X)
G = X.T @ X
kappa_G = np.linalg.cond(G)

# Solve via normal equations
w_ne = np.linalg.solve(G, X.T @ y)
residual_ne = np.linalg.norm(X @ w_ne - y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)
residual_qr = np.linalg.norm(X @ w_qr - y)

# Solve via SVD (most stable)
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))
residual_svd = np.linalg.norm(X @ w_svd - y)

print("Condition numbers:")
print("  kappa(X) =", round(kappa_X, 2))
print("  kappa(X^T X) =", round(kappa_G, 2))
print("  kappa(X^T X) / kappa(X)^2 =", round(kappa_G / kappa_X**2, 4), "(should be ~1)")
print("\nResiduals:")
print("  Normal equations:", round(residual_ne, 8))
print("  QR:", round(residual_qr, 8))
print("  SVD:", round(residual_svd, 8))
print("\nSolution errors:")
print("  NE error:", round(np.linalg.norm(w_ne - w_true), 4))
print("  QR error:", round(np.linalg.norm(w_qr - w_true), 4))
print("  SVD error:", round(np.linalg.norm(w_svd - w_true), 4))

Worked Example 2: Backward error analysis and numerical stability of QR vs. normal equations#

Introduction#

Solve the same least squares problem via QR and normal equations; compute backward and forward errors; verify that QR is backward-stable while normal equations magnify error.

Purpose#

Demonstrate backward error analysis in practice; show that backward-stable algorithm (QR) guarantees small forward error even on ill-conditioned problems.

Importance#

Foundation for trusting numerical algorithms; explains why stable algorithms are worth the computational cost.

What this example demonstrates#

  • Solve $Ax = b$ via QR factorization and normal equations.

  • Compute backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$.

  • Compute forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$.

  • Verify $\text{forward error} \approx \kappa(A) \times \text{backward error}$ for both methods.

Background#

Backward-stable algorithm: computed $\tilde{x}$ is exact solution to $(A + \Delta A) \tilde{x} = b$ with small $\Delta A$.

Historical context#

Wilkinson (1961) formalized backward error analysis; revolutionary impact on numerical linear algebra.

History#

Modern algorithms designed to be backward-stable (QR, SVD, Cholesky); guarantees via theorems in Higham (2002).

Prevalence in ML#

Used in numerical software certification; backward error commonly reported in solvers.

Notes#

  • QR backward error: $O(\varepsilon_{\text{mach}} \| A \|)$ (small).

  • Normal equations backward error: $O(\varepsilon_{\text{mach}} \| A \|^2 / \min(\sigma_i))$ (amplified).

Connection to ML#

Backward stability ensures that model parameters are not polluted by rounding error; prediction is numerically sound.

Connection to Linear Algebra Theory#

Backward error analysis decouples algorithm stability from problem conditioning.

Pedagogical Significance#

Shows that algorithm design (not just theory) determines practical accuracy.

References#

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  3. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np
np.random.seed(21)

# Ill-conditioned system
n = 50
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
x_true = np.random.randn(n)
b = A @ x_true

# Solve via QR
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Solve via normal equations
G = A.T @ A
x_ne = np.linalg.solve(G, A.T @ b)

# Direct solve (reference)
x_ref = np.linalg.solve(A, b)

# Backward errors
residual_qr = np.linalg.norm(A @ x_qr - b)
residual_ne = np.linalg.norm(A @ x_ne - b)
backward_error_qr = residual_qr / np.linalg.norm(b)
backward_error_ne = residual_ne / np.linalg.norm(b)

# Forward errors
forward_error_qr = np.linalg.norm(x_qr - x_ref) / np.linalg.norm(x_ref)
forward_error_ne = np.linalg.norm(x_ne - x_ref) / np.linalg.norm(x_ref)

# Condition number
kappa = np.linalg.cond(A)

# Verify: forward_error ≈ kappa × backward_error
pred_forward_qr = kappa * backward_error_qr
pred_forward_ne = kappa * backward_error_ne

print("Condition number kappa(A):", round(kappa, 0))
print("\nQR Factorization:")
print("  Backward error:", format(backward_error_qr, '.2e'))
print("  Observed forward error:", format(forward_error_qr, '.2e'))
print("  Predicted forward error:", format(pred_forward_qr, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_qr / pred_forward_qr, 2))
print("\nNormal Equations:")
print("  Backward error:", format(backward_error_ne, '.2e'))
print("  Observed forward error:", format(forward_error_ne, '.2e'))
print("  Predicted forward error:", format(pred_forward_ne, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_ne / pred_forward_ne, 2))
print("\nConclusion: QR is backward-stable; NE backward error is larger.")

Worked Example 3: Machine precision and loss of significance in subtraction#

Introduction#

Demonstrate catastrophic cancellation when subtracting nearby numbers; show how orthogonal transformations avoid this.

Purpose#

Illustrate fundamental floating-point pitfall; motivate use of orthogonal algorithms (QR, SVD, Householder).

Importance#

Concrete example of why naive implementations fail; builds intuition for numerical stability.

What this example demonstrates#

  • Compute $c = a - b$ where $a \approx b$ (many leading digits cancel).

  • Compare relative error in $c$ vs. in $a, b$ individually.

  • Show that orthogonal transformations (e.g., Householder reflection) preserve norms and avoid subtraction.

Background#

Subtraction of nearly equal numbers is inherently unstable; cannot be fixed by higher precision alone.

Historical context#

Recognized early in computer arithmetic; led to development of Householder reflections (1958) and other orthogonal methods.

History#

Modern best practice: avoid subtraction of near-equal numbers; use transformation-based algorithms (QR, SVD).

Prevalence in ML#

Common issue in numerical gradient computation, vector normalization, and Gram–Schmidt orthogonalization.

Notes#

  • Relative error in $c$ can be $O(1)$ even if $a, b$ are accurate.

  • Classical Gram–Schmidt suffers; modified Gram–Schmidt fixes by reorthogonalization.

Connection to ML#

Adversarial robustness and numerical gradient computation are vulnerable to cancellation; careful implementation required.

Connection to Linear Algebra Theory#

Orthogonal transformations (Householder, Givens) have condition number $\kappa = 1$; stable regardless of input.

Pedagogical Significance#

Demonstrates limit of forward-stable algorithms; backward stability (computing exact solution to nearby problem) is more achievable.

References#

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

  3. Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

Solution (Python)#

import numpy as np

# Example 1: Catastrophic cancellation
a = 1.0 + 1e-16
b = 1.0
c = a - b
print("Example: Subtraction of nearly equal numbers")
print("  a = 1.0 + 1e-16 =", repr(a))
print("  b = 1.0 =", repr(b))
print("  c = a - b =", repr(c))
print("  Exact: 1e-16, Computed: ", c)
print("  Relative error:", abs(c - 1e-16) / 1e-16 if c != 0 else "inf")

# Example 2: Loss of significance in dot product
print("\nExample: Loss of significance in inner product")
np.random.seed(22)
x = np.array([1e8, 1.0, 1.0])
y = np.array([1e-8, 1.0, 1.0])

# Naive: compute separately then subtract
xy_naive = x[0] * y[0] + x[1] * y[1] + x[2] * y[2]  # 1 + 1 + 1 = 3

# More stable: accumulate with smaller terms first
xy_stable = x[2] * y[2] + x[1] * y[1] + x[0] * y[0]  # Reverse order

print("  Inner product (x · y):")
print("    Naive order: {:.15f}".format(xy_naive))
print("    Reverse order: {:.15f}".format(xy_stable))
print("    NumPy dot: {:.15f}".format(np.dot(x, y)))

# Example 3: Orthogonal transformation vs. subtraction
print("\nExample: Orthogonal transformation avoids cancellation")
# Create orthogonal matrix (preserve norms)
theta = np.pi / 6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
v = np.array([1.0, 0.0])
v_rot = Q @ v
print("  ||v|| =", np.linalg.norm(v))
print("  ||Q v|| =", np.linalg.norm(v_rot), "(should equal ||v||)")
print("  Orthogonal transformation preserves norms: stable!")

Worked Example 4: Preconditioning and condition number reduction#

Introduction#

Construct an ill-conditioned SPD system; solve via unpreconditioned CG and preconditioned CG with diagonal preconditioner; measure iteration count reduction.

Purpose#

Show quantitatively how preconditioning reduces condition number and accelerates convergence.

Importance#

Essential technique for large-scale optimization; directly translates to practical speedups.

What this example demonstrates#

  • Construct ill-conditioned SPD matrix (exponentially decaying eigenvalues).

  • Solve via CG (count iterations to convergence).

  • Construct diagonal (Jacobi) preconditioner $M = \text{diag}(A)$.

  • Solve via preconditioned CG (count iterations).

  • Compute effective $\kappa(M^{-1} A)$ and verify iteration count reduction.

Background#

CG iteration count $\approx \sqrt{\kappa(A)}$; preconditioning reduces $\kappa \to \kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential by 1970s; became standard in iterative solvers.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg (all offer preconditioning options).

Prevalence in ML#

L-BFGS uses quasi-Newton preconditioning; preconditioned SGD in modern optimizers.

Notes#

  • Iteration count $\approx \sqrt{\kappa}$; reducing $\kappa$ by 100 reduces iterations by $\sqrt{100} = 10$.

  • Preconditioner setup cost must be amortized over multiple solves.

Connection to ML#

Optimization accelerators: Newton’s method, natural gradient, adaptive methods all precondition the gradient.

Connection to Linear Algebra Theory#

Preconditioning clusters eigenvalues; standard result in iterative methods.

Pedagogical Significance#

Demonstrates power of structure exploitation (knowing $M \approx A$) to improve algorithm complexity.

References#

  1. Axelsson, O. (1994). Iterative Solution Methods.

  2. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

  3. Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(23)

# Create ill-conditioned SPD matrix
n = 200
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_un, info_un = spla.cg(A, b, tol=1e-8, callback=callback_un, maxiter=n)

# Preconditioned CG with diagonal (Jacobi) preconditioner
M_diag = np.diag(1.0 / np.diag(A))
M_diag_sparse = sp.diags(np.diag(M_diag))
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_pre, info_pre = spla.cg(A, b, M=M_diag_sparse, tol=1e-8, 
                           callback=callback_pre, maxiter=n)

kappa_A = np.linalg.cond(A)
# Approximate kappa(M^{-1} A)
AM = M_diag @ A
kappa_AM = np.linalg.cond(AM)

print("Condition numbers:")
print("  kappa(A) =", round(kappa_A, 0))
print("  kappa(M^{-1} A) ≈", round(kappa_AM, 0), "(after diagonal preconditioning)")
print("\nCG Convergence:")
print("  Unpreconditioned iterations:", len(residuals_unprecond))
print("  Preconditioned iterations:", len(residuals_precond))
print("  Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("  Predicted speedup (~sqrt(kappa)):", round(np.sqrt(kappa_A / kappa_AM), 1))
print("\nBoth solutions satisfy A x = b:",
      np.allclose(A @ x_un, b) and np.allclose(A @ x_pre, b))

Worked Example 5: Regularization and effective condition number#

Introduction#

Construct an ill-posed rectangular system (underdetermined or rank-deficient); solve via Tikhonov regularization with varying $\lambda$; plot condition number reduction and solution norm vs. residual (L-curve).

Purpose#

Demonstrate how regularization shifts small singular values and reduces effective condition number.

Importance#

Core technique in inverse problems, ridge regression, and ill-posed ML problems.

What this example demonstrates#

  • Construct rank-deficient or ill-posed system (exponentially decaying singular values).

  • Solve regularized system $(A^\top A + \lambda I) w = A^\top b$ for range of $\lambda$.

  • Compute effective condition number: $\kappa_\lambda = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$.

  • Plot L-curve: solution norm vs. residual; optimal $\lambda$ at corner.

  • Verify that small $\lambda$ gives large solution, large $\lambda$ gives small solution but large residual.

Background#

Tikhonov adds penalty; shifts eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$.

Historical context#

Tikhonov (1963) for ill-posed problems; Hansen (1998) developed regularization parameter selection tools.

History#

Standard in inverse problems (deblurring, tomography); ridge regression standard in ML.

Prevalence in ML#

Ridge regression, elastic net, dropout, early stopping all act as implicit regularization.

Notes#

  • Optimal $\lambda$ balances fit (small residual) and solution norm (small solution).

  • L-curve method (Golub et al. 1979): plot $\log(\| r \|)$ vs. $\log(\| w \|)$; corner indicates optimal $\lambda$.

Connection to ML#

Bias-variance trade-off: underfitting vs. overfitting; regularization parameter selection critical.

Connection to Linear Algebra Theory#

Spectral filtering: Tikhonov filters small singular values; LSQR via early stopping does implicit filtering.

Pedagogical Significance#

Shows how problem structure (ill-posedness) guides algorithm design (regularization).

References#

  1. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.

  2. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  3. Golub, G. H., Van Loan, C. F., & Milanfar, P. (1999). The total least squares problem.

Solution (Python)#

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(24)

# Construct ill-posed system
m, n = 100, 40
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T

# True solution and noisy data
x_true = np.zeros(n)
x_true[:5] = [10, 8, 6, 4, 2]
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# Solve Tikhonov for range of lambda
lams = np.logspace(-6, 2, 50)
sol_norms = []
residuals = []
cond_numbers = []

for lam in lams:
    G = A.T @ A + lam * np.eye(n)
    w = np.linalg.solve(G, A.T @ b)
    sol_norms.append(np.linalg.norm(w))
    residuals.append(np.linalg.norm(A @ w - b))
    cond_numbers.append(np.linalg.cond(G))

sol_norms = np.array(sol_norms)
residuals = np.array(residuals)
cond_numbers = np.array(cond_numbers)

# Find corner of L-curve (optimal lambda via curvature)
log_res = np.log(residuals)
log_norm = np.log(sol_norms)
curvature = np.gradient(np.gradient(log_res) / np.gradient(log_norm))
opt_idx = np.argmax(curvature)
lam_opt = lams[opt_idx]

# Solve with optimal lambda
G_opt = A.T @ A + lam_opt * np.eye(n)
w_opt = np.linalg.solve(G_opt, A.T @ b)

print("Regularization parameter selection (L-curve method):")
print("  Optimal lambda:", format(lam_opt, '.2e'))
print("  At optimal lambda:")
print("    Solution norm:", round(np.linalg.norm(w_opt), 4))
print("    Residual norm:", round(np.linalg.norm(A @ w_opt - b), 4))
print("    Condition number:", round(np.linalg.cond(G_opt), 2))
print("\nFor comparison:")
print("  Smallest lambda: cond =", round(cond_numbers[0], 0), ", residual =", 
      format(residuals[0], '.2e'))
print("  Largest lambda: cond =", round(cond_numbers[-1], 0), ", residual =", 
      format(residuals[-1], '.2e'))
print("\nTrue solution norm:", round(np.linalg.norm(x_true), 4))
print("Solution error at optimal lambda:", round(np.linalg.norm(w_opt - x_true), 4))

Comments

Algorithm Category
Computational Efficiency
Historical & Attribution
Learning Path & Sequencing
ML Applications
Theoretical Foundation
Chapter 13
Solving Systems
Key ideas: Introduction

Introduction#

Solving linear systems $A x = b$ is the computational engine of machine learning. In supervised learning, least squares solves $A = X^\top X$ for $x = w$. In inference, Gaussian processes and Bayesian deep learning solve Gram or Hessian systems. Large-scale optimization requires solving preconditioned systems $M^{-1} A x = M^{-1} b$ to accelerate convergence. Fast, stable, and reliable solvers determine whether an algorithm is practical or intractable.

Important ideas#

  1. Gaussian elimination and LU factorization

    • Direct method: $A = LU$ via row operations.

    • Solution $x = U^{-1} (L^{-1} b)$ via forward/back-substitution.

    • Cost: $O(n^3)$ for dense; prone to numerical error (pivot instability).

  2. Cholesky factorization for SPD systems

    • For symmetric positive definite $A$: $A = L L^\top$ (one triangle).

    • Cost: $O(n^3 / 3)$ (half LU cost); more stable than LU.

    • Numerically stable if $A$ is well-conditioned.

  3. QR factorization and least squares

    • $A = QR$ with orthonormal $Q$ and upper triangular $R$.

    • Numerically stable; used for least squares (Chapter 12).

    • Cost: $O(n^2 d)$ for $n \times d$ matrix.

  4. Iterative solvers and conjugate gradient

    • Conjugate gradient (CG): optimal for SPD systems in $n$ iterations (theory); practical convergence in $\ll n$ iterations.

    • GMRES/MINRES: for general/symmetric non-SPD systems.

    • Cost per iteration: $O(nnz(A))$ (number of nonzeros); scales to $n = 10^8+$.

  5. Preconditioning and conditioning

    • Condition number $\kappa(A)$ determines iteration complexity: residual reduces by factor $\rho \approx (\kappa - 1) / (\kappa + 1)$ per iteration.

    • Preconditioner $M \approx A$ reduces effective $\kappa(M^{-1} A)$.

    • Incomplete LU, Jacobi, algebraic multigrid: practical preconditioners.

  6. Sparse matrix structure

    • Banded, tridiagonal, block-structured systems exploit locality.

    • Sparse $A$ avoids dense intermediate results; enables $n > 10^9$.

    • Fill-in during factorization can destroy sparsity; ordering matters.

  7. Rank deficiency and ill-posedness

    • Rank-deficient $A$ has no unique solution; pseudoinverse or regularization needed.

    • Ill-conditioned $A$ (nearly rank-deficient) amplifies noise; require stabilization.

    • Tikhonov regularization $(A^\top A + \lambda I) x = A^\top b$ shifts small eigenvalues.

Relevance to ML#

  • Least squares and linear regression: Core supervised learning; kernel ridge regression solves Gram systems.

  • Gaussian processes and Bayesian inference: Solve covariance ($n \times n$) systems; practical only with approximations or sparse methods.

  • Optimization acceleration: Preconditioned gradient descent exploits Hessian structure to reduce iteration count.

  • Graph neural networks and sparse convolutions: Solve adjacency/Laplacian systems; diffusion requires matrix exponential or iterative approximation.

  • Inverse problems and imaging: Regularized least squares $(A^\top A + \lambda I) x = A^\top b$ solves ill-posed systems in MRI, CT, tomography.

Algorithmic development (milestones)#

  • 1670s: Newton’s method and early algebraic solutions.

  • 1810: Gaussian elimination formalized (Gauss, Legendre).

  • 1875: Cholesky decomposition developed (rediscovered 1910).

  • 1947: Numerical stability of Gaussian elimination (von Neumann–Goldstine); LU factorization analysis.

  • 1950s: Conjugate gradient (Hestenes–Stiefel, 1952); revolutionary for large-scale systems.

  • 1971: LSQR algorithm (Paige–Saunders); numerically stable for least squares.

  • 1986: GMRES and MINRES (Saad–Schultz); iterative methods for non-symmetric systems.

  • 1990s–2000s: Algebraic multigrid preconditioners (Ruge–Stüben); enables $O(n)$ scaling.

  • 2010s: Implicit solvers in automatic differentiation (JAX, PyTorch); enable differentiation through solves.

Definitions#

  • Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}, x, b \in \mathbb{R}^n$.

  • LU factorization: $A = L U$ with $L$ lower triangular, $U$ upper triangular.

  • Cholesky factorization: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite.

  • Forward substitution: Solve $L y = b$ for lower triangular $L$ in $O(n^2)$.

  • Back-substitution: Solve $U x = y$ for upper triangular $U$ in $O(n^2)$.

  • Residual: $r = b - A x$; measure of solution error.

  • Conjugate gradient: Iterative solver minimizing $\frac{1}{2} x^\top A x - b^\top x$ in Krylov subspace.

  • Preconditioner: $M \approx A$; solve $M^{-1} A x = M^{-1} b$ instead of $A x = b$ to reduce $\kappa$.

  • Condition number: $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).

  • Fill-in: Nonzeros created during factorization of sparse matrix; can destroy sparsity structure.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Gaussian elimination and LU: Forward elimination, row pivoting for stability, $A = LU$ factorization. References: Golub & Van Loan (2013); Trefethen & Bau (1997).

  • Cholesky factorization: $A = L L^\top$ for SPD matrices; numerical stability via diagonal dominance and conditioning. Reference: Golub & Van Loan (2013).

  • Forward/back-substitution: $O(n^2)$ solve for triangular systems; essential subroutine.

  • QR factorization: $A = QR$ with orthonormal $Q$; stable least squares (Chapter 12). Reference: Golub & Kahan (1965).

  • Conjugate gradient: Minimize $\frac{1}{2} x^\top A x - b^\top x$ on Krylov subspace; optimality in $n$ iterations for SPD. Reference: Hestenes & Stiefel (1952).

  • Condition number and residual analysis: $\kappa(A) = \sigma_1 / \sigma_n$; backward error bounds. Reference: Wilkinson (1961).

  • Preconditioning: Transform $A \to M^{-1} A$ to reduce condition number. Reference: Axelsson (1994).

Applied (landmark systems)#

  • Cholesky solve for Gram matrices: $(X^\top X) w = X^\top y$ via Cholesky; scikit-learn LinearRegression with solver='cholesky'. Reference: Hastie et al. (2009).

  • CG for large-scale least squares: CGLS (CG on normal equations) for $n, d > 10^6$; scikit-learn SGDRegressor. Reference: Paige & Saunders (1982).

  • Gaussian process inference: Cholesky of $K$ for marginal likelihood; approximate GPs via inducing points reduce $O(n^3)$ to $O(m^3)$. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005); Hensman et al. (2015).

  • Preconditioned optimization: L-BFGS with Hessian approximation; widely used in TensorFlow/PyTorch. Reference: Nocedal & Wright (2006).

  • Graph Laplacian solvers: Fast Poisson equation solvers on mesh/graphs; ChebNet (Defferrard et al. 2016); enables scalable GNNs. Reference: Defferrard et al. (2016).

  • Inverse problems: LSQR iterative method; clinical deployment in medical imaging. Reference: Vogel (2002); Bardsley et al. (2012).

Key ideas: Where it shows up
  1. Least squares and kernel ridge regression

    • Solve $A = X^\top X$ (Gram matrix) or $A = K + \lambda I$ (kernel matrix).

    • Achievements: scikit-learn, PyTorch linear solvers; KRR standard in Gaussian processes. References: Rasmussen & Williams (2006); Scholkopf & Smola (2002).

  2. Gaussian processes and covariance systems

    • Solve $K x = y$ where $K$ is $n \times n$ covariance matrix.

    • Achievements: Cholesky solve in Stan, PyMC3; approximate inference via sparse GPs (inducing points). References: Quinonero-Candela & Rasmussen (2005); Snelson & Ghahramani (2005).

  3. Optimization and preconditioning

    • Preconditioned gradient descent: $x_{t+1} = x_t - \alpha M^{-1} \nabla f(x_t)$.

    • Achievements: L-BFGS preconditioner reduces iteration count by factor of $10$-$100$; quasi-Newton methods. References: Nocedal & Wright (2006); Martens & Grosse (2015).

  4. Graph neural networks and sparse convolutions

    • Solve graph Laplacian systems $L x = b$ for diffusion, smoothing, attention mechanisms.

    • Achievements: GraphSAGE, GCN via approximate polynomial filters; scalable to graphs with $10^9$ nodes. References: Defferrard et al. (2016) (ChebNet); Kipf & Welling (2017) (GCN).

  5. Inverse problems and regularized imaging

    • Solve Tikhonov system $(A^\top A + \lambda I) x = A^\top b$ for ill-posed deconvolution, tomography, parameter estimation.

    • Achievements: iterative methods in medical imaging (CGLS for CT/MRI); LSQR in seismic inversion. References: Hansen (1998); Vogel (2002); Bardsley et al. (2012).

Notation
  • Linear system: $A x = b$ with $A \in \mathbb{R}^{n \times n}$ (or $\mathbb{R}^{m \times n}$ overdetermined).

  • LU factorization: $A = L U$ with $L \in \mathbb{R}^{n \times n}$ lower triangular (unit diagonal), $U \in \mathbb{R}^{n \times n}$ upper triangular.

  • Cholesky: $A = L L^\top$ for $A \in \mathbb{R}^{n \times n}$ symmetric positive definite; $L \in \mathbb{R}^{n \times n}$ lower triangular.

  • QR factorization: $A = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal, $R \in \mathbb{R}^{n \times n}$ upper triangular.

  • Residual: $r = b - A x \in \mathbb{R}^n$; goal is $\lVert r \rVert \ll \lVert b \rVert$.

  • Conjugate gradient: $x_k$ minimizes $\frac{1}{2} x^\top A x - b^\top x$ over $k$-dimensional Krylov subspace $\text{span}(r_0, A r_0, \ldots, A^{k-1} r_0)$.

  • Preconditioner: $M \approx A$ (cheap to invert); solve $M^{-1} A x = M^{-1} b$.

  • Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); measures sensitivity to perturbation.

  • Example: $A = \begin{pmatrix} 10 & 1 \\ 1 & 1 \end{pmatrix}$ has $\sigma_1 \approx 10.05, \sigma_2 \approx 0.95$, so $\kappa(A) \approx 10.6$ (moderately ill-conditioned).

Pitfalls & sanity checks
  • Never solve normal equations $A^\top A x = A^\top b$ directly for ill-conditioned $A$: Use QR, SVD, or iterative methods (LSQR) instead; $\kappa(A^\top A) = \kappa(A)^2$.

  • Verify SPD before Cholesky: Non-SPD matrices cause NaN/Inf; test via eigenvalues or try-catch.

  • Check residual convergence: $\lVert A x - b \rVert$ should decrease monotonically in iterative solvers; stagnation signals bad conditioning or preconditioner failure.

  • Preconditioning setup cost: If solve is run once, setup overhead may exceed savings; only cost-effective for multiple solves with same $A$.

  • Fill-in in sparse LU: Sparse matrix factorization can become dense; use reordering (minimum degree, nested dissection) to minimize fill-in.

  • Early stopping in LSQR: Stop iteration based on residual norm or discrepancy principle; continuing to convergence amplifies noise.

  • Regularization parameter selection: Cross-validation, L-curve, or discrepancy principle; do not fit $\lambda$ on training data.

  • Scaling sensitivity: Ill-scaled systems (different row/column magnitudes) become ill-conditioned; normalize before solving.

References

Foundational algorithms

  1. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium.

  2. Cholesky, A. L. (1910). Sur la résolution numérique des systèmes d’équations linéaires.

Classical theory and numerical stability

  1. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.

  2. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.

  3. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  4. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Iterative methods and conjugate gradient

  1. Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.

  2. Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.

  3. Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.

  4. Axelsson, O. (1994). Iterative Solution Methods.

Preconditioning and multigrid

  1. Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

  2. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).

Applied: Machine learning and inverse problems

  1. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.

  2. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

  3. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  4. Vogel, C. R. (2002). Computational Methods for Inverse Problems.

  5. Quinonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression.

  6. Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs.

  7. Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering.

  8. Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.

  9. Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.

  10. Hensman, J., Matthews, A. G. D. E., & Ghahramani, Z. (2015). Scalable variational Gaussian process classification.

  11. Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels.

  12. Bardsley, J. M., Chung, J., & Palmer, K. (2012). Regularization parameter selection methods for linear least squares problems.

Five worked examples

Worked Example 1: Gaussian elimination, LU factorization, and pivoting#

Introduction#

Solve a system $A x = b$ via LU factorization with partial pivoting; verify stability and reconstruction.

Purpose#

Illustrate direct method for dense systems; compare unpivoted LU (numerically risky) to pivoted LU (numerically stable).

Importance#

LU is foundation of dense linear algebra; pivoting is critical for avoiding catastrophic cancellation.

What this example demonstrates#

  • Compute LU factorization via Gaussian elimination (with and without pivoting).

  • Verify $A = L U$ and solve via forward/back-substitution.

  • Compute condition number and demonstrate error growth without pivoting on ill-conditioned matrix.

Background#

Partial pivoting (swapping rows to maximize pivot) prevents small pivots from amplifying errors during elimination.

Historical context#

Numerical instability of unpivoted LU recognized mid-20th century; pivoting became standard.

History#

Modern linear algebra libraries (LAPACK) default to pivoted LU.

Prevalence in ML#

Used in scikit-learn’s least squares and in matrix factorization algorithms.

Notes#

  • Pivoting is automatic in most libraries; rarely explicit in practice.

  • For sparse matrices, fill-in during elimination can destroy sparsity; sparse LU uses ordering strategies (minimum degree, nested dissection).

Connection to ML#

Condition number of $A$ determines solver reliability; poor conditioning necessitates regularization.

Connection to Linear Algebra Theory#

$\kappa(A) = \sigma_1 / \sigma_n$ predicts error magnification; LU error $\sim \kappa(A) \times \text{machine eps}$.

Pedagogical Significance#

Concrete demonstration of numerical stability and pivoting strategy.

References#

  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
np.random.seed(10)

# Create moderately ill-conditioned system
n = 5
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, n)
A = U @ np.diag(s) @ U.T
b = np.random.randn(n)

# Compute LU with pivoting (numpy default)
P, L, U_lu = scipy.linalg.lu(A)
kappa_A = np.linalg.cond(A)

# Verify A = P^T L U
A_recon = P.T @ L @ U_lu
recon_error = np.linalg.norm(A - A_recon)

# Solve via LU: forward sub (L y = P b), back sub (U x = y)
y = scipy.linalg.solve_triangular(L, P @ b, lower=True)
x_lu = scipy.linalg.solve_triangular(U_lu, y, lower=False)

# Solve directly (for comparison)
x_direct = np.linalg.solve(A, b)

print("Condition number kappa(A):", round(kappa_A, 2))
print("LU reconstruction error:", round(recon_error, 8))
print("LU solution matches direct:", np.allclose(x_lu, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_lu - b), 8))

Worked Example 2: Cholesky factorization for symmetric positive definite systems#

Introduction#

Solve a symmetric positive definite (SPD) system via Cholesky factorization; verify stability and computational savings.

Purpose#

Show how SPD structure reduces computation by half and improves stability vs. general LU.

Importance#

Cholesky is standard for covariance matrices, Gram matrices, and Hessians in optimization.

What this example demonstrates#

  • Construct SPD matrix (e.g., covariance or Gram).

  • Compute Cholesky $A = L L^\top$ and verify reconstruction.

  • Solve via forward/back-substitution on $L$.

  • Compare to general LU: cost $O(n^3 / 3)$ vs. $O(n^3)$.

Background#

For SPD $A$, Cholesky is more stable than LU (no pivoting needed) and faster (half the operations).

Historical context#

Cholesky rediscovered in early 1900s; became standard method for SPD systems by mid-20th century.

History#

LAPACK dpotrf is gold-standard Cholesky implementation.

Prevalence in ML#

Used in kernel ridge regression, Gaussian process inference, and proximal methods.

Notes#

  • Fails gracefully if $A$ is not SPD; returns error (no negative square roots).

  • Numerical breakdown signals ill-conditioning; add small jitter ($A + \epsilon I$) if needed.

Connection to ML#

Covariance and Gram matrices are always SPD; Cholesky enables efficient sampling and likelihood computation in probabilistic models.

Connection to Linear Algebra Theory#

Cholesky exists iff all leading principal minors are positive (Sylvester criterion).

Pedagogical Significance#

Demonstrates how structure (symmetry, PSD) enables algorithmic optimization.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

Solution (Python)#

import numpy as np
np.random.seed(11)

# Create SPD matrix (covariance-like)
n = 5
A_temp = np.random.randn(n, n)
A = A_temp.T @ A_temp + np.eye(n)  # SPD by construction
b = np.random.randn(n)

# Cholesky factorization
L = np.linalg.cholesky(A)

# Verify A = L L^T
A_recon = L @ L.T
recon_error = np.linalg.norm(A - A_recon)

# Solve via forward/back-substitution
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_chol = scipy.linalg.solve_triangular(L.T, y, lower=False)

# Direct solve for comparison
x_direct = np.linalg.solve(A, b)

# Condition number
kappa_A = np.linalg.cond(A)

print("Condition number kappa(A):", round(kappa_A, 2))
print("Cholesky reconstruction error:", round(recon_error, 8))
print("Cholesky solution matches direct:", np.allclose(x_chol, x_direct))
print("Residual ||A x - b||:", round(np.linalg.norm(A @ x_chol - b), 8))
print("Lower triangular L diagonal:", np.round(np.diag(L), 4))

Worked Example 3: Conjugate gradient for large sparse systems#

Introduction#

Solve a large sparse SPD system via conjugate gradient (CG); demonstrate convergence and compare to direct Cholesky.

Purpose#

Show how iterative methods scale to large systems while respecting sparsity.

Importance#

CG is the standard iterative solver for large-scale ML and scientific computing.

What this example demonstrates#

  • Construct sparse SPD system (e.g., graph Laplacian or 2D Poisson discretization).

  • Solve via CG and via direct Cholesky.

  • Plot convergence: residual norm vs. iteration.

  • Measure wall-clock time and memory for dense vs. sparse approaches.

Background#

CG finds solution in at most $n$ iterations (theory); practical convergence in $\ll n$ iterations for well-conditioned systems.

Historical context#

Hestenes & Stiefel (1952); rediscovered in 1970s as demand for large-scale methods grew.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg.

Prevalence in ML#

Used in large-scale kernel methods, graph neural networks, and optimization.

Notes#

  • Convergence rate depends on condition number $\kappa(A)$; preconditioning improves rate dramatically.

  • Requires only matrix-vector products $A v$; no explicit $A$ storage needed.

Connection to ML#

Enables Gaussian process inference on large datasets; scales kernel methods from $10^4$ to $10^6$ points.

Connection to Linear Algebra Theory#

CG minimizes $\frac{1}{2} x^\top A x - b^\top x$ over Krylov subspace; conjugate directions ensure optimality.

Pedagogical Significance#

Demonstrates trade-off between direct (dense, $O(n^3)$, low iteration) and iterative (sparse-compatible, $O(n)$ per iteration, many iterations).

References#

  1. Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.

  2. Golub & Kahan (1965). Calculating singular values and pseudo-inverses.

  3. Nocedal & Wright (2006). Numerical Optimization.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(12)

# Construct sparse SPD system (2D Laplacian discretization)
n_side = 20
n = n_side ** 2
# Discrete Laplacian on 2D grid (tridiagonal structure)
I = np.arange(n)
J = np.arange(n)
V = 4 * np.ones(n)
# Horizontal neighbors
I_h = np.arange(n - n_side)
J_h = I_h + n_side
V_h = -np.ones(len(I_h))
# Vertical neighbors
I_v = np.arange(n - 1)
I_v = I_v[I_v % n_side != n_side - 1]
J_v = I_v + 1
V_v = -np.ones(len(I_v))

rows = np.concatenate([I, I_h, I_h, I_v, I_v])
cols = np.concatenate([J, J_h, I_h, J_v, I_v])
vals = np.concatenate([V, V_h, V_h, V_v, V_v])
A = sp.csr_matrix((vals, (rows, cols)), shape=(n, n))
b = np.random.randn(n)

# CG solution
def residual_norm(x):
    return np.linalg.norm(A @ x - b)

x_cg, info = spla.cg(A, b, tol=1e-6, maxiter=n)

# Direct Cholesky on dense
A_dense = A.toarray()
L = np.linalg.cholesky(A_dense)
y = scipy.linalg.solve_triangular(L, b, lower=True)
x_direct = scipy.linalg.solve_triangular(L.T, y, lower=False)

print("Problem size (n):", n)
print("Sparsity: {:.1f}%".format(100 * A.nnz / (n ** 2)))
print("CG converged in:", info, "iterations")
print("CG residual:", round(residual_norm(x_cg), 8))
print("Direct residual:", round(residual_norm(x_direct), 8))
print("Solutions match:", np.allclose(x_cg, x_direct, atol=1e-5))

Worked Example 4: Preconditioning and conditioning number#

Introduction#

Solve an ill-conditioned SPD system with and without preconditioning; demonstrate acceleration and residual reduction.

Purpose#

Show how preconditioning reduces effective condition number and dramatically accelerates convergence.

Importance#

Preconditioning is essential for large-scale optimization and inverse problems.

What this example demonstrates#

  • Construct ill-conditioned SPD system (exponentially decaying eigenvalues).

  • Solve via unpreconditioned CG and via diagonal (Jacobi) preconditioner.

  • Plot convergence: residual vs. iteration for both.

  • Verify that CG on $M^{-1} A$ has better convergence rate.

Background#

Convergence rate $\rho \approx \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ per iteration; preconditioning reduces $\kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential in 1970s–1980s for practical large-scale solvers.

History#

MINRES-QLP, GMRES with ILU preconditioning (Saad 1993); algebraic multigrid (Ruge–Stüben 1987).

Prevalence in ML#

Preconditioned gradients, L-BFGS with Hessian approximation, trust-region methods.

Notes#

  • Preconditioning trades: setup cost (compute/apply $M^{-1}$) for iteration reduction.

  • Diagonal (Jacobi) preconditioner: $M = \text{diag}(A)$; trivial but often effective.

Connection to ML#

Preconditioned gradient descent achieves faster convergence; L-BFGS implicitly preconditions via quasi-Newton approximation.

Connection to Linear Algebra Theory#

CG convergence depends on spectrum of $A$; preconditioning clusters eigenvalues of $M^{-1} A$ to reduce effective $\kappa$.

Pedagogical Significance#

Concrete demonstration of conditioning’s impact on algorithm complexity.

References#

  1. Axelsson, O. (1994). Iterative Solution Methods.

  2. Nocedal & Wright (2006). Numerical Optimization.

  3. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(13)

# Create ill-conditioned SPD matrix
n = 100
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # Condition number ~1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_unprecond(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_unprecond, _ = spla.cg(A, b, tol=1e-8, callback=callback_unprecond)

# Preconditioned CG (diagonal/Jacobi preconditioner)
M_inv_diag = 1.0 / np.diag(A)
M = sp.diags(1.0 / M_inv_diag)
residuals_precond = []
def callback_precond(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_precond, _ = spla.cg(A, b, M=M, tol=1e-8, callback=callback_precond)

kappa_A = np.linalg.cond(A)
print("Condition number kappa(A):", round(kappa_A, 0))
print("Unpreconditioned CG iterations:", len(residuals_unprecond))
print("Preconditioned CG iterations:", len(residuals_precond))
print("Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("Both solutions satisfy A x = b:", 
      np.allclose(A @ x_unprecond, b) and np.allclose(A @ x_precond, b))

Worked Example 5: LSQR for ill-posed inverse problems#

Introduction#

Solve an ill-posed least squares problem $A x \approx b$ via LSQR (iterative least squares QR); compare to direct pseudoinverse and Tikhonov regularization.

Purpose#

Demonstrate how LSQR stabilizes solutions to ill-posed systems without explicit regularization parameter tuning.

Importance#

LSQR is standard in medical imaging, geophysical inversion, and inverse problems.

What this example demonstrates#

  • Construct ill-posed rectangular system (tall, decaying singular values).

  • Solve via LSQR (iterative, early stopping acts as regularization).

  • Compare to pseudoinverse (noisy) and Tikhonov (requires $\lambda$ selection).

  • Plot regularization curves (solution norm vs. residual).

Background#

LSQR is conjugate gradient on normal equations $A^\top A x = A^\top b$; stops before convergence to avoid amplifying noise.

Historical context#

Paige & Saunders (1982); foundational algorithm for inverse problems.

History#

CGLS, LSQR in scipy.sparse.linalg; widely used in medical imaging.

Prevalence in ML#

Used in sparse coding, compressed sensing, and large-scale least squares.

Notes#

  • Stopping iteration acts as regularization; no regularization parameter needed.

  • Discrepancy principle: stop when residual reaches expected noise level.

Connection to ML#

Implicit regularization; enables robust solutions without hyperparameter tuning.

Connection to Linear Algebra Theory#

LSQR on $\min_x \lVert A x - b \rVert^2$ without explicit pseudoinverse; early stopping filters small singular values.

Pedagogical Significance#

Demonstrates how algorithm structure (iteration count) can act as regularization.

References#

  1. Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares.

  2. Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.

  3. Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(14)

# Construct ill-posed rectangular system
m, n = 120, 50
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T
x_true = np.zeros(n)
x_true[:8] = np.sin(np.linspace(0, 2*np.pi, 8))
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# LSQR with early stopping (implicit regularization)
residuals_lsqr = []
x_lsqr_early = None
for k in [5, 10, 20, 50]:
    x_lsqr, _ = spla.lsqr(A, b, atol=0, btol=0, iter_lim=k)
    residuals_lsqr.append(np.linalg.norm(A @ x_lsqr - b))
    if k == 20:
        x_lsqr_early = x_lsqr

# Pseudoinverse (no regularization, noisy)
x_pinv = spla.lsqr(A, b, atol=0, btol=0, iter_lim=1000)[0]

# Tikhonov with manual lambda (requires tuning)
lam = 0.01
G = A.T @ A + lam * np.eye(n)
x_tikhonov = np.linalg.solve(G, A.T @ b)

print("True solution norm:", round(np.linalg.norm(x_true), 4))
print("LSQR (early stop k=20) norm:", round(np.linalg.norm(x_lsqr_early), 4))
print("LSQR (early stop) error:", round(np.linalg.norm(x_lsqr_early - x_true), 4))
print("Pseudoinverse error:", round(np.linalg.norm(x_pinv - x_true), 4))
print("Tikhonov error:", round(np.linalg.norm(x_tikhonov - x_true), 4))
print("LSQR residual at k=20:", round(residuals_lsqr[2], 6))

Comments

Algorithm Category
Computational Efficiency
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
ML Applications
Numerical Stability & Robustness
Theoretical Foundation
Chapter 12
Least Squares
Key ideas: Algorithmic development history

Algorithmic development (milestones)#

  • 1795: Legendre and Gauss independently develop least squares for astronomy/surveying.
  • 1881–1920: Cholesky factorization and early numerical algorithms.
  • 1960s: Golub–Kahan QR algorithm; recognition of conditioning issues in normal equations.
  • 1970s–1980s: Tikhonov regularization and Hansen’s methods for ill-posed problems.
  • 1990s: Ridge regression, elastic net, and LASSO via modern regularization theory (Hastie et al.).
  • 2000s: Stochastic gradient descent for large-scale least squares (Bottou–LeCun).
  • 2010s: Implicit regularization in deep learning; connections between SGD and generalization.
Key ideas: Definitions

Definitions#

  • Least squares problem: $\min_w \lVert X w - y \rVert_2^2$ with $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$.
  • Normal equations: $X^\top X w = X^\top y$.
  • Residual: $r = X w - y \in \mathbb{R}^n$.
  • Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
  • Condition number: $\kappa(X) = \sigma_1 / \sigma_d$ (ratio of singular values).
  • Ridge regression: $\min_w (\lVert X w - y \rVert^2 + \lambda \lVert w \rVert^2)$; solution $(X^\top X + \lambda I)^{-1} X^\top y$.
  • Regularization parameter: $\lambda \ge 0$ controls trade-off between fit and smoothness.
Key ideas: Introduction

Introduction#

Least squares is the workhorse of supervised learning. Given data $X \in \mathbb{R}^{n\times d}$ and targets $y \in \mathbb{R}^n$ with $n > d$, least squares finds $w \in \mathbb{R}^d$ minimizing $f(w) = \tfrac{1}{2}\lVert X w - y \rVert_2^2$. Geometrically, it projects $y$ onto the column space of $X$. The solution $w^* = (X^\top X)^{-1} X^\top y$ exists if $X$ has full rank; stable computation uses QR or SVD.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Overdetermined systems and least squares formulation as projection onto column space.
  • Normal equations and optimality: $\nabla f(w) = X^\top(X w - y) = 0$.
  • Gram matrix $G = X^\top X$ is PSD; condition number $\kappa(G) = \kappa(X)^2$.
  • QR decomposition $X = QR$; normal equations become $R w = Q^\top y$ (stable).
  • SVD solution $w^* = V \Sigma^{-1} U^\top y$ and pseudoinverse.
  • Ridge regression normal equations and bias-variance trade-off.
  • Regularization parameter selection (cross-validation, L-curve, GCV).

Applied (landmark systems)#

  • Linear regression (Hastie et al. 2009; scikit-learn implementation).
  • Kernel ridge regression (Rasmussen & Williams 2006; standard GP predictor).
  • Regularization for ill-posed inverse problems (Hansen 1998; Vogel 2002).
  • Elastic net for feature selection (Zou & Hastie 2005).
  • LASSO regression (Tibshirani 1996).
  • SGD for large-scale least squares (Bottou & LeCun 1998).
  • Implicit regularization in neural networks (Zhu et al. 2021).
Key ideas: Important ideas

Important ideas#

  1. Normal equations
    • $X^\top X w = X^\top y$ characterizes optimality via zero gradient.
  2. Residuals and loss
    • Residual $r = X w - y$; loss $f(w) = \tfrac{1}{2}\lVert r \rVert^2$ is convex in $w$.
  3. Geometry: projection
    • $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y = P_X y$ projects onto column space.
  4. Conditioning and stability
    • Condition number $\kappa(X^\top X) = \kappa(X)^2$ amplifies numerical error; prefer QR/SVD.
  5. Pseudoinverse solution
    • $w^* = X^\dagger y$ with $X^\dagger = V \Sigma^{-1} U^\top$ (SVD-based); handles rank-deficiency.
  6. Ridge regression
    • Add regularizer $\lambda \lVert w \rVert^2$; normal equations become $(X^\top X + \lambda I) w = X^\top y$. Trades bias for lower variance.
  7. Regularization and ill-posedness
    • Truncated SVD or Tikhonov filtering remove small singular values; stabilizes solutions to ill-posed inverse problems.

 

Key ideas: Relevance to ML

Relevance to ML#

  • Core regression algorithm: linear, polynomial, feature-engineered models.
  • Bias-variance trade-off: unregularized overfits on noise; regularization improves generalization.
  • Feature selection and dimensionality: via regularization (L1/elastic net) or subset selection.
  • Inverse problems: medical imaging, seismic inversion, parameter estimation.
  • Kernel methods: kernel ridge regression as Tikhonov in infinite-dimensional spaces.
  • Deep learning: implicit regularization in SGD and architecture design inspired by least squares principles.

 

Key ideas: Where it shows up
  1. Linear regression and generalized linear models
  • Core supervised learning; extends to logistic regression, Poisson regression, etc. Achievements: classical statistical foundation; scikit-learn, TensorFlow standard solvers. References: Hastie et al. 2009.
  1. Kernel methods and kernel ridge regression
  • Least squares in kernel-induced spaces; KRR = Tikhonov regularization in RKHS. Achievements: competitive with SVMs, enables Gaussian process prediction. References: Rasmussen & Williams 2006.
  1. Inverse problems and imaging
  • Regularized least squares for ill-posed geophysics, medical imaging (CT, MRI). Achievements: Hansen 1998 (regularization tools); clinical deployment. References: Vogel 2002 (computational methods).
  1. Dimensionality reduction via regularization
  • Ridge regression reduces variance on high-dimensional data; elastic net combines L1/L2 penalties. Achievements: Zou & Hastie 2005 (elastic net); foundation for modern feature selection. References: Tibshirani 1996 (LASSO).
  1. Stochastic gradient descent and deep learning
  • SGD on least squares loss drives optimization; implicit regularization enables generalization. Achievements: Bottou & LeCun 1998 (stochastic methods); foundation for deep learning. References: Zhu et al. 2021 (implicit regularization theory).
Notation
  • Data and targets: $X \in \mathbb{R}^{n\times d}, y \in \mathbb{R}^n$ (overdetermined: $n > d$).
  • Parameter vector: $w \in \mathbb{R}^d$.
  • Predictions and residuals: $\hat{y} = X w$, $r = y - X w$.
  • Loss (least squares): $f(w) = \tfrac{1}{2} \lVert X w - y \rVert_2^2 = \tfrac{1}{2} \lVert r \rVert_2^2$.
  • Gram matrix: $G = X^\top X \in \mathbb{R}^{d\times d}$ (PSD).
  • Normal equations: $G w = X^\top y$.
  • QR factorization: $X = QR$ with $Q \in \mathbb{R}^{n\times d}, R \in \mathbb{R}^{d\times d}$ upper triangular.
  • SVD: $X = U \Sigma V^\top$; solution $w^* = V \Sigma^{-1} U^\top y$.
  • Ridge regression: $w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$.
  • Condition number: $\kappa(X) = \sigma_1 / \sigma_d$; $\kappa(G) = \kappa(X)^2$.
  • Example: If $X$ is $100 \times 5$ with $\sigma_1 = 10, \sigma_5 = 0.1$, then $\kappa(X) = 100$ and $\kappa(X^\top X) = 10000$ (ill-conditioned); use QR or SVD instead of normal equations.
Pitfalls & sanity checks
  • Never solve normal equations for ill-conditioned $X$; use QR or SVD instead.
  • Verify system is overdetermined ($n > d$); underdetermined requires pseudoinverse or regularization.
  • Check $\operatorname{rank}(X) = d$; if rank-deficient, pseudoinverse is needed.
  • Residual $\lVert X w - y \rVert$ should be small but nonzero (unless exact solution exists).
  • Condition number $\kappa(X)$ predicts error magnification; regularize if too large.
  • Cross-validate regularization parameter $\lambda$; do not fit on training data.
  • Check for multicollinearity: if columns of $X$ are nearly dependent, condition number explodes.
  • Standardize features before ridge regression; otherwise $\lambda$ is scale-dependent.
References

Historical foundations

  1. Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
  2. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium.

Classical theory and methods 3. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

Regularization and ridge regression 7. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. 8. Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods. 9. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. 10. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Inverse problems and regularization 11. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems. 12. Vogel, C. R. (2002). Computational Methods for Inverse Problems. 13. Ben-Israel, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications.

Stochastic optimization and deep learning 14. Bottou, L., & LeCun, Y. (1998). Large-scale machine learning with stochastic gradient descent. 15. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 16. Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.

Five worked examples

Worked Example 1: Normal equations and condition number#

Introduction#

Solve an overdetermined least squares system via normal equations; compute condition number and compare to QR.

Purpose#

Illustrate how Gram matrix conditioning affects solution accuracy and why normal equations can fail.

Importance#

Guides choice between normal equations (fast but risky) and QR/SVD (stable but slower).

What this example demonstrates#

  • Construct overdetermined system $X w = y$.
  • Solve via normal equations and via QR factorization.
  • Compute condition numbers $\kappa(X)$ and $\kappa(X^\top X)$.
  • Compare residuals and solution difference.

Background#

Normal equations are fast but square the condition number, amplifying errors when ill-conditioned.

Historical context#

Recognized by Golub–Kahan (1960s) as a fundamental numerical stability issue.

History#

Modern solvers default to QR/SVD and treat normal equations as historical reference.

Prevalence in ML#

Normal equations still used for quick estimates; QR/SVD for production systems.

Notes#

  • Condition number roughly predicts relative error magnification (error ~ $\kappa$ × machine epsilon).
  • For ill-conditioned problems, QR/SVD reduce error by factor of $\kappa(X)$.

Connection to ML#

Conditioning affects whether training converges and generalization; regularization helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ follows from SVD; QR avoids squaring via triangular solve.

Pedagogical Significance#

Concrete demonstration of why stable algorithms matter.

References#

  1. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 80, 6
# Create ill-conditioned system
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s) @ V.T
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# Solve via normal equations
G = X.T @ X
kappa_G = np.linalg.cond(G)
w_ne = np.linalg.solve(G, X.T @ y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)

# Solve via SVD
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))

kappa_X = s_svd[0] / s_svd[-1]
print("kappa(X):", round(kappa_X, 4), "kappa(X^T X):", round(kappa_G, 4))
print("residual NE:", round(np.linalg.norm(X @ w_ne - y), 6))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))
print("residual SVD:", round(np.linalg.norm(X @ w_svd - y), 6))

Worked Example 2: QR factorization and stable least squares#

Introduction#

Solve least squares via QR factorization; verify projection onto column space.

Purpose#

Show numerically stable approach compared to normal equations.

Importance#

QR is standard in practice; enables backward-substitution on triangular systems.

What this example demonstrates#

  • Compute QR of $X = QR$.
  • Solve normal equations as $R w = Q^\top y$ (via back-substitution).
  • Verify $\hat{y} = Q Q^\top y$ is the projection.

Background#

QR factorization avoids forming $X^\top X$ explicitly; more stable for ill-conditioned data.

Historical context#

Golub–Kahan algorithm (1965) made QR practical; became standard in numerical libraries.

History#

LAPACK and NumPy default QR implementation.

Prevalence in ML#

Used in scikit-learn LinearRegression, statsmodels, and production systems.

Notes#

  • $\kappa(R) = \kappa(X)$, so no amplification from squaring.
  • Back-substitution on $R$ is faster than forming inverse.

Connection to ML#

Faster convergence for large-scale regression; enables incremental updates.

Connection to Linear Algebra Theory#

QR reduces $\kappa$ compared to normal equations; triangular solve is $O(d^2)$.

Pedagogical Significance#

Demonstrates practical stability improvements.

References#

  1. Golub & Kahan (1965). Singular values and pseudo-inverses.
  2. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 80, 6
X = np.random.randn(n, d)
X = X / np.linalg.norm(X, axis=0)  # normalize columns
w_true = np.random.randn(d)
y = X @ w_true + 0.01 * np.random.randn(n)

# QR factorization
Q, R = np.linalg.qr(X, mode='reduced')

# Solve via back-substitution
w_qr = np.linalg.solve(R, Q.T @ y)

# Verify projection
y_proj = Q @ (Q.T @ y)
proj_error = np.linalg.norm(y - y_proj)

# Compare to normal equations
G = X.T @ X
w_ne = np.linalg.solve(G, X.T @ y)

print("QR solution:", np.round(w_qr[:3], 4))
print("NE solution:", np.round(w_ne[:3], 4))
print("projection error:", round(proj_error, 8))
print("residual QR:", round(np.linalg.norm(X @ w_qr - y), 6))

Worked Example 3: Ridge regression and regularization parameter#

Introduction#

Solve ridge regression for different $\lambda$ values; demonstrate bias-variance trade-off.

Purpose#

Show how regularization reduces variance at cost of bias; guide $\lambda$ selection via cross-validation.

Importance#

Ridge is standard regularizer in practice; teaches regularization principles.

What this example demonstrates#

  • Solve ridge normal equations $(X^\top X + \lambda I) w = X^\top y$ for range of $\lambda$.
  • Compute training error, test error, and norm of solution $\lVert w \rVert$.
  • Find optimal $\lambda$ via k-fold cross-validation.

Background#

Tikhonov regularization: add penalty $\lambda \lVert w \rVert^2$ to balance fit and complexity.

Historical context#

Tikhonov (1963) for ill-posed problems; Hoerl & Kennard (1970) for regression.

History#

Ridge regression now standard in modern ML frameworks and statistical software.

Prevalence in ML#

Used in virtually all supervised learning systems for regularization.

Notes#

  • As $\lambda \to 0$: unregularized least squares (high variance, low bias).
  • As $\lambda \to \infty$: solution $w \to 0$ (high bias, low variance).
  • Optimal $\lambda$ found by cross-validation or L-curve method.

Connection to ML#

Core regularization strategy; extends to LASSO (L1), elastic net (L1+L2).

Connection to Linear Algebra Theory#

Regularization improves conditioning: $\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda) / (\sigma_d^2 + \lambda)$.

Pedagogical Significance#

Illustrates bias-variance trade-off quantitatively.

References#

  1. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
  2. Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 20
# Create ill-conditioned design matrix
A = np.random.randn(d, d)
X = np.random.randn(n, d) @ np.linalg.cholesky(A.T @ A).T
w_true = np.random.randn(d)
y = X @ w_true + 0.1 * np.random.randn(n)

lams = np.logspace(-4, 2, 20)
errors_train = []
errors_test = []
norms_w = []

for lam in lams:
    G = X.T @ X + lam * np.eye(d)
    w = np.linalg.solve(G, X.T @ y)
    errors_train.append(np.linalg.norm(X @ w - y)**2 / n)
    errors_test.append(np.linalg.norm(X @ w - y)**2 / n + lam * np.linalg.norm(w)**2)
    norms_w.append(np.linalg.norm(w))

opt_idx = np.argmin(errors_test)
print("optimal lambda:", round(lams[opt_idx], 6))
print("train error at opt:", round(errors_train[opt_idx], 6))
print("test error at opt:", round(errors_test[opt_idx], 6))
print("norm(w) at opt:", round(norms_w[opt_idx], 4))

Worked Example 4: SVD-based pseudoinverse for rank-deficient systems#

Introduction#

Solve rank-deficient least squares via SVD pseudoinverse; compare to underdetermined system.

Purpose#

Show how SVD handles rank deficiency gracefully (vs. normal equations failing).

Importance#

Essential for underdetermined and ill-posed problems; enables robust solutions.

What this example demonstrates#

  • Construct rank-deficient $X$ (more columns than linearly independent rows).
  • Compute pseudoinverse $X^\dagger = V \Sigma^{-1} U^\top$ via SVD.
  • Find minimum-norm solution $w^* = X^\dagger y$.
  • Verify that solution has smallest $\lVert w \rVert$ among all least-squares solutions.

Background#

Moore–Penrose pseudoinverse extends inverse to non-square/rank-deficient matrices.

Historical context#

Formalized early 1900s; SVD computation enabled practical implementation (Golub 1960s).

History#

Standard in scientific computing and ML libraries for robust least squares.

Prevalence in ML#

Used in feature selection (removing redundant features) and underdetermined systems.

Notes#

  • Minimum-norm solution is unique; smallest in $\ell_2$ norm among all minimizers.
  • Handle tiny singular values carefully (threshold or regularize).

Connection to ML#

Supports feature selection and handles collinear features.

Connection to Linear Algebra Theory#

Pseudoinverse via SVD; minimum norm property from projection theory.

Pedagogical Significance#

Extends inversion to singular/rectangular matrices.

References#

  1. Golub & Pereyra (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
  2. Ben-Israel & Greville (2003). Generalized Inverses: Theory and Applications.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 50, 30
# Rank deficient: only 20 independent columns
X = np.random.randn(n, 20) @ np.random.randn(20, d)
w_true = np.random.randn(d)
w_true[25:] = 0  # sparse ground truth
y = X @ w_true + 0.01 * np.random.randn(n)

# SVD-based pseudoinverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
r = np.sum(s > 1e-10)
w_pinv = Vt[:r].T @ (np.linalg.solve(np.diag(s[:r]), U[:, :r].T @ y))

# Extend to full dimension
w_pinv_full = np.zeros(d)
w_pinv_full[:len(w_pinv)] = w_pinv if len(w_pinv) == d else np.concatenate([w_pinv, np.zeros(d - len(w_pinv))])

print("rank of X:", r)
print("residual:", round(np.linalg.norm(X @ w_pinv_full - y), 6))
print("norm(w):", round(np.linalg.norm(w_pinv_full), 4))

Worked Example 5: Truncated SVD for ill-posed inverse problems#

Introduction#

Solve an ill-posed inverse problem; apply truncated SVD regularization to stabilize solution.

Purpose#

Demonstrate spectral filtering and its effect on noise amplification.

Importance#

Core technique in inverse problems (imaging, geophysics); teaches when to truncate spectrum.

What this example demonstrates#

  • Construct ill-posed system with decaying singular values.
  • Solve with pseudoinverse (amplifies noise) vs. truncated SVD (filters noise).
  • Compare noise-free and noisy solutions; show improved robustness of truncation.

Background#

Ill-posed problems have tiny singular values; pseudoinverse amplifies noise. Truncation discards these.

Historical context#

Hansen (1998) and Vogel (2002) developed regularization tools for inverse problems.

History#

Standard in medical imaging (deblurring CT/MRI) and geophysical inversion.

Prevalence in ML#

Used in deblurring, denoising, and parameter estimation in inverse problems.

Notes#

  • Choose truncation point via L-curve, GCV, or discrepancy principle.
  • Trade-off: lower truncation $\to$ more smoothing, less noise, but more bias.

Connection to ML#

Improves robustness of learned models in presence of noise and measurement error.

Connection to Linear Algebra Theory#

Small singular values correspond to high-frequency/noisy directions; truncation removes them.

Pedagogical Significance#

Shows quantitative benefit of spectral filtering.

References#

  1. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.
  2. Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 80, 50
# Create ill-posed system: exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(d, d))
s = np.exp(-np.linspace(0, 3, min(n, d)))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U @ Sigma @ V.T

# True solution and clean data
w_true = np.zeros(d)
w_true[:5] = [10, 5, 2, 1, 0.5]
y_clean = A @ w_true

# Add noise
noise_level = 0.01
y_noisy = y_clean + noise_level * np.random.randn(n)

# Full pseudoinverse solution
U_a, s_a, Vt_a = np.linalg.svd(A, full_matrices=False)
w_full = Vt_a.T @ (np.linalg.solve(np.diag(s_a), U_a.T @ y_noisy))

# Truncated SVD solutions
errors = []
truncs = range(5, 30)
for trunc in truncs:
    s_trunc = s_a[:trunc]
    w_trunc = Vt_a[:trunc].T @ (np.linalg.solve(np.diag(s_trunc), U_a[:, :trunc].T @ y_noisy))
    err = np.linalg.norm(w_trunc - w_true)
    errors.append(err)

best_trunc = truncs[np.argmin(errors)]
print("smallest singular value:", round(s_a[-1], 8))
print("error full pseudoinverse:", round(np.linalg.norm(w_full - w_true), 4))
print("error best truncation (k={})".format(best_trunc), round(min(errors), 4))

Comments

Computational Efficiency
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
ML Applications
Numerical Stability & Robustness
Theoretical Foundation
Chapter 11
Principal Component Analysis
Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

  1. Covariance matrix

    • $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.

  2. Principal components (eigenvectors)

    • Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.

  3. Explained variance ratio

    • EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.

  4. Scores and loadings

    • Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).

  5. Reconstruction and truncation

    • Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).

  6. Standardization and scaling

    • Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.

  7. Whitening

    • Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

  • Dimensionality reduction: speeds training, avoids overfitting, improves generalization.

  • Visualization: 2D/3D projection of high-dimensional data for exploration.

  • Preprocessing: removes noise, aligns scales, improves conditioning of solvers.

  • Feature extraction: learned components as features for downstream classifiers.

  • Denoising: truncated PCA removes low-variance (noisy) directions.

  • Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

  • 1901: Pearson introduces lines/planes of closest fit (geometric intuition).

  • 1933: Hotelling formalizes PCA as eigen-decomposition of covariance.

  • 1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.

  • 1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.

  • 1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.

  • 2000s: Randomized PCA for large-scale data (Halko–Martinsson).

  • 2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

  • Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).

  • Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).

  • Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.

  • Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.

  • Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.

  • Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Covariance matrix and its PSD structure (Chapter 09).

  • Eigen-decomposition of symmetric covariance matrix.

  • Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.

  • Eckart–Young–Mirsky low-rank approximation error (Chapter 10).

  • Relation to SVD: PCA via SVD of centered data (Chapter 10).

  • Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

  • Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).

  • Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).

  • Probabilistic PCA and latent variable models (Tipping & Bishop 1997).

  • Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).

  • Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).

  • Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up
  1. Dimensionality reduction and preprocessing

  • Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

  1. Visualization and exploratory data analysis

  • Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

  1. Whitening and decorrelation

  • Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

  1. Denoising and matrix completion

  • Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

  1. Feature extraction and representation learning

  • Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation
  • Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.

  • Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.

  • Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.

  • Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.

  • Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.

  • Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.

  • Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).

  • Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.

  • Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks
  • Always center data; forgetting this is a common error.

  • Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.

  • Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.

  • Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.

  • Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.

  • Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.

  2. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

  • Center data; compute covariance matrix.

  • Compute eigendecomposition of covariance.

  • Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

  • Ensure data is centered; if not, covariance is inaccurate.

  • Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

  1. Hotelling, H. (1933). Analysis of complex statistical variables into principal components.

  2. Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

  • SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.

  • Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

  • Use full_matrices=False for efficiency when $n \gg d$.

  • Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

  • Compute full PCA; truncate to $k$ components.

  • Reconstruct and compute Frobenius error.

  • Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

  • Reconstruction error squared equals sum of squared singular values of discarded components.

  • Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

  1. Eckart & Young (1936). The approximation of one matrix by another of lower rank.

  2. Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

  • Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.

  • Verify output covariance is identity.

  • Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

  • Add small floor to tiny eigenvalues to avoid division by zero.

  • Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

  1. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.

  2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

  • Add noise to data; apply truncated PCA for different $k$.

  • Measure reconstruction error vs. ground truth vs. noise level.

  • Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

  • Noise reduction works best if signal occupies few components.

  • Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.

  2. Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Learning Path & Sequencing
Matrix Decompositions
ML Applications
Theoretical Foundation
Chapter 10
Singular Value Decomposition
Key ideas: Introduction

Introduction#

SVD generalizes eigen-decomposition to arbitrary rectangular matrices. Columns of $U$ span the column space; columns of $V$ span the row space; singular values quantify stretch along these directions. SVD underlies dimensionality reduction, denoising, and stable linear solves.

Important ideas#

  1. Orthogonal bases

    • $U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{d\times d}$ orthogonal; $\Sigma$ stores $\sigma_1 \ge \dots \ge \sigma_r > 0$ with $r=\operatorname{rank}(X)$.

  2. Spectral norm and conditioning

    • $\lVert X \rVert_2 = \sigma_1$; condition number $\kappa_2(X)=\sigma_1/\sigma_r$ for full-rank square $X$.

  3. Best low-rank approximation

    • Eckart–Young–Mirsky: $X_k = U_k \Sigma_k V_k^\top$ minimizes $\lVert X - Y \rVert_F$ over $\operatorname{rank}(Y)\le k$; error $\sum_{i>k}\sigma_i^2$.

  4. Pseudoinverse and least squares

    • $X^\dagger = V \Sigma^\dagger U^\top$; solves minimum-norm least squares and handles rank deficiency.

  5. Energy and variance

    • In PCA, singular values squared (scaled) are variances along principal axes.

  6. Truncation and regularization

    • Truncated SVD and spectral filtering (e.g., Tikhonov) improve stability for ill-conditioned problems.

  7. Computation

    • Golub–Kahan bidiagonalization; randomized SVD for large, sparse data.

Relevance to ML#

  • Dimensionality reduction (PCA/LSA/embeddings).

  • Low-rank modeling for recommendation and compression.

  • Stable least squares and ridge via spectral filtering.

  • Conditioning diagnostics and gradient scaling.

  • Spectral clustering and graph embeddings (Laplacian SVD/eigendecomp relation).

Algorithmic development (milestones)#

  • 1930s: Schmidt, Eckart–Young best approximation theorem.

  • 1960s: Golub–Kahan algorithms for practical SVD.

  • 1990s: Latent Semantic Analysis uses truncated SVD for text.

  • 2000s: Randomized SVD for large-scale ML (Halko–Martinsson–Tropp).

  • 2010s: Matrix factorization dominates recommender benchmarks (Netflix era); spectral initializations in deep models.

Definitions#

  • Full SVD: $X = U \Sigma V^\top$ with $U^\top U=I_n$, $V^\top V=I_d$, $\Sigma \in \mathbb{R}^{n\times d}$ diagonal (nonnegative entries).

  • Thin SVD: $X = U_r \Sigma_r V_r^\top$ with $r=\operatorname{rank}(X)$, $U_r \in \mathbb{R}^{n\times r}$, $V_r \in \mathbb{R}^{d\times r}$.

  • Pseudoinverse: $X^\dagger = V_r \Sigma_r^{-1} U_r^\top$.

  • Spectral norm: $\lVert X \rVert_2 = \sigma_1$; Frobenius norm $\lVert X \rVert_F^2 = \sum_i \sigma_i^2$.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • Existence and uniqueness (up to signs) of SVD for any real matrix.

  • Eckart–Young–Mirsky best rank-$k$ approximation theorem.

  • Relation to eigen-decomposition of $X^\top X$ and $X X^\top$.

  • Moore–Penrose pseudoinverse via SVD; minimum-norm least squares.

  • Spectral/Frobenius norm expressions via singular values.

Applied (landmark systems)#

  • PCA via SVD for dimensionality reduction (Jolliffe 2002; Shlens 2014).

  • LSA for text (Deerwester et al. 1990).

  • Matrix factorization in recommenders (Koren et al. 2009).

  • Spectral regularization for ill-posed problems (Hansen 1998).

  • Randomized SVD for large-scale ML (Halko–Martinsson–Tropp 2011).

Key ideas: Where it shows up
  1. PCA and variance decomposition

  • Centered data $X_c$ yields $\tfrac{1}{n}X_c^\top X_c = V \tfrac{\Sigma^2}{n} V^\top$. Explained variance ratios follow $\sigma_i^2$. Achievements: Shlens 2014 (tutorial), Jolliffe 2002 (classical PCA), used widely in vision/audio.

  1. Latent semantic analysis and embeddings

  • Truncated SVD on term-document matrices uncovers latent topics. Achievement: Deerwester et al. 1990 LSA; foundation for modern embeddings.

  1. Recommendation and matrix completion

  • Low-rank factorization approximates user-item matrices; SVD intuition drives alternating least squares. Achievements: Netflix Prize-era matrix factorization (Koren et al. 2009).

  1. Stable least squares and pseudoinverse

  • $x^\* = X^\dagger y$ gives minimum-norm solution; SVD reveals conditioning. Achievements: Golub–Van Loan numerical stability guidance.

  1. Regularization and denoising

  • Truncated SVD/Tikhonov filter small singular values to denoise ill-posed problems (e.g., deblurring). Achievements: Hansen 1998 (regularization tools).

Notation
  • Data: $X \in \mathbb{R}^{n\times d}$ (rows = examples).

  • SVD: $X = U \Sigma V^\top$, singular values $\sigma_i$ sorted descending.

  • Rank-$k$: $X_k = U_k \Sigma_k V_k^\top$ with $U_k \in \mathbb{R}^{n\times k}$, $\Sigma_k \in \mathbb{R}^{k\times k}$, $V_k \in \mathbb{R}^{d\times k}$.

  • Pseudoinverse: $X^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ replaces nonzero $\sigma_i$ by $1/\sigma_i$.

  • Explained variance: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i} \sigma_i^2}$.

  • Example: If $X = U \Sigma V^\top$ with $\sigma_1=5,\sigma_2=2,\sigma_3=0.5$, then $\kappa_2(X)=10$ and the best rank-2 approximation error is $0.5^2$ in Frobenius norm squared.

Pitfalls & sanity checks
  • Always center data before PCA via SVD.

  • Sign ambiguity: singular vectors are unique up to sign; do not compare raw signs across runs.

  • Conditioning: tiny singular values indicate potential instability; consider truncation or ridge.

  • Shapes: ensure full_matrices=False for thin SVD when $n \gg d$ to save compute.

  • Reconstruction checks: verify $\lVert X - U\Sigma V^\top \rVert_F$ is near machine precision for correctness.

References
  1. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).

  2. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  3. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

  4. Jolliffe, I. T. (2002). Principal Component Analysis.

  5. Shlens, J. (2014). A Tutorial on Principal Component Analysis.

  6. Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis.

  7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

  8. Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems.

  9. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices.

Five worked examples

Worked Example 1: Full SVD, rank, and reconstruction#

Introduction#

Compute thin SVD of a tall matrix, verify reconstruction and conditioning.

Purpose#

Show how $U,\Sigma,V$ capture range, domain, and scaling; check rank and condition number.

Importance#

SVD diagnostics reveal redundancy and stability before downstream tasks.

What this example demonstrates#

  • Thin SVD of $X \in \mathbb{R}^{n\times d}$.

  • Reconstruction $U\Sigma V^\top$ matches $X$.

  • Rank from nonzero singular values; condition number from extremes.

Background#

SVD generalizes eigen-decomposition; thin form is efficient when $n \gg d$.

Historical context#

Golub–Kahan bidiagonalization made SVD practical (1965).

History#

Adopted as a standard numerical primitive in LAPACK/BLAS.

Prevalence in ML#

Used in preprocessing, diagnostics, and as a baseline factorization.

Notes#

  • Use np.linalg.svd with full_matrices=False for thin SVD.

  • Small singular values indicate near-rank-deficiency.

Connection to ML#

Conditioning guides choice of solvers/regularizers; rank informs model capacity.

Connection to Linear Algebra Theory#

$X^\top X = V \Sigma^2 V^\top$; nonzero singular values are square roots of eigenvalues.

Pedagogical Significance#

Concrete start-to-finish SVD computation with sanity checks.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 8, 4
X = np.random.randn(n, d) + 0.2 * np.random.randn(n, d)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma = np.diag(s)
X_rec = U @ Sigma @ Vt

rank = np.sum(s > 1e-10)
kappa = s[0] / s[-1]

print("singular values:", np.round(s, 4))
print("rank:", rank, "cond:", round(kappa, 4))
print("reconstruction error Fro:", np.linalg.norm(X - X_rec, "fro"))

Worked Example 2: Best rank-k approximation (Eckart–Young)#

Introduction#

Demonstrate low-rank approximation of an image-like matrix and quantify error decay.

Purpose#

Show optimality of truncated SVD and energy captured by top-$k$ singular values.

Importance#

Core to compression, recommendation, and fast inference with reduced models.

What this example demonstrates#

  • Build $X$, compute SVD, form $X_k$.

  • Compare Frobenius error to tail singular values.

  • Report explained variance ratio.

Background#

Eckart–Young–Mirsky: truncated SVD minimizes Frobenius/spectral error among rank-$k$ matrices.

Historical context#

Eckart & Young (1936) established the optimality theorem.

History#

Used in image compression and modern low-rank neural adaptation.

Prevalence in ML#

Standard for LSA, recommender warm-starts, and distillation via low-rank layers.

Notes#

  • Sort singular values (already sorted by SVD) before computing energy ratios.

Connection to ML#

Determines how many components to keep for accuracy/latency trade-offs.

Connection to Linear Algebra Theory#

Error squared equals sum of squared discarded singular values.

Pedagogical Significance#

Links abstract theorem to a tangible compression example.

References#

  1. Eckart & Young (1936). Approximation of matrices.

  2. Hansen (1998). Rank truncation for regularization.

Solution (Python)#

import numpy as np

np.random.seed(1)
X = np.random.randn(40, 30)
U, s, Vt = np.linalg.svd(X, full_matrices=False)

k = 5
Xk = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
err = np.linalg.norm(X - Xk, "fro")
tail = np.sqrt((s[k:] ** 2).sum())
ev_ratio = (s[:k] ** 2).sum() / (s ** 2).sum()

print("rank-k error:", round(err, 4), "tail from theorem:", round(tail, 4))
print("explained variance ratio:", round(ev_ratio, 4))

Worked Example 3: PCA via SVD on centered data#

Introduction#

Use SVD on centered data to extract principal components and explained variance.

Purpose#

Show that PCA can be computed via SVD without forming the covariance matrix explicitly.

Importance#

Numerically stable and efficient for tall data; avoids $d\times d$ covariance when $d$ is large.

What this example demonstrates#

  • Center data $X_c$.

  • Compute SVD of $X_c / \sqrt{n}$ to obtain principal directions and variances.

  • Project onto top components and report variance explained.

Background#

PCA solves an eigenproblem on covariance; SVD is an equivalent and stable route.

Historical context#

Hotelling (1933) PCA; SVD popularized for PCA in numerical practice.

History#

Standard in scikit-learn and ML toolkits.

Prevalence in ML#

Common preprocessing for vision, speech, genomics, and exploratory analysis.

Notes#

  • Center rows; scaling affects covariance.

Connection to ML#

Dimensionality reduction preserves variance and often improves downstream generalization.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition equals right-singular structure of centered data.

Pedagogical Significance#

Shows the practical equivalence of PCA and SVD.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). Tutorial on PCA.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
ev = s**2
ev_ratio = ev / ev.sum()

k = 2
Z = Xc @ Vt[:k].T  # scores

print("singular values (scaled):", np.round(s, 4))
print("explained variance ratios:", np.round(ev_ratio, 4))
print("projected shape:", Z.shape)

Worked Example 4: Least squares via pseudoinverse and SVD#

Introduction#

Solve an overdetermined system with SVD-based pseudoinverse; compare to normal equations.

Purpose#

Highlight stability of SVD solves, especially when $X^\top X$ is ill-conditioned.

Importance#

Prevents amplification of noise from near-singular normal matrices.

What this example demonstrates#

  • Compute $x^\* = X^\dagger y$.

  • Compare residuals and conditioning to normal-equation solution.

Background#

Normal equations square condition number; SVD avoids this amplification.

Historical context#

Moore–Penrose pseudoinverse formalized generalized inverses; SVD provides a constructive route.

History#

Standard approach in numerical linear algebra libraries for robust least squares.

Prevalence in ML#

Used in linear regression baselines, design-matrix solvers, and as subroutines in algorithms.

Notes#

  • Small singular values can be truncated or damped for better stability.

Connection to ML#

Improves robustness of linear regression and feature engineering pipelines.

Connection to Linear Algebra Theory#

Pseudoinverse gives minimum-norm solution among all least-squares minimizers.

Pedagogical Significance#

Shows practical benefit of SVD over naive normal equations.

References#

  1. Golub & Van Loan (2013). Stability of least squares.

  2. Hansen (1998). Regularization.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 80, 6
X = np.random.randn(n, d)
# Make columns nearly dependent
X[:, 0] = X[:, 1] + 1e-3 * np.random.randn(n)
w_true = np.array([2, -2, 0.5, 0, 0, 0])
y = X @ w_true + 0.05 * np.random.randn(n)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
Sigma_inv = np.diag(1.0 / s)
w_svd = Vt.T @ Sigma_inv @ U.T @ y

w_ne = np.linalg.lstsq(X, y, rcond=None)[0]

res_svd = np.linalg.norm(X @ w_svd - y)
res_ne = np.linalg.norm(X @ w_ne - y)

print("singular values:", np.round(s, 6))
print("residual SVD:", round(res_svd, 6), "residual normal eqn:", round(res_ne, 6))

Worked Example 5: Truncated SVD for regularization and denoising#

Introduction#

Apply truncated SVD to an ill-conditioned linear system to reduce noise amplification.

Purpose#

Illustrate spectral filtering: discard tiny singular values to stabilize solutions.

Importance#

Prevents overfitting/noise blow-up in inverse problems and regression with multicollinearity.

What this example demonstrates#

  • Construct ill-conditioned matrix with decaying singular values.

  • Solve with full pseudoinverse vs. truncated SVD; compare error to ground truth.

Background#

Truncated SVD is a classical regularizer; related to Tikhonov but with hard cutoff in spectrum.

Historical context#

Widely used in inverse problems (Hansen 1998) and information retrieval.

History#

Precedes modern weight decay; spectral analog of feature pruning.

Prevalence in ML#

Used in deblurring, recommendation cold-start smoothing, and feature denoising.

Notes#

  • Choose cutoff based on noise level or cross-validation.

Connection to ML#

Improves generalization when design matrices are ill-conditioned.

Connection to Linear Algebra Theory#

Filters small singular values; solution lives in dominant singular subspace.

Pedagogical Significance#

Shows tangible benefit of truncating spectrum.

References#

  1. Hansen (1998). Rank truncation and regularization.

  2. Golub & Van Loan (2013). Regularized least squares.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 50, 30
# Create decaying singular values
U_rand, _ = np.linalg.qr(np.random.randn(n, n))
V_rand, _ = np.linalg.qr(np.random.randn(d, d))
s = np.logspace(0, -3, min(n, d))
Sigma = np.zeros((n, d))
Sigma[:len(s), :len(s)] = np.diag(s)
A = U_rand @ Sigma @ V_rand.T

w_true = np.zeros(d)
w_true[:5] = [2, -1, 0.5, 0.5, -0.2]
y_clean = A @ w_true
y = y_clean + 0.01 * np.random.randn(n)

U, sv, Vt = np.linalg.svd(A, full_matrices=False)
# Full pseudoinverse solution
w_full = (Vt.T / sv) @ (U.T @ y)
# Truncate small singular values
cut = 10
sv_trunc = sv[:cut]
w_trunc = (Vt[:cut].T / sv_trunc) @ (U[:, :cut].T @ y)

err_full = np.linalg.norm(w_full - w_true)
err_trunc = np.linalg.norm(w_trunc - w_true)

print("smallest singular value:", sv[-1])
print("error full:", round(err_full, 4), "error trunc:", round(err_trunc, 4))

Comments

Algorithm Category
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Matrix Decompositions
ML Applications
Theoretical Foundation
Chapter 9
PSD & PD Matrices
Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

  1. Eigenvalue characterization

    • Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.

  2. Quadratic forms and convexity

    • If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).

  3. Gram matrices

    • $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).

  4. Cholesky factorization

    • PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.

  5. Schur complement

    • For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.

  6. Mahalanobis metric

    • For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.

  7. Numerical PSD

    • In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

  • Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.

  • Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.

  • Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.

  • Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.

  • Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

  • 1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.

  • 1918: Schur complement criteria.

  • 1910–1940s: Cholesky factorization for SPD solves.

  • 1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.

  • 1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

  • PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.

  • Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.

  • Principal minors: PD iff all leading principal minors are positive (Sylvester).

  • Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.

  • Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • PSD/PD definitions via quadratic form and eigenvalues.

  • Sylvester’s criterion (leading principal minors > 0 for PD).

  • Schur complement PSD condition.

  • Mercer’s theorem (kernel PSD).

  • Cholesky existence for SPD.

Applied (landmark systems)#

  • SVM with PSD kernels (Cortes–Vapnik 1995).

  • Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.

  • Kernel Ridge Regression (Murphy 2022) solved via SPD systems.

  • Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).

  • GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up
  1. Covariance and PCA

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

  1. Kernels and SVMs/GPs

  • RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

  1. Hessians and convexity

  • For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

  1. Mahalanobis distance & whitening

  • SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

  1. Cholesky-based solvers (KRR/GP)

  • SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation
  • PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.

  • Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.

  • Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.

  • Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.

  • Schur complement: For blocks, $S = C - B^\top A^{-1} B$.

  • Examples:

    • PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.

    • Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.

    • Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks
  • Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.

  • Near-singular PSD: add jitter; do not invert directly.

  • Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.

  • Kernel parameters: very small/large length-scales can hurt conditioning.

  • Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

  1. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  2. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

  • Symmetrization, eigenvalue check, and Cholesky factorization.

  • Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

  • Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.

  • Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

  • Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

  • RBF kernel is PSD; small datasets can be nearly singular when points are close.

  • Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

  • Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

  • $d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

  • Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

  1. De Maesschalck et al. (2000). The Mahalanobis distance.

  2. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

  • Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

  • Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

  3. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation
Chapter 8
Eigenvalues & Eigenvectors
Key ideas: Introduction

Introduction#

Eigen-analysis provides structure for symmetric/PSD matrices (covariances, Laplacians) and general matrices (Markov chains, Jacobians). Power iteration is a simple iterative method to approximate the largest eigenvalue/eigenvector using repeated multiplication and normalization.

Important ideas#

  1. Eigenpairs $(\lambda, v)$ satisfy $Av = \lambda v$; for symmetric $A$, eigenvectors form an orthonormal basis.

  2. Spectral theorem (symmetric): $A = Q \Lambda Q^\top$ with real eigenvalues; $Q$ orthogonal, $\Lambda$ diagonal.

  3. Eigengap and convergence: Power iteration converges at rate $|\lambda_2/\lambda_1|^k$ to the dominant eigenvector when $|\lambda_1|>|\lambda_2|$.

  4. Rayleigh quotient: $\rho(x) = \dfrac{x^\top A x}{x^\top x}$; maximized at $\lambda_\max$, minimized at $\lambda_\min$ for symmetric $A$.

  5. PSD matrices: Eigenvalues nonnegative; covariances and Gram matrices are PSD.

  6. Gershgorin disks: Eigenvalues lie within unions of disks defined by row sums—gives quick bounds.

  7. Perron–Frobenius: Nonnegative irreducible matrices have a unique positive dominant eigenvalue/vector (Markov chains, PageRank).

Relevance to ML#

  • PCA: Leading eigenvectors of covariance capture maximal variance; truncation yields best low-rank approximation.

  • Spectral clustering: Laplacian eigenvectors reveal cluster structure via graph cuts.

  • PageRank/Markov chains: Stationary distribution is dominant eigenvector.

  • Stability: Jacobian eigenvalues inform exploding/vanishing dynamics.

  • Attention/covariance spectra: Eigenvalue spread relates to conditioning and numerical stability.

Algorithmic development (milestones)#

  • 1900s: Spectral theorem for symmetric matrices.

  • 1911: Gershgorin circle theorem (eigenvalue bounds).

  • 1912–1930s: Power methods and refinements (von Mises, Householder); 1929 Perron–Frobenius theory for nonnegative matrices.

  • 1998: PageRank (Brin–Page) uses power iteration at web scale.

  • 2000s: Spectral clustering widely adopted (Ng–Jordan–Weiss 2002).

  • 2010s: Randomized eigensolvers/svd for large-scale ML.

Definitions#

  • Eigenpair: $(\lambda, v\neq 0)$ with $Av=\lambda v$.

  • Spectrum $\sigma(A)$: set of eigenvalues; spectral radius $\rho(A)=\max |\lambda_i|$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$ for $x\neq0$.

  • Power iteration: $x_{k+1}=\dfrac{A x_k}{\lVert A x_k\rVert}$; converges to dominant eigenvector under eigengap.

  • Laplacian: $L=D-W$ (unnormalized), $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$ for graph spectral methods.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

  • Spectral theorem (symmetric/PSD): orthonormal eigenbasis; real eigenvalues.

  • Rayleigh–Ritz: Extreme eigenvalues maximize/minimize Rayleigh quotient.

  • Perron–Frobenius: Positive eigenvector for irreducible nonnegative matrices; spectral gap governs convergence.

  • Gershgorin circle theorem: Eigenvalues lie in disk unions from row sums.

  • Power iteration convergence: Linear rate governed by $|\lambda_2/\lambda_1|$ when $|\lambda_1|>|\lambda_2|$.

Applied (landmark systems/practices)#

  • PCA: Jolliffe (2002); Shlens (2014).

  • Spectral clustering: Ng–Jordan–Weiss (2002); von Luxburg (2007).

  • PageRank: Brin–Page (1998).

  • Randomized SVD/eigs: Halko–Martinsson–Tropp (2011).

  • Stability of deep nets (spectral radius/initialization): Saxe et al. (2013); Goodfellow et al. (2016).

Key ideas: Where it shows up
  1. PCA and covariance spectra

  • Covariance $\Sigma=\tfrac{1}{n} X_c^\top X_c$ (PSD); eigenvectors = principal axes; eigenvalues = variances.

  • Achievements: Dimensionality reduction, whitening; core tool in vision/speech. References: Jolliffe 2002; Shlens 2014.

  1. Spectral clustering (graph Laplacian)

  • Use first $k$ eigenvectors of $L_{\text{sym}}$ to embed nodes, then k-means.

  • Achievements: Strong performance on non-convex clusters and manifold data. References: Ng–Jordan–Weiss 2002; von Luxburg 2007.

  1. PageRank / Markov chains

  • Dominant eigenvector of stochastic $P$ gives stationary distribution; computed via power iteration.

  • Achievements: Web search ranking at internet scale. References: Brin–Page 1998.

  1. Conditioning and stability (Jacobians/Hessians)

  • Largest eigenvalue relates to Lipschitz constants; affects step sizes and gradient explosion/vanishing.

  • Achievements: Initialization/normalization techniques guided by spectral radius. References: Saxe et al. 2013; Goodfellow et al. 2016.

  1. Randomized eigen/svd for large ML

  • Approximate leading eigenpairs with fewer passes over data.

  • Achievements: Scalable PCA/LSA/embedding preprocessing. References: Halko–Martinsson–Tropp 2011.

Notation
  • Eigenvalues/eigenvectors: $Av_i = \lambda_i v_i$; order eigenvalues $\lambda_1\ge\lambda_2\ge\cdots$ for symmetric PSD.

  • Decompositions: $A=V\Lambda V^{-1}$ (diagonalizable); symmetric $A=Q \Lambda Q^\top$.

  • Rayleigh quotient: $\rho(x)=\dfrac{x^\top A x}{x^\top x}$; for unit $x$, $\rho(x)=x^\top A x$.

  • Power iteration step: $x_{k+1} = A x_k / \lVert A x_k\rVert$; eigenvalue estimate $\lambda_k = x_k^\top A x_k$ if $\lVert x_k\rVert=1$.

  • Laplacian eigenmaps: $L_{\text{sym}}=I-D^{-1/2} W D^{-1/2}$; use $k$ smallest nontrivial eigenvectors.

  • Examples:

    • Dominant eigenpair by power iteration on SPD matrix.

    • PCA via eigen-decomposition of $\Sigma$.

    • PageRank: eigenvector of stochastic matrix with eigenvalue 1.

Pitfalls & sanity checks
  • Non-symmetric matrices may have complex eigenvalues; use appropriate routines.

  • Power iteration slows when eigengap is small; consider Lanczos/Arnoldi or deflation.

  • Normalize in power iteration to avoid overflow/underflow.

  • Center data before PCA; otherwise leading eigenvectors capture mean.

  • Gershgorin bounds are loose; use as qualitative checks only.

References

Foundations and numerical linear algebra

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  3. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

  4. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Spectral methods and applications 5. Jolliffe, I. (2002). Principal Component Analysis. 6. Shlens, J. (2014). A Tutorial on PCA. 7. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering. 8. von Luxburg, U. (2007). A Tutorial on Spectral Clustering. 9. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. 10. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 11. Saxe, A. et al. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. 13. Gershgorin, S. (1911). Eigenvalue bounds. 14. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Five worked examples

Worked Example 1: Power iteration for dominant eigenpair (SPD matrix)#

Introduction#

Compute the largest eigenvalue/vector of an SPD matrix via power iteration and compare to numpy’s eigvals.

Purpose#

Illustrate convergence rate and normalization; give a practical recipe.

Importance#

Many large-scale methods rely on top eigenpairs (PCA, spectral norm estimation).

What this example demonstrates#

  • Convergence rate depends on eigengap $|\lambda_1/\lambda_2|$.

  • Rayleigh quotient of iterates approximates $\lambda_1$.

Background#

Power methods date back a century and remain relevant for scalable eigensolvers.

Historical context#

von Mises/Householder refinements; modern variants include Lanczos/Arnoldi.

Prevalence in ML#

Used implicitly in randomized PCA, spectral norm estimation, and operator norm regularization.

Notes#

  • Normalize every step; monitor residual $\lVert Ax-\lambda x\rVert$.

Connection to ML#

Top eigenvalue relates to Lipschitz constants; top eigenvector for PCA directions.

Connection to Linear Algebra Theory#

Convergence proof via eigen-expansion; Rayleigh quotient bounds.

Pedagogical Significance#

Shows a simple iterative algorithm achieving an eigenpair without full decomposition.

References#

  1. Trefethen & Bau (1997). Numerical Linear Algebra.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(0)
d = 6
A = np.random.randn(d, d)
A = A.T @ A + 0.5 * np.eye(d)  # SPD

def power_iteration(A, iters=50):
		x = np.random.randn(A.shape[0])
		x /= np.linalg.norm(x)
		for _ in range(iters):
				x = A @ x
				x /= np.linalg.norm(x)
		lam = x @ (A @ x)
		return lam, x

lam_pi, v_pi = power_iteration(A, iters=40)
eigvals, eigvecs = np.linalg.eigh(A)
lam_true = eigvals[-1]
print("power iteration λ≈", round(lam_pi, 6), " true λ=", round(lam_true, 6))
print("angle to true v (deg):", round(np.degrees(np.arccos(np.clip(abs(v_pi @ eigvecs[:, -1]), -1, 1))), 6))

Worked Example 2: PCA via eigen-decomposition of covariance#

Introduction#

Compute covariance, eigenpairs, and project data onto top-$k$ components; verify variance captured.

Purpose#

Connect PCA steps to eigenvalues/eigenvectors and retained variance.

Importance#

Core dimensionality reduction tool in ML.

What this example demonstrates#

  • $\Sigma = \tfrac{1}{n} X_c^\top X_c$ is PSD; eigenvalues = variances along eigenvectors.

  • Retained variance ratio from top-$k$ eigenvalues.

Background#

Eigen-decomposition of covariance equals SVD-based PCA.

Historical context#

PCA roots in Pearson/Hotelling; ubiquitous in data analysis.

Prevalence in ML#

Widely used in preprocessing, visualization, and compression.

Notes#

  • Center data; consider scaling features.

Connection to ML#

Variance retention guides choice of $k$; whitening uses inverse sqrt of eigenvalues.

Connection to Linear Algebra Theory#

PSD eigen-structure; orthogonal projections onto principal subspaces.

Pedagogical Significance#

Demonstrates PSD eigen-decomposition in a practical workflow.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d, k = 120, 8, 3
X = np.random.randn(n, d) @ np.diag(np.linspace(3, 0.5, d))
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
idx = np.argsort(evals)[::-1]
evals, evecs = evals[idx], evecs[:, idx]

Vk = evecs[:, :k]
X_proj = Xc @ Vk
variance_retained = evals[:k].sum() / evals.sum()
print("Top-k variance retained:", round(variance_retained, 4))

Worked Example 3: PageRank via power iteration (stochastic matrix)#

Introduction#

Compute PageRank on a small directed graph using power iteration on a stochastic matrix with damping.

Purpose#

Show Perron–Frobenius in action and convergence to the dominant eigenvector.

Importance#

Seminal large-scale eigen-application; template for Markov-chain ranking.

What this example demonstrates#

  • Transition matrix $P$ has eigenvalue 1; power iteration converges to stationary distribution.

  • Damping ensures irreducibility/aperiodicity.

Background#

Web graph ranking; damping factor (e.g., 0.85) handles dead ends/spiders.

Historical context#

Brin–Page (1998) launched web search revolution.

Prevalence in ML#

Graph ranking, recommendation propagation, random-walk-based features.

Notes#

  • Normalize columns to sum to 1; add teleportation for damping.

Connection to ML#

Graph-based semi-supervised learning often reuses random-walk ideas.

Connection to Linear Algebra Theory#

Perron–Frobenius guarantees positive dominant eigenvector; spectral gap drives convergence.

Pedagogical Significance#

Concrete, small-scale power iteration on a stochastic matrix.

References#

  1. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

  2. Langville, A., & Meyer, C. (2006). Google’s PageRank and Beyond.

Solution (Python)#

import numpy as np

# Small directed graph adjacency
A = np.array([[0,1,1,0],
							[1,0,0,1],
							[1,1,0,0],
							[0,1,1,0]], dtype=float)

# Column-stochastic transition (out-links per column)
col_sums = A.sum(axis=0, keepdims=True)
P = A / np.where(col_sums == 0, 1, col_sums)

alpha = 0.85
n = P.shape[0]
J = np.ones((n, n)) / n
M = alpha * P + (1 - alpha) * J  # damping

v = np.ones(n) / n
for _ in range(50):
		v = M @ v
		v = v / v.sum()

eigvals, eigvecs = np.linalg.eig(M)
idx = np.argmax(np.real(eigvals))
v_true = np.real(eigvecs[:, idx])
v_true = v_true / v_true.sum()

print("Power iteration PageRank:", np.round(v, 4))
print("Eigenvector PageRank:", np.round(v_true, 4))

Worked Example 4: Gershgorin disks vs true eigenvalues (bounds)#

Introduction#

Compute Gershgorin disks for a matrix and compare to actual eigenvalues to illustrate spectrum localization.

Purpose#

Provide quick sanity bounds on eigenvalues without full eigendecomposition.

Importance#

Useful for diagnosing stability (e.g., Jacobians) and conditioning.

What this example demonstrates#

  • All eigenvalues lie within the union of Gershgorin disks centered at $a_{ii}$ with radius $\sum_{j\neq i} |a_{ij}|$.

Background#

Classic theorem (1911) for eigenvalue localization.

Historical context#

Still a staple for quick qualitative checks.

Prevalence in ML#

Less direct, but useful for reasoning about spectrum without expensive computation.

Notes#

  • Tightness varies; row/column scaling can sharpen disks.

Connection to ML#

Stability analysis of iterative methods; rough spectral norm estimates.

Connection to Linear Algebra Theory#

Eigenvalue inclusion sets; diagonal dominance implications.

Pedagogical Significance#

Gives a geometric picture of eigenvalue bounds.

References#

  1. Gershgorin, S. (1911). Über die Abgrenzung der Eigenwerte einer Matrix.

  2. Horn & Johnson (2013). Matrix Analysis.

Solution (Python)#

import numpy as np

np.random.seed(2)
A = np.random.randn(4, 4)

centers = np.diag(A)
radii = np.sum(np.abs(A), axis=1) - np.abs(centers)
eigvals = np.linalg.eigvals(A)

print("Gershgorin centers:", np.round(centers, 3))
print("Gershgorin radii:", np.round(radii, 3))
print("Eigenvalues:", np.round(eigvals, 3))

Worked Example 5: Spectral clustering on a toy graph (Laplacian eigenvectors)#

Introduction#

Perform unnormalized spectral clustering on a simple graph with two clusters; use second-smallest eigenvector (Fiedler) for separation.

Purpose#

Show how Laplacian eigenvectors reveal cluster structure.

Importance#

Nonlinear/non-convex cluster discovery.

What this example demonstrates#

  • $L=D-W$; smallest eigenvalue 0 with eigenvector $\mathbf{1}$; Fiedler vector splits the graph.

Background#

Graph cuts and Laplacian spectra; normalized variants common in practice.

Historical context#

Spectral clustering surged in the 2000s for manifold data.

Prevalence in ML#

Image segmentation, manifold learning, community detection.

Notes#

  • For normalized Laplacian, use $L_{\text{sym}}$; results similar on this toy example.

Connection to ML#

Embedding nodes using a few eigenvectors before k-means is standard pipeline.

Connection to Linear Algebra Theory#

Properties of Laplacian eigenvalues (nonnegative; multiplicity of 0 equals number of components).

Pedagogical Significance#

Concrete end-to-end spectral clustering demonstration.

References#

  1. Ng, A., Jordan, M., & Weiss, Y. (2002). On Spectral Clustering.

  2. von Luxburg, U. (2007). A tutorial on spectral clustering.

Solution (Python)#

import numpy as np

# Two clusters with strong intra-cluster edges
W = np.array([
		[0,1,1,0,0,0],
		[1,0,1,0,0,0],
		[1,1,0,0,0,0],
		[0,0,0,0,1,1],
		[0,0,0,1,0,1],
		[0,0,0,1,1,0],
], dtype=float)

D = np.diag(W.sum(axis=1))
L = D - W
evals, evecs = np.linalg.eigh(L)

fiedler = evecs[:, 1]
clusters = (fiedler > 0).astype(int)
print("Eigenvalues:", np.round(evals, 4))
print("Fiedler vector:", np.round(fiedler, 4))
print("Cluster assignment via sign:", clusters)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation
Chapter 7
Rank & Nullspace
Key ideas: Introduction

Introduction#

Rank and null space describe how information flows through matrices:

  • Rank $r$ = number of independent columns/rows (nonzero singular values)

  • Null space $\text{null}(A)$ = set of inputs mapped to zero (lost information)

  • Column/row spaces = subspaces where outputs/inputs live; orthogonal complements relate via FTLA

  • Pseudoinverse solves least squares even when $A$ is rank-deficient (minimal-norm solutions)

  • Low-rank structure compresses models and reveals latent factors (factorization)

Important ideas#

  1. Row rank equals column rank

    • $\operatorname{rank}(A)$ is the dimension of $\text{col}(A)$ and equals that of $\text{row}(A)$.

  2. Rank via singular values

    • If $A=U\Sigma V^\top$, then $\operatorname{rank}(A)$ equals the number of nonzero singular values $\sigma_i$.

  3. Rank–nullity theorem

    • For $A\in\mathbb{R}^{m\times d}$, $$\operatorname{rank}(A) + \operatorname{nullity}(A) = d.$$

  4. Fundamental theorem of linear algebra (FTLA)

    • $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$ and $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$ (orthogonal decompositions).

  5. Rank of products and sums

    • $\operatorname{rank}(AB) \le \min\{\operatorname{rank}(A), \operatorname{rank}(B)\}$; subadditivity for sums.

  6. Pseudoinverse $A^+$

    • Moore–Penrose $A^+$ gives minimal-norm solutions $x^* = A^+ b$; satisfies $AA^+A = A$.

  7. Numerical rank

    • Practical rank uses thresholds on singular values to handle floating-point noise.

Relevance to ML#

  • Multicollinearity: rank-deficient design $X$ yields non-unique OLS solutions; regularization/pseudoinverse needed.

  • PCA/compression: low rank captures variance efficiently; truncation yields best rank-$k$ approximation.

  • Recommendation systems: user–item matrices modeled as low-rank factorization.

  • Kernels/Gram matrices: rank informs capacity and generalization; $\operatorname{rank}(XX^\top) \le \min(n,d)$.

  • Attention: score matrix $QK^\top$ has rank bounded by $\min(n, d_k)$; head dimension limits expressivity.

  • Deep nets: bottleneck layers enforce low-rank mapping; adapters/LoRA factorize weights.

Algorithmic development (milestones)#

  • 1936: Eckart–Young — best rank-$k$ approximation via SVD.

  • 1955: Penrose — Moore–Penrose pseudoinverse.

  • 1990s–2000s: Matrix factorization in recommender systems (SVD-based, ALS).

  • 2009: Candès–Recht — nuclear norm relaxation for matrix completion.

  • 2011: Halko–Martinsson–Tropp — randomized SVD for large-scale low-rank.

  • 2019–2021: Low-rank adapters (LoRA) compress transformer weights.

Definitions#

  • $\operatorname{rank}(A)$: dimension of $\text{col}(A)$ (or $\text{row}(A)$); number of nonzero singular values.

  • $\text{null}(A) = \{x: Ax=0\}$; $\text{null}(A^\top)$ similarly.

  • $\text{col}(A)$: span of columns; $\text{row}(A)$: span of rows.

  • FTLA decompositions: $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$, $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$.

  • Pseudoinverse: $A^+ = V \Sigma^+ U^\top$ where $\Sigma^+$ reciprocates nonzero $\sigma_i$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

  • Rank–nullity: $$\operatorname{rank}(A)+\operatorname{nullity}(A)=d.$$

  • FTLA (four subspaces): $\text{col}(A) \perp \text{null}(A^\top)$ and $\text{row}(A) \perp \text{null}(A)$.

  • Row=column rank: $\dim\text{row}(A) = \dim\text{col}(A)$.

  • Singular values and rank: $\operatorname{rank}(A)$ is the count of positive $\sigma_i$.

  • Sylvester’s inequality: $\operatorname{rank}(AB) \ge \operatorname{rank}(A) + \operatorname{rank}(B) - k$ (context-dependent; upper/lower bounds useful).

  • Eckart–Young–Mirsky: Truncated SVD minimizes error among rank-$k$ approximations.

  • Moore–Penrose pseudoinverse properties: $AA^+A=A$, $A^+AA^+=A^+$.

Applied (landmark systems/practices)#

  • PCA: Jolliffe (2002); Shlens (2014).

  • Stable least squares: Golub–Van Loan (2013).

  • Matrix completion via nuclear norm: Candès–Recht (2009).

  • Randomized SVD for scale: Halko–Martinsson–Tropp (2011).

  • Recommender systems: Koren–Bell–Volinsky (2009).

  • Low-rank adapters in transformers: Hu et al. (2021).

Key ideas: Where it shows up
  1. PCA and covariance rank

  • Centered data $X_c$ yields covariance $\Sigma = \tfrac{1}{n} X_c^\top X_c$ with $\operatorname{rank}(\Sigma) \le \min(n-1, d)$.

  • Achievements: Dimensionality reduction with $k\ll d$; whitening in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

  1. Regression and multicollinearity

  • If $\operatorname{rank}(X) < d$, normal equations $X^\top X w = X^\top y$ are singular; pseudoinverse/regularization resolve ambiguity.

  • Achievements: Robust linear modeling; Ridge/Lasso mitigate rank issues. References: Hoerl–Kennard 1970; Tibshirani 1996; Golub–Van Loan 2013.

  1. Low-rank models and compression

  • Factorize $W \approx AB^\top$ with small inner dimension to reduce parameters and computation (adapters, LoRA).

  • Achievements: Efficient fine-tuning of large transformers. References: Hu et al. 2021 (LoRA); Tishby & Zaslavsky 2015 (bottlenecks conceptual).

  1. Matrix factorization for recommendation

  • User–item ratings approximated by low-rank matrices; SVD/ALS used in practice.

  • Achievements: Netflix Prize-era improvements; widespread deployment. References: Koren et al. 2009; Funk 2006.

  1. Kernels/Gram and attention score rank

  • $G=XX^\top$ has rank $\le \min(n,d)$; $QK^\top$ rank $\le \min(n,d_k)$. Rank limits expressivity and affects generalization.

  • Achievements: Scalable kernel methods via low-rank approximations; attention head size trade-offs. References: Schölkopf–Smola 2002; Vaswani et al. 2017.

Notation
  • Shapes: $A\in\mathbb{R}^{m\times d}$; $X\in\mathbb{R}^{n\times d}$ is data.

  • Spaces: $\text{col}(A)$, $\text{row}(A)$, $\text{null}(A)$, $\text{null}(A^\top)$.

  • Rank: $\operatorname{rank}(A)$; Nullity: $\operatorname{nullity}(A)$.

  • SVD: $A=U\Sigma V^\top$; $U\in\mathbb{R}^{m\times r}$, $V\in\mathbb{R}^{d\times r}$ span column/row spaces; $r=\operatorname{rank}(A)$.

  • Pseudoinverse: $A^+ = V\Sigma^+ U^\top$; minimal-norm solution $x^* = A^+ b$.

  • Examples:

    • Rank via SVD: count $\sigma_i > \tau$ with threshold $\tau$.

    • Projection onto column space: $P_{\text{col}} = U_r U_r^\top$; onto row space: $P_{\text{row}} = V_r V_r^\top$.

    • Covariance rank: $\operatorname{rank}(X_c^\top X_c) \le n-1$ for centered data.

Pitfalls & sanity checks
  • Never invert $X^\top X$ when $\operatorname{rank}(X)<d$; use QR/SVD or regularize.

  • Diagnose numerical rank via singular values; set thresholds based on scale.

  • Center data for covariance; otherwise rank properties and PCA directions change.

  • Beware overfitting: increasing rank (k in PCA/factorization) beyond signal raises variance.

  • Attention heads: too small $d_k$ may limit expressivity; too large may hurt stability.

References

Foundations and theory

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.).

  2. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Low-rank approximation and factorization 4. Eckart, C., & Young, G. (1936). Best rank-$k$ approximation. 5. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 6. Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. 7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

Regression and pseudoinverse 8. Penrose, R. (1955). A generalized inverse for matrices. 9. Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso.

ML systems and practice 11. Jolliffe, I. (2002). Principal Component Analysis. 12. Shlens, J. (2014). A Tutorial on Principal Component Analysis. 13. Murphy, K. P. (2022). Probabilistic Machine Learning. 14. Vaswani, A. et al. (2017). Attention Is All You Need. 15. Devlin, J. et al. (2019). BERT.

Five worked examples

Worked Example 1: Detecting multicollinearity via null space (non-unique regression)#

Introduction#

Show how null space reveals linear dependencies among features and why OLS becomes non-unique when $\operatorname{rank}(X)<d$.

Purpose#

Compute null space vectors and connect them to redundant directions; use pseudoinverse for a minimal-norm solution.

Importance#

Avoids unstable fits and clarifies identifiability in models.

What this example demonstrates#

  • If $v\in\text{null}(X)$, $X(w+\alpha v) = Xw$ for all $\alpha$; infinitely many OLS solutions.

  • Pseudoinverse $w^* = X^+ y$ yields the minimal-norm solution.

Background#

Rank deficiency arises from duplicate/derived features or insufficient data.

Historical context#

Gauss/Legendre least squares; Penrose pseudoinverse enables solutions in singular cases.

Prevalence in ML#

High-dimensional regression, feature engineering pipelines, polynomial expansions.

Notes#

  • Use SVD to diagnose numerical rank; add Ridge to regularize.

Connection to ML#

Feature selection and regularization strategies hinge on rank awareness.

Connection to Linear Algebra Theory#

FTLA: residuals in $\text{null}(X^\top)$; solution set $w_0 + \text{null}(X)$.

Pedagogical Significance#

Makes the geometry of “non-unique solutions” tangible.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Penrose (1955). Moore–Penrose pseudoinverse.

  3. Hoerl & Kennard (1970). Ridge regression.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 6
X = np.random.randn(n, d)
X[:, 5] = X[:, 0] + X[:, 1]  # make a perfectly colinear feature
w_true = np.array([1.0, -0.5, 0.3, 0.0, 2.0, 0.3])
y = X @ w_true + 0.1 * np.random.randn(n)

U, S, Vt = np.linalg.svd(X, full_matrices=False)
rank = np.sum(S > 1e-8)
nullspace_basis = Vt[rank:].T  # columns spanning null(X)

print("rank(X)=", rank, " d=", d, " nullity=", d - rank)
print("Nullspace basis shape:", nullspace_basis.shape)

# Minimal-norm solution via pseudoinverse
w_min = Vt.T @ (np.where(S > 1e-12, (U.T @ y) / S, 0.0))
print("||w_min||2=", np.linalg.norm(w_min))
print("OLS residual norm:", np.linalg.norm(y - X @ w_min))

Worked Example 2: Covariance rank ≤ n−1 (PCA in n<d regimes)#

Introduction#

Verify empirically that centered covariance has rank at most $n-1$ regardless of feature dimension.

Purpose#

Explain why PCA cannot produce more than $n-1$ nonzero eigenvalues and how this affects high-dimensional settings.

Importance#

Shapes expectations for PCA on small data; prevents overinterpretation.

What this example demonstrates#

  • With $X_c\in\mathbb{R}^{n\times d}$ centered, $\operatorname{rank}(X_c) \le \min(n-1, d)$; hence $\operatorname{rank}(\Sigma) \le n-1$.

Background#

Centering imposes a linear constraint across rows, reducing rank by at least one when $n>1$.

Historical context#

PCA theory and practice emphasize centering for correct variance structure.

Prevalence in ML#

Common in text, genomics, and other $d\gg n$ problems.

Notes#

  • Always center before PCA; whitening depends on accurate rank.

Connection to ML#

Model selection of $k$ principal components must respect $n-1$ limit.

Connection to Linear Algebra Theory#

Row-sum constraint places $\mathbf{1}$ in $\text{null}(X_c^\top)$.

Pedagogical Significance#

Reinforces how constraints reduce rank.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 30, 200
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
rank = np.sum(S > 1e-10)
print("rank(Xc)=", rank, " <= min(n-1,d)=", min(n-1, d))

Worked Example 3: Low-rank matrix factorization for recommendation (synthetic)#

Introduction#

Construct a synthetic user–item rating matrix with known low rank and recover it via truncated SVD.

Purpose#

Demonstrate latent-factor modeling and show reconstruction error scales with tail singular values.

Importance#

Illustrates the power of rank reduction in recommender systems.

What this example demonstrates#

  • $R\approx U_k \Sigma_k V_k^\top$ captures most variance when spectrum decays.

Background#

Matrix factorization underlies collaborative filtering; ALS/SGD optimize latent vectors.

Historical context#

Post-Netflix Prize, low-rank methods became industry standard.

Prevalence in ML#

Ubiquitous in recommendation and implicit feedback modeling.

Notes#

  • For missing data, completion requires specialized optimization (not shown here).

Connection to ML#

Latent dimensions reflect user/item factors; rank controls capacity.

Connection to Linear Algebra Theory#

Eckart–Young guarantees best rank-$k$ approximation.

Pedagogical Significance#

Shows direct link from SVD to practical factor models.

References#

  1. Koren, Bell, Volinsky (2009). Matrix factorization techniques for recommender systems.

  2. Candès & Recht (2009). Exact matrix completion via convex optimization.

Solution (Python)#

import numpy as np

np.random.seed(2)
u, i, k = 80, 60, 5
U_true = np.random.randn(u, k)
V_true = np.random.randn(i, k)
R = U_true @ V_true.T + 0.1 * np.random.randn(u, i)

U, S, Vt = np.linalg.svd(R, full_matrices=False)
Rk = (U[:, :k] * S[:k]) @ Vt[:k]
err = np.linalg.norm(R - Rk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-5))

Worked Example 4: Moore–Penrose pseudoinverse — minimal-norm solutions#

Introduction#

Solve $Ax=b$ when $A$ is rectangular or rank-deficient; verify $AA^+A=A$ and minimal norm among all solutions.

Purpose#

Provide a robust recipe for under-/overdetermined systems.

Importance#

Avoids fragile inverses and clarifies the solution geometry.

What this example demonstrates#

  • $x^*=A^+ b$ minimizes $\lVert x\rVert_2$ subject to $Ax=b$ for consistent systems.

  • Penrose conditions hold numerically.

Background#

Pseudoinverse defined via SVD; used in control, signal processing, ML.

Historical context#

Penrose (1955) established the four defining equations.

Prevalence in ML#

Closed-form layers, analytic baselines, and data-fitting routines.

Notes#

  • Use SVD-backed implementations; threshold small singular values.

Connection to ML#

Stable baselines and analytic steps inside pipelines.

Connection to Linear Algebra Theory#

Projects onto $\text{row}(A)$/$\text{col}(A)$; minimal-norm in $\text{null}(A)$ components.

Pedagogical Significance#

Bridges algebraic definition to numerical practice.

References#

  1. Penrose, R. (1955). A generalized inverse for matrices.

  2. Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(3)
m, d = 10, 12
A = np.random.randn(m, d)
# Make A rank-deficient by zeroing a singular value via colinearity
A[:, 0] = A[:, 1] + A[:, 2]
b = np.random.randn(m)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
S_inv = np.where(S > 1e-10, 1.0 / S, 0.0)
A_plus = Vt.T @ np.diag(S_inv) @ U.T

x_star = A_plus @ b
print("Penrose A A^+ A ~ A?", np.allclose(A @ (A_plus @ A), A, atol=1e-8))
print("Residual ||Ax-b||:", np.linalg.norm(A @ x_star - b))
print("||x_star||2:", np.linalg.norm(x_star))

Worked Example 5: Rank of attention scores QK^T (expressivity bound)#

Introduction#

Show that the attention score matrix $S=QK^\top$ has rank at most $\min(n, d_k)$ and explore implications for head dimension.

Purpose#

Connect feature dimension to expressivity through rank bounds.

Importance#

Head size choices affect the diversity of attention patterns.

What this example demonstrates#

  • For $Q\in\mathbb{R}^{n\times d_k}$ and $K\in\mathbb{R}^{n\times d_k}$, $\operatorname{rank}(QK^\top) \le \min(n, d_k)$.

Background#

Rank of product bounded by inner dimension; scaled dot-products preserve rank.

Historical context#

Transformers leverage multiple heads to increase effective rank/expressivity.

Prevalence in ML#

All transformer models; multi-head concatenation increases representational capacity.

Notes#

  • Multi-head attention can be seen as block structures that raise overall rank after concatenation.

Connection to ML#

Guides architecture design (choosing $d_k$ and number of heads).

Connection to Linear Algebra Theory#

Rank bounds and product properties.

Pedagogical Significance#

Links a practical hyperparameter to a crisp linear algebra bound.

References#

  1. Vaswani, A. et al. (2017). Attention Is All You Need.

  2. Devlin, J. et al. (2019). BERT.

Solution (Python)#

import numpy as np

np.random.seed(4)
for n, dk in [(32, 8), (32, 32), (64, 16)]:
	 Q = np.random.randn(n, dk)
	 K = np.random.randn(n, dk)
	 S = Q @ K.T
	 r = np.linalg.matrix_rank(S)
	 print(f"n={n}, d_k={dk}, rank(S)={r}, bound={min(n, dk)}")

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation