ex1.ai

Matrix Products in DL & Transformers

Chapter 16

Matrix Products in DL & Transformers

Key ideas: Introduction

Introduction#

Neural networks are fundamentally stacks of matrix multiplications. A forward pass through a deep network is a product of weight matrices and activations. Each layer computes $A_{l+1} = \sigma(W_l A_l)$ where $W_l \in \mathbb{R}^{d_{l+1} \times d_l}$, $A_l \in \mathbb{R}^{n \times d_l}$ (batch size $n$, layer input dimension $d_l$, output dimension $d_{l+1}$). The cost is $O(n d_l d_{l+1})$ per layer. Transformers add attention: $\text{Attention}(Q, K, V) = \sigma(Q K^\top / \sqrt{d_k}) V$, which involves three GEMM operations and a softmax (polynomial in $n$ sequence length). For a Transformer with sequence length $L$, hidden dimension $d$, and $H$ attention heads per layer, attention cost is $O(L^2 d)$ (quadratic in sequence length—a major bottleneck). Modern accelerators (GPUs, TPUs) are matrix-multiply engines: billions of floating-point operations per second (TFLOPs). Utilization depends on arithmetic intensity (ops/byte): bandwidth-bound operations underutilize the accelerator; computation-bound operations (high arithmetic intensity) achieve near-peak performance. Understanding how to write matrix products that achieve high arithmetic intensity, and how to distribute them across devices, determines whether you can train billion-parameter models.

Important ideas#

Matrix-matrix multiplication (GEMM) structure
- Dense GEMM: $C \leftarrow AB$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
- Arithmetic: $mk + kmn + mn = O(mkn)$ floating-point operations (FLOPs).
- Memory: $O(m + k + n)$ words (inputs + output); on GPU, $m, k, n$ can be $1000$s, requiring GB of memory.
- Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes}} = \frac{2mkn}{8(mk + kn + mn)} \approx \frac{mkn}{4(m + k + n)}$ (higher is better).
Blocking and cache efficiency
- GEMM blocked into $b \times b$ tiles; each tile multiplied using fast cache.
- Cache line length (64 bytes typical); GEMM loads tile once, reuses it $O(b)$ times.
- Roofline model: peak FLOP rate vs. memory bandwidth; if arithmetic intensity $< I_{\text{roof}}$, algorithm is bandwidth-bound.
Batch matrix multiplication (batched GEMM)
- Forward pass: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size).
- Exploit parallelism: process multiple batches on multiple cores/GPU SMs.
- Highly efficient when batch size is large; small batches underutilize accelerator.
Convolution as matrix multiplication (im2col, Winograd)
- Convolution unfolds as GEMM: reshape input patches into columns; multiply by filter matrix; reshape output.
- im2col: input image to column matrix; allows use of highly optimized GEMM (cuBLAS, MKL).
- Cost: $O(kh kw d_{\text{in}} d_{\text{out}} h_{\text{out}} w_{\text{out}})$ (kernel height/width, input/output channels, spatial dims).
- Winograd: fast convolution via transformed domain; reduces arithmetic but increases numerical complexity.
Scaled dot-product attention
- Query-key-value paradigm: $Q \in \mathbb{R}^{L \times d_k}$, $K, V \in \mathbb{R}^{L \times d_v}$ (sequence length $L$, head dimension $d_k, d_v$).
- Attention: (1) $M = Q K^\top / \sqrt{d_k}$ (matrix product $L \times d_k \times d_k \times L = O(L^2 d_k)$), (2) $A = \text{softmax}(M)$ (per-row normalization, no matrix products), (3) $O = AV$ (matrix product $L \times L \times d_v = O(L^2 d_v)$).
- Total: $O(L^2 (d_k + d_v)) = O(L^2 d)$ (quadratic in sequence length).
- Challenge: for $L = 4096$ (typical transformer), $L^2 = 16M$ operations per attention head; billions for multi-head.
Mixed precision and numerical stability
- FP32 (single precision, float32): 32 bits, ~7 significant digits; gradients, weights commonly stored in FP32.
- FP16 (half precision, float16): 16 bits, ~4 significant digits; range $[6 \times 10^{-8}, 6 \times 10^4]$; GPU operations 2–10× faster.
- BFloat16 (Brain Float): 16 bits, same exponent range as FP32, reduced mantissa; intermediate between FP32 and FP16.
- Mixed precision: compute GEMM in FP16 (fast), accumulate in FP32 (stable); scale loss to prevent underflow.
Distributed matrix multiplication
- Data parallelism: replicate model on each device; partition minibatches; synchronize gradients via all-reduce.
- Model parallelism: partition matrix weights across devices; communication within matrix product (e.g., matmul followed by communication).
- Pipeline parallelism: partition layers across devices; overlap computation on layer $i$ with communication on layer $i-1$.
- Cost: compute + communication latency; communication often dominates at large scale (Roofline model).

Relevance to ML#

Convolutional neural networks (CNNs): Forward and backward passes are GEMM-heavy; efficiency determines whether you can train on billion-pixel images or video.
Recurrent neural networks (RNNs), LSTMs, GRUs: Fully-connected layers between timesteps; matrix products per timestep.
Transformers and large language models: Attention is $O(L^2 d)$ matrix products; for GPT-3 ($L = 2048$, $d = 12288$), attention dominates forward/backward.
Graph neural networks (GNNs): Graph convolution is sparse matrix product; efficiency depends on sparsity and format.
Distributed training: Modern LLMs trained on thousands of GPUs/TPUs; communication cost (network bandwidth) often exceeds computation cost.

Algorithmic development (milestones)#

1969: Strassen algorithm: $O(n^{2.807})$ vs. $O(n^3)$ naive GEMM (theoretically significant; rarely used in practice due to constants).
1979–1990: Level-1/2/3 BLAS (Basic Linear Algebra Subprograms); standardized interface for matrix ops; LAPACK (1992) built on BLAS.
1995–2005: GPU era begins: NVIDIA GeForce, Tesla; GPUs have 100× more memory bandwidth than CPUs; GEMMs run 10–100× faster.
2006: CUDA (Compute Unified Device Architecture) released; enables general-purpose GPU computing; cuBLAS optimized GEMM for NVIDIA GPUs.
2011: Mixed precision training proposed; FP16 + loss scaling enables 10–100× speedups on GPUs.
2012: AlexNet (Krizhevsky et al.) demonstrates deep CNN training on GPUs; FLOPs dominate; GEMM-heavy.
2015: Batch normalization (Ioffe & Szegedy); reduces sensitivity to initialization; enables mixed precision at scale.
2017: Transformer architecture (Vaswani et al.); attention is dense GEMM-based; quadratic in sequence length.
2018–2020: Distributed training frameworks mature (PyTorch DDP, TensorFlow Horovod); trillion-parameter models trained via model parallelism.
2020–2023: Flash Attention (Dao et al. 2022) reduces attention memory via block-sparse operations; Megatron-LM and DeepSpeed enable distributed GEMMs at petaflop scales.

Definitions#

GEMM (General Matrix Multiply): $C \leftarrow \alpha A B + \beta C$ (standard matrix multiply with scaling/accumulation).
FLOP (floating-point operation): One add or multiply; GEMM $C \leftarrow AB$ is $2mkn$ FLOPs.
Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes read/written}}$ (ops per byte); high $I$ means compute-bound; low $I$ means bandwidth-bound.
Roofline model: Peak achievable throughput = $\min(\text{peak FLOP rate}, \text{memory bandwidth} \times \text{arithmetic intensity})$.
Memory-bound: Algorithm where memory bandwidth is bottleneck; cannot achieve peak FLOP rate.
Compute-bound: Algorithm where compute is bottleneck; limited by FLOPs/cycle, not memory.
Mixed precision: Using multiple precision levels (e.g., FP16 for compute, FP32 for accumulation) to trade accuracy for speed.
All-reduce: Distributed operation: each device sums its values with all others; result replicated on all devices. Cost: $O(\log D)$ communication rounds for $D$ devices.
Collective communication: Broadcasting, all-reduce, reduce-scatter, all-gather operations in distributed training.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

GEMM arithmetic and complexity: $O(mkn)$ FLOPs, memory $O(m + k + n)$. Reference: Golub & Van Loan (2013).
Arithmetic intensity and Roofline model: $I = \text{FLOPs/bytes}$; peak rate is $\min(\text{FLOP rate}, \text{bandwidth} \times I)$. Reference: Williams et al. (2009).
Cache-oblivious algorithms: Block-recursive GEMM achieves near-optimal cache behavior independent of cache size. Reference: Frigo et al. (1999).
Batched GEMM: Independent products $C_i \leftarrow A_i B_i$; parallelism across batch dimension. Reference: BLAS 3 standard (1990).
Attention complexity: Scaled dot-product $O(L^2 d)$ without optimizations; challenges for long contexts. Reference: Vaswani et al. (2017).
Distributed GEMM: Communication cost for gradient all-reduce, model/data parallelism. Reference: Thakur et al. (2005) (MPI Collective Communications).

Applied (landmark systems)#

Level-3 BLAS (cuBLAS, MKL): Industry-standard GEMM implementations; peak performance on CPUs/GPUs. Implementation: NVIDIA cuBLAS, Intel MKL. Reference: Dongarra et al. (1990) (BLAS 3).
Convolution as GEMM (im2col): Standard in libcnpy (Caffe, PyTorch); enables reuse of optimized GEMM. Implementation: PyTorch conv2d uses im2col on CPU. Reference: Krizhevsky et al. (2012).
Mixed precision training: Automatic mixed precision in PyTorch (torch.cuda.amp), TensorFlow (tf.keras.mixed_precision). Achieves 2–3× speedup on V100/A100. Reference: NVIDIA Automatic Mixed Precision Training Guide (2020).
Distributed GEMM (Megatron-LM, DeepSpeed): Tensor parallelism partitions GEMM across devices; pipeline parallelism overlaps layers. Implementation: Microsoft DeepSpeed, NVIDIA Megatron-LM. Reference: Shoeybi et al. (2019); Rasley et al. (2020).
Flash Attention: IO-efficient attention via blocked matrix products; reduces memory bandwidth by 10×. Implementation: Tri Dao’s flash-attention library. Reference: Dao et al. (2022).

Key ideas: Where it shows up

Convolutional neural networks and image classification
- Forward pass: convolutional layers (im2col GEMM), batch norm (element-wise), pooling (no GEMM).
- Backward: weight gradient, input gradient via GEMM.
- Achievements: ResNet-50 trains on 8 V100 GPUs in ~100 seconds (Goyal et al. 2017); mixed precision reduces time to ~60 seconds. References: Krizhevsky et al. (2012) (AlexNet); He et al. (2015) (ResNet); Goyal et al. (2017) (accurate large-batch SGD).
Transformer models and large language models
- Per-layer: projection QKV (3 GEMMs), attention (2 GEMMs), MLP (2 GEMMs) = ~7 GEMMs per layer.
- Attention cost: $O(L^2 d)$ (quadratic in sequence length); dominates for long sequences.
- Achievements: GPT-3 (Brown et al. 2020) trained in 300 billion FLOPs; parallelized across 1,024 A100 GPUs using model parallelism. Flash Attention (Dao et al. 2022) reduces attention memory by 10×. References: Vaswani et al. (2017) (Transformer); Brown et al. (2020) (GPT-3); Dao et al. (2022) (Flash Attention).
Distributed training and synchronization
- Data parallelism: gradient all-reduce after each minibatch.
- Model parallelism: gradient exchanges within matrix products.
- Achievements: LAMB optimizer (You et al. 2019) enables BERT training on 32k TPUs in 76 minutes. Megatron-LM (Shoeybi et al. 2019) trains GPT models with tensor parallelism. References: You et al. (2019) (LAMB); Shoeybi et al. (2019) (Megatron-LM).
Mixed precision training
- Automatic mixed precision (AMP): dynamically select FP16/FP32 for operations.
- Loss scaling: prevent FP16 gradient underflow.
- Achievements: NVIDIA Automatic Mixed Precision reduces training time by 2–3× on V100/A100 while maintaining accuracy. References: NVIDIA Mixed Precision Training guide; Micikevicius et al. (2018).
Graph neural networks and sparse matrix products
- Graph convolution: $X' = \sigma(AXW)$ where $A$ is sparse adjacency matrix.
- Sparse-dense GEMM: $O(\text{nnz}(A) \cdot d)$ arithmetic intensity lower than dense, but feasible for sparse graphs.
- Achievements: DGL, PyG enable billion-node GNNs via optimized sparse GEMMs. References: Kipf & Welling (2017) (GCN); Wang et al. (2019) (DGL); Fey et al. (2019) (PyG).

Notation

Matrix product: $C \leftarrow A B$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
Batched product: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size); vectorization across batch.
Attention: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) V$ with $Q, K, V \in \mathbb{R}^{L \times d}$ (sequence length $L$, dimension $d$).
Complexity: Attention is $O(L^2 d)$ FLOPs; dense GEMM is $O(n d_{\text{in}} d_{\text{out}})$ per layer (batch size $n$).
Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$ (depends on matrix shapes; higher $I$ achieves better GPU utilization).
FLOP rate: Peak: $P$ (e.g., 20 TFLOP for V100 in FP32); practical: $P \times \text{efficiency}$ (typically 50–80%).
Memory bandwidth: $B$ (e.g., 900 GB/s for A100 HBM2e); roofline: achieved throughput $= \min(P, I \times B)$.
Example: ResNet-50 forward pass: ~8 GFLOPs per image; batch size 256 = 2 TFLOPs; A100 achieves ~80% utilization = 16 TFLOP achieved; time ~0.1 ms.

Pitfalls & sanity checks

Batch size too small: GPUs underutilized; poor arithmetic intensity. Typical minimum: 32–64 per device.
Tall-skinny GEMM: Low arithmetic intensity; underutilize accelerator. Prefer square or batched products.
Ignoring data layout: Row-major vs. column-major affects cache performance by 10×.
Mixed precision without loss scaling: FP16 gradients underflow ($\approx 10^{-6}$); loss scale prevents this (multiply loss by $2^{16}$, divide gradients).
Attention without length limits: Quadratic memory; even with batch size 1, $L = 8192$ requires 256 MB for single head.
Synchronous all-reduce without compression: Communication time dominates; gradient compression (sparsification, quantization) essential at scale.
Assuming linear scaling: Communication cost breaks linear scaling; efficiency drops from 90% (4 devices) to 30% (128 devices).
Convolution without im2col: Naive loops 100–1000× slower than GEMM-based implementation.

References

Matrix multiplication theory

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Strassen, V. (1969). Gaussian elimination is not optimal.
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Performance modeling and BLAS

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of FORTRAN basic linear algebra subprograms.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Demmel, J., Gearhart, J., Liphardt, B., Schwartz, O., & Toledo, S. (2009). Communication-avoiding Gaussian elimination.

Deep learning and convolution

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
He, H., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.

Transformer and attention

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are unsupervised multitask learners.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.

Mixed precision and numerical stability

Micikevicius, P., Narang, S., Alben, J., et al. (2018). Mixed precision training.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift.
NVIDIA Automatic Mixed Precision Training Guide. (2020).

Distributed training

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.
You, Y., Gitman, I., & Ginsburg, B. (2019). Large batch optimization for deep learning: training BERT in 76 minutes.

Attention optimization

Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.
Child, A., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers.
Peng, H., Schwartz-Ziv, R., & Armon, M. (2021). Reducing transformer depth on demand with structured dropout.

Five worked examples

Worked Example 1: GEMM efficiency and arithmetic intensity#

Introduction#

Implement dense matrix multiplication on CPU and GPU; measure FLOP rate and memory bandwidth utilization; demonstrate how matrix shape affects arithmetic intensity.

Purpose#

Understand relationship between GEMM dimensions and arithmetic intensity; show how to achieve peak GPU performance.

Importance#

Foundation for understanding deep learning performance; shapes (batch size, hidden dimensions) directly impact training time.

What this example demonstrates#

Construct tall-skinny vs. square GEMM matrices.
Measure FLOPs and memory bandwidth for each.
Compute arithmetic intensity $I = \text{FLOPs/bytes}$.
Compare achieved FLOP rate vs. peak.
Predict speedup from roofline model.

Background#

GEMM efficiency depends on matrix shape: square matrices have high arithmetic intensity; tall-skinny have low intensity.

Historical context#

Roofline model (Williams et al. 2009) formalizes this trade-off; guides architecture and algorithm design.

History#

Standard framework for performance modeling in HPC and ML systems.

Prevalence in ML#

Every deep learning practitioner adjusts batch size, layer dimensions to maximize GPU utilization.

Notes#

Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$; maximized when $m \approx k \approx n$ (cube).
For fixed $k$, varying $m, n$ (batch size, hidden dims) changes $I$ by 10×.

Connection to ML#

Batch size and hidden dimension choices affect both accuracy and training speed; understanding trade-offs is critical.

Connection to Linear Algebra Theory#

GEMM is fundamental linear algebra operation; efficiency is determined by cache locality (blocking theory).

Pedagogical Significance#

Demonstrates practical performance modeling; connects theory (arithmetic intensity) to practice (measured FLOP rates).

References#

Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Solution (Python)#

import numpy as np
import time

np.random.seed(35)

# Test different matrix shapes (keeping k fixed)
k = 1024
shapes = [
    (128, k, 128),    # Tall-skinny-ish: low intensity
    (1024, k, 1024),  # Square: high intensity
    (4096, k, 4096),  # Large square: even higher
]

print("GEMM Efficiency Analysis")
print("=" * 80)
print(f"{'m x n':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Intensity':15} {'Est. GFLOPs':15}")
print("-" * 80)

for m, k_dim, n in shapes:
    # Arithmetic
    flops = 2 * m * k_dim * n
    # Memory: read A (m*k), read B (k*n), write C (m*n)
    mem_bytes = 8 * (m * k_dim + k_dim * n + m * n)
    intensity = flops / mem_bytes
    
    # Estimate performance from roofline
    # Assume: Peak 20 TFLOP (V100 FP32), Bandwidth 900 GB/s
    peak_flops = 20e12
    bandwidth = 900e9
    roofline = min(peak_flops, bandwidth * intensity)
    
    print(f"{m}x{n}         {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {intensity:>14.2f} {roofline/1e9:>14.1f}")

print("\n" + "=" * 80)
print("Key insight: Higher arithmetic intensity -> higher roofline GFLOPs")
print("Square matrices (m ~ k ~ n) achieve 10-100x higher intensity than tall-skinny")

Worked Example 2: Batched GEMM and GPU parallelism#

Introduction#

Implement batched matrix multiplication; measure performance as batch size varies; show speedup from batch parallelism.

Purpose#

Demonstrate how batch dimension enables parallelism; show relationship between batch size and GPU utilization.

Importance#

Batch size is a key hyperparameter; understanding its impact on performance guides training setup.

What this example demonstrates#

Generate batched matrices $A_i, B_i$ for $i = 1, \ldots, B$.
Time batched GEMM vs. sequential.
Measure speedup; show scaling with batch size.
Explain why small batches underutilize GPU.

Background#

GPUs have thousands of cores; small batches can’t keep all cores busy; large batches achieve better utilization.

Historical context#

Batch GEMM standardized in BLAS Level 3 (1990); essential for CNN/RNN training.

History#

Modern frameworks (PyTorch, TensorFlow) automatically batch GEMMs; rarely needs manual tuning.

Prevalence in ML#

Every training loop uses batched GEMM; batch size choice directly impacts throughput.

Notes#

Batch size $B = 1$: each GEMM is independent; throughput limited.
$B = 32$: better utilization; GPUs have 80+ SMs (streaming multiprocessors).
$B = 256$: excellent utilization; typical for modern training.

Connection to ML#

Batch size affects both convergence (larger batches can have worse generalization) and speed; practical sweet spot is usually 32–256.

Connection to Linear Algebra Theory#

Batched GEMM exploits structure (independent problems); vectorization across batch dimension.

Pedagogical Significance#

Shows interplay between algorithm structure and hardware parallelism.

References#

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of Fortran basic linear algebra subprograms.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.

Solution (Python)#

import numpy as np
import time

np.random.seed(36)

# Batched GEMM performance
batch_sizes = [1, 4, 16, 64, 256]
m, k, n = 1024, 1024, 1024
iterations = 10

print("Batched GEMM Performance (m=k=n={}, {} iterations)".format(m, iterations))
print("=" * 60)
print(f"{'Batch Size':15} {'Total Time (s)':20} {'GFLOPs':15}")
print("-" * 60)

for B in batch_sizes:
    # Create batch of matrices
    A = np.random.randn(B, m, k).astype(np.float32)
    B_mat = np.random.randn(B, k, n).astype(np.float32)
    
    # Batched matmul (sequential in Python; normally GPU would parallelize)
    t0 = time.time()
    for _ in range(iterations):
        C = np.matmul(A, B_mat)
    t_total = time.time() - t0
    
    # FLOPs: 2mkn per batch, B batches, iterations
    flops = iterations * B * 2 * m * k * n
    gflops = flops / (t_total * 1e9)
    
    print(f"{B:>14} {t_total:>19.4f} {gflops:>14.1f}")

print("\n" + "=" * 60)
print("Note: Larger batch sizes achieve higher GFLOPs due to better parallelism")

Worked Example 3: Convolution as GEMM (im2col)#

Introduction#

Implement convolution using naive loops, then via im2col GEMM; measure speedup from optimized GEMM.

Purpose#

Show how convolution is equivalent to matrix multiplication; demonstrate efficiency gain from reusing optimized GEMM.

Importance#

Foundational for understanding why GPUs excel at CNNs; im2col is standard in production frameworks.

What this example demonstrates#

Naive 5-loop convolution implementation.
im2col transformation: reshape patches into columns.
GEMM on im2col matrix; reshape output.
Compare naive vs. GEMM time.

Background#

Convolution unfolds into GEMM; allows reuse of highly-tuned BLAS kernels; 10–100× speedup.

Historical context#

im2col technique developed for efficient convolution implementations in early deep learning (Caffe, 2013).

History#

Standard in all deep learning frameworks; sometimes augmented by Winograd for further speedup.

Prevalence in ML#

Every CNN implementation uses im2col or similar GEMM-based convolution.

Notes#

im2col memory overhead: factors of 2–4× larger than direct convolution; trade memory for speed.
Winograd convolution (for $3 \times 3$ kernels): lower arithmetic but numerically complex.

Connection to ML#

Convolutional layers dominate image classification and detection models; efficiency here directly impacts training speed.

Connection to Linear Algebra Theory#

Convolution is linear transformation; im2col exploits structure to reduce to GEMM.

Pedagogical Significance#

Demonstrates how abstract operations (convolution) map to concrete linear algebra (GEMM).

References#

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks.

Solution (Python)#

import numpy as np
import time

np.random.seed(37)

# Convolution parameters
batch_size, in_height, in_width, in_channels = 32, 64, 64, 3
out_channels, kernel_h, kernel_w, stride = 16, 3, 3, 1
pad = 1

# Padded input
X_padded = np.pad(np.random.randn(batch_size, in_height, in_width, in_channels),
                   ((0,0), (pad,pad), (pad,pad), (0,0)), mode='constant')
W = np.random.randn(out_channels, kernel_h, kernel_w, in_channels)

# Output dimensions
out_height = (in_height + 2*pad - kernel_h) // stride + 1
out_width = (in_width + 2*pad - kernel_w) // stride + 1

# Naive convolution (slow)
print("Naive convolution (5-loop implementation):")
t0 = time.time()
Y_naive = np.zeros((batch_size, out_height, out_width, out_channels))
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            for c in range(out_channels):
                h_start = h * stride
                w_start = w * stride
                patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
                Y_naive[b, h, w, c] = np.sum(patch * W[c])
t_naive = time.time() - t0
print(f"  Time: {t_naive:.4f} s")

# im2col GEMM (fast)
print("\nim2col GEMM (optimized convolution):")
t0 = time.time()

# im2col: extract patches
X_col = np.zeros((batch_size * out_height * out_width, kernel_h * kernel_w * in_channels))
idx = 0
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            h_start = h * stride
            w_start = w * stride
            patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
            X_col[idx] = patch.reshape(-1)
            idx += 1

# Weight matrix (reshape filters)
W_mat = W.reshape(out_channels, -1).T  # (kernel_h*kernel_w*in_channels, out_channels)

# GEMM
Y_col = X_col @ W_mat  # (batch*out_h*out_w, out_channels)

# Reshape to output
Y_gemm = Y_col.reshape(batch_size, out_height, out_width, out_channels)

t_gemm = time.time() - t0
print(f"  Time: {t_gemm:.4f} s")

print(f"\nSpeedup: {t_naive / t_gemm:.1f}x")
print(f"Results match: {np.allclose(Y_naive, Y_gemm, atol=1e-5)}")

Worked Example 4: Scaled dot-product attention complexity#

Introduction#

Implement attention operation; measure memory and time complexity; show quadratic dependence on sequence length.

Purpose#

Understand why attention is a bottleneck for long sequences; motivate approximate attention methods.

Importance#

Attention scales as $O(L^2 d)$; for long sequences (4K tokens), this dominates; critical for efficiency research.

What this example demonstrates#

Implement attention: QK^T, softmax, output.
Measure memory (intermediate softmax matrix is $L \times L$).
Time scaling with $L$; show quadratic growth.
Compare attention time vs. other layers.

Background#

Quadratic attention complexity is fundamental limitation of transformer architecture; many proposed approximations.

Historical context#

Vaswani et al. (2017) introduce attention; complexity not initially recognized as bottleneck for $L > 512$.

History#

Post-2020, attention optimization becomes major research area: Flash Attention, sparse attention, linear attention variants.

Prevalence in ML#

Every transformer model suffers from quadratic attention; common workaround is to limit context length or use approximations.

Notes#

Attention FLOPs: $2L^2 d$ (dominant for $L > d$).
Memory: $O(L^2)$ for attention matrix; for $L = 4096, d = 768$: 64 MB per sequence.

Connection to ML#

Limiting context length ($L = 512$ vs. $L = 4096$) is common trade-off between expressiveness and efficiency.

Connection to Linear Algebra Theory#

Attention is polynomial in sequence length; matrix products scale quadratically in one dimension.

Pedagogical Significance#

Shows concrete example of how algorithmic bottleneck (quadratic) impacts practical ML.

References#

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.
Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.

Solution (Python)#

import numpy as np
import time

np.random.seed(38)

# Attention parameters
d = 768  # Hidden dimension
num_heads = 12
d_k = d // num_heads
L_values = [128, 256, 512, 1024, 2048]  # Sequence lengths

print("Attention Complexity Analysis (d={}, num_heads={})".format(d, num_heads))
print("=" * 70)
print(f"{'Seq Len L':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Time (ms)':15}")
print("-" * 70)

for L in L_values:
    batch_size = 1
    
    # Create Q, K, V
    Q = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    K = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    V = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    
    # Measure time and memory
    t0 = time.time()
    
    # Attention: QK^T / sqrt(d_k)
    scores = np.matmul(Q, K.transpose(0, 1, 3, 2))  # (batch, heads, L, L)
    scores = scores / np.sqrt(d_k)
    
    # Softmax
    scores = scores - np.max(scores, axis=-1, keepdims=True)
    exp_scores = np.exp(scores)
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Output
    output = np.matmul(weights, V)  # (batch, heads, L, d_k)
    
    t_attn = time.time() - t0
    
    # FLOPs: QK^T = 2*L^2*d_k, softmax ~L^2, output = 2*L^2*d_k
    flops = batch_size * num_heads * (2 * L * L * d_k + 2 * L * L * d_k)
    
    # Memory: scores matrix is L x L per head
    mem_bytes = batch_size * num_heads * L * L * 4
    
    print(f"{L:>14} {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {t_attn*1e3:>14.2f}")

print("\n" + "=" * 70)
print("Key insight: FLOPs and memory scale quadratically with sequence length")
print("For L=4096: 15 GB memory, billions of FLOPs -- attention becomes bottleneck")

Worked Example 5: Distributed GEMM and communication cost#

Introduction#

Implement data parallel training with gradient synchronization; measure computation vs. communication time; show communication overhead.

Purpose#

Understand communication bottleneck in distributed training; motivate communication-efficient algorithms.

Importance#

Modern LLMs trained on 1000s of GPUs; communication often dominates; critical for scaling.

What this example demonstrates#

Simulate distributed GEMM (matmul on local device).
Simulate all-reduce for gradient synchronization.
Measure computation time vs. communication time.
Show how communication latency scales with number of devices.

Background#

Distributed training divides minibatches across devices; after each minibatch, devices exchange gradients via all-reduce.

Historical context#

Large-batch SGD and gradient compression (2017–2019) driven by communication bottleneck.

History#

Modern frameworks (PyTorch DDP, Horovod) optimize communication; mixed precision + gradient compression reduce overhead.

Prevalence in ML#

Every distributed training uses all-reduce; communication cost is well-studied bottleneck.

Notes#

Computation time: $O(B \cdot d_{\text{in}} \cdot d_{\text{out}})$ (linear in batch size, dimensions).
Communication time: $O(\log D + d_{\text{gradient}})$ (logarithmic in device count $D$, linear in gradient size).
For 1000 devices: all-reduce with $\log D \approx 10$ rounds; if each round takes 10 μs, total ~100 μs; computation often takes ms.

Connection to ML#

Large-batch training requires communication efficiency; gradient compression and other tricks essential for practical scaling.

Connection to Linear Algebra Theory#

All-reduce is tree-based collective communication; optimal complexity is $O(\log D)$.

Pedagogical Significance#

Shows distributed systems aspect of linear algebra; explains why scaling beyond certain point is challenging.

References#

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.

Solution (Python)#

import numpy as np
import time

np.random.seed(39)

# Distributed training simulation
num_devices = [1, 4, 8, 16, 32]
batch_size = 256
hidden_dim = 2048

print("Distributed GEMM: Computation vs. Communication")
print("=" * 70)
print(f"{'Devices':15} {'Comp Time (ms)':20} {'Comm Time (μs)':20} {'Comp/Comm Ratio':15}")
print("-" * 70)

# Assume:
# - Computation: 100 GFLOPs/device (V100)
# - Communication: 25 GB/s interconnect (typical)

compute_flops_per_device = 100e9  # 100 GFLOPs
comm_bandwidth = 25e9  # GB/s (25 GB/s)

for D in num_devices:
    # Local batch per device
    local_batch = batch_size // D
    
    # GEMM: local_batch x hidden_dim x hidden_dim
    flops_local = 2 * local_batch * hidden_dim * hidden_dim
    
    # Computation time
    t_compute = flops_local / compute_flops_per_device
    
    # Communication: all-reduce of gradients (hidden_dim)
    # Complexity: O(log D) communication rounds
    # Each round transmits O(hidden_dim) data (simplified)
    comm_rounds = int(np.log2(D)) + 1
    gradient_size = hidden_dim * 4  # bytes (FP32)
    comm_per_round = gradient_size / comm_bandwidth
    t_comm = comm_rounds * comm_per_round
    
    ratio = t_compute / t_comm
    
    print(f"{D:>14} {t_compute*1e3:>19.3f} {t_comm*1e6:>19.2f} {ratio:>14.1f}x")

print("\n" + "=" * 70)
print("Key insight: Communication becomes bottleneck at large scale")
print("For 32 devices: communication ~100 microseconds, computation ~10 milliseconds")
print("Compute/comm ratio decreases -> inefficiency at scale")

Comments

Algorithm Category

Direct Methods

Approximation Methods

Attention Optimization

Computational Efficiency

BLAS & Matrix Multiplication

Sparse Matrix Operations

GPU & Distributed Computing

Data Modality

Historical & Attribution

Modern Era

Key Concepts & Theorems

Norm & Distance

Learning Path & Sequencing

Advanced

ML Applications

Deep Learning & Neural Networks

Graph Neural Networks

Attention Mechanisms

Convolutional Layers

Numerical Stability & Robustness

Floating-Point Arithmetic

Loss Scaling

Theoretical Foundation

Applied Machine Learning

Systems & Engineering

Sparse Matrices

Chapter 15

Sparse Matrices

Key ideas: Introduction

Introduction#

Most real-world data matrices are sparse: social networks are sparse graphs; text is sparse (million-word vocabulary but document has hundreds of words); images are sparse in wavelet domains; scientific simulations generate sparse discretizations. Dense matrices require $O(n^2)$ memory and $O(n^3)$ computation; sparse matrices require only $O(\text{nnz})$. A sparse matrix with $\text{nnz} = O(n)$ or $O(n \log n)$ reduces complexity from cubic to linear or nearly linear. This is transformational: it enables algorithms that would be intractable on dense representations. Exploiting sparsity requires careful algorithm design: level-of-fill control, reordering strategies, specialized data structures, and communication patterns in distributed solvers.

Important ideas#

Sparse matrix representations
- Compressed Sparse Row (CSR): store nonzeros by row; three arrays (values, column indices, row pointers); $O(\text{nnz} + n)$ memory.
- Compressed Sparse Column (CSC): symmetric role for columns; efficient for column access.
- Coordinate format (COO): list of $(i, j, v)$ triples; inefficient for computation but flexible for construction.
- Diagonal, tridiagonal, block-structured: exploit special structure for even greater efficiency.
Sparsity patterns and fill-in
- Reachability: nonzero $(i, j)$ entry means path from $i$ to $j$ in graph; reveals communication pattern.
- Fill-in: $L, U$ in $A = LU$ can have more nonzeros than $A$; reordering (minimum degree, nested dissection) minimizes fill-in.
- Symbolic factorization: determine nonzero pattern of $L, U$ without computing values.
Matrix-vector products and efficient computation
- Dense $y = A x$ costs $O(n^2)$; sparse costs $O(\text{nnz})$.
- SpMV (sparse matrix-vector product) is bandwidth-limited on modern hardware; careful data layout crucial.
- GPU sparse kernels: CUSPARSE; CPU sparse kernels: MKL Sparse BLAS, OpenMP parallelization.
Direct sparse solvers
- LU with partial pivoting: reorder to minimize fill-in, apply pivoting.
- Cholesky for SPD: symmetric structure implies symmetric reordering; often very sparse.
- Supernodal methods: group small dense blocks for cache efficiency; key to modern solvers (PARDISO, SuperLU, UMFPACK).
Iterative sparse solvers
- CG, GMRES, MINRES: require only matrix-vector products; no explicit factorization.
- Preconditioning essential: ILU (incomplete LU), IC (incomplete Cholesky), algebraic multigrid (AMG) enable convergence.
- Matrix-free methods: supply only matrix-vector product function; never form or store matrix.
Graph algorithms and adjacency matrices
- Adjacency matrix $A$ where $A_{ij} = 1$ if edge $(i, j)$ exists; number of nonzeros = $2 \times$ edges (undirected).
- Shortest paths, PageRank, diffusion: matrix powers, powers of adjacency matrix encode reachability.
- Laplacian matrix $L = D - A$ (degree diagonal minus adjacency); eigenvectors yield graph partitions (spectral clustering).
Storage and computational complexity trade-offs
- Dense $A \in \mathbb{R}^{n \times n}$: $O(n^2)$ memory, $O(n^3)$ LU, $O(n^2)$ SpMV.
- Sparse $A$ with $\text{nnz} = O(n)$: $O(n)$ memory, $O(n^{1.5})$ LU (with reordering), $O(n)$ SpMV.
- For $n = 10^6$: dense requires terabytes; sparse (with $\text{nnz} = 10^7$) requires gigabytes.

Relevance to ML#

Text and NLP: Document-term matrices are sparse (million words, hundreds per document); sparse SVD enables topic models.
Graphs and networks: Social networks, citation graphs, knowledge graphs are sparse; graph neural networks leverage adjacency matrix sparsity.
Recommender systems: User-item ratings sparse ($\sim 0.1\%$ observed); matrix completion via sparse factorization.
Inverse problems and imaging: Medical imaging, seismic inversion, tomography discretizations are sparse; sparse solvers essential.
Large-scale optimization: SGD and proximal methods operate on sparse data; distributed solvers exploit communication sparsity.

Algorithmic development (milestones)#

1960s: Sparse matrix computation recognized as practical necessity in structural engineering (finite element analysis).
1971: Tinney & Walker develop minimum degree ordering; George extends to nested dissection.
1979: Duff develops efficient symbolic factorization algorithms.
1980s: Supernodal methods developed (Ashcraft, Eisenstat); enable cache-efficient dense kernels within sparse framework.
1990s: PARDISO, SuperLU, UMFPACK emerge as production sparse solvers.
2000s: Algebraic multigrid (AMG) becomes standard preconditioner for sparse iterative methods.
2005–2010: GPU sparse kernels (CUSPARSE) enable acceleration; sparse matrix formats adapt to GPU memory.
2010s: Sparse deep learning (neural networks on graph-structured data); graph convolutional networks (GCN), graph attention networks (GAT) exploit adjacency matrix sparsity.

Definitions#

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A) \ll n^2$ nonzeros; typically $\text{nnz} = O(n)$ or $O(n \log n)$.
Sparsity: $\text{sparsity} = 1 - \frac{\text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity} > 0.99$ is typical for ML.
Compressed Sparse Row (CSR): Three arrays: values[nnz], col[nnz], row_ptr[n+1]; row_ptr[i] points to start of row $i$ in values/col.
Fill-in: Nonzero entries created during factorization that were zero in original matrix.
Reordering: Permutation $P$ such that $P A P^\top$ has lower fill-in; minimum degree, nested dissection are common strategies.
Graph of a matrix: Undirected graph $G$ where vertices are matrix indices, edge $(i, j)$ exists iff $A_{ij} \ne 0$.
Degree: Vertex $i$ has degree = number of nonzeros in row/column $i$.
Laplacian matrix: $L = D - A$ where $D = \text{diag}(d_1, \ldots, d_n)$ is degree matrix, $A$ is adjacency matrix.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Sparse matrix representations: CSR, CSC, COO formats; trade-offs between storage and access patterns. Reference: Golub & Van Loan (2013).
Fill-in and reordering: Minimum degree, nested dissection reduce fill-in; symbolic factorization predicts nonzero pattern. Reference: George & Liu (1981).
SpMV complexity: $O(\text{nnz})$ time, bandwidth-limited on modern hardware. Reference: Williams et al. (2009).
Direct sparse solvers: LU with reordering, supernodal methods, pivot selection for stability. Reference: Davis (2006) (SuiteSparse); Schenk & Gärtner (2004) (PARDISO).
Iterative sparse solvers: CG, GMRES on Krylov subspaces; preconditioning essential (ILU, IC, AMG). Reference: Saad (2003); Briggs et al. (2000) (multigrid tutorial).
Graph Laplacian and spectral methods: Eigenvalues/eigenvectors of $L = D - A$ for clustering, diffusion. Reference: Spielman & Teng (2004).

Applied (landmark systems)#

Sparse SVD for dimensionality reduction: ARPACK for large sparse matrices; used in topic modeling (LSA, LDA). Implementation: scikit-learn TruncatedSVD. Reference: Deerwester et al. (1990).
Direct sparse solvers (production): PARDISO (Intel MKL), SuperLU (LBNL), UMFPACK (UMF collection). Deployed in finite element, circuit simulation. Reference: Schenk & Gärtner (2004); Li (2005).
Iterative sparse solvers with preconditioning: GMRES + ILU in PETSc, Trilinos; ALS for matrix completion in Spark MLlib. Reference: Balay et al. (2021) (PETSc); Herring et al. (2009) (Trilinos).
Graph neural networks: PyTorch Geometric adjacency matrix operations; sparse matrix multiplication for graph convolution. Reference: Kipf & Welling (2017).
Compressed sensing and sparse recovery: ADMM/FISTA for L1-regularized problems; applications in imaging, signal processing. Reference: Boyd et al. (2011); Beck & Teboulle (2009).

Key ideas: Where it shows up

Text and NLP
- Document-term matrices $X \in \mathbb{R}^{n \times d}$ with $n$ documents, $d$ vocabulary; typically $\text{nnz} = O(nd)$ (words per doc).
- Achievements: sparse SVD/NMF in topic modeling (LSA, LDA); scikit-learn TfidfVectorizer produces sparse matrices. References: Deerwester et al. (1990) (LSA); Blei et al. (2003) (LDA).
Graph neural networks and social networks
- Adjacency matrix $A \in \mathbb{R}^{n \times n}$ sparse; graph convolution $X' = \sigma(AXW)$ exploits sparsity for $O(\text{nnz} \cdot d)$ per layer vs. $O(n^2 d)$ dense.
- Achievements: PyTorch Geometric, DGL, Spektral enable billion-node GNNs; GCN (Kipf & Welling 2017), GAT (Velickovic et al. 2018) standard. References: Kipf & Welling (2017); Velickovic et al. (2018).
Recommender systems and matrix completion
- Rating matrices $R \in \mathbb{R}^{m \times n}$ (users × items) extremely sparse ($\sim 0.01\%$ observed); matrix factorization via alternating least squares.
- Achievements: Sparse SVD in Netflix prize; ALS in Spark MLlib scales to billions of ratings. References: Koren et al. (2009) (matrix factorization); Zhou et al. (2008) (implicit feedback).
Inverse problems and sparse regularization
- Measurement matrix $A$ in compressed sensing, tomography often sparse or low-rank. Sparse solutions via L1 regularization (LASSO).
- Achievements: ADMM (alternating direction method of multipliers) for large-scale L1-regularized problems; FISTA (fast iterative shrinkage-thresholding). References: Boyd et al. (2011) (ADMM); Beck & Teboulle (2009) (FISTA).
Distributed and federated learning
- Distributed optimization exploits sparsity in communication: gradient compression, local SGD reduce communication. Sparse parameter updates.
- Achievements: gradient sparsification in federated learning (TensorFlow Federated); communication-efficient distributed SGD (Kairouz et al. 2021). References: Kairouz et al. (2021) (federated learning).

Notation

Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A)$ nonzeros; typically $\text{nnz} \ll n^2$.
Sparsity: $\text{sparsity}(A) = \frac{n^2 - \text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity}(A) > 0.99$ typical.
CSR format: Values v[1..nnz], column indices j[1..nnz], row pointers p[1..n+1] with p[i+1] - p[i] = nonzeros in row $i$.
Adjacency matrix: $A \in \{0,1\}^{n \times n}$ where $A_{ij} = 1$ if edge $(i, j)$ exists; $\text{nnz}(A) = 2m$ for $m$ edges (undirected).
Degree matrix: $D = \text{diag}(d_1, \ldots, d_n)$ where $d_i = \sum_j A_{ij}$ (degree of vertex $i$).
Laplacian: $L = D - A$ (graph Laplacian); $L$ is PSD with $\lambda_1 = 0$ (constant eigenvector).
Fill-in ratio: $\text{fill}(L, U) = \frac{\text{nnz}(L) + \text{nnz}(U)}{\text{nnz}(A)}$ (factorization creates nonzeros).
Example: Document-term matrix for 100K documents, 1M vocabulary words, ~200 words per doc: $\text{nnz} \approx 100K \times 200 = 20M$; density $\approx 0.00002\%$.

Pitfalls & sanity checks

Never convert large sparse matrix to dense: Memory explosion; dense matrix for $n = 10^6$ requires terabytes.
Choose representation based on access pattern: CSR for row access/SpMV, CSC for column access/factorization.
Monitor fill-in in factorization: If fill-in ratio $> 100$, consider iterative solver + preconditioner instead.
Ensure reordering is applied: Production solvers (PARDISO, SuperLU) apply automatically; verify if using custom code.
Test on subset before scaling: Verify correctness on small sparse system before large deployment.
GPU sparse kernels have overhead: SpMV on GPU slower than CPU for small matrices; threshold typically $n > 1000$.
Preconditioning setup vs. solve trade-off: Single solve: skip preconditioning. Multiple solves with same matrix: preconditioning pays off.
Sparsity pattern changes: If $A$ has varying sparsity over iterations, symbolic factorization must be updated.

References

Foundational sparse matrix theory

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Sparse matrix representations and algorithms

Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Tinney, W. F., & Walker, J. W. (1967). Direct solutions of sparse network equations by optimally ordered triangular factorization.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Sparse iterative solvers and preconditioning

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic Multigrid (AMG).

Sparse linear algebra in scientific computing

Balay, S., Abhyankar, S., Adams, M. F., et al. (2021). PETSc Users Manual.
Herring, A. L., Gilmour, A., Paddon, D. J., & Giles, M. B. (2009). Trilinos Overview.
Li, X. S. (2005). An overview of SuperLU: Algorithms, implementation, and user interface.

Topic modeling and sparse SVD

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.

Graph neural networks and sparse operations

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Recommender systems and matrix completion

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.
Zhou, T., Kuscsik, Z., Liu, J. K., Medo, M., Yeung, C. H., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems.

Optimization and sparse recovery

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
Kairouz, P., McMahan, H. B., Avent, B., et al. (2021). Advances and Open Problems in Federated Learning.

Five worked examples

Worked Example 1: Sparse matrix representations and storage efficiency#

Introduction#

Construct a sparse matrix (adjacency matrix of a large graph); compare storage and computation time across dense, CSR, and COO formats; demonstrate memory/time trade-offs.

Purpose#

Illustrate practical efficiency gains of sparse formats; motivate algorithm selection based on access patterns.

Importance#

Foundation for understanding when sparse methods are worthwhile; shows memory savings enable billion-scale problems.

What this example demonstrates#

Generate sparse graph (e.g., scale-free network or grid).
Store as dense array, then convert to CSR, CSC, COO.
Measure memory footprint for each format.
Benchmark matrix-vector product on dense vs. sparse.
Plot time vs. sparsity level.

Background#

Dense storage is simple but wasteful for sparse matrices. Sparse formats trade memory for computational complexity in accessing elements.

Historical context#

Sparse matrix formats developed as computers gained memory constraints in 1960s–1970s (finite element analysis).

History#

CSR became de facto standard; specialized hardware (GPU) support added via CUSPARSE, cuSPARSE.

Prevalence in ML#

Every large-scale ML system (Spark, Hadoop, TensorFlow) supports sparse matrices; default for text, graphs.

Notes#

CSR efficient for row access, SpMV; CSC efficient for column access, matrix-matrix products.
COO flexible for construction; convert to CSR/CSC before computation.
GPU memory bandwidth is bottleneck; careful format choice crucial.

Connection to ML#

Text vectorization produces sparse matrices; graphs produce sparse adjacency matrices. System must efficiently handle both.

Connection to Linear Algebra Theory#

Sparse matrix properties (connectivity, degree distribution) determine factorization cost and algorithm performance.

Pedagogical Significance#

Demonstrates performance trade-offs; motivates specialized algorithms.

References#

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(30)

# Create sparse adjacency matrix (random graph)
n = 10000
density = 0.001  # 0.1% sparse
nnz = int(density * n * n)

# Generate random sparse matrix
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
data = np.ones(nnz)

# CSR format (efficient for SpMV)
A_csr = sp.csr_matrix((data, (rows, cols)), shape=(n, n))

# CSC format (efficient for column access)
A_csc = A_csr.tocsc()

# COO format (flexible)
A_coo = sp.coo_matrix((data, (rows, cols)), shape=(n, n))

# Dense format
A_dense = A_csr.toarray()

# Measure memory
import sys
mem_csr = sys.getsizeof(A_csr.data) + sys.getsizeof(A_csr.indices) + sys.getsizeof(A_csr.indptr)
mem_dense = A_dense.nbytes

print("Memory comparison for {}x{} matrix (density {:.4f}%)".format(n, n, 100*density))
print("  Dense: {:.2f} MB".format(mem_dense / 1e6))
print("  CSR: {:.2f} MB".format(mem_csr / 1e6))
print("  Savings: {:.1f}x".format(mem_dense / mem_csr))

# Benchmark matrix-vector product
x = np.random.randn(n)
iterations = 100

# Sparse SpMV
t0 = time.time()
for _ in range(iterations):
    y_sparse = A_csr @ x
t_sparse = time.time() - t0

# Dense gemv
t0 = time.time()
for _ in range(iterations):
    y_dense = A_dense @ x
t_dense = time.time() - t0

print("\nMatrix-vector product time ({} iterations):".format(iterations))
print("  Sparse: {:.4f} s".format(t_sparse))
print("  Dense: {:.4f} s".format(t_dense))
print("  Speedup: {:.1f}x".format(t_dense / t_sparse))

print("\nSparsity level: {:.1f}%".format(100 * A_csr.nnz / (n*n)))
print("NNZ: {}".format(A_csr.nnz))

Worked Example 2: Fill-in and reordering in sparse LU factorization#

Introduction#

Construct a sparse matrix; compute LU factorization without reordering and with minimum-degree reordering; compare fill-in and factorization time.

Purpose#

Demonstrate how reordering dramatically reduces fill-in and speeds factorization.

Importance#

Critical for practical sparse solvers; shows that problem structure matters.

What this example demonstrates#

Construct sparse system (banded, block-structured, or random).
Perform symbolic LU factorization (predict nonzero pattern).
Compute LU without reordering; measure fill-in ratio.
Apply minimum-degree reordering; compute LU again.
Compare factorization time and fill-in.

Background#

Fill-in (new nonzeros created during factorization) can destroy sparsity; reordering (permutation of rows/columns) reduces fill-in.

Historical context#

Tinney & Walker (1967) minimum degree; George (1973) nested dissection; standard in SPARSE BLAS era.

History#

All production sparse solvers (PARDISO, SuperLU, UMFPACK) include automatic reordering.

Prevalence in ML#

Used in sparse inverse problems, finite element discretizations; less visible in ML but critical for scalability.

Notes#

Minimum degree is heuristic; nested dissection is theoretically superior but more complex.
Fill-in ratio typically 5–100× for unordered; reordering reduces to 1–5×.

Connection to ML#

Sparse solve is bottleneck in iterative methods (Newton, interior-point algorithms). Reordering accelerates optimization.

Connection to Linear Algebra Theory#

Graph reachability (matrix powers) predicts fill-in; reduction is graph-theoretic problem.

Pedagogical Significance#

Shows interplay between linear algebra and graph algorithms.

References#

George, J. A., & Liu, J. W. H. (1981). Computer Solution of Large Sparse Positive Definite Systems.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems.
Schenk, O., & Gärtner, K. (2004). PARDISO: A high-performance serial and parallel sparse linear solver.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
from scipy.sparse import csgraph
import time

np.random.seed(31)

# Create sparse banded system (mimics discretized PDE)
n = 1000
band_width = 5
A_banded = sp.diags([np.ones(n-1), -2*np.ones(n), np.ones(n-1)],
                     [-1, 0, 1], shape=(n, n), format='csr')
# Add random entries to create irregular pattern
rows = np.random.choice(n, 50)
cols = np.random.choice(n, 50)
A_banded = A_banded + 0.01 * sp.csr_matrix((np.ones(50), (rows, cols)), shape=(n, n))
A_banded.eliminate_zeros()

# Convert to CSC for LU (standard for sparse direct solvers)
A_csc = A_banded.tocsc()

# LU without reordering
print("LU factorization comparison (n={})".format(n))
print("\nNo reordering:")
t0 = time.time()
LU_no_reorder = spla.splu(A_csc, permc_spec='natural')
t_no_reorder = time.time() - t0
nnz_no_reorder = LU_no_reorder.nnz

# LU with COLAMD reordering (minimum degree heuristic)
print("\nWith COLAMD reordering:")
t0 = time.time()
LU_reorder = spla.splu(A_csc, permc_spec='COLAMD')
t_reorder = time.time() - t0
nnz_reorder = LU_reorder.nnz

print("\nResults:")
print("  Original NNZ: {}".format(A_csc.nnz))
print("  No reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_no_reorder, nnz_no_reorder / A_csc.nnz, t_no_reorder))
print("  With reorder - NNZ: {}, fill-in ratio: {:.2f}, time: {:.4f} s".format(
    nnz_reorder, nnz_reorder / A_csc.nnz, t_reorder))
print("  Fill-in reduction: {:.1f}x".format(nnz_no_reorder / nnz_reorder))
print("  Time improvement: {:.1f}x".format(t_no_reorder / t_reorder))

Worked Example 3: Sparse SVD for topic modeling (LSA)#

Introduction#

Apply truncated SVD to sparse document-term matrix; extract latent topics; compare sparse vs. dense SVD efficiency.

Purpose#

Show practical use of sparse linear algebra in NLP; demonstrate dimensionality reduction on real-world sparse data.

Importance#

Sparse SVD is standard in topic modeling; ARPACK enables scalability to billions of words.

What this example demonstrates#

Generate synthetic document-term matrix (sparse, realistic sparsity ~0.01%).
Compute truncated SVD via ARPACK (sparse).
Compute dense SVD for comparison (infeasible for large $n$).
Extract topics (top words per topic).
Measure reconstruction error vs. rank.

Background#

LSA (latent semantic analysis) applies SVD to tf-idf matrix; low-rank factors reveal latent topics.

Historical context#

Deerwester et al. (1990) pioneered LSA; ARPACK (Lehoucq et al. 1998) made scalable SVD practical.

History#

Standard in topic modeling pipelines; scikit-learn TruncatedSVD widely used.

Prevalence in ML#

Textmining, information retrieval, recommendation systems routinely use sparse SVD.

Notes#

Truncated SVD (rank $k \ll n$) computes only $k$ largest singular values; much cheaper than full SVD.
Sparse matrix multiplication dominates cost; $O(k \cdot \text{nnz})$ per iteration.

Connection to ML#

Topics are interpretable; document-topic and topic-word matrices enable downstream tasks.

Connection to Linear Algebra Theory#

ARPACK uses Krylov subspaces and implicitly restarted Arnoldi; exploits SpMV efficiency.

Pedagogical Significance#

Demonstrates power of combining sparse formats with specialized algorithms (Krylov methods).

References#

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
Lehoucq, R. B., Sorensen, D. C., & Yang, C. (1998). ARPACK Users’ Guide.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import time

np.random.seed(32)

# Synthetic document-term matrix (sparse)
n_docs, n_terms = 10000, 50000
nnz_per_doc = 200

# Generate sparse matrix (realistic sparsity)
data = []
row_idx = []
col_idx = []
for doc in range(n_docs):
    terms = np.random.choice(n_terms, nnz_per_doc, replace=False)
    counts = np.random.poisson(2, nnz_per_doc) + 1
    data.extend(counts)
    row_idx.extend([doc] * nnz_per_doc)
    col_idx.extend(terms)

X = sp.csr_matrix((data, (row_idx, col_idx)), shape=(n_docs, n_terms))

# Normalize (tf-idf variant)
X = X.astype(np.float32)
row_sums = np.array(X.sum(axis=1)).flatten()
X.data /= row_sums[X.nonzero()[0]]

print("Document-term matrix: {}x{}, sparsity: {:.4f}%".format(
    n_docs, n_terms, 100*(1 - X.nnz/(n_docs*n_terms))))

# Sparse SVD (ARPACK)
k = 50  # Number of topics
print("\nTruncated SVD (rank {})...".format(k))
t0 = time.time()
U_sparse, s_sparse, Vt_sparse = svds(X, k=k, which='LM')
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense SVD (for small subset for illustration)
print("\nFull dense SVD (on 100x1000 subset for illustration)...")
X_dense_small = X[:100, :1000].toarray()
t0 = time.time()
U_dense, s_dense, Vt_dense = np.linalg.svd(X_dense_small, full_matrices=False)
t_dense = time.time() - t0
print("  Time (100x1000): {:.4f} s".format(t_dense))
print("  Dense SVD on full matrix would take >> hours")

# Extract topics (top terms per topic)
print("\nTop 5 terms per topic (first 3 topics):")
for topic_idx in range(min(3, k)):
    top_terms = np.argsort(Vt_sparse[topic_idx])[-5:][::-1]
    print("  Topic {}: terms {}".format(topic_idx, top_terms))

# Reconstruction error
rank_list = [5, 10, 20, 50]
print("\nReconstruction error vs. rank:")
for r in rank_list:
    U_r = U_sparse[:, -r:]
    s_r = s_sparse[-r:]
    Vt_r = Vt_sparse[-r:, :]
    X_recon = U_r @ np.diag(s_r) @ Vt_r
    err = np.linalg.norm(X[:100, :].toarray() - X_recon[:100]) / np.linalg.norm(X[:100])
    print("  Rank {}: relative error {:.4f}".format(r, err))

Worked Example 4: Preconditioned iterative solver on sparse system#

Introduction#

Construct a large sparse SPD system (from discretized PDE); solve via unpreconditioned and preconditioned CG; demonstrate preconditioning impact.

Purpose#

Show how preconditioning enables convergence on sparse systems where direct solvers would be expensive.

Importance#

Practical approach for million-scale sparse systems; avoids expensive fill-in of direct LU.

What this example demonstrates#

Construct sparse Laplacian or discretized diffusion matrix.
Solve unpreconditioned CG; count iterations.
Construct ILU or AMG preconditioner.
Solve preconditioned CG; count iterations.
Compare iteration counts and wall-clock time.

Background#

Iterative solvers avoid factorization fill-in; preconditioning clusters spectrum for fast convergence.

Historical context#

Preconditioned CG developed 1970s–1980s; ILU standard; AMG emerged 1980s–1990s.

History#

Standard in PETSc, Trilinos, deal.II (finite element libraries); production code for billion-node problems.

Prevalence in ML#

Used in large-scale optimization (Newton, interior-point), graph algorithms (diffusion on networks).

Notes#

Preconditioning setup cost amortized over many solves.
ILU relatively cheap; AMG more expensive but better convergence for difficult problems.

Connection to ML#

Large-scale optimization (e.g., logistic regression on massive sparse data) requires preconditioned iterative methods.

Connection to Linear Algebra Theory#

CG iteration count $\approx \sqrt{\kappa}$; preconditioning reduces $\kappa$ to enable convergence.

Pedagogical Significance#

Demonstrates integration of sparsity, preconditioning, and iterative methods.

References#

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A Multigrid Tutorial (2nd ed.).
Barrett, R., Berry, M., Chan, T. F., et al. (1994). Templates for the Solution of Linear Systems.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla

np.random.seed(33)

# Create sparse Laplacian (discretized 2D Poisson, 100x100 grid)
n = 10000  # 100x100 grid
# 5-point stencil: center + 4 neighbors
I = []
J = []
S = []
for i in range(n):
    I.extend([i, i, i])
    J.extend([i, (i+1) % n, (i-1) % n])
    S.extend([4, -1, -1])

A = sp.csr_matrix((S, (I, J)), shape=(n, n))
A = A.tocsr()

# RHS
b = np.random.randn(n)

print("Sparse Laplacian system: {}x{}, nnz={}".format(n, n, A.nnz))
print("Condition number (estimated): ~{}".format(int(n**0.5)))

# Unpreconditioned CG
print("\nUnpreconditioned CG:")
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))

x_un, info_un = spla.cg(A, b, tol=1e-6, callback=callback_un, maxiter=n)
print("  Iterations: {}".format(len(residuals_unprecond)))

# ILU preconditioner
print("\nBuilding ILU preconditioner...")
M_ilu = spla.spilu(A.tocsc())
M = spla.LinearOperator((n, n), matvec=M_ilu.solve)

print("Preconditioned CG:")
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))

x_pre, info_pre = spla.cg(A, b, M=M, tol=1e-6, callback=callback_pre, maxiter=n)
print("  Iterations: {}".format(len(residuals_precond)))

print("\nSummary:")
print("  Unpreconditioned iterations: {}".format(len(residuals_unprecond)))
print("  Preconditioned iterations: {}".format(len(residuals_precond)))
print("  Speedup: {:.1f}x".format(len(residuals_unprecond) / len(residuals_precond)))
print("  Both converged:", info_un == 0 and info_pre == 0)

Worked Example 5: Graph neural network convolution on sparse adjacency matrix#

Introduction#

Construct a graph (sparse adjacency matrix); implement graph convolution layer; measure SpMM efficiency vs. dense multiplication.

Purpose#

Demonstrate sparse matrix operations in modern ML (graph neural networks).

Importance#

GNNs process billion-node graphs; sparsity is essential for feasibility.

What this example demonstrates#

Generate graph (scale-free, grid, or real network).
Build node feature matrix $X \in \mathbb{R}^{n \times d}$.
Implement sparse graph convolution: $X' = \sigma(A X W)$ with SpMM.
Compare sparse vs. dense GCN layer time/memory.
Verify output correctness.

Background#

Graph convolution aggregates features from neighbors via sparse $A$ multiplication.

Historical context#

Kipf & Welling (2017) GCN; PyTorch Geometric (2019) enables efficient sparse operations.

History#

Standard in graph learning frameworks (DGL, Spektral, PyTorch Geometric).

Prevalence in ML#

GNNs on social networks, citation graphs, molecular graphs; sparse adjacency matrix standard.

Notes#

SpMM (sparse-dense matrix product) is key operation; GPU acceleration available.
For $n$-node graph with $m$ edges and $d$ features: sparse is $O(m \cdot d)$, dense is $O(n^2 \cdot d)$.

Connection to ML#

GNNs enable learning on structured, graph-based data; scalable to billion-node networks via sparsity.

Connection to Linear Algebra Theory#

Graph convolution is polynomial filter on graph Laplacian; spectral perspective connects to Chebyshev polynomials.

Pedagogical Significance#

Integrates sparse linear algebra, graph theory, and modern deep learning.

References#

Kipf, T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Velickovic, P., Cucurull, G., Casanova, A., et al. (2018). Graph Attention Networks.
Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric.

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import time

np.random.seed(34)

# Generate graph (scale-free network)
n = 10000  # Nodes
avg_degree = 10
edges = []
for i in range(n):
    # Preferential attachment: connect to random nodes
    targets = np.random.choice(n, min(avg_degree, n-1), replace=False)
    for j in targets:
        if i != j:
            edges.append((i, j))
            edges.append((j, i))  # Undirected

edges = np.array(edges)
A = sp.csr_matrix((np.ones(len(edges)), edges.T), shape=(n, n))
A.eliminate_zeros()

# Node features
d = 64  # Feature dimension
X = np.random.randn(n, d).astype(np.float32)

# Weight matrix (GCN parameters)
W = np.random.randn(d, d).astype(np.float32)

print("Graph: n={}, edges={}, sparsity={:.4f}%".format(
    n, A.nnz // 2, 100*(1 - A.nnz/(n*n))))
print("Features: n={}, d={}".format(n, d))

# Sparse graph convolution: X_out = A @ X @ W
print("\nSparse graph convolution:")
t0 = time.time()
X_intermediate = A @ X  # SpMM: sparse @ dense
X_out_sparse = X_intermediate @ W  # Dense @ dense
t_sparse = time.time() - t0
print("  Time: {:.4f} s".format(t_sparse))

# Dense graph convolution (for comparison on subset)
print("\nDense graph convolution (on 1000-node subgraph):")
A_dense_sub = A[:1000, :1000].toarray()
X_sub = X[:1000]
t0 = time.time()
X_out_dense_sub = A_dense_sub @ X_sub @ W
t_dense_sub = time.time() - t0
print("  Time (1000 nodes): {:.4f} s".format(t_dense_sub))
# Extrapolate to full graph
t_dense_extrapolated = t_dense_sub * (n/1000)**2
print("  Extrapolated to full graph: {:.2f} s".format(t_dense_extrapolated))

print("\nSpeedup (dense vs. sparse): {:.1f}x".format(t_dense_extrapolated / t_sparse))

# Apply nonlinearity (ReLU)
X_out_sparse = np.maximum(X_out_sparse, 0)

print("\nGCN layer output shape:", X_out_sparse.shape)
print("Output mean:", np.mean(X_out_sparse))
print("Output sparsity after ReLU:", 100 * np.sum(X_out_sparse == 0) / X_out_sparse.size, "%")

Comments

Algorithm Category

Direct Methods

Iterative Methods

Graph Algorithms

Computational Efficiency

Sparse Matrix Operations

Data Modality

Sparse Matrices

Structured Matrices

Historical & Attribution

Modern Era

Key Concepts & Theorems

Spectral Theory

Learning Path & Sequencing

Advanced

ML Applications

Graph Neural Networks

Recommender Systems

Dimensionality Reduction

Problem Structure & Exploitation

Sparsity

Theoretical Foundation

Applied Machine Learning

Principal Component Analysis

Chapter 11

Principal Component Analysis

Key ideas: Introduction

Introduction#

PCA seeks a low-dimensional projection that captures the most variance. Geometrically, it rotates data so axes align with directions of maximal spread. Algebraically, it solves the optimization $\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$, yielding the top eigenvector of $X_c^\top X_c$. Successive components are orthogonal and capture diminishing variance.

Important ideas#

Covariance matrix
- $\Sigma = \tfrac{1}{n}X_c^\top X_c$ with $X_c$ centered. Eigenvalues $\lambda_i$ are variances along principal directions.
Principal components (eigenvectors)
- Columns of $V$ from SVD (or eigenvectors of $\Sigma$) form an orthonormal basis ordered by variance.
Explained variance ratio
- EVR = $\tfrac{\lambda_i}{\sum_j \lambda_j}$ quantifies how much total variance component $i$ explains; cumulative EVR guides dimensionality choice.
Scores and loadings
- Scores: $Z = X_c V$ (projections onto components); loadings: $V$ (directions in original space).
Reconstruction and truncation
- Truncated PCA: keep $k$ components; $\tilde{X}_c = Z_k V_k^\top$ minimizes squared error (Eckart–Young).
Standardization and scaling
- Standardize to unit variance before PCA if variables have different scales; otherwise leading component may be dominated by high-variance features.
Whitening
- Transform to unit variance: $Z_w = Z \Lambda^{-1/2}$ decorrelates and rescales for downstream algorithms (e.g., RBF kernels).

Relevance to ML#

Dimensionality reduction: speeds training, avoids overfitting, improves generalization.
Visualization: 2D/3D projection of high-dimensional data for exploration.
Preprocessing: removes noise, aligns scales, improves conditioning of solvers.
Feature extraction: learned components as features for downstream classifiers.
Denoising: truncated PCA removes low-variance (noisy) directions.
Whitening: standardizes correlation structure, crucial for many algorithms (kernels, distance-based methods).

Algorithmic development (milestones)#

1901: Pearson introduces lines/planes of closest fit (geometric intuition).
1933: Hotelling formalizes PCA as eigen-decomposition of covariance.
1950s–1960s: Computational advances (QR, Jacobi methods) enable practical PCA.
1995: Probabilistic PCA (Tipping–Bishop) bridges PCA and Gaussian latent variable models.
1997–2010s: Kernel PCA (Schölkopf et al.) and sparse PCA emerge for nonlinear and interpretable variants.
2000s: Randomized PCA for large-scale data (Halko–Martinsson).
2010s: PCA integrated into deep learning (autoencoders, PCA layers, spectral initialization).

Definitions#

Centered data: $X_c = X - \bar{X}$ with $\bar{X} = \tfrac{1}{n}\mathbf{1}^\top X$ (row means).
Covariance matrix: $\Sigma = \tfrac{1}{n}X_c^\top X_c \in \mathbb{R}^{d\times d}$ (PSD).
Principal components: eigenvectors of $\Sigma$, ordered by eigenvalue magnitude.
Variance explained by component $i$: $\lambda_i / \operatorname{tr}(\Sigma)$.
Whitened data: $Z_w = X_c V \Lambda^{-1/2}$ with $\Lambda$ diagonal eigenvalue matrix.
Reconstructed data: $\tilde{X}_c = X_c V_k V_k^\top$ using rank-$k$ approximation.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Covariance matrix and its PSD structure (Chapter 09).
Eigen-decomposition of symmetric covariance matrix.
Variational characterization: $\arg\max_u \lVert X_c u \rVert^2$ subject to $\lVert u \rVert=1$ yields top eigenvector.
Eckart–Young–Mirsky low-rank approximation error (Chapter 10).
Relation to SVD: PCA via SVD of centered data (Chapter 10).
Standardization and scaling effects on covariance eigenvalues.

Applied (landmark systems)#

Dimensionality reduction (Jolliffe 2002; Hastie et al. 2009).
Whitening for deep learning (LeCun et al. 1998; Krizhevsky et al. 2012).
Probabilistic PCA and latent variable models (Tipping & Bishop 1997).
Kernel PCA for nonlinear reduction (Schölkopf et al. 1998).
Randomized PCA for large scale (Halko–Martinsson–Tropp 2011).
Matrix completion via truncated SVD (Candès & Tao 2010).

Key ideas: Where it shows up

Dimensionality reduction and preprocessing

Removes redundant features; improves stability of downstream solvers. Achievements: widely used in computer vision (image preprocessing), bioinformatics (gene expression), and AutoML pipelines. References: Jolliffe 2002.

Visualization and exploratory data analysis

Project to 2D/3D for interactive inspection and cluster discovery. Achievements: industry standard in data exploration tools (Pandas, R, Plotly). References: Hastie et al. 2009.

Whitening and decorrelation

Standardizes feature covariance to identity; improves kernel methods and RBF networks. Achievements: standard preprocessing in deep learning frameworks. References: LeCun et al. 1998 (early deep learning); Krizhevsky et al. 2012 (ImageNet AlexNet).

Denoising and matrix completion

Truncated PCA recovers low-rank structure from noisy observations. Achievements: used in image inpainting and recommendation cold-start. References: Candès & Tao 2010 (matrix completion); Pearson 1901 (geometric intuition).

Feature extraction and representation learning

Learned components become features for classifiers; precursor to autoencoders. Achievements: basis for deep autoencoders and VAEs. References: Hinton & Salakhutdinov 2006 (deep learning via autoencoders).

Notation

Data: $X \in \mathbb{R}^{n\times d}$; centered $X_c = X - \bar{X}$.
Covariance: $\Sigma = \tfrac{1}{n}X_c^\top X_c$.
Eigendecomposition: $\Sigma = V \Lambda V^\top$ with $\Lambda$ diagonal.
Principal components: columns of $V$; eigenvalues $\lambda_i$ are variances.
Scores (projections): $Z = X_c V \in \mathbb{R}^{n\times d}$ or truncated $Z_k = X_c V_k$.
Explained variance ratio: $\text{EVR}_i = \tfrac{\lambda_i}{\sum_j \lambda_j}$.
Standardized data: $X_s = X_c / \sigma$ (element-wise or per-column standard deviation).
Whitened data: $Z_w = Z \Lambda^{-1/2} = X_c V \Lambda^{-1/2}$.
Example: If $X$ is $100 \times 50$ with 2 dominant eigenvalues $\lambda_1=8, \lambda_2=3, \sum_j \lambda_j=12$, then $\text{EVR}_1 \approx 0.67, \text{EVR}_2 \approx 0.25$; keep 2 components to explain $92\%$ of variance.

Pitfalls & sanity checks

Always center data; forgetting this is a common error.
Standardize features if they have different scales; otherwise PCA is dominated by high-variance features.
Sign ambiguity: eigenvectors are unique up to sign; do not compare raw signs across methods.
Small/negative eigenvalues: should not occur for PSD covariance matrix; indicates numerical error or centering issue.
Reconstruction: verify $\lVert X_c - X_c V_k V_k^\top \rVert_F$ equals tail variance for sanity check.
Number of components: do not blindly choose $k$; use scree plot, cumulative variance, or cross-validation.

References

Foundational work

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Classical theory and methods 3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). 4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). 5. Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.

Numerical algorithms 6. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). 7. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. 8. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Finding structure with randomness.

Extensions and applications 9. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Kernel Principal Component Analysis. 10. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic principal component analysis. 11. LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. 12. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. 13. Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Proximal algorithms and shared optimality conditions. 14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.

Five worked examples

Worked Example 1: Computing PCA via eigen-decomposition and interpreting variance#

Introduction#

Compute PCA on a synthetic dataset via covariance eigendecomposition; examine explained variance and principal directions.

Purpose#

Build intuition for how eigenvalues quantify variance along principal axes.

Importance#

Diagnostics guide choice of number of components for downstream tasks.

What this example demonstrates#

Center data; compute covariance matrix.
Compute eigendecomposition of covariance.
Report eigenvalues, cumulative explained variance ratio, and principal component directions.

Background#

Hotelling (1933) formalized PCA as eigen-decomposition of covariance.

Historical context#

Foundational work in multivariate statistics; adapted widely in ML.

History#

Standard in all major statistical/ML libraries (R, Python, MATLAB).

Prevalence in ML#

Used routinely in preprocessing and exploratory analysis.

Notes#

Ensure data is centered; if not, covariance is inaccurate.
Eigenvalues are variances; eigenvectors are directions.

Connection to ML#

Explains what PCA extracts and why truncation works.

Connection to Linear Algebra Theory#

Covariance eigen-decomposition is the Rayleigh quotient optimization.

Pedagogical Significance#

Links variance to eigenvalues concretely.

References#

Hotelling, H. (1933). Analysis of complex statistical variables into principal components.
Jolliffe, I. T. (2002). Principal Component Analysis.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)
evals = evals[::-1]
evecs = evecs[:, ::-1]

cumsum_var = np.cumsum(evals) / evals.sum()

print("eigenvalues:", np.round(evals, 4))
print("explained variance ratio:", np.round(evals / evals.sum(), 4))
print("cumulative EVR (k=1,2,3):", np.round(cumsum_var[:3], 4))

Worked Example 2: PCA via SVD (numerically stable)#

Introduction#

Compute PCA using SVD of centered data instead of forming covariance matrix explicitly.

Purpose#

Show how SVD avoids squaring condition number; more numerically stable for ill-conditioned data.

Importance#

Standard in practice; avoids explicit covariance computation and is faster for tall data.

What this example demonstrates#

SVD of $X_c / \sqrt{n}$ yields principal components (columns of $V$) and singular values.
Squared singular values equal eigenvalues of $X_c^\top X_c / n$.

Background#

SVD is numerically more stable than eigen-decomposition of $X_c^\top X_c$.

Historical context#

Popularized in numerical linear algebra as the default PCA route.

History#

Standard in scikit-learn PCA class; uses SVD internally.

Prevalence in ML#

Default in modern PCA implementations.

Notes#

Use full_matrices=False for efficiency when $n \gg d$.
Singular values $s$ relate to eigenvalues via $\lambda_i = (s_i / \sqrt{n})^2$.

Connection to ML#

More robust to numerical issues; faster for large $n$.

Connection to Linear Algebra Theory#

SVD of $X_c$ relates directly to covariance eigen-structure.

Pedagogical Significance#

Bridges SVD (Chapter 10) and PCA practically.

References#

Golub & Van Loan (2013). Matrix Computations.
Trefethen & Bau (1997). Numerical Linear Algebra.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([5.0, 3.0, 1.5, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals_from_svd = s ** 2

Sigma = (Xc.T @ Xc) / n
evals_from_eig = np.linalg.eigvalsh(Sigma)[::-1]

print("eigenvalues from eig:", np.round(evals_from_eig, 6))
print("eigenvalues from SVD:", np.round(evals_from_svd, 6))
print("difference:", np.linalg.norm(evals_from_eig - evals_from_svd))

Worked Example 3: Dimensionality reduction and reconstruction error#

Introduction#

Demonstrate PCA truncation to $k$ components; compare reconstruction error to variance lost.

Purpose#

Show how truncated PCA minimizes squared error (Eckart–Young); guide choice of $k$.

Importance#

Core to deciding how many components to keep in applications.

What this example demonstrates#

Compute full PCA; truncate to $k$ components.
Reconstruct and compute Frobenius error.
Verify error matches variance in discarded components.

Background#

Eckart–Young–Mirsky theorem guarantees optimality of rank-$k$ truncation.

Historical context#

Theoretical guarantee for best low-rank approximation.

History#

Used in all dimensionality reduction and compression workflows.

Prevalence in ML#

Standard choice heuristic for $k$: keep 90–95% explained variance.

Notes#

Reconstruction error squared equals sum of squared singular values of discarded components.
Trade-off: fewer components → less storage/compute, but more information loss.

Connection to ML#

Informs practical $k$ selection for downstream tasks.

Connection to Linear Algebra Theory#

Optimal low-rank approximation per Eckart–Young theorem.

Pedagogical Significance#

Links theory to practical dimensionality reduction.

References#

Eckart & Young (1936). The approximation of one matrix by another of lower rank.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 100, 10
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, s, Vt = np.linalg.svd(Xc / np.sqrt(n), full_matrices=False)
evals = s ** 2

k = 4
Xc_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
error_fro = np.linalg.norm(Xc - Xc_k, "fro")
tail_vars = evals[k:].sum()

ev_ratio = np.cumsum(evals) / evals.sum()
print("explained variance ratio (k=1..5):", np.round(ev_ratio[:5], 4))
print("reconstruction error:", round(error_fro, 4))
print("tail variance:", round(tail_vars, 4), "sqrt:", round(np.sqrt(tail_vars), 4))

Worked Example 4: Whitening and decorrelation#

Introduction#

Apply PCA whitening to standardize covariance; show uncorrelated and unit-variance output.

Purpose#

Demonstrate how whitening decorrelates features and enables downstream algorithms.

Importance#

Preprocessing step for many algorithms (kernels, RBF networks, distance-based methods).

What this example demonstrates#

Compute PCA; form whitening transform $Z_w = Z \Lambda^{-1/2}$.
Verify output covariance is identity.
Compare to standard scaling.

Background#

Whitening removes correlation structure and equalizes variance across dimensions.

Historical context#

Used in signal processing; adopted in deep learning for stabilization.

History#

LeCun et al. (1998) highlighted importance in early deep learning; Krizhevsky et al. (2012) used it in AlexNet.

Prevalence in ML#

Standard preprocessing in deep learning, kernel methods, and statistical tests.

Notes#

Add small floor to tiny eigenvalues to avoid division by zero.
Whitening can amplify noise if done naively on high-variance directions.

Connection to ML#

Improves convergence, gradient scales, and generalization of many algorithms.

Connection to Linear Algebra Theory#

Transform to canonical coordinate system (aligned with PCA axes).

Pedagogical Significance#

Practical application of PCA-based preprocessing.

References#

LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.

Solution (Python)#

import numpy as np

np.random.seed(3)
n, d = 150, 4
# Create correlated features
A = np.random.randn(d, d)
Cov = A.T @ A
X = np.random.randn(n, d) @ np.linalg.cholesky(Cov).T
Xc = X - X.mean(axis=0, keepdims=True)

evals, evecs = np.linalg.eigh((Xc.T @ Xc) / n)
evals = evals[::-1]
evecs = evecs[:, ::-1]

# Whitening transform
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
Z = Xc @ evecs
Zw = Z @ Lambda_inv_sqrt

# Verify output covariance is identity
Sigma_w = (Zw.T @ Zw) / n

print("input covariance diag:", np.round(np.diag((Xc.T @ Xc) / n), 4))
print("whitened covariance diag:", np.round(np.diag(Sigma_w), 4))
print("whitened covariance off-diag max:", round(np.max(np.abs(Sigma_w - np.eye(d))), 6))

Worked Example 5: Denoising via truncated PCA#

Introduction#

Apply truncated PCA to a noisy signal; show noise reduction as a function of truncation.

Purpose#

Illustrate how keeping top-$k$ components removes high-frequency (noisy) information.

Importance#

Core application in image denoising, signal processing, and data cleaning.

What this example demonstrates#

Add noise to data; apply truncated PCA for different $k$.
Measure reconstruction error vs. ground truth vs. noise level.
Show improvement from truncation.

Background#

Noise typically occupies low-variance directions; truncation removes it.

Historical context#

Classical application dating to Pearson (1901); widely used in signal/image processing.

History#

Precursor to modern deep denoising autoencoders.

Prevalence in ML#

Used in image inpainting, audio denoising, sensor data cleanup.

Notes#

Noise reduction works best if signal occupies few components.
Trade-off: lower $k$ → more denoising, but may remove true signal.

Connection to ML#

Improves feature quality for downstream models; common preprocessing.

Connection to Linear Algebra Theory#

Low-rank structure (signal) separated from noise via truncation.

Pedagogical Significance#

Demonstrates practical benefit of dimensionality reduction.

References#

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Hastie et al. (2009). The Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 100, 20
# True signal with low-rank structure
U_true, _ = np.linalg.qr(np.random.randn(n, 5))
V_true, _ = np.linalg.qr(np.random.randn(d, 5))
s_true = np.array([10.0, 8.0, 6.0, 4.0, 2.0])
X_clean = U_true @ np.diag(s_true) @ V_true.T

# Add noise
noise = 0.5 * np.random.randn(n, d)
X_noisy = X_clean + noise

Xc_noisy = X_noisy - X_noisy.mean(axis=0, keepdims=True)
U, s, Vt = np.linalg.svd(Xc_noisy / np.sqrt(n), full_matrices=False)

errors = []
ks = range(1, 11)
for k in ks:
    X_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k]
    err = np.linalg.norm(X_clean[:, :] - X_k, "fro")
    errors.append(err)

print("reconstruction error for k=1..5:", np.round(errors[:5], 4))
print("best k:", np.argmin(errors) + 1)