ex1.ai

Matrix Products in DL & Transformers

Chapter 16

Matrix Products in DL & Transformers

Key ideas: Introduction

Introduction#

Neural networks are fundamentally stacks of matrix multiplications. A forward pass through a deep network is a product of weight matrices and activations. Each layer computes $A_{l+1} = \sigma(W_l A_l)$ where $W_l \in \mathbb{R}^{d_{l+1} \times d_l}$, $A_l \in \mathbb{R}^{n \times d_l}$ (batch size $n$, layer input dimension $d_l$, output dimension $d_{l+1}$). The cost is $O(n d_l d_{l+1})$ per layer. Transformers add attention: $\text{Attention}(Q, K, V) = \sigma(Q K^\top / \sqrt{d_k}) V$, which involves three GEMM operations and a softmax (polynomial in $n$ sequence length). For a Transformer with sequence length $L$, hidden dimension $d$, and $H$ attention heads per layer, attention cost is $O(L^2 d)$ (quadratic in sequence length—a major bottleneck). Modern accelerators (GPUs, TPUs) are matrix-multiply engines: billions of floating-point operations per second (TFLOPs). Utilization depends on arithmetic intensity (ops/byte): bandwidth-bound operations underutilize the accelerator; computation-bound operations (high arithmetic intensity) achieve near-peak performance. Understanding how to write matrix products that achieve high arithmetic intensity, and how to distribute them across devices, determines whether you can train billion-parameter models.

Important ideas#

Matrix-matrix multiplication (GEMM) structure
- Dense GEMM: $C \leftarrow AB$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
- Arithmetic: $mk + kmn + mn = O(mkn)$ floating-point operations (FLOPs).
- Memory: $O(m + k + n)$ words (inputs + output); on GPU, $m, k, n$ can be $1000$s, requiring GB of memory.
- Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes}} = \frac{2mkn}{8(mk + kn + mn)} \approx \frac{mkn}{4(m + k + n)}$ (higher is better).
Blocking and cache efficiency
- GEMM blocked into $b \times b$ tiles; each tile multiplied using fast cache.
- Cache line length (64 bytes typical); GEMM loads tile once, reuses it $O(b)$ times.
- Roofline model: peak FLOP rate vs. memory bandwidth; if arithmetic intensity $< I_{\text{roof}}$, algorithm is bandwidth-bound.
Batch matrix multiplication (batched GEMM)
- Forward pass: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size).
- Exploit parallelism: process multiple batches on multiple cores/GPU SMs.
- Highly efficient when batch size is large; small batches underutilize accelerator.
Convolution as matrix multiplication (im2col, Winograd)
- Convolution unfolds as GEMM: reshape input patches into columns; multiply by filter matrix; reshape output.
- im2col: input image to column matrix; allows use of highly optimized GEMM (cuBLAS, MKL).
- Cost: $O(kh kw d_{\text{in}} d_{\text{out}} h_{\text{out}} w_{\text{out}})$ (kernel height/width, input/output channels, spatial dims).
- Winograd: fast convolution via transformed domain; reduces arithmetic but increases numerical complexity.
Scaled dot-product attention
- Query-key-value paradigm: $Q \in \mathbb{R}^{L \times d_k}$, $K, V \in \mathbb{R}^{L \times d_v}$ (sequence length $L$, head dimension $d_k, d_v$).
- Attention: (1) $M = Q K^\top / \sqrt{d_k}$ (matrix product $L \times d_k \times d_k \times L = O(L^2 d_k)$), (2) $A = \text{softmax}(M)$ (per-row normalization, no matrix products), (3) $O = AV$ (matrix product $L \times L \times d_v = O(L^2 d_v)$).
- Total: $O(L^2 (d_k + d_v)) = O(L^2 d)$ (quadratic in sequence length).
- Challenge: for $L = 4096$ (typical transformer), $L^2 = 16M$ operations per attention head; billions for multi-head.
Mixed precision and numerical stability
- FP32 (single precision, float32): 32 bits, ~7 significant digits; gradients, weights commonly stored in FP32.
- FP16 (half precision, float16): 16 bits, ~4 significant digits; range $[6 \times 10^{-8}, 6 \times 10^4]$; GPU operations 2–10× faster.
- BFloat16 (Brain Float): 16 bits, same exponent range as FP32, reduced mantissa; intermediate between FP32 and FP16.
- Mixed precision: compute GEMM in FP16 (fast), accumulate in FP32 (stable); scale loss to prevent underflow.
Distributed matrix multiplication
- Data parallelism: replicate model on each device; partition minibatches; synchronize gradients via all-reduce.
- Model parallelism: partition matrix weights across devices; communication within matrix product (e.g., matmul followed by communication).
- Pipeline parallelism: partition layers across devices; overlap computation on layer $i$ with communication on layer $i-1$.
- Cost: compute + communication latency; communication often dominates at large scale (Roofline model).

Relevance to ML#

Convolutional neural networks (CNNs): Forward and backward passes are GEMM-heavy; efficiency determines whether you can train on billion-pixel images or video.
Recurrent neural networks (RNNs), LSTMs, GRUs: Fully-connected layers between timesteps; matrix products per timestep.
Transformers and large language models: Attention is $O(L^2 d)$ matrix products; for GPT-3 ($L = 2048$, $d = 12288$), attention dominates forward/backward.
Graph neural networks (GNNs): Graph convolution is sparse matrix product; efficiency depends on sparsity and format.
Distributed training: Modern LLMs trained on thousands of GPUs/TPUs; communication cost (network bandwidth) often exceeds computation cost.

Algorithmic development (milestones)#

1969: Strassen algorithm: $O(n^{2.807})$ vs. $O(n^3)$ naive GEMM (theoretically significant; rarely used in practice due to constants).
1979–1990: Level-1/2/3 BLAS (Basic Linear Algebra Subprograms); standardized interface for matrix ops; LAPACK (1992) built on BLAS.
1995–2005: GPU era begins: NVIDIA GeForce, Tesla; GPUs have 100× more memory bandwidth than CPUs; GEMMs run 10–100× faster.
2006: CUDA (Compute Unified Device Architecture) released; enables general-purpose GPU computing; cuBLAS optimized GEMM for NVIDIA GPUs.
2011: Mixed precision training proposed; FP16 + loss scaling enables 10–100× speedups on GPUs.
2012: AlexNet (Krizhevsky et al.) demonstrates deep CNN training on GPUs; FLOPs dominate; GEMM-heavy.
2015: Batch normalization (Ioffe & Szegedy); reduces sensitivity to initialization; enables mixed precision at scale.
2017: Transformer architecture (Vaswani et al.); attention is dense GEMM-based; quadratic in sequence length.
2018–2020: Distributed training frameworks mature (PyTorch DDP, TensorFlow Horovod); trillion-parameter models trained via model parallelism.
2020–2023: Flash Attention (Dao et al. 2022) reduces attention memory via block-sparse operations; Megatron-LM and DeepSpeed enable distributed GEMMs at petaflop scales.

Definitions#

GEMM (General Matrix Multiply): $C \leftarrow \alpha A B + \beta C$ (standard matrix multiply with scaling/accumulation).
FLOP (floating-point operation): One add or multiply; GEMM $C \leftarrow AB$ is $2mkn$ FLOPs.
Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes read/written}}$ (ops per byte); high $I$ means compute-bound; low $I$ means bandwidth-bound.
Roofline model: Peak achievable throughput = $\min(\text{peak FLOP rate}, \text{memory bandwidth} \times \text{arithmetic intensity})$.
Memory-bound: Algorithm where memory bandwidth is bottleneck; cannot achieve peak FLOP rate.
Compute-bound: Algorithm where compute is bottleneck; limited by FLOPs/cycle, not memory.
Mixed precision: Using multiple precision levels (e.g., FP16 for compute, FP32 for accumulation) to trade accuracy for speed.
All-reduce: Distributed operation: each device sums its values with all others; result replicated on all devices. Cost: $O(\log D)$ communication rounds for $D$ devices.
Collective communication: Broadcasting, all-reduce, reduce-scatter, all-gather operations in distributed training.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

GEMM arithmetic and complexity: $O(mkn)$ FLOPs, memory $O(m + k + n)$. Reference: Golub & Van Loan (2013).
Arithmetic intensity and Roofline model: $I = \text{FLOPs/bytes}$; peak rate is $\min(\text{FLOP rate}, \text{bandwidth} \times I)$. Reference: Williams et al. (2009).
Cache-oblivious algorithms: Block-recursive GEMM achieves near-optimal cache behavior independent of cache size. Reference: Frigo et al. (1999).
Batched GEMM: Independent products $C_i \leftarrow A_i B_i$; parallelism across batch dimension. Reference: BLAS 3 standard (1990).
Attention complexity: Scaled dot-product $O(L^2 d)$ without optimizations; challenges for long contexts. Reference: Vaswani et al. (2017).
Distributed GEMM: Communication cost for gradient all-reduce, model/data parallelism. Reference: Thakur et al. (2005) (MPI Collective Communications).

Applied (landmark systems)#

Level-3 BLAS (cuBLAS, MKL): Industry-standard GEMM implementations; peak performance on CPUs/GPUs. Implementation: NVIDIA cuBLAS, Intel MKL. Reference: Dongarra et al. (1990) (BLAS 3).
Convolution as GEMM (im2col): Standard in libcnpy (Caffe, PyTorch); enables reuse of optimized GEMM. Implementation: PyTorch conv2d uses im2col on CPU. Reference: Krizhevsky et al. (2012).
Mixed precision training: Automatic mixed precision in PyTorch (torch.cuda.amp), TensorFlow (tf.keras.mixed_precision). Achieves 2–3× speedup on V100/A100. Reference: NVIDIA Automatic Mixed Precision Training Guide (2020).
Distributed GEMM (Megatron-LM, DeepSpeed): Tensor parallelism partitions GEMM across devices; pipeline parallelism overlaps layers. Implementation: Microsoft DeepSpeed, NVIDIA Megatron-LM. Reference: Shoeybi et al. (2019); Rasley et al. (2020).
Flash Attention: IO-efficient attention via blocked matrix products; reduces memory bandwidth by 10×. Implementation: Tri Dao’s flash-attention library. Reference: Dao et al. (2022).

Key ideas: Where it shows up

Convolutional neural networks and image classification
- Forward pass: convolutional layers (im2col GEMM), batch norm (element-wise), pooling (no GEMM).
- Backward: weight gradient, input gradient via GEMM.
- Achievements: ResNet-50 trains on 8 V100 GPUs in ~100 seconds (Goyal et al. 2017); mixed precision reduces time to ~60 seconds. References: Krizhevsky et al. (2012) (AlexNet); He et al. (2015) (ResNet); Goyal et al. (2017) (accurate large-batch SGD).
Transformer models and large language models
- Per-layer: projection QKV (3 GEMMs), attention (2 GEMMs), MLP (2 GEMMs) = ~7 GEMMs per layer.
- Attention cost: $O(L^2 d)$ (quadratic in sequence length); dominates for long sequences.
- Achievements: GPT-3 (Brown et al. 2020) trained in 300 billion FLOPs; parallelized across 1,024 A100 GPUs using model parallelism. Flash Attention (Dao et al. 2022) reduces attention memory by 10×. References: Vaswani et al. (2017) (Transformer); Brown et al. (2020) (GPT-3); Dao et al. (2022) (Flash Attention).
Distributed training and synchronization
- Data parallelism: gradient all-reduce after each minibatch.
- Model parallelism: gradient exchanges within matrix products.
- Achievements: LAMB optimizer (You et al. 2019) enables BERT training on 32k TPUs in 76 minutes. Megatron-LM (Shoeybi et al. 2019) trains GPT models with tensor parallelism. References: You et al. (2019) (LAMB); Shoeybi et al. (2019) (Megatron-LM).
Mixed precision training
- Automatic mixed precision (AMP): dynamically select FP16/FP32 for operations.
- Loss scaling: prevent FP16 gradient underflow.
- Achievements: NVIDIA Automatic Mixed Precision reduces training time by 2–3× on V100/A100 while maintaining accuracy. References: NVIDIA Mixed Precision Training guide; Micikevicius et al. (2018).
Graph neural networks and sparse matrix products
- Graph convolution: $X' = \sigma(AXW)$ where $A$ is sparse adjacency matrix.
- Sparse-dense GEMM: $O(\text{nnz}(A) \cdot d)$ arithmetic intensity lower than dense, but feasible for sparse graphs.
- Achievements: DGL, PyG enable billion-node GNNs via optimized sparse GEMMs. References: Kipf & Welling (2017) (GCN); Wang et al. (2019) (DGL); Fey et al. (2019) (PyG).

Notation

Matrix product: $C \leftarrow A B$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
Batched product: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size); vectorization across batch.
Attention: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) V$ with $Q, K, V \in \mathbb{R}^{L \times d}$ (sequence length $L$, dimension $d$).
Complexity: Attention is $O(L^2 d)$ FLOPs; dense GEMM is $O(n d_{\text{in}} d_{\text{out}})$ per layer (batch size $n$).
Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$ (depends on matrix shapes; higher $I$ achieves better GPU utilization).
FLOP rate: Peak: $P$ (e.g., 20 TFLOP for V100 in FP32); practical: $P \times \text{efficiency}$ (typically 50–80%).
Memory bandwidth: $B$ (e.g., 900 GB/s for A100 HBM2e); roofline: achieved throughput $= \min(P, I \times B)$.
Example: ResNet-50 forward pass: ~8 GFLOPs per image; batch size 256 = 2 TFLOPs; A100 achieves ~80% utilization = 16 TFLOP achieved; time ~0.1 ms.

Pitfalls & sanity checks

Batch size too small: GPUs underutilized; poor arithmetic intensity. Typical minimum: 32–64 per device.
Tall-skinny GEMM: Low arithmetic intensity; underutilize accelerator. Prefer square or batched products.
Ignoring data layout: Row-major vs. column-major affects cache performance by 10×.
Mixed precision without loss scaling: FP16 gradients underflow ($\approx 10^{-6}$); loss scale prevents this (multiply loss by $2^{16}$, divide gradients).
Attention without length limits: Quadratic memory; even with batch size 1, $L = 8192$ requires 256 MB for single head.
Synchronous all-reduce without compression: Communication time dominates; gradient compression (sparsification, quantization) essential at scale.
Assuming linear scaling: Communication cost breaks linear scaling; efficiency drops from 90% (4 devices) to 30% (128 devices).
Convolution without im2col: Naive loops 100–1000× slower than GEMM-based implementation.

References

Matrix multiplication theory

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Strassen, V. (1969). Gaussian elimination is not optimal.
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Performance modeling and BLAS

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of FORTRAN basic linear algebra subprograms.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Demmel, J., Gearhart, J., Liphardt, B., Schwartz, O., & Toledo, S. (2009). Communication-avoiding Gaussian elimination.

Deep learning and convolution

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
He, H., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.

Transformer and attention

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are unsupervised multitask learners.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.

Mixed precision and numerical stability

Micikevicius, P., Narang, S., Alben, J., et al. (2018). Mixed precision training.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift.
NVIDIA Automatic Mixed Precision Training Guide. (2020).

Distributed training

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.
You, Y., Gitman, I., & Ginsburg, B. (2019). Large batch optimization for deep learning: training BERT in 76 minutes.

Attention optimization

Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.
Child, A., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers.
Peng, H., Schwartz-Ziv, R., & Armon, M. (2021). Reducing transformer depth on demand with structured dropout.

Five worked examples

Worked Example 1: GEMM efficiency and arithmetic intensity#

Introduction#

Implement dense matrix multiplication on CPU and GPU; measure FLOP rate and memory bandwidth utilization; demonstrate how matrix shape affects arithmetic intensity.

Purpose#

Understand relationship between GEMM dimensions and arithmetic intensity; show how to achieve peak GPU performance.

Importance#

Foundation for understanding deep learning performance; shapes (batch size, hidden dimensions) directly impact training time.

What this example demonstrates#

Construct tall-skinny vs. square GEMM matrices.
Measure FLOPs and memory bandwidth for each.
Compute arithmetic intensity $I = \text{FLOPs/bytes}$.
Compare achieved FLOP rate vs. peak.
Predict speedup from roofline model.

Background#

GEMM efficiency depends on matrix shape: square matrices have high arithmetic intensity; tall-skinny have low intensity.

Historical context#

Roofline model (Williams et al. 2009) formalizes this trade-off; guides architecture and algorithm design.

History#

Standard framework for performance modeling in HPC and ML systems.

Prevalence in ML#

Every deep learning practitioner adjusts batch size, layer dimensions to maximize GPU utilization.

Notes#

Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$; maximized when $m \approx k \approx n$ (cube).
For fixed $k$, varying $m, n$ (batch size, hidden dims) changes $I$ by 10×.

Connection to ML#

Batch size and hidden dimension choices affect both accuracy and training speed; understanding trade-offs is critical.

Connection to Linear Algebra Theory#

GEMM is fundamental linear algebra operation; efficiency is determined by cache locality (blocking theory).

Pedagogical Significance#

Demonstrates practical performance modeling; connects theory (arithmetic intensity) to practice (measured FLOP rates).

References#

Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Solution (Python)#

import numpy as np
import time

np.random.seed(35)

# Test different matrix shapes (keeping k fixed)
k = 1024
shapes = [
    (128, k, 128),    # Tall-skinny-ish: low intensity
    (1024, k, 1024),  # Square: high intensity
    (4096, k, 4096),  # Large square: even higher
]

print("GEMM Efficiency Analysis")
print("=" * 80)
print(f"{'m x n':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Intensity':15} {'Est. GFLOPs':15}")
print("-" * 80)

for m, k_dim, n in shapes:
    # Arithmetic
    flops = 2 * m * k_dim * n
    # Memory: read A (m*k), read B (k*n), write C (m*n)
    mem_bytes = 8 * (m * k_dim + k_dim * n + m * n)
    intensity = flops / mem_bytes
    
    # Estimate performance from roofline
    # Assume: Peak 20 TFLOP (V100 FP32), Bandwidth 900 GB/s
    peak_flops = 20e12
    bandwidth = 900e9
    roofline = min(peak_flops, bandwidth * intensity)
    
    print(f"{m}x{n}         {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {intensity:>14.2f} {roofline/1e9:>14.1f}")

print("\n" + "=" * 80)
print("Key insight: Higher arithmetic intensity -> higher roofline GFLOPs")
print("Square matrices (m ~ k ~ n) achieve 10-100x higher intensity than tall-skinny")

Worked Example 2: Batched GEMM and GPU parallelism#

Introduction#

Implement batched matrix multiplication; measure performance as batch size varies; show speedup from batch parallelism.

Purpose#

Demonstrate how batch dimension enables parallelism; show relationship between batch size and GPU utilization.

Importance#

Batch size is a key hyperparameter; understanding its impact on performance guides training setup.

What this example demonstrates#

Generate batched matrices $A_i, B_i$ for $i = 1, \ldots, B$.
Time batched GEMM vs. sequential.
Measure speedup; show scaling with batch size.
Explain why small batches underutilize GPU.

Background#

GPUs have thousands of cores; small batches can’t keep all cores busy; large batches achieve better utilization.

Historical context#

Batch GEMM standardized in BLAS Level 3 (1990); essential for CNN/RNN training.

History#

Modern frameworks (PyTorch, TensorFlow) automatically batch GEMMs; rarely needs manual tuning.

Prevalence in ML#

Every training loop uses batched GEMM; batch size choice directly impacts throughput.

Notes#

Batch size $B = 1$: each GEMM is independent; throughput limited.
$B = 32$: better utilization; GPUs have 80+ SMs (streaming multiprocessors).
$B = 256$: excellent utilization; typical for modern training.

Connection to ML#

Batch size affects both convergence (larger batches can have worse generalization) and speed; practical sweet spot is usually 32–256.

Connection to Linear Algebra Theory#

Batched GEMM exploits structure (independent problems); vectorization across batch dimension.

Pedagogical Significance#

Shows interplay between algorithm structure and hardware parallelism.

References#

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of Fortran basic linear algebra subprograms.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.

Solution (Python)#

import numpy as np
import time

np.random.seed(36)

# Batched GEMM performance
batch_sizes = [1, 4, 16, 64, 256]
m, k, n = 1024, 1024, 1024
iterations = 10

print("Batched GEMM Performance (m=k=n={}, {} iterations)".format(m, iterations))
print("=" * 60)
print(f"{'Batch Size':15} {'Total Time (s)':20} {'GFLOPs':15}")
print("-" * 60)

for B in batch_sizes:
    # Create batch of matrices
    A = np.random.randn(B, m, k).astype(np.float32)
    B_mat = np.random.randn(B, k, n).astype(np.float32)
    
    # Batched matmul (sequential in Python; normally GPU would parallelize)
    t0 = time.time()
    for _ in range(iterations):
        C = np.matmul(A, B_mat)
    t_total = time.time() - t0
    
    # FLOPs: 2mkn per batch, B batches, iterations
    flops = iterations * B * 2 * m * k * n
    gflops = flops / (t_total * 1e9)
    
    print(f"{B:>14} {t_total:>19.4f} {gflops:>14.1f}")

print("\n" + "=" * 60)
print("Note: Larger batch sizes achieve higher GFLOPs due to better parallelism")

Worked Example 3: Convolution as GEMM (im2col)#

Introduction#

Implement convolution using naive loops, then via im2col GEMM; measure speedup from optimized GEMM.

Purpose#

Show how convolution is equivalent to matrix multiplication; demonstrate efficiency gain from reusing optimized GEMM.

Importance#

Foundational for understanding why GPUs excel at CNNs; im2col is standard in production frameworks.

What this example demonstrates#

Naive 5-loop convolution implementation.
im2col transformation: reshape patches into columns.
GEMM on im2col matrix; reshape output.
Compare naive vs. GEMM time.

Background#

Convolution unfolds into GEMM; allows reuse of highly-tuned BLAS kernels; 10–100× speedup.

Historical context#

im2col technique developed for efficient convolution implementations in early deep learning (Caffe, 2013).

History#

Standard in all deep learning frameworks; sometimes augmented by Winograd for further speedup.

Prevalence in ML#

Every CNN implementation uses im2col or similar GEMM-based convolution.

Notes#

im2col memory overhead: factors of 2–4× larger than direct convolution; trade memory for speed.
Winograd convolution (for $3 \times 3$ kernels): lower arithmetic but numerically complex.

Connection to ML#

Convolutional layers dominate image classification and detection models; efficiency here directly impacts training speed.

Connection to Linear Algebra Theory#

Convolution is linear transformation; im2col exploits structure to reduce to GEMM.

Pedagogical Significance#

Demonstrates how abstract operations (convolution) map to concrete linear algebra (GEMM).

References#

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks.

Solution (Python)#

import numpy as np
import time

np.random.seed(37)

# Convolution parameters
batch_size, in_height, in_width, in_channels = 32, 64, 64, 3
out_channels, kernel_h, kernel_w, stride = 16, 3, 3, 1
pad = 1

# Padded input
X_padded = np.pad(np.random.randn(batch_size, in_height, in_width, in_channels),
                   ((0,0), (pad,pad), (pad,pad), (0,0)), mode='constant')
W = np.random.randn(out_channels, kernel_h, kernel_w, in_channels)

# Output dimensions
out_height = (in_height + 2*pad - kernel_h) // stride + 1
out_width = (in_width + 2*pad - kernel_w) // stride + 1

# Naive convolution (slow)
print("Naive convolution (5-loop implementation):")
t0 = time.time()
Y_naive = np.zeros((batch_size, out_height, out_width, out_channels))
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            for c in range(out_channels):
                h_start = h * stride
                w_start = w * stride
                patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
                Y_naive[b, h, w, c] = np.sum(patch * W[c])
t_naive = time.time() - t0
print(f"  Time: {t_naive:.4f} s")

# im2col GEMM (fast)
print("\nim2col GEMM (optimized convolution):")
t0 = time.time()

# im2col: extract patches
X_col = np.zeros((batch_size * out_height * out_width, kernel_h * kernel_w * in_channels))
idx = 0
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            h_start = h * stride
            w_start = w * stride
            patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
            X_col[idx] = patch.reshape(-1)
            idx += 1

# Weight matrix (reshape filters)
W_mat = W.reshape(out_channels, -1).T  # (kernel_h*kernel_w*in_channels, out_channels)

# GEMM
Y_col = X_col @ W_mat  # (batch*out_h*out_w, out_channels)

# Reshape to output
Y_gemm = Y_col.reshape(batch_size, out_height, out_width, out_channels)

t_gemm = time.time() - t0
print(f"  Time: {t_gemm:.4f} s")

print(f"\nSpeedup: {t_naive / t_gemm:.1f}x")
print(f"Results match: {np.allclose(Y_naive, Y_gemm, atol=1e-5)}")

Worked Example 4: Scaled dot-product attention complexity#

Introduction#

Implement attention operation; measure memory and time complexity; show quadratic dependence on sequence length.

Purpose#

Understand why attention is a bottleneck for long sequences; motivate approximate attention methods.

Importance#

Attention scales as $O(L^2 d)$; for long sequences (4K tokens), this dominates; critical for efficiency research.

What this example demonstrates#

Implement attention: QK^T, softmax, output.
Measure memory (intermediate softmax matrix is $L \times L$).
Time scaling with $L$; show quadratic growth.
Compare attention time vs. other layers.

Background#

Quadratic attention complexity is fundamental limitation of transformer architecture; many proposed approximations.

Historical context#

Vaswani et al. (2017) introduce attention; complexity not initially recognized as bottleneck for $L > 512$.

History#

Post-2020, attention optimization becomes major research area: Flash Attention, sparse attention, linear attention variants.

Prevalence in ML#

Every transformer model suffers from quadratic attention; common workaround is to limit context length or use approximations.

Notes#

Attention FLOPs: $2L^2 d$ (dominant for $L > d$).
Memory: $O(L^2)$ for attention matrix; for $L = 4096, d = 768$: 64 MB per sequence.

Connection to ML#

Limiting context length ($L = 512$ vs. $L = 4096$) is common trade-off between expressiveness and efficiency.

Connection to Linear Algebra Theory#

Attention is polynomial in sequence length; matrix products scale quadratically in one dimension.

Pedagogical Significance#

Shows concrete example of how algorithmic bottleneck (quadratic) impacts practical ML.

References#

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.
Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.

Solution (Python)#

import numpy as np
import time

np.random.seed(38)

# Attention parameters
d = 768  # Hidden dimension
num_heads = 12
d_k = d // num_heads
L_values = [128, 256, 512, 1024, 2048]  # Sequence lengths

print("Attention Complexity Analysis (d={}, num_heads={})".format(d, num_heads))
print("=" * 70)
print(f"{'Seq Len L':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Time (ms)':15}")
print("-" * 70)

for L in L_values:
    batch_size = 1
    
    # Create Q, K, V
    Q = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    K = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    V = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    
    # Measure time and memory
    t0 = time.time()
    
    # Attention: QK^T / sqrt(d_k)
    scores = np.matmul(Q, K.transpose(0, 1, 3, 2))  # (batch, heads, L, L)
    scores = scores / np.sqrt(d_k)
    
    # Softmax
    scores = scores - np.max(scores, axis=-1, keepdims=True)
    exp_scores = np.exp(scores)
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Output
    output = np.matmul(weights, V)  # (batch, heads, L, d_k)
    
    t_attn = time.time() - t0
    
    # FLOPs: QK^T = 2*L^2*d_k, softmax ~L^2, output = 2*L^2*d_k
    flops = batch_size * num_heads * (2 * L * L * d_k + 2 * L * L * d_k)
    
    # Memory: scores matrix is L x L per head
    mem_bytes = batch_size * num_heads * L * L * 4
    
    print(f"{L:>14} {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {t_attn*1e3:>14.2f}")

print("\n" + "=" * 70)
print("Key insight: FLOPs and memory scale quadratically with sequence length")
print("For L=4096: 15 GB memory, billions of FLOPs -- attention becomes bottleneck")

Worked Example 5: Distributed GEMM and communication cost#

Introduction#

Implement data parallel training with gradient synchronization; measure computation vs. communication time; show communication overhead.

Purpose#

Understand communication bottleneck in distributed training; motivate communication-efficient algorithms.

Importance#

Modern LLMs trained on 1000s of GPUs; communication often dominates; critical for scaling.

What this example demonstrates#

Simulate distributed GEMM (matmul on local device).
Simulate all-reduce for gradient synchronization.
Measure computation time vs. communication time.
Show how communication latency scales with number of devices.

Background#

Distributed training divides minibatches across devices; after each minibatch, devices exchange gradients via all-reduce.

Historical context#

Large-batch SGD and gradient compression (2017–2019) driven by communication bottleneck.

History#

Modern frameworks (PyTorch DDP, Horovod) optimize communication; mixed precision + gradient compression reduce overhead.

Prevalence in ML#

Every distributed training uses all-reduce; communication cost is well-studied bottleneck.

Notes#

Computation time: $O(B \cdot d_{\text{in}} \cdot d_{\text{out}})$ (linear in batch size, dimensions).
Communication time: $O(\log D + d_{\text{gradient}})$ (logarithmic in device count $D$, linear in gradient size).
For 1000 devices: all-reduce with $\log D \approx 10$ rounds; if each round takes 10 μs, total ~100 μs; computation often takes ms.

Connection to ML#

Large-batch training requires communication efficiency; gradient compression and other tricks essential for practical scaling.

Connection to Linear Algebra Theory#

All-reduce is tree-based collective communication; optimal complexity is $O(\log D)$.

Pedagogical Significance#

Shows distributed systems aspect of linear algebra; explains why scaling beyond certain point is challenging.

References#

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.

Solution (Python)#

import numpy as np
import time

np.random.seed(39)

# Distributed training simulation
num_devices = [1, 4, 8, 16, 32]
batch_size = 256
hidden_dim = 2048

print("Distributed GEMM: Computation vs. Communication")
print("=" * 70)
print(f"{'Devices':15} {'Comp Time (ms)':20} {'Comm Time (μs)':20} {'Comp/Comm Ratio':15}")
print("-" * 70)

# Assume:
# - Computation: 100 GFLOPs/device (V100)
# - Communication: 25 GB/s interconnect (typical)

compute_flops_per_device = 100e9  # 100 GFLOPs
comm_bandwidth = 25e9  # GB/s (25 GB/s)

for D in num_devices:
    # Local batch per device
    local_batch = batch_size // D
    
    # GEMM: local_batch x hidden_dim x hidden_dim
    flops_local = 2 * local_batch * hidden_dim * hidden_dim
    
    # Computation time
    t_compute = flops_local / compute_flops_per_device
    
    # Communication: all-reduce of gradients (hidden_dim)
    # Complexity: O(log D) communication rounds
    # Each round transmits O(hidden_dim) data (simplified)
    comm_rounds = int(np.log2(D)) + 1
    gradient_size = hidden_dim * 4  # bytes (FP32)
    comm_per_round = gradient_size / comm_bandwidth
    t_comm = comm_rounds * comm_per_round
    
    ratio = t_compute / t_comm
    
    print(f"{D:>14} {t_compute*1e3:>19.3f} {t_comm*1e6:>19.2f} {ratio:>14.1f}x")

print("\n" + "=" * 70)
print("Key insight: Communication becomes bottleneck at large scale")
print("For 32 devices: communication ~100 microseconds, computation ~10 milliseconds")
print("Compute/comm ratio decreases -> inefficiency at scale")

Comments

Algorithm Category

Direct Methods

Approximation Methods

Attention Optimization

Computational Efficiency

BLAS & Matrix Multiplication

Sparse Matrix Operations

GPU & Distributed Computing

Data Modality

Historical & Attribution

Modern Era

Key Concepts & Theorems

Norm & Distance

Learning Path & Sequencing

Advanced

ML Applications

Deep Learning & Neural Networks

Graph Neural Networks

Attention Mechanisms

Convolutional Layers

Numerical Stability & Robustness

Floating-Point Arithmetic

Loss Scaling

Theoretical Foundation

Applied Machine Learning

Systems & Engineering

Linear Maps & Matrices

Chapter 4

Linear Maps & Matrices

Key ideas: Introduction

Introduction#

Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:

Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)
Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)
Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)
Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)

Linear maps are everywhere in ML:

Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).
Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.
Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.
PCA: Projection onto principal components is a linear map.
Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.

Important Ideas#

1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:

Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.
Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.

Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.

Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.

Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).

2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.

Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).

3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).

Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.

Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).

4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying: $$ (Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y $$

Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).

In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).

Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.

5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:

Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.
Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).

Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).

Relevance to Machine Learning#

Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.

Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.

Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).

Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.

Algorithmic Development History#

1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.

2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.

3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.

4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.

5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.

6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.

7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.

8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.

Definitions#

Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:

$f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).
$f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).

Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).

Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$: $$ \text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A) $$

Rank. The rank of $A$ is: $$ \text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns} $$

Nullity. The nullity of $A$ is: $$ \text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A) $$

Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$

Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying: $$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$

Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Rank-nullity theorem. For $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$ Consequences: If $\text{rank}(A) < d$, solutions to $Ax = b$ are not unique (null space is non-trivial). For invertible $A$ (rank = $d$), solutions are unique.
Fundamental theorem of linear algebra. Orthogonal decomposition: $\mathbb{R}^d = \text{col}(A^\top) \oplus \text{null}(A)$ and $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$ (orthogonal direct sums). Basis for all linear algebra.
Universal approximation (Cybenko 1989, Hornik 1991). A neural network with one hidden layer (linear map + nonlinearity + output linear map) can approximate any continuous function on compact sets arbitrarily well (with enough hidden units).
Spectral theorem for symmetric matrices (Hamilton, Sylvester, 1850s-1880s). Every symmetric $A$ has eigendecomposition $A = U \Lambda U^\top$ (orthogonal diagonalization). Basis for PCA, optimization, understanding symmetric structures.
Singular Value Decomposition (Beltrami 1873, Eckart-Young 1936). Every matrix $A \in \mathbb{R}^{m \times d}$ can be written as $A = U \Sigma V^\top$ (orthogonal $U, V$, diagonal $\Sigma$). Reveals low-rank structure, optimal approximations, conditioning.

Why essential: These theorems quantify what linear maps can/cannot represent, how to invert them, when solutions exist, and how to find optimal approximations.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Backpropagation and gradient-based learning (Rumelhart et al. 1986, 1990s-present). Automatic differentiation computes gradients via chain rule (composition of matrix transposes). Enables training networks with billions of parameters. All modern deep learning depends on this.
Dense neural networks (Cybenko 1989, Hornik 1991, 1990s-present). Theoretical universality + practical training via backprop = powerful function approximators. AlexNet (2012) showed depth matters: stacking linear maps + activations learns hierarchical representations.
Convolutional Neural Networks (LeCun et al. 1990, AlexNet 2012, ResNet 2015). Structured linear maps (convolution with weight sharing). Dramatically reduced parameters vs. dense. State-of-the-art on vision (ImageNet), object detection, segmentation.
Recurrent Neural Networks and LSTMs (Hochreiter & Schmidhuert 1997, 2000s-present). Apply same linear map over time steps (sequence model for NLP, time series). Enabled machine translation, speech recognition.
Transformers and Attention (Vaswani et al. 2017, Devlin et al. 2018, GPT series 2018-2023). All-attention architecture (linear projections + softmax + matrix multiply). Achieved state-of-the-art across NLP (GLUE, SuperGLUE), vision (ImageNet via ViT), multimodal (CLIP). Scales to trillions of parameters.
Least squares for regression (Gauss, Legendre, Tikhonov, modern methods). Normal equations $(X^\top X) w = X^\top y$ solved via QR/SVD (numerically stable). Classical ML workhorse; fast closed-form solution, interpretable results.

Why essential: These systems achieve state-of-the-art by leveraging linear map structure (composition, transposes, efficient matrix multiply). Understanding linear algebra is necessary to design architectures, optimize, and debug.

Key ideas: Where it shows up

1. Backpropagation and Gradient Flow — Transpose carries errors backward#

Major achievements:

Backpropagation (Rumelhart, Hinton, Williams 1986): Efficient algorithm for computing gradients through neural networks via chain rule. Each layer applies $y = \sigma(W x + b)$; backward pass uses $\frac{\partial L}{\partial x} = W^\top \frac{\partial L}{\partial y}$ (transpose carries gradients).
Modern deep learning (1990s-2010s): Backprop enabled training of deep networks (10-1000+ layers). Scaling to billions of parameters (GPT, Vision Transformers).
Automatic differentiation (1980s-present): Frameworks (TensorFlow, PyTorch) implement backprop automatically by composing transposes. Practitioners never write transposes explicitly; framework handles it.
Applications: All supervised learning, reinforcement learning, generative models. Billions of backprop steps every day globally.

Connection to linear maps: Forward pass chains linear maps with nonlinearities: $f = \sigma_k \circ (A_k \sigma_{k-1} \circ (A_{k-1} \cdots))$. Backward pass computes gradients: $\nabla_w L = (\sigma'_{k-1})^T A_{k-1}^T (\sigma'_{k-2})^T A_{k-2}^T \cdots$ (products of transposes).

2. Neural Network Layers — Linear maps + activation functions#

Major achievements:

Dense layers (Rosenblatt Perceptron 1958, MLPs 1970s-1980s): Input $x$, linear map $h = Wx + b$, activation $y = \sigma(h)$ (ReLU, sigmoid, tanh). Each layer is a learnable linear map.
Depth (ResNets, Vaswani 2015-2017): 50-1000 layers. Skip connections $x_{i+1} = \sigma(W_i x_i + b_i) + x_i$ allow training very deep networks. Each skip branch is a composition of linear maps.
Scaling (AlexNet 2012, GPT-3 2020, Gato 2022): Modern networks: billions to trillions of parameters. Matrix multiply dominates computation. Large linear maps $W \in \mathbb{R}^{4096 \times 4096}$ applied to batches.
Optimization: Understanding composition of linear maps helps explain generalization (implicit regularization favors low-complexity solutions in the span of data).

Connection to linear maps: Each dense layer is $W: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$. Network composes $W_k \circ \sigma \circ W_{k-1} \circ \sigma \circ \cdots \circ W_1$. Expressiveness comes from depth (composition) and nonlinearity ($\sigma$).

3. Attention Mechanism — Multi-head projections and weighted sums#

Major achievements:

Scaled dot-product attention (Vaswani et al. 2017): Queries, Keys, Values are projections (linear maps) $Q = XW_Q, K = XW_K, V = XW_V$. Attention weights $A = \text{softmax}(QK^\top / \sqrt{d_k})$. Output $\text{Attention}(Q,K,V) = AV$ (matrix multiply with softmax-weighted rows).
Multi-head attention: $h$ heads, each applying different linear projections. Concatenate: $\text{MultiHead}(Q,K,V) = \text{Concat}(A_1, \ldots, A_h) W^O$ (linear map combines heads).
Transformers (Vaswani 2017, Devlin et al. 2018): Attention layers (all linear maps + softmax) in sequence. BERT, GPT achieve state-of-the-art across NLP tasks.
Scale: GPT-3 (175B parameters), PaLM (540B), GPT-4. Training scales across thousands of GPUs, with matrix multiplication as bottleneck.

Connection to linear maps: Attention is composition of linear maps: $\text{Attention} = A V$ where $A = \text{softmax}(Q K^\top / \sqrt{d_k})$. Each head applies different linear projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$. Output is weighted linear combination of values.

4. Least Squares and Regression — Normal equations as linear system#

Major achievements:

Least squares (Gauss, Legendre, early 1800s): Solve $\min_w \|Xw - y\|_2^2$. Normal equations: $(X^\top X) w = X^\top y$. Linear system $Aw = b$ (product of two linear maps).
Ridge regression (Tikhonov 1963, Hoerl & Kennard 1970): Add regularization $\min_w (\|Xw - y\|_2^2 + \lambda \|w\|_2^2)$. Solution: $w = (X^\top X + \lambda I)^{-1} X^\top y$ (invertible for any $\lambda > 0$).
LASSO (Tibshirani 1996): L1 regularization forces sparsity. Solved via proximal methods (composition of proximal operators, each a linear map or projection).
Kernel methods (Mercer 1909, Schölkopf & Smola 2001): Non-linear regression via Gram matrix $K = X X^\top$ (product of linear maps, then apply kernel trick).

Connection to linear maps: Normal equations involve products of matrices: $X^\top X$ (composition of $X^\top$ and $X$), $X^\top y$ (linear map applied to $y$). Solution involves matrix inversion (inverse is also a linear map).

5. Convolutional and Recurrent Networks — Structured linear maps#

Major achievements:

CNNs (LeCun et al. 1990s, AlexNet 2012, ResNet 2015): Convolutional layers are linear maps with weight sharing (same weights applied across spatial positions). Reduces parameters vs. dense layer (e.g., conv 3×3×64→64 channels vs. dense with same feature count).
RNNs, LSTMs (Hochreiter & Schmidhuber 1997): Recurrent layers apply the same linear map $W$ repeatedly over time: $h_t = \sigma(W h_{t-1} + U x_t)$ (composition of linear maps over time steps).
Efficiency: Weight sharing and structured matrices (convolution, recurrence) reduce parameters and computation compared to dense layers.
Interpretability: Convolutional structure learned by early layers is interpretable (edge filters, textures). Linear maps with structured sparsity/sharing have semantic meaning.

Connection to linear maps: Conv layer is a linear map (convolution can be written as matrix multiplication with Toeplitz structure). RNN applies same linear map repeatedly: composition $W \circ W \circ \cdots \circ W$ over time.

Notation

Standard Conventions#

1. Linear map and matrix notation.

Linear map: $f: V \to W$ or $A: \mathbb{R}^d \to \mathbb{R}^m$ (function notation).
Matrix representation: $A \in \mathbb{R}^{m \times d}$ or $[A]_{ij}$ for entry in row $i$, column $j$.
Matrix-vector product: $y = Ax$ (linear map applied to vector $x$).
Matrix-matrix product: $C = AB$ (composition: apply $B$ then $A$).
Image and kernel: $\text{im}(A)$ or $\text{col}(A)$ for column space; $\ker(A)$ or $\text{null}(A)$ for null space.

Examples:

Linear map: $f(x) = 3x_1 - 2x_2 \in \mathbb{R}$. Matrix: $A = [3, -2] \in \mathbb{R}^{1 \times 2}$.
Linear map: $(x, y) \mapsto (2x + y, x - 3y)$. Matrix: $A = \begin{bmatrix} 2 & 1 \\ 1 & -3 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$.
Composition: Neural network layer 1: $h_1 = \sigma(W_1 x)$, layer 2: $h_2 = \sigma(W_2 h_1) = \sigma(W_2 \sigma(W_1 x))$. Composition: $f = \sigma \circ (W_2 \circ \sigma \circ W_1)$.

2. Rank notation.

Rank: $\text{rank}(A)$ = dimension of column space = number of linearly independent columns.
Nullity: $\text{nullity}(A) = d - \text{rank}(A)$ (dimension of null space).
Full rank: $\text{rank}(A) = \min(m, d)$ (maximum possible rank).
Rank deficient: $\text{rank}(A) < \min(m, d)$ (singular or near-singular).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 2 (full rank), columns independent.
$A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 1 (rank deficient), second column = 2 × first column.

3. Transpose notation.

Transpose: $A^\top$ (rows and columns swapped).
Adjoint property: $(Ax)^\top y = x^\top (A^\top y)$ (inner product duality).
Composition rule: $(AB)^\top = B^\top A^\top$ (note reversed order).
Inverse of transpose: $(A^\top)^{-1} = (A^{-1})^\top$ (for invertible $A$).

Examples:

$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$.
Gradient in backprop: $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (linear map $A$ → transpose $A^\top$ for gradient).

4. Composition and chaining notation.

Composition operator: $(f \circ g)(x) = f(g(x))$ (apply $g$ first, then $f$).
Matrix chaining: For $f = A, g = B$, composition is $f \circ g = A \circ B$ with matrix product $AB$ (apply $B$ then $A$).
Neural network layers: Output $h_i = \sigma_i(A_i h_{i-1})$ (chain $A_1, \sigma_1, A_2, \sigma_2, \ldots$).

Examples:

Rotate by $\theta$, then scale by $2$: $R_\theta \circ S_2$. Matrix: $S_2 R_\theta$.
Neural network: $f(x) = \sigma_2(A_2 \sigma_1(A_1 x))$. Composition: $\sigma_2 \circ A_2 \circ \sigma_1 \circ A_1$.

5. Invertibility and determinant notation.

Invertible (nonsingular): $A^{-1}$ exists; $AA^{-1} = A^{-1} A = I$.
Determinant: $\det(A)$ or $|A|$. For invertibility: $\det(A) \neq 0 \Leftrightarrow A$ invertible.
Condition number: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular value).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. $\det(A) = 2 \neq 0$, so $A$ is invertible. $A^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1/2 \end{bmatrix}$.
Ill-conditioned matrix: $\kappa(A) = 10^{10}$ (nearly singular). Small perturbations cause large changes in solution. Use regularization or preconditioning.

6. Special matrices notation.

Identity: $I \in \mathbb{R}^{d \times d}$ (diagonal matrix with 1’s).
Orthogonal/orthonormal: $Q^\top Q = QQ^\top = I$ (columns/rows orthonormal).
Symmetric: $A^\top = A$.
Positive semi-definite (PSD): $A \succeq 0$; all eigenvalues $\geq 0$. Covariance matrices are PSD.

Examples:

QR decomposition: $A = QR$ where $Q$ orthonormal, $R$ upper triangular.
Symmetric matrix: $\Sigma = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. Eigendecomposition: $\Sigma = U \Lambda U^\top$ (orthonormal $U$, diagonal $\Lambda$).
PSD matrix: Covariance $\text{Cov}(X) \succeq 0$ (always PSD). Gram matrix $G = X^\top X \succeq 0$ (always PSD).

Pitfalls & sanity checks

When working with linear maps and matrices:

Always check shapes. Matrix multiply requires compatible dimensions. $A \in \mathbb{R}^{m \times d}$, $x \in \mathbb{R}^d$ yields $Ax \in \mathbb{R}^m$. Shape mismatch = runtime error.
Prefer stable decompositions. Never compute $(X^\top X)^{-1}$ explicitly. Use QR (via solve) or SVD (truncate small singular values) for numerical stability.
Transpose order matters. $(AB)^\top = B^\top A^\top$ (reversed order). In backprop, composition reverses layer order via transposes.
Condition number determines stability. If $\kappa(A) > 10^8$, expect numerical errors. Use regularization (Ridge, Tikhonov) or preconditioning.
Gradients flow via transposes. Backprop systematically applies transposes. Understand: ill-conditioned weights → vanishing/exploding gradients.

References

Foundational texts:

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.
Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Linear maps and matrix theory:

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Hoffman, K., & Kunze, R. (1971). Linear Algebra (2nd ed.). Prentice-Hall.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Axler, S. J., Bourdon, P. S., & Wade, W. M. (2000). Harmonic Function Theory (2nd ed.). Springer.

Neural networks and backpropagation:

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533–536.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). “Efficient backprop.” In Neural Networks: Tricks of the Trade (pp. 9–50). Springer.

Optimization:

Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.
Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer.
Kingma, D. P., & Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv:1412.6980.

Transformers and attention:

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” NAACL.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). “An image is worth 16×16 words: Transformers for image recognition at scale.” ICLR.

Least squares and numerical methods:

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Dover reprint.
Golub, G. H., & Pereyra, V. (1973). “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.” SIAM Journal on Numerical Analysis, 10(2), 413–432.

Five worked examples

Worked Example 1: Backprop uses transpose#

Problem. For y=Wx, show ∂L/∂x = W^T ∂L/∂y.

Solution (math). Jacobian of y=Wx is W; chain rule yields transpose in reverse mode.

Solution (Python).

import numpy as np
W=np.array([[2.,1.],[-1.,3.]])
dL_dy=np.array([0.5,-2.])
print(W.T@dL_dy)

Worked Example 2: Q,K,V projections in transformers#

Problem. Compute Q=XW_Q, K=XW_K, V=XW_V.

Solution (math). These are linear maps from model dimension to head dimensions.

Solution (Python).

import numpy as np
X=np.array([[1.,0.],[0.,1.],[1.,1.]])
Wq=np.array([[1.,0.],[0.,2.]])
Wk=np.array([[2.,0.],[0.,1.]])
Wv=np.array([[1.,1.],[0.,1.]])
print(X@Wq)
print(X@Wk)
print(X@Wv)

Worked Example 3: Normal equations matrix#

Problem. Form A=X^TX and b=X^Ty for least squares.

Solution (math). Solving A w=b is equivalent to minimizing ||Xw-y||^2 when X has full rank.

Solution (Python).

import numpy as np
X=np.array([[1.,1.],[1.,2.],[1.,3.]])
y=np.array([1.,2.,2.5])
A=X.T@X; b=X.T@y
print(A)
print(b)

Worked Example 4: Batch GD as matrix products#

Problem. Compute one gradient step for MSE.

Solution (math). w←w-η(1/n)X^T(Xw-y).

Solution (Python).

import numpy as np
X=np.array([[1.,2.],[3.,4.],[5.,6.]])
y=np.array([1.,0.,1.])
w=np.zeros(2)
eta=0.1
g=(1/len(X))*X.T@(X@w-y)
print(w-eta*g)

Worked Example 5: Attention is matrix multiplication#

Problem. Compute A=softmax(QK^T/√d) and output O=AV.

Solution (math). Attention is a composition of matrix multiplications plus a row-wise softmax.

Solution (Python).

import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores,axis=1)
print(A@V)