ex1.ai

Matrix Products in DL & Transformers

Chapter 16

Matrix Products in DL & Transformers

Key ideas: Introduction

Introduction#

Neural networks are fundamentally stacks of matrix multiplications. A forward pass through a deep network is a product of weight matrices and activations. Each layer computes $A_{l+1} = \sigma(W_l A_l)$ where $W_l \in \mathbb{R}^{d_{l+1} \times d_l}$, $A_l \in \mathbb{R}^{n \times d_l}$ (batch size $n$, layer input dimension $d_l$, output dimension $d_{l+1}$). The cost is $O(n d_l d_{l+1})$ per layer. Transformers add attention: $\text{Attention}(Q, K, V) = \sigma(Q K^\top / \sqrt{d_k}) V$, which involves three GEMM operations and a softmax (polynomial in $n$ sequence length). For a Transformer with sequence length $L$, hidden dimension $d$, and $H$ attention heads per layer, attention cost is $O(L^2 d)$ (quadratic in sequence length—a major bottleneck). Modern accelerators (GPUs, TPUs) are matrix-multiply engines: billions of floating-point operations per second (TFLOPs). Utilization depends on arithmetic intensity (ops/byte): bandwidth-bound operations underutilize the accelerator; computation-bound operations (high arithmetic intensity) achieve near-peak performance. Understanding how to write matrix products that achieve high arithmetic intensity, and how to distribute them across devices, determines whether you can train billion-parameter models.

Important ideas#

Matrix-matrix multiplication (GEMM) structure
- Dense GEMM: $C \leftarrow AB$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
- Arithmetic: $mk + kmn + mn = O(mkn)$ floating-point operations (FLOPs).
- Memory: $O(m + k + n)$ words (inputs + output); on GPU, $m, k, n$ can be $1000$s, requiring GB of memory.
- Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes}} = \frac{2mkn}{8(mk + kn + mn)} \approx \frac{mkn}{4(m + k + n)}$ (higher is better).
Blocking and cache efficiency
- GEMM blocked into $b \times b$ tiles; each tile multiplied using fast cache.
- Cache line length (64 bytes typical); GEMM loads tile once, reuses it $O(b)$ times.
- Roofline model: peak FLOP rate vs. memory bandwidth; if arithmetic intensity $< I_{\text{roof}}$, algorithm is bandwidth-bound.
Batch matrix multiplication (batched GEMM)
- Forward pass: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size).
- Exploit parallelism: process multiple batches on multiple cores/GPU SMs.
- Highly efficient when batch size is large; small batches underutilize accelerator.
Convolution as matrix multiplication (im2col, Winograd)
- Convolution unfolds as GEMM: reshape input patches into columns; multiply by filter matrix; reshape output.
- im2col: input image to column matrix; allows use of highly optimized GEMM (cuBLAS, MKL).
- Cost: $O(kh kw d_{\text{in}} d_{\text{out}} h_{\text{out}} w_{\text{out}})$ (kernel height/width, input/output channels, spatial dims).
- Winograd: fast convolution via transformed domain; reduces arithmetic but increases numerical complexity.
Scaled dot-product attention
- Query-key-value paradigm: $Q \in \mathbb{R}^{L \times d_k}$, $K, V \in \mathbb{R}^{L \times d_v}$ (sequence length $L$, head dimension $d_k, d_v$).
- Attention: (1) $M = Q K^\top / \sqrt{d_k}$ (matrix product $L \times d_k \times d_k \times L = O(L^2 d_k)$), (2) $A = \text{softmax}(M)$ (per-row normalization, no matrix products), (3) $O = AV$ (matrix product $L \times L \times d_v = O(L^2 d_v)$).
- Total: $O(L^2 (d_k + d_v)) = O(L^2 d)$ (quadratic in sequence length).
- Challenge: for $L = 4096$ (typical transformer), $L^2 = 16M$ operations per attention head; billions for multi-head.
Mixed precision and numerical stability
- FP32 (single precision, float32): 32 bits, ~7 significant digits; gradients, weights commonly stored in FP32.
- FP16 (half precision, float16): 16 bits, ~4 significant digits; range $[6 \times 10^{-8}, 6 \times 10^4]$; GPU operations 2–10× faster.
- BFloat16 (Brain Float): 16 bits, same exponent range as FP32, reduced mantissa; intermediate between FP32 and FP16.
- Mixed precision: compute GEMM in FP16 (fast), accumulate in FP32 (stable); scale loss to prevent underflow.
Distributed matrix multiplication
- Data parallelism: replicate model on each device; partition minibatches; synchronize gradients via all-reduce.
- Model parallelism: partition matrix weights across devices; communication within matrix product (e.g., matmul followed by communication).
- Pipeline parallelism: partition layers across devices; overlap computation on layer $i$ with communication on layer $i-1$.
- Cost: compute + communication latency; communication often dominates at large scale (Roofline model).

Relevance to ML#

Convolutional neural networks (CNNs): Forward and backward passes are GEMM-heavy; efficiency determines whether you can train on billion-pixel images or video.
Recurrent neural networks (RNNs), LSTMs, GRUs: Fully-connected layers between timesteps; matrix products per timestep.
Transformers and large language models: Attention is $O(L^2 d)$ matrix products; for GPT-3 ($L = 2048$, $d = 12288$), attention dominates forward/backward.
Graph neural networks (GNNs): Graph convolution is sparse matrix product; efficiency depends on sparsity and format.
Distributed training: Modern LLMs trained on thousands of GPUs/TPUs; communication cost (network bandwidth) often exceeds computation cost.

Algorithmic development (milestones)#

1969: Strassen algorithm: $O(n^{2.807})$ vs. $O(n^3)$ naive GEMM (theoretically significant; rarely used in practice due to constants).
1979–1990: Level-1/2/3 BLAS (Basic Linear Algebra Subprograms); standardized interface for matrix ops; LAPACK (1992) built on BLAS.
1995–2005: GPU era begins: NVIDIA GeForce, Tesla; GPUs have 100× more memory bandwidth than CPUs; GEMMs run 10–100× faster.
2006: CUDA (Compute Unified Device Architecture) released; enables general-purpose GPU computing; cuBLAS optimized GEMM for NVIDIA GPUs.
2011: Mixed precision training proposed; FP16 + loss scaling enables 10–100× speedups on GPUs.
2012: AlexNet (Krizhevsky et al.) demonstrates deep CNN training on GPUs; FLOPs dominate; GEMM-heavy.
2015: Batch normalization (Ioffe & Szegedy); reduces sensitivity to initialization; enables mixed precision at scale.
2017: Transformer architecture (Vaswani et al.); attention is dense GEMM-based; quadratic in sequence length.
2018–2020: Distributed training frameworks mature (PyTorch DDP, TensorFlow Horovod); trillion-parameter models trained via model parallelism.
2020–2023: Flash Attention (Dao et al. 2022) reduces attention memory via block-sparse operations; Megatron-LM and DeepSpeed enable distributed GEMMs at petaflop scales.

Definitions#

GEMM (General Matrix Multiply): $C \leftarrow \alpha A B + \beta C$ (standard matrix multiply with scaling/accumulation).
FLOP (floating-point operation): One add or multiply; GEMM $C \leftarrow AB$ is $2mkn$ FLOPs.
Arithmetic intensity: $I = \frac{\text{FLOPs}}{\text{bytes read/written}}$ (ops per byte); high $I$ means compute-bound; low $I$ means bandwidth-bound.
Roofline model: Peak achievable throughput = $\min(\text{peak FLOP rate}, \text{memory bandwidth} \times \text{arithmetic intensity})$.
Memory-bound: Algorithm where memory bandwidth is bottleneck; cannot achieve peak FLOP rate.
Compute-bound: Algorithm where compute is bottleneck; limited by FLOPs/cycle, not memory.
Mixed precision: Using multiple precision levels (e.g., FP16 for compute, FP32 for accumulation) to trade accuracy for speed.
All-reduce: Distributed operation: each device sums its values with all others; result replicated on all devices. Cost: $O(\log D)$ communication rounds for $D$ devices.
Collective communication: Broadcasting, all-reduce, reduce-scatter, all-gather operations in distributed training.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

GEMM arithmetic and complexity: $O(mkn)$ FLOPs, memory $O(m + k + n)$. Reference: Golub & Van Loan (2013).
Arithmetic intensity and Roofline model: $I = \text{FLOPs/bytes}$; peak rate is $\min(\text{FLOP rate}, \text{bandwidth} \times I)$. Reference: Williams et al. (2009).
Cache-oblivious algorithms: Block-recursive GEMM achieves near-optimal cache behavior independent of cache size. Reference: Frigo et al. (1999).
Batched GEMM: Independent products $C_i \leftarrow A_i B_i$; parallelism across batch dimension. Reference: BLAS 3 standard (1990).
Attention complexity: Scaled dot-product $O(L^2 d)$ without optimizations; challenges for long contexts. Reference: Vaswani et al. (2017).
Distributed GEMM: Communication cost for gradient all-reduce, model/data parallelism. Reference: Thakur et al. (2005) (MPI Collective Communications).

Applied (landmark systems)#

Level-3 BLAS (cuBLAS, MKL): Industry-standard GEMM implementations; peak performance on CPUs/GPUs. Implementation: NVIDIA cuBLAS, Intel MKL. Reference: Dongarra et al. (1990) (BLAS 3).
Convolution as GEMM (im2col): Standard in libcnpy (Caffe, PyTorch); enables reuse of optimized GEMM. Implementation: PyTorch conv2d uses im2col on CPU. Reference: Krizhevsky et al. (2012).
Mixed precision training: Automatic mixed precision in PyTorch (torch.cuda.amp), TensorFlow (tf.keras.mixed_precision). Achieves 2–3× speedup on V100/A100. Reference: NVIDIA Automatic Mixed Precision Training Guide (2020).
Distributed GEMM (Megatron-LM, DeepSpeed): Tensor parallelism partitions GEMM across devices; pipeline parallelism overlaps layers. Implementation: Microsoft DeepSpeed, NVIDIA Megatron-LM. Reference: Shoeybi et al. (2019); Rasley et al. (2020).
Flash Attention: IO-efficient attention via blocked matrix products; reduces memory bandwidth by 10×. Implementation: Tri Dao’s flash-attention library. Reference: Dao et al. (2022).

Key ideas: Where it shows up

Convolutional neural networks and image classification
- Forward pass: convolutional layers (im2col GEMM), batch norm (element-wise), pooling (no GEMM).
- Backward: weight gradient, input gradient via GEMM.
- Achievements: ResNet-50 trains on 8 V100 GPUs in ~100 seconds (Goyal et al. 2017); mixed precision reduces time to ~60 seconds. References: Krizhevsky et al. (2012) (AlexNet); He et al. (2015) (ResNet); Goyal et al. (2017) (accurate large-batch SGD).
Transformer models and large language models
- Per-layer: projection QKV (3 GEMMs), attention (2 GEMMs), MLP (2 GEMMs) = ~7 GEMMs per layer.
- Attention cost: $O(L^2 d)$ (quadratic in sequence length); dominates for long sequences.
- Achievements: GPT-3 (Brown et al. 2020) trained in 300 billion FLOPs; parallelized across 1,024 A100 GPUs using model parallelism. Flash Attention (Dao et al. 2022) reduces attention memory by 10×. References: Vaswani et al. (2017) (Transformer); Brown et al. (2020) (GPT-3); Dao et al. (2022) (Flash Attention).
Distributed training and synchronization
- Data parallelism: gradient all-reduce after each minibatch.
- Model parallelism: gradient exchanges within matrix products.
- Achievements: LAMB optimizer (You et al. 2019) enables BERT training on 32k TPUs in 76 minutes. Megatron-LM (Shoeybi et al. 2019) trains GPT models with tensor parallelism. References: You et al. (2019) (LAMB); Shoeybi et al. (2019) (Megatron-LM).
Mixed precision training
- Automatic mixed precision (AMP): dynamically select FP16/FP32 for operations.
- Loss scaling: prevent FP16 gradient underflow.
- Achievements: NVIDIA Automatic Mixed Precision reduces training time by 2–3× on V100/A100 while maintaining accuracy. References: NVIDIA Mixed Precision Training guide; Micikevicius et al. (2018).
Graph neural networks and sparse matrix products
- Graph convolution: $X' = \sigma(AXW)$ where $A$ is sparse adjacency matrix.
- Sparse-dense GEMM: $O(\text{nnz}(A) \cdot d)$ arithmetic intensity lower than dense, but feasible for sparse graphs.
- Achievements: DGL, PyG enable billion-node GNNs via optimized sparse GEMMs. References: Kipf & Welling (2017) (GCN); Wang et al. (2019) (DGL); Fey et al. (2019) (PyG).

Notation

Matrix product: $C \leftarrow A B$ with $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.
Batched product: $C_i \leftarrow A_i B_i$ for $i = 1, \ldots, B$ (batch size); vectorization across batch.
Attention: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) V$ with $Q, K, V \in \mathbb{R}^{L \times d}$ (sequence length $L$, dimension $d$).
Complexity: Attention is $O(L^2 d)$ FLOPs; dense GEMM is $O(n d_{\text{in}} d_{\text{out}})$ per layer (batch size $n$).
Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$ (depends on matrix shapes; higher $I$ achieves better GPU utilization).
FLOP rate: Peak: $P$ (e.g., 20 TFLOP for V100 in FP32); practical: $P \times \text{efficiency}$ (typically 50–80%).
Memory bandwidth: $B$ (e.g., 900 GB/s for A100 HBM2e); roofline: achieved throughput $= \min(P, I \times B)$.
Example: ResNet-50 forward pass: ~8 GFLOPs per image; batch size 256 = 2 TFLOPs; A100 achieves ~80% utilization = 16 TFLOP achieved; time ~0.1 ms.

Pitfalls & sanity checks

Batch size too small: GPUs underutilized; poor arithmetic intensity. Typical minimum: 32–64 per device.
Tall-skinny GEMM: Low arithmetic intensity; underutilize accelerator. Prefer square or batched products.
Ignoring data layout: Row-major vs. column-major affects cache performance by 10×.
Mixed precision without loss scaling: FP16 gradients underflow ($\approx 10^{-6}$); loss scale prevents this (multiply loss by $2^{16}$, divide gradients).
Attention without length limits: Quadratic memory; even with batch size 1, $L = 8192$ requires 256 MB for single head.
Synchronous all-reduce without compression: Communication time dominates; gradient compression (sparsification, quantization) essential at scale.
Assuming linear scaling: Communication cost breaks linear scaling; efficiency drops from 90% (4 devices) to 30% (128 devices).
Convolution without im2col: Naive loops 100–1000× slower than GEMM-based implementation.

References

Matrix multiplication theory

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Strassen, V. (1969). Gaussian elimination is not optimal.
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Performance modeling and BLAS

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of FORTRAN basic linear algebra subprograms.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Demmel, J., Gearhart, J., Liphardt, B., Schwartz, O., & Toledo, S. (2009). Communication-avoiding Gaussian elimination.

Deep learning and convolution

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
He, H., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.

Transformer and attention

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are unsupervised multitask learners.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.

Mixed precision and numerical stability

Micikevicius, P., Narang, S., Alben, J., et al. (2018). Mixed precision training.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift.
NVIDIA Automatic Mixed Precision Training Guide. (2020).

Distributed training

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.
You, Y., Gitman, I., & Ginsburg, B. (2019). Large batch optimization for deep learning: training BERT in 76 minutes.

Attention optimization

Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.
Child, A., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers.
Peng, H., Schwartz-Ziv, R., & Armon, M. (2021). Reducing transformer depth on demand with structured dropout.

Five worked examples

Worked Example 1: GEMM efficiency and arithmetic intensity#

Introduction#

Implement dense matrix multiplication on CPU and GPU; measure FLOP rate and memory bandwidth utilization; demonstrate how matrix shape affects arithmetic intensity.

Purpose#

Understand relationship between GEMM dimensions and arithmetic intensity; show how to achieve peak GPU performance.

Importance#

Foundation for understanding deep learning performance; shapes (batch size, hidden dimensions) directly impact training time.

What this example demonstrates#

Construct tall-skinny vs. square GEMM matrices.
Measure FLOPs and memory bandwidth for each.
Compute arithmetic intensity $I = \text{FLOPs/bytes}$.
Compare achieved FLOP rate vs. peak.
Predict speedup from roofline model.

Background#

GEMM efficiency depends on matrix shape: square matrices have high arithmetic intensity; tall-skinny have low intensity.

Historical context#

Roofline model (Williams et al. 2009) formalizes this trade-off; guides architecture and algorithm design.

History#

Standard framework for performance modeling in HPC and ML systems.

Prevalence in ML#

Every deep learning practitioner adjusts batch size, layer dimensions to maximize GPU utilization.

Notes#

Arithmetic intensity: $I = \frac{2mkn}{8(mk + kn + mn)}$; maximized when $m \approx k \approx n$ (cube).
For fixed $k$, varying $m, n$ (batch size, hidden dims) changes $I$ by 10×.

Connection to ML#

Batch size and hidden dimension choices affect both accuracy and training speed; understanding trade-offs is critical.

Connection to Linear Algebra Theory#

GEMM is fundamental linear algebra operation; efficiency is determined by cache locality (blocking theory).

Pedagogical Significance#

Demonstrates practical performance modeling; connects theory (arithmetic intensity) to practice (measured FLOP rates).

References#

Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for floating-point programs.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Frigo, M., Leiserson, C. E., Prokop, H., & Ramachandran, S. (1999). Cache-oblivious algorithms.

Solution (Python)#

import numpy as np
import time

np.random.seed(35)

# Test different matrix shapes (keeping k fixed)
k = 1024
shapes = [
    (128, k, 128),    # Tall-skinny-ish: low intensity
    (1024, k, 1024),  # Square: high intensity
    (4096, k, 4096),  # Large square: even higher
]

print("GEMM Efficiency Analysis")
print("=" * 80)
print(f"{'m x n':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Intensity':15} {'Est. GFLOPs':15}")
print("-" * 80)

for m, k_dim, n in shapes:
    # Arithmetic
    flops = 2 * m * k_dim * n
    # Memory: read A (m*k), read B (k*n), write C (m*n)
    mem_bytes = 8 * (m * k_dim + k_dim * n + m * n)
    intensity = flops / mem_bytes
    
    # Estimate performance from roofline
    # Assume: Peak 20 TFLOP (V100 FP32), Bandwidth 900 GB/s
    peak_flops = 20e12
    bandwidth = 900e9
    roofline = min(peak_flops, bandwidth * intensity)
    
    print(f"{m}x{n}         {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {intensity:>14.2f} {roofline/1e9:>14.1f}")

print("\n" + "=" * 80)
print("Key insight: Higher arithmetic intensity -> higher roofline GFLOPs")
print("Square matrices (m ~ k ~ n) achieve 10-100x higher intensity than tall-skinny")

Worked Example 2: Batched GEMM and GPU parallelism#

Introduction#

Implement batched matrix multiplication; measure performance as batch size varies; show speedup from batch parallelism.

Purpose#

Demonstrate how batch dimension enables parallelism; show relationship between batch size and GPU utilization.

Importance#

Batch size is a key hyperparameter; understanding its impact on performance guides training setup.

What this example demonstrates#

Generate batched matrices $A_i, B_i$ for $i = 1, \ldots, B$.
Time batched GEMM vs. sequential.
Measure speedup; show scaling with batch size.
Explain why small batches underutilize GPU.

Background#

GPUs have thousands of cores; small batches can’t keep all cores busy; large batches achieve better utilization.

Historical context#

Batch GEMM standardized in BLAS Level 3 (1990); essential for CNN/RNN training.

History#

Modern frameworks (PyTorch, TensorFlow) automatically batch GEMMs; rarely needs manual tuning.

Prevalence in ML#

Every training loop uses batched GEMM; batch size choice directly impacts throughput.

Notes#

Batch size $B = 1$: each GEMM is independent; throughput limited.
$B = 32$: better utilization; GPUs have 80+ SMs (streaming multiprocessors).
$B = 256$: excellent utilization; typical for modern training.

Connection to ML#

Batch size affects both convergence (larger batches can have worse generalization) and speed; practical sweet spot is usually 32–256.

Connection to Linear Algebra Theory#

Batched GEMM exploits structure (independent problems); vectorization across batch dimension.

Pedagogical Significance#

Shows interplay between algorithm structure and hardware parallelism.

References#

Dongarra, J., Du Croz, J., Hammarling, S., & Hanson, R. H. (1990). An extended set of Fortran basic linear algebra subprograms.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Goyal, P., Dollár, P., Girshick, R., et al. (2017). Accurate large-batch SGD: training ImageNet in 1 hour.

Solution (Python)#

import numpy as np
import time

np.random.seed(36)

# Batched GEMM performance
batch_sizes = [1, 4, 16, 64, 256]
m, k, n = 1024, 1024, 1024
iterations = 10

print("Batched GEMM Performance (m=k=n={}, {} iterations)".format(m, iterations))
print("=" * 60)
print(f"{'Batch Size':15} {'Total Time (s)':20} {'GFLOPs':15}")
print("-" * 60)

for B in batch_sizes:
    # Create batch of matrices
    A = np.random.randn(B, m, k).astype(np.float32)
    B_mat = np.random.randn(B, k, n).astype(np.float32)
    
    # Batched matmul (sequential in Python; normally GPU would parallelize)
    t0 = time.time()
    for _ in range(iterations):
        C = np.matmul(A, B_mat)
    t_total = time.time() - t0
    
    # FLOPs: 2mkn per batch, B batches, iterations
    flops = iterations * B * 2 * m * k * n
    gflops = flops / (t_total * 1e9)
    
    print(f"{B:>14} {t_total:>19.4f} {gflops:>14.1f}")

print("\n" + "=" * 60)
print("Note: Larger batch sizes achieve higher GFLOPs due to better parallelism")

Worked Example 3: Convolution as GEMM (im2col)#

Introduction#

Implement convolution using naive loops, then via im2col GEMM; measure speedup from optimized GEMM.

Purpose#

Show how convolution is equivalent to matrix multiplication; demonstrate efficiency gain from reusing optimized GEMM.

Importance#

Foundational for understanding why GPUs excel at CNNs; im2col is standard in production frameworks.

What this example demonstrates#

Naive 5-loop convolution implementation.
im2col transformation: reshape patches into columns.
GEMM on im2col matrix; reshape output.
Compare naive vs. GEMM time.

Background#

Convolution unfolds into GEMM; allows reuse of highly-tuned BLAS kernels; 10–100× speedup.

Historical context#

im2col technique developed for efficient convolution implementations in early deep learning (Caffe, 2013).

History#

Standard in all deep learning frameworks; sometimes augmented by Winograd for further speedup.

Prevalence in ML#

Every CNN implementation uses im2col or similar GEMM-based convolution.

Notes#

im2col memory overhead: factors of 2–4× larger than direct convolution; trade memory for speed.
Winograd convolution (for $3 \times 3$ kernels): lower arithmetic but numerically complex.

Connection to ML#

Convolutional layers dominate image classification and detection models; efficiency here directly impacts training speed.

Connection to Linear Algebra Theory#

Convolution is linear transformation; im2col exploits structure to reduce to GEMM.

Pedagogical Significance#

Demonstrates how abstract operations (convolution) map to concrete linear algebra (GEMM).

References#

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
Jia, Y., Shelhamer, E., Donahue, J., et al. (2014). Caffe: convolutional architecture for fast feature embedding.
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks.

Solution (Python)#

import numpy as np
import time

np.random.seed(37)

# Convolution parameters
batch_size, in_height, in_width, in_channels = 32, 64, 64, 3
out_channels, kernel_h, kernel_w, stride = 16, 3, 3, 1
pad = 1

# Padded input
X_padded = np.pad(np.random.randn(batch_size, in_height, in_width, in_channels),
                   ((0,0), (pad,pad), (pad,pad), (0,0)), mode='constant')
W = np.random.randn(out_channels, kernel_h, kernel_w, in_channels)

# Output dimensions
out_height = (in_height + 2*pad - kernel_h) // stride + 1
out_width = (in_width + 2*pad - kernel_w) // stride + 1

# Naive convolution (slow)
print("Naive convolution (5-loop implementation):")
t0 = time.time()
Y_naive = np.zeros((batch_size, out_height, out_width, out_channels))
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            for c in range(out_channels):
                h_start = h * stride
                w_start = w * stride
                patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
                Y_naive[b, h, w, c] = np.sum(patch * W[c])
t_naive = time.time() - t0
print(f"  Time: {t_naive:.4f} s")

# im2col GEMM (fast)
print("\nim2col GEMM (optimized convolution):")
t0 = time.time()

# im2col: extract patches
X_col = np.zeros((batch_size * out_height * out_width, kernel_h * kernel_w * in_channels))
idx = 0
for b in range(batch_size):
    for h in range(out_height):
        for w in range(out_width):
            h_start = h * stride
            w_start = w * stride
            patch = X_padded[b, h_start:h_start+kernel_h, w_start:w_start+kernel_w, :]
            X_col[idx] = patch.reshape(-1)
            idx += 1

# Weight matrix (reshape filters)
W_mat = W.reshape(out_channels, -1).T  # (kernel_h*kernel_w*in_channels, out_channels)

# GEMM
Y_col = X_col @ W_mat  # (batch*out_h*out_w, out_channels)

# Reshape to output
Y_gemm = Y_col.reshape(batch_size, out_height, out_width, out_channels)

t_gemm = time.time() - t0
print(f"  Time: {t_gemm:.4f} s")

print(f"\nSpeedup: {t_naive / t_gemm:.1f}x")
print(f"Results match: {np.allclose(Y_naive, Y_gemm, atol=1e-5)}")

Worked Example 4: Scaled dot-product attention complexity#

Introduction#

Implement attention operation; measure memory and time complexity; show quadratic dependence on sequence length.

Purpose#

Understand why attention is a bottleneck for long sequences; motivate approximate attention methods.

Importance#

Attention scales as $O(L^2 d)$; for long sequences (4K tokens), this dominates; critical for efficiency research.

What this example demonstrates#

Implement attention: QK^T, softmax, output.
Measure memory (intermediate softmax matrix is $L \times L$).
Time scaling with $L$; show quadratic growth.
Compare attention time vs. other layers.

Background#

Quadratic attention complexity is fundamental limitation of transformer architecture; many proposed approximations.

Historical context#

Vaswani et al. (2017) introduce attention; complexity not initially recognized as bottleneck for $L > 512$.

History#

Post-2020, attention optimization becomes major research area: Flash Attention, sparse attention, linear attention variants.

Prevalence in ML#

Every transformer model suffers from quadratic attention; common workaround is to limit context length or use approximations.

Notes#

Attention FLOPs: $2L^2 d$ (dominant for $L > d$).
Memory: $O(L^2)$ for attention matrix; for $L = 4096, d = 768$: 64 MB per sequence.

Connection to ML#

Limiting context length ($L = 512$ vs. $L = 4096$) is common trade-off between expressiveness and efficiency.

Connection to Linear Algebra Theory#

Attention is polynomial in sequence length; matrix products scale quadratically in one dimension.

Pedagogical Significance#

Shows concrete example of how algorithmic bottleneck (quadratic) impacts practical ML.

References#

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: fast and memory-efficient exact attention with IO-awareness.
Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). Rethinking attention with performers.

Solution (Python)#

import numpy as np
import time

np.random.seed(38)

# Attention parameters
d = 768  # Hidden dimension
num_heads = 12
d_k = d // num_heads
L_values = [128, 256, 512, 1024, 2048]  # Sequence lengths

print("Attention Complexity Analysis (d={}, num_heads={})".format(d, num_heads))
print("=" * 70)
print(f"{'Seq Len L':15} {'FLOPs (M)':15} {'Memory (MB)':15} {'Time (ms)':15}")
print("-" * 70)

for L in L_values:
    batch_size = 1
    
    # Create Q, K, V
    Q = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    K = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    V = np.random.randn(batch_size, num_heads, L, d_k).astype(np.float32)
    
    # Measure time and memory
    t0 = time.time()
    
    # Attention: QK^T / sqrt(d_k)
    scores = np.matmul(Q, K.transpose(0, 1, 3, 2))  # (batch, heads, L, L)
    scores = scores / np.sqrt(d_k)
    
    # Softmax
    scores = scores - np.max(scores, axis=-1, keepdims=True)
    exp_scores = np.exp(scores)
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Output
    output = np.matmul(weights, V)  # (batch, heads, L, d_k)
    
    t_attn = time.time() - t0
    
    # FLOPs: QK^T = 2*L^2*d_k, softmax ~L^2, output = 2*L^2*d_k
    flops = batch_size * num_heads * (2 * L * L * d_k + 2 * L * L * d_k)
    
    # Memory: scores matrix is L x L per head
    mem_bytes = batch_size * num_heads * L * L * 4
    
    print(f"{L:>14} {flops/1e6:>14.0f} {mem_bytes/1e6:>14.1f} {t_attn*1e3:>14.2f}")

print("\n" + "=" * 70)
print("Key insight: FLOPs and memory scale quadratically with sequence length")
print("For L=4096: 15 GB memory, billions of FLOPs -- attention becomes bottleneck")

Worked Example 5: Distributed GEMM and communication cost#

Introduction#

Implement data parallel training with gradient synchronization; measure computation vs. communication time; show communication overhead.

Purpose#

Understand communication bottleneck in distributed training; motivate communication-efficient algorithms.

Importance#

Modern LLMs trained on 1000s of GPUs; communication often dominates; critical for scaling.

What this example demonstrates#

Simulate distributed GEMM (matmul on local device).
Simulate all-reduce for gradient synchronization.
Measure computation time vs. communication time.
Show how communication latency scales with number of devices.

Background#

Distributed training divides minibatches across devices; after each minibatch, devices exchange gradients via all-reduce.

Historical context#

Large-batch SGD and gradient compression (2017–2019) driven by communication bottleneck.

History#

Modern frameworks (PyTorch DDP, Horovod) optimize communication; mixed precision + gradient compression reduce overhead.

Prevalence in ML#

Every distributed training uses all-reduce; communication cost is well-studied bottleneck.

Notes#

Computation time: $O(B \cdot d_{\text{in}} \cdot d_{\text{out}})$ (linear in batch size, dimensions).
Communication time: $O(\log D + d_{\text{gradient}})$ (logarithmic in device count $D$, linear in gradient size).
For 1000 devices: all-reduce with $\log D \approx 10$ rounds; if each round takes 10 μs, total ~100 μs; computation often takes ms.

Connection to ML#

Large-batch training requires communication efficiency; gradient compression and other tricks essential for practical scaling.

Connection to Linear Algebra Theory#

All-reduce is tree-based collective communication; optimal complexity is $O(\log D)$.

Pedagogical Significance#

Shows distributed systems aspect of linear algebra; explains why scaling beyond certain point is challenging.

References#

Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH.
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism.
Rasley, J., He, Y., Yan, F., Ruwase, O., & O’Neill, M. (2020). DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters.

Solution (Python)#

import numpy as np
import time

np.random.seed(39)

# Distributed training simulation
num_devices = [1, 4, 8, 16, 32]
batch_size = 256
hidden_dim = 2048

print("Distributed GEMM: Computation vs. Communication")
print("=" * 70)
print(f"{'Devices':15} {'Comp Time (ms)':20} {'Comm Time (μs)':20} {'Comp/Comm Ratio':15}")
print("-" * 70)

# Assume:
# - Computation: 100 GFLOPs/device (V100)
# - Communication: 25 GB/s interconnect (typical)

compute_flops_per_device = 100e9  # 100 GFLOPs
comm_bandwidth = 25e9  # GB/s (25 GB/s)

for D in num_devices:
    # Local batch per device
    local_batch = batch_size // D
    
    # GEMM: local_batch x hidden_dim x hidden_dim
    flops_local = 2 * local_batch * hidden_dim * hidden_dim
    
    # Computation time
    t_compute = flops_local / compute_flops_per_device
    
    # Communication: all-reduce of gradients (hidden_dim)
    # Complexity: O(log D) communication rounds
    # Each round transmits O(hidden_dim) data (simplified)
    comm_rounds = int(np.log2(D)) + 1
    gradient_size = hidden_dim * 4  # bytes (FP32)
    comm_per_round = gradient_size / comm_bandwidth
    t_comm = comm_rounds * comm_per_round
    
    ratio = t_compute / t_comm
    
    print(f"{D:>14} {t_compute*1e3:>19.3f} {t_comm*1e6:>19.2f} {ratio:>14.1f}x")

print("\n" + "=" * 70)
print("Key insight: Communication becomes bottleneck at large scale")
print("For 32 devices: communication ~100 microseconds, computation ~10 milliseconds")
print("Compute/comm ratio decreases -> inefficiency at scale")

Comments

Algorithm Category

Direct Methods

Approximation Methods

Attention Optimization

Computational Efficiency

BLAS & Matrix Multiplication

Sparse Matrix Operations

GPU & Distributed Computing

Data Modality

Historical & Attribution

Modern Era

Key Concepts & Theorems

Norm & Distance

Learning Path & Sequencing

Advanced

ML Applications

Deep Learning & Neural Networks

Graph Neural Networks

Attention Mechanisms

Convolutional Layers

Numerical Stability & Robustness

Floating-Point Arithmetic

Loss Scaling

Theoretical Foundation

Applied Machine Learning

Systems & Engineering

Conditioning & Stability

Chapter 14

Conditioning & Stability

Key ideas: Introduction

Introduction#

Numerical stability determines whether an algorithm produces a correct answer. Machine arithmetic operates on finite precision (float64: ~16 significant digits). An ill-conditioned problem (large $\kappa$) is inherently difficult: small input perturbations cause large output changes. Backward-stable algorithms guarantee that computed solution is exact solution to a nearby problem. Forward error depends on conditioning; backward error depends on algorithm stability. Together, they predict what accuracy is achievable. This is fundamental to reliable ML: ill-conditioned matrices appear in nearly singular Gram matrices, ill-posed inverse problems, and deep network Hessians.

Important ideas#

Condition number and sensitivity
- $\kappa(A) = \sigma_1 / \sigma_n$ (ratio of largest/smallest singular values).
- Forward error bound: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \cdot \frac{\| \Delta A \|}{\| A \|}$ (perturbation on $A$ causes error in $x$).
- Ill-conditioned ($\kappa \gg 1$) problems amplify noise; well-conditioned ($\kappa \approx 1$) are benign.
Machine precision and floating-point arithmetic
- Machine epsilon: $\varepsilon_{\text{machine}} \approx 2^{-52} \approx 2.2 \times 10^{-16}$ for float64.
- Relative error per operation: $O(\varepsilon_{\text{machine}})$; accumulated error over $n$ operations: $O(n \cdot \varepsilon_{\text{machine}})$.
- Loss of significance: if $a \approx b$, then $a - b$ has few correct digits (cancellation).
Backward error analysis and stability
- Algorithm is backward-stable if computed solution $\tilde{x}$ is exact solution to perturbed problem: $\tilde{A} \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon_{\text{machine}})$.
- Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|} \approx \kappa(A) \cdot O(\varepsilon_{\text{machine}})$ for stable algorithm.
- Unstable algorithm: backward error is large; forward error can be catastrophic.
Condition number of Gram matrix vs. original matrix
- $\kappa(A^\top A) = \kappa(A)^2$; normal equations square the condition number.
- QR/SVD avoid this: $\kappa(R) = \kappa(A)$ (no squaring), $\kappa(\Sigma) = \kappa(A)$.
- For least squares: direct method on $A$ (QR) preferable to solving $A^\top A x = A^\top b$.
Preconditioning and effective conditioning
- Preconditioner $M \approx A$ transforms $A \to M^{-1} A$; goal is $\kappa(M^{-1} A) \ll \kappa(A)$.
- Incomplete LU (ILU), Jacobi (diagonal), algebraic multigrid (AMG) are practical preconditioners.
- CG iteration count $\approx \sqrt{\kappa(M^{-1} A)}$; reducing $\kappa$ by factor 100 reduces iterations by factor 10.
Regularization and ill-posedness
- Ill-posed problem: small changes in data cause arbitrarily large changes in solution ($\kappa = \infty$).
- Tikhonov regularization: $(A^\top A + \lambda I) x = A^\top b$ shifts small singular values by $\lambda$.
- Effective $\kappa$ after regularization: $\kappa_{\text{reg}} = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$; well-chosen $\lambda$ reduces condition number.
Numerical cancellation and loss of significance
- Subtraction $a - b$ where $a \approx b$ cancels leading digits; result has low precision.
- Orthogonal transformations (QR, SVD, Householder) avoid this; preserve angles and norms.
- Modified Gram–Schmidt (MGS) more stable than classical Gram–Schmidt due to reorthogonalization.

Relevance to ML#

Linear regression and least squares: Gram matrix $X^\top X$ can be ill-conditioned if features are correlated; normal equations fail; QR/SVD required.
Kernel methods and Gaussian processes: Gram/covariance matrix condition number determines sample complexity and numerical stability of Cholesky.
Deep learning and Hessian conditioning: Neural network loss Hessian is often ill-conditioned; affects convergence rate, learning rate selection, and second-order optimization.
Regularization and implicit bias: Ill-posed learning problems ($\kappa = \infty$) require regularization (L2, dropout, early stopping); reduces effective $\kappa$.
Numerical robustness in production: Conditioning determines whether model training is reproducible, whether predictions are stable to input noise, and whether inference is safe.

Algorithmic development (milestones)#

1947: von Neumann & Goldstine analyze numerical stability of Gaussian elimination; discover amplification by machine epsilon.
1961: Wilkinson develops comprehensive framework for backward error analysis; introduces condition number formally.
1965: Golub & Kahan recognize $\kappa(A^\top A) = \kappa(A)^2$ is catastrophic for normal equations; advocate for QR/SVD.
1971: Peters & Wilkinson develop LINPACK; implement numerically stable algorithms with careful pivot selection.
1979–1990: Higham develops extended backward error analysis; proves stability of Cholesky, QR, modified Gram–Schmidt.
1992: Arioli et al. develop a posteriori error bounds; can verify computed solution without knowing true solution.
2000s: Preconditioning becomes standard in iterative solvers; algebraic multigrid (AMG) enables linear scaling.
2010s: Automatic differentiation frameworks (TensorFlow, PyTorch) provide implicit solvers with backward stability analysis.

Definitions#

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)}$ (SVD-based); $\kappa_\infty(A) = \| A \|_\infty \| A^{-1} \|_\infty$ (operator norm).
Machine epsilon: $\varepsilon_{\text{machine}} = 2^{-(p-1)}$ for $p$-bit mantissa; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point number: Normalized form: $\pm m \times 2^e$ with $1 \le m < 2$ (hidden bit), $e \in [-1022, 1023]$.
Relative error: $\frac{\| \tilde{x} - x \|}{\| x \|}$ (dimensionless; independent of scaling).
Backward error: $\frac{\| \tilde{A} - A \|}{\| A \|}$ for perturbed matrix; small backward error $\Rightarrow$ computed $\tilde{x}$ is exact solution to nearby problem.
Forward error: $\frac{\| \tilde{x} - x \|}{\| x \|}$; depends on both backward error and conditioning: $\text{forward error} \approx \kappa(A) \times \text{backward error}$.
Ill-conditioned: $\kappa(A) \gg 1$; small perturbations cause large output changes.
Well-conditioned: $\kappa(A) \approx 1$; stable to perturbations.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

Condition number definition and properties: $\kappa(A) = \sigma_1 / \sigma_n$; $\kappa(AB) \le \kappa(A) \kappa(B)$; $\kappa(A) \ge 1$. Reference: Golub & Van Loan (2013).
Perturbation theory and forward error bounds: $\frac{\| \Delta x \|}{\| x \|} \lesssim \kappa(A) \frac{\| \Delta A \|}{\| A \|}$ (first-order). Reference: Wilkinson (1961); Higham (2002).
Backward error analysis: Algorithm is backward-stable if $\tilde{x}$ satisfies $(\tilde{A}) \tilde{x} = \tilde{b}$ with $\| \tilde{A} - A \|, \| \tilde{b} - b \| = O(\varepsilon)$. Reference: Higham (2002).
Machine precision and floating-point error: $\varepsilon_{\text{mach}} \approx 2^{-p}$; error per operation $O(\varepsilon)$; cumulative error over $n$ steps $O(n \varepsilon)$. Reference: IEEE 754 (2019); Goldberg (1991).
Condition number of Gram matrix: $\kappa(A^\top A) = \kappa(A)^2$; why normal equations are risky. Reference: Golub & Kahan (1965).
Preconditioning theory: Transform $A \to M^{-1} A$; goal $\kappa(M^{-1} A) \ll \kappa(A)$. Reference: Axelsson (1994).
Tikhonov regularization: Shift eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$; reduce effective condition number. Reference: Tikhonov (1963).

Applied (landmark systems)#

QR factorization for stable least squares: $\kappa(R) = \kappa(A)$ (no squaring); stable even for ill-conditioned $A$. Implementation: LAPACK dgeqrf; scikit-learn default solver. Reference: Golub & Kahan (1965); Golub & Van Loan (2013).
SVD for rank-deficient and ill-conditioned systems: Direct pseudoinverse computation; threshold small singular values. Implementation: LAPACK dgesvd; scipy.linalg.svd. Reference: Golub & Kahan (1965).
Cholesky with jitter for near-singular covariance: Add $\epsilon I$ if Cholesky fails; trade-off between accuracy and stability. Standard in GPyTorch, Pyro, Stan. Reference: Rasmussen & Williams (2006).
Modified Gram–Schmidt (MGS) for reorthogonalization: More stable than classical GS; requires reorthogonalization passes. Reference: Golub & Pereyra (1973); Yarlagadda et al. (1979).
Preconditioned conjugate gradient (PCG): L-BFGS (quasi-Newton), ILU preconditioning, algebraic multigrid reduce iteration count by factors 10–100. Standard in scipy.sparse.linalg, PETSc. Reference: Nocedal & Wright (2006); Saad (2003).
Ridge regression and elastic net for regularization: Manual $\lambda$ tuning or cross-validation; scikit-learn Ridge, ElasticNet. Reference: Hoerl & Kennard (1970); Zou & Hastie (2005).

Key ideas: Where it shows up

Linear regression and least squares
- Gram matrix $X^\top X$ condition number determines numerical stability of normal equations.
- Achievements: QR/SVD solvers in scikit-learn, statsmodels ensure stability; warning when $\kappa > 10^{10}$. References: Hastie et al. (2009); Golub & Kahan (1965).
Kernel methods and Gaussian processes
- Covariance/kernel matrix $K$ condition number governs Cholesky stability and marginal likelihood evaluation.
- Achievements: Cholesky with jitter in Stan/PyMC3; sparse GP approximations circumvent $O(n^3)$ and ill-conditioning. References: Rasmussen & Williams (2006); Snelson & Ghahramani (2005).
Deep learning and second-order optimization
- Hessian condition number determines convergence rate of Newton/quasi-Newton methods; affects learning rate and step size.
- Achievements: L-BFGS (quasi-Newton) in PyTorch/TensorFlow adapts to Hessian conditioning; natural gradient methods rescale by Fisher information. References: Martens & Grosse (2015); Nocedal & Wright (2006).
Regularization and implicit bias
- Ill-posed learning problems require regularization; L2 penalty (ridge) shifts eigenvalues, reducing effective condition number.
- Achievements: elastic net (L1+L2) in scikit-learn balances sparsity and conditioning; early stopping in SGD acts as implicit regularization. References: Tibshirani (1996); Zou & Hastie (2005); Zhu et al. (2021).
Numerical robustness and reliability
- Production ML systems must be stable to input noise, feature preprocessing, and computational precision.
- Achievements: feature standardization reduces condition number; model monitoring tracks numerical stability (checks for NaN/Inf); validation on adversarial perturbations. References: Goodfellow et al. (2014) (adversarial robustness); LeCun et al. (1998) (feature normalization).

Notation

Condition number: $\kappa(A) = \frac{\sigma_1(A)}{\sigma_n(A)} = \frac{\lambda_{\max}(A^\top A)}{\lambda_{\min}(A^\top A)}$ (for SPD $A$).
Machine epsilon: $\varepsilon_{\text{mach}}$ or $\text{eps}$ in code; $\approx 2.2 \times 10^{-16}$ for float64.
Floating-point relative error: $\tilde{a} = a(1 + \delta)$ where $|\delta| \le \varepsilon_{\text{mach}}$.
Perturbation: $\tilde{A} = A + \Delta A$ with $\| \Delta A \| \le \varepsilon \| A \|$.
Forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$ (computed vs. true solution).
Backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$ (residual relative to data norm).
Norm: $\| A \|_2 = \sigma_1(A)$ (spectral norm); $\| A \|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius).
Example: For $A \in \mathbb{R}^{100 \times 100}$ with $\sigma_1 = 10, \sigma_{100} = 0.01$: $\kappa(A) = 1000$. Relative error in solving $Ax = b$ is amplified by factor $\approx 1000$; if input has error $10^{-15}$, output error $\approx 10^{-12}$.

Pitfalls & sanity checks

Never ignore condition number: If $\kappa(A) > 10^{8}$, accuracy is compromised; either regularize or use higher precision.
Use QR/SVD for least squares, never normal equations (unless you’ve verified conditioning): $\kappa(X^\top X) = \kappa(X)^2$ is catastrophic.
Check for catastrophic cancellation: Subtraction of nearly equal numbers loses precision; use orthogonal algorithms instead.
Machine precision is absolute limit: Even with perfect algorithm, relative error $\gtrsim \varepsilon_{\text{mach}} \times \kappa(A)$ is unavoidable.
Backward error bounds are meaningful: Small backward error $\Rightarrow$ computed solution is exact for nearby problem (reassuring).
Regularization parameter selection matters: Cross-validation, L-curve, or discrepancy principle; manual tuning often fails.
Feature scaling affects conditioning: Standardize features before regression; poor scaling inflates condition number unnecessarily.
Preconditioning amortized over multiple solves: Single solve: preconditioning setup not worth it; multiple solves: huge benefit.

References

Foundational backward error analysis

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.
IEEE 754. (2019). IEEE Standard for Floating-Point Arithmetic.

Classical numerical linear algebra

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.

Comprehensive stability analysis

Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix.
Givens, W. (1958). Computation of plane unitary rotations transforming a general matrix to triangular form.

Orthogonal factorizations and reorthogonalization

Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems.
Yarlagadda, R. K., Hershey, J. E., & Sule, S. M. (1979). Gram–Schmidt orthogonalization with application to order recursive lattice filters.

Iterative methods and preconditioning

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).
Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid (AMG).

Ill-posed problems and regularization

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Hansen, P. C., Nagy, J. G., & O’Leary, D. P. (2006). Deblurring Images: Matrices, Spectra, and Filtering.
Vogel, C. R. (2002). Computational Methods for Inverse Problems.

Applied ML and regularization

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Zhu, Z., Wu, J., Yu, B., Wu, D., & Welling, M. (2021). The implicit regularization of ordinary SGD for loss functions with modulus of continuity.
Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Five worked examples

Worked Example 1: Condition number of Gram matrix and instability of normal equations#

Introduction#

Construct a data matrix $X$ with correlated features; compute condition number $\kappa(X)$ and $\kappa(X^\top X)$; solve least squares via normal equations vs. QR to demonstrate error amplification.

Purpose#

Illustrate why normal equations fail for ill-conditioned data; show that $\kappa(X^\top X) = \kappa(X)^2$ causes error to explode.

Importance#

Foundational motivation for using QR/SVD in regression; explains why scikit-learn warns on ill-conditioned matrices.

What this example demonstrates#

Construct $X$ with exponentially decaying singular values (high correlation).
Compute $\kappa(X)$ and $\kappa(X^\top X) = \kappa(X)^2$.
Solve via normal equations: $G = X^\top X$, $w = G^{-1} X^\top y$.
Solve via QR factorization: $w$ from $R w = Q^\top y$.
Compare residuals and error growth on perturbed data.

Background#

Gram matrix $G = X^\top X$ is square and symmetric but inherits (squared) conditioning from $X$. Gaussian elimination on $G$ amplifies errors by $\kappa(G)^2 = \kappa(X)^4$ in worst case.

Historical context#

Golub & Kahan (1965) proved $\kappa(X^\top X) = \kappa(X)^2$; advocated QR as practical solution.

History#

Recognition of this issue led to LINPACK (1979) and LAPACK (1992) emphasis on QR/SVD.

Prevalence in ML#

Statisticians and practitioners routinely encounter this; feature correlation is common in real data.

Notes#

Problem is fundamental to linear algebra, not implementation-specific.
Data standardization (mean 0, std 1) helps but doesn’t eliminate correlation.
Large-scale regression ($n, d > 10^6$) exacerbates conditioning issues.

Connection to ML#

Ridge regression adds $\lambda I$ to shift eigenvalues; reduces effective condition number. Feature engineering (orthogonalization, PCA preprocessing) also helps.

Connection to Linear Algebra Theory#

$\kappa(X^\top X) = \kappa(X)^2$ is exact; QR avoids Gram matrix entirely.

Pedagogical Significance#

Concrete demonstration of why theory (backward error analysis) matters in practice.

References#

Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.).
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).

Solution (Python)#

import numpy as np
np.random.seed(20)

# Create ill-conditioned data (correlated features)
n, d = 100, 10
# Generate features with exponentially decaying singular values
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -2, d)
X = U[:n, :d] @ np.diag(s)

# True solution and target
w_true = np.random.randn(d)
y = X @ w_true + 0.001 * np.random.randn(n)

# Condition numbers
kappa_X = np.linalg.cond(X)
G = X.T @ X
kappa_G = np.linalg.cond(G)

# Solve via normal equations
w_ne = np.linalg.solve(G, X.T @ y)
residual_ne = np.linalg.norm(X @ w_ne - y)

# Solve via QR
Q, R = np.linalg.qr(X, mode='reduced')
w_qr = np.linalg.solve(R, Q.T @ y)
residual_qr = np.linalg.norm(X @ w_qr - y)

# Solve via SVD (most stable)
U_svd, s_svd, Vt = np.linalg.svd(X, full_matrices=False)
w_svd = Vt.T @ (np.linalg.solve(np.diag(s_svd), U_svd.T @ y))
residual_svd = np.linalg.norm(X @ w_svd - y)

print("Condition numbers:")
print("  kappa(X) =", round(kappa_X, 2))
print("  kappa(X^T X) =", round(kappa_G, 2))
print("  kappa(X^T X) / kappa(X)^2 =", round(kappa_G / kappa_X**2, 4), "(should be ~1)")
print("\nResiduals:")
print("  Normal equations:", round(residual_ne, 8))
print("  QR:", round(residual_qr, 8))
print("  SVD:", round(residual_svd, 8))
print("\nSolution errors:")
print("  NE error:", round(np.linalg.norm(w_ne - w_true), 4))
print("  QR error:", round(np.linalg.norm(w_qr - w_true), 4))
print("  SVD error:", round(np.linalg.norm(w_svd - w_true), 4))

Worked Example 2: Backward error analysis and numerical stability of QR vs. normal equations#

Introduction#

Solve the same least squares problem via QR and normal equations; compute backward and forward errors; verify that QR is backward-stable while normal equations magnify error.

Purpose#

Demonstrate backward error analysis in practice; show that backward-stable algorithm (QR) guarantees small forward error even on ill-conditioned problems.

Importance#

Foundation for trusting numerical algorithms; explains why stable algorithms are worth the computational cost.

What this example demonstrates#

Solve $Ax = b$ via QR factorization and normal equations.
Compute backward error: $\frac{\| A \tilde{x} - b \|}{\| b \|}$.
Compute forward error: $\frac{\| \tilde{x} - x^* \|}{\| x^* \|}$.
Verify $\text{forward error} \approx \kappa(A) \times \text{backward error}$ for both methods.

Background#

Backward-stable algorithm: computed $\tilde{x}$ is exact solution to $(A + \Delta A) \tilde{x} = b$ with small $\Delta A$.

Historical context#

Wilkinson (1961) formalized backward error analysis; revolutionary impact on numerical linear algebra.

History#

Modern algorithms designed to be backward-stable (QR, SVD, Cholesky); guarantees via theorems in Higham (2002).

Prevalence in ML#

Used in numerical software certification; backward error commonly reported in solvers.

Notes#

QR backward error: $O(\varepsilon_{\text{mach}} \| A \|)$ (small).
Normal equations backward error: $O(\varepsilon_{\text{mach}} \| A \|^2 / \min(\sigma_i))$ (amplified).

Connection to ML#

Backward stability ensures that model parameters are not polluted by rounding error; prediction is numerically sound.

Connection to Linear Algebra Theory#

Backward error analysis decouples algorithm stability from problem conditioning.

Pedagogical Significance#

Shows that algorithm design (not just theory) determines practical accuracy.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np
np.random.seed(21)

# Ill-conditioned system
n = 50
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2  # Ensure symmetry
x_true = np.random.randn(n)
b = A @ x_true

# Solve via QR
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Solve via normal equations
G = A.T @ A
x_ne = np.linalg.solve(G, A.T @ b)

# Direct solve (reference)
x_ref = np.linalg.solve(A, b)

# Backward errors
residual_qr = np.linalg.norm(A @ x_qr - b)
residual_ne = np.linalg.norm(A @ x_ne - b)
backward_error_qr = residual_qr / np.linalg.norm(b)
backward_error_ne = residual_ne / np.linalg.norm(b)

# Forward errors
forward_error_qr = np.linalg.norm(x_qr - x_ref) / np.linalg.norm(x_ref)
forward_error_ne = np.linalg.norm(x_ne - x_ref) / np.linalg.norm(x_ref)

# Condition number
kappa = np.linalg.cond(A)

# Verify: forward_error ≈ kappa × backward_error
pred_forward_qr = kappa * backward_error_qr
pred_forward_ne = kappa * backward_error_ne

print("Condition number kappa(A):", round(kappa, 0))
print("\nQR Factorization:")
print("  Backward error:", format(backward_error_qr, '.2e'))
print("  Observed forward error:", format(forward_error_qr, '.2e'))
print("  Predicted forward error:", format(pred_forward_qr, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_qr / pred_forward_qr, 2))
print("\nNormal Equations:")
print("  Backward error:", format(backward_error_ne, '.2e'))
print("  Observed forward error:", format(forward_error_ne, '.2e'))
print("  Predicted forward error:", format(pred_forward_ne, '.2e'))
print("  Ratio (obs/pred):", round(forward_error_ne / pred_forward_ne, 2))
print("\nConclusion: QR is backward-stable; NE backward error is larger.")

Worked Example 3: Machine precision and loss of significance in subtraction#

Introduction#

Demonstrate catastrophic cancellation when subtracting nearby numbers; show how orthogonal transformations avoid this.

Purpose#

Illustrate fundamental floating-point pitfall; motivate use of orthogonal algorithms (QR, SVD, Householder).

Importance#

Concrete example of why naive implementations fail; builds intuition for numerical stability.

What this example demonstrates#

Compute $c = a - b$ where $a \approx b$ (many leading digits cancel).
Compare relative error in $c$ vs. in $a, b$ individually.
Show that orthogonal transformations (e.g., Householder reflection) preserve norms and avoid subtraction.

Background#

Subtraction of nearly equal numbers is inherently unstable; cannot be fixed by higher precision alone.

Historical context#

Recognized early in computer arithmetic; led to development of Householder reflections (1958) and other orthogonal methods.

History#

Modern best practice: avoid subtraction of near-equal numbers; use transformation-based algorithms (QR, SVD).

Prevalence in ML#

Common issue in numerical gradient computation, vector normalization, and Gram–Schmidt orthogonalization.

Notes#

Relative error in $c$ can be $O(1)$ even if $a, b$ are accurate.
Classical Gram–Schmidt suffers; modified Gram–Schmidt fixes by reorthogonalization.

Connection to ML#

Adversarial robustness and numerical gradient computation are vulnerable to cancellation; careful implementation required.

Connection to Linear Algebra Theory#

Orthogonal transformations (Householder, Givens) have condition number $\kappa = 1$; stable regardless of input.

Pedagogical Significance#

Demonstrates limit of forward-stable algorithms; backward stability (computing exact solution to nearby problem) is more achievable.

References#

Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic.

Solution (Python)#

import numpy as np

# Example 1: Catastrophic cancellation
a = 1.0 + 1e-16
b = 1.0
c = a - b
print("Example: Subtraction of nearly equal numbers")
print("  a = 1.0 + 1e-16 =", repr(a))
print("  b = 1.0 =", repr(b))
print("  c = a - b =", repr(c))
print("  Exact: 1e-16, Computed: ", c)
print("  Relative error:", abs(c - 1e-16) / 1e-16 if c != 0 else "inf")

# Example 2: Loss of significance in dot product
print("\nExample: Loss of significance in inner product")
np.random.seed(22)
x = np.array([1e8, 1.0, 1.0])
y = np.array([1e-8, 1.0, 1.0])

# Naive: compute separately then subtract
xy_naive = x[0] * y[0] + x[1] * y[1] + x[2] * y[2]  # 1 + 1 + 1 = 3

# More stable: accumulate with smaller terms first
xy_stable = x[2] * y[2] + x[1] * y[1] + x[0] * y[0]  # Reverse order

print("  Inner product (x · y):")
print("    Naive order: {:.15f}".format(xy_naive))
print("    Reverse order: {:.15f}".format(xy_stable))
print("    NumPy dot: {:.15f}".format(np.dot(x, y)))

# Example 3: Orthogonal transformation vs. subtraction
print("\nExample: Orthogonal transformation avoids cancellation")
# Create orthogonal matrix (preserve norms)
theta = np.pi / 6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta), np.cos(theta)]])
v = np.array([1.0, 0.0])
v_rot = Q @ v
print("  ||v|| =", np.linalg.norm(v))
print("  ||Q v|| =", np.linalg.norm(v_rot), "(should equal ||v||)")
print("  Orthogonal transformation preserves norms: stable!")

Worked Example 4: Preconditioning and condition number reduction#

Introduction#

Construct an ill-conditioned SPD system; solve via unpreconditioned CG and preconditioned CG with diagonal preconditioner; measure iteration count reduction.

Purpose#

Show quantitatively how preconditioning reduces condition number and accelerates convergence.

Importance#

Essential technique for large-scale optimization; directly translates to practical speedups.

What this example demonstrates#

Construct ill-conditioned SPD matrix (exponentially decaying eigenvalues).
Solve via CG (count iterations to convergence).
Construct diagonal (Jacobi) preconditioner $M = \text{diag}(A)$.
Solve via preconditioned CG (count iterations).
Compute effective $\kappa(M^{-1} A)$ and verify iteration count reduction.

Background#

CG iteration count $\approx \sqrt{\kappa(A)}$; preconditioning reduces $\kappa \to \kappa(M^{-1} A)$.

Historical context#

Preconditioning recognized as essential by 1970s; became standard in iterative solvers.

History#

Standard in PETSc, Trilinos, scipy.sparse.linalg (all offer preconditioning options).

Prevalence in ML#

L-BFGS uses quasi-Newton preconditioning; preconditioned SGD in modern optimizers.

Notes#

Iteration count $\approx \sqrt{\kappa}$; reducing $\kappa$ by 100 reduces iterations by $\sqrt{100} = 10$.
Preconditioner setup cost must be amortized over multiple solves.

Connection to ML#

Optimization accelerators: Newton’s method, natural gradient, adaptive methods all precondition the gradient.

Connection to Linear Algebra Theory#

Preconditioning clusters eigenvalues; standard result in iterative methods.

Pedagogical Significance#

Demonstrates power of structure exploitation (knowing $M \approx A$) to improve algorithm complexity.

References#

Axelsson, O. (1994). Iterative Solution Methods.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.).
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.).

Solution (Python)#

import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
np.random.seed(23)

# Create ill-conditioned SPD matrix
n = 200
U, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -3, n)  # kappa ~ 1000
A = U @ np.diag(s) @ U.T
A = (A + A.T) / 2
b = np.random.randn(n)

# Unpreconditioned CG
residuals_unprecond = []
def callback_un(x):
    residuals_unprecond.append(np.linalg.norm(A @ x - b))
x_un, info_un = spla.cg(A, b, tol=1e-8, callback=callback_un, maxiter=n)

# Preconditioned CG with diagonal (Jacobi) preconditioner
M_diag = np.diag(1.0 / np.diag(A))
M_diag_sparse = sp.diags(np.diag(M_diag))
residuals_precond = []
def callback_pre(x):
    residuals_precond.append(np.linalg.norm(A @ x - b))
x_pre, info_pre = spla.cg(A, b, M=M_diag_sparse, tol=1e-8, 
                           callback=callback_pre, maxiter=n)

kappa_A = np.linalg.cond(A)
# Approximate kappa(M^{-1} A)
AM = M_diag @ A
kappa_AM = np.linalg.cond(AM)

print("Condition numbers:")
print("  kappa(A) =", round(kappa_A, 0))
print("  kappa(M^{-1} A) ≈", round(kappa_AM, 0), "(after diagonal preconditioning)")
print("\nCG Convergence:")
print("  Unpreconditioned iterations:", len(residuals_unprecond))
print("  Preconditioned iterations:", len(residuals_precond))
print("  Speedup factor:", round(len(residuals_unprecond) / len(residuals_precond), 1))
print("  Predicted speedup (~sqrt(kappa)):", round(np.sqrt(kappa_A / kappa_AM), 1))
print("\nBoth solutions satisfy A x = b:",
      np.allclose(A @ x_un, b) and np.allclose(A @ x_pre, b))

Worked Example 5: Regularization and effective condition number#

Introduction#

Construct an ill-posed rectangular system (underdetermined or rank-deficient); solve via Tikhonov regularization with varying $\lambda$; plot condition number reduction and solution norm vs. residual (L-curve).

Purpose#

Demonstrate how regularization shifts small singular values and reduces effective condition number.

Importance#

Core technique in inverse problems, ridge regression, and ill-posed ML problems.

What this example demonstrates#

Construct rank-deficient or ill-posed system (exponentially decaying singular values).
Solve regularized system $(A^\top A + \lambda I) w = A^\top b$ for range of $\lambda$.
Compute effective condition number: $\kappa_\lambda = (\sigma_1^2 + \lambda) / (\sigma_n^2 + \lambda)$.
Plot L-curve: solution norm vs. residual; optimal $\lambda$ at corner.
Verify that small $\lambda$ gives large solution, large $\lambda$ gives small solution but large residual.

Background#

Tikhonov adds penalty; shifts eigenvalues $\sigma_i^2 \to \sigma_i^2 + \lambda$.

Historical context#

Tikhonov (1963) for ill-posed problems; Hansen (1998) developed regularization parameter selection tools.

History#

Standard in inverse problems (deblurring, tomography); ridge regression standard in ML.

Prevalence in ML#

Ridge regression, elastic net, dropout, early stopping all act as implicit regularization.

Notes#

Optimal $\lambda$ balances fit (small residual) and solution norm (small solution).
L-curve method (Golub et al. 1979): plot $\log(\| r \|)$ vs. $\log(\| w \|)$; corner indicates optimal $\lambda$.

Connection to ML#

Bias-variance trade-off: underfitting vs. overfitting; regularization parameter selection critical.

Connection to Linear Algebra Theory#

Spectral filtering: Tikhonov filters small singular values; LSQR via early stopping does implicit filtering.

Pedagogical Significance#

Shows how problem structure (ill-posedness) guides algorithm design (regularization).

References#

Tikhonov, A. N. (1963). On the solution of ill-posed problems and regularized methods.
Hansen, P. C. (1998). Rank-deficient and Discrete Ill-Posed Problems.
Golub, G. H., Van Loan, C. F., & Milanfar, P. (1999). The total least squares problem.

Solution (Python)#

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(24)

# Construct ill-posed system
m, n = 100, 40
U, _ = np.linalg.qr(np.random.randn(m, m))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.exp(-np.linspace(0, 4, n))  # Exponentially decaying singular values
A = U[:m, :n] @ np.diag(s) @ V.T

# True solution and noisy data
x_true = np.zeros(n)
x_true[:5] = [10, 8, 6, 4, 2]
b_clean = A @ x_true
noise_level = 0.01
b = b_clean + noise_level * np.random.randn(m)

# Solve Tikhonov for range of lambda
lams = np.logspace(-6, 2, 50)
sol_norms = []
residuals = []
cond_numbers = []

for lam in lams:
    G = A.T @ A + lam * np.eye(n)
    w = np.linalg.solve(G, A.T @ b)
    sol_norms.append(np.linalg.norm(w))
    residuals.append(np.linalg.norm(A @ w - b))
    cond_numbers.append(np.linalg.cond(G))

sol_norms = np.array(sol_norms)
residuals = np.array(residuals)
cond_numbers = np.array(cond_numbers)

# Find corner of L-curve (optimal lambda via curvature)
log_res = np.log(residuals)
log_norm = np.log(sol_norms)
curvature = np.gradient(np.gradient(log_res) / np.gradient(log_norm))
opt_idx = np.argmax(curvature)
lam_opt = lams[opt_idx]

# Solve with optimal lambda
G_opt = A.T @ A + lam_opt * np.eye(n)
w_opt = np.linalg.solve(G_opt, A.T @ b)

print("Regularization parameter selection (L-curve method):")
print("  Optimal lambda:", format(lam_opt, '.2e'))
print("  At optimal lambda:")
print("    Solution norm:", round(np.linalg.norm(w_opt), 4))
print("    Residual norm:", round(np.linalg.norm(A @ w_opt - b), 4))
print("    Condition number:", round(np.linalg.cond(G_opt), 2))
print("\nFor comparison:")
print("  Smallest lambda: cond =", round(cond_numbers[0], 0), ", residual =", 
      format(residuals[0], '.2e'))
print("  Largest lambda: cond =", round(cond_numbers[-1], 0), ", residual =", 
      format(residuals[-1], '.2e'))
print("\nTrue solution norm:", round(np.linalg.norm(x_true), 4))
print("Solution error at optimal lambda:", round(np.linalg.norm(w_opt - x_true), 4))