ex1.ai

Chapter 4

Linear Maps & Matrices

Key ideas: Introduction

Introduction#

Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:

Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)
Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)
Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)
Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)

Linear maps are everywhere in ML:

Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).
Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.
Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.
PCA: Projection onto principal components is a linear map.
Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.

Important Ideas#

1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:

Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.
Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.

Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.

Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.

Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).

2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.

Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).

3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).

Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.

Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).

4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying: $$ (Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y $$

Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).

In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).

Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.

5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:

Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.
Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).

Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).

Relevance to Machine Learning#

Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.

Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.

Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).

Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.

Algorithmic Development History#

1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.

2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.

3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.

4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.

5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.

6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.

7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.

8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.

Definitions#

Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:

$f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).
$f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).

Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).

Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.

Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$: $$ \text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A) $$

Rank. The rank of $A$ is: $$ \text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns} $$

Nullity. The nullity of $A$ is: $$ \text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A) $$

Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$

Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying: $$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$

Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

Rank-nullity theorem. For $A \in \mathbb{R}^{m \times d}$: $$ \text{rank}(A) + \text{nullity}(A) = d $$ Consequences: If $\text{rank}(A) < d$, solutions to $Ax = b$ are not unique (null space is non-trivial). For invertible $A$ (rank = $d$), solutions are unique.
Fundamental theorem of linear algebra. Orthogonal decomposition: $\mathbb{R}^d = \text{col}(A^\top) \oplus \text{null}(A)$ and $\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top)$ (orthogonal direct sums). Basis for all linear algebra.
Universal approximation (Cybenko 1989, Hornik 1991). A neural network with one hidden layer (linear map + nonlinearity + output linear map) can approximate any continuous function on compact sets arbitrarily well (with enough hidden units).
Spectral theorem for symmetric matrices (Hamilton, Sylvester, 1850s-1880s). Every symmetric $A$ has eigendecomposition $A = U \Lambda U^\top$ (orthogonal diagonalization). Basis for PCA, optimization, understanding symmetric structures.
Singular Value Decomposition (Beltrami 1873, Eckart-Young 1936). Every matrix $A \in \mathbb{R}^{m \times d}$ can be written as $A = U \Sigma V^\top$ (orthogonal $U, V$, diagonal $\Sigma$). Reveals low-rank structure, optimal approximations, conditioning.

Why essential: These theorems quantify what linear maps can/cannot represent, how to invert them, when solutions exist, and how to find optimal approximations.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

Backpropagation and gradient-based learning (Rumelhart et al. 1986, 1990s-present). Automatic differentiation computes gradients via chain rule (composition of matrix transposes). Enables training networks with billions of parameters. All modern deep learning depends on this.
Dense neural networks (Cybenko 1989, Hornik 1991, 1990s-present). Theoretical universality + practical training via backprop = powerful function approximators. AlexNet (2012) showed depth matters: stacking linear maps + activations learns hierarchical representations.
Convolutional Neural Networks (LeCun et al. 1990, AlexNet 2012, ResNet 2015). Structured linear maps (convolution with weight sharing). Dramatically reduced parameters vs. dense. State-of-the-art on vision (ImageNet), object detection, segmentation.
Recurrent Neural Networks and LSTMs (Hochreiter & Schmidhuert 1997, 2000s-present). Apply same linear map over time steps (sequence model for NLP, time series). Enabled machine translation, speech recognition.
Transformers and Attention (Vaswani et al. 2017, Devlin et al. 2018, GPT series 2018-2023). All-attention architecture (linear projections + softmax + matrix multiply). Achieved state-of-the-art across NLP (GLUE, SuperGLUE), vision (ImageNet via ViT), multimodal (CLIP). Scales to trillions of parameters.
Least squares for regression (Gauss, Legendre, Tikhonov, modern methods). Normal equations $(X^\top X) w = X^\top y$ solved via QR/SVD (numerically stable). Classical ML workhorse; fast closed-form solution, interpretable results.

Why essential: These systems achieve state-of-the-art by leveraging linear map structure (composition, transposes, efficient matrix multiply). Understanding linear algebra is necessary to design architectures, optimize, and debug.

Key ideas: Where it shows up

1. Backpropagation and Gradient Flow — Transpose carries errors backward#

Major achievements:

Backpropagation (Rumelhart, Hinton, Williams 1986): Efficient algorithm for computing gradients through neural networks via chain rule. Each layer applies $y = \sigma(W x + b)$; backward pass uses $\frac{\partial L}{\partial x} = W^\top \frac{\partial L}{\partial y}$ (transpose carries gradients).
Modern deep learning (1990s-2010s): Backprop enabled training of deep networks (10-1000+ layers). Scaling to billions of parameters (GPT, Vision Transformers).
Automatic differentiation (1980s-present): Frameworks (TensorFlow, PyTorch) implement backprop automatically by composing transposes. Practitioners never write transposes explicitly; framework handles it.
Applications: All supervised learning, reinforcement learning, generative models. Billions of backprop steps every day globally.

Connection to linear maps: Forward pass chains linear maps with nonlinearities: $f = \sigma_k \circ (A_k \sigma_{k-1} \circ (A_{k-1} \cdots))$. Backward pass computes gradients: $\nabla_w L = (\sigma'_{k-1})^T A_{k-1}^T (\sigma'_{k-2})^T A_{k-2}^T \cdots$ (products of transposes).

2. Neural Network Layers — Linear maps + activation functions#

Major achievements:

Dense layers (Rosenblatt Perceptron 1958, MLPs 1970s-1980s): Input $x$, linear map $h = Wx + b$, activation $y = \sigma(h)$ (ReLU, sigmoid, tanh). Each layer is a learnable linear map.
Depth (ResNets, Vaswani 2015-2017): 50-1000 layers. Skip connections $x_{i+1} = \sigma(W_i x_i + b_i) + x_i$ allow training very deep networks. Each skip branch is a composition of linear maps.
Scaling (AlexNet 2012, GPT-3 2020, Gato 2022): Modern networks: billions to trillions of parameters. Matrix multiply dominates computation. Large linear maps $W \in \mathbb{R}^{4096 \times 4096}$ applied to batches.
Optimization: Understanding composition of linear maps helps explain generalization (implicit regularization favors low-complexity solutions in the span of data).

Connection to linear maps: Each dense layer is $W: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$. Network composes $W_k \circ \sigma \circ W_{k-1} \circ \sigma \circ \cdots \circ W_1$. Expressiveness comes from depth (composition) and nonlinearity ($\sigma$).

3. Attention Mechanism — Multi-head projections and weighted sums#

Major achievements:

Scaled dot-product attention (Vaswani et al. 2017): Queries, Keys, Values are projections (linear maps) $Q = XW_Q, K = XW_K, V = XW_V$. Attention weights $A = \text{softmax}(QK^\top / \sqrt{d_k})$. Output $\text{Attention}(Q,K,V) = AV$ (matrix multiply with softmax-weighted rows).
Multi-head attention: $h$ heads, each applying different linear projections. Concatenate: $\text{MultiHead}(Q,K,V) = \text{Concat}(A_1, \ldots, A_h) W^O$ (linear map combines heads).
Transformers (Vaswani 2017, Devlin et al. 2018): Attention layers (all linear maps + softmax) in sequence. BERT, GPT achieve state-of-the-art across NLP tasks.
Scale: GPT-3 (175B parameters), PaLM (540B), GPT-4. Training scales across thousands of GPUs, with matrix multiplication as bottleneck.

Connection to linear maps: Attention is composition of linear maps: $\text{Attention} = A V$ where $A = \text{softmax}(Q K^\top / \sqrt{d_k})$. Each head applies different linear projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$. Output is weighted linear combination of values.

4. Least Squares and Regression — Normal equations as linear system#

Major achievements:

Least squares (Gauss, Legendre, early 1800s): Solve $\min_w \|Xw - y\|_2^2$. Normal equations: $(X^\top X) w = X^\top y$. Linear system $Aw = b$ (product of two linear maps).
Ridge regression (Tikhonov 1963, Hoerl & Kennard 1970): Add regularization $\min_w (\|Xw - y\|_2^2 + \lambda \|w\|_2^2)$. Solution: $w = (X^\top X + \lambda I)^{-1} X^\top y$ (invertible for any $\lambda > 0$).
LASSO (Tibshirani 1996): L1 regularization forces sparsity. Solved via proximal methods (composition of proximal operators, each a linear map or projection).
Kernel methods (Mercer 1909, Schölkopf & Smola 2001): Non-linear regression via Gram matrix $K = X X^\top$ (product of linear maps, then apply kernel trick).

Connection to linear maps: Normal equations involve products of matrices: $X^\top X$ (composition of $X^\top$ and $X$), $X^\top y$ (linear map applied to $y$). Solution involves matrix inversion (inverse is also a linear map).

5. Convolutional and Recurrent Networks — Structured linear maps#

Major achievements:

CNNs (LeCun et al. 1990s, AlexNet 2012, ResNet 2015): Convolutional layers are linear maps with weight sharing (same weights applied across spatial positions). Reduces parameters vs. dense layer (e.g., conv 3×3×64→64 channels vs. dense with same feature count).
RNNs, LSTMs (Hochreiter & Schmidhuber 1997): Recurrent layers apply the same linear map $W$ repeatedly over time: $h_t = \sigma(W h_{t-1} + U x_t)$ (composition of linear maps over time steps).
Efficiency: Weight sharing and structured matrices (convolution, recurrence) reduce parameters and computation compared to dense layers.
Interpretability: Convolutional structure learned by early layers is interpretable (edge filters, textures). Linear maps with structured sparsity/sharing have semantic meaning.

Connection to linear maps: Conv layer is a linear map (convolution can be written as matrix multiplication with Toeplitz structure). RNN applies same linear map repeatedly: composition $W \circ W \circ \cdots \circ W$ over time.

Notation

Standard Conventions#

1. Linear map and matrix notation.

Linear map: $f: V \to W$ or $A: \mathbb{R}^d \to \mathbb{R}^m$ (function notation).
Matrix representation: $A \in \mathbb{R}^{m \times d}$ or $[A]_{ij}$ for entry in row $i$, column $j$.
Matrix-vector product: $y = Ax$ (linear map applied to vector $x$).
Matrix-matrix product: $C = AB$ (composition: apply $B$ then $A$).
Image and kernel: $\text{im}(A)$ or $\text{col}(A)$ for column space; $\ker(A)$ or $\text{null}(A)$ for null space.

Examples:

Linear map: $f(x) = 3x_1 - 2x_2 \in \mathbb{R}$. Matrix: $A = [3, -2] \in \mathbb{R}^{1 \times 2}$.
Linear map: $(x, y) \mapsto (2x + y, x - 3y)$. Matrix: $A = \begin{bmatrix} 2 & 1 \\ 1 & -3 \end{bmatrix} \in \mathbb{R}^{2 \times 2}$.
Composition: Neural network layer 1: $h_1 = \sigma(W_1 x)$, layer 2: $h_2 = \sigma(W_2 h_1) = \sigma(W_2 \sigma(W_1 x))$. Composition: $f = \sigma \circ (W_2 \circ \sigma \circ W_1)$.

2. Rank notation.

Rank: $\text{rank}(A)$ = dimension of column space = number of linearly independent columns.
Nullity: $\text{nullity}(A) = d - \text{rank}(A)$ (dimension of null space).
Full rank: $\text{rank}(A) = \min(m, d)$ (maximum possible rank).
Rank deficient: $\text{rank}(A) < \min(m, d)$ (singular or near-singular).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 2 (full rank), columns independent.
$A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix} \in \mathbb{R}^{3 \times 2}$. Rank = 1 (rank deficient), second column = 2 × first column.

3. Transpose notation.

Transpose: $A^\top$ (rows and columns swapped).
Adjoint property: $(Ax)^\top y = x^\top (A^\top y)$ (inner product duality).
Composition rule: $(AB)^\top = B^\top A^\top$ (note reversed order).
Inverse of transpose: $(A^\top)^{-1} = (A^{-1})^\top$ (for invertible $A$).

Examples:

$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$.
Gradient in backprop: $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (linear map $A$ → transpose $A^\top$ for gradient).

4. Composition and chaining notation.

Composition operator: $(f \circ g)(x) = f(g(x))$ (apply $g$ first, then $f$).
Matrix chaining: For $f = A, g = B$, composition is $f \circ g = A \circ B$ with matrix product $AB$ (apply $B$ then $A$).
Neural network layers: Output $h_i = \sigma_i(A_i h_{i-1})$ (chain $A_1, \sigma_1, A_2, \sigma_2, \ldots$).

Examples:

Rotate by $\theta$, then scale by $2$: $R_\theta \circ S_2$. Matrix: $S_2 R_\theta$.
Neural network: $f(x) = \sigma_2(A_2 \sigma_1(A_1 x))$. Composition: $\sigma_2 \circ A_2 \circ \sigma_1 \circ A_1$.

5. Invertibility and determinant notation.

Invertible (nonsingular): $A^{-1}$ exists; $AA^{-1} = A^{-1} A = I$.
Determinant: $\det(A)$ or $|A|$. For invertibility: $\det(A) \neq 0 \Leftrightarrow A$ invertible.
Condition number: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular value).

Examples:

$A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$. $\det(A) = 2 \neq 0$, so $A$ is invertible. $A^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1/2 \end{bmatrix}$.
Ill-conditioned matrix: $\kappa(A) = 10^{10}$ (nearly singular). Small perturbations cause large changes in solution. Use regularization or preconditioning.

6. Special matrices notation.

Identity: $I \in \mathbb{R}^{d \times d}$ (diagonal matrix with 1’s).
Orthogonal/orthonormal: $Q^\top Q = QQ^\top = I$ (columns/rows orthonormal).
Symmetric: $A^\top = A$.
Positive semi-definite (PSD): $A \succeq 0$; all eigenvalues $\geq 0$. Covariance matrices are PSD.

Examples:

QR decomposition: $A = QR$ where $Q$ orthonormal, $R$ upper triangular.
Symmetric matrix: $\Sigma = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. Eigendecomposition: $\Sigma = U \Lambda U^\top$ (orthonormal $U$, diagonal $\Lambda$).
PSD matrix: Covariance $\text{Cov}(X) \succeq 0$ (always PSD). Gram matrix $G = X^\top X \succeq 0$ (always PSD).

Pitfalls & sanity checks

When working with linear maps and matrices:

Always check shapes. Matrix multiply requires compatible dimensions. $A \in \mathbb{R}^{m \times d}$, $x \in \mathbb{R}^d$ yields $Ax \in \mathbb{R}^m$. Shape mismatch = runtime error.
Prefer stable decompositions. Never compute $(X^\top X)^{-1}$ explicitly. Use QR (via solve) or SVD (truncate small singular values) for numerical stability.
Transpose order matters. $(AB)^\top = B^\top A^\top$ (reversed order). In backprop, composition reverses layer order via transposes.
Condition number determines stability. If $\kappa(A) > 10^8$, expect numerical errors. Use regularization (Ridge, Tikhonov) or preconditioning.
Gradients flow via transposes. Backprop systematically applies transposes. Understand: ill-conditioned weights → vanishing/exploding gradients.

References

Foundational texts:

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.
Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

Linear maps and matrix theory:

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Hoffman, K., & Kunze, R. (1971). Linear Algebra (2nd ed.). Prentice-Hall.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Axler, S. J., Bourdon, P. S., & Wade, W. M. (2000). Harmonic Function Theory (2nd ed.). Springer.

Neural networks and backpropagation:

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533–536.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). “Efficient backprop.” In Neural Networks: Tricks of the Trade (pp. 9–50). Springer.

Optimization:

Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.
Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer.
Kingma, D. P., & Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv:1412.6980.

Transformers and attention:

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” NAACL.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). “An image is worth 16×16 words: Transformers for image recognition at scale.” ICLR.

Least squares and numerical methods:

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Dover reprint.
Golub, G. H., & Pereyra, V. (1973). “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.” SIAM Journal on Numerical Analysis, 10(2), 413–432.

Five worked examples

Worked Example 1: Backprop uses transpose#

Problem. For y=Wx, show ∂L/∂x = W^T ∂L/∂y.

Solution (math). Jacobian of y=Wx is W; chain rule yields transpose in reverse mode.

Solution (Python).

import numpy as np
W=np.array([[2.,1.],[-1.,3.]])
dL_dy=np.array([0.5,-2.])
print(W.T@dL_dy)

Worked Example 2: Q,K,V projections in transformers#

Problem. Compute Q=XW_Q, K=XW_K, V=XW_V.

Solution (math). These are linear maps from model dimension to head dimensions.

Solution (Python).

import numpy as np
X=np.array([[1.,0.],[0.,1.],[1.,1.]])
Wq=np.array([[1.,0.],[0.,2.]])
Wk=np.array([[2.,0.],[0.,1.]])
Wv=np.array([[1.,1.],[0.,1.]])
print(X@Wq)
print(X@Wk)
print(X@Wv)

Worked Example 3: Normal equations matrix#

Problem. Form A=X^TX and b=X^Ty for least squares.

Solution (math). Solving A w=b is equivalent to minimizing ||Xw-y||^2 when X has full rank.

Solution (Python).

import numpy as np
X=np.array([[1.,1.],[1.,2.],[1.,3.]])
y=np.array([1.,2.,2.5])
A=X.T@X; b=X.T@y
print(A)
print(b)

Worked Example 4: Batch GD as matrix products#

Problem. Compute one gradient step for MSE.

Solution (math). w←w-η(1/n)X^T(Xw-y).

Solution (Python).

import numpy as np
X=np.array([[1.,2.],[3.,4.],[5.,6.]])
y=np.array([1.,0.,1.])
w=np.zeros(2)
eta=0.1
g=(1/len(X))*X.T@(X@w-y)
print(w-eta*g)

Worked Example 5: Attention is matrix multiplication#

Problem. Compute A=softmax(QK^T/√d) and output O=AV.

Solution (math). Attention is a composition of matrix multiplications plus a row-wise softmax.

Solution (Python).

import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores,axis=1)
print(A@V)

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

Classical Era

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Foundational

Linear Algebra Foundations

Matrix Theory

ML Applications

Deep Learning & Neural Networks

Theoretical Foundation

Linear Algebra Theory