Introduction
Linear maps (also called linear transformations or functions) are structure-preserving transformations between vector spaces: they respect addition and scalar multiplication. Matrices are their concrete representation: a linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ is represented as a matrix $A \in \mathbb{R}^{m \times d}$ so that $f(x) = Ax$. This is the language of neural networks: each layer is a composition of linear maps (matrix multiplications) and nonlinear activations. Understanding linear maps clarifies:
Model expressiveness: What functions can be represented? (Universal approximation via composition of linear maps and nonlinearities.)
Gradient flow: How do errors backpropagate through layers? (Chain rule uses transposes of linear map matrices.)
Data transformation: How do representations change through layers? (Each layer applies a linear map to its input.)
Optimization: How should weights change to reduce loss? (Gradient is also a linear map, obtained via transpose.)
Linear maps are everywhere in ML:
Neural networks: Each dense layer is a linear map $h_{i+1} = \sigma(W_i h_i + b_i)$ (linear map $W_i$, then activation $\sigma$).
Attention: Query/Key/Value projections are linear maps. Attention output is a weighted linear combination.
Least squares: Solving $\hat{w} = (X^\top X)^{-1} X^\top y$ involves products of linear maps.
PCA: Projection onto principal components is a linear map.
Convolution: Convolutional layers are linear maps when viewed in the spatial/frequency domain.
Important Ideas
1. Linear map = function preserving structure. A function $f: V \to W$ between vector spaces is linear if:
Additivity: $f(u + v) = f(u) + f(v)$ for all $u, v \in V$.
Homogeneity: $f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$.
Why these properties? Linear maps are exactly those that can be written as matrix multiplication: $f(x) = Ax$. Additivity ensures the matrix distributes: $A(x + y) = Ax + Ay$. Homogeneity ensures scaling: $A(\alpha x) = \alpha (Ax)$.
Example: Rotation by angle $\theta$ is linear: $f([x, y]^\top) = [\cos\theta \cdot x - \sin\theta \cdot y, \sin\theta \cdot x + \cos\theta \cdot y]^\top = R_\theta [x, y]^\top$.
Non-example: $f(x) = x + 1$ is not linear (fails $f(0) = 0$ test). $f(x) = \|x\|$ is not linear (not additive).
2. Matrix representation is unique (up to basis). For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$ with standard bases, the matrix $A \in \mathbb{R}^{m \times d}$ satisfies $f(x) = Ax$ uniquely. Columns of $A$ are images of standard basis vectors: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.
Why unique? By linearity, $f(x) = f(\sum_j x_j e_j) = \sum_j x_j f(e_j)$. If we know $f$ on basis vectors, we know $f$ everywhere.
Example: $f(x) = 2x_1 + 3x_2$ is $f([x_1, x_2]^\top) = [2, 3] \cdot [x_1, x_2]^\top$. Matrix is $A = [2, 3]$ (1 row, 2 columns).
3. Composition = matrix multiplication. For linear maps $f: \mathbb{R}^d \to \mathbb{R}^m$ with matrix $A$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ with matrix $B$, the composition $g \circ f: \mathbb{R}^d \to \mathbb{R}^p$ has matrix $BA$ (note order: right-to-left in notation, left-to-right in matrix product).
Why this order? $(g \circ f)(x) = g(f(x)) = g(Ax) = B(Ax) = (BA)x$. Matrix product $BA$ is therefore natural for composition.
Example: Neural network layer 1 applies $A_1$, layer 2 applies $A_2$. Composition is $A_2 A_1$ (layer 1 first, then layer 2).
4. Transpose = dual map (adjoint). For matrix $A: \mathbb{R}^d \to \mathbb{R}^m$, the transpose $A^\top: \mathbb{R}^m \to \mathbb{R}^d$ is the unique linear map satisfying:
$$
(Ax)^\top y = x^\top (A^\top y) \quad \text{for all } x, y
$$
Geometric interpretation: If $A$ rotates a vector, $A^\top$ rotates in the opposite direction (roughly). If $A$ projects onto a subspace, $A^\top$ projects perpendicular to that subspace (in a weighted sense).
In backprop: If forward pass applies $y = Ax$, reverse mode applies $\frac{\partial L}{\partial x} = A^\top \frac{\partial L}{\partial y}$ (transpose carries gradients backward).
Example: $A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, then $A^\top = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$.
5. Image and kernel characterize a linear map. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:
Image (column space): $\text{im}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^d\}$ (all possible outputs). Dimension = rank$(A)$.
Kernel (null space): $\ker(A) = \text{null}(A) = \{x : Ax = 0\}$ (inputs mapping to zero). Dimension = nullity$(A) = d - \text{rank}(A)$.
Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = d$ (dimension in = rank out + null space).
Why important? Image tells us what the map can represent. Kernel tells us what information is lost. For invertible maps, kernel is trivial (only zero maps to zero).
Relevance to Machine Learning
Expressiveness through composition. A single linear map is limited (can only learn rotations/scalings/projections). Composing many linear maps with nonlinearities dramatically increases expressiveness. Universal approximation theorem (Cybenko 1989) says a single hidden layer with activation can approximate any continuous function.
Gradient computation via transposes. Backpropagation is the chain rule applied backward through the network. Gradient w.r.t. input of a layer uses the transpose of the weight matrix. Understanding transposes is essential for implementing and understanding neural networks.
Data transformation and representation learning. Neural networks learn by composing linear maps (weight matrices) with nonlinearities. Early layers learn low-level features (via image of $A_1$). Deep layers compose these into high-level features (via $(A_k \cdots A_2 A_1)$).
Optimization structure. Gradient descent updates weights proportional to $X^\top (Xw - y)$ (linear map composition). Understanding matrix products clarifies why batch size, feature dimension, and conditioning affect optimization.
Algorithmic Development History
1. Linear transformations (Euler, 1750s-1770s). Euler rotated coordinate systems to solve differential equations and optimize geometry problems. Rotations are linear maps.
2. Matrix algebra (Cayley, Sylvester, 1850s-1880s). Introduced matrices as algebraic objects. Cayley-Hamilton theorem: matrices satisfy their own characteristic polynomial. Matrix multiplication defined to represent composition of linear transformations.
3. Bilinear forms and adjoints (Cauchy, Hermite, Hilbert, 1800s-1900s). Developed duality theory: every linear form has an adjoint. Transpose is the matrix adjoint.
4. Rank and nullity (Grassmann 1844, Frobenius 1870s-1880s). Formalized rank as dimension of image. Rank-nullity theorem central to linear algebra.
5. Spectral theory (Schur 1909, Hilbert 1920s). Every matrix can be decomposed into eigenvalues/eigenvectors. Spectral decomposition reveals structure of linear maps.
6. Computational algorithms (Householder 1958, Golub-Kahan 1965): Developed numerically stable algorithms for matrix factorization (QR, SVD, Cholesky). Made linear algebra practical at scale.
7. Neural networks and backprop (Rumelhart, Hinton, Williams 1986). Showed that composing linear maps with nonlinearities, trained via backprop (which uses transposes), learns powerful representations. Modern deep learning.
8. Transformers and attention (Vaswani et al. 2017). All attention operations are linear maps: $\text{softmax}(QK^\top) V$ is a composition of matrix multiplications, softmax (nonlinear), and another multiplication.
Definitions
Linear map (linear transformation). A function $f: V \to W$ between vector spaces over $\mathbb{R}$ is linear if:
$f(u + v) = f(u) + f(v)$ for all $u, v \in V$ (additivity).
$f(\alpha v) = \alpha f(v)$ for all $v \in V$, $\alpha \in \mathbb{R}$ (homogeneity).
Equivalently: $f(\alpha u + \beta v) = \alpha f(u) + \beta f(v)$ (linearity).
Matrix representation. For linear map $f: \mathbb{R}^d \to \mathbb{R}^m$, the matrix $A \in \mathbb{R}^{m \times d}$ represents $f$ if $f(x) = Ax$ for all $x \in \mathbb{R}^d$. Columns of $A$ are: $A = [f(e_1) | f(e_2) | \cdots | f(e_d)]$.
Image and kernel. For linear map $A: \mathbb{R}^d \to \mathbb{R}^m$:
$$
\text{im}(A) = \{Ax : x \in \mathbb{R}^d\} = \text{col}(A), \quad \text{ker}(A) = \{x : Ax = 0\} = \text{null}(A)
$$
Rank. The rank of $A$ is:
$$
\text{rank}(A) = \dim(\text{im}(A)) = \dim(\text{col}(A)) = \text{number of linearly independent columns}
$$
Nullity. The nullity of $A$ is:
$$
\text{nullity}(A) = \dim(\text{ker}(A)) = d - \text{rank}(A)
$$
Rank-nullity theorem. For any matrix $A \in \mathbb{R}^{m \times d}$:
$$
\text{rank}(A) + \text{nullity}(A) = d
$$
Transpose (adjoint). The transpose of $A \in \mathbb{R}^{m \times d}$ is $A^\top \in \mathbb{R}^{d \times m}$ satisfying:
$$(Ax)^\top y = x^\top (A^\top y), \quad (AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A$$
Invertible matrix. A square matrix $A \in \mathbb{R}^{d \times d}$ is invertible (nonsingular) if there exists $A^{-1}$ such that $AA^{-1} = A^{-1} A = I$. Equivalent: $\text{rank}(A) = d$ (full rank), $\ker(A) = \{0\}$ (trivial kernel), $\det(A) \neq 0$ (nonzero determinant).
Comments