Chapter 2
Span & Linear Combination
Key ideas: Introduction

Introduction#

Span and linear combinations are the fundamental building blocks of linear algebra and machine learning. Every prediction $\hat{y} = Xw$, every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$, every attention output $\sum_i \alpha_i v_i$, and every representation learned by a neural network is ultimately a linear combination of basis vectors. Understanding span—the set of all possible linear combinations—reveals model expressiveness, training dynamics, and the geometry of learned representations.

The span of a set of vectors $\{v_1, \ldots, v_k\}$ is the smallest subspace containing all of them. Geometrically, it’s all points reachable by scaling and adding the vectors. Algebraically, it’s $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$. In ML, span determines:

  • Model capacity: What functions can a model represent?
  • Feature redundancy: Are some features linear combinations of others?
  • Solution uniqueness: When are there multiple parameter vectors giving identical predictions?
  • Expressiveness vs. efficiency: Can we reduce dimensionality without losing information?

This chapter adopts an ML-first perspective: we introduce span through concrete algorithms (kernel methods, attention, overparameterization) rather than abstract axioms. The goal is to build geometric intuition (span as reachable points) and computational skill (checking linear independence, computing basis) simultaneously.

 

Important Ideas#

1. Linear combinations are everywhere in ML. A linear combination of vectors $\{v_1, \ldots, v_k\}$ with coefficients $\{\alpha_1, \ldots, \alpha_k\}$ is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k $$

Examples in ML:

  • Linear regression predictions: $\hat{y} = Xw = \sum_{j=1}^d w_j x_j$ (linear combination of feature columns).
  • Gradient descent updates: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ (linear combination of current parameters and gradient).
  • Attention outputs: $z = \sum_{i=1}^n \alpha_i v_i$ (weighted sum of value vectors with attention weights $\alpha_i$).
  • Kernel predictions: $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (representer theorem: optimal solution is a linear combination of training kernels).
  • Word embeddings: Analogies $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (linear combinations capture semantic relationships).

2. Span determines expressiveness. The span of $\{v_1, \ldots, v_k\}$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$

This is the set of all possible linear combinations—the “reachable subspace” if we’re allowed to scale and add the vectors. Key properties:

  • It’s a subspace: Closed under addition and scalar multiplication (adding/scaling linear combinations gives another linear combination).
  • It’s the smallest subspace containing $\{v_1, \ldots, v_k\}$: Any subspace containing all $v_i$ must contain their span.
  • Dimension = number of linearly independent vectors: If $v_k = \sum_{i=1}^{k-1} c_i v_i$ (linear dependence), adding $v_k$ doesn’t increase the span.

In ML context:

  • Column space of $X$: All predictions $\hat{y} = Xw$ lie in $\text{span}(\text{columns of } X) = \text{col}(X)$. If $y \notin \text{col}(X)$, perfect fit is impossible (residual is nonzero).
  • Feature redundancy: If feature $x_j$ is a linear combination of other features, adding it doesn’t increase $\text{span}(\text{columns of } X)$ or model capacity.
  • Kernel methods: Predictions lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (representer theorem). This is typically a finite-dimensional subspace of the (infinite-dimensional) RKHS.

3. Linear independence vs. dependence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = 0$ is $\alpha_1 = \cdots = \alpha_k = 0$. Otherwise, they’re linearly dependent (one is a linear combination of others).

Why it matters:

  • Basis: A linearly independent set spanning $V$ is a basis for $V$. Every vector in $V$ has a unique representation as a linear combination of basis vectors.
  • Rank: $\text{rank}(X) = $ number of linearly independent columns = $\dim(\text{col}(X))$.
  • Multicollinearity: In regression, linearly dependent features ($\text{rank}(X) < d$) make $X^\top X$ singular (non-invertible), requiring regularization.

4. Representer theorem: solutions lie in span of training data. For many ML problems (kernel ridge regression, SVMs, Gaussian processes), the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$

This is a linear combination of kernel functions evaluated at training points. Despite working in an infinite-dimensional space (e.g., RBF kernel), the solution lies in an $n$-dimensional subspace (span of $\{k(x_i, \cdot)\}_{i=1}^n$).

Implications:

  • Computational tractability: Optimization over infinite dimensions reduces to solving an $n \times n$ system.
  • Overfitting vs. underfitting: More training points ($n$ large) increases capacity but also computational cost ($O(n^3)$ for exact methods).
  • Sparse solutions: $\ell_1$ regularization (Lasso, SVM) produces solutions with many $\alpha_i = 0$ (sparse linear combinations).

 

Relevance to Machine Learning#

Model expressiveness and capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible predictions. For linear regression $\hat{y} = Xw$:

  • If $\text{rank}(X) = d$ (full column rank), the model can fit $d$ linearly independent targets.
  • If $\text{rank}(X) < d$, features are redundant. Adding more linearly dependent features doesn’t help.
  • If $\text{rank}(X) < n$ (overdetermined), exact fit is impossible unless $y \in \text{col}(X)$ (rare).

Attention mechanisms. Transformer attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where the output is a convex combination (weighted average with non-negative weights summing to 1) of value vectors. Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to multiple subspaces (heads), increasing expressiveness.

Kernel methods and representer theorem. For kernel ridge regression, the optimal solution is: $$ \alpha^* = (K + \lambda I)^{-1} y $$ where $K_{ij} = k(x_i, x_j)$ is the Gram matrix. Predictions are $f(x) = \sum_{i=1}^n \alpha_i^* k(x_i, x)$ (linear combination of training kernels). This holds for any kernel (linear, polynomial, RBF, neural network), enabling implicit infinite-dimensional feature spaces.

Word embeddings and analogies. Word2Vec (Mikolov et al., 2013) famously demonstrated that semantic relationships correspond to linear offsets in embedding space: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$. This shows embeddings capture compositional structure (adding/subtracting vectors blends meanings).

 

Algorithmic Development History#

1. Linear combinations in classical mechanics (Newton, 1687). Newton’s second law $F = ma$ expresses force as a linear combination of acceleration components. Decomposing vectors into basis components (Cartesian coordinates) enabled solving physical systems.

2. Linear algebra formalization (Grassmann 1844, Peano 1888). Grassmann introduced “extensive magnitudes” (vectors) and exterior algebra (wedge products, spans). Peano axiomatized vector spaces with addition and scalar multiplication, formalizing linear combinations.

3. Least squares and column space (Gauss 1809, Legendre 1805). Gauss used least squares for orbit determination. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, and the best fit minimizes $\|y - \hat{y}\|_2$ by projecting $y$ onto $\text{col}(X)$.

4. Kernel trick and representer theorem (Kimeldorf & Wahba 1970, Schölkopf 1990s). Kimeldorf & Wahba proved the representer theorem for splines: optimal smoothing spline is a linear combination of kernel basis functions. Schölkopf, Smola, and Vapnik extended this to SVMs and kernel ridge regression, enabling nonlinear learning in RKHS.

5. Word embeddings and linear structure (Mikolov et al. 2013). Word2Vec revealed that embeddings exhibit linear compositionality: analogies like “king - man + woman ≈ queen” work because semantic relationships correspond to parallel vectors (linear offsets). This was surprising—neural networks learned a structured linear space without explicit supervision.

6. Attention and weighted sums (Bahdanau 2015, Vaswani 2017). Attention mechanisms compute outputs as convex combinations (weighted averages) of value vectors. The Transformer (Vaswani et al., 2017) replaced recurrence with attention, showing that linear combinations of context (with learned weights) suffice for sequence modeling.

7. Overparameterization and implicit bias (Bartlett 2020, Arora 2019). Modern deep networks are vastly overparameterized ($d \gg n$), so solutions lie in $w_{\min} + \text{null}(X)$ (affine subspace). Gradient descent exhibits implicit regularization, preferring solutions in specific subspaces (e.g., low-rank, sparse). Understanding span and null space clarifies why overparameterized models generalize.

 

Definitions#

Linear combination. Given vectors $\{v_1, \ldots, v_k\} \subset V$ and scalars $\{\alpha_1, \ldots, \alpha_k\} \subset \mathbb{R}$, the linear combination is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \cdots + \alpha_k v_k \in V $$

Span. The span of $\{v_1, \ldots, v_k\}$ is the set of all linear combinations: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.

Linear independence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if: $$ \sum_{i=1}^k \alpha_i v_i = 0 \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0 $$ Otherwise, they are **linearly dependent** (at least one $v_j$ is a linear combination of the others).

Basis. A set $\{v_1, \ldots, v_k\}$ is a basis for subspace $S$ if:

  1. It spans $S$: $\text{span}\{v_1, \ldots, v_k\} = S$.
  2. It is linearly independent.

Every vector in $S$ has a unique representation as a linear combination of basis vectors.

Column space (range). For $A \in \mathbb{R}^{m \times n}$, the column space is: $$ \text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{\text{columns of } A\} $$

Dimension. $\dim(S) = $ number of vectors in any basis for $S$. For $\text{col}(A)$, $\dim(\text{col}(A)) = \text{rank}(A)$ (number of linearly independent columns).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Representer theorem (Kimeldorf & Wahba 1970, Schölkopf et al. 2001). For kernel ridge regression and SVMs, the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$ This holds for **any** reproducing kernel $k$ on RKHS $\mathcal{H}$. The solution lies in the $n$-dimensional subspace $\text{span}{k(x_1, \cdot), \ldots, k(x_n, \cdot)}$, even though $\mathcal{H}$ may be infinite-dimensional (e.g., RBF kernel).
  2. VC dimension and span (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$, the VC dimension is $d+1$. This measures expressiveness: the classifier can shatter (correctly classify all $2^{d+1}$ labelings) any set of $d+1$ points in general position. The decision boundaries are hyperplanes (linear combinations of features).
  3. Rank-nullity theorem (fundamental theorem of linear algebra). For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ In ML: If $X \in \mathbb{R}^{n \times d}$ has $\text{rank}(X) = r < d$, there are $d - r$ linearly dependent features (null space dimension). Solutions to $Xw = y$ form an affine subspace $w_{\text{particular}} + \text{null}(X)$.
  4. Eckart-Young theorem (1936). The truncated SVD $\hat{X} = U_k \Sigma_k V_k^\top$ (keeping top $k$ singular values) minimizes: $$ \|\hat{X} - X\|_F = \min_{\text{rank}(\hat{X}) \leq k} \|\hat{X} - X\|_F $$ Geometrically: Projecting columns of $X$ onto $\text{span}{u_1, \ldots, u_k}$ minimizes reconstruction error. This justifies PCA, low-rank matrix completion, and recommender systems.
  5. Johnson-Lindenstrauss lemma (1984). Random projection from $\mathbb{R}^d$ to $\mathbb{R}^k$ (with $k = O(\log n / \epsilon^2)$) approximately preserves pairwise distances with high probability. This enables dimensionality reduction: data approximately lies in a low-dimensional subspace, discoverable via random projections.

Why essential: These theorems quantify when learning is tractable (representer theorem → finite-dimensional optimization), how much data suffices (VC dimension → sample complexity), and when low-dimensional structure exists (Eckart-Young → lossy compression bounds).

 

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. Word2Vec (Mikolov et al., 2013). Learned 300-dimensional embeddings for millions of words via skip-gram/CBOW. Demonstrated linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ achieved 40% accuracy on analogy tasks. Showed that linear combinations capture semantic relationships (gender, tense, capitals).
  2. ResNet (He et al., 2015). Introduced skip connections $y = F(x) + x$, enabling training of 152-layer networks (vs. ~20 layers for VGG). Won ImageNet 2015 with 3.57% top-5 error. The key: $F(x) + x$ is a linear combination (residual + identity), preserving gradients during backpropagation.
  3. Transformer (Vaswani et al., 2017). Replaced RNNs with attention: $\text{softmax}(QK^\top / \sqrt{d_k}) V$ (linear combination of value vectors). Enabled GPT-3 (175B params, Brown et al. 2020), BERT (340M params, Devlin et al. 2018), and state-of-the-art results across NLP (translation, summarization, QA).
  4. Kernel SVMs (Boser et al. 1992, Cortes & Vapnik 1995). Applied kernel trick to large-margin classifiers. Won NIPS 2003 feature selection challenge, achieved 99.3% accuracy on MNIST (Decoste & Schölkopf 2002). Decision function $f(x) = \sum_{i \in SV} \alpha_i y_i k(x_i, x)$ is a sparse linear combination (only support vectors have $\alpha_i \neq 0$).
  5. PCA for face recognition (Eigenfaces, Turk & Pentland 1991). Projected face images onto span of top eigenvectors (principal components). Each face is approximated as $x \approx \sum_{i=1}^k c_i u_i$ (linear combination of eigenfaces). Achieved real-time recognition with $k = 50$-$100$ components (vs. $d = 10,000$ pixels).
  6. GPT-3 (Brown et al., 2020). 175B parameter Transformer trained on 300B tokens. Demonstrated few-shot learning (2-3 examples) across diverse tasks without fine-tuning. Attention layers compute $\sum_{i=1}^n \alpha_i v_i$ (linear combinations of context), with $n = 2048$ tokens.

Why essential: These systems achieved state-of-the-art by exploiting linear combination structure (attention, skip connections, kernel methods). Understanding span is necessary to interpret embeddings (Word2Vec analogies), debug failures (rank deficiency in features), and design architectures (multi-head attention = multiple subspaces).

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Data spans low-dimensional subspace#

Major achievements:

  • Hotelling (1933): Formalized PCA as finding orthogonal directions of maximum variance. Principal components are eigenvectors of the covariance matrix $C = \frac{1}{n} X_c^\top X_c$ (centered data).
  • Eckart-Young theorem (1936): Proved that truncated SVD $X \approx U_k \Sigma_k V_k^\top$ (keeping top $k$ singular vectors) minimizes reconstruction error $\|X - \hat{X}\|_F$. This justifies PCA: projecting onto $\text{span}\{u_1, \ldots, u_k\}$ (top eigenvectors) is optimal.
  • Modern applications: Face recognition (eigenfaces, Turk & Pentland 1991), data compression (JPEG2000), preprocessing for neural networks (whitening), exploratory data analysis (visualizing high-dimensional datasets in 2D/3D).

Connection to span: PCA finds the $k$-dimensional subspace (span of top eigenvectors) that best approximates the data cloud. Projecting data $X$ onto $\text{span}\{u_1, \ldots, u_k\}$ gives $X_{\text{proj}} = X V_k V_k^\top$, where each row is a linear combination of top eigenvectors. The retained variance is $\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i$.

 

2. Stochastic Gradient Descent (SGD) — Updates are linear combinations#

Major achievements:

  • Robbins & Monro (1951): Proved convergence of stochastic approximation $\theta_{t+1} = \theta_t - \eta_t g_t$ (where $g_t$ is a noisy gradient) under diminishing step sizes $\sum_t \eta_t = \infty$, $\sum_t \eta_t^2 < \infty$.
  • Momentum methods (Polyak 1964, Nesterov 1983): Introduced momentum $m_{t+1} = \beta m_t + \nabla \mathcal{L}(\theta_t)$, $\theta_{t+1} = \theta_t - \eta m_{t+1}$ (exponentially weighted average of gradients). This is a linear combination of past gradients with decaying weights.
  • Adam optimizer (Kingma & Ba 2014): Adaptive learning rates using first and second moment estimates. Became the dominant optimizer for deep learning (BERT, GPT, Stable Diffusion).

Connection to span: Every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ is a linear combination of the current parameters and the negative gradient. The optimization trajectory $\{\theta_0, \theta_1, \ldots\}$ lies in the affine subspace $\theta_0 + \text{span}\{\nabla \mathcal{L}(\theta_0), \nabla \mathcal{L}(\theta_1), \ldots\}$. For linear models, gradients are linear combinations of data columns.

 

3. Deep Neural Networks—Compositional Linear Combinations

Major achievements:

  • Universal approximation (Cybenko 1989, Hornik 1991): Single hidden layer networks can approximate continuous functions arbitrarily well. The output is $f(x) = \sum_{i=1}^h w_i \sigma(v_i^\top x + b_i)$ (linear combination of activations).
  • Deep learning revolution (2012-present): AlexNet (2012), VGG (2014), ResNet (2015), Transformers (2017) demonstrated that depth (composing linear maps + nonlinearities) is more powerful than width (more neurons per layer).
  • Neural Tangent Kernels (Jacot et al. 2018): Showed that infinite-width networks behave like kernel methods, with predictions in $\text{span}\{\text{training features}\}$.

Connection to span: Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$). The pre-activation $W_l h_l$ lies in $\text{col}(W_l)$. Deep networks compose these linear combinations across layers, creating hierarchical representations.

 

4. Kernel Methods—Predictions as linear combinations of kernels#

Major achievements:

  • Representer theorem (Kimeldorf & Wahba 1970): For regularized risk minimization $\min_{f \in \mathcal{H}} \sum_{i=1}^n \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the optimal solution is $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel basis functions).
  • Support Vector Machines (Boser et al. 1992, Cortes & Vapnik 1995): Introduced large-margin classifiers with kernel trick. Won NIPS feature selection challenge (2003), dominated ML competitions (early 2000s).
  • Gaussian Processes (Rasmussen & Williams 2006): Bayesian kernel methods for regression/classification. Predictions are linear combinations $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ with $\alpha = (K + \sigma^2 I)^{-1} y$.

Connection to span: Despite working in (potentially infinite-dimensional) RKHS, kernel predictions always lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (finite-dimensional subspace spanned by training kernels). The Gram matrix $K_{ij} = k(x_i, x_j)$ encodes inner products in this subspace.

 

5. Transformer Attention—Weighted sums of value vectors#

Major achievements:

  • Vaswani et al. (2017): “Attention is All You Need” replaced RNNs with self-attention. Enabled parallelization and scaling to billions of parameters (GPT-3: 175B params, GPT-4: ~1.7T params).
  • BERT (Devlin et al. 2018): Bidirectional Transformers for masked language modeling. Achieved state-of-the-art on 11 NLP tasks (GLUE benchmark).
  • Vision Transformers (Dosovitskiy et al. 2020): Applied attention to image patches, surpassing CNNs on ImageNet (ViT-H/14: 88.5% top-1 accuracy).
  • Multimodal models (CLIP, Flamingo, GPT-4): Unified vision and language via attention over heterogeneous inputs.

Connection to span: Attention output $z = \text{softmax}(QK^\top / \sqrt{d_k}) V$ is a convex combination (weighted average with non-negative weights summing to 1) of value vectors (rows of $V$). Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to $h$ different subspaces, computing $h$ independent linear combinations in parallel.

Notation

Standard Conventions#

1. Linear combination syntax.

  • Summation notation: $\sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k$.
  • Matrix-vector product: $Xw = \sum_{j=1}^d w_j x_j$ (linear combination of columns of $X$ with weights from $w$).
  • Convex combination: $\sum_{i=1}^k \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (weighted average).

Examples:

  • Linear regression prediction: $\hat{y} = X w = \sum_{j=1}^d w_j X_{:,j}$ (each prediction is a linear combination of feature columns).
  • Attention output: $z = \sum_{i=1}^n \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$ (convex combination of value vectors).
  • Word analogy: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$ (coefficients can be negative).

2. Span notation.

  • Set notation: $\text{span}\{v_1, \ldots, v_k\} = \{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$.
  • Equivalent: $\text{span}(S)$ where $S = \{v_1, \ldots, v_k\}$ (span of a set).
  • Column space: $\text{col}(A) = \text{span}\{\text{columns of } A\}$.
  • Row space: $\text{row}(A) = \text{span}\{\text{rows of } A\} = \text{col}(A^\top)$.

Examples:

  • For $X \in \mathbb{R}^{3 \times 2}$ with columns $x_1 = [1, 0, 1]^\top$, $x_2 = [0, 1, 1]^\top$: $$ \text{col}(X) = \text{span}\{x_1, x_2\} = \left\{ w_1 \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} + w_2 \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} : w_1, w_2 \in \mathbb{R} \right\} $$ This is a 2D plane in $\mathbb{R}^3$ passing through the origin.

3. Linear independence notation.

  • Independence: Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if $\sum_{i=1}^k \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_k = 0$.
  • Dependence: If there exist $\alpha_i$ (not all zero) such that $\sum_{i=1}^k \alpha_i v_i = 0$, vectors are linearly dependent.
  • Rank: $\text{rank}(A) = \max\{\text{number of linearly independent columns of } A\} = \max\{\text{number of linearly independent rows of } A\}$.

Examples:

  • Vectors $v_1 = [1, 0]^\top$, $v_2 = [0, 1]^\top$ are linearly independent (standard basis for $\mathbb{R}^2$).
  • Vectors $v_1 = [1, 2]^\top$, $v_2 = [2, 4]^\top$ are linearly dependent ($v_2 = 2 v_1$).
  • For $X \in \mathbb{R}^{100 \times 50}$, $\text{rank}(X) \leq 50$ (at most 50 linearly independent columns).

4. Basis notation.

  • Basis: A linearly independent spanning set. Denoted $\mathcal{B} = \{v_1, \ldots, v_d\}$ for a $d$-dimensional space.
  • Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.
  • Coordinates: For vector $v = \sum_{i=1}^d \alpha_i v_i$ (linear combination of basis vectors), the coordinates are $[\alpha_1, \ldots, \alpha_d]^\top$.

Examples:

  • Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top$, $e_2 = [0, 1, 0]^\top$, $e_3 = [0, 0, 1]^\top$.
  • Any vector $v = [v_1, v_2, v_3]^\top = v_1 e_1 + v_2 e_2 + v_3 e_3$ (linear combination of standard basis).
  • For PCA, top $k$ eigenvectors $\{u_1, \ldots, u_k\}$ form a basis for the principal subspace.

5. Kernel and null space notation.

  • Null space: $\text{null}(A) = \{x : Ax = 0\}$ (vectors mapped to zero).
  • Kernel: $\ker(A) = \text{null}(A)$ (alternative notation).
  • Range (column space): $\text{range}(A) = \text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$.

Examples:

  • For $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ (rank 1), $\text{null}(A) = \text{span}\{[2, -1]^\top\}$ (1D subspace).
  • Overparameterized regression: If $\text{rank}(X) < d$, solutions to $Xw = y$ form $w_0 + \text{null}(X)$ (affine subspace).
  • Kernel ridge regression: Solution $\alpha = (K + \lambda I)^{-1} y$ lies in $\mathbb{R}^n$ (span of training examples).

6. Projection notation.

  • Orthogonal projection: $P_S v$ projects $v$ onto subspace $S$.
  • Projection matrix: $P = A(A^\top A)^{-1} A^\top$ projects onto $\text{col}(A)$.
  • Complement: $v = P_S v + P_{S^\perp} v$ (decomposition into parallel and perpendicular components).

Examples:

  • PCA projection onto top $k$ eigenvectors: $X_{\text{proj}} = X V_k V_k^\top$ where $V_k = [u_1 | \cdots | u_k]$.
  • Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto $\text{col}(X)$).
  • Residual: $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top) y$ (projection onto $\text{col}(X)^\perp$).
Pitfalls & sanity checks

Common Mistakes#

  1. Confusing span with basis: Span dimension = number of linearly independent vectors, not total count.
  2. Assuming full rank: Always check np.linalg.matrix_rank(X) before inverting $X^\top X$.
  3. Ignoring numerical stability: Use lstsq instead of normal equations.
  4. Misunderstanding convex combinations: Not all linear combinations are convex (need $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$).
  5. Overparameterization misconceptions: $d > n$ doesn’t always cause overfitting (implicit regularization).

     

Essential Checks#

# Check linear independence
rank = np.linalg.matrix_rank(X)
assert rank == X.shape[1], "Columns linearly dependent"

# Verify span membership
V = np.column_stack([v1, v2, v3])
alpha = np.linalg.lstsq(V, v, rcond=None)[0]
assert np.allclose(V @ alpha, v), "v not in span(V)"

# Test null space
assert np.allclose(X @ z, 0), "z not in null(X)"

# Attention weights
assert np.allclose(alpha.sum(), 1) and (alpha >= 0).all()
References

Foundational Texts#

  1. Strang (2016): Linear Algebra - span, basis, column/null space
  2. Axler (2015): Linear Algebra Done Right - abstract vector spaces
  3. Horn & Johnson (2013): Matrix Analysis - rank, decompositions

     

Machine Learning#

  1. Hastie et al. (2009): Elements of Statistical Learning - regression, SVMs
  2. Goodfellow et al. (2016): Deep Learning - Chapter 2 (Linear Algebra)
  3. Murphy (2022): Probabilistic ML - linear regression, kernels

     

Key Papers#

  1. Kimeldorf & Wahba (1970): Representer theorem
  2. Vapnik & Chervonenkis (1971): VC dimension
  3. Eckart & Young (1936): Low-rank approximation
  4. Mikolov et al. (2013): Word2Vec analogies
  5. Vaswani et al. (2017): Transformer attention
  6. He et al. (2015): ResNet skip connections
  7. Bartlett et al. (2020): Benign overfitting
  8. Belkin et al. (2019): Double descent

     

Advanced Topics#

  1. Schölkopf & Smola (2002): Learning with Kernels
  2. Rasmussen & Williams (2006): Gaussian Processes
  3. Golub & Van Loan (2013): Matrix Computations
  4. Trefethen & Bau (1997): Numerical Linear Algebra
Five worked examples

Worked Example 1: Predictions lie in span(columns of X)#

Introduction#

Linear regression predictions $\hat{y} = Xw$ are linear combinations of the columns of the feature matrix $X$. This fundamental observation reveals model expressiveness: all possible predictions lie in the column space $\text{col}(X)$, a subspace of $\mathbb{R}^n$. If the target $y$ lies outside this subspace ($y \notin \text{col}(X)$), perfect fit is impossible—the best we can do is project $y$ onto $\text{col}(X)$ (least squares solution).

This example explicitly computes $Xw$ as $\sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns), demonstrating that predictions span a subspace determined entirely by the features.

 

Purpose#

  • Visualize predictions as linear combinations: Show that $\hat{y} = Xw = w_1 X_{:,1} + w_2 X_{:,2} + \cdots + w_d X_{:,d}$.
  • Identify the constraint: Predictions lie in $\text{col}(X)$, limiting model capacity to $\dim(\text{col}(X)) = \text{rank}(X)$.
  • Connect to least squares: When $y \notin \text{col}(X)$, minimizing $\|Xw - y\|_2$ finds the closest point in $\text{col}(X)$ to $y$.
     

Importance#

Model expressiveness. The span of $X$’s columns determines all possible predictions. For $X \in \mathbb{R}^{n \times d}$:

  • If $\text{rank}(X) = d$ (full column rank), the model can fit any $d$ linearly independent targets.
  • If $\text{rank}(X) < d$, some features are redundant (linearly dependent). Adding more linearly dependent features doesn’t increase capacity.
  • If $\text{rank}(X) < n$ (typical when $d < n$), predictions lie in a proper subspace of $\mathbb{R}^n$. Perfect fit is impossible unless $y \in \text{col}(X)$.

Residuals and orthogonality. The least squares residual $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$: $X^\top r = 0$. Geometrically, $\hat{y}$ is the orthogonal projection of $y$ onto $\text{col}(X)$, and $r$ lies in the orthogonal complement $\text{col}(X)^\perp$.

Feature selection. If feature $j$ is a linear combination of other features ($X_{:,j} = \sum_{i \neq j} c_i X_{:,i}$), including it doesn’t increase $\text{rank}(X)$ or expand $\text{col}(X)$. Feature selection algorithms (Lasso, forward selection) aim to find minimal feature sets spanning the target space.

 

What This Example Demonstrates#

  • Matrix-vector product as linear combination: $Xw = \sum_{j=1}^d w_j X_{:,j}$ (sum of weighted columns).
  • Predictions constrained to subspace: $\hat{y} \in \text{col}(X) = \text{span}\{X_{:,1}, \ldots, X_{:,d}\}$.
  • Numerical verification: Compute both $Xw$ (matrix product) and $\sum_j w_j X_{:,j}$ (explicit sum), verify they’re identical.

     

Background#

Least squares (Gauss 1809, Legendre 1805). Gauss used least squares to fit planetary orbits, minimizing sum of squared errors. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, so minimizing $\|y - Xw\|_2^2$ finds the closest point in $\text{col}(X)$ to $y$.

Normal equations. Setting $\nabla_w \|Xw - y\|_2^2 = 0$ gives $X^\top X w = X^\top y$. If $X$ has full column rank, $w^* = (X^\top X)^{-1} X^\top y$. The prediction is $\hat{y} = X w^* = X(X^\top X)^{-1} X^\top y$ (projection matrix $P = X(X^\top X)^{-1} X^\top$ projects onto $\text{col}(X)$).

Geometric interpretation. $\text{col}(X)$ is a $d$-dimensional (or $\text{rank}(X)$-dimensional) hyperplane in $\mathbb{R}^n$. The prediction $\hat{y}$ is the foot of the perpendicular from $y$ to this hyperplane. The residual $r = y - \hat{y}$ is perpendicular to the hyperplane.

 

Historical Context#

1. Least squares origins (Gauss 1809, Legendre 1805). Legendre published the method in 1805 for fitting orbits. Gauss claimed to have used it since 1795 (controversy over priority). Both recognized that predictions are linear combinations of features.

2. Matrix formulation (Cauchy 1829, Sylvester 1850). Matrix algebra enabled compact notation $\hat{y} = Xw$ instead of writing out sums. Sylvester introduced “matrix” terminology in 1850.

3. Projection interpretation (Schmidt 1907, Courant & Hilbert 1924). Erhard Schmidt formalized orthogonal projections in Hilbert spaces. The least squares solution became understood as projecting $y$ onto $\text{col}(X)$.

4. Modern ML (1990s-present). Regularization (ridge, Lasso) modifies $\text{col}(X)$ by adding penalty terms. Kernel methods (SVMs, kernel ridge regression) work in implicitly mapped feature spaces, where $\text{col}(\Phi(X))$ may be infinite-dimensional but solutions lie in $\text{span}\{k(x_i, \cdot)\}_{i=1}^n$ (finite-dimensional by the representer theorem).

 

History in Machine Learning#

  • 1805: Legendre publishes least squares (linear combinations of features).
  • 1809: Gauss derives normal equations $X^\top X w = X^\top y$.
  • 1907: Schmidt formalizes orthogonal projections (geometric interpretation).
  • 1970: Kimeldorf & Wahba prove the representer theorem (kernel solutions in the span of training points).
  • 1995: Vapnik’s Nature of Statistical Learning Theory connects VC dimension to span of hypothesis class.
  • 2006: Compressed sensing (Candès, Donoho) exploits sparse linear combinations for recovery.
  • 2018: Neural Tangent Kernels (Jacot et al.) show infinite-width networks have predictions in the span of features.

     

Prevalence in Machine Learning#

Universal in supervised learning: Every linear model (linear regression, logistic regression, linear SVM, perceptron) computes predictions as $\hat{y} = Xw$ or $\hat{y} = \sigma(Xw + b)$ (linear combination + nonlinearity).

Deep learning layers: Each fully connected layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$, where $W_l h_l$ is a linear combination of hidden activations (columns of $W_l$ with coefficients from $h_l$).

Generalized linear models (GLMs): Exponential family models (Poisson regression, gamma regression) use $\mathbb{E}[y] = g^{-1}(X w)$ (linear combination inside link function).

Kernel methods: SVMs, kernel ridge regression, Gaussian processes all predict via $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (linear combination of kernel evaluations).

 

Notes and Explanatory Details#

Shape discipline:

  • Feature matrix: $X \in \mathbb{R}^{n \times d}$ (rows = examples, columns = features).
  • Weights: $w \in \mathbb{R}^d$ (one weight per feature).
  • Prediction: $\hat{y} = Xw \in \mathbb{R}^n$ (one prediction per example).
  • Column $j$ of $X$: $X_{:,j} \in \mathbb{R}^n$ (feature $j$ across all examples).

Matrix-vector product identity: $$ Xw = \begin{bmatrix} | & | & & | \\ X_{:,1} & X_{:,2} & \cdots & X_{:,d} \\ | & | & & | \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} = \sum_{j=1}^d w_j X_{:,j} $$

Example: For $X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}$, $w = \begin{bmatrix} 2 \\ -1 \end{bmatrix}$: $$ Xw = 2 \begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} + (-1) \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 4 \end{bmatrix} $$

Numerical considerations: For large $d$ (wide data), storing $X$ explicitly may be wasteful if $\text{rank}(X) \ll d$. Low-rank approximations (truncated SVD) reduce storage and computation.

 

Connection to Machine Learning#

Underfitting vs. overfitting: If $\text{rank}(X) \ll n$ (few effective features), the model underfits (predictions lie in low-dimensional subspace). If $\text{rank}(X) = n$ and $d \geq n$ (more features than examples), the model can perfectly fit noise (overfitting).

Regularization modifies the span: Ridge regression solves $(X^\top X + \lambda I) w = X^\top y$, shrinking weights toward zero. This effectively reduces the effective rank of $X$, constraining predictions to a lower-dimensional subspace.

Basis functions and feature expansion: Nonlinear models (polynomial regression, RBF networks) expand features: $\phi(x) = [x, x^2, x^3, \ldots]$. Predictions $\hat{y} = \Phi(X) w$ lie in $\text{col}(\Phi(X))$, a nonlinear subspace in the original space but linear in feature space.

 

Connection to Linear Algebra Theory#

Fundamental theorem of linear algebra. For $X \in \mathbb{R}^{n \times d}$: $$ \mathbb{R}^n = \text{col}(X) \oplus \text{null}(X^\top) $$ (direct sum: every vector $y \in \mathbb{R}^n$ decomposes uniquely as $y = y_{\parallel} + y_{\perp}$ where $y_{\parallel} \in \text{col}(X)$ and $y_{\perp} \in \text{null}(X^\top)$).

In least squares, $\hat{y} = y_{\parallel}$ (projection onto $\text{col}(X)$) and $r = y_{\perp}$ (projection onto $\text{null}(X^\top)$). The normal equations $X^\top r = 0$ express orthogonality.

Rank and dimension: $\dim(\text{col}(X)) = \text{rank}(X) \leq \min(n, d)$. If $\text{rank}(X) = r < d$, there are $d - r$ redundant features (null space has dimension $d - r$).

Projection matrix: $P = X(X^\top X)^{-1} X^\top$ (assuming $X$ has full column rank) satisfies:

  • $P^2 = P$ (idempotent: projecting twice is the same as projecting once).
  • $P^\top = P$ (symmetric: orthogonal projection).
  • $\text{col}(P) = \text{col}(X)$ (projects onto column space of $X$).

     

Pedagogical Significance#

Concrete visualization. Students can compute $Xw$ by hand for small $X$ (e.g., $3 \times 2$ matrix) and verify it’s a weighted sum of columns. This makes the abstract “linear combination” concept tangible.

Foundation for least squares. Understanding that predictions lie in $\text{col}(X)$ is essential before learning least squares. The geometric interpretation (projecting $y$ onto $\text{col}(X)$) clarifies why least squares works and when it fails.

Debugging linear models. If predictions are poor, check $\text{rank}(X)$: low rank indicates redundant/collinear features. Use np.linalg.matrix_rank(X) to diagnose.

 

References#

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4: “Orthogonality” (projections, least squares).
  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 3: “Linear Methods for Regression.”
  3. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Appendix C: “Numerical Linear Algebra Background” (least squares, QR decomposition).
  4. Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. Chapter 5: “Orthogonalization and Least Squares.”
  5. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11: “Linear Regression.”

Problem. Show $\hat{y} = Xw$ lies in the span of columns of $X$.

Solution (math).

For $X \in \mathbb{R}^{n \times d}$ with columns $X_{:,1}, \ldots, X_{:,d} \in \mathbb{R}^n$ and weights $w = [w_1, \ldots, w_d]^\top \in \mathbb{R}^d$, the prediction is: $$ \hat{y} = Xw = \sum_{j=1}^d w_j X_{:,j} $$

This is a linear combination of the columns of $X$, so $\hat{y} \in \text{span}\{X_{:,1}, \ldots, X_{:,d}\} = \text{col}(X)$.

Solution (Python).

import numpy as np

# Define feature matrix X (3 examples, 2 features)
X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])

# Define weight vector w
w = np.array([2., -1.])

# Prediction via matrix-vector product
y_hat_1 = X @ w

# Prediction as explicit linear combination of columns
y_hat_2 = w[0] * X[:, 0] + w[1] * X[:, 1]

print(f"X =\n{X}\n")
print(f"w = {w}\n")
print(f"Method 1 (matrix product): y_hat = X @ w = {y_hat_1}")
print(f"Method 2 (linear combination): y_hat = {w[0]}*X[:,0] + {w[1]}*X[:,1] = {y_hat_2}")
print(f"\nAre they equal? {np.allclose(y_hat_1, y_hat_2)}")
print(f"y_hat lies in span(columns of X): True (by construction)")

Output:

X =
[[1. 2.]
 [3. 4.]
 [5. 6.]]

w = [ 2. -1.]

Method 1 (matrix product): y_hat = X @ w = [0. 2. 4.]
Method 2 (linear combination): y_hat = 2.0*X[:,0] + -1.0*X[:,1] = [0. 2. 4.]

Are they equal? True
y_hat lies in span(columns of X): True (by construction)

Worked Example 2: Kernel ridge solution lies in span(training features)#

Introduction#

The representer theorem states that despite optimizing over an infinite-dimensional RKHS, the optimal solution for kernel ridge regression always has the form $f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$—a linear combination of kernel functions evaluated at training points.

 

Purpose#

  • Demonstrate the representer theorem computationally
  • Show that optimization in infinite dimensions reduces to solving $(K + \lambda I)\alpha = y$
  • Verify predictions lie in span of training kernels
  •  

Importance#

Kernel methods enable nonlinear learning in implicitly mapped feature spaces while maintaining computational tractability ($O(n^3)$ instead of infinite-dimensional optimization).

 

What This Example Demonstrates#

Compute kernel Gram matrix $K_{ij} = k(x_i, x_j)$, solve for $\alpha$, interpret as linear combination of training kernels.

 

Background#

RKHS and representer theorem (Kimeldorf & Wahba 1970): For loss $\mathcal{L}(f) = \sum_i \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2$, the minimizer is $f^*(x) = \sum_i \alpha_i k(x_i, x)$.

 

References#

  1. Kimeldorf & Wahba (1970), Schölkopf et al. (2001), Rasmussen & Williams (2006)

Problem: Compute $\alpha$ for kernel ridge regression and interpret span.

Solution (math): $\alpha = (K + \lambda I)^{-1} y$ where $K_{ij} = k(x_i, x_j)$. Predictions: $f(x) = \sum_i \alpha_i k(x_i, x)$.

Solution (Python):

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=6, seed=1)
y = np.arange(len(X), dtype=float)
K = toy_kernel_rbf(X, gamma=0.5)
lam = 1e-2
alpha = np.linalg.solve(K + lam * np.eye(len(X)), y)

print(f"Coefficients alpha: {alpha}")
print(f"Predictions lie in span{{k(x_1, ·), ..., k(x_6, ·)}}")

Worked Example 3: Attention is a weighted sum#

Introduction#

Attention computes outputs as convex combinations of value vectors: $z = \sum_i \alpha_i v_i$ where $\alpha = \text{softmax}(q^\top K / \sqrt{d_k})$.

 

Purpose#

Show attention output is a linear combination, verify weights sum to 1, demonstrate constraint to span of values.

 

Importance#

Attention is the core operation in Transformers (GPT, BERT), enabling contextual representations through weighted averaging.

 

References#

Vaswani et al. (2017), Bahdanau et al. (2015)

Problem: Compute attention output as $\sum_i \alpha_i v_i$.

Solution (math): $z = \text{softmax}(q^\top K / \sqrt{d_k}) V$

Solution (Python):

import numpy as np
from scripts.toy_data import scaled_dot_attention

Q = np.array([[1., 0.]])
K = np.array([[1., 0.], [0., 1.], [1., 1.]])
V = np.array([[1., 0.], [0., 2.], [1., 1.]])
output = scaled_dot_attention(Q, K, V)

print(f"Attention output: {output[0]}")
print(f"Output lies in span(rows of V)")

Worked Example 4: Overparameterization and null space#

Introduction#

When $d > n$ (more parameters than examples), solutions to $Xw = y$ are non-unique. The solution set forms an affine subspace $w_0 + \text{null}(X)$.

 

Purpose#

Show non-uniqueness, identify solution set structure, and discuss the minimum-norm solution returned by lstsq.

 

Importance#

Modern deep learning is vastly overparameterized. Understanding null space clarifies why multiple parameters give identical predictions yet generalize differently.

 

References#

Bartlett et al. (2020), Belkin et al. (2019)

Problem: Explain non-uniqueness when $d > n$.

Solution (math): If $Xw_0 = y$ and $z \in \text{null}(X)$, then $X(w_0 + z) = y$. Solutions form $w_0 + \text{null}(X)$.

Solution (Python):

import numpy as np

rng = np.random.default_rng(0)
X = rng.normal(size=(3, 5))  # n=3, d=5
w0 = rng.normal(size=5)
y = X @ w0
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Rank(X): {np.linalg.matrix_rank(X)}")
print(f"Null space dim: {X.shape[1] - np.linalg.matrix_rank(X)}")
print(f"||w_hat||_2 = {np.linalg.norm(w_hat):.4f} (minimum norm)")
print(f"w0 - w_hat in null(X): {np.allclose(X @ (w0 - w_hat), 0)}")

Worked Example 5: Word analogy vector arithmetic#

Introduction#

Word2Vec embeddings exhibit linear structure: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (semantic relationships = vector offsets).

Purpose#

Compute analogy as linear combination, demonstrate compositional semantics, motivate embedding arithmetic.

Importance#

Analogies reveal that neural networks learn structured representations where linear algebra operations correspond to semantic operations.

References#

Mikolov et al. (2013), Pennington et al. (2014)

Problem: Compute “king - man + woman” analogy.

Solution (math): $e_{\text{target}} = 1 \cdot e_{\text{king}} + (-1) \cdot e_{\text{man}} + 1 \cdot e_{\text{woman}}$

Solution (Python):

import numpy as np

E = {
    'king': np.array([0.8, 0.2, 0.1]),
    'man': np.array([0.7, 0.1, 0.0]),
    'woman': np.array([0.6, 0.3, 0.0])
}

analogy = E['king'] - E['man'] + E['woman']
print(f"king - man + woman = {analogy}")
print(f"(Find nearest word to this vector → queen)")

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation