Introduction#
Span and linear combinations are the fundamental building blocks of linear algebra and machine learning. Every prediction $\hat{y} = Xw$, every gradient descent update $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$, every attention output $\sum_i \alpha_i v_i$, and every representation learned by a neural network is ultimately a linear combination of basis vectors. Understanding span—the set of all possible linear combinations—reveals model expressiveness, training dynamics, and the geometry of learned representations.
The span of a set of vectors $\{v_1, \ldots, v_k\}$ is the smallest subspace containing all of them. Geometrically, it’s all points reachable by scaling and adding the vectors. Algebraically, it’s $\{\sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R}\}$. In ML, span determines:
- Model capacity: What functions can a model represent?
- Feature redundancy: Are some features linear combinations of others?
- Solution uniqueness: When are there multiple parameter vectors giving identical predictions?
- Expressiveness vs. efficiency: Can we reduce dimensionality without losing information?
This chapter adopts an ML-first perspective: we introduce span through concrete algorithms (kernel methods, attention, overparameterization) rather than abstract axioms. The goal is to build geometric intuition (span as reachable points) and computational skill (checking linear independence, computing basis) simultaneously.
Important Ideas#
1. Linear combinations are everywhere in ML. A linear combination of vectors $\{v_1, \ldots, v_k\}$ with coefficients $\{\alpha_1, \ldots, \alpha_k\}$ is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_k v_k $$
Examples in ML:
- Linear regression predictions: $\hat{y} = Xw = \sum_{j=1}^d w_j x_j$ (linear combination of feature columns).
- Gradient descent updates: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ (linear combination of current parameters and gradient).
- Attention outputs: $z = \sum_{i=1}^n \alpha_i v_i$ (weighted sum of value vectors with attention weights $\alpha_i$).
- Kernel predictions: $f(x) = \sum_{i=1}^n \alpha_i k(x_i, x)$ (representer theorem: optimal solution is a linear combination of training kernels).
- Word embeddings: Analogies $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$ (linear combinations capture semantic relationships).
2. Span determines expressiveness. The span of $\{v_1, \ldots, v_k\}$ is: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$
This is the set of all possible linear combinations—the “reachable subspace” if we’re allowed to scale and add the vectors. Key properties:
- It’s a subspace: Closed under addition and scalar multiplication (adding/scaling linear combinations gives another linear combination).
- It’s the smallest subspace containing $\{v_1, \ldots, v_k\}$: Any subspace containing all $v_i$ must contain their span.
- Dimension = number of linearly independent vectors: If $v_k = \sum_{i=1}^{k-1} c_i v_i$ (linear dependence), adding $v_k$ doesn’t increase the span.
In ML context:
- Column space of $X$: All predictions $\hat{y} = Xw$ lie in $\text{span}(\text{columns of } X) = \text{col}(X)$. If $y \notin \text{col}(X)$, perfect fit is impossible (residual is nonzero).
- Feature redundancy: If feature $x_j$ is a linear combination of other features, adding it doesn’t increase $\text{span}(\text{columns of } X)$ or model capacity.
- Kernel methods: Predictions lie in $\text{span}\{k(x_1, \cdot), \ldots, k(x_n, \cdot)\}$ (representer theorem). This is typically a finite-dimensional subspace of the (infinite-dimensional) RKHS.
3. Linear independence vs. dependence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = 0$ is $\alpha_1 = \cdots = \alpha_k = 0$. Otherwise, they’re linearly dependent (one is a linear combination of others).
Why it matters:
- Basis: A linearly independent set spanning $V$ is a basis for $V$. Every vector in $V$ has a unique representation as a linear combination of basis vectors.
- Rank: $\text{rank}(X) = $ number of linearly independent columns = $\dim(\text{col}(X))$.
- Multicollinearity: In regression, linearly dependent features ($\text{rank}(X) < d$) make $X^\top X$ singular (non-invertible), requiring regularization.
4. Representer theorem: solutions lie in span of training data. For many ML problems (kernel ridge regression, SVMs, Gaussian processes), the optimal solution has the form: $$ f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x) $$
This is a linear combination of kernel functions evaluated at training points. Despite working in an infinite-dimensional space (e.g., RBF kernel), the solution lies in an $n$-dimensional subspace (span of $\{k(x_i, \cdot)\}_{i=1}^n$).
Implications:
- Computational tractability: Optimization over infinite dimensions reduces to solving an $n \times n$ system.
- Overfitting vs. underfitting: More training points ($n$ large) increases capacity but also computational cost ($O(n^3)$ for exact methods).
- Sparse solutions: $\ell_1$ regularization (Lasso, SVM) produces solutions with many $\alpha_i = 0$ (sparse linear combinations).
Relevance to Machine Learning#
Model expressiveness and capacity. The span of a feature matrix $X \in \mathbb{R}^{n \times d}$ determines all possible predictions. For linear regression $\hat{y} = Xw$:
- If $\text{rank}(X) = d$ (full column rank), the model can fit $d$ linearly independent targets.
- If $\text{rank}(X) < d$, features are redundant. Adding more linearly dependent features doesn’t help.
- If $\text{rank}(X) < n$ (overdetermined), exact fit is impossible unless $y \in \text{col}(X)$ (rare).
Attention mechanisms. Transformer attention computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where the output is a convex combination (weighted average with non-negative weights summing to 1) of value vectors. Each output lies in $\text{span}(\text{rows of } V)$. Multi-head attention projects to multiple subspaces (heads), increasing expressiveness.
Kernel methods and representer theorem. For kernel ridge regression, the optimal solution is: $$ \alpha^* = (K + \lambda I)^{-1} y $$ where $K_{ij} = k(x_i, x_j)$ is the Gram matrix. Predictions are $f(x) = \sum_{i=1}^n \alpha_i^* k(x_i, x)$ (linear combination of training kernels). This holds for any kernel (linear, polynomial, RBF, neural network), enabling implicit infinite-dimensional feature spaces.
Word embeddings and analogies. Word2Vec (Mikolov et al., 2013) famously demonstrated that semantic relationships correspond to linear offsets in embedding space: $e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \approx e_{\text{queen}}$. This shows embeddings capture compositional structure (adding/subtracting vectors blends meanings).
Algorithmic Development History#
1. Linear combinations in classical mechanics (Newton, 1687). Newton’s second law $F = ma$ expresses force as a linear combination of acceleration components. Decomposing vectors into basis components (Cartesian coordinates) enabled solving physical systems.
2. Linear algebra formalization (Grassmann 1844, Peano 1888). Grassmann introduced “extensive magnitudes” (vectors) and exterior algebra (wedge products, spans). Peano axiomatized vector spaces with addition and scalar multiplication, formalizing linear combinations.
3. Least squares and column space (Gauss 1809, Legendre 1805). Gauss used least squares for orbit determination. The key insight: predictions $\hat{y} = Xw$ lie in $\text{col}(X)$, and the best fit minimizes $\|y - \hat{y}\|_2$ by projecting $y$ onto $\text{col}(X)$.
4. Kernel trick and representer theorem (Kimeldorf & Wahba 1970, Schölkopf 1990s). Kimeldorf & Wahba proved the representer theorem for splines: optimal smoothing spline is a linear combination of kernel basis functions. Schölkopf, Smola, and Vapnik extended this to SVMs and kernel ridge regression, enabling nonlinear learning in RKHS.
5. Word embeddings and linear structure (Mikolov et al. 2013). Word2Vec revealed that embeddings exhibit linear compositionality: analogies like “king - man + woman ≈ queen” work because semantic relationships correspond to parallel vectors (linear offsets). This was surprising—neural networks learned a structured linear space without explicit supervision.
6. Attention and weighted sums (Bahdanau 2015, Vaswani 2017). Attention mechanisms compute outputs as convex combinations (weighted averages) of value vectors. The Transformer (Vaswani et al., 2017) replaced recurrence with attention, showing that linear combinations of context (with learned weights) suffice for sequence modeling.
7. Overparameterization and implicit bias (Bartlett 2020, Arora 2019). Modern deep networks are vastly overparameterized ($d \gg n$), so solutions lie in $w_{\min} + \text{null}(X)$ (affine subspace). Gradient descent exhibits implicit regularization, preferring solutions in specific subspaces (e.g., low-rank, sparse). Understanding span and null space clarifies why overparameterized models generalize.
Definitions#
Linear combination. Given vectors $\{v_1, \ldots, v_k\} \subset V$ and scalars $\{\alpha_1, \ldots, \alpha_k\} \subset \mathbb{R}$, the linear combination is: $$ v = \sum_{i=1}^k \alpha_i v_i = \alpha_1 v_1 + \cdots + \alpha_k v_k \in V $$
Span. The span of $\{v_1, \ldots, v_k\}$ is the set of all linear combinations: $$ \text{span}\{v_1, \ldots, v_k\} = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{R} \right\} $$ This is the **smallest subspace** containing ${v_1, \ldots, v_k}$.
Linear independence. Vectors $\{v_1, \ldots, v_k\}$ are linearly independent if: $$ \sum_{i=1}^k \alpha_i v_i = 0 \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0 $$ Otherwise, they are **linearly dependent** (at least one $v_j$ is a linear combination of the others).
Basis. A set $\{v_1, \ldots, v_k\}$ is a basis for subspace $S$ if:
- It spans $S$: $\text{span}\{v_1, \ldots, v_k\} = S$.
- It is linearly independent.
Every vector in $S$ has a unique representation as a linear combination of basis vectors.
Column space (range). For $A \in \mathbb{R}^{m \times n}$, the column space is: $$ \text{col}(A) = \{Ax : x \in \mathbb{R}^n\} = \text{span}\{\text{columns of } A\} $$
Dimension. $\dim(S) = $ number of vectors in any basis for $S$. For $\text{col}(A)$, $\dim(\text{col}(A)) = \text{rank}(A)$ (number of linearly independent columns).
Comments