Introduction#
Basis and dimension are the language for measuring and reasoning about vector spaces. A basis is a minimal spanning set—a collection of linearly independent vectors that can be scaled and added to represent every vector in the space. The dimension is simply the size of a basis (number of basis vectors). These concepts are ubiquitous in ML:
PCA: Principal components form a basis for a lower-dimensional subspace capturing most data variance.
Autoencoders: Encoder learns a basis for latent representations (bottleneck layer); decoder reconstructs using this basis.
Neural networks: Each layer’s hidden activations form a basis for representations learned by that layer.
Whitening/normalization: Change of basis to decorrelate and rescale features (covariance becomes identity).
Sparse coding: Find a basis (dictionary) that sparsely represents data.
Understanding basis and dimension clarifies model capacity (how many independent directions can the model control?), data complexity (how many dimensions does the data actually occupy?), and numerical stability (are basis vectors nearly parallel, i.e., ill-conditioned?).
Important Ideas#
1. Basis = minimal spanning set. A basis $\{v_1, \ldots, v_d\}$ for subspace $S$ satisfies:
Spans $S$: Every vector in $S$ is a linear combination $s = \sum_{i=1}^d \alpha_i v_i$.
Linearly independent: No $v_j$ is a linear combination of the others. Equivalently, $\sum_{i=1}^d \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_d = 0$.
Why minimal? If any vector is removed, the set no longer spans $S$. If any vector is linearly dependent on others, it’s redundant.
Uniqueness of representation: For basis $\{v_1, \ldots, v_d\}$, each vector $s \in S$ has unique coefficients: if $s = \sum_i \alpha_i v_i = \sum_i \beta_i v_i$, then $\alpha_i = \beta_i$ for all $i$ (follows from linear independence).
2. Dimension = basis size. All bases for a subspace have the same number of vectors. This number is the dimension: $\dim(S) = $ number of vectors in any basis for $S$. For matrix $A$: $$ \dim(\text{col}(A)) = \text{rank}(A), \quad \dim(\text{null}(A)) = \text{nullity}(A) = n - \text{rank}(A) $$
Why constant? Different bases may use different vectors, but all bases have the same cardinality. Changing basis doesn’t change dimension (it’s a property of the subspace, not the basis).
3. Change of basis. The same vector has different coordinates in different bases. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the change of basis matrix $P = [v'_1 | \cdots | v'_d]$ (columns are new basis vectors in old basis) satisfies: $$ v_\text{old basis} = P v_\text{new basis}, \quad v_\text{new basis} = P^{-1} v_\text{old basis} $$
In ML: Changing basis = change of representation. PCA rotates to principal component basis. Whitening rotates to decorrelated basis (covariance = identity).
4. Standard basis and explicit coordinates. The standard basis in $\mathbb{R}^d$ is $\{e_1, \ldots, e_d\}$ where $e_i = [0, \ldots, 1, \ldots, 0]^\top$ (1 in position $i$). For this basis: $$ v = [v_1, \ldots, v_d]^\top = \sum_{i=1}^d v_i e_i $$
Coordinates in the standard basis are just the vector’s components. Embedding lookups use one-hot basis: to get embedding of token $i$, multiply by $e_i$ (selection vector).
Relevance to Machine Learning#
Model capacity and expressiveness. A model operating in a $d$-dimensional space can control at most $d$ independent directions. Linear regression with $d$ features can fit $d$ linearly independent targets. Deep networks learn hierarchical bases: layer $\ell$ learns a basis for its activation space (dimension = number of hidden units).
Data intrinsic dimensionality. Real data often lies in a lower-dimensional subspace (manifold hypothesis). PCA finds a basis for the dominant subspace; if top $k$ eigenvalues capture 95% of variance, data is approximately $k$-dimensional. This justifies dimensionality reduction without information loss.
Numerical stability and conditioning. Basis properties affect computation: orthonormal bases (columns have norm 1, mutually perpendicular) are numerically stable (condition number = 1). Nearly parallel basis vectors (ill-conditioned) cause numerical errors in linear algebra operations.
Feature engineering and representation learning. Hand-crafted features (e.g., polynomial features, Fourier basis) are explicit bases. Neural networks learn implicit bases (hidden layers) that are optimized for the task. Autoencoders learn an optimal basis for data reconstruction (variational autoencoders add structure to this basis).
Algorithmic Development History#
1. Coordinate geometry (Descartes & Fermat, 1630s-1640s). Cartesian coordinates introduced the standard basis for Euclidean space. Descartes’ La Géométrie (1637) showed geometry could be expressed algebraically using coordinate axes.
2. Change of basis and coordinate transformations (Euler 1770s, Cauchy 1820s-1830s). Euler rotated coordinate systems to simplify equations. Cauchy formalized change of basis through matrix transformations.
3. Linear independence and dimension (Grassmann 1844, Peano 1888). Grassmann introduced independence axiomatically. Peano formalized dimension as the size of maximal independent sets.
4. Orthonormal bases and orthogonalization (Schmidt 1907, Gram-Schmidt process). Erhard Schmidt proved every finite-dimensional space admits an orthonormal basis. Gram-Schmidt orthogonalization computes one constructively.
5. Eigendecomposition and spectral bases (Schur 1909, 1920s-1930s). Schur showed every matrix has an eigenvalue decomposition. Eigenvalues/eigenvectors define natural bases (spectral decomposition).
6. PCA and dimensionality reduction (Pearson 1901, Hotelling 1933, SVD 1960s). PCA finds the basis vectors (principal components) that best capture data variance. SVD algorithm (Golub & Kahan 1965) computes this stably.
7. Neural basis learning (1980s-2010s). Neural networks learn basis implicitly: hidden layers learn representations that act as bases for downstream computations. Deeper networks learn hierarchical bases (abstract high-level concepts from concrete low-level features).
8. Dictionary learning and sparse coding (2000s). Learn overcomplete bases (more basis vectors than dimension) that sparsely represent data. Applications: image denoising (K-SVD, Aharon et al. 2006), signal processing.
Definitions#
Basis. A set $\mathcal{B} = \{v_1, \ldots, v_d\}$ is a basis for subspace $S$ if:
$\text{span}(\mathcal{B}) = S$ (spans the subspace).
Vectors in $\mathcal{B}$ are linearly independent.
Every vector in $S$ has a unique representation: $s = \sum_{i=1}^d \alpha_i v_i$ (coordinates $\alpha_i$ are unique).
Dimension. The dimension of subspace $S$ is: $$ \dim(S) = \text{size of any basis for } S $$ This is well-defined: all bases for $S$ have the same size.
Rank and nullity. For matrix $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A)) $$ $$ \text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A) $$
Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$.
Change of basis matrix. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the matrix $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates) satisfies: $$ P = \text{matrix whose columns are basis } \mathcal{B}' \text{ in coordinates of basis } \mathcal{B} $$ For coordinates $[v]\mathcal{B}$ (in basis $\mathcal{B}$) and $[v]{\mathcal{B}’}$ (in basis $\mathcal{B}’$): $$ [v]_\mathcal{B} = P [v]_{\mathcal{B}'}, \quad [v]_{\mathcal{B}'} = P^{-1} [v]_\mathcal{B} $$
Coordinates. For vector $v \in S$ and basis $\mathcal{B} = \{v_1, \ldots, v_d\}$, the coordinates are scalars $[\alpha_1, \ldots, \alpha_d]^\top$ satisfying: $$ v = \sum_{i=1}^d \alpha_i v_i $$ Coordinates are unique (follows from linear independence).
Comments