Worked Example 1: Embedding interpolation is still a vector
Introduction
Token embeddings in NLP models (Word2Vec, GloVe, BERT, GPT) map discrete tokens to continuous vectors in $\mathbb{R}^d$. A fundamental property: any linear combination of embeddings remains a valid embedding (closure under vector space operations). This enables semantic arithmetic (“king” - “man” + “woman” ≈ “queen”), interpolation between concepts, and averaging embeddings for sentences or documents.
This example demonstrates that embedding spaces are vector spaces by explicitly computing an interpolation (convex combination) of two token embeddings. The result stays in $\mathbb{R}^d$, illustrating closure under linear combinations.
Purpose
Verify closure: Show that $\alpha e(a) + (1-\alpha) e(b) \in \mathbb{R}^d$ for any embeddings $e(a), e(b)$ and scalar $\alpha \in [0,1]$.
Introduce convex combinations: Interpolation with $\alpha \in [0,1]$ produces points on the line segment between $e(a)$ and $e(b)$.
Connect to ML: Embedding arithmetic is used in analogy tasks, compositional semantics, and prompt engineering (e.g., blending concepts for image generation).
Importance
Semantic compositionality. The vector space structure of embeddings enables composing meanings via linear algebra. Famous examples:
Word2Vec analogies (Mikolov et al., 2013): $v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$ achieves ~40% accuracy on analogy tasks.
Sentence embeddings: Average token embeddings $\bar{v} = \frac{1}{n} \sum_{i=1}^n e(t_i)$ (simple but effective baseline for sentence similarity).
Image-text embeddings (CLIP, 2021): Contrastive learning aligns image and text embeddings in a shared vector space. Interpolations blend visual/textual concepts.
Training stability. Gradient descent updates embeddings via $e(t) \leftarrow e(t) - \eta \nabla \mathcal{L}$. Closure ensures embeddings never “leave” $\mathbb{R}^d$ during training.
What This Example Demonstrates
This example shows that embedding spaces are closed under linear combinations, a necessary condition for being a vector space. Interpolation $v = \alpha e(a) + (1-\alpha)e(b)$ produces a point between $e(a)$ and $e(b)$, illustrating that we can “blend” semantic meanings by taking weighted averages.
The geometric interpretation: $e(a)$ and $e(b)$ define a line in $\mathbb{R}^d$; all convex combinations lie on the line segment $[e(a), e(b)]$. This extends to arbitrary linear combinations (not just convex), forming the span $\{\alpha e(a) + \beta e(b) : \alpha, \beta \in \mathbb{R}\}$ (a 2D subspace if $e(a)$ and $e(b)$ are linearly independent).
Background
Distributional semantics. The idea that “words are characterized by the company they keep” (Firth, 1957) led to vector space models in NLP. Early methods (latent semantic analysis, 1990; HAL, 1997) used co-occurrence matrices. Modern neural embeddings (Word2Vec, 2013; GloVe, 2014) learn dense representations by predicting context words.
Vector space models in NLP:
Bag-of-words: Represent documents as sparse vectors in $\mathbb{R}^{|\text{vocab}|}$ (counts or TF-IDF weights).
Word embeddings: Learn dense vectors $e(w) \in \mathbb{R}^d$ ($d \approx 50$-$1000$) capturing semantic similarity. Similar words have nearby vectors (measured by cosine similarity or Euclidean distance).
Contextual embeddings (BERT, GPT): Embeddings depend on context; $e(w | \text{context})$ varies across sentences. Still vectors in $\mathbb{R}^d$ at each layer.
Closure and linearity: The vector space axioms (closure, distributivity) are assumed in embedding models but rarely verified explicitly. This example makes closure concrete: interpolation $\alpha e(a) + (1-\alpha)e(b)$ stays in $\mathbb{R}^d$ because $\mathbb{R}^d$ is a vector space.
Historical Context
1. Distributional hypothesis (1950s-1960s). Harris (1954) and Firth (1957) proposed that word meaning is determined by distribution (co-occurrence patterns). This motivated vector representations based on context counts.
2. Latent Semantic Analysis (Deerwester et al., 1990). Applied SVD to term-document matrices, projecting words/documents into low-dimensional subspaces. Demonstrated that dimensionality reduction (via truncated SVD) preserves semantic relationships.
3. Word2Vec (Mikolov et al., 2013). Introduced skip-gram and CBOW models, training shallow neural networks to predict context words. Showed that embeddings exhibit linear structure: analogies correspond to parallel vectors ($v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$).
4. GloVe (Pennington et al., 2014). Combined global co-occurrence statistics with local context prediction, achieving state-of-the-art performance on analogy and similarity tasks.
5. Contextual embeddings (2018-present). BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) compute embeddings that vary by context, using Transformer architectures. Embeddings at each layer are still vectors in $\mathbb{R}^d$, but $e(w)$ depends on the entire input sequence.
History in Machine Learning
1990: LSA applies SVD to term-document matrices (vector space models).
2013: Word2Vec popularizes dense embeddings; analogy tasks demonstrate linear structure.
2014: GloVe combines global statistics with neural methods.
2017: Transformers (Vaswani et al.) enable contextualized embeddings via attention.
2018: BERT and GPT revolutionize NLP by learning contextual representations at scale.
2021: CLIP (Radford et al.) aligns image and text embeddings in a shared vector space, enabling zero-shot image classification and text-to-image generation.
Prevalence in Machine Learning
Ubiquitous in NLP: Every modern NLP model (BERT, GPT, T5, LLaMA) uses token embeddings in $\mathbb{R}^d$ ($d = 768$ for BERT-base, $d = 4096$ for GPT-3, $d = 12288$ for GPT-4). Embeddings are the primary representation for text.
Vision and multimodal models:
Vision Transformers (ViT, 2020): Patch embeddings in $\mathbb{R}^d$ replace pixel representations.
CLIP (2021): Image and text embeddings in a shared $\mathbb{R}^{512}$ space enable cross-modal retrieval.
DALL-E, Stable Diffusion (2021-2022): Text embeddings condition diffusion models for image generation.
Recommendation systems: Item embeddings in $\mathbb{R}^d$ capture user preferences. Collaborative filtering factorizes user-item matrices into embeddings.
Notes and Explanatory Details
Shape discipline: $e(a) \in \mathbb{R}^d$, $e(b) \in \mathbb{R}^d$, $\alpha \in \mathbb{R}$. The interpolation $v = \alpha e(a) + (1-\alpha)e(b)$ is a linear combination, so $v \in \mathbb{R}^d$ by closure.
Convex combinations: Restricting $\alpha \in [0,1]$ ensures $v$ lies on the line segment $[e(a), e(b)]$. Allowing $\alpha \in \mathbb{R}$ gives the entire line through $e(a)$ and $e(b)$ (the span).
Geometric interpretation: In 3D, if $e(a) = [1, 0, 2]$ and $e(b) = [-1, 3, 0]$, then $v = 0.3 e(a) + 0.7 e(b)$ lies 30% of the way from $e(b)$ to $e(a)$.
Numerical considerations: Embedding norms vary (typical $\|e(w)\|_2 \approx 1$-$10$ depending on initialization). Normalization (dividing by $\|e(w)\|_2$) is common for cosine similarity metrics.
Connection to Machine Learning
Analogy tasks. Linear offsets capture semantic relationships: $v_{\text{France}} - v_{\text{Paris}} \approx v_{\text{Germany}} - v_{\text{Berlin}}$ (capital relationship). The vector $v_{\text{France}} - v_{\text{Paris}}$ represents the “capital-of” direction in embedding space.
Prompt interpolation. In text-to-image models, interpolating prompt embeddings generates images blending two concepts. Example: $\alpha e(\text{"dog"}) + (1-\alpha)e(\text{"cat"})$ with $\alpha = 0.5$ might generate a hybrid “doge” image.
Sentence embeddings. Averaging token embeddings $\bar{v} = \frac{1}{n} \sum_{i=1}^n e(t_i)$ is a simple but effective sentence representation (used in Skip-Thought, InferSent). More sophisticated: weighted averages (TF-IDF weights) or learned aggregations (attention).
Connection to Linear Algebra Theory
Vector space axioms. $\mathbb{R}^d$ satisfies all vector space axioms:
Closure: $e(a) + e(b) \in \mathbb{R}^d$ and $\alpha e(a) \in \mathbb{R}^d$.
Associativity: $(e(a) + e(b)) + e(c) = e(a) + (e(b) + e(c))$.
Commutativity: $e(a) + e(b) = e(b) + e(a)$.
Identity: $e(a) + 0 = e(a)$ where $0 = [0, \ldots, 0] \in \mathbb{R}^d$.
Inverses: $e(a) + (-e(a)) = 0$.
Scalar distributivity: $\alpha(e(a) + e(b)) = \alpha e(a) + \alpha e(b)$.
Subspaces. The span of embeddings $\{\text{span}\{e(t_1), \ldots, e(t_k)\}\}$ is a subspace of $\mathbb{R}^d$. For a vocabulary of size $|V|$, all embeddings lie in a $k$-dimensional subspace if $k < d$ (low-rank embedding matrix).
Affine combinations. Convex combinations $\sum_{i=1}^k \alpha_i e(t_i)$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ form a convex hull (polytope in $\mathbb{R}^d$). Sentence embeddings via averaging lie in this convex hull.
Pedagogical Significance
Concrete verification of closure. Many students learn vector space axioms abstractly but rarely see explicit numerical verification. This example shows that $\alpha e(a) + (1-\alpha)e(b) \in \mathbb{R}^d$ by computing actual numbers.
Geometric intuition. Interpolation visualizes the line segment between two points in $\mathbb{R}^d$. Extending to $\alpha \notin [0,1]$ shows extrapolation (moving beyond $e(a)$ or $e(b)$ along the line).
Foundation for advanced topics. Understanding embedding spaces as vector spaces is prerequisite for:
Analogies: Vector arithmetic $e(a) - e(b) + e(c)$ requires closure.
Dimensionality reduction: Projecting embeddings to lower-dimensional subspaces (PCA, t-SNE).
Alignment: Mapping embeddings between languages (Procrustes alignment, learned transforms).
References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” NeurIPS 2013. Introduced Word2Vec (skip-gram, CBOW); demonstrated analogy tasks.
Pennington, J., Socher, R., & Manning, C. D. (2014). “GloVe: Global Vectors for Word Representation.” EMNLP 2014. Combined global co-occurrence statistics with local context.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019. Contextual embeddings via masked language modeling.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” ICML 2021. CLIP aligns image/text embeddings in shared space.
Firth, J. R. (1957). “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis. Introduced distributional hypothesis: “You shall know a word by the company it keeps.”
Problem. Show token embeddings live in a vector space and compute an interpolation.
Solution (math).
Given embeddings $e(a), e(b) \in \mathbb{R}^d$ and $\alpha \in [0,1]$, the interpolation is:
$$
v = \alpha e(a) + (1-\alpha)e(b)
$$
By closure of $\mathbb{R}^d$ under linear combinations, $v \in \mathbb{R}^d$. For $\alpha = 0$, $v = e(b)$; for $\alpha = 1$, $v = e(a)$; for $\alpha = 0.5$, $v$ is the midpoint.
Solution (Python).
import numpy as np
# Define embeddings for tokens 'a' and 'b' in R^3
E = {
'a': np.array([1., 0., 2.]),
'b': np.array([-1., 3., 0.])
}
# Interpolation parameter (0 <= alpha <= 1)
alpha = 0.3
# Compute convex combination
v = alpha * E['a'] + (1 - alpha) * E['b']
print(f"e(a) = {E['a']}")
print(f"e(b) = {E['b']}")
print(f"v = {alpha} * e(a) + {1-alpha} * e(b) = {v}")
print(f"v is in R^3: {v.shape == (3,)}")
Output:
e(a) = [1. 0. 2.]
e(b) = [-1. 3. 0.]
v = 0.3 * e(a) + 0.7 * e(b) = [-0.4 2.1 0.6]
v is in R^3: True
Worked Example 2: Zero-mean subspace projection
Introduction
Centering data (subtracting the mean) is a ubiquitous preprocessing step in machine learning. PCA, covariance estimation, batch normalization, and many other algorithms assume zero-mean data. Mathematically, zero-mean vectors form a subspace: $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ where $\mathbf{1} = [1, \ldots, 1]^\top$ is the all-ones vector.
This example shows that $S$ is the null space of the row vector $\mathbf{1}^\top$, demonstrates projection onto $S$ via the centering matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$, and verifies that the projected vector has zero mean.
Purpose
Demonstrate subspaces defined by constraints: $S = \{x : \mathbf{1}^\top x = 0\}$ is a hyperplane through the origin ($(n-1)$-dimensional subspace).
Introduce projection matrices: $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ projects onto $S$ (removes the mean).
Connect to ML: Centering data is equivalent to projecting onto the zero-mean subspace.
Importance
PCA and covariance estimation. PCA operates on the centered data matrix $X_c = X - \frac{1}{n} \mathbf{1}\mathbf{1}^\top X$ (each column has zero mean). The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ measures variance around the mean; if data is not centered, $C$ mixes mean and variance.
Batch normalization (Ioffe & Szegedy, 2015). Normalizes layer activations by subtracting batch mean and dividing by batch std. The mean-centering step is projection onto the zero-mean subspace.
Regularization and identifiability. In linear regression with an intercept $f(x) = w^\top x + b$, centering inputs $x \mapsto x - \bar{x}$ and targets $y \mapsto y - \bar{y}$ decouples the intercept from the weights, improving numerical stability and interpretability.
What This Example Demonstrates
This example shows that:
Zero-mean vectors form a subspace (closed under addition/scaling, contains zero).
Projection onto $S$ is linear: $x_{\text{proj}} = Px$ with $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$.
The projection removes the mean: $\mathbf{1}^\top (Px) = 0$ for all $x$.
$P$ is idempotent: $P^2 = P$ (projecting twice is the same as projecting once).
$P$ is symmetric: $P^\top = P$ (orthogonal projection).
Background
Affine subspaces vs. linear subspaces. An affine subspace (hyperplane) has the form $\{x : a^\top x = c\}$ for $c \neq 0$. This is not a subspace unless $c = 0$ (does not contain the origin). Zero-mean vectors form a linear subspace because $\mathbf{1}^\top 0 = 0$.
Centering matrix. The matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ is called the centering matrix or projection onto zero-mean subspace. It satisfies:
$P\mathbf{1} = 0$ (projects all-ones vector to zero).
$Px = x - \bar{x} \mathbf{1}$ where $\bar{x} = \frac{1}{n} \mathbf{1}^\top x$ (subtracts the mean).
$P^2 = P$ (idempotent: projecting twice does nothing).
$P^\top = P$ (symmetric: orthogonal projection).
Null space and column space. $S = \text{null}(\mathbf{1}^\top)$ is the set of vectors perpendicular to $\mathbf{1}$. The orthogonal complement is $S^\perp = \text{span}\{\mathbf{1}\}$ (scalar multiples of $\mathbf{1}$). By the fundamental theorem of linear algebra, $\mathbb{R}^n = S \oplus S^\perp$ (direct sum).
Historical Context
1. Gaussian elimination and centering (Gauss, 1809). Gauss used mean-centering in least squares for astronomical orbit fitting (method of least squares, published in Theoria Motus).
2. PCA (Pearson 1901, Hotelling 1933). Both Pearson’s “lines of closest fit” and Hotelling’s “principal components” assume centered data. The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ is undefined without centering (would conflate mean and variance).
3. Projection matrices (Penrose 1955, Rao 1955). The theory of orthogonal projections was formalized in the 1950s. The centering matrix $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ is a rank-$(n-1)$ projection matrix.
4. Batch normalization (Ioffe & Szegedy, 2015). Revolutionized deep learning by normalizing layer activations. The first step is centering: $\hat{x}_i = x_i - \frac{1}{B} \sum_{i=1}^B x_i$ (subtract batch mean).
History in Machine Learning
1809: Gauss applies least squares with mean-centering (astronomical data).
1901: Pearson introduces PCA (assumes centered data).
1933: Hotelling formalizes PCA (covariance matrix requires centering).
1955: Penrose and Rao develop theory of projection matrices.
2015: Batch normalization (Ioffe & Szegedy) makes centering a learned layer operation.
2016: Layer normalization (Ba et al.) centers across features instead of batches.
2019: Group normalization (Wu & He) centers within feature groups (used in computer vision).
Prevalence in Machine Learning
Preprocessing: Nearly all classical ML algorithms (PCA, LDA, SVM, ridge regression) assume centered data. Scikit-learn’s StandardScaler first centers (`x \mapsto x - \bar{x}$) then scales ($x \mapsto x / \sigma$).
Deep learning normalization:
Batch norm: Centers and scales mini-batch statistics.
Layer norm: Centers and scales across feature dimension (used in Transformers).
Instance norm: Centers each example independently (style transfer, GANs).
Optimization: Adam optimizer maintains exponential moving averages of gradients and squared gradients. The first moment $m_t$ is effectively a centered gradient estimate.
Notes and Explanatory Details
Shape discipline:
Input: $x \in \mathbb{R}^n$ (column vector).
All-ones vector: $\mathbf{1} \in \mathbb{R}^n$ (column vector).
Mean: $\bar{x} = \frac{1}{n} \mathbf{1}^\top x \in \mathbb{R}$ (scalar).
Centering matrix: $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \in \mathbb{R}^{n \times n}$.
Projected vector: $x_{\text{proj}} = Px \in \mathbb{R}^n$.
Verification of projection properties:
Projects to zero-mean subspace: $\mathbf{1}^\top (Px) = \mathbf{1}^\top \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right) x = \mathbf{1}^\top x - \frac{1}{n} (\mathbf{1}^\top \mathbf{1})(\mathbf{1}^\top x) = \mathbf{1}^\top x - \mathbf{1}^\top x = 0$.
Idempotent: $P^2 = \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right)^2 = I - \frac{2}{n} \mathbf{1}\mathbf{1}^\top + \frac{1}{n^2} \mathbf{1}(\mathbf{1}^\top \mathbf{1})\mathbf{1}^\top = I - \frac{2}{n} \mathbf{1}\mathbf{1}^\top + \frac{1}{n} \mathbf{1}\mathbf{1}^\top = P$.
Symmetric: $P^\top = \left(I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top \right)^\top = I - \frac{1}{n} (\mathbf{1}\mathbf{1}^\top)^\top = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top = P$.
Numerical considerations: Computing $P$ explicitly (storing $n \times n$ matrix) is wasteful for large $n$. Instead, compute $Px = x - \bar{x} \mathbf{1}$ directly (subtracting the mean).
Connection to Machine Learning
PCA. The centered data matrix $X_c = PX$ (apply $P$ to each column) ensures principal components capture variance, not mean. The covariance matrix $C = \frac{1}{n} X_c^\top X_c$ would be biased without centering.
Batch normalization. For a mini-batch $\{x_1, \ldots, x_B\}$, batch norm computes:
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad \mu_B = \frac{1}{B} \sum_{i=1}^B x_i, \quad \sigma_B^2 = \frac{1}{B} \sum_{i=1}^B (x_i - \mu_B)^2
$$
The centering step $x_i - \mu_B$ is projection onto the zero-mean subspace.
Residuals in regression. In ordinary least squares, residuals $r = y - \hat{y} = (I - X(X^\top X)^{-1} X^\top)y$ are projections onto the orthogonal complement of $\text{col}(X)$. If $X$ includes a column of ones (intercept), residuals automatically have zero mean.
Connection to Linear Algebra Theory
Projection theorem. Every vector $x \in \mathbb{R}^n$ can be uniquely decomposed as $x = x_\parallel + x_\perp$ where $x_\parallel \in S$ and $x_\perp \in S^\perp$. For $S = \{x : \mathbf{1}^\top x = 0\}$, we have:
$$
x_\parallel = Px = x - \bar{x} \mathbf{1}, \quad x_\perp = (I-P)x = \bar{x} \mathbf{1}
$$
Rank of projection matrix. $P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top$ has rank $n-1$ because:
$\text{null}(P) = \text{span}\{\mathbf{1}\}$ (1D subspace).
By rank-nullity theorem, $\text{rank}(P) + \dim(\text{null}(P)) = n$, so $\text{rank}(P) = n-1$.
Eigenvalues of $P$. $P$ has eigenvalues $\lambda = 1$ (multiplicity $n-1$) and $\lambda = 0$ (multiplicity 1):
This confirms $P$ is a projection: eigenvalues are 0 or 1, characteristic of projection matrices.
Relation to covariance. The sample covariance matrix is:
$$
C = \frac{1}{n} X_c^\top X_c = \frac{1}{n} (PX)^\top (PX) = \frac{1}{n} X^\top P^\top P X = \frac{1}{n} X^\top P X
$$
using $P^\top = P$ and $P^2 = P$.
Pedagogical Significance
Concrete example of a subspace. Students often learn subspaces abstractly (“closed under addition and scaling”). This example gives a geometric and algebraic definition: $S = \{x : \mathbf{1}^\top x = 0\}$ (algebraic) is an $(n-1)$-dimensional hyperplane through the origin (geometric).
Projection as matrix multiplication. Projecting $x$ onto $S$ is simply $x_{\text{proj}} = Px$. This demystifies projections (often introduced with complicated formulas) by showing they’re linear maps.
Foundation for PCA. Understanding centering is essential before learning PCA. Many textbooks jump to “compute eigenvalues of $X^\top X$” without explaining why $X$ must be centered.
Computational perspective. Explicitly forming $P$ (storing $n^2$ entries) is wasteful. Implementing $Px$ as $x - \bar{x} \mathbf{1}$ (computing mean, subtracting) is much faster ($O(n)$ vs. $O(n^2)$).
References
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Introduced least squares with mean-centering for orbit determination.
Hotelling, H. (1933). “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology, 24(6), 417–441. Formalized PCA (assumes centered data).
Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ICML 2015. Introduced batch normalization (centering + scaling layer activations).
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press. Chapter 4 covers projection matrices and least squares.
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. Section 2.5 covers idempotent matrices (projections).
Problem. Project $x$ onto $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ (zero-mean subspace).
Solution (math).
$S$ is a subspace (null space of $\mathbf{1}^\top$). The projection matrix is:
$$
P = I - \frac{1}{n} \mathbf{1}\mathbf{1}^\top
$$
Applying $P$ to $x$ gives:
$$
x_{\text{proj}} = Px = x - \frac{1}{n} (\mathbf{1}^\top x) \mathbf{1} = x - \bar{x} \mathbf{1}
$$
where $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ is the mean of $x$. Verification: $\mathbf{1}^\top x_{\text{proj}} = \mathbf{1}^\top (x - \bar{x} \mathbf{1}) = \mathbf{1}^\top x - n \bar{x} = 0$.
Solution (Python).
import numpy as np
# Define vector x in R^5
n = 5
x = np.array([3., 1., 0., -2., 4.])
# Centering matrix P = I - (1/n) * 1 * 1^T
I = np.eye(n)
one = np.ones((n, 1))
P = I - (1 / n) * (one @ one.T)
# Project x onto zero-mean subspace
x_proj = P @ x
print(f"Original x: {x}")
print(f"Mean of x: {x.mean():.2f}")
print(f"Projected x_proj: {x_proj}")
print(f"Mean of x_proj: {x_proj.sum():.2e} (should be ~0)")
print(f"Verification: 1^T @ x_proj = {one.T @ x_proj.reshape(-1, 1)[0, 0]:.2e}")
Output:
Original x: [ 3. 1. 0. -2. 4.]
Mean of x: 1.20
Projected x_proj: [ 1.8 -0.2 -1.2 -3.2 2.8]
Mean of x_proj: 0.00e+00 (should be ~0)
Verification: 1^T @ x_proj = 0.00e+00
Worked Example 4: Bias trick (affine → linear)
Introduction
Most machine learning models use affine transformations: $f(x) = Wx + b$ where $W$ is a weight matrix and $b$ is a bias vector. Affine maps are not linear (they don’t map zero to zero if $b \neq 0$), but there’s an elegant trick: augment the input with a constant 1, turning $f(x) = Wx + b$ into a purely linear map $f(x') = W' x'$ in a higher-dimensional space.
This “bias trick” (also called “homogeneous coordinates”) is ubiquitous in ML: neural network layers, logistic regression, SVMs, and computer graphics all use it. It simplifies implementation (one matrix multiply instead of multiply + add) and unifies the treatment of weights and biases.
Purpose
Unify affine and linear maps: Convert $f(x) = Wx + b$ to $f(x') = W'x'$ where $x' = [x; 1]$ (augmented input) and $W' = [W \,|\, b]$ (concatenated weight matrix and bias).
Simplify backpropagation: Gradients w.r.t. $W'$ handle both weights and biases uniformly.
Connect to projective geometry: Homogeneous coordinates in computer graphics use the same augmentation.
Importance
Neural network implementation. Every linear layer $h = Wx + b$ can be written as $h = W'x'$ where $x' = [x; 1]$ and $W' = [W \,|\, b]$. Many frameworks (PyTorch, TensorFlow) handle biases separately for efficiency, but conceptually this augmentation clarifies the math.
Logistic regression. The decision boundary $w^\top x + b = 0$ becomes $w'^\top x' = 0$ in augmented space, a linear classifier (hyperplane through the origin in $\mathbb{R}^{d+1}$).
Conditioning and regularization. Regularizing $\|w\|_2^2$ without penalizing $b$ (common practice) is harder to express if $w$ and $b$ are combined. Keeping them separate maintains flexibility, but the augmentation perspective clarifies that they live in different subspaces.
What This Example Demonstrates
Affine maps become linear in augmented space: $f(x) = Wx + b$ (affine in $\mathbb{R}^d$) equals $f(x') = W'x'$ (linear in $\mathbb{R}^{d+1}$).
Augmentation preserves structure: Adding a constant 1 extends the input space without losing information.
Numerical verification: Compute both $Wx + b$ and $W'x'$, verify they’re identical.
Background
Affine vs. linear. A map $f: \mathbb{R}^d \to \mathbb{R}^m$ is:
Linear if $f(\alpha x + \beta y) = \alpha f(x) + \beta f(y)$ for all $x, y, \alpha, \beta$. Equivalently, $f(x) = Ax$ for some matrix $A$.
Affine if $f(x) = Ax + b$ for some matrix $A$ and vector $b$. Affine maps preserve affine combinations (weighted averages with weights summing to 1) but not arbitrary linear combinations.
Homogeneous coordinates. In computer graphics, 3D points $(x, y, z)$ are represented as 4-vectors $(x, y, z, 1)$ to handle translations uniformly. The last coordinate acts as a “scaling factor” (1 for ordinary points, 0 for vectors). This is exactly the bias trick.
Historical Context: Homogeneous coordinates were introduced by August Ferdinand Möbius (1827) for projective geometry. Their use in ML is a modern application of this classical idea.
Connection to Machine Learning
Deep neural networks. Each layer computes $h_{l+1} = \sigma(W_l h_l + b_l)$ where $\sigma$ is a nonlinearity (ReLU, sigmoid). The affine transformation $W_l h_l + b_l$ can be written as $W'_l h'_l$ with $h'_l = [h_l; 1]$.
Batch processing. For a mini-batch $X \in \mathbb{R}^{B \times d}$ (rows are examples), the transformation $Y = XW^\top + \mathbf{1} b^\top$ (broadcasting bias) becomes $Y = X' W'^\top$ where $X' = [X \,|\, \mathbf{1}]$ (augmented batch) and $W' = [W \,|\, b]$.
Regularization subtlety. Ridge regression penalizes $\|w\|_2^2$ but not $b$ (bias is unregularized). If we augment and use $w' = [w; b]$, regularizing $\|w'\|_2^2$ would incorrectly penalize the bias. This is why most implementations keep $w$ and $b$ separate despite the conceptual elegance of augmentation.
References
Möbius, A. F. (1827). Der barycentrische Calcul. Introduced homogeneous coordinates for projective geometry.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Section 6.1: “Feedforward Networks” (linear layers with biases).
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Section 3.1: “Linear Regression” (discusses augmentation for intercept terms).
Problem. Rewrite $f(x) = Wx + b$ as a linear map in augmented space.
Solution (math).
Define the augmented input $x' \in \mathbb{R}^{d+1}$:
$$
x' = \begin{bmatrix} x \\ 1 \end{bmatrix}
$$
and the augmented weight matrix $W' \in \mathbb{R}^{m \times (d+1)}$:
$$
W' = \begin{bmatrix} W & b \end{bmatrix}
$$
Then:
$$
f(x) = Wx + b = W' x' = \begin{bmatrix} W & b \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} = Wx + b \cdot 1
$$
This is a linear map in $\mathbb{R}^{d+1}$ (no bias term needed).
Solution (Python).
import numpy as np
# Define weight matrix W (2x2) and bias vector b (2,)
W = np.array([[2., 1.],
[-1., 3.]])
b = np.array([0.5, -2.])
# Input vector x (2,)
x = np.array([1., 4.])
# Standard affine transformation: Wx + b
y_affine = W @ x + b
# Augmented transformation: W' @ x'
# W' = [W | b] (concatenate b as a column)
W_aug = np.c_[W, b] # Shape: (2, 3)
# x' = [x; 1] (append 1)
x_aug = np.r_[x, 1.] # Shape: (3,)
# Linear transformation in augmented space
y_linear = W_aug @ x_aug
print(f"W =\n{W}")
print(f"b = {b}")
print(f"x = {x}\n")
print(f"Affine: Wx + b = {y_affine}")
print(f"\nAugmented W' =\n{W_aug}")
print(f"Augmented x' = {x_aug}")
print(f"Linear: W'x' = {y_linear}\n")
print(f"Are they equal? {np.allclose(y_affine, y_linear)}")
Output:
W =
[[ 2. 1.]
[-1. 3.]]
b = [ 0.5 -2. ]
x = [1. 4.]
Affine: Wx + b = [ 6.5 9. ]
Augmented W' =
[[ 2. 1. 0.5]
[-1. 3. -2. ]]
Augmented x' = [1. 4. 1.]
Linear: W'x' = [ 6.5 9. ]
Are they equal? True
Worked Example 5: Attention outputs lie in span(V)
Introduction
The attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) is the core operation in Transformers, powering modern LLMs (GPT, BERT, LLaMA), vision models (ViT), and multimodal systems (CLIP, Flamingo). Attention computes weighted sums of value vectors $V$, with weights determined by query-key similarities. Crucially, attention outputs are constrained to lie in $\text{span}(V)$ — they cannot “invent” information outside the value subspace.
This example demonstrates that attention is a linear combination operation: the output $z = \sum_{i=1}^n \alpha_i v_i$ (where $\alpha_i$ are softmax-normalized attention scores) always lies in $\text{span}\{v_1, \ldots, v_n\}$, a subspace of $\mathbb{R}^{d_v}$.
Purpose
Understand attention as weighted averaging: Output = $\sum_{i=1}^n \alpha_i v_i$ with $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$ (convex combination).
Identify the constraint: Attention outputs lie in $\text{span}(V)$, limiting expressiveness to the value subspace.
Connect to ML: Multi-head attention learns multiple subspaces (heads) in parallel, increasing capacity.
Importance
Transformer architecture. Attention is the primary operation in Transformers, replacing recurrence (RNNs) and convolution (CNNs). Each layer computes:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$
where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of inputs. The output is a weighted sum of value vectors, constrained to $\text{span}(V)$.
Multi-head attention. Splitting into $h$ heads projects to $h$ different subspaces:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O
$$
where $\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$. Each head operates in a different $(d_v / h)$-dimensional subspace.
Expressiveness vs. efficiency. Attention can only combine existing values (linear combinations), not generate new directions. This is a feature, not a bug: it provides inductive bias (outputs depend on inputs) and computational efficiency (matrix multiplications).
What This Example Demonstrates
Attention is linear combination: For softmax weights $\alpha \in \mathbb{R}^{1 \times n}$ and value matrix $V \in \mathbb{R}^{n \times d_v}$, the output $z = \alpha V \in \mathbb{R}^{1 \times d_v}$ is a linear combination of rows of $V$.
Outputs lie in subspace: $z \in \text{span}\{v_1, \ldots, v_n\}$ where $v_i \in \mathbb{R}^{d_v}$ are rows of $V$.
Convex combination: Since $\alpha_i \geq 0$ and $\sum_i \alpha_i = 1$, $z$ lies in the convex hull of $\{v_1, \ldots, v_n\}$.
Background
Attention mechanism history:
Bahdanau attention (2015): Introduced for neural machine translation. Computes alignment scores between encoder hidden states and decoder state, uses weighted sum for context.
Scaled dot-product attention (Vaswani 2017): Simplified to $\text{softmax}(QK^\top / \sqrt{d_k})V$, enabling parallelization and scaling.
Multi-head attention (Vaswani 2017): Projects to multiple subspaces (heads), learns different relationships.
Mathematical interpretation: Attention is a content-based addressing mechanism: queries “look up” relevant keys, retrieve corresponding values. The softmax ensures smooth interpolation (differentiable, convex weights).
Relation to kernels: Attention can be viewed as a kernel method where $K(q, k) = \exp(q^\top k / \sqrt{d_k})$ (unnormalized softmax). The output is a weighted sum in the kernel space (span of values).
Connection to Machine Learning
Self-attention in Transformers. For input sequence $X \in \mathbb{R}^{n \times d}$, self-attention computes $Q = XW^Q$, $K = XW^K$, $V = XW^V$. Each output token is a linear combination of all input tokens’ values, weighted by query-key similarities.
Cross-attention in encoder-decoder models. Decoder queries attend to encoder keys/values. Example: in machine translation, the decoder (target language) attends to encoder representations (source language). Outputs lie in the span of encoder values.
Positional embeddings. Since attention is permutation-invariant (outputs are linear combinations regardless of input order), position information must be injected via positional encodings $p_i \in \mathbb{R}^d$ added to embeddings. This augments the value subspace.
Computational complexity. Attention requires $O(n^2 d_v)$ operations ($n \times n$ attention matrix times $n \times d_v$ value matrix). For long sequences (e.g., books, genomic data), this becomes prohibitive, motivating sparse attention and low-rank approximations.
Connection to Linear Algebra Theory
Convex combinations and convex hulls. If $\alpha_i \geq 0$ and $\sum_i \alpha_i = 1$, then $z = \sum_i \alpha_i v_i$ lies in the convex hull $\text{conv}\{v_1, \ldots, v_n\}$, the smallest convex set containing all $v_i$. Geometrically, this is a polytope (bounded region) in $\mathbb{R}^{d_v}$.
Rank of attention output. The attention output matrix $Z = \text{softmax}(QK^\top / \sqrt{d_k}) V \in \mathbb{R}^{n \times d_v}$ has $\text{rank}(Z) \leq \min(n, \text{rank}(V))$. If $V$ is low-rank, attention cannot increase rank (outputs lie in a low-dimensional subspace).
Projection interpretation. If all attention weights concentrate on a single token ($\alpha_i = 1$, $\alpha_j = 0$ for $j \neq i$), the output equals $v_i$ (projection to a single basis vector). Smooth attention weights (uniform $\alpha_i = 1/n$) give the average $\bar{v} = \frac{1}{n} \sum_i v_i$ (projection to center of value cloud).
Orthogonality and attention scores. High query-key similarity $q^\top k$ indicates alignment (small angle between $q$ and $k$). Orthogonal query-key pairs ($q^\top k = 0$) receive low attention weight. This is the same geometric intuition as inner products measuring similarity.
Pedagogical Significance
Concrete example of span. Attention outputs visibly demonstrate that linear combinations $\sum_i \alpha_i v_i$ lie in $\text{span}\{v_1, \ldots, v_n\}$. Students can compute actual numbers and verify the output is expressible as a weighted sum.
Geometric visualization. For 2D or 3D value vectors, plot $\{v_1, \ldots, v_n\}$ and the attention output $z$. $z$ lies inside the convex hull (polytope formed by connecting $v_i$).
Foundation for Transformers. Understanding attention as linear combination clarifies:
Why attention is permutation-invariant: Linear combinations don’t depend on order.
Why multi-head attention helps: Different heads explore different subspaces.
Limitations: Attention can only interpolate existing values, not generate new directions (nonlinearity comes from layer stacking and feedforward networks).
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention is All You Need.” NeurIPS 2017. Introduced Transformer architecture with scaled dot-product attention and multi-head attention.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR 2015. First attention mechanism for seq2seq models.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019. Bidirectional self-attention for masked language modeling.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ICLR 2021. Vision Transformers (ViT) apply attention to image patches.
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). “Efficient Transformers: A Survey.” arXiv:2009.06732. Reviews sparse attention, low-rank approximations, and other efficiency techniques.
Problem. Show attention output is in the span of value vectors.
Solution (math).
For value matrix $V \in \mathbb{R}^{n \times d_v}$ (rows $v_1, \ldots, v_n$) and attention weights $\alpha \in \mathbb{R}^{1 \times n}$ (from softmax, so $\alpha_i \geq 0$, $\sum_i \alpha_i = 1$), the attention output is:
$$
z = \alpha V = \sum_{i=1}^n \alpha_i v_i \in \mathbb{R}^{1 \times d_v}
$$
This is a convex combination of rows of $V$, hence $z \in \text{span}\{v_1, \ldots, v_n\} \subseteq \mathbb{R}^{d_v}$.
Solution (Python).
import numpy as np
# Define value matrix V (3 tokens, 2-dimensional values)
V = np.array([[1., 0.], # v_1
[0., 1.], # v_2
[2., 2.]]) # v_3
# Define attention weights (from softmax, sum to 1)
A = np.array([[0.2, 0.5, 0.3]]) # Shape: (1, 3)
# Compute attention output z = A @ V
z = A @ V # Shape: (1, 2)
print(f"Value matrix V (rows are values):\n{V}\n")
print(f"Attention weights A = {A[0]} (sum = {A.sum():.1f})\n")
print(f"Attention output z = A @ V = {z[0]}")
print(f"\nVerification as linear combination:")
print(f"z = {A[0,0]}*v_1 + {A[0,1]}*v_2 + {A[0,2]}*v_3")
print(f" = {A[0,0]}*{V[0]} + {A[0,1]}*{V[1]} + {A[0,2]}*{V[2]}")
print(f" = {A[0,0]*V[0] + A[0,1]*V[1] + A[0,2]*V[2]}")
print(f"\nz lies in span(V): True (z is a linear combination of rows of V)")
Output:
Value matrix V (rows are values):
[[1. 0.]
[0. 1.]
[2. 2.]]
Attention weights A = [0.2 0.5 0.3] (sum = 1.0)
Attention output z = A @ V = [0.8 1.1]
Verification as linear combination:
z = 0.2*v_1 + 0.5*v_2 + 0.3*v_3
= 0.2*[1. 0.] + 0.5*[0. 1.] + 0.3*[2. 2.]
= [0.8 1.1]
z lies in span(V): True (z is a linear combination of rows of V)
Comments