ex1.ai

Chapter 5

Inner Products & Norms

Key ideas: Introduction

Introduction#

Inner products and norms provide the geometry for data and models:

Similarity via inner products $\langle x, y\rangle$ and cosine $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$
Size and distance via norms $\lVert x\rVert$ and induced metrics $d(x,y) = \lVert x-y\rVert$
Orthogonality ($\langle x, y\rangle = 0$) and projections onto subspaces
Positive semidefinite (PSD) Gram matrices and kernels driving SVMs/GPs
Stability and regularization via $\ell_2$ (Ridge) and $\ell_1$ (Lasso) penalties
Scaled dot-product attention uses many inner products and a normalization factor $1/\sqrt{d}$

Important ideas#

Inner product axioms and induced norms
- An inner product $\langle x, y\rangle$ on $\mathbb{R}^d$ is symmetric, bilinear, and positive definite; the induced norm is $\lVert x\rVert = \sqrt{\langle x, x\rangle}$.
Cauchy–Schwarz and cosine similarity
- \[\big|\langle x, y\rangle\big| \le \lVert x\rVert\,\lVert y\rVert\]
- Defines the angle via $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$.
Triangle inequality and Minkowski/Hölder
- For $p\in[1,\infty]$, $\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p$; Hölder duality connects $p$ and $q$ with $1/p+1/q=1$.
Dual norms and bounds
- The dual norm is $\lVert z\rVert_* = \sup_{\lVert x\rVert\le 1} \langle z, x\rangle$; e.g., dual of $\ell_1$ is $\ell_\infty$, dual of $\ell_2$ is $\ell_2$.
Orthogonality, orthonormal bases, and projections
- If $U\in\mathbb{R}^{d\times k}$ has orthonormal columns, the orthogonal projector is $P = UU^\top$, minimizing reconstruction error.
Gram matrices, PSD, and kernels
- For data matrix $X\in\mathbb{R}^{n\times d}$, $G=X X^\top$ has entries $G_{ij}=\langle x_i, x_j\rangle$ and is PSD. Kernel matrices generalize this to $K_{ij}=k(x_i,x_j)$.
Mahalanobis norms
- For SPD $M\succ 0$, $\lVert x\rVert_M = \sqrt{x^\top M x}$ reweights geometry (whitening, metric learning).
Norm-induced stability
- Lipschitz constants, gradient clipping, and regularization costs all depend on norms.

Relevance to ML#

Similarity search: cosine similarity is the standard for embeddings (IR, recommendation, retrieval, metric learning).
Regularization: $\ell_2$ (weight decay) controls scale; $\ell_1$ encourages sparsity.
Optimization: gradient norms determine step sizes; clipping prevents exploding gradients.
Kernels: SVMs, GPs rely on PSD Gram matrices of inner products.
Attention: scaled dot-products stabilize softmax logits as dimension grows.
PCA/covariance: variance equals squared $\ell_2$ norm along directions; orthogonal projections minimize $\ell_2$ error.

Algorithmic development (select milestones)#

1850s–1900s: Euclidean geometry formalized; Cauchy–Schwarz inequality.
1909: Mercer’s theorem (PSD kernels); foundations of kernel methods.
1950: Aronszajn formalizes RKHS; inner products in function spaces.
1960s–1970s: Robust norms (Huber); convex analysis; optimization bounds.
1995: SVMs (Cortes–Vapnik) with kernel trick.
2013–2015: Word2Vec, GloVe popularize cosine similarity in embeddings.
2015–2016: BatchNorm/LayerNorm normalize activations (variance/norm control).
2017: Scaled dot-product attention (Transformers) stabilizes inner-product logits.
2020: Contrastive learning (SimCLR) uses normalized cosine objectives.

Definitions#

Inner product: $\langle x, y\rangle = x^\top y$ (standard), or weighted $\langle x, y\rangle_M = x^\top M y$ with $M\succ 0$.
Induced norm: $\lVert x\rVert = \sqrt{\langle x, x\rangle}$; $\ell_p$ norms: $\lVert x\rVert_1=\sum_i|x_i|$, $\lVert x\rVert_2=\sqrt{\sum_i x_i^2}$, $\lVert x\rVert_\infty=\max_i |x_i|$.
Cosine similarity: $\cos\theta(x,y) = \dfrac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}$.
Orthogonality: $\langle x, y\rangle = 0$; orthonormal set: $\langle u_i, u_j\rangle = \delta_{ij}$.
Gram matrix: $G_{ij}=\langle x_i, x_j\rangle$; PSD: $z^\top G z \ge 0$ $\forall z$.
Kernel: $k(x,y)=\langle \phi(x), \phi(y)\rangle$; $K_{ij}=k(x_i,x_j)$ is PSD.
Mahalanobis norm: $\lVert x\rVert_M = \sqrt{x^\top M x}$ with $M\succ 0$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems and tools)#

Cauchy–Schwarz: $$\big|\langle x,y\rangle\big|\le \lVert x\rVert\,\lVert y\rVert,$$ equality iff $x, y$ are linearly dependent.
Triangle inequality and Minkowski: $$\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p,$$ basis of $\ell_p$ geometries.
Hölder’s inequality: $$|\langle x,y\rangle| \le \lVert x\rVert_p\,\lVert y\rVert_q,$$ with $1/p+1/q=1$.
Pythagorean theorem (projections): For orthogonal $a\perp b$, $$\lVert a+b\rVert_2^2 = \lVert a\rVert_2^2 + \lVert b\rVert_2^2.$$
Norm equivalence (finite-dimensional): For any two norms on $\mathbb{R}^d$, there exist $c, C>0$ with $c\lVert x\rVert_a \le \lVert x\rVert_b \le C\lVert x\rVert_a$.
PSD characterization: $G$ is a Gram matrix iff $z^\top G z \ge 0$ for all $z$ (kernel validity test).

Applied (landmark systems and practices)#

SVMs (margins via inner products): Cortes–Vapnik (1995); kernel trick.
Gaussian Processes (inner products in function space): Rasmussen–Williams (2006).
BatchNorm/LayerNorm (norm/variance control): Ioffe–Szegedy (2015); Ba et al. (2016).
Word2Vec/GloVe (cosine similarity): Mikolov et al. (2013); Pennington et al. (2014).
SimCLR/contrastive learning (normalized dot-products): Chen et al. (2020).
Transformers (scaled dot-product): Vaswani et al. (2017).
Gradient clipping (norm control in training): Pascanu et al. (2013).

Key ideas: Where it shows up

PCA and covariance geometry

Variance along $u$: $\sigma^2(u)=\lVert X_c u\rVert_2^2/n= u^\top \Sigma u$, with $\Sigma=\tfrac{1}{n}X_c^\top X_c$.
Principal components are eigenvectors of $\Sigma$ maximizing inner products with data; projection error uses Pythagorean decomposition.
Achievements: Dimensionality reduction at scale; whitening used broadly in vision and speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

SGD/optimization: gradient norms and clipping

Step sizes depend on Lipschitz constants tied to operator/dual norms.
Gradient clipping by $\ell_2$ norm prevents exploding gradients (RNNs). References: Pascanu et al. 2013; Goodfellow et al. 2016; Nesterov 2018.

Deep nets: normalization and regularization

Weight decay ($\ell_2$) controls model complexity; $\ell_1$ encourages sparsity.
BatchNorm/LayerNorm normalize mean/variance, implicitly controlling activation norms. References: Ioffe–Szegedy 2015; Ba et al. 2016.

Kernels and PSD Gram matrices

SVMs and GPs depend on PSD kernels (Mercer). $K=XX^\top$ is PSD; RBF kernel yields smooth function priors.
Achievements: Kernel SVMs in text/vision (1990s–2000s); GPs in Bayesian ML. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Transformers: scaled dot-product attention

Scores $S=QK^\top/\sqrt{d_k}$ use many inner products; the $\sqrt{d_k}$ factor stabilizes softmax variance.
Achievements: SOTA in NLP/vision; ubiquitous backbone. References: Vaswani et al. 2017; Devlin et al. 2019; Dosovitskiy et al. 2020.

Embeddings and retrieval

Cosine similarity is the default for semantic retrieval and metric learning; normalization puts data on the unit sphere.
Achievements: Word2Vec/GloVe; SimCLR; CLIP/contrastive vision-language models. References: Mikolov et al. 2013; Pennington et al. 2014; Chen et al. 2020; Radford et al. 2021.

Notation

Vectors are column vectors. Data matrix: $X\in\mathbb{R}^{n\times d}$ (rows are examples; columns features). Centered data: $X_c$.
Inner product: $\langle x, y\rangle = x^\top y$; cosine similarity: $$\cos\theta(x,y) = \frac{\langle x, y\rangle}{\lVert x\rVert_2\,\lVert y\rVert_2}.$$
Norms: $\lVert x\rVert_1, \lVert x\rVert_2, \lVert x\rVert_\infty$; dual norms $\lVert\cdot\rVert_*$; Mahalanobis $\lVert x\rVert_M = \sqrt{x^\top M x}$.
Projection: If $U\in\mathbb{R}^{d\times k}$ is orthonormal, $P=UU^\top$; residual $r=(I-P)x$ is orthogonal to $\text{col}(U)$.
Gram matrix: $G=XX^\top$ (PSD); kernel matrix: $K_{ij}=k(x_i,x_j)$.
Examples:
- Embedding cosine: normalize $\hat{x}=x/\lVert x\rVert_2$, $\hat{y}=y/\lVert y\rVert_2$, then $\langle \hat{x},\hat{y}\rangle=\cos\theta$.
- Ridge penalty: $\lambda\lVert w\rVert_2^2$; Lasso: $\lambda\lVert w\rVert_1$.
- Attention scores: $S=QK^\top/\sqrt{d_k}$; softmax row-wise on $S$.

Pitfalls & sanity checks

Cosine vs Euclidean: without normalization, rankings can change due to scale.
PSD checks: ensure Gram/kernel matrices are PSD (numerically, allow tiny negatives).
Norm choice: $\ell_2$ is rotation-invariant; $\ell_1$ is robust/sparse but not smooth.
Attention scaling: omit $1/\sqrt{d_k}$ and softmax saturates for large $d_k$.
Centering for covariance: use $X_c$ for PCA; otherwise directions mix mean effects.
Gradient norms: clip by global norm to avoid exploding updates.

References

Foundations and geometry

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Axler, S. (2015). Linear Algebra Done Right (3rd ed.).
Horn, R. & Johnson, C. (2012). Matrix Analysis.
Boyd, S. & Vandenberghe, L. (2004). Convex Optimization.

Kernels and PSD 5. Mercer, J. (1909). Functions of positive and negative type. 6. Aronszajn, N. (1950). Theory of Reproducing Kernels. 7. Schölkopf, B. & Smola, A. (2002). Learning with Kernels. 8. Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.

Regularization and optimization 9. Hoerl, A. & Kennard, R. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso. 11. Nesterov, Y. (2018). Lectures on Convex Optimization. 12. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. 13. Pascanu, R. et al. (2013). On the difficulty of training RNNs (gradient clipping).

Embeddings, normalization, attention 14. Mikolov, T. et al. (2013). Word2Vec. 15. Pennington, J. et al. (2014). GloVe. 16. Ioffe, S. & Szegedy, C. (2015). Batch Normalization. 17. Ba, J. L. et al. (2016). Layer Normalization. 18. Chen, T. et al. (2020). SimCLR. 19. Radford, A. et al. (2021). CLIP. 20. Vaswani, A. et al. (2017). Attention Is All You Need.

PCA and projections 21. Jolliffe, I. (2002). Principal Component Analysis. 22. Shlens, J. (2014). A Tutorial on PCA. 23. Eckart, C. & Young, G. (1936). Low-rank approximation. 24. Golub, G. & Van Loan, C. (2013). Matrix Computations.

Five worked examples

Worked Example 1: Cosine similarity vs Euclidean distance for embedding retrieval#

Introduction#

Cosine similarity is ubiquitous for nearest-neighbor search in embedding spaces (text, images, audio). We show that for $\ell_2$-normalized vectors, maximizing cosine similarity is equivalent to minimizing Euclidean distance.

Purpose#

Relate inner products to distances under normalization; provide a fast retrieval recipe.

Importance#

Industrial search, recommendation, and retrieval pipelines rely on cosine similarity with normalized embeddings for stability and interpretability.

What this example demonstrates#

Equivalence: For unit vectors, $$\lVert x-y\rVert_2^2 = 2(1-\langle x,y\rangle).$$
Ranking by cosine equals ranking by negative Euclidean distance after normalization.

Background#

Vector space models in IR (Salton) and modern embeddings (Word2Vec, GloVe, CLIP) use cosine similarity due to scale invariance.

Historical context#

From tf–idf cosine in IR to neural embeddings; normalization combats varying document lengths and feature scales.

Prevalence in ML#

Text retrieval, semantic search, metric learning, contrastive pretraining; approximate nearest neighbor (ANN) indices often assume normalized data.

Notes#

Always normalize embeddings: $\hat{x}=x/\lVert x\rVert_2$.
For batched comparisons: use matrix products $S=\hat{X}\hat{Y}^\top$ to get all cosines.

Connection to ML#

Similarity search, contrastive objectives, and re-ranking all hinge on stable cosine scores.

Connection to Linear Algebra Theory#

Inner products induce norms/angles; normalization maps data to the unit sphere $\mathbb{S}^{d-1}$.

Pedagogical Significance#

Shows direct algebraic link between inner product and Euclidean geometry under normalization.

References#

Salton, G. et al. (1975). A vector space model for information retrieval.
Mikolov, T. et al. (2013). Efficient Estimation of Word Representations.
Pennington, J. et al. (2014). GloVe.
Radford, A. et al. (2021). CLIP.

Solution (Python)#

import numpy as np

np.random.seed(0)
d, n_query, n_db = 128, 4, 6
X = np.random.randn(n_query, d)
Y = np.random.randn(n_db, d)

def normalize(A):
	 nrm = np.linalg.norm(A, axis=1, keepdims=True) + 1e-12
	 return A / nrm

Xn, Yn = normalize(X), normalize(Y)
cos = Xn @ Yn.T                  # cosine similarities
eucl2 = ((Xn[:, None, :] - Yn[None, :, :])**2).sum(-1)  # squared distances

print("Cosine matrix:\n", np.round(cos, 3))
print("Squared distances (normalized):\n", np.round(eucl2, 3))
print("Relationship check (row 0):", np.allclose(eucl2[0], 2 * (1 - cos[0]), atol=1e-6))

Worked Example 2: $\ell_2$ vs $\ell_1$ regularization under orthonormal design#

Introduction#

Compare Ridge ($\ell_2$) and Lasso ($\ell_1$) when $X^\top X = I$. Ridge has a closed form; Lasso reduces to soft-thresholding.

Purpose#

Show how norms shape solutions: $\ell_2$ shrinks weights smoothly; $\ell_1$ induces sparsity.

Importance#

Regularization choice affects interpretability, robustness, and generalization.

What this example demonstrates#

For $X^\top X=I$, OLS is $w_{\text{ls}}=X^\top y$.
Ridge: $$w_{\text{ridge}} = \frac{1}{1+\lambda} w_{\text{ls}}.$$
Lasso: $$w_{\text{lasso}, i} = \operatorname{sign}(w_{\text{ls}, i})\,\max\{|w_{\text{ls}, i}|-\lambda, 0\}.$$

Background#

Ridge stabilizes ill-conditioned problems; Lasso selects features.

Historical context#

Ridge (Tikhonov, 1963; Hoerl–Kennard, 1970) and Lasso (Tibshirani, 1996) are canonical.

Prevalence in ML#

Widely used in linear models, compressed sensing, and high-dimensional statistics.

Notes#

The soft-threshold formula holds exactly under orthonormal design; otherwise use coordinate descent.

Connection to ML#

Norm penalties as priors/constraints: weight decay, sparsity, and model selection.

Connection to Linear Algebra Theory#

Dual norms and subgradients for $\ell_1$; spectral properties for $\ell_2$.

Pedagogical Significance#

Highlights geometric differences: $\ell_2$ balls are round; $\ell_1$ balls have corners that promote zeros.

References#

Hoerl, A. & Kennard, R. (1970). Ridge Regression.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
Hastie, T. et al. (2009). Elements of Statistical Learning.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 64, 8
U, _ = np.linalg.qr(np.random.randn(n, d))  # n x d with orthonormal columns
X = U
w_true = np.zeros(d); w_true[:3] = [2.0, -1.5, 0.5]
y = X @ w_true + 0.1 * np.random.randn(n)

w_ls = X.T @ y
lam = 0.5
w_ridge = w_ls / (1.0 + lam)

def soft_threshold(a, lam):
	 return np.sign(a) * np.maximum(np.abs(a) - lam, 0.0)
w_lasso = soft_threshold(w_ls, lam)

print("||w_ls||2=", np.linalg.norm(w_ls))
print("Ridge (lam=0.5):", np.round(w_ridge, 3))
print("Lasso (lam=0.5):", np.round(w_lasso, 3))

Worked Example 3: Gram matrices are PSD; kernels in practice#

Introduction#

Show that $G=XX^\top$ is PSD and illustrate a common kernel (RBF). Verify PSD numerically.

Purpose#

Connect inner products to PSD matrices and kernel validity.

Importance#

Kernel methods hinge on PSD property; invalid kernels can break optimization.

What this example demonstrates#

For any $z$, $$z^\top (XX^\top) z = \lVert X^\top z\rVert_2^2 \ge 0.$$
RBF kernel is PSD; eigenvalues are nonnegative up to numerical tolerance.

Background#

Mercer’s theorem characterizes kernels as inner products in (possibly infinite-dimensional) feature spaces.

Historical context#

Kernel trick popularized SVMs and GPs; modern random features approximate kernels at scale.

Prevalence in ML#

Text, bioinformatics, small/medium tabular data, Bayesian regression.

Notes#

Numerical PSD check via eigenvalues or Cholesky with jitter.

Connection to ML#

SVM margin maximization and GP covariance both rely on PSD structure.

Connection to Linear Algebra Theory#

Gram operators encode geometry via inner products.

Pedagogical Significance#

Concrete link between data matrix products and PSD.

References#

Mercer, J. (1909). Functions of positive and negative type.
Schölkopf, B. & Smola, A. (2002). Learning with Kernels.
Rasmussen, C. & Williams, C. (2006). Gaussian Processes for ML.
Rahimi, A. & Recht, B. (2007). Random features for large-scale kernels.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 10, 5
X = np.random.randn(n, d)
G = X @ X.T

evals = np.linalg.eigvalsh(G)
print("Gram PSD? min eigenvalue:", np.min(evals))

def rbf_kernel(A, B, sigma=1.0):
	 A2 = (A**2).sum(1)[:, None]
	 B2 = (B**2).sum(1)[None, :]
	 D2 = A2 + B2 - 2 * A @ B.T
	 return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.0)
kevals = np.linalg.eigvalsh(K)
print("RBF kernel PSD? min eigenvalue:", np.min(kevals))

Worked Example 4: Why attention uses $1/\sqrt{d_k}$ scaling#

Introduction#

For random features with variance 1, dot-products have variance that grows with $d_k$; scaling by $1/\sqrt{d_k}$ stabilizes softmax.

Purpose#

Quantify inner-product growth and show stabilization by scaling.

Importance#

Essential to prevent saturation and numerical instability in attention.

What this example demonstrates#

If $q,k\sim \mathcal{N}(0, I)$ in $\mathbb{R}^{d_k}$, then $\operatorname{Var}(q^\top k) = d_k$.
Scaling by $1/\sqrt{d_k}$ makes variance approximately 1 across dimensions.

Background#

Softmax is sensitive to logit scale; large variance yields peaky distributions and vanishing gradients.

Historical context#

Transformers introduced the scaling to stabilize training across widths.

Prevalence in ML#

Every modern Transformer variant uses this factor (self- and cross-attention).

Notes#

Normalization and temperature are closely related; tuning temperature affects entropy.

Connection to ML#

Stable attention distributions, better gradient flow, easier optimization.

Connection to Linear Algebra Theory#

Variance of inner products aggregates component variances; normalization rescales geometry.

Pedagogical Significance#

Shows a direct norm/variance argument behind a ubiquitous architectural choice.

References#

Vaswani, A. et al. (2017). Attention Is All You Need.
Goodfellow, I. et al. (2016). Deep Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
for d in [16, 64, 256, 1024]:
	 trials = 2000
	 q = np.random.randn(trials, d)
	 k = np.random.randn(trials, d)
	 dots = np.sum(q * k, axis=1)
	 scaled = dots / np.sqrt(d)
	 print(f"d={d:4d} var(dot)={np.var(dots):.1f}  var(scaled)={np.var(scaled):.2f}")

Worked Example 5: Orthogonal projection minimizes squared error (Pythagorean decomposition)#

Introduction#

Projecting onto an orthonormal subspace minimizes $\ell_2$ reconstruction error and decomposes energy orthogonally.

Purpose#

Connect projections, norms, and PCA-style reconstructions.

Importance#

Underlies least squares, PCA truncation, and many dimensionality-reduction pipelines.

What this example demonstrates#

For orthonormal $U\in\mathbb{R}^{d\times k}$, $$\hat{x}=UU^\top x = \arg\min_{z\in\text{col}(U)} \lVert x-z\rVert_2.$$
Pythagorean identity: $$\lVert x\rVert_2^2 = \lVert UU^\top x\rVert_2^2 + \lVert (I-UU^\top)x\rVert_2^2.$$

Background#

Least squares is projection onto column space; PCA chooses $U$ to maximize captured variance.

Historical context#

Orthogonal expansions from Fourier to PCA; SVD gives best rank-$k$ approximation.

Prevalence in ML#

Everywhere: regression, PCA, subspace tracking, recommendation.

Notes#

Orthonormality is crucial; otherwise use oblique projections or QR/SVD.

Connection to ML#

Data compression and denoising via low-dimensional projections.

Connection to Linear Algebra Theory#

Orthogonal projectors are idempotent and symmetric; decomposition follows from orthogonality of components.

Pedagogical Significance#

Reinforces geometric intuition of least squares and PCA.

References#

Golub, G. & Van Loan, C. (2013). Matrix Computations.
Jolliffe, I. (2002). Principal Component Analysis.
Eckart, C. & Young, G. (1936). Approximation in terms of the best rank-$k$.

Solution (Python)#

import numpy as np

np.random.seed(4)
d, k = 20, 3
x = np.random.randn(d)
U, _ = np.linalg.qr(np.random.randn(d, k))  # orthonormal basis
P = U @ U.T
x_hat = P @ x
r = x - x_hat

lhs = np.linalg.norm(x)**2
rhs = np.linalg.norm(x_hat)**2 + np.linalg.norm(r)**2
print("Projection error minimal?", np.linalg.norm(r) <= np.linalg.norm(x - U @ (U.T @ x) + 1e-12))
print("Pythagorean holds (numeric):", np.allclose(lhs, rhs, atol=1e-10))

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

Classical Era

Key Concepts & Theorems

Norm & Distance

Learning Path & Sequencing

Foundational

Linear Algebra Foundations

Inner Products & Norms

Theoretical Foundation

Linear Algebra Theory