Introduction#
Inner products and norms provide the geometry for data and models:
Similarity via inner products $\langle x, y\rangle$ and cosine $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$
Size and distance via norms $\lVert x\rVert$ and induced metrics $d(x,y) = \lVert x-y\rVert$
Orthogonality ($\langle x, y\rangle = 0$) and projections onto subspaces
Positive semidefinite (PSD) Gram matrices and kernels driving SVMs/GPs
Stability and regularization via $\ell_2$ (Ridge) and $\ell_1$ (Lasso) penalties
Scaled dot-product attention uses many inner products and a normalization factor $1/\sqrt{d}$
Important ideas#
Inner product axioms and induced norms
An inner product $\langle x, y\rangle$ on $\mathbb{R}^d$ is symmetric, bilinear, and positive definite; the induced norm is $\lVert x\rVert = \sqrt{\langle x, x\rangle}$.
Cauchy–Schwarz and cosine similarity
- \[\big|\langle x, y\rangle\big| \le \lVert x\rVert\,\lVert y\rVert\]
Defines the angle via $\cos\theta = \langle x, y\rangle/(\lVert x\rVert\,\lVert y\rVert)$.
Triangle inequality and Minkowski/Hölder
For $p\in[1,\infty]$, $\lVert x+y\rVert_p \le \lVert x\rVert_p + \lVert y\rVert_p$; Hölder duality connects $p$ and $q$ with $1/p+1/q=1$.
Dual norms and bounds
The dual norm is $\lVert z\rVert_* = \sup_{\lVert x\rVert\le 1} \langle z, x\rangle$; e.g., dual of $\ell_1$ is $\ell_\infty$, dual of $\ell_2$ is $\ell_2$.
Orthogonality, orthonormal bases, and projections
If $U\in\mathbb{R}^{d\times k}$ has orthonormal columns, the orthogonal projector is $P = UU^\top$, minimizing reconstruction error.
Gram matrices, PSD, and kernels
For data matrix $X\in\mathbb{R}^{n\times d}$, $G=X X^\top$ has entries $G_{ij}=\langle x_i, x_j\rangle$ and is PSD. Kernel matrices generalize this to $K_{ij}=k(x_i,x_j)$.
Mahalanobis norms
For SPD $M\succ 0$, $\lVert x\rVert_M = \sqrt{x^\top M x}$ reweights geometry (whitening, metric learning).
Norm-induced stability
Lipschitz constants, gradient clipping, and regularization costs all depend on norms.
Relevance to ML#
Similarity search: cosine similarity is the standard for embeddings (IR, recommendation, retrieval, metric learning).
Regularization: $\ell_2$ (weight decay) controls scale; $\ell_1$ encourages sparsity.
Optimization: gradient norms determine step sizes; clipping prevents exploding gradients.
Kernels: SVMs, GPs rely on PSD Gram matrices of inner products.
Attention: scaled dot-products stabilize softmax logits as dimension grows.
PCA/covariance: variance equals squared $\ell_2$ norm along directions; orthogonal projections minimize $\ell_2$ error.
Algorithmic development (select milestones)#
1850s–1900s: Euclidean geometry formalized; Cauchy–Schwarz inequality.
1909: Mercer’s theorem (PSD kernels); foundations of kernel methods.
1950: Aronszajn formalizes RKHS; inner products in function spaces.
1960s–1970s: Robust norms (Huber); convex analysis; optimization bounds.
1995: SVMs (Cortes–Vapnik) with kernel trick.
2013–2015: Word2Vec, GloVe popularize cosine similarity in embeddings.
2015–2016: BatchNorm/LayerNorm normalize activations (variance/norm control).
2017: Scaled dot-product attention (Transformers) stabilizes inner-product logits.
2020: Contrastive learning (SimCLR) uses normalized cosine objectives.
Definitions#
Inner product: $\langle x, y\rangle = x^\top y$ (standard), or weighted $\langle x, y\rangle_M = x^\top M y$ with $M\succ 0$.
Induced norm: $\lVert x\rVert = \sqrt{\langle x, x\rangle}$; $\ell_p$ norms: $\lVert x\rVert_1=\sum_i|x_i|$, $\lVert x\rVert_2=\sqrt{\sum_i x_i^2}$, $\lVert x\rVert_\infty=\max_i |x_i|$.
Cosine similarity: $\cos\theta(x,y) = \dfrac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}$.
Orthogonality: $\langle x, y\rangle = 0$; orthonormal set: $\langle u_i, u_j\rangle = \delta_{ij}$.
Gram matrix: $G_{ij}=\langle x_i, x_j\rangle$; PSD: $z^\top G z \ge 0$ $\forall z$.
Kernel: $k(x,y)=\langle \phi(x), \phi(y)\rangle$; $K_{ij}=k(x_i,x_j)$ is PSD.
Mahalanobis norm: $\lVert x\rVert_M = \sqrt{x^\top M x}$ with $M\succ 0$.
Comments