Chapter 3
Basis & Dimension
Key ideas: Introduction

Introduction#

Basis and dimension are the language for measuring and reasoning about vector spaces. A basis is a minimal spanning set—a collection of linearly independent vectors that can be scaled and added to represent every vector in the space. The dimension is simply the size of a basis (number of basis vectors). These concepts are ubiquitous in ML:

  • PCA: Principal components form a basis for a lower-dimensional subspace capturing most data variance.

  • Autoencoders: Encoder learns a basis for latent representations (bottleneck layer); decoder reconstructs using this basis.

  • Neural networks: Each layer’s hidden activations form a basis for representations learned by that layer.

  • Whitening/normalization: Change of basis to decorrelate and rescale features (covariance becomes identity).

  • Sparse coding: Find a basis (dictionary) that sparsely represents data.

Understanding basis and dimension clarifies model capacity (how many independent directions can the model control?), data complexity (how many dimensions does the data actually occupy?), and numerical stability (are basis vectors nearly parallel, i.e., ill-conditioned?).

Important Ideas#

1. Basis = minimal spanning set. A basis $\{v_1, \ldots, v_d\}$ for subspace $S$ satisfies:

  • Spans $S$: Every vector in $S$ is a linear combination $s = \sum_{i=1}^d \alpha_i v_i$.

  • Linearly independent: No $v_j$ is a linear combination of the others. Equivalently, $\sum_{i=1}^d \alpha_i v_i = 0 \Rightarrow \alpha_1 = \cdots = \alpha_d = 0$.

Why minimal? If any vector is removed, the set no longer spans $S$. If any vector is linearly dependent on others, it’s redundant.

Uniqueness of representation: For basis $\{v_1, \ldots, v_d\}$, each vector $s \in S$ has unique coefficients: if $s = \sum_i \alpha_i v_i = \sum_i \beta_i v_i$, then $\alpha_i = \beta_i$ for all $i$ (follows from linear independence).

2. Dimension = basis size. All bases for a subspace have the same number of vectors. This number is the dimension: $\dim(S) = $ number of vectors in any basis for $S$. For matrix $A$: $$ \dim(\text{col}(A)) = \text{rank}(A), \quad \dim(\text{null}(A)) = \text{nullity}(A) = n - \text{rank}(A) $$

Why constant? Different bases may use different vectors, but all bases have the same cardinality. Changing basis doesn’t change dimension (it’s a property of the subspace, not the basis).

3. Change of basis. The same vector has different coordinates in different bases. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the change of basis matrix $P = [v'_1 | \cdots | v'_d]$ (columns are new basis vectors in old basis) satisfies: $$ v_\text{old basis} = P v_\text{new basis}, \quad v_\text{new basis} = P^{-1} v_\text{old basis} $$

In ML: Changing basis = change of representation. PCA rotates to principal component basis. Whitening rotates to decorrelated basis (covariance = identity).

4. Standard basis and explicit coordinates. The standard basis in $\mathbb{R}^d$ is $\{e_1, \ldots, e_d\}$ where $e_i = [0, \ldots, 1, \ldots, 0]^\top$ (1 in position $i$). For this basis: $$ v = [v_1, \ldots, v_d]^\top = \sum_{i=1}^d v_i e_i $$

Coordinates in the standard basis are just the vector’s components. Embedding lookups use one-hot basis: to get embedding of token $i$, multiply by $e_i$ (selection vector).

Relevance to Machine Learning#

Model capacity and expressiveness. A model operating in a $d$-dimensional space can control at most $d$ independent directions. Linear regression with $d$ features can fit $d$ linearly independent targets. Deep networks learn hierarchical bases: layer $\ell$ learns a basis for its activation space (dimension = number of hidden units).

Data intrinsic dimensionality. Real data often lies in a lower-dimensional subspace (manifold hypothesis). PCA finds a basis for the dominant subspace; if top $k$ eigenvalues capture 95% of variance, data is approximately $k$-dimensional. This justifies dimensionality reduction without information loss.

Numerical stability and conditioning. Basis properties affect computation: orthonormal bases (columns have norm 1, mutually perpendicular) are numerically stable (condition number = 1). Nearly parallel basis vectors (ill-conditioned) cause numerical errors in linear algebra operations.

Feature engineering and representation learning. Hand-crafted features (e.g., polynomial features, Fourier basis) are explicit bases. Neural networks learn implicit bases (hidden layers) that are optimized for the task. Autoencoders learn an optimal basis for data reconstruction (variational autoencoders add structure to this basis).

Algorithmic Development History#

1. Coordinate geometry (Descartes & Fermat, 1630s-1640s). Cartesian coordinates introduced the standard basis for Euclidean space. Descartes’ La Géométrie (1637) showed geometry could be expressed algebraically using coordinate axes.

2. Change of basis and coordinate transformations (Euler 1770s, Cauchy 1820s-1830s). Euler rotated coordinate systems to simplify equations. Cauchy formalized change of basis through matrix transformations.

3. Linear independence and dimension (Grassmann 1844, Peano 1888). Grassmann introduced independence axiomatically. Peano formalized dimension as the size of maximal independent sets.

4. Orthonormal bases and orthogonalization (Schmidt 1907, Gram-Schmidt process). Erhard Schmidt proved every finite-dimensional space admits an orthonormal basis. Gram-Schmidt orthogonalization computes one constructively.

5. Eigendecomposition and spectral bases (Schur 1909, 1920s-1930s). Schur showed every matrix has an eigenvalue decomposition. Eigenvalues/eigenvectors define natural bases (spectral decomposition).

6. PCA and dimensionality reduction (Pearson 1901, Hotelling 1933, SVD 1960s). PCA finds the basis vectors (principal components) that best capture data variance. SVD algorithm (Golub & Kahan 1965) computes this stably.

7. Neural basis learning (1980s-2010s). Neural networks learn basis implicitly: hidden layers learn representations that act as bases for downstream computations. Deeper networks learn hierarchical bases (abstract high-level concepts from concrete low-level features).

8. Dictionary learning and sparse coding (2000s). Learn overcomplete bases (more basis vectors than dimension) that sparsely represent data. Applications: image denoising (K-SVD, Aharon et al. 2006), signal processing.

Definitions#

Basis. A set $\mathcal{B} = \{v_1, \ldots, v_d\}$ is a basis for subspace $S$ if:

  1. $\text{span}(\mathcal{B}) = S$ (spans the subspace).

  2. Vectors in $\mathcal{B}$ are linearly independent.

Every vector in $S$ has a unique representation: $s = \sum_{i=1}^d \alpha_i v_i$ (coordinates $\alpha_i$ are unique).

Dimension. The dimension of subspace $S$ is: $$ \dim(S) = \text{size of any basis for } S $$ This is well-defined: all bases for $S$ have the same size.

Rank and nullity. For matrix $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A)) $$ $$ \text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A) $$

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$.

Change of basis matrix. For bases $\mathcal{B} = \{v_1, \ldots, v_d\}$ and $\mathcal{B}' = \{v'_1, \ldots, v'_d\}$, the matrix $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates) satisfies: $$ P = \text{matrix whose columns are basis } \mathcal{B}' \text{ in coordinates of basis } \mathcal{B} $$ For coordinates $[v]\mathcal{B}$ (in basis $\mathcal{B}$) and $[v]{\mathcal{B}’}$ (in basis $\mathcal{B}’$): $$ [v]_\mathcal{B} = P [v]_{\mathcal{B}'}, \quad [v]_{\mathcal{B}'} = P^{-1} [v]_\mathcal{B} $$

Coordinates. For vector $v \in S$ and basis $\mathcal{B} = \{v_1, \ldots, v_d\}$, the coordinates are scalars $[\alpha_1, \ldots, \alpha_d]^\top$ satisfying: $$ v = \sum_{i=1}^d \alpha_i v_i $$ Coordinates are unique (follows from linear independence).

Essential vs Optional: Theoretical ML

Theoretical Machine Learning — Essential Foundations#

Theorems and formal guarantees:

  1. Rank-nullity theorem. For $A \in \mathbb{R}^{m \times n}$: $$ \text{rank}(A) + \dim(\text{null}(A)) = n $$ Consequences: If $\text{rank}(A) = d < n$, then $\dim(\text{null}(A)) = n - d$ (dimension of solution set). If $\text{rank}(A) = n$ (full rank), solutions are unique (if they exist).

  2. Basis existence (Steinitz 1913). Every vector space has a basis. For finite-dimensional spaces: basis exists, all bases have same size (dimension).

  3. Dimension and approximation (approximation theory). Best $k$-dimensional approximation to $x$ is $\text{proj}_{S_k} x$ (projection onto $k$-dimensional subspace). The $k$-dimensional subspace with minimum approximation error is $\text{span}\{u_1, \ldots, u_k\}$ (top $k$ singular vectors of data matrix).

  4. VC dimension and capacity (Vapnik & Chervonenkis 1971). For linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$. Capacity grows with dimension (more basis dimensions = more expressiveness).

  5. Matrix rank and complexity. Rank determines complexity: rank-$r$ matrices form an $O(r(m+n))$ parameter subspace (vs. $mn$ for general matrices). Low-rank approximation is easier to learn (fewer degrees of freedom).

Why essential: These theorems quantify relationships between dimension, expressiveness, and solution complexity.

Applied Machine Learning — Essential for Implementation#

Achievements and landmark systems:

  1. PCA for dimensionality reduction (Turk & Pentland 1991). Eigenfaces for face recognition: project face images onto top 50-100 eigenvectors (basis), achieve real-time recognition. Dimension reduction from 10,000 pixels to ~100 PCA coordinates.

  2. Whitening for preprocessing (LeCun et al. 1998). Decorrelate and rescale features: change of basis to make covariance matrix identity. Improves gradient-based optimization (condition number = 1), enables smaller learning rates.

  3. Batch normalization (Ioffe & Szegedy 2015). Normalize layer activations: change of basis (center and rescale). Speeds training (50-100x), enables higher learning rates, reduces internal covariate shift.

  4. Autoencoders for representation learning (Hinton & Salakhutdinov 2006, 2010s-present). Learn non-linear basis via encoder bottleneck. Applications: image compression, anomaly detection, generative models (VAE, 2013).

  5. Transformer attention with multiple bases (Vaswani et al. 2017, Devlin et al. 2018). 8-64 attention heads = 8-64 different basis projections. Achieved state-of-the-art on GLUE (NLP benchmark), ImageNet-scale vision (ViT), multimodal (CLIP).

  6. Dictionary learning and sparse coding (Aharon et al. 2006, Mairal et al. 2009). Learn overcomplete basis that sparsely represents data. Applications: image denoising (matched K-SVD, PSNR improvement 2-5dB), face recognition (sparse representation).

Why essential: These systems achieve state-of-the-art by exploiting basis structure (dimensionality reduction, whitening, learned bases, attention heads). Understanding dimension and basis is necessary to design architectures and interpret representations.

Key ideas: Where it shows up

1. Principal Component Analysis (PCA) — Optimal basis for dimensionality reduction#

Major achievements:

  • Pearson (1901), Hotelling (1933): Formalized finding axes (basis vectors) of maximum variance. Principal components are orthonormal basis vectors.

  • Eigendecomposition solution: Eigenvectors of covariance matrix $C = \frac{1}{n} X_c^\top X_c$ are principal components. Top $k$ eigenvectors form a $k$-dimensional basis capturing maximum variance.

  • SVD connection (Eckart-Young 1936): Truncated SVD $X \approx U_k \Sigma_k V_k^\top$ gives same basis as PCA. Top $k$ rows of $V_k^\top$ are principal component coordinates.

  • Modern applications: Preprocessing (feature whitening), visualization (2D/3D plots from high-D data), compression (keep top $k$ components).

Connection to basis: PCA finds an orthonormal basis $\{u_1, \ldots, u_k\}$ (eigenvectors) for the principal subspace. Each data point $x_i$ has coordinates $(u_1^\top x_i, \ldots, u_k^\top x_i)$ in this basis (dimension reduction from $d$ to $k$).

2. Autoencoders and Latent Representations — Learned bases#

Major achievements:

  • Autoencoders (1980s-1990s): Neural networks learn a bottleneck (low-dimensional basis). Encoder compresses to basis coordinates; decoder reconstructs from coordinates.

  • Variational Autoencoders (Kingma & Welling 2013): Adds probabilistic structure: basis coordinates are drawn from Gaussian prior. Enables generative modeling (sampling new data).

  • Disentangled representations (2010s): Learn bases where each coordinate captures an interpretable factor (e.g., pose, lighting in faces). Beta-VAE encourages disentanglement.

  • Applications: Image generation (basis for visual features), anomaly detection (reconstruction error when input outside learned basis), data compression.

Connection to basis: Encoder learns a basis $\mathcal{B} = \{\text{hidden activations}\}$ for a low-dimensional latent space. Decoder reconstructs using this basis.

3. Deep Neural Networks — Hierarchical basis learning#

Major achievements:

  • Universal approximation (Cybenko 1989): Hidden layer activations form a basis for representing functions. Single layer suffices for continuous functions on compact sets.

  • Hierarchy of bases (Bengio 2013): Deep networks learn multiple bases: low layers learn simple bases (edges, textures), high layers learn complex bases (objects, concepts).

  • Representation learning (LeCun et al. 2015): Deep networks optimize basis learned representations for the task (task-specific basis, not generic PCA).

  • Scale (AlexNet 2012, ResNet 2015, Vision Transformers 2020): Depth enables learning of rich hierarchical bases, enabling state-of-the-art performance on complex tasks.

Connection to basis: Layer $\ell$ operates in a $d_\ell$-dimensional activation space (dimension = number of hidden units). Weights $W_\ell$ form a basis (or part of one) for transforming layer $\ell-1$ representations to layer $\ell$ basis.

4. Feature Engineering and Feature Spaces — Hand-crafted vs. learned bases#

Major achievements:

  • Polynomial basis (classical ML): Augment features with powers: $[x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots]$ (explicit basis for nonlinear decision boundaries).

  • Fourier basis (signal processing): Decompose signals using Fourier basis $\{\cos(k\omega t), \sin(k\omega t)\}_{k=0}^\infty$ (frequency domain basis).

  • Radial Basis Functions (RBF networks, 1980s): Basis functions centered at data points (e.g., Gaussian bumps). Kernel methods implicitly use infinite RBF basis.

  • Learned bases (deep learning, 2010s-present): End-to-end training optimizes basis (feature space) jointly with downstream task, outperforming hand-crafted bases.

Connection to basis: Features are coordinates in an explicit basis space. Deep learning learns the basis implicitly through weight optimization.

5. Transformer Attention and Multi-Head Projections — Multiple basis subspaces#

Major achievements:

  • Multi-head attention (Vaswani et al. 2017): Project inputs to $h$ different subspaces (heads), each with its own basis. Enables learning multiple relationships in parallel.

  • Scaled dot-product (Vaswani 2017): Each head computes $\text{softmax}(QK^\top / \sqrt{d_k}) V$, where $V$ columns form a basis for the value subspace.

  • BERT (Devlin et al. 2018): Bidirectional Transformers with 12-24 heads. Different heads learn different linguistic bases (syntax, semantics, discourse).

  • Applications: Machine translation (parallel bases for source/target), question answering (basis for understanding), language generation.

Connection to basis: Each attention head projects to a $(d_v / h)$-dimensional basis (subspace). Multi-head concatenation combines multiple basis representations.

Notation

Standard Conventions#

1. Basis notation.

  • Basis: $\mathcal{B} = \{v_1, \ldots, v_d\}$ (set of basis vectors).

  • Basis matrix: $V = [v_1 | \cdots | v_d] \in \mathbb{R}^{n \times d}$ (columns are basis vectors).

  • Standard basis: $\{e_1, \ldots, e_d\}$ where $e_i$ has 1 in position $i$, 0 elsewhere.

  • Coordinates: $[v]_\mathcal{B} = [\alpha_1, \ldots, \alpha_d]^\top$ satisfying $v = \sum_i \alpha_i v_i$.

Examples:

  • Standard basis for $\mathbb{R}^3$: $e_1 = [1, 0, 0]^\top, e_2 = [0, 1, 0]^\top, e_3 = [0, 0, 1]^\top$.

  • Vector $v = [3, -1, 2]^\top$ has coordinates $[3, -1, 2]^\top$ in standard basis.

  • PCA basis for 2D data: $\mathcal{B} = \{u_1, u_2\}$ (top 2 eigenvectors). Coordinates $[v]_\mathcal{B} = [u_1^\top v, u_2^\top v]^\top$.

2. Dimension notation.

  • Dimension: $\dim(V)$ or $\dim(S)$ for subspace $S$.

  • Rank: $\text{rank}(A)$ = dimension of column space (equivalently, row space).

  • Nullity: $\text{nullity}(A) = \dim(\text{null}(A)) = n - \text{rank}(A)$.

Examples:

  • For $X \in \mathbb{R}^{100 \times 50}$ (100 examples, 50 features), if $\text{rank}(X) = 30$, then $\dim(\text{col}(X)) = 30$ (data approximately 30-dimensional).

  • If $\text{rank}(X) = 30 < 50$, then $\text{nullity}(X) = 50 - 30 = 20$ (solution set is 20-dimensional affine subspace).

3. Change of basis notation.

  • Basis transition: $\mathcal{B} \to \mathcal{B}'$ (change from basis $\mathcal{B}$ to $\mathcal{B}'$).

  • Transition matrix: $P = [v'_1 | \cdots | v'_d]$ (new basis vectors in old coordinates).

  • Coordinates transform: $[v]_\mathcal{B} = P [v]_{\mathcal{B}'}$ (old basis = transition matrix times new basis coordinates).

Examples:

  • Rotate from standard basis to basis aligned with principal components: $P = [u_1 | \cdots | u_d]$ (eigenvectors of covariance matrix).

  • Whitening: $P = \Lambda^{-1/2} U^\top$ (inverse square root of covariance eigenvalues, times eigenvectors transpose) transforms to decorrelated basis.

4. Orthonormal basis notation.

  • Orthonormal: Basis vectors $\{q_1, \ldots, q_d\}$ satisfy $\|q_i\|_2 = 1$ and $q_i^\top q_j = 0$ for $i \neq j$.

  • Orthonormal matrix: $Q = [q_1 | \cdots | q_d]$ satisfies $Q^\top Q = I$ (columns orthonormal).

  • Orthogonal matrix: $Q \in \mathbb{R}^{d \times d}$ satisfies $Q^\top Q = QQ^\top = I$ (square, invertible, $Q^{-1} = Q^\top$).

Examples:

  • Eigenvectors of symmetric matrices form orthonormal basis.

  • QR decomposition: $A = QR$ where $Q$ has orthonormal columns (basis for $\text{col}(A)$).

  • SVD: $A = U \Sigma V^\top$ where $U, V$ are orthogonal matrices (orthonormal bases for column and row spaces).

5. Span and basis rank notation.

  • Full rank: $\text{rank}(A) = \min(m, n)$ (maximum possible rank). Columns (or rows) are linearly independent.

  • Rank deficient: $\text{rank}(A) < \min(m, n)$ (some columns/rows linearly dependent).

  • Column space dimension: $\dim(\text{col}(A)) = \text{rank}(A)$ (basis for column space has $\text{rank}(A)$ vectors).

Examples:

  • $X = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix}$ has rank 1 (second column = 2× first column). Basis for column space: $\{[1, 2, 3]^\top\}$ (1 vector).

  • $X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}$ has rank 2 (columns linearly independent). Basis: $\{[1, 0, 0]^\top, [0, 1, 0]^\top\}$ (2 vectors).

6. Projection onto basis notation.

  • Projection matrix: $P_V = V(V^\top V)^{-1} V^\top$ projects onto column space of $V$ (assuming full column rank).

  • Orthogonal projection: If $V$ has orthonormal columns, $P_V = VV^\top$ (simpler form).

  • Coordinates via projection: For orthonormal basis $V$, coordinates are $[v]_\mathcal{B} = V^\top v$.

Examples:

  • PCA projection: $X_\text{proj} = XVV^\top$ (project onto basis of top eigenvectors $V$).

  • Least squares: $\hat{y} = X(X^\top X)^{-1} X^\top y$ (projection of $y$ onto column space of $X$).

Pitfalls & sanity checks

When working with bases and dimension:

  1. Always center before PCA. Uncentered data gives wrong principal components. Check: mean of centered data should be ~0.

  2. Rank-deficient systems. If rank($X$) < $d$, solution is not unique. Use minimum norm solution or add regularization (Ridge).

  3. Numerical instability from nearly-dependent features. Check condition number: if $\kappa(X) > 10^8$, expect numerical errors. Use SVD/QR instead of explicit inverse.

  4. One-hot trap: $k$ categories → only $k-1$ independent one-hot variables. Adding all $k$ causes singularity. Drop one category.

  5. Coordinate consistency: When changing basis, verify: (1) old = $P$ × new, (2) $P$ is invertible (full column rank), (3) reconstruction error ~0.

References

Foundational texts:

  1. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press.

  2. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer.

  3. Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.

  4. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.

PCA and dimensionality reduction:

  1. Pearson, K. (1901). “On lines and planes of closest fit to systems of points.” Philosophical Magazine, 2(11), 559–572.

  2. Hotelling, H. (1933). “Analysis of a complex of statistical variables.” Journal of Educational Psychology, 24(6), 417–441.

  3. Eckart, C., & Young, G. (1936). “The approximation of one matrix by another of lower rank.” Psychometrika, 1(3), 211–218.

  4. Turk, M., & Pentland, A. (1991). “Eigenfaces for recognition.” Journal of Cognitive Neuroscience, 3(1), 71–86.

Neural networks and representation learning:

  1. Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems, 2(4), 303–314.

  2. Bengio, Y. (2013). “Deep learning of representations: Looking forward.” In Statistical Language and Speech Processing (pp. 1–37). Springer.

  3. Hinton, G. E., & Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science, 313(5786), 504–507.

  4. Kingma, D. P., & Welling, M. (2013). “Auto-encoding variational Bayes.” arXiv:1312.6114.

Optimization and normalization:

  1. Robbins, H., & Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics, 22(3), 400–407.

  2. Ioffe, S., & Szegedy, C. (2015). “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” In ICML (pp. 448–456).

  3. Ba, J., Kiros, J. R., & Hinton, G. E. (2016). “Layer normalization.” arXiv:1607.06450.

Transformers and attention:

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). “Attention is all you need.” In NeurIPS (pp. 5998–6008).

  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805.

Dictionary learning and sparse coding:

  1. Aharon, M., Elad, M., & Bruckstein, A. (2006). “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation.” IEEE Transactions on Signal Processing, 54(11), 4311–4322.

Five worked examples

Worked Example 1: One-hot basis and embedding lookup#

Problem. Show embedding lookup is matrix multiplication by a one-hot vector.

Solution (math). If x=e_i (standard basis), then E x selects the i-th column of E.

Solution (Python).

import numpy as np
E=np.array([[1.,2.,3.],[0.,-1.,1.]])
x=np.array([0.,1.,0.])
print(E@x)

Worked Example 2: Coordinates in a PCA basis#

Problem. Compute 1D PCA coordinate z of a point.

Solution (math). If u is the top PC, z=u^T(x-μ).

Solution (Python).

import numpy as np
from scripts.toy_data import toy_pca_points
X=toy_pca_points(n=10,seed=0)
mu=X.mean(0)
Xc=X-mu
U,S,Vt=np.linalg.svd(Xc,full_matrices=False)
u=Vt[0]
z=u@(X[0]-mu)
print(z)

Worked Example 3: Redundant engineered features via rank#

Problem. Detect redundancy by checking rank.

Solution (math). If rank(X)<d, columns are dependent.

Solution (Python).

import numpy as np
X=np.array([[1.,2.,3.],[2.,4.,6.],[3.,6.,9.]])
print(np.linalg.matrix_rank(X))

Worked Example 4: Whitening as change of basis#

Problem. Whiten data using covariance eigendecomposition.

Solution (math). If Σ=UΛU^T, then x_white=Λ^{-1/2}U^T(x-μ).

Solution (Python).

import numpy as np
rng=np.random.default_rng(0)
X=rng.normal(size=(200,2))
mu=X.mean(0); Xc=X-mu
Sigma=np.cov(Xc,rowvar=False)
lam,U=np.linalg.eigh(Sigma)
W=np.diag(1/np.sqrt(lam))@U.T
Xw=(W@Xc.T).T
print(np.cov(Xw,rowvar=False))

Worked Example 5: SGD gradients live in feature span#

Problem. Show each per-example gradient is a scalar times the feature vector.

Solution (math). For squared loss, ∇_w = (x^Tw - y)x.

Solution (Python).

import numpy as np
x=np.array([1.,-2.,0.5])
w=np.array([0.2,0.1,-0.3])
y=1.0
g=(x@w-y)*x
print(g)

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Theoretical Foundation