ex1.ai

Chapter 7

Rank & Nullspace

Key ideas: Introduction

Introduction#

Rank and null space describe how information flows through matrices:

Rank $r$ = number of independent columns/rows (nonzero singular values)
Null space $\text{null}(A)$ = set of inputs mapped to zero (lost information)
Column/row spaces = subspaces where outputs/inputs live; orthogonal complements relate via FTLA
Pseudoinverse solves least squares even when $A$ is rank-deficient (minimal-norm solutions)
Low-rank structure compresses models and reveals latent factors (factorization)

Important ideas#

Row rank equals column rank
- $\operatorname{rank}(A)$ is the dimension of $\text{col}(A)$ and equals that of $\text{row}(A)$.
Rank via singular values
- If $A=U\Sigma V^\top$, then $\operatorname{rank}(A)$ equals the number of nonzero singular values $\sigma_i$.
Rank–nullity theorem
- For $A\in\mathbb{R}^{m\times d}$, $$\operatorname{rank}(A) + \operatorname{nullity}(A) = d.$$
Fundamental theorem of linear algebra (FTLA)
- $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$ and $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$ (orthogonal decompositions).
Rank of products and sums
- $\operatorname{rank}(AB) \le \min\{\operatorname{rank}(A), \operatorname{rank}(B)\}$; subadditivity for sums.
Pseudoinverse $A^+$
- Moore–Penrose $A^+$ gives minimal-norm solutions $x^* = A^+ b$; satisfies $AA^+A = A$.
Numerical rank
- Practical rank uses thresholds on singular values to handle floating-point noise.

Relevance to ML#

Multicollinearity: rank-deficient design $X$ yields non-unique OLS solutions; regularization/pseudoinverse needed.
PCA/compression: low rank captures variance efficiently; truncation yields best rank-$k$ approximation.
Recommendation systems: user–item matrices modeled as low-rank factorization.
Kernels/Gram matrices: rank informs capacity and generalization; $\operatorname{rank}(XX^\top) \le \min(n,d)$.
Attention: score matrix $QK^\top$ has rank bounded by $\min(n, d_k)$; head dimension limits expressivity.
Deep nets: bottleneck layers enforce low-rank mapping; adapters/LoRA factorize weights.

Algorithmic development (milestones)#

1936: Eckart–Young — best rank-$k$ approximation via SVD.
1955: Penrose — Moore–Penrose pseudoinverse.
1990s–2000s: Matrix factorization in recommender systems (SVD-based, ALS).
2009: Candès–Recht — nuclear norm relaxation for matrix completion.
2011: Halko–Martinsson–Tropp — randomized SVD for large-scale low-rank.
2019–2021: Low-rank adapters (LoRA) compress transformer weights.

Definitions#

$\operatorname{rank}(A)$: dimension of $\text{col}(A)$ (or $\text{row}(A)$); number of nonzero singular values.
$\text{null}(A) = \{x: Ax=0\}$; $\text{null}(A^\top)$ similarly.
$\text{col}(A)$: span of columns; $\text{row}(A)$: span of rows.
FTLA decompositions: $\mathbb{R}^n = \text{col}(A) \oplus \text{null}(A^\top)$, $\mathbb{R}^d = \text{row}(A) \oplus \text{null}(A)$.
Pseudoinverse: $A^+ = V \Sigma^+ U^\top$ where $\Sigma^+$ reciprocates nonzero $\sigma_i$.

Essential vs Optional: Theoretical ML

Theoretical (essential theorems/tools)#

Rank–nullity: $$\operatorname{rank}(A)+\operatorname{nullity}(A)=d.$$
FTLA (four subspaces): $\text{col}(A) \perp \text{null}(A^\top)$ and $\text{row}(A) \perp \text{null}(A)$.
Row=column rank: $\dim\text{row}(A) = \dim\text{col}(A)$.
Singular values and rank: $\operatorname{rank}(A)$ is the count of positive $\sigma_i$.
Sylvester’s inequality: $\operatorname{rank}(AB) \ge \operatorname{rank}(A) + \operatorname{rank}(B) - k$ (context-dependent; upper/lower bounds useful).
Eckart–Young–Mirsky: Truncated SVD minimizes error among rank-$k$ approximations.
Moore–Penrose pseudoinverse properties: $AA^+A=A$, $A^+AA^+=A^+$.

Applied (landmark systems/practices)#

PCA: Jolliffe (2002); Shlens (2014).
Stable least squares: Golub–Van Loan (2013).
Matrix completion via nuclear norm: Candès–Recht (2009).
Randomized SVD for scale: Halko–Martinsson–Tropp (2011).
Recommender systems: Koren–Bell–Volinsky (2009).
Low-rank adapters in transformers: Hu et al. (2021).

Key ideas: Where it shows up

PCA and covariance rank

Centered data $X_c$ yields covariance $\Sigma = \tfrac{1}{n} X_c^\top X_c$ with $\operatorname{rank}(\Sigma) \le \min(n-1, d)$.
Achievements: Dimensionality reduction with $k\ll d$; whitening in vision/speech. References: Jolliffe 2002; Shlens 2014; Murphy 2022.

Regression and multicollinearity

If $\operatorname{rank}(X) < d$, normal equations $X^\top X w = X^\top y$ are singular; pseudoinverse/regularization resolve ambiguity.
Achievements: Robust linear modeling; Ridge/Lasso mitigate rank issues. References: Hoerl–Kennard 1970; Tibshirani 1996; Golub–Van Loan 2013.

Low-rank models and compression

Factorize $W \approx AB^\top$ with small inner dimension to reduce parameters and computation (adapters, LoRA).
Achievements: Efficient fine-tuning of large transformers. References: Hu et al. 2021 (LoRA); Tishby & Zaslavsky 2015 (bottlenecks conceptual).

Matrix factorization for recommendation

User–item ratings approximated by low-rank matrices; SVD/ALS used in practice.
Achievements: Netflix Prize-era improvements; widespread deployment. References: Koren et al. 2009; Funk 2006.

Kernels/Gram and attention score rank

$G=XX^\top$ has rank $\le \min(n,d)$; $QK^\top$ rank $\le \min(n,d_k)$. Rank limits expressivity and affects generalization.
Achievements: Scalable kernel methods via low-rank approximations; attention head size trade-offs. References: Schölkopf–Smola 2002; Vaswani et al. 2017.

Notation

Shapes: $A\in\mathbb{R}^{m\times d}$; $X\in\mathbb{R}^{n\times d}$ is data.
Spaces: $\text{col}(A)$, $\text{row}(A)$, $\text{null}(A)$, $\text{null}(A^\top)$.
Rank: $\operatorname{rank}(A)$; Nullity: $\operatorname{nullity}(A)$.
SVD: $A=U\Sigma V^\top$; $U\in\mathbb{R}^{m\times r}$, $V\in\mathbb{R}^{d\times r}$ span column/row spaces; $r=\operatorname{rank}(A)$.
Pseudoinverse: $A^+ = V\Sigma^+ U^\top$; minimal-norm solution $x^* = A^+ b$.
Examples:
- Rank via SVD: count $\sigma_i > \tau$ with threshold $\tau$.
- Projection onto column space: $P_{\text{col}} = U_r U_r^\top$; onto row space: $P_{\text{row}} = V_r V_r^\top$.
- Covariance rank: $\operatorname{rank}(X_c^\top X_c) \le n-1$ for centered data.

Pitfalls & sanity checks

Never invert $X^\top X$ when $\operatorname{rank}(X)<d$; use QR/SVD or regularize.
Diagnose numerical rank via singular values; set thresholds based on scale.
Center data for covariance; otherwise rank properties and PCA directions change.
Beware overfitting: increasing rank (k in PCA/factorization) beyond signal raises variance.
Attention heads: too small $d_k$ may limit expressivity; too large may hurt stability.

References

Foundations and theory

Strang, G. (2016). Introduction to Linear Algebra (5th ed.).
Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Low-rank approximation and factorization 4. Eckart, C., & Young, G. (1936). Best rank-$k$ approximation. 5. Halko, N., Martinsson, P.-G., & Tropp, J. (2011). Randomized algorithms for matrices. 6. Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. 7. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems.

Regression and pseudoinverse 8. Penrose, R. (1955). A generalized inverse for matrices. 9. Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression. 10. Tibshirani, R. (1996). Lasso.

ML systems and practice 11. Jolliffe, I. (2002). Principal Component Analysis. 12. Shlens, J. (2014). A Tutorial on Principal Component Analysis. 13. Murphy, K. P. (2022). Probabilistic Machine Learning. 14. Vaswani, A. et al. (2017). Attention Is All You Need. 15. Devlin, J. et al. (2019). BERT.

Five worked examples

Worked Example 1: Detecting multicollinearity via null space (non-unique regression)#

Introduction#

Show how null space reveals linear dependencies among features and why OLS becomes non-unique when $\operatorname{rank}(X)<d$.

Purpose#

Compute null space vectors and connect them to redundant directions; use pseudoinverse for a minimal-norm solution.

Importance#

Avoids unstable fits and clarifies identifiability in models.

What this example demonstrates#

If $v\in\text{null}(X)$, $X(w+\alpha v) = Xw$ for all $\alpha$; infinitely many OLS solutions.
Pseudoinverse $w^* = X^+ y$ yields the minimal-norm solution.

Background#

Rank deficiency arises from duplicate/derived features or insufficient data.

Historical context#

Gauss/Legendre least squares; Penrose pseudoinverse enables solutions in singular cases.

Prevalence in ML#

High-dimensional regression, feature engineering pipelines, polynomial expansions.

Notes#

Use SVD to diagnose numerical rank; add Ridge to regularize.

Connection to ML#

Feature selection and regularization strategies hinge on rank awareness.

Connection to Linear Algebra Theory#

FTLA: residuals in $\text{null}(X^\top)$; solution set $w_0 + \text{null}(X)$.

Pedagogical Significance#

Makes the geometry of “non-unique solutions” tangible.

References#

Golub & Van Loan (2013). Matrix Computations.
Penrose (1955). Moore–Penrose pseudoinverse.
Hoerl & Kennard (1970). Ridge regression.

Solution (Python)#

import numpy as np

np.random.seed(0)
n, d = 20, 6
X = np.random.randn(n, d)
X[:, 5] = X[:, 0] + X[:, 1]  # make a perfectly colinear feature
w_true = np.array([1.0, -0.5, 0.3, 0.0, 2.0, 0.3])
y = X @ w_true + 0.1 * np.random.randn(n)

U, S, Vt = np.linalg.svd(X, full_matrices=False)
rank = np.sum(S > 1e-8)
nullspace_basis = Vt[rank:].T  # columns spanning null(X)

print("rank(X)=", rank, " d=", d, " nullity=", d - rank)
print("Nullspace basis shape:", nullspace_basis.shape)

# Minimal-norm solution via pseudoinverse
w_min = Vt.T @ (np.where(S > 1e-12, (U.T @ y) / S, 0.0))
print("||w_min||2=", np.linalg.norm(w_min))
print("OLS residual norm:", np.linalg.norm(y - X @ w_min))

Worked Example 2: Covariance rank ≤ n−1 (PCA in n<d regimes)#

Introduction#

Verify empirically that centered covariance has rank at most $n-1$ regardless of feature dimension.

Purpose#

Explain why PCA cannot produce more than $n-1$ nonzero eigenvalues and how this affects high-dimensional settings.

Importance#

Shapes expectations for PCA on small data; prevents overinterpretation.

What this example demonstrates#

With $X_c\in\mathbb{R}^{n\times d}$ centered, $\operatorname{rank}(X_c) \le \min(n-1, d)$; hence $\operatorname{rank}(\Sigma) \le n-1$.

Background#

Centering imposes a linear constraint across rows, reducing rank by at least one when $n>1$.

Historical context#

PCA theory and practice emphasize centering for correct variance structure.

Prevalence in ML#

Common in text, genomics, and other $d\gg n$ problems.

Notes#

Always center before PCA; whitening depends on accurate rank.

Connection to ML#

Model selection of $k$ principal components must respect $n-1$ limit.

Connection to Linear Algebra Theory#

Row-sum constraint places $\mathbf{1}$ in $\text{null}(X_c^\top)$.

Pedagogical Significance#

Reinforces how constraints reduce rank.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 30, 200
X = np.random.randn(n, d)
Xc = X - X.mean(axis=0, keepdims=True)

U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
rank = np.sum(S > 1e-10)
print("rank(Xc)=", rank, " <= min(n-1,d)=", min(n-1, d))

Worked Example 3: Low-rank matrix factorization for recommendation (synthetic)#

Introduction#

Construct a synthetic user–item rating matrix with known low rank and recover it via truncated SVD.

Purpose#

Demonstrate latent-factor modeling and show reconstruction error scales with tail singular values.

Importance#

Illustrates the power of rank reduction in recommender systems.

What this example demonstrates#

$R\approx U_k \Sigma_k V_k^\top$ captures most variance when spectrum decays.

Background#

Matrix factorization underlies collaborative filtering; ALS/SGD optimize latent vectors.

Historical context#

Post-Netflix Prize, low-rank methods became industry standard.

Prevalence in ML#

Ubiquitous in recommendation and implicit feedback modeling.

Notes#

For missing data, completion requires specialized optimization (not shown here).

Connection to ML#

Latent dimensions reflect user/item factors; rank controls capacity.

Connection to Linear Algebra Theory#

Eckart–Young guarantees best rank-$k$ approximation.

Pedagogical Significance#

Shows direct link from SVD to practical factor models.

References#

Koren, Bell, Volinsky (2009). Matrix factorization techniques for recommender systems.
Candès & Recht (2009). Exact matrix completion via convex optimization.

Solution (Python)#

import numpy as np

np.random.seed(2)
u, i, k = 80, 60, 5
U_true = np.random.randn(u, k)
V_true = np.random.randn(i, k)
R = U_true @ V_true.T + 0.1 * np.random.randn(u, i)

U, S, Vt = np.linalg.svd(R, full_matrices=False)
Rk = (U[:, :k] * S[:k]) @ Vt[:k]
err = np.linalg.norm(R - Rk, 'fro')**2
tail = (S[k:]**2).sum()
print("Fro error:", round(err, 6), " Tail sum:", round(tail, 6), " Close?", np.allclose(err, tail, atol=1e-5))

Worked Example 4: Moore–Penrose pseudoinverse — minimal-norm solutions#

Introduction#

Solve $Ax=b$ when $A$ is rectangular or rank-deficient; verify $AA^+A=A$ and minimal norm among all solutions.

Purpose#

Provide a robust recipe for under-/overdetermined systems.

Importance#

Avoids fragile inverses and clarifies the solution geometry.

What this example demonstrates#

$x^*=A^+ b$ minimizes $\lVert x\rVert_2$ subject to $Ax=b$ for consistent systems.
Penrose conditions hold numerically.

Background#

Pseudoinverse defined via SVD; used in control, signal processing, ML.

Historical context#

Penrose (1955) established the four defining equations.

Prevalence in ML#

Closed-form layers, analytic baselines, and data-fitting routines.

Notes#

Use SVD-backed implementations; threshold small singular values.

Connection to ML#

Stable baselines and analytic steps inside pipelines.

Connection to Linear Algebra Theory#

Projects onto $\text{row}(A)$/$\text{col}(A)$; minimal-norm in $\text{null}(A)$ components.

Pedagogical Significance#

Bridges algebraic definition to numerical practice.

References#

Penrose, R. (1955). A generalized inverse for matrices.
Golub & Van Loan (2013). Matrix Computations.

Solution (Python)#

import numpy as np

np.random.seed(3)
m, d = 10, 12
A = np.random.randn(m, d)
# Make A rank-deficient by zeroing a singular value via colinearity
A[:, 0] = A[:, 1] + A[:, 2]
b = np.random.randn(m)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
S_inv = np.where(S > 1e-10, 1.0 / S, 0.0)
A_plus = Vt.T @ np.diag(S_inv) @ U.T

x_star = A_plus @ b
print("Penrose A A^+ A ~ A?", np.allclose(A @ (A_plus @ A), A, atol=1e-8))
print("Residual ||Ax-b||:", np.linalg.norm(A @ x_star - b))
print("||x_star||2:", np.linalg.norm(x_star))

Worked Example 5: Rank of attention scores QK^T (expressivity bound)#

Introduction#

Show that the attention score matrix $S=QK^\top$ has rank at most $\min(n, d_k)$ and explore implications for head dimension.

Purpose#

Connect feature dimension to expressivity through rank bounds.

Importance#

Head size choices affect the diversity of attention patterns.

What this example demonstrates#

For $Q\in\mathbb{R}^{n\times d_k}$ and $K\in\mathbb{R}^{n\times d_k}$, $\operatorname{rank}(QK^\top) \le \min(n, d_k)$.

Background#

Rank of product bounded by inner dimension; scaled dot-products preserve rank.

Historical context#

Transformers leverage multiple heads to increase effective rank/expressivity.

Prevalence in ML#

All transformer models; multi-head concatenation increases representational capacity.

Notes#

Multi-head attention can be seen as block structures that raise overall rank after concatenation.

Connection to ML#

Guides architecture design (choosing $d_k$ and number of heads).

Connection to Linear Algebra Theory#

Rank bounds and product properties.

Pedagogical Significance#

Links a practical hyperparameter to a crisp linear algebra bound.

References#

Vaswani, A. et al. (2017). Attention Is All You Need.
Devlin, J. et al. (2019). BERT.

Solution (Python)#

import numpy as np

np.random.seed(4)
for n, dk in [(32, 8), (32, 32), (64, 16)]:
	 Q = np.random.randn(n, dk)
	 K = np.random.randn(n, dk)
	 S = Q @ K.T
	 r = np.linalg.matrix_rank(S)
	 print(f"n={n}, d_k={dk}, rank(S)={r}, bound={min(n, dk)}")

Comments

Algorithm Category

Direct Methods

Data Modality

Dense Matrices

Historical & Attribution

Classical Era

Key Concepts & Theorems

Rank & Nullspace

Learning Path & Sequencing

Foundational

Linear Algebra Foundations

Matrix Theory

Theoretical Foundation

Linear Algebra Theory