ex1.ai

Chapter 9

PSD & PD Matrices

Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

Eigenvalue characterization
- Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.
Quadratic forms and convexity
- If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).
Gram matrices
- $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).
Cholesky factorization
- PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.
Schur complement
- For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.
Mahalanobis metric
- For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.
Numerical PSD
- In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.
Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.
Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.
Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.
Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.
1918: Schur complement criteria.
1910–1940s: Cholesky factorization for SPD solves.
1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.
1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.
Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.
Principal minors: PD iff all leading principal minors are positive (Sylvester).
Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.
Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

PSD/PD definitions via quadratic form and eigenvalues.
Sylvester’s criterion (leading principal minors > 0 for PD).
Schur complement PSD condition.
Mercer’s theorem (kernel PSD).
Cholesky existence for SPD.

Applied (landmark systems)#

SVM with PSD kernels (Cortes–Vapnik 1995).
Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.
Kernel Ridge Regression (Murphy 2022) solved via SPD systems.
Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).
GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up

Covariance and PCA

$\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

Kernels and SVMs/GPs

RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

Hessians and convexity

For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

Mahalanobis distance & whitening

SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

Cholesky-based solvers (KRR/GP)

SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation

PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.
Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.
Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.
Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.
Schur complement: For blocks, $S = C - B^\top A^{-1} B$.
Examples:
- PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.
- Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.
- Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks

Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.
Near-singular PSD: add jitter; do not invert directly.
Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.
Kernel parameters: very small/large length-scales can hurt conditioning.
Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

Symmetrization, eigenvalue check, and Cholesky factorization.
Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

Golub & Van Loan (2013). Matrix Computations.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

$\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.
Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

Jolliffe (2002). PCA.
Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

RBF kernel is PSD; small datasets can be nearly singular when points are close.
Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

$d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

De Maesschalck et al. (2000). The Mahalanobis distance.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

Schölkopf & Smola (2002). Learning with Kernels.
Rasmussen & Williams (2006). Gaussian Processes for ML.
Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))