Chapter 9
PSD & PD Matrices
Key ideas: Introduction

Introduction#

Positive semidefinite (PSD) means $x^\top A x \ge 0$ for all $x$; positive definite (PD) means $>0$ for all nonzero $x$. Symmetric PSD matrices have nonnegative eigenvalues, admit Cholesky factorizations (for PD), and define inner products or metrics. PSD structure underpins convexity, stability, and validity of kernels.

Important ideas#

  1. Eigenvalue characterization

    • Symmetric $A$ is PSD iff all eigenvalues $\lambda_i \ge 0$; PD iff $\lambda_i > 0$.

  2. Quadratic forms and convexity

    • If $A\succeq 0$, $f(x)=\tfrac{1}{2} x^\top A x$ is convex; Hessian PSD implies convex objective locally/globally (for twice-differentiable functions).

  3. Gram matrices

    • $G_{ij}=\langle x_i, x_j\rangle$ is PSD; kernels $K_{ij}=k(x_i,x_j)$ must be PSD (Mercer).

  4. Cholesky factorization

    • PD $A=LL^\top$ with $L$ lower-triangular; fast, stable solves vs. explicit inverses.

  5. Schur complement

    • For block matrix $\begin{pmatrix}A & B\\ B^\top & C\end{pmatrix}$ with $A\succ 0$, PSD iff Schur complement $C-B^\top A^{-1}B \succeq 0$.

  6. Mahalanobis metric

    • For SPD $M$, $d_M(x,y)^2=(x-y)^\top M (x-y)$ defines a metric; whitening corresponds to $M=\Sigma^{-1}$.

  7. Numerical PSD

    • In practice, enforce symmetry, threshold tiny negative eigenvalues, or add jitter to make matrices usable (e.g., kernels, covariances).

Relevance to ML#

  • Covariance/variance: PSD guarantees nonnegative variances; PCA/SVD rely on PSD covariance.

  • Kernels/GPs/SVMs: PSD kernels ensure valid feature maps and convex objectives.

  • Optimization: Hessian PSD/PD gives convexity; PD enables Newton/Cholesky steps.

  • Metrics/whitening: SPD metrics shape distance (Mahalanobis), whitening for decorrelation.

  • Uncertainty: GP posterior covariance must remain PSD for meaningful variances.

Algorithmic development (milestones)#

  • 1900s: Mercer’s theorem (kernel PSD) and early quadratic form characterizations.

  • 1918: Schur complement criteria.

  • 1910–1940s: Cholesky factorization for SPD solves.

  • 1950s–2000s: Convex optimization and interior-point methods rely on PSD cones.

  • 1990s–2000s: SVMs/GPs bring PSD kernels mainstream in ML.

Definitions#

  • PSD: $A\succeq 0$ if $A=A^\top$ and $x^\top A x\ge 0$ for all $x$; PD: $x^\top A x>0$ for all $x\neq 0$.

  • Eigenvalue test: PSD/PD iff eigenvalues nonnegative/positive.

  • Principal minors: PD iff all leading principal minors are positive (Sylvester).

  • Gram matrix: $G=XX^\top$ is PSD; kernel matrix $K_{ij}=k(x_i,x_j)$ is PSD if $k$ is a valid kernel.

  • Cholesky: $A=LL^\top$ for SPD $A$ with $L$ lower-triangular and positive diagonal.

Essential vs Optional: Theoretical ML

Theoretical (essential)#

  • PSD/PD definitions via quadratic form and eigenvalues.

  • Sylvester’s criterion (leading principal minors > 0 for PD).

  • Schur complement PSD condition.

  • Mercer’s theorem (kernel PSD).

  • Cholesky existence for SPD.

Applied (landmark systems)#

  • SVM with PSD kernels (Cortes–Vapnik 1995).

  • Gaussian Processes (Rasmussen–Williams 2006) using PSD kernels and Cholesky.

  • Kernel Ridge Regression (Murphy 2022) solved via SPD systems.

  • Whitening and Mahalanobis metrics in anomaly detection (De Maesschalck 2000).

  • GP uncertainty calibration and jitter practice (Seeger 2004).

Key ideas: Where it shows up
  1. Covariance and PCA

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c \succeq 0$; eigenvalues are variances. PCA uses PSD structure. References: Jolliffe 2002; Shlens 2014.

  1. Kernels and SVMs/GPs

  • RBF/linear/polynomial kernels yield PSD $K$; SVM dual and GP posteriors rely on PSD/PD for convexity and validity. References: Cortes–Vapnik 1995; Schölkopf–Smola 2002; Rasmussen–Williams 2006.

  1. Hessians and convexity

  • For twice-differentiable losses, PSD Hessian implies convex; PD gives strict convexity. References: Boyd–Vandenberghe 2004; Nocedal–Wright 2006.

  1. Mahalanobis distance & whitening

  • SPD metrics reweight features; whitening uses $\Sigma^{-1/2}$. References: De Maesschalck et al. 2000 (Mahalanobis in chemometrics); Murphy 2022.

  1. Cholesky-based solvers (KRR/GP)

  • SPD kernel matrices solved via Cholesky for stability; jitter added for near-singular cases. References: Rasmussen–Williams 2006; Seeger 2004.

Notation
  • PSD/PD: $A\succeq 0$, $A\succ 0$; quadratic form $x^\top A x$.

  • Eigenvalues: $\lambda_\min(A)$, $\lambda_\max(A)$ with $\lambda_\min \ge 0$ for PSD.

  • Gram/kernel: $G=XX^\top$; $K_{ij}=k(x_i,x_j)$.

  • Cholesky: $A=LL^\top$ (SPD); solve $Ax=b$ via forward/back-substitution.

  • Schur complement: For blocks, $S = C - B^\top A^{-1} B$.

  • Examples:

    • PSD check: eigenvalues $\ge -\varepsilon$ after symmetrization $\tfrac{1}{2}(A+A^\top)$.

    • Kernel matrix for RBF: $K_{ij}=\exp(-\lVert x_i - x_j\rVert^2 / (2\sigma^2))$.

    • Mahalanobis: $d_M(x,y)^2=(x-y)^\top M (x-y)$ with SPD $M$.

Pitfalls & sanity checks
  • Symmetry: enforce $A = \tfrac{1}{2}(A+A^\top)$ before PSD checks.

  • Near-singular PSD: add jitter; do not invert directly.

  • Covariance: center data before forming $\Sigma$; otherwise PSD still holds but principal directions change.

  • Kernel parameters: very small/large length-scales can hurt conditioning.

  • Cholesky failures: indicate non-PSD or insufficient jitter.

References

Foundations

  1. Horn, R. A., & Johnson, C. R. (2013). Matrix Analysis (2nd ed.).

  2. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

  3. Golub, G., & Van Loan, C. (2013). Matrix Computations (4th ed.).

Kernels and probabilistic models 4. Mercer, J. (1909). Functions of positive and negative type. 5. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. 6. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. 7. Seeger, M. (2004). Gaussian processes for machine learning (tutorial).

Optimization and metrics 8. Nocedal, J., & Wright, S. (2006). Numerical Optimization. 9. De Maesschalck, R. et al. (2000). The Mahalanobis distance.

Applications and practice 10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. 11. Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Five worked examples

Worked Example 1: Checking PSD/PD and using Cholesky vs eigenvalues#

Introduction#

Compare PSD tests: eigenvalues vs. Cholesky; show failure on non-PSD and success on jittered fix.

Purpose#

Provide practical PSD diagnostics and repair.

Importance#

Prevents crashes/NaNs in kernel methods and GP solvers.

What this example demonstrates#

  • Symmetrization, eigenvalue check, and Cholesky factorization.

  • Adding jitter ($\epsilon I$) can restore PD for near-PSD matrices.

Background#

Cholesky is preferred for SPD solves; fails fast if not SPD.

Historical context#

Cholesky (early 1900s) for efficient SPD factorization.

Prevalence in ML#

Common in GP, KRR, and covariance handling.

Notes#

  • Always symmetrize numerically; use small jitter (e.g., $1e-6$) when needed.

Connection to ML#

Stable training/inference for kernel models requires PSD kernels; jitter is standard.

Connection to Linear Algebra Theory#

Eigenvalue PSD characterization; $A=LL^\top$ iff $A$ is SPD.

Pedagogical Significance#

Hands-on PSD validation and fix.

References#

  1. Golub & Van Loan (2013). Matrix Computations.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(0)
A = np.random.randn(5, 5)
A = 0.5 * (A + A.T)  # symmetrize
A_bad = A.copy()
A_bad[0, 0] = -5.0  # make indefinite

def is_psd_eig(M, tol=1e-10):
    w = np.linalg.eigvalsh(M)
    return np.all(w >= -tol), w

for name, M in [("good", A), ("bad", A_bad)]:
    ok, w = is_psd_eig(M)
    print(f"{name}: min eig={w.min():.4f} PSD? {ok}")
    try:
        L = np.linalg.cholesky(M)
        print(f"{name}: Cholesky succeeded, diag min {np.min(np.diag(L)):.4f}")
    except np.linalg.LinAlgError:
        print(f"{name}: Cholesky failed")

# Jitter fix
eps = 1e-3
A_fix = A_bad + eps * np.eye(A_bad.shape[0])
ok, w = is_psd_eig(A_fix)
print("jittered: min eig=", round(w.min(), 4), "PSD?", ok)
L = np.linalg.cholesky(A_fix)
print("jittered Cholesky diag min:", round(np.min(np.diag(L)), 4))

Worked Example 2: Covariance is PSD; whitening via eigenvalues#

Introduction#

Demonstrate PSD of covariance and perform whitening using eigenvalues; verify variance becomes identity.

Purpose#

Show practical PSD use in preprocessing and stability.

Importance#

Whitening decorrelates features; PSD guarantees nonnegative variances.

What this example demonstrates#

  • $\Sigma=\tfrac{1}{n}X_c^\top X_c$ is PSD.

  • Whitening transform $W=\Lambda^{-1/2} Q^\top$ yields covariance close to identity.

Background#

PCA/whitening are PSD-based transforms; eigenvalues must be nonnegative.

Historical context#

Classical in signal processing; ubiquitous in ML preprocessing.

Prevalence in ML#

Used in vision, speech, and as a step in ICA and some deep pipelines.

Notes#

  • Add small floor to tiny eigenvalues for numerical stability.

Connection to ML#

Stabilizes downstream models; aligns scales across dimensions.

Connection to Linear Algebra Theory#

PSD eigen-structure; square roots and inverse square roots well-defined for PD.

Pedagogical Significance#

Concrete PSD-to-transform pipeline.

References#

  1. Jolliffe (2002). PCA.

  2. Shlens (2014). PCA tutorial.

Solution (Python)#

import numpy as np

np.random.seed(1)
n, d = 200, 5
X = np.random.randn(n, d) @ np.diag([3.0, 2.0, 1.0, 0.5, 0.2])
Xc = X - X.mean(axis=0, keepdims=True)

Sigma = (Xc.T @ Xc) / n
evals, evecs = np.linalg.eigh(Sigma)

# Whitening
floor = 1e-6
Lambda_inv_sqrt = np.diag(1.0 / np.sqrt(evals + floor))
W = Lambda_inv_sqrt @ evecs.T
Xw = (W @ Xc.T).T
Sigma_w = (Xw.T @ Xw) / n

print("Sigma PSD? min eig=", round(evals.min(), 6))
print("Whitened covariance diag:", np.round(np.diag(Sigma_w), 4))

Worked Example 3: Kernel matrix PSD and jitter for stability (RBF)#

Introduction#

Build an RBF kernel matrix, verify PSD via eigenvalues/Cholesky, and show how jitter fixes near-singularity.

Purpose#

Connect kernel validity to PSD checks used in SVM/GP/KRR implementations.

Importance#

Kernel methods rely on PSD to remain convex and numerically stable.

What this example demonstrates#

  • RBF kernel is PSD; small datasets can be nearly singular when points are close.

  • Jitter improves conditioning and enables Cholesky.

Background#

Mercer kernels generate PSD Gram matrices; RBF is a classic example.

Historical context#

Kernel trick popularized SVMs/GPs; jitter common in GP practice.

Prevalence in ML#

Standard in GPs, KRR, and kernel SVMs.

Notes#

  • Condition number matters for solves; inspect spectrum.

Connection to ML#

Stable training/inference for kernel models.

Connection to Linear Algebra Theory#

Eigenvalue PSD test; Cholesky existence for SPD.

Pedagogical Significance#

Shows PSD verification and repair on a kernel matrix.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

Solution (Python)#

import numpy as np

np.random.seed(2)
n, d = 12, 3
X = np.random.randn(n, d) * 0.2  # close points -> near-singular

def rbf_kernel(A, sigma=0.5):
    A2 = (A**2).sum(1)[:, None]
    D2 = A2 + A2.T - 2 * A @ A.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X)
evals = np.linalg.eigvalsh(K)
print("min eig:", round(evals.min(), 8), "cond:", float(evals.max() / max(evals.min(), 1e-12)))

try:
    L = np.linalg.cholesky(K)
    print("Cholesky ok, min diag:", round(np.min(np.diag(L)), 6))
except np.linalg.LinAlgError:
    print("Cholesky failed, adding jitter")
    eps = 1e-4
    K = K + eps * np.eye(n)
    L = np.linalg.cholesky(K)
    print("Cholesky ok after jitter, min diag:", round(np.min(np.diag(L)), 6))

Worked Example 4: Mahalanobis distance with SPD metric#

Introduction#

Construct an SPD metric, compute Mahalanobis distances, and relate to whitening.

Purpose#

Show how SPD matrices define learned distances and why PSD/PD matters.

Importance#

Metric learning, anomaly detection, and clustering rely on valid metrics.

What this example demonstrates#

  • $d_M(x,y)^2 = (x-y)^\top M (x-y)$ with $M\succ 0$; relate to Euclidean distance after whitening.

Background#

SPD metrics generalize Euclidean geometry; learned metrics often constrain $M\succeq 0$.

Historical context#

Mahalanobis (1936); modern metric learning enforces PSD.

Prevalence in ML#

Used in k-NN with learned metrics, anomaly scores, and Gaussian modeling.

Notes#

  • Ensure $M$ is SPD (eigenvalues > 0); use Cholesky or eig clip.

Connection to ML#

Improves retrieval and clustering by reweighting feature space.

Connection to Linear Algebra Theory#

SPD defines inner products; whitening via $M^{1/2}$ links to Euclidean distance.

Pedagogical Significance#

Concrete example of PSD/PD defining geometry.

References#

  1. De Maesschalck et al. (2000). The Mahalanobis distance.

  2. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(3)
d = 4
A = np.random.randn(d, d)
M = A.T @ A + 0.5 * np.eye(d)  # SPD

x = np.random.randn(d)
y = np.random.randn(d)
diff = x - y
mah2 = diff.T @ M @ diff
eucl2 = np.dot(diff, diff)

evals = np.linalg.eigvalsh(M)
print("M min eig:", round(evals.min(), 6))
print("Mahalanobis^2:", round(mah2, 6), " Euclidean^2:", round(eucl2, 6))

Worked Example 5: Kernel ridge regression via Cholesky (SPD solve)#

Introduction#

Solve KRR with an SPD kernel matrix using Cholesky; show stability vs. explicit inverse.

Purpose#

Demonstrate practical SPD solve in a common ML method.

Importance#

Avoids numerical issues and is standard in GP/KRR implementations.

What this example demonstrates#

  • Solve $(K+\lambda I)\alpha = y$ via Cholesky; predictions $\hat{y}=K\alpha$.

Background#

Ridge regularization makes $K+\lambda I$ SPD.

Historical context#

Kernel methods mainstream since 1990s; Cholesky is the standard linear solve.

Prevalence in ML#

Regression/classification with kernels; GP regression uses same linear system.

Notes#

  • Regularization size affects conditioning; add jitter if needed.

Connection to ML#

Core kernel regression solve; identical algebra to GP posterior mean.

Connection to Linear Algebra Theory#

SPD guarantees unique solution and stable Cholesky factorization.

Pedagogical Significance#

Shows end-to-end use of SPD property in a solver.

References#

  1. Schölkopf & Smola (2002). Learning with Kernels.

  2. Rasmussen & Williams (2006). Gaussian Processes for ML.

  3. Murphy (2022). Probabilistic Machine Learning.

Solution (Python)#

import numpy as np

np.random.seed(4)
n, d = 40, 3
X = np.random.randn(n, d)

def rbf_kernel(A, B=None, sigma=1.0):
    if B is None:
        B = A
    A2 = (A**2).sum(1)[:, None]
    B2 = (B**2).sum(1)[None, :]
    D2 = A2 + B2 - 2 * A @ B.T
    return np.exp(-D2 / (2 * sigma**2))

K = rbf_kernel(X, X, sigma=1.2)
w_true = np.random.randn(n)
y = K @ w_true + 0.05 * np.random.randn(n)

lam = 1e-2
K_reg = K + lam * np.eye(n)
L = np.linalg.cholesky(K_reg)

# Solve via Cholesky: L L^T alpha = y
z = np.linalg.solve(L, y)
alpha = np.linalg.solve(L.T, z)
y_hat = K @ alpha

rmse = np.sqrt(np.mean((y_hat - y)**2))
print("Cholesky solve RMSE:", round(rmse, 6))
print("SPD? min eig(K_reg)=", round(np.linalg.eigvalsh(K_reg).min(), 6))

Comments

Algorithm Category
Data Modality
Historical & Attribution
Key Concepts & Theorems
Learning Path & Sequencing
Linear Algebra Foundations
Matrix Decompositions
Problem Structure & Exploitation
Theoretical Foundation