Introduction#
Most real-world data matrices are sparse: social networks are sparse graphs; text is sparse (million-word vocabulary but document has hundreds of words); images are sparse in wavelet domains; scientific simulations generate sparse discretizations. Dense matrices require $O(n^2)$ memory and $O(n^3)$ computation; sparse matrices require only $O(\text{nnz})$. A sparse matrix with $\text{nnz} = O(n)$ or $O(n \log n)$ reduces complexity from cubic to linear or nearly linear. This is transformational: it enables algorithms that would be intractable on dense representations. Exploiting sparsity requires careful algorithm design: level-of-fill control, reordering strategies, specialized data structures, and communication patterns in distributed solvers.
Important ideas#
Sparse matrix representations
Compressed Sparse Row (CSR): store nonzeros by row; three arrays (values, column indices, row pointers); $O(\text{nnz} + n)$ memory.
Compressed Sparse Column (CSC): symmetric role for columns; efficient for column access.
Coordinate format (COO): list of $(i, j, v)$ triples; inefficient for computation but flexible for construction.
Diagonal, tridiagonal, block-structured: exploit special structure for even greater efficiency.
Sparsity patterns and fill-in
Reachability: nonzero $(i, j)$ entry means path from $i$ to $j$ in graph; reveals communication pattern.
Fill-in: $L, U$ in $A = LU$ can have more nonzeros than $A$; reordering (minimum degree, nested dissection) minimizes fill-in.
Symbolic factorization: determine nonzero pattern of $L, U$ without computing values.
Matrix-vector products and efficient computation
Dense $y = A x$ costs $O(n^2)$; sparse costs $O(\text{nnz})$.
SpMV (sparse matrix-vector product) is bandwidth-limited on modern hardware; careful data layout crucial.
GPU sparse kernels: CUSPARSE; CPU sparse kernels: MKL Sparse BLAS, OpenMP parallelization.
Direct sparse solvers
LU with partial pivoting: reorder to minimize fill-in, apply pivoting.
Cholesky for SPD: symmetric structure implies symmetric reordering; often very sparse.
Supernodal methods: group small dense blocks for cache efficiency; key to modern solvers (PARDISO, SuperLU, UMFPACK).
Iterative sparse solvers
CG, GMRES, MINRES: require only matrix-vector products; no explicit factorization.
Preconditioning essential: ILU (incomplete LU), IC (incomplete Cholesky), algebraic multigrid (AMG) enable convergence.
Matrix-free methods: supply only matrix-vector product function; never form or store matrix.
Graph algorithms and adjacency matrices
Adjacency matrix $A$ where $A_{ij} = 1$ if edge $(i, j)$ exists; number of nonzeros = $2 \times$ edges (undirected).
Shortest paths, PageRank, diffusion: matrix powers, powers of adjacency matrix encode reachability.
Laplacian matrix $L = D - A$ (degree diagonal minus adjacency); eigenvectors yield graph partitions (spectral clustering).
Storage and computational complexity trade-offs
Dense $A \in \mathbb{R}^{n \times n}$: $O(n^2)$ memory, $O(n^3)$ LU, $O(n^2)$ SpMV.
Sparse $A$ with $\text{nnz} = O(n)$: $O(n)$ memory, $O(n^{1.5})$ LU (with reordering), $O(n)$ SpMV.
For $n = 10^6$: dense requires terabytes; sparse (with $\text{nnz} = 10^7$) requires gigabytes.
Relevance to ML#
Text and NLP: Document-term matrices are sparse (million words, hundreds per document); sparse SVD enables topic models.
Graphs and networks: Social networks, citation graphs, knowledge graphs are sparse; graph neural networks leverage adjacency matrix sparsity.
Recommender systems: User-item ratings sparse ($\sim 0.1\%$ observed); matrix completion via sparse factorization.
Inverse problems and imaging: Medical imaging, seismic inversion, tomography discretizations are sparse; sparse solvers essential.
Large-scale optimization: SGD and proximal methods operate on sparse data; distributed solvers exploit communication sparsity.
Algorithmic development (milestones)#
1960s: Sparse matrix computation recognized as practical necessity in structural engineering (finite element analysis).
1971: Tinney & Walker develop minimum degree ordering; George extends to nested dissection.
1979: Duff develops efficient symbolic factorization algorithms.
1980s: Supernodal methods developed (Ashcraft, Eisenstat); enable cache-efficient dense kernels within sparse framework.
1990s: PARDISO, SuperLU, UMFPACK emerge as production sparse solvers.
2000s: Algebraic multigrid (AMG) becomes standard preconditioner for sparse iterative methods.
2005–2010: GPU sparse kernels (CUSPARSE) enable acceleration; sparse matrix formats adapt to GPU memory.
2010s: Sparse deep learning (neural networks on graph-structured data); graph convolutional networks (GCN), graph attention networks (GAT) exploit adjacency matrix sparsity.
Definitions#
Sparse matrix: $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A) \ll n^2$ nonzeros; typically $\text{nnz} = O(n)$ or $O(n \log n)$.
Sparsity: $\text{sparsity} = 1 - \frac{\text{nnz}(A)}{n^2}$ (fraction of zeros); $\text{sparsity} > 0.99$ is typical for ML.
Compressed Sparse Row (CSR): Three arrays:
values[nnz],col[nnz],row_ptr[n+1]; row_ptr[i] points to start of row $i$ in values/col.Fill-in: Nonzero entries created during factorization that were zero in original matrix.
Reordering: Permutation $P$ such that $P A P^\top$ has lower fill-in; minimum degree, nested dissection are common strategies.
Graph of a matrix: Undirected graph $G$ where vertices are matrix indices, edge $(i, j)$ exists iff $A_{ij} \ne 0$.
Degree: Vertex $i$ has degree = number of nonzeros in row/column $i$.
Laplacian matrix: $L = D - A$ where $D = \text{diag}(d_1, \ldots, d_n)$ is degree matrix, $A$ is adjacency matrix.
Comments