Overview#

What g-learn Does?#

The main purpose of g-learn is to estimate the algebraic quantity

(1)#\[ \mathrm{trace} \left( f(\mathbf{A}) \right),\]

where \(\mathbf{A}\) is a square matrix, \(f\) is a matrix function, and \(\mathrm{trace}(\cdot)\) is the trace operator. g-learn can also compute variants of (1), such as

(2)#\[\mathrm{trace} \left(\mathbf{B} f(\mathbf{A}) \right),\]

and

(3)#\[ \mathrm{trace} \left(\mathbf{B} f(\mathbf{A}) \mathbf{C} f(\mathbf{A}) \right),\]

where \(\mathbf{B}\) and \(\mathbf{C}\) are matrices. Other variations include the cases where \(\mathbf{A}\) is replaced by \(\mathbf{A}^{\intercal} \mathbf{A}\) in the above expressions.

Where g-learn Can Be Applied?#

Despite g-learn’s aim being tailored to specific algebraic computations, it addresses a demanding and challenging computational task with wide a range of applications. Namely, the expression (1) is ubiquitous in a variety of applications [R1], and in fact, it is often the most computationally expensive term in such applications. Some common examples of \(f\) lead to the following forms of (1):

Log-Determinant#: If \(f: \lambda \mapsto \log \vert \lambda \vert\), then \(\mathrm{trace} \left(f(\mathbf{A}) \right) = \log \vert \det \mathbf{A} \vert\) is the log-determinant of \(\mathbf{A}\), which frequently appears in statistics and machine learning, particularly in log-likelihood functions [R2].
Trace of Matrix Powers#: If \(f: \lambda \mapsto \lambda^p\), \(p \in \mathbb{R}\), then (1) is \(\mathrm{trace} (\mathbf{A}^p)\). Interesting cases are the negative powers, such as the trace of inverse, \(\mathrm{trace} (\mathbf{A}^{-1})\), where \(\mathbf{A}^{-1}\) is implicitly known. These class of functions frequently appears in statistics and machine learning. For instance, \(p=-1\) and \(p=-2\) appear in the Jacobian and Hessian of log-likelihood functions, respectively [R3].
Schatten \(p\)-norm and Schatten \(p\)-anti-norm#: If \(f: \lambda \mapsto \lambda^{\frac{p}{2}}\), then \(\mathrm{trace} (\mathbf{A}^{\frac{p}{2}})\) is the Schatten \(p\)-norm (if \(p > 0\)), and is the Schatten \(p\)-anti-norm (if \(p < 0\)). Schatten norm has applications in rank-constrained optimization in machine learning.
Eigencount and Numerical Rank#: If \(f: \lambda \mapsto \mathbf{1}_{[a,b]}(\lambda)\) is the indicator (step) function in the interval \([a, b]\), then \(\mathrm{trace}(\mathbf{1}(\mathbf{A}))\) estimates the number of non-zero eigenvalues of \(\mathbf{A}\) in that interval, which is an inexpensive method to estimate the rank of a large matrix. Eigencount is closely related to the Principal Component Analysis (PCA) and low-rank approximations in machine learning.
Estrada Index of Graphs#: If \(f: \lambda \mapsto \exp(\lambda)\), then \(\mathrm{trace} \left( \exp(\mathbf{A}) \right)\) is the Estrada index of \(\mathbf{A}\), which has applications in computational biology such as in protein folding.
Spectral Density#: If \(f: \lambda \mapsto \delta(\lambda - \mu)\), where \(\delta(\lambda)\) is the Dirac’s delta function, then \(\mathrm{trace} \left( f(\mathbf{A})\right)\) yields the spectral density of the eigenvalues of \(\mathbf{A}\). Estimating the spectral density of matrices, which is also known as Density of States (DOS), is a common problem in solid state physics.

Randomized Algorithms For Massive Data#

Calculating (1) and its variants is a computational challenge when

Matrices are very large. In practical applications, matrix size could range from million to billion.
\(f(\mathbf{A})\) cannot be computed directly, or only known implicitly, such as \(\mathbf{A}^{-1}\).

g-learn employs randomized (stochastic) algorithms [R5] to fulfill the demand for computing (1) on massive data. Such classes of algorithms are fast and scalable to large matrices. g-learn implements the following randomized algorithms:

Hutchinson’s Method#: Hutchinson technique is the earliest randomized method employed to estimate the trace of the inverse of an invertible matrix [R6]. g-learn implements Hutchinson’s method to compute (1), (2), and (3) for \(f(\lambda) = \lambda^{-1}\).
Stochastic Lanczos Quadrature Method#: The Stochastic Lanczos Quadrature (SLQ) method [R7] combines two of the greatest algorithms of the century in applied mathematics, namely the Monte-Carlo method and Lanczaos algorithm (also, Golub-Kahn-Lanczos algoritm) [R8] [R9], together with Gauss quadrature to estimate (1) for an analytic function \(f\) and symmetric positive-definite matrix \(\mathbf{A}\). g-learn provides an efficient and scalable implementation of SLQ method.

Along with the randomized methods, g-learn also provides direct (non-stochastic) methods which are only for benchmarking purposes to test the accuracy of the randomized methods on small matrices.

Applications in Optimization#

A unique and novel feature of g-learn is the ability to interpolate the trace of the arbitrary functions of the affine matrix function \(t \mapsto \mathbf{A} + t \mathbf{B}\). Such an affine matrix function appears in variety of optimization formulations in machine learning. Often in these applications, the hyperparameter \(t\) has to be tuned. To this end, the optimization scheme should compute

\[t \mapsto \mathrm{trace} \left(f(\mathbf{A} + t \mathbf{B}) \right),\]

for a large number of input hyperparameter \(t \in \mathbb{R}\). See common examples of the function \(f\) in Overview.

Instead of directly computing the above function for every \(t\), g-learn can interpolate the above function for a wide range of \(t\) with a high accuracy with only a handful number of evaluation of the above function. This solution can enhance the processing time of an optimization scheme by several orders of magnitude with only less than \(1 \%\) error [R4].

Petascale Computing#

The core of g-learn is a high-performance C++/CUDA library capable of performing on parallel CPUs or GPU farm with multiple GPU devices. g-learn can fairly perform at petascale, for instance on a cluster node of twenty GPUs with NVIDIA Hopper architecture, each with 60 TFLOPS.

For a gallery of the performance of g-learn on GPU and CPU on massive matrices, see Performance.

References#

[R1]

Ubaru, S., Saad, Y. (2018). Applications of Trace Estimation Techniques. In: High-Performance Computing in Science and Engineering. HPCSE 2017. Lecture Notes in Computer Science, vol 11087. Springer, Cham. doi: 10.1007/978-3-319-97136-0_2

[R2]

Ameli, S., and Shadden. S. C. (2023). A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression. Applied Mathematics and Computation 452, 128032. DOI, arXiv: 2207.08038 [math.ST].

[R3]

Ameli, S. and Shadden, S. C. (2022). Noise Estimation in Gaussian Process Regression. arXiv: 2206.09976 [cs.LG]

[R4]

Ameli, S., and Shadden. S. C. (2022). Interpolating Log-Determinant and Trace of the Powers of Matrix \(\mathbf{A} + t \mathbf{B}\). Statistics and Computing 32, 108. DOI

[R5]

Mahoney, M. W. (2011). Randomized algorithms for matrices and data. arXiv: 1104.5557 [cs.DS].

[R6]

Hutchinson, M. F. (1990). A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Comm. Statist. Simulation Comput. Volume 19, Number 2, pp. 433-450. Taylor & Francis. doi: 10.1080/03610919008812866.

[R7]

Golub, G. H. and Meurant, G. (2010). Matrices, Moments and Quadrature with Applications. Princeton University Press. isbn: 0691143412. jstor.org/stable/j.ctt7tbvs.

[R8]

Dongarra, J. and Sullivan, F. (2000). The Top 10 Algorithms. Computing in Science and Eng. 2, 1, pp. 22–23. doi: 10.1109/MCISE.2000.814652.

[R9]

Higham, N. J. (2016). Nicholas J. Higham on the top 10 algorithms in applied mathematics. The Princeton Companion to Applied Mathematics. Princeton University Press. isbn: 786842300.