Comparison of Randomized Algorithms#

imate implements various deterministic and randomized algorithms on dense and sparse matrices. The goal of the following numerical experiments is to compare the performance, scalability, and accuracy of these algorithms.

Test Description#

The following numerical experiments aims to estimate

(1)#\[\log \det (\mathbf{A}),\]

and

(2)#\[\mathrm{trace}(\mathbf{A}^{-1}),\]

where \(\mathbf{A}\) is symmetric and positive-definite. The above quantities are computationally expensive expressions that frequently appears in the likelihood functions and their Jacobian and Hessian.

Algorithms#

The following Algorithms were tested on Intel® Xeon CPU E5-2670 v3 with 24 threads.

Cholesky Decomposition#

This method is implemented by the following functions:

The complexity of computing (1) for matrices obtained from 1D, 2D, and 3D grids are respectively \(\mathcal{O}(n)\), \(\mathcal{O}(n^{\frac{3}{2}})\), and \(\mathcal{O}(n^2)\) where \(n\) is the matrix size. The complexity of computing (2) for sparse matrices is \(\mathcal{O}(\rho n^2)\) where \(\rho\) is the sparse matrix density.

Hutchinson Algorithm#

This method is only applied to (2) and implemented by imate.traceinv(method=’hutchinson’) function. The complexity of this method is:

(3)#\[\mathcal{O}(\mathrm{nnz}(\mathbf{A})s),\]

where \(s\) is the number of Monte-Carlo iterations in the algorithm and \(\rho\) is the sparse matrix density. In this experiment, \(s = 80\).

Stochastic Lanczos Quadrature Algorithm#

This method is implemented by:

The complexity of this method is:

(4)#\[\mathcal{O} \left( (\mathrm{nnz}(\mathbf{A}) l + n l^2) s \right),\]

where \(l\) is the number of Lanczos iterations, and \(s\) is the number of Monte-Carlo iterations. The numerical experiment is performed with \(l=80\) and \(s=200\).

Arithmetic Types#

The benchmark test also examines the performance and accuracy of imate on various arithmetic types of the matrix data. To this end, the matrices that are described below are re-cast into 32-bit, 64-bit, and 128-bit floating point types.

Note

Supporting 128-bit data types is one of the features of imate, which is often not available in numerical libraries, such as BLAS, OpenBLAS, Cholmod, etc.

Test on Simple Matrices#

The Gramian matrix \(\mathbf{A} = \mathbf{B}^{\intercal} \mathbf{B}\) is considered for the test where \(\mathbf{B}\) is a sparse bi-diagonal Toeplitz matrix defined by

\[\begin{split}B_{ij} = \begin{cases} a, & i = j, \\ b, & i+1 = j. \end{cases}\end{split}\]

The above matrix can be generated by imate.toeplitz() function. In this experiment, \(a = 2\), \(b = 1\), and the matrix size is varied by powers of two, \(n = 2^8, 2^9, \dots, 2^{14}\).

The Cholesky factor of \(\mathbf{A}\) is \(\mathbf{B}^{\intercal}\). Also, \(\mathrm{nnz}(\mathbf{A}) = 3n\). An advantage of using the above matrix is that an analytic formula for (1) and (2) is known. Namely,

(5)#\[ \log \det \mathbf{A} = 2n \log_e a.\]

See imate.sample_matrices.toeplitz_logdet() for details. Also, if \(n \gg 1\), then

(6)#\[ \mathrm{trace}(\mathbf{A}^{-1}) \approx \frac{1}{a^2 - b^2} \left( n - \frac{q^{2}}{1 - q^2} \right),\]

where \(q = b/a\). See imate.sample_matrices.toeplitz_traceinv() for details. The above analytic formulas are used as the benchmark solution to test the accuracy of the results.

Elapsed Time of Computing Log-Determinant#

The elapsed (wall) time of the computations is shown in the figure below. The Cholesky method is faster than the SLQ method by an order of magnitude, hence it could be considered the preferred algorithm to compute log-determinant.

For large matrix sizes, \(n \geq 2^{19}\), the elapsed time can be related to the matrix size as \(t \propto n^{\alpha}\). It can be seen from the slope of the fitted lines in the figure that for the SLQ method, \(\alpha\) is close to 1. This result is consistent with the analytic complexity \(\mathcal{O}(n)\) derived from (4) and using \(\mathrm{nnz}(\mathbf{A}) \sim n\) for a tri-diagonal matrix. Similarly, for the Cholesky method, \(\alpha \approx 1\), which corresponds to the analytic complexity of computing log-determinant for matrices obtained from 1D meshes.

../_images/compare_methods_analytic_matrix_logdet_time.png

Note

The Cholesky method imate.logdet(method=’cholesky’) uses logdet function from SuiteSparse’s Cholmod library, which only supports 64-bit data types. Because of this, the performance test for 32-bit and 128-bit is not available. In contrast, imate’s implementation of randomized algorithms supports all data types shown in the above figure.

Accuracy of Computing Log-Determinant#

The error of computing (1) is shown in the figure below, which is obtained by comparing the results with the benchmark solution (5). The error of the Cholesky method is close to the machine precision (almost zero) since it is a direct method. On the other hand, the error of the SLQ method, as a randomized algorithm, is non-zero, yet very small. Such a remarkably small error of the SLQ method is due to the specific matrix used in the test and can be explained by the localized distribution of the eigenvalues of \(\mathbf{A}\) which makes the SLQ method very effective in estimating (1) with only a few Lanczos iterations. However, for practical matrices, usually, the error of randomized methods is larger than the figure below.

../_images/compare_methods_analytic_matrix_logdet_accuracy.png

Elapsed Time of Computing Trace of Inverse#

The elapsed (wall) time of the computations is shown in the figure below. Unlike the above results for log-determinant, the Cholesky method here is significantly slower than the SLQ method. In fact, computing the trace of inverse of matrices is one of the applications where the performance of randomized methods surpasses the direct methods significantly.

The computational complexity can be quantified by the relation between the elapsed time and the matrix size as \(t \propto n^{\alpha}\). It can be seen from the slope of the fitted lines in the figure that for Hutchinson and SLQ methods, \(\alpha\) is close to 1. This result is consistent with the analytic complexity \(\mathcal{O}(n)\) derived from (3) and (4) and using \(\mathrm{nnz}(\mathbf{A}) \sim n\) for a tri-diagonal matrix. Similarly, for the Cholesky method, \(\alpha \approx 2\), which corresponds to the analytic complexity \(\mathcal{O}(n^2)\) for computing the trace of matrix inverse.

../_images/compare_methods_analytic_matrix_traceinv_time.png

Accuracy of Computing Trace of Inverse#

The error of computing (2) is obtained from the benchmark solution (6) and shown in the figure below. The error of the Cholesky method is close to the machine precision (almost zero) since it is a direct method. On the other hand, the SLQ method, as a randomized algorithm, is non-zero. As mentioned previously, such a remarkable small error of the SLQ method is due to the specific matrix \(\mathbf{A}\) used in the test as its eigenvalues have a localized distribution, allowing the SLQ method to estimate (2) with a few Lanczos iterations.

../_images/compare_methods_analytic_matrix_traceinv_accuracy.png

Test on Practical Matrices#

The performance of algorithms was also examined on practical matrices. The table below shows the practical matrices used in the test, which are chosen from SuiteSparse Matrix Collection and are obtained from real applications. The matrices in the table below are all symmetric positive-definite. The number of nonzero elements (nnz) of these matrices increases approximately by a factor of 5 on average and their sparse density remains at the same order of magnitude (except for the first three).

Matrix Name

Size

nnz

Density

Application

nos5

468

5,172

0.02

Structural Problem

mhd4800b

4,800

27,520

0.001

Electromagnetics

bodyy6

19,366

134,208

0.0003

Structural Problem

G2_circuit

150,102

726,674

0.00003

Circuit Simulation

parabolic_fem

525,825

3,674,625

0.00001

Computational Fluid Dynamics

StocF-1465

1,465,137

21,005,389

0.00001

Computational Fluid Dynamics

Bump_2911

2,911,419

127,729,899

0.00001

Structural Problem

Queen_4147

4,147,110

329,499,284

0.00002

Structural Problem

Process Time of Computing Log-Determinant#

The elapsed (wall) time of the computations is shown in the figure below. Unlike the results of the Toeplitz matrices, here, the result of comparing the Cholesky method and SLQ method is mixed with the Cholesky method often being marginally faster than the SLQ method. However, the SLQ method is scalable to larger matrices, whereas the Cholesky method often crashes at \(n > 10^8\).

../_images/compare_methods_practical_matrix_logdet_time.png

Accuracy of Computing Log-Determinant#

The error of computing log-determinant using the SLQ method is shown in the figure below, which is obtained by comparing with the result of the Cholesky method as a benchmark. For four of the matrices, the error is less than \(1 \%\). However, for the second and the sixth matrices in the figure, the error is above \(10 \%\). To reduce the error, a higher number of Monte-Carlo iterations are needed on these matrices.

../_images/compare_methods_practical_matrix_logdet_accuracy.png

Varying Algorithm Parameters#

A known issue of the Lanczos algorithm is that the eigenvectors computed during the recusance Lanczos iterations lose their orthogonality. A solution to this issue is the re-orthogonalization of the newly computed eigenvectors with either some or all previous eigenvectors, which is known as partial or full orthogonalization, respectively. imate supports both types of orthogonalization techniques.

In the following tests, the effect of re-orthogonalizations of the eigenvectors and the effect of varying the number of Lanczos iterations are examined.

Process Time of Computing Log-Determinant#

The test below corresponds to the tri-diagonal matrix \(\mathbf{A}\) (See Test on Simple Matrices) with \(n = 2^{14}\). The figure below shows the elapsed time (left) and process time (right) of computing the log-determinant versus the number of Lanczos iterations, \(l\) (also known as Lanczos degree). The orange and red curves respectively show the results with and without orthogonalization of the eigenvectors in the Lanczos algorithm. The processing time is proportional to \(\mathcal{O}(l)\) without orthogonalization, which is consistent with the complexity given in (4) when \(\mathrm{nnz}(\mathbf{A})\) is large. However, the processing time is almost \(\mathcal{O}(l^{2})\) with orthogonalization for \(l < 300\), which is consistent with the complexity of the Gram-Schmit orthogonalization process.

../_images/vary_lanczos_degree_analytic_matrix_time.png

Note that at \(l > 300\), the processing time suddenly increases. This effect does not reflect a mathematical complexity, rather, is due to the fact that the Gram-Schmidt process is a memory-bounded process when the number of vectors is large. Specifically, when full-orthogonalization is used, all previous eigenvectors of the Lanczos iterations should be stored on the memory in a single large array. However, by increasing the number of eigenvectors, the size of the arrays grows larger than the memory bandwidth of the CPU cache, making the memory access significantly inefficient.

Accuracy of Computing Log-Determinant#

The figure below shows the error of computing log-determinant by comparing the results with the analytic value in (5). While the full-orthogonalization process in the Lanczos algorithm is an order of magnitude slower than without using orthogonalization, it provides a higher accuracy by an order of magnitude. The figure also implies that only a few Lanczos iterations, \(l = 30 \sim 60\) are sufficient to obtain a reasonable order of accuracy. This result holds for most applications.

../_images/vary_lanczos_degree_analytic_matrix_accuracy.png

Process Time of Computing Trace of Inverse#

The above tests were also performed on the practical matrices (see Test on Practical Matrices). The figure below corresponds to computing the race of inverse of the matrix jnlbrng1 with \(n = 40000\). The results are very similar to the previous test on simple matrices. Namely, the complexity of the process with and without orthogonalization is \(\mathcal{O}(l^2)\) and \(\mathcal{O}(l)\), respectively. Also, orthogonalization is by an order of magnitude slower than without orthogonalization and the effect of the cache memory bandwidth can be seen at \(l > 300\).;

../_images/vary_lanczos_degree_practical_matrix_time.png

Accuracy of Computing Trace of Inverse#

The error of computing the trace of inverse of the test matrix is shown below, which is obtained by comparison of the results with the Cholesky method (not shown here). The difference between the error of the process with and without orthogonalization is insignificant, hence, one may not use orthogonalization as it is effectively faster. Also, as suggested by the figure’s results, a practical choice for the number of Lanczos iterations is \(l = 20 \sim 50\) for most applications.

../_images/vary_lanczos_degree_practical_matrix_accuracy.png

How to Reproduce Results#

Prepare Matrix Data#

  1. Download all the above-mentioned sparse matrices from SuiteSparse Matrix Collection. For instance, download Queen_4147.mat from Queen_4147.

  2. Run /imate/benchmark/matrices/read_matrix.m to extract sparse matrix data from Queen_4147.mat:

    read_matrix('Queen_4147.mat');
    
  3. Run /imate/benchmark/matrices/read_matrix.py to convert the outputs of the above script to generate a python pickle file:

    read_matrix.py Queen_4147 float32    # to generate 32-bit data
    read_matrix.py Queen_4147 float64    # to generate 64-bit data
    read_matrix.py Queen_4147 float128   # to generate 128-bit data
    

    The output of the above script will be written in /imate/benchmark/matrices/.

Perform Numerical Test#

Run Locally#

Submit to Cluster with SLURM#

Plot Results#

These notebooks stores the plots as svg files in /imate/benchmark/svg_plots/.