imate
C++/CUDA Reference
cuTraceEstimator< DataType > Class Template Reference

A static class to compute the trace of implicit matrix functions using stochastic Lanczos quadrature method. This class acts as a templated namespace, where the member methods is public and static. The internal private member functions are also static. More...

#include <cu_trace_estimator.h>

Static Public Member Functions

static FlagType cu_trace_estimator (cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const int64_t seed, const IndexType lanczos_degree, const DataType lanczos_tol, const IndexType min_num_samples, const IndexType max_num_samples, const DataType error_atol, const DataType error_rtol, const DataType confidence_level, const DataType outlier_significance_level, const IndexType num_threads, const IndexType num_gpu_devices, DataType *trace, DataType *error, DataType **samples, IndexType *processed_samples_indices, IndexType *num_samples_used, IndexType *num_outliers, FlagType *converged, float &alg_wall_time)
 Stochastic Lanczos quadrature method to estimate trace of a function of a linear operator. Both function and the linear operator can be defined with parameters. More...
 

Static Private Member Functions

static void _cu_stochastic_lanczos_quadrature (cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const IndexType lanczos_degree, const DataType lanczos_tol, RandomNumberGenerator &random_number_generator, DataType *random_vector, FlagType *converged, DataType *trace_estimate)
 For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quadrature method. More...
 

Detailed Description

template<typename DataType>
class cuTraceEstimator< DataType >

A static class to compute the trace of implicit matrix functions using stochastic Lanczos quadrature method. This class acts as a templated namespace, where the member methods is public and static. The internal private member functions are also static.

See also
Diagonalization

Definition at line 39 of file cu_trace_estimator.h.

Member Function Documentation

◆ _cu_stochastic_lanczos_quadrature()

template<typename DataType >
void cuTraceEstimator< DataType >::_cu_stochastic_lanczos_quadrature ( cuLinearOperator< DataType > *  A,
DataType *  parameters,
const IndexType  num_inquiries,
const Function matrix_function,
const FlagType  gram,
const DataType  exponent,
const FlagType  orthogonalize,
const IndexType  lanczos_degree,
const DataType  lanczos_tol,
RandomNumberGenerator random_number_generator,
DataType *  random_vector,
FlagType converged,
DataType *  trace_estimate 
)
staticprivate

For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quadrature method.

Note
In special case when an eigenvalue relation is known, this function sets the converged inquiries to "not" converged in order to continue updating those inquiries. This is because in this special case, computing for other inquiries is free.
Parameters
[in]AAn instance of a class derived from LinearOperator class. This object will perform the matrix-vector operation and/or transposed matrix-vector operation for a linear operator. The linear operator can represent a fixed matrix, or a combination of matrices together with some given parameters.
[in]parametersThe parameters of the linear operator A. The size of this array is num_parameters*num_inquiries where num_parameters is the number of parameters that define the linear operator A, and num_inquiries is the number of different set of parameters to compute trace on different parametrized operators. The j-th set of parameters are stored in parameters[j*num_parameters:(j+1)*num_parameters]. That is, this array is contiguous over each batch of parameters.
[in]num_inquiriesThe number of batches of parameters. This function computes num_inquiries values of trace corresponding to different batch of parameters of the linear operator A. Hence, the number of output trace is num_inquiries. Hence, it is the number of columns of the output array samples.
[in]matrix_functionAn instance of Function class which has the function function. This function defines the matrix function, and operates on scalar eigenvalues of the matrix.
[in]gramFlag indicating whether the linear operator A is Gramian. If the linear operator is:
  • Gramian, then, Lanczos tridiagonalization method is employed. This method requires only matrix-vector multiplication.
  • not Gramian, then, Golub-Kahn bidiagonalization method is employed. This method requires both matrix and transposed-matrix vector multiplications.
[in]exponentThe exponent parameter p in the trace of the expression \( f((\mathbf{A} + t \mathbf{B})^p) \). The exponent is a real number and by default it is set to 1.0.
[in]orthogonalizeIndicates whether to orthogonalize the orthogonal eigenvectors during Lanczos recursive iterations.
  • If set to 0, no orthogonalization is performed.
  • If set to a negative integer, a newly computed eigenvector is orthogonalized against all the previous eigenvectors (full reorthogonalization).
  • If set to a positive integer, say q less than lanczos_degree, the newly computed eigenvector is orthogonalized against the last q previous eigenvectors (partial reorthogonalization).
  • If set to an integer larger than lanczos_degree, it is cut to lanczos_degree, which effectively orthogonalizes against all previous eigenvectors (full reorthogonalization).
[in]lanczos_degreeThe number of Lanczos recursive iterations. The operator A is reduced to a square tridiagonal (or bidiagonal) matrix of the size lanczos_degree. The eigenvalues (or singular values) of this reduced matrix is computed and used in the stochastic Lanczos quadrature method. The larger Lanczos degre leads to a better estimation. The computational cost is quadratically increases with the lanczos degree.
[in]lanczos_tolThe tolerance to stop the Lanczos recursive iterations before the end of iterations reached. If the tolerance is not met, the iterations (total of lanczos_degree iterations) continue till end.
[in]random_number_generatorGenerates random numbers that fills random_vector. In each parallel thread, an independent sequence of random numbers are generated. This object should be initialized by num_threads.
[in]random_vectorA 1D vector of the size of matrix A. The Lanczos iterations start off from this random vector. Each given random vector is used per a Monte-Carlo computation of the SLQ method. In the Lanczos iterations, other vectors are generated orthogonal to this initial random vector. This array is filled inside this function.
[out]converged1D array of the size of the number of columns of samples. Each element indicates which column of samples has converged to the tolerance criteria. Normally, if the num_samples used is less than max_num_samples, it indicates that the convergence has reached.
[out]trace_estimate1D array of the size of the number of columns of samples array. This array constitures each row of samples array. Each element of trace_estimates is the estimated trace for each parameter inquiry.

Definition at line 419 of file cu_trace_estimator.cu.

433 {
434  // Matrix size
435  IndexType matrix_size = A->get_num_rows();
436 
437  // Fill random vectors with Rademacher distribution (+1, -1), normalized
438  // but not orthogonalized. Setting num_threads to zero indicates to not
439  // create any new threads in RandomNumbrGenerator since the current
440  // function is inside a parallel thread.
441  IndexType num_threads = 0;
443  random_number_generator, random_vector, matrix_size, num_threads);
444 
445  // Allocate diagonals (alpha) and supdiagonals (beta) of Lanczos matrix
446  DataType* alpha = new DataType[lanczos_degree];
447  DataType* beta = new DataType[lanczos_degree];
448 
449  // Define 2D arrays needed to decomposition. All these arrays are
450  // defined as 1D array with Fortran ordering
451  DataType* eigenvectors = NULL;
452  DataType* left_singularvectors = NULL;
453  DataType* right_singularvectors_transposed = NULL;
454 
455  // Actual number of inquiries
456  IndexType required_num_inquiries = num_inquiries;
458  {
459  // When a relation between eigenvalues and the parameters of the linear
460  // operator is known, to compute eigenvalues of for each inquiry, only
461  // computing one inquiry is enough. This is because an eigenvalue for
462  // one parameter setting is enough to compute eigenvalue of another set
463  // of parameters.
464  required_num_inquiries = 1;
465  }
466 
467  // Allocate and initialize theta
468  IndexType i;
469  IndexType j;
470  DataType** theta = new DataType*[num_inquiries];
471  for (j=0; j < num_inquiries; ++j)
472  {
473  theta[j] = new DataType[lanczos_degree];
474 
475  // Initialize components to zero
476  for (i=0; i < lanczos_degree; ++i)
477  {
478  theta[j][i] = 0.0;
479  }
480  }
481 
482  // Allocate and initialize tau
483  DataType** tau = new DataType*[num_inquiries];
484  for (j=0; j < num_inquiries; ++j)
485  {
486  tau[j] = new DataType[lanczos_degree];
487 
488  // Initialize components to zero
489  for (i=0; i < lanczos_degree; ++i)
490  {
491  tau[j][i] = 0.0;
492  }
493  }
494 
495  // Allocate lanczos size for each inquiry. This variable keeps the non-zero
496  // size of the tri-diagonal (or bi-diagonal) matrix. Ideally, this matrix
497  // is of the size lanczos_degree. But, due to the early termination, this
498  // size might be smaller.
499  IndexType* lanczos_size = new IndexType[num_inquiries];
500 
501  // Number of parameters of linear operator A
502  IndexType num_parameters = A->get_num_parameters();
503 
504  // Lanczos iterations, computes theta and tau for each inquiry parameter
505  for (j=0; j < required_num_inquiries; ++j)
506  {
507  // If trace is already converged, do not compute on the new sample.
508  // However, exclude the case where required_num_inquiries is not the
509  // same as num_inquiries, since in this case, we compute one inquiry
510  // for multiple parameters.
511  if ((converged[j] == 1) && (required_num_inquiries == num_inquiries))
512  {
513  continue;
514  }
515 
516  // Set parameter of linear operator A
517  A->set_parameters(&parameters[j*num_parameters]);
518 
519  if (gram)
520  {
521  // Use Golub-Kahn-Lanczos Bi-diagonalization
522  lanczos_size[j] = cu_golub_kahn_bidiagonalization(
523  A, random_vector, matrix_size, lanczos_degree, lanczos_tol,
524  orthogonalize, alpha, beta);
525 
526  // Allocate matrix of singular vectors (1D array, Fortran ordering)
527  left_singularvectors = \
528  new DataType[lanczos_size[j] * lanczos_size[j]];
529  right_singularvectors_transposed = \
530  new DataType[lanczos_size[j] * lanczos_size[j]];
531 
532  // Note: alpha is written in-place with singular values
534  alpha, beta, left_singularvectors,
535  right_singularvectors_transposed, lanczos_size[j]);
536 
537  // theta and tau from singular values and vectors
538  for (i=0; i < lanczos_size[j]; ++i)
539  {
540  theta[j][i] = alpha[i] * alpha[i];
541  tau[j][i] = right_singularvectors_transposed[i];
542  }
543  }
544  else
545  {
546  // Use Lanczos Tri-diagonalization
547  lanczos_size[j] = cu_lanczos_tridiagonalization(
548  A, random_vector, matrix_size, lanczos_degree, lanczos_tol,
549  orthogonalize, alpha, beta);
550 
551  // Allocate eigenvectors matrix (1D array with Fortran ordering)
552  eigenvectors = new DataType[lanczos_size[j] * lanczos_size[j]];
553 
554  // Note: alpha is written in-place with eigenvalues
556  alpha, beta, eigenvectors, lanczos_size[j]);
557 
558  // theta and tau from singular values and vectors
559  for (i=0; i < lanczos_size[j]; ++i)
560  {
561  theta[j][i] = alpha[i];
562  tau[j][i] = eigenvectors[i * lanczos_size[j]];
563  }
564  }
565  }
566 
567  // If an eigenvalue relation is known, compute the rest of eigenvalues
568  // using the eigenvalue relation given in the operator A for its
569  // eigenvalues. If no eigenvalue relation is not known, the rest of
570  // eigenvalues were already computed in the above loop and no other
571  // computation is needed.
572  if (A->is_eigenvalue_relation_known() && num_inquiries > 1)
573  {
574  // When the code execution reaches this function, at least one of the
575  // inquiries is not converged, but some others might have been
576  // converged already. Here, we force-update those that are even
577  // converged already by setting converged to false. The extra update is
578  // free of charge when a relation for the eigenvalues are known.
579  for (j=0; j < num_inquiries; ++j)
580  {
581  converged[j] = 0;
582  }
583 
584  // Compute theta and tau for the rest of inquiry parameters
585  for (j=1; j < num_inquiries; ++j)
586  {
587  // Only j=0 was iterated before. Set the same size for other j-s
588  lanczos_size[j] = lanczos_size[0];
589 
590  for (i=0; i < lanczos_size[j]; ++i)
591  {
592  // Shift eigenvalues by the old and new parameters
593  theta[j][i] = A->get_eigenvalue(
594  &parameters[0],
595  theta[0][i],
596  &parameters[j*num_parameters]);
597 
598  // tau is the same (at least for the affine operator)
599  tau[j][i] = tau[0][i];
600  }
601  }
602  }
603 
604  // Estimate trace using quadrature method
605  DataType quadrature_sum;
606  for (j=0; j < num_inquiries; ++j)
607  {
608  // If the j-th inquiry is already converged, skip.
609  if (converged[j] == 1)
610  {
611  continue;
612  }
613 
614  // Initialize sum for the integral of quadrature
615  quadrature_sum = 0.0;
616 
617  // Important: This loop should iterate till lanczos_size[j], but not
618  // lanczos_degree. Otherwise the computation is wrong for certain
619  // matrices, such as if the input matrix is identity, or rank
620  // deficient. By using lanczos_size[j] instead of lanczos_degree, all
621  // issues with special matrices will resolve.
622  for (i=0; i < lanczos_size[j]; ++i)
623  {
624  quadrature_sum += tau[j][i] * tau[j][i] * \
625  matrix_function->function(pow(theta[j][i], exponent));
626  }
627 
628  trace_estimate[j] = matrix_size * quadrature_sum;
629  }
630 
631  // Release dynamic memory
632  delete[] alpha;
633  delete[] beta;
634  delete[] lanczos_size;
635 
636  for (j=0; j < required_num_inquiries; ++j)
637  {
638  delete[] theta[j];
639  }
640  delete[] theta;
641 
642  for (j=0; j < required_num_inquiries; ++j)
643  {
644  delete[] tau[j];
645  }
646  delete[] tau;
647 
648  if (eigenvectors != NULL)
649  {
650  delete[] eigenvectors;
651  }
652 
653  if (left_singularvectors != NULL)
654  {
655  delete[] left_singularvectors;
656  }
657 
658  if (right_singularvectors_transposed != NULL)
659  {
660  delete[] right_singularvectors_transposed;
661  }
662 }
static int eigh_tridiagonal(DataType *diagonals, DataType *subdiagonals, DataType *eigenvectors, IndexType matrix_size)
Computes all eigenvalues and eigenvectors of a real and symmetric tridiagonal matrix.
static int svd_bidiagonal(DataType *diagonals, DataType *subdiagonals, DataType *U, DataType *Vt, IndexType matrix_size)
Computes all singular-values and left and right eigenvectors of a real and symmetric upper bi-diagona...
static void generate_random_array(RandomNumberGenerator &random_number_generator, DataType *array, const LongIndexType array_size, const IndexType num_threads)
Generates a pseudo-random array with Rademacher distribution where elements are either +1 or -1.
virtual DataType get_eigenvalue(const DataType *known_parameters, const DataType known_eigenvalue, const DataType *inquiry_parameters) const =0
void set_parameters(DataType *parameters_)
Sets the scalar parameter this->parameters. Parameter is initialized to NULL. However,...
LongIndexType get_num_rows() const
FlagType is_eigenvalue_relation_known() const
Returns a flag that determines whether a relation between the parameters of the operator and its eige...
IndexType get_num_parameters() const
IndexType cu_golub_kahn_bidiagonalization(cuLinearOperator< DataType > *A, const DataType *v, const LongIndexType n, const IndexType m, const DataType lanczos_tol, const FlagType orthogonalize, DataType *alpha, DataType *beta)
Bi-diagonalizes the positive-definite matrix A using Golub-Kahn-Lanczos method.
IndexType cu_lanczos_tridiagonalization(cuLinearOperator< DataType > *A, const DataType *v, const LongIndexType n, const IndexType m, const DataType lanczos_tol, const FlagType orthogonalize, DataType *alpha, DataType *beta)
Tri-diagonalizes matrix A to T using the start vector v. is the Lanczos degree, which will be the siz...
int IndexType
Definition: types.h:65

References cu_golub_kahn_bidiagonalization(), cu_lanczos_tridiagonalization(), Diagonalization< DataType >::eigh_tridiagonal(), RandomArrayGenerator< DataType >::generate_random_array(), cLinearOperator< DataType >::get_eigenvalue(), cLinearOperator< DataType >::get_num_parameters(), cLinearOperator< DataType >::get_num_rows(), cLinearOperator< DataType >::is_eigenvalue_relation_known(), cLinearOperator< DataType >::set_parameters(), and Diagonalization< DataType >::svd_bidiagonal().

Referenced by cuTraceEstimator< DataType >::cu_trace_estimator().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cu_trace_estimator()

template<typename DataType >
FlagType cuTraceEstimator< DataType >::cu_trace_estimator ( cuLinearOperator< DataType > *  A,
DataType *  parameters,
const IndexType  num_inquiries,
const Function matrix_function,
const FlagType  gram,
const DataType  exponent,
const FlagType  orthogonalize,
const int64_t  seed,
const IndexType  lanczos_degree,
const DataType  lanczos_tol,
const IndexType  min_num_samples,
const IndexType  max_num_samples,
const DataType  error_atol,
const DataType  error_rtol,
const DataType  confidence_level,
const DataType  outlier_significance_level,
const IndexType  num_threads,
const IndexType  num_gpu_devices,
DataType *  trace,
DataType *  error,
DataType **  samples,
IndexType processed_samples_indices,
IndexType num_samples_used,
IndexType num_outliers,
FlagType converged,
float &  alg_wall_time 
)
static

Stochastic Lanczos quadrature method to estimate trace of a function of a linear operator. Both function and the linear operator can be defined with parameters.

Multiple batches of parameters of the linear operator can be passed to this function. In such a case, the output trace is an array of the of the number of the inquired parameters.

The stochastic estimator computes multiple samples of trace and the final result is the average of the samples. This function outputs both the samples of estimated trace values (in samples array) and their average (in trace array).

Parameters
[in]AAn instance of a class derived from LinearOperator class. This object will perform the matrix-vector operation and/or transposed matrix-vector operation for a linear operator. The linear operator can represent a fixed matrix, or a combination of matrices together with some given parameters.
[in]parametersThe parameters of the linear operator A. The size of this array is num_parameters*num_inquiries where num_parameters is the number of parameters that define the linear operator A, and num_inquiries is the number of different set of parameters to compute trace on different parametrized operators. The j-th set of parameters are stored in parameters[j*num_parameters:(j+1)*num_parameters]. That is, this array is contiguous over each batch of parameters.
[in]num_inquiriesThe number of batches of parameters. This function computes num_inquiries values of trace corresponding to different batch of parameters of the linear operator A. Hence, the number of output trace is num_inquiries. Hence, it is the number of columns of the output array samples.
[in]matrix_functionAn instance of Function class which has the function function. This function defines the matrix function, and operates on scalar eigenvalues of the matrix.
[in]gramFlag indicating whether the linear operator A is Gramian. If the linear operator is:
  • Gramian, then, Lanczos tridiagonalization method is employed. This method requires only matrix-vector multiplication.
  • not Gramian, then, Golub-Kahn bidiagonalization method is employed. This method requires both matrix and transposed-matrix vector multiplications.
[in]exponentThe exponent parameter p in the trace of the expression \( f((\mathbf{A} + t \mathbf{B})^p) \). The exponent is a real number and by default it is set to 1.0.
[in]orthogonalizeIndicates whether to orthogonalize the orthogonal eigenvectors during Lanczos recursive iterations.
  • If set to 0, no orthogonalization is performed.
  • If set to a negative integer, a newly computed eigenvector is orthogonalized against all the previous eigenvectors (full reorthogonalization).
  • If set to a positive integer, say q less than lanczos_degree, the newly computed eigenvector is orthogonalized against the last q previous eigenvectors (partial reorthogonalization).
  • If set to an integer larger than lanczos_degree, it is cut to lanczos_degree, which effectively orthogonalizes against all previous eigenvectors (full reorthogonalization).
[in]seedA non-negative integer to be used as seed to initiate the generation of sequences of peudo-random numbers in the algorithm. This is useful to make the result of the randomized algorithm to be reproducible. If a negative integer is given, the given seed value is ignored and the current processor time is used as the seed to initiate he generation random number sequences. In this case, the result is not reproducible, rather, is pseudo-random.
[in]lanczos_degreeThe number of Lanczos recursive iterations. The operator A is reduced to a square tridiagonal (or bidiagonal) matrix of the size lanczos_degree. The eigenvalues (or singular values) of this reduced matrix is computed and used in the stochastic Lanczos quadrature method. The larger Lanczos degree leads to a better estimation. The computational cost is quadratically increases with the lanczos degree.
[in]lanczos_tolThe tolerance to stop the Lanczos recursive iterations before the end of iterations reached. If the tolerance is not met, the iterations (total of lanczos_degree iterations) continue till end.
[in]min_num_samplesMinimum number of times that the trace estimation is repeated. Within the min number of samples, the Monte-Carlo continues even if convergence is reached.
[in]max_num_samplesThe number of times that the trace estimation is repeated. The output trace value is the average of the samples. Hence, this is the number of rows of the output array samples. Larger number of samples leads to a better trace estimation. The computational const linearly increases with number of samples.
[in]error_atolAbsolute tolerance criterion for early termination during the computation of trace samples. If the tolerance is not met, then all iterations (total of max_num_samples) proceed till end.
[in]error_rtolRelative tolerance criterion for early termination during the computation of trace samples. If the tolerance is not met, then all iterations (total of max_num_samples) proceed till end.
[in]confidence_levelThe confidence level of the error, which is a number between 0 and 1. This affects the scale of error.
[in]outlier_significance_levelOne minus the confidence level of the uncertainty band of the outlier. This is a number between 0 and 1. Confidence level of outleir and significance level of outlier are complement of each other.
[in]num_threadsNumber of OpenMP parallel processes. The parallelization is implemented over the Monte-Carlo iterations.
[in]num_gpu_devicesNumber of GPU devices to use. This is the number of CPU threads to be created to handle each GPU device in parallel for each CPU thread.
[out]traceThe output trace of size num_inquiries. These values are the average of the rows of samples array.
[out]errorThe error of estimation of trace, which is the standard deviation of the rows of samples array. The size of this array is num_inquiries.
[out]samples2D array of all estimated trace samples. The shape of this array is (max_num_samples*num_inquiries). The average of the rows is also given in trace array.
[out]processed_samples_indicesA 1D array indicating the processing order of rows of the samples. In parallel processing, this order of processing the rows of samples is not necessarly sequential.
[out]num_samples_used1D array of the size of the number of columns of samples. Each element indicates how many iterations were used till convergence is reached for each column of the samples. The number of iterations should be a number between min_num_samples and max_num_samples.
[out]num_outliers1D array with the size of number of columns of samples. Each element indicates how many rows of the samples array were outliers and were removed during averaging rows of samples.
[out]converged1D array of the size of the number of columns of samples. Each element indicates which column of samples has converged to the tolerance criteria. Normally, if the num_samples used is less than max_num_samples, it indicates that the convergence has reached.
[out]alg_wall_timeThe elapsed time that takes for the SLQ algorithm. This does not include array allocation/deallocation.
Returns
A signal to indicate the status of computation:
  • 1 indicates successful convergence within the given tolerances was met. Convergence is achieved when all elements of convergence array are below convergence_atol or convergence_rtol times trace.
  • 0 indicates the convergence criterion was not met for at least one of the trace inquiries.

Definition at line 197 of file cu_trace_estimator.cu.

224 {
225  // Matrix size
226  IndexType matrix_size = A->get_num_rows();
227 
228  // Set the number of threads
229  omp_set_num_threads(num_gpu_devices);
230 
231  // Allocate 1D array of random vectors We only allocate a random vector
232  // per parallel thread. Thus, the total size of the random vectors is
233  // matrix_size*num_threads. On each iteration in parallel threads, the
234  // allocated memory is reused. That is, in each iteration, a new random
235  // vector is generated for that specific thread id.
236  IndexType random_vectors_size = matrix_size * num_gpu_devices;
237  DataType* random_vectors = new DataType[random_vectors_size];
238 
239  // Initialize random number generator to generate in parallel threads
240  // independently.
241  RandomNumberGenerator random_number_generator(num_gpu_devices, seed);
242 
243  // The counter of filled size of processed_samples_indices array
244  // This scalar variable is defined as array to be shared among al threads
245  IndexType num_processed_samples = 0;
246 
247  // Criterion for early termination of iterations if convergence reached
248  // This scalar variable is defined as array to be shared among al threads
249  FlagType all_converged = 0;
250 
251  // Using square-root of max possible chunk size for parallel schedules
252  unsigned int chunk_size = static_cast<int>(
253  sqrt(static_cast<DataType>(max_num_samples) / num_gpu_devices));
254  if (chunk_size < 1)
255  {
256  chunk_size = 1;
257  }
258 
259  // Timing elapsed time of algorithm
260  CudaTimer cuda_timer;
261  cuda_timer.start();
262 
263  // Shared-memory parallelism over Monte-Carlo ensemble sampling
264  IndexType i;
265  #pragma omp parallel for schedule(dynamic, chunk_size)
266  for (i=0; i < max_num_samples; ++i)
267  {
268  if (!static_cast<bool>(all_converged))
269  {
270  // Switch to a device with the same device id as the cpu thread id
271  unsigned int thread_id = omp_get_thread_num();
273 
274  // Perform one Monte-Carlo sampling to estimate trace
276  A, parameters, num_inquiries, matrix_function, gram,
277  exponent, orthogonalize, lanczos_degree, lanczos_tol,
278  random_number_generator,
279  &random_vectors[matrix_size*thread_id], converged,
280  samples[i]);
281 
282  // Critical section
283  #pragma omp critical
284  {
285  // Store the index of processed samples
286  processed_samples_indices[num_processed_samples] = i;
287  ++num_processed_samples;
288 
289  // Check whether convergence criterion has been met to stop.
290  // This check can also be done after another parallel thread
291  // set all_converged to "1", but we continue to update error.
293  samples, min_num_samples, num_inquiries,
294  processed_samples_indices, num_processed_samples,
295  confidence_level, error_atol, error_rtol, error,
296  num_samples_used, converged);
297  }
298  }
299  }
300 
301  // Elapsed wall time of the algorithm (computation only, not array i/o)
302  cuda_timer.stop();
303  alg_wall_time = cuda_timer.elapsed();
304 
305  // Remove outliers from trace estimates and average trace estimates
307  confidence_level, outlier_significance_level, num_inquiries,
308  max_num_samples, num_samples_used, processed_samples_indices,
309  samples, num_outliers, trace, error);
310 
311  // Deallocate memory
312  delete[] random_vectors;
313 
314  return all_converged;
315 }
static FlagType check_convergence(DataType **samples, const IndexType min_num_samples, const IndexType num_inquiries, const IndexType *processed_samples_indices, const IndexType num_processed_samples, const DataType confidence_level, const DataType error_atol, const DataType error_rtol, DataType *error, IndexType *num_samples_used, FlagType *converged)
Checks if the standard deviation of the set of the cumulative averages of trace estimators converged ...
static void average_estimates(const DataType confidence_level, const DataType outlier_significance_level, const IndexType num_inquiries, const IndexType max_num_samples, const IndexType *num_samples_used, const IndexType *processed_samples_indices, DataType **samples, IndexType *num_outliers, DataType *trace, DataType *error)
Averages the estimates of trace. Removes outliers and reevaluates the error to take into account for ...
static void set_device(int device_id)
Sets the current device in multi-gpu applications.
Records elasped time between two CUDA events.
Definition: cuda_timer.h:62
void stop()
Stops the timer.
Definition: cuda_timer.cu:67
void start()
Starts the timer.
Definition: cuda_timer.cu:54
float elapsed() const
Returns the elapsed time in seconds.
Definition: cuda_timer.cu:80
Generates 64-bit integers on multiple parallel threads.
static void _cu_stochastic_lanczos_quadrature(cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const IndexType lanczos_degree, const DataType lanczos_tol, RandomNumberGenerator &random_number_generator, DataType *random_vector, FlagType *converged, DataType *trace_estimate)
For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quad...
int FlagType
Definition: types.h:68

References cuTraceEstimator< DataType >::_cu_stochastic_lanczos_quadrature(), ConvergenceTools< DataType >::average_estimates(), ConvergenceTools< DataType >::check_convergence(), CudaTimer::elapsed(), cLinearOperator< DataType >::get_num_rows(), CudaInterface< ArrayType >::set_device(), CudaTimer::start(), and CudaTimer::stop().

Here is the call graph for this function:

The documentation for this class was generated from the following files: