imate
C++/CUDA Reference
Loading...
Searching...
No Matches
cuTraceEstimator< DataType > Class Template Reference

A static class to compute the trace of implicit matrix functions using stochastic Lanczos quadrature method. This class acts as a templated namespace, where the member methods is public and static. The internal private member functions are also static. More...

#include <cu_trace_estimator.h>

Static Public Member Functions

static FlagType cu_trace_estimator (cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const int64_t seed, const IndexType lanczos_degree, const DataType lanczos_tol, const IndexType min_num_samples, const IndexType max_num_samples, const DataType error_atol, const DataType error_rtol, const DataType confidence_level, const DataType outlier_significance_level, const IndexType num_threads, const IndexType num_gpu_devices, DataType *trace, DataType *error, DataType **samples, IndexType *processed_samples_indices, IndexType *num_samples_used, IndexType *num_outliers, FlagType *converged, double &alg_wall_time)
 Stochastic Lanczos quadrature method to estimate trace of a function of a linear operator. Both function and the linear operator can be defined with parameters.
 

Static Private Member Functions

static void _cu_stochastic_lanczos_quadrature (cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const IndexType lanczos_degree, const DataType lanczos_tol, RandomNumberGenerator &random_number_generator, DataType *random_vector, FlagType *converged, DataType *trace_estimate)
 For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quadrature method.
 

Detailed Description

template<typename DataType>
class cuTraceEstimator< DataType >

A static class to compute the trace of implicit matrix functions using stochastic Lanczos quadrature method. This class acts as a templated namespace, where the member methods is public and static. The internal private member functions are also static.

See also
Diagonalization

Definition at line 39 of file cu_trace_estimator.h.

Member Function Documentation

◆ _cu_stochastic_lanczos_quadrature()

template<typename DataType >
void cuTraceEstimator< DataType >::_cu_stochastic_lanczos_quadrature ( cuLinearOperator< DataType > *  A,
DataType *  parameters,
const IndexType  num_inquiries,
const Function matrix_function,
const FlagType  gram,
const DataType  exponent,
const FlagType  orthogonalize,
const IndexType  lanczos_degree,
const DataType  lanczos_tol,
RandomNumberGenerator random_number_generator,
DataType *  random_vector,
FlagType converged,
DataType *  trace_estimate 
)
staticprivate

For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quadrature method.

Note
In special case when an eigenvalue relation is known, this function sets the converged inquiries to "not" converged in order to continue updating those inquiries. This is because in this special case, computing for other inquiries is free.
Parameters
[in]AAn instance of a class derived from LinearOperator class. This object will perform the matrix-vector operation and/or transposed matrix-vector operation for a linear operator. The linear operator can represent a fixed matrix, or a combination of matrices together with some given parameters.
[in]parametersThe parameters of the linear operator A. The size of this array is num_parameters*num_inquiries where num_parameters is the number of parameters that define the linear operator A, and num_inquiries is the number of different set of parameters to compute trace on different parametrized operators. The j-th set of parameters are stored in parameters[j*num_parameters:(j+1)*num_parameters]. That is, this array is contiguous over each batch of parameters.
[in]num_inquiriesThe number of batches of parameters. This function computes num_inquiries values of trace corresponding to different batch of parameters of the linear operator A. Hence, the number of output trace is num_inquiries. Hence, it is the number of columns of the output array samples.
[in]matrix_functionAn instance of Function class which has the function function. This function defines the matrix function, and operates on scalar eigenvalues of the matrix.
[in]gramFlag indicating whether the linear operator A is Gramian. If the linear operator is:
  • Gramian, then, Lanczos tridiagonalization method is employed. This method requires only matrix-vector multiplication.
  • not Gramian, then, Golub-Kahn bidiagonalization method is employed. This method requires both matrix and transposed-matrix vector multiplications.
[in]exponentThe exponent parameter p in the trace of the expression \( f((\mathbf{A} + t \mathbf{B})^p) \). The exponent is a real number and by default it is set to 1.0.
[in]orthogonalizeIndicates whether to orthogonalize the orthogonal eigenvectors during Lanczos recursive iterations.
  • If set to 0, no orthogonalization is performed.
  • If set to a negative integer, a newly computed eigenvector is orthogonalized against all the previous eigenvectors (full reorthogonalization).
  • If set to a positive integer, say q less than lanczos_degree, the newly computed eigenvector is orthogonalized against the last q previous eigenvectors (partial reorthogonalization).
  • If set to an integer larger than lanczos_degree, it is cut to lanczos_degree, which effectively orthogonalizes against all previous eigenvectors (full reorthogonalization).
[in]lanczos_degreeThe number of Lanczos recursive iterations. The operator A is reduced to a square tridiagonal (or bidiagonal) matrix of the size lanczos_degree. The eigenvalues (or singular values) of this reduced matrix is computed and used in the stochastic Lanczos quadrature method. The larger Lanczos degre leads to a better estimation. The computational cost is quadratically increases with the lanczos degree.
[in]lanczos_tolThe tolerance to stop the Lanczos recursive iterations before the end of iterations reached. If the tolerance is not met, the iterations (total of lanczos_degree iterations) continue till end.
[in]random_number_generatorGenerates random numbers that fills random_vector. In each parallel thread, an independent sequence of random numbers are generated. This object should be initialized by num_threads.
[in]random_vectorA 1D vector of the size of matrix A. The Lanczos iterations start off from this random vector. Each given random vector is used per a Monte-Carlo computation of the SLQ method. In the Lanczos iterations, other vectors are generated orthogonal to this initial random vector. This array is filled inside this function.
[out]converged1D array of the size of the number of columns of samples. Each element indicates which column of samples has converged to the tolerance criteria. Normally, if the num_samples used is less than max_num_samples, it indicates that the convergence has reached.
[out]trace_estimate1D array of the size of the number of columns of samples array. This array constitures each row of samples array. Each element of trace_estimates is the estimated trace for each parameter inquiry.

Definition at line 437 of file cu_trace_estimator.cu.

451{
452 // Matrix size
453 IndexType matrix_size = A->get_num_rows();
454
455 // Fill random vectors with Rademacher distribution (+1, -1), normalized
456 // but not orthogonalized. Setting num_threads to zero indicates to not
457 // create any new threads in RandomNumbrGenerator since the current
458 // function is inside a parallel thread.
459 IndexType num_threads = 0;
461 random_number_generator, random_vector, matrix_size, num_threads);
462
463 // Allocate diagonals (alpha) and supdiagonals (beta) of Lanczos matrix
464 DataType* alpha = new DataType[lanczos_degree];
465 DataType* beta = new DataType[lanczos_degree];
466
467 // Define 2D arrays needed to decomposition. All these arrays are
468 // defined as 1D array with Fortran ordering
469 DataType* eigenvectors = NULL;
470 DataType* left_singularvectors = NULL;
471 DataType* right_singularvectors_transposed = NULL;
472
473 // Actual number of inquiries
474 IndexType required_num_inquiries = num_inquiries;
476 {
477 // When a relation between eigenvalues and the parameters of the linear
478 // operator is known, to compute eigenvalues of for each inquiry, only
479 // computing one inquiry is enough. This is because an eigenvalue for
480 // one parameter setting is enough to compute eigenvalue of another set
481 // of parameters.
482 required_num_inquiries = 1;
483 }
484
485 // Allocate and initialize theta
486 IndexType i;
487 IndexType j;
488 DataType** theta = new DataType*[num_inquiries];
489 for (j=0; j < num_inquiries; ++j)
490 {
491 theta[j] = new DataType[lanczos_degree];
492
493 // Initialize components to zero
494 for (i=0; i < lanczos_degree; ++i)
495 {
496 theta[j][i] = 0.0;
497 }
498 }
499
500 // Allocate and initialize tau
501 DataType** tau = new DataType*[num_inquiries];
502 for (j=0; j < num_inquiries; ++j)
503 {
504 tau[j] = new DataType[lanczos_degree];
505
506 // Initialize components to zero
507 for (i=0; i < lanczos_degree; ++i)
508 {
509 tau[j][i] = 0.0;
510 }
511 }
512
513 // Allocate lanczos size for each inquiry. This variable keeps the non-zero
514 // size of the tri-diagonal (or bi-diagonal) matrix. Ideally, this matrix
515 // is of the size lanczos_degree. But, due to the early termination, this
516 // size might be smaller.
517 IndexType* lanczos_size = new IndexType[num_inquiries];
518
519 // Number of parameters of linear operator A
520 IndexType num_parameters = A->get_num_parameters();
521
522 // Lanczos iterations, computes theta and tau for each inquiry parameter
523 for (j=0; j < required_num_inquiries; ++j)
524 {
525 // If trace is already converged, do not compute on the new sample.
526 // However, exclude the case where required_num_inquiries is not the
527 // same as num_inquiries, since in this case, we compute one inquiry
528 // for multiple parameters.
529 if ((converged[j] == 1) && (required_num_inquiries == num_inquiries))
530 {
531 continue;
532 }
533
534 // Set parameter of linear operator A
535 A->set_parameters(&parameters[j*num_parameters]);
536
537 if (gram)
538 {
539 // Use Golub-Kahn-Lanczos Bi-diagonalization
540 lanczos_size[j] = cu_golub_kahn_bidiagonalization(
541 A, random_vector, matrix_size, lanczos_degree, lanczos_tol,
542 orthogonalize, alpha, beta);
543
544 // Allocate matrix of singular vectors (1D array, Fortran ordering)
545 left_singularvectors = \
546 new DataType[lanczos_size[j] * lanczos_size[j]];
547 right_singularvectors_transposed = \
548 new DataType[lanczos_size[j] * lanczos_size[j]];
549
550 // Note: alpha is written in-place with singular values
552 alpha, beta, left_singularvectors,
553 right_singularvectors_transposed, lanczos_size[j]);
554
555 // theta and tau from singular values and vectors
556 for (i=0; i < lanczos_size[j]; ++i)
557 {
558 theta[j][i] = cu_arithmetics::mul(alpha[i], alpha[i]);
559 tau[j][i] = right_singularvectors_transposed[i];
560 }
561 }
562 else
563 {
564 // Use Lanczos Tri-diagonalization
565 lanczos_size[j] = cu_lanczos_tridiagonalization(
566 A, random_vector, matrix_size, lanczos_degree, lanczos_tol,
567 orthogonalize, alpha, beta);
568
569 // Allocate eigenvectors matrix (1D array with Fortran ordering)
570 eigenvectors = new DataType[lanczos_size[j] * lanczos_size[j]];
571
572 // Note: alpha is written in-place with eigenvalues
574 alpha, beta, eigenvectors, lanczos_size[j]);
575
576 // theta and tau from singular values and vectors
577 for (i=0; i < lanczos_size[j]; ++i)
578 {
579 theta[j][i] = alpha[i];
580 tau[j][i] = eigenvectors[i * lanczos_size[j]];
581 }
582 }
583 }
584
585 // If an eigenvalue relation is known, compute the rest of eigenvalues
586 // using the eigenvalue relation given in the operator A for its
587 // eigenvalues. If no eigenvalue relation is not known, the rest of
588 // eigenvalues were already computed in the above loop and no other
589 // computation is needed.
590 if (A->is_eigenvalue_relation_known() && num_inquiries > 1)
591 {
592 // When the code execution reaches this function, at least one of the
593 // inquiries is not converged, but some others might have been
594 // converged already. Here, we force-update those that are even
595 // converged already by setting converged to false. The extra update is
596 // free of charge when a relation for the eigenvalues are known.
597 for (j=0; j < num_inquiries; ++j)
598 {
599 converged[j] = 0;
600 }
601
602 // Compute theta and tau for the rest of inquiry parameters
603 for (j=1; j < num_inquiries; ++j)
604 {
605 // Only j=0 was iterated before. Set the same size for other j-s
606 lanczos_size[j] = lanczos_size[0];
607
608 for (i=0; i < lanczos_size[j]; ++i)
609 {
610 // Shift eigenvalues by the old and new parameters
611 theta[j][i] = A->get_eigenvalue(
612 &parameters[0],
613 theta[0][i],
614 &parameters[j*num_parameters]);
615
616 // tau is the same (at least for the affine operator)
617 tau[j][i] = tau[0][i];
618 }
619 }
620 }
621
622 // Estimate trace using quadrature method
623 DataType quadrature_sum;
624 for (j=0; j < num_inquiries; ++j)
625 {
626 // If the j-th inquiry is already converged, skip.
627 if (converged[j] == 1)
628 {
629 continue;
630 }
631
632 // Initialize sum for the integral of quadrature
633 quadrature_sum = 0.0;
634
635 // Important: This loop should iterate till lanczos_size[j], but not
636 // lanczos_degree. Otherwise the computation is wrong for certain
637 // matrices, such as if the input matrix is identity, or rank
638 // deficient. By using lanczos_size[j] instead of lanczos_degree, all
639 // issues with special matrices will resolve.
640 for (i=0; i < lanczos_size[j]; ++i)
641 {
642 quadrature_sum += cu_arithmetics::mul(
643 tau[j][i],
644 tau[j][i],
646 matrix_function->function(std::pow(
649 )
650 )
651 )
652 );
653 }
654
655 trace_estimate[j] = \
656 cu_arithmetics::mul(
657 quadrature_sum,
659 static_cast<unsigned long long int>(matrix_size)
660 )
661 );
662 }
663
664 // Release dynamic memory
665 delete[] alpha;
666 delete[] beta;
667 delete[] lanczos_size;
668
669 for (j=0; j < required_num_inquiries; ++j)
670 {
671 delete[] theta[j];
672 }
673 delete[] theta;
674
675 for (j=0; j < required_num_inquiries; ++j)
676 {
677 delete[] tau[j];
678 }
679 delete[] tau;
680
681 if (eigenvectors != NULL)
682 {
683 delete[] eigenvectors;
684 }
685
686 if (left_singularvectors != NULL)
687 {
688 delete[] left_singularvectors;
689 }
690
691 if (right_singularvectors_transposed != NULL)
692 {
693 delete[] right_singularvectors_transposed;
694 }
695}
static int eigh_tridiagonal(DataType *diagonals, DataType *subdiagonals, DataType *eigenvectors, IndexType matrix_size)
Computes all eigenvalues and eigenvectors of a real and symmetric tridiagonal matrix.
static int svd_bidiagonal(DataType *diagonals, DataType *subdiagonals, DataType *U, DataType *Vt, IndexType matrix_size)
Computes all singular-values and left and right eigenvectors of a real and symmetric upper bi-diagona...
virtual float function(const float lambda_) const =0
static void generate_random_array(RandomNumberGenerator &random_number_generator, DataType *array, const LongIndexType array_size, const IndexType num_threads)
Generates a pseudo-random array with Rademacher distribution where elements are either +1 or -1.
IndexType get_num_parameters() const
Returns the number of parameters of the linear operator.
LongIndexType get_num_rows() const
Returns the number of rows of the matrix.
FlagType is_eigenvalue_relation_known() const
Returns a flag that determines whether a relation between the parameters of the operator and its eige...
void set_parameters(DataType *parameters_)
Sets the scalar parameter this->parameters. Parameter is initialized to NULL. However,...
virtual DataType get_eigenvalue(const DataType *known_parameters, const DataType known_eigenvalue, const DataType *inquiry_parameters) const =0
IndexType cu_golub_kahn_bidiagonalization(cuLinearOperator< DataType > *A, const DataType *v, const LongIndexType n, const IndexType m, const DataType lanczos_tol, const FlagType orthogonalize, DataType *alpha, DataType *beta)
Bi-diagonalizes the positive-definite matrix A using Golub-Kahn-Lanczos method.
IndexType cu_lanczos_tridiagonalization(cuLinearOperator< DataType > *A, const DataType *v, const LongIndexType n, const IndexType m, const DataType lanczos_tol, const FlagType orthogonalize, DataType *alpha, DataType *beta)
Tri-diagonalizes matrix A to T using the start vector v. is the Lanczos degree, which will be the siz...
__host__ __device__ DataType mul(const DataType x, const DataType y)
Multiply two floating point numbers in round-to-nearest-even mode.
__host__ __device__ DataType abs(const DataType x)
Absolute value of a floating point number.
int IndexType
Definition types.h:65

References cu_arithmetics::abs(), cu_golub_kahn_bidiagonalization(), cu_lanczos_tridiagonalization(), Diagonalization< DataType >::eigh_tridiagonal(), Function::function(), RandomArrayGenerator< DataType >::generate_random_array(), cuLinearOperator< DataType >::get_eigenvalue(), cLinearOperatorBase::get_num_parameters(), cLinearOperatorBase::get_num_rows(), cLinearOperatorBase::is_eigenvalue_relation_known(), cu_arithmetics::mul(), cuLinearOperator< DataType >::set_parameters(), and Diagonalization< DataType >::svd_bidiagonal().

Referenced by cuTraceEstimator< DataType >::cu_trace_estimator().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cu_trace_estimator()

template<typename DataType >
FlagType cuTraceEstimator< DataType >::cu_trace_estimator ( cuLinearOperator< DataType > *  A,
DataType *  parameters,
const IndexType  num_inquiries,
const Function matrix_function,
const FlagType  gram,
const DataType  exponent,
const FlagType  orthogonalize,
const int64_t  seed,
const IndexType  lanczos_degree,
const DataType  lanczos_tol,
const IndexType  min_num_samples,
const IndexType  max_num_samples,
const DataType  error_atol,
const DataType  error_rtol,
const DataType  confidence_level,
const DataType  outlier_significance_level,
const IndexType  num_threads,
const IndexType  num_gpu_devices,
DataType *  trace,
DataType *  error,
DataType **  samples,
IndexType processed_samples_indices,
IndexType num_samples_used,
IndexType num_outliers,
FlagType converged,
double &  alg_wall_time 
)
static

Stochastic Lanczos quadrature method to estimate trace of a function of a linear operator. Both function and the linear operator can be defined with parameters.

Multiple batches of parameters of the linear operator can be passed to this function. In such a case, the output trace is an array of the of the number of the inquired parameters.

The stochastic estimator computes multiple samples of trace and the final result is the average of the samples. This function outputs both the samples of estimated trace values (in samples array) and their average (in trace array).

Parameters
[in]AAn instance of a class derived from LinearOperator class. This object will perform the matrix-vector operation and/or transposed matrix-vector operation for a linear operator. The linear operator can represent a fixed matrix, or a combination of matrices together with some given parameters.
[in]parametersThe parameters of the linear operator A. The size of this array is num_parameters*num_inquiries where num_parameters is the number of parameters that define the linear operator A, and num_inquiries is the number of different set of parameters to compute trace on different parametrized operators. The j-th set of parameters are stored in parameters[j*num_parameters:(j+1)*num_parameters]. That is, this array is contiguous over each batch of parameters.
[in]num_inquiriesThe number of batches of parameters. This function computes num_inquiries values of trace corresponding to different batch of parameters of the linear operator A. Hence, the number of output trace is num_inquiries. Hence, it is the number of columns of the output array samples.
[in]matrix_functionAn instance of Function class which has the function function. This function defines the matrix function, and operates on scalar eigenvalues of the matrix.
[in]gramFlag indicating whether the linear operator A is Gramian. If the linear operator is:
  • Gramian, then, Lanczos tridiagonalization method is employed. This method requires only matrix-vector multiplication.
  • not Gramian, then, Golub-Kahn bidiagonalization method is employed. This method requires both matrix and transposed-matrix vector multiplications.
[in]exponentThe exponent parameter p in the trace of the expression \( f((\mathbf{A} + t \mathbf{B})^p) \). The exponent is a real number and by default it is set to 1.0.
[in]orthogonalizeIndicates whether to orthogonalize the orthogonal eigenvectors during Lanczos recursive iterations.
  • If set to 0, no orthogonalization is performed.
  • If set to a negative integer, a newly computed eigenvector is orthogonalized against all the previous eigenvectors (full reorthogonalization).
  • If set to a positive integer, say q less than lanczos_degree, the newly computed eigenvector is orthogonalized against the last q previous eigenvectors (partial reorthogonalization).
  • If set to an integer larger than lanczos_degree, it is cut to lanczos_degree, which effectively orthogonalizes against all previous eigenvectors (full reorthogonalization).
[in]seedA non-negative integer to be used as seed to initiate the generation of sequences of peudo-random numbers in the algorithm. This is useful to make the result of the randomized algorithm to be reproducible. If a negative integer is given, the given seed value is ignored and the current processor time is used as the seed to initiate he generation random number sequences. In this case, the result is not reproducible, rather, is pseudo-random.
[in]lanczos_degreeThe number of Lanczos recursive iterations. The operator A is reduced to a square tridiagonal (or bidiagonal) matrix of the size lanczos_degree. The eigenvalues (or singular values) of this reduced matrix is computed and used in the stochastic Lanczos quadrature method. The larger Lanczos degree leads to a better estimation. The computational cost is quadratically increases with the lanczos degree.
[in]lanczos_tolThe tolerance to stop the Lanczos recursive iterations before the end of iterations reached. If the tolerance is not met, the iterations (total of lanczos_degree iterations) continue till end.
[in]min_num_samplesMinimum number of times that the trace estimation is repeated. Within the min number of samples, the Monte-Carlo continues even if convergence is reached.
[in]max_num_samplesThe number of times that the trace estimation is repeated. The output trace value is the average of the samples. Hence, this is the number of rows of the output array samples. Larger number of samples leads to a better trace estimation. The computational const linearly increases with number of samples.
[in]error_atolAbsolute tolerance criterion for early termination during the computation of trace samples. If the tolerance is not met, then all iterations (total of max_num_samples) proceed till end.
[in]error_rtolRelative tolerance criterion for early termination during the computation of trace samples. If the tolerance is not met, then all iterations (total of max_num_samples) proceed till end.
[in]confidence_levelThe confidence level of the error, which is a number between 0 and 1. This affects the scale of error.
[in]outlier_significance_levelOne minus the confidence level of the uncertainty band of the outlier. This is a number between 0 and 1. Confidence level of outleir and significance level of outlier are complement of each other.
[in]num_threadsNumber of OpenMP parallel processes. The parallelization is implemented over the Monte-Carlo iterations.
[in]num_gpu_devicesNumber of GPU devices to use. This is the number of CPU threads to be created to handle each GPU device in parallel for each CPU thread.
[out]traceThe output trace of size num_inquiries. These values are the average of the rows of samples array.
[out]errorThe error of estimation of trace, which is the standard deviation of the rows of samples array. The size of this array is num_inquiries.
[out]samples2D array of all estimated trace samples. The shape of this array is (max_num_samples*num_inquiries). The average of the rows is also given in trace array.
[out]processed_samples_indicesA 1D array indicating the processing order of rows of the samples. In parallel processing, this order of processing the rows of samples is not necessarly sequential.
[out]num_samples_used1D array of the size of the number of columns of samples. Each element indicates how many iterations were used till convergence is reached for each column of the samples. The number of iterations should be a number between min_num_samples and max_num_samples.
[out]num_outliers1D array with the size of number of columns of samples. Each element indicates how many rows of the samples array were outliers and were removed during averaging rows of samples.
[out]converged1D array of the size of the number of columns of samples. Each element indicates which column of samples has converged to the tolerance criteria. Normally, if the num_samples used is less than max_num_samples, it indicates that the convergence has reached.
[out]alg_wall_timeThe elapsed time that takes for the SLQ algorithm. This does not include array allocation/deallocation.
Returns
A signal to indicate the status of computation:
  • 1 indicates successful convergence within the given tolerances was met. Convergence is achieved when all elements of convergence array are below convergence_atol or convergence_rtol times trace.
  • 0 indicates the convergence criterion was not met for at least one of the trace inquiries.

Definition at line 205 of file cu_trace_estimator.cu.

232{
233 // Matrix size
234 IndexType matrix_size = A->get_num_rows();
235
236 // Set the number of threads
237 #if defined(USE_OPENMP) && (USE_OPENMP == 1)
238 omp_set_num_threads(num_gpu_devices);
239 #else
240 num_threads = 1;
241 #endif
242
243 // Allocate 1D array of random vectors We only allocate a random vector
244 // per parallel thread. Thus, the total size of the random vectors is
245 // matrix_size*num_threads. On each iteration in parallel threads, the
246 // allocated memory is reused. That is, in each iteration, a new random
247 // vector is generated for that specific thread id.
248 IndexType random_vectors_size = matrix_size * num_gpu_devices;
249 DataType* random_vectors = new DataType[random_vectors_size];
250
251 // Initialize random number generator to generate in parallel threads
252 // independently.
253 RandomNumberGenerator random_number_generator(num_gpu_devices, seed);
254
255 // The counter of filled size of processed_samples_indices array
256 // This scalar variable is defined as array to be shared among al threads
257 IndexType num_processed_samples = 0;
258
259 // Criterion for early termination of iterations if convergence reached
260 // This scalar variable is defined as array to be shared among al threads
261 FlagType all_converged = 0;
262
263 // Using square-root of max possible chunk size for parallel schedules
264 unsigned int chunk_size = static_cast<int>(
265 std::sqrt(static_cast<double>(max_num_samples) / \
266 static_cast<double>(num_gpu_devices)));
267 if (chunk_size < 1)
268 {
269 chunk_size = 1;
270 }
271
272 // Timing elapsed time of algorithm
273 CudaTimer cuda_timer;
274 cuda_timer.start();
275
276 // Shared-memory parallelism over Monte-Carlo ensemble sampling
277 IndexType i;
278 #if defined(USE_OPENMP) && (USE_OPENMP == 1)
279 #pragma omp parallel for \
280 schedule(dynamic, chunk_size)
281 #endif
282 for (i=0; i < max_num_samples; ++i)
283 {
284 if (!static_cast<bool>(all_converged))
285 {
286 // Switch to a device with the same device id as the cpu thread id
287 unsigned int thread_id = omp_get_thread_num();
289
290 // Perform one Monte-Carlo sampling to estimate trace
292 A, parameters, num_inquiries, matrix_function, gram,
293 exponent, orthogonalize, lanczos_degree, lanczos_tol,
294 random_number_generator,
295 &random_vectors[matrix_size*thread_id], converged,
296 samples[i]);
297
298 // Critical section
299 #if defined(USE_OPENMP) && (USE_OPENMP == 1)
300 #pragma omp critical
301 #endif
302 {
303 // Store the index of processed samples
304 processed_samples_indices[num_processed_samples] = i;
305 ++num_processed_samples;
306
307 // Check whether convergence criterion has been met to stop.
308 // This check can also be done after another parallel thread
309 // set all_converged to "1", but we continue to update error.
311 samples, min_num_samples, num_inquiries,
312 processed_samples_indices, num_processed_samples,
313 confidence_level, error_atol, error_rtol, error,
314 num_samples_used, converged);
315 }
316 }
317 }
318
319 // Elapsed wall time of the algorithm (computation only, not array i/o)
320 cuda_timer.stop();
321 alg_wall_time = cuda_timer.elapsed();
322
323 // Remove outliers from trace estimates and average trace estimates
325 confidence_level, outlier_significance_level, num_inquiries,
326 max_num_samples, num_samples_used, processed_samples_indices,
327 samples, num_outliers, trace, error);
328
329 // Deallocate memory
330 delete[] random_vectors;
331
332 return all_converged;
333}
static FlagType check_convergence(DataType **samples, const IndexType min_num_samples, const IndexType num_inquiries, const IndexType *processed_samples_indices, const IndexType num_processed_samples, const DataType confidence_level, const DataType error_atol, const DataType error_rtol, DataType *error, IndexType *num_samples_used, FlagType *converged)
Checks if the standard deviation of the set of the cumulative averages of trace estimators converged ...
static void average_estimates(const DataType confidence_level, const DataType outlier_significance_level, const IndexType num_inquiries, const IndexType max_num_samples, const IndexType *num_samples_used, const IndexType *processed_samples_indices, DataType **samples, IndexType *num_outliers, DataType *trace, DataType *error)
Averages the estimates of trace. Removes outliers and reevaluates the error to take into account for ...
static void set_device(int device_id)
Sets the current device in multi-gpu applications.
Definition cuda_api.cu:191
Records elasped time between two CUDA events.
Definition cuda_timer.h:62
void stop()
Stops the timer.
Definition cuda_timer.cu:67
void start()
Starts the timer.
Definition cuda_timer.cu:54
float elapsed() const
Returns the elapsed time in seconds.
Definition cuda_timer.cu:80
Generates 64-bit integers on multiple parallel threads.
static void _cu_stochastic_lanczos_quadrature(cuLinearOperator< DataType > *A, DataType *parameters, const IndexType num_inquiries, const Function *matrix_function, const FlagType gram, const DataType exponent, const FlagType orthogonalize, const IndexType lanczos_degree, const DataType lanczos_tol, RandomNumberGenerator &random_number_generator, DataType *random_vector, FlagType *converged, DataType *trace_estimate)
For a given random input vector, computes one Monte-Carlo sample to estimate trace using Lanczos quad...
void omp_set_num_threads(int num_threads)
int omp_get_thread_num()
int FlagType
Definition types.h:68

References cuTraceEstimator< DataType >::_cu_stochastic_lanczos_quadrature(), ConvergenceTools< DataType >::average_estimates(), ConvergenceTools< DataType >::check_convergence(), CudaTimer::elapsed(), cLinearOperatorBase::get_num_rows(), omp_get_thread_num(), omp_set_num_threads(), CudaAPI< ArrayType >::set_device(), CudaTimer::start(), and CudaTimer::stop().

Here is the call graph for this function:

The documentation for this class was generated from the following files: