imate
C++/CUDA Reference
Loading...
Searching...
No Matches
cu_lanczos_tridiagonalization.cu File Reference

Go to the source code of this file.

Functions

template<typename DataType >
IndexType cu_lanczos_tridiagonalization (cuLinearOperator< DataType > *A, const DataType *v, const LongIndexType n, const IndexType m, const DataType lanczos_tol, const FlagType orthogonalize, DataType *alpha, DataType *beta)
 Tri-diagonalizes matrix A to T using the start vector v. is the Lanczos degree, which will be the size of square matrix T.
 
template IndexType cu_lanczos_tridiagonalization< float > (cuLinearOperator< float > *A, const float *v, const LongIndexType n, const IndexType m, const float lanczos_tol, const FlagType orthogonalize, float *alpha, float *beta)
 
template IndexType cu_lanczos_tridiagonalization< double > (cuLinearOperator< double > *A, const double *v, const LongIndexType n, const IndexType m, const double lanczos_tol, const FlagType orthogonalize, double *alpha, double *beta)
 

Function Documentation

◆ cu_lanczos_tridiagonalization()

template<typename DataType >
IndexType cu_lanczos_tridiagonalization ( cuLinearOperator< DataType > *  A,
const DataType *  v,
const LongIndexType  n,
const IndexType  m,
const DataType  lanczos_tol,
const FlagType  orthogonalize,
DataType *  alpha,
DataType *  beta 
)

Tri-diagonalizes matrix A to T using the start vector v. is the Lanczos degree, which will be the size of square matrix T.

The output of this function is not an explicit matrix T, rather are the two arrays alpha of length m and beta of length m+1. The array alpha[:] represents the diagonal elements and beta[1:] represents the off-diagonal elements of the tri-diagonal (m,m) symmetric and positive-definite matrix T.

Lanczos tridiagonalization vs Golub-Kahn bidiagonalization

  • The Lanczos tri-diagonalization is twice faster (in runtime), as it has only one matrix-vector multiplication. Whereas the Golub-Kahn bi-diagonalization has two matrix-vector multiplications.
  • The Lanczos tri-diagonalization can only be applied to symmetric matrices. Whereas the Golub-Kahn bi-diagonalization can be applied to any matrix.

Algorithm

The algorithm and notations are obtained from [DEMMEL], p. 57, Algorithm 4.6 (see also [SAAD] p. 137, Algorithm 6.5). However there are four ways to implement the iteration. [PAIGE]_ has shown that the iteration that is implemented below is the most stable against loosing orthogonality of the eigenvectors. For details, see [CULLUM]_ p. 46, and p.48, particularly the algorithm denoted by A(2,7). The differences of these implementations are the order in which \( \alpha_j \) and \( \beta_j \) are defined and the order in which vectors are subtracted from r .

References

  • [DEMMEL] Demmel, J., Templates for solution of Algebraic Eigenvalue Problems, p. 57.
  • [SAAD] Saad, Numerical Methods for Large Eigenvalue Problems, p. 137.
  • [PAIGE] Paige (1980) Accuracy and effectiveness of the Lanczos algorithm for the symmetric eigenproblem.
  • [CULLUM] Cullum; Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. 1. pp.46-48.
Parameters
[in]AA linear operator that represents a matrix of size (n,n) and can perform matrix-vector operation with dot() method. This matrix should be positive-definite.
[in]vStart vector for the Lanczos tri-diagonalization. Column vector of size c n. It could be generated randomly. Often it is generated by the Rademacher distribution with entries c +1 and -1.
[in]nSize of the square matrix A, which is also the size of the vector v.
[in]mLanczos degree, which is the number of Lanczos iterations.
[in]lanczos_tolThe tolerance of the residual error of the Lanczos iteration.
[in]orthogonalizeIndicates whether to orthogonalize the orthogonal eigenvectors during Lanczos recursive iterations.
  • If set to 0, no orthogonalization is performed.
  • If set to a negative integer, a newly computed eigenvector is orthogonalized against all the previous eigenvectors (full reorthogonalization).
  • If set to a positive integer, say q less than lanczos_degree, the newly computed eigenvector is orthogonalized against the last q previous eigenvectors (partial reorthogonalization).
  • If set to an integer larger than lanczos_degree, it is cut to lanczos_degree, which effectively orthogonalizes against all previous eigenvectors (full reorthogonalization).
[out]alphaThis is a 1D array of size m. The array alpha[:] constitute the diagonal elements of the tri-diagonal matrix T. This is the output and written in place.
[out]betaThis is a 1D array of size m. The array beta[:] constitute the off-diagonals of the tri-diagonal matrix T. This array is the output and written in place.
Returns
Counter for the Lanczos iterations. Normally, the size of the output matrix should be (m,m), which is the Lanczos degree. However, if the algorithm terminates early, the size of alpha and beta, and hence the output tri-diagonal matrix, is smaller. This counter keeps track of the non-zero size of alpha and beta.

Definition at line 139 of file cu_lanczos_tridiagonalization.cu.

148{
149 // Get cublas handle
150 cublasHandle_t cublas_handle = A->get_cublas_handle();
151
152 // buffer_size is number of last orthogonal vectors to keep in the buffer V
153 IndexType buffer_size;
154 if (orthogonalize == 0 || orthogonalize == 1)
155 {
156 // At least two vectors must be stored in buffer for Lanczos recursion
157 buffer_size = 2;
158 }
159 else if ((orthogonalize < 0) ||
160 (orthogonalize > static_cast<FlagType>(m)))
161 {
162 // Using full re-orthogonalization, keep all of the m vectors in buffer
163 buffer_size = m;
164 }
165 else
166 {
167 // Orthogonalize with less than m vectors (0 < orthogonalize < m)
168 buffer_size = orthogonalize;
169 }
170
171 // Allocate 2D array (as 1D array, and coalesced row-wise) to store
172 // the last buffer_size of orthogonalized vectors of length n. New vectors
173 // are stored by cycling through the buffer to replace with old ones.
174 DataType* device_V = CudaAPI<DataType>::alloc(n * buffer_size);
175
176 // Allocate vector r
177 DataType* device_r = CudaAPI<DataType>::alloc(n);
178
179 // Copy v into r
180 CudaAPI<DataType>::copy_to_device(v, n, device_r);
181
182 // Initial beta
183 DataType initial_beta = cuVectorOperations<DataType>::euclidean_norm(
184 cublas_handle, device_r, n);
185
186 // Declare iterators
187 IndexType j;
188 IndexType lanczos_size = 0;
189 IndexType num_ortho;
190
191 // In the following, beta[j] means beta[j-1] in the Demmel text
192 for (j=0; j < m; ++j)
193 {
194 // Update the size of Lanczos tridiagonal matrix
195 ++lanczos_size;
196
197 // Normalize r and copy to the j-th column of V
198 if (j == 0)
199 {
201 cublas_handle, device_r, n,
203 &device_V[(j % buffer_size)*n]);
204 }
205 else
206 {
208 cublas_handle, device_r, n,
210 &device_V[(j % buffer_size)*n]);
211 }
212
213 // Multiply A to the j-th column of V, write into r
214 A->dot(&device_V[(j % buffer_size)*n], device_r);
215
216 // alpha[j] is V[:, j] dot r
218 cublas_handle, &device_V[(j % buffer_size)*n], device_r, n);
219
220 // Subtract V[:,j] * alpha[j] from r
222 cublas_handle, &device_V[(j % buffer_size)*n], n, alpha[j],
223 device_r);
224
225 // Subtract V[:,j-1] * beta[j] from r
226 if (j > 0)
227 {
229 cublas_handle, &device_V[((j-1) % buffer_size)*n], n,
230 beta[j-1], device_r);
231 }
232
233 // Gram-Schmidt process (full re-orthogonalization)
234 if (orthogonalize != 0)
235 {
236 // Find how many column vectors are filled so far in the buffer V
237 if (j < buffer_size)
238 {
239 num_ortho = j+1;
240 }
241 else
242 {
243 num_ortho = buffer_size;
244 }
245
246 // Gram-Schmidt process
248 cublas_handle, &device_V[0], n, buffer_size, j%buffer_size,
249 num_ortho, device_r);
250 }
251
252 // beta is norm of r
254 cublas_handle, device_r, n);
255
256 // Exit criterion when the vector r is zero. If each component of a
257 // zero vector has the tolerance epsilon, (which is called lanczos_tol
258 // here), the tolerance of norm of r is epsilon times sqrt of n.
259 if (beta[j] < cu_arithmetics::mul(
260 lanczos_tol,
262 static_cast<double>(std::sqrt(n)))
263 )
264 )
265 {
266 break;
267 }
268 }
269
270 // Free dynamic memory
271 CudaAPI<DataType>::del(device_V);
272 CudaAPI<DataType>::del(device_r);
273
274 return lanczos_size;
275}
static ArrayType * alloc(const size_t array_size)
Allocates memory on gpu device. This function creates a pointer and returns it.
Definition cuda_api.cu:39
static void del(void *device_array)
Deletes memory on gpu device if its pointer is not NULL, then sets the pointer to NULL.
Definition cuda_api.cu:169
static void copy_to_device(const ArrayType *host_array, const size_t array_size, ArrayType *device_array)
Copies memory on host to device memory.
Definition cuda_api.cu:145
virtual void dot(const DataType *vector, DataType *product)=0
cublasHandle_t get_cublas_handle() const
This function returns a reference to the cublasHandle_t object. The object will be created,...
static void gram_schmidt_process(cublasHandle_t cublas_handle, const DataType *V, const LongIndexType vector_size, const IndexType num_vectors, const IndexType last_vector, const FlagType num_ortho, DataType *r)
Modified Gram-Schmidt orthogonalization process to orthogonalize the vector v against a subset of the...
static void copy_scaled_vector(cublasHandle_t cublas_handle, const DataType *RESTRICT input_vector, const LongIndexType vector_size, const DataType scale, DataType *RESTRICT output_vector)
Scales a vector and stores to a new vector.
static void subtract_scaled_vector(cublasHandle_t cublas_handle, const DataType *RESTRICT input_vector, const LongIndexType vector_size, const DataType scale, DataType *RESTRICT output_vector)
Subtracts the scaled input vector from the output vector.
static DataType inner_product(cublasHandle_t cublas_handle, const DataType *RESTRICT vector1, const DataType *RESTRICT vector2, const LongIndexType vector_size)
Computes Euclidean inner product of two vectors.
static DataType euclidean_norm(cublasHandle_t cublas_handle, const DataType *RESTRICT vector, const LongIndexType vector_size)
Computes the Euclidean 2-norm of a 1D array.
__host__ __device__ DataType mul(const DataType x, const DataType y)
Multiply two floating point numbers in round-to-nearest-even mode.
__host__ __device__ DataType abs(const DataType x)
Absolute value of a floating point number.
int FlagType
Definition types.h:68
int IndexType
Definition types.h:65

References cu_arithmetics::abs(), CudaAPI< ArrayType >::alloc(), cuVectorOperations< DataType >::copy_scaled_vector(), CudaAPI< ArrayType >::copy_to_device(), CudaAPI< ArrayType >::del(), cuLinearOperator< DataType >::dot(), cuVectorOperations< DataType >::euclidean_norm(), cuLinearOperator< DataType >::get_cublas_handle(), cuOrthogonalization< DataType >::gram_schmidt_process(), cuVectorOperations< DataType >::inner_product(), cu_arithmetics::mul(), and cuVectorOperations< DataType >::subtract_scaled_vector().

Referenced by cuTraceEstimator< DataType >::_cu_stochastic_lanczos_quadrature().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cu_lanczos_tridiagonalization< double >()

template IndexType cu_lanczos_tridiagonalization< double > ( cuLinearOperator< double > *  A,
const double *  v,
const LongIndexType  n,
const IndexType  m,
const double  lanczos_tol,
const FlagType  orthogonalize,
double *  alpha,
double *  beta 
)

◆ cu_lanczos_tridiagonalization< float >()

template IndexType cu_lanczos_tridiagonalization< float > ( cuLinearOperator< float > *  A,
const float *  v,
const LongIndexType  n,
const IndexType  m,
const float  lanczos_tol,
const FlagType  orthogonalize,
float *  alpha,
float *  beta 
)