imate
C++/CUDA Reference
Loading...
Searching...
No Matches
cublas_impl Namespace Reference

Templated implenentations of several BLAS-type functions in CUDA. More...

Functions

template<typename DataType , typename ComputeType >
cudaError_t cublasTgemv (cublasOperation_t trans, int m, int n, const DataType *RESTRICT alpha, const DataType *RESTRICT A, int lda, const DataType *RESTRICT x, int incx, const DataType *RESTRICT beta, DataType *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).
 
template<typename DataType >
cudaError_t cublasTcopy (int n, const DataType *RESTRICT x, int incx, DataType *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \).
 
template<typename DataType >
cudaError_t cublasTaxpy (int n, const DataType *RESTRICT alpha, const DataType *RESTRICT x, int incx, DataType *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).
 
template<typename DataType , typename ComputeType >
cudaError_t cublasTdot (int n, const DataType *RESTRICT x, int incx, const DataType *RESTRICT y, int incy, DataType *RESTRICT result)
 Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).
 
template<typename DataType , typename ComputeType >
cudaError_t cublasTnrm2 (int n, const DataType *RESTRICT x, int incx, DataType *RESTRICT result)
 Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).
 
template<typename DataType >
cudaError_t cublasTscal (int n, const DataType *RESTRICT alpha, DataType *RESTRICT x, int incx)
 Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).
 

Detailed Description

Templated implenentations of several BLAS-type functions in CUDA.

The motivation for re-implementing CuBLAS is that CUDA's CuBLAS library does not supports __half type and __nv_bfloat16 type for some of it functions. For instance, while there is support for level 3 functions, they do not provide level 2 and 1 functions with __half type.

The functions in this namespace provides some level 2 functions by implementing CUDA kernels from scratch. These implementations are templated with mixed precision computations where both the data types and inner computation types are templated. The data type is set by DatatType typename and the inner computation type is set by ComputeType typename.

Despite the generic templated functions, he main intent of these templates are to be used primarily for the missing types in CuBLAS, namely, the __half type (which is float16 type) and __nv_bfloat6 type (which is Google's bfloat16 type). But users may utilize these templates for any data and compute types.

The prefix convension for all functions in this namespace are cublasT (for instance cublasTgemv) where T here denotes template. In the CuBLAS API, this letter a placeholder for data type, such as S for single preicsion and D for double precision.

The functions in this namespace are the host codes. The kernel codes corresponding to each host code can be found in cublas_impl_kernels namespace.

See also
Namespace cublas_api .

Function Documentation

◆ cublasTaxpy()

template<typename DataType >
cudaError_t cublas_impl::cublasTaxpy ( int  n,
const DataType *RESTRICT  alpha,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).

This function is a custom implementation of cuBLAS's cublasSaxpy from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTaxpy_kernel .

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTaxpy_kernel

Definition at line 223 of file cublas_impl.cu.

230 {
231 // Set number of device threads and blocks
232 const int threads_per_block = 256;
233 int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
234
235 // Call device code
236 cublas_impl_kernels::cublasTaxpy_kernel<DataType><<<
237 blocks_per_grid, threads_per_block>>>(
238 n, *alpha, x, incx, y, incy);
239
240 cudaError_t error = cudaDeviceSynchronize();
241
242 return error;
243 }

◆ cublasTcopy()

template<typename DataType >
cudaError_t cublas_impl::cublasTcopy ( int  n,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \).

This function is a custom implementation of cuBLAS's cublasScopy from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTcopy_kernel .

Parameters
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTcopy_kernel

Definition at line 169 of file cublas_impl.cu.

175 {
176 // Set number of device threads and blocks
177 const int threads_per_block = 256;
178 int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
179
180 // Call device code
181 cublas_impl_kernels::cublasTcopy_kernel<DataType><<<
182 blocks_per_grid, threads_per_block>>>(
183 n, x, incx, y, incy);
184
185 cudaError_t error = cudaDeviceSynchronize();
186
187 return error;
188 }

◆ cublasTdot()

template<typename DataType , typename ComputeType >
cudaError_t cublas_impl::cublasTdot ( int  n,
const DataType *RESTRICT  x,
int  incx,
const DataType *RESTRICT  y,
int  incy,
DataType *RESTRICT  result 
)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).

This function is a custom implementation of cuBLAS's cublasSdot from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTdot_kernel .

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTdot_kernel

Definition at line 277 of file cublas_impl.cu.

284 {
285 // device pointer to store the result (this is a scalar value)
286 ComputeType *device_result;
287 cudaMalloc(&device_result, sizeof(ComputeType));
288 cudaMemset(device_result, static_cast<ComputeType>(0.0f),
289 sizeof(ComputeType));
290
291 // Set number of device threads and blocks
292 const int threads_per_block = 256;
293 int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
294
295 // Call device code
297 DataType, ComputeType, threads_per_block><<<
298 blocks_per_grid, threads_per_block>>>(
299 n, x, incx, y, incy, device_result);
300
301 cudaError_t error = cudaDeviceSynchronize();
302
303 // Return back result from device and store as higher precision type
304 ComputeType host_result_comp;
305 cudaMemcpy(&host_result_comp, device_result, sizeof(ComputeType),
306 cudaMemcpyDeviceToHost);
307
308 // Convert type to match output type
310 host_result_comp);
311
312 cudaFree(device_result);
313
314 return error;
315 }
cudaError_t cudaFree(void *devPtr)
Definition of CUDA's cudaFree function using dynamically loaded cudart library.
cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, cudaMemcpyKind kind)
Definition of CUDA's cudaMemcpy function using dynamically loaded cudart library.
cudaError_t cudaMalloc(void **devPtr, size_t size)
Definition of CUDA's cudaMalloc function using dynamically loaded cudart library.
__host__ __device__ DataType abs(const DataType x)
Absolute value of a floating point number.
__global__ void cublasTdot_kernel(const int n, const DataType *RESTRICT x, const int incx, const DataType *RESTRICT y, const int incy, ComputeType *RESTRICT result)
Computes .

References cu_arithmetics::abs(), cublas_impl_kernels::cublasTdot_kernel(), cudaFree(), cudaMalloc(), and cudaMemcpy().

Here is the call graph for this function:

◆ cublasTgemv()

template<typename DataType , typename ComputeType >
cudaError_t cublas_impl::cublasTgemv ( cublasOperation_t  trans,
int  m,
int  n,
const DataType *RESTRICT  alpha,
const DataType *RESTRICT  A,
int  lda,
const DataType *RESTRICT  x,
int  incx,
const DataType *RESTRICT  beta,
DataType *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is a custom implementation of cuBLAS's cublasSgemv from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTgemv_kernel .

Parameters
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTgemv_kernel

Definition at line 77 of file cublas_impl.cu.

89 {
90 // Determine array sizes based on operation of A
91 bool trans_;
92 int x_size;
93 int y_size;
94
95 if (trans == CUBLAS_OP_N)
96 {
97 // A is not transposed
98 trans_ = false;
99 y_size = m;
100 x_size = n;
101 }
102 else if (trans == CUBLAS_OP_T)
103 {
104 // A is transposed
105 trans_ = true;
106 y_size = n;
107 x_size = m;
108 }
109 else
110 {
111 throw std::invalid_argument(
112 "'trans' argument must be CUBLAS_OP_N or CUBLAS_OP_T.");
113 }
114
115 // The optimal number of threads per block (here 640) is obtained by
116 // calling cudaOccupancyMaxPotentialBlockSize() in a separate
117 // benchmark.
118 const int threads_per_block = 640;
119 dim3 dim_block(threads_per_block);
120
121 // We assume each thread represents one element of y. That is, the
122 // total number of threads is the size of y.
123 int blocks_per_grid = \
124 (y_size + threads_per_block - 1) / threads_per_block;
125 dim3 dim_grid(blocks_per_grid);
126
127 // Calling kernel code
129 DataType, ComputeType, threads_per_block>
130 <<<dim_grid, dim_block>>>(
131 trans_, y_size, x_size, *alpha, A, lda, x, incx, *beta, y,
132 incy);
133
134 cudaError_t error = cudaDeviceSynchronize();
135
136 return error;
137 }
__global__ void cublasTgemv_kernel(const bool trans, const int m, const int n, const DataType alpha, const DataType *RESTRICT A, const int lda, const DataType *RESTRICT x, const int incx, const DataType beta, DataType *RESTRICT y, const int incy)
Performs the operation .

References cublas_impl_kernels::cublasTgemv_kernel().

Here is the call graph for this function:

◆ cublasTnrm2()

template<typename DataType , typename ComputeType >
cudaError_t cublas_impl::cublasTnrm2 ( int  n,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  result 
)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).

This function is a custom implementation of cuBLAS's cublasSnrm2 from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTnrm2_kernel .

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTnrm2_kernel

Definition at line 344 of file cublas_impl.cu.

349 {
350 // device pointer to store the result (this is a scalar value)
351 ComputeType *device_result;
352 cudaMalloc(&device_result, sizeof(ComputeType));
353 cudaMemset(device_result, static_cast<ComputeType>(0.0f),
354 sizeof(ComputeType));
355
356 // Set number of device threads and blocks
357 const int threads_per_block = 256;
358 int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
359
360 // Call device code
362 DataType, ComputeType, threads_per_block><<<
363 blocks_per_grid, threads_per_block>>>(
364 n, x, incx, device_result);
365
366 cudaError_t error = cudaDeviceSynchronize();
367
368 // Return back result from device and store as higher precision type
369 ComputeType host_result_comp;
370 cudaMemcpy(&host_result_comp, device_result, sizeof(ComputeType),
371 cudaMemcpyDeviceToHost);
372
373 // Convert type to match output type
375 host_result_comp);
376
377 cudaFree(device_result);
378
379 return error;
380 }
__global__ void cublasTnrm2_kernel(const int n, const DataType *RESTRICT x, const int incx, ComputeType *RESTRICT result)
Computes .

References cu_arithmetics::abs(), cublas_impl_kernels::cublasTnrm2_kernel(), cudaFree(), cudaMalloc(), and cudaMemcpy().

Here is the call graph for this function:

◆ cublasTscal()

template<typename DataType >
cudaError_t cublas_impl::cublasTscal ( int  n,
const DataType *RESTRICT  alpha,
DataType *RESTRICT  x,
int  incx 
)

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).

This function is a custom implementation of cuBLAS's cublasSscale from scratch. The corresponding kernel code can be found at cublas_impl_kernels::cublasTscal_kernel .

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
Returns
error CUDA synchronize error code.
See also
cublas_impl_kernels::cublasTscal_kernel

Definition at line 411 of file cublas_impl.cu.

416 {
417 // Set number of device threads and blocks
418 int threads_per_block = 256;
419 int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
420
421 // Call device code
422 cublas_impl_kernels::cublasTscal_kernel<DataType><<<
423 blocks_per_grid, threads_per_block>>>(
424 n, *alpha, x, incx);
425
426 cudaError_t error = cudaDeviceSynchronize();
427
428 return error;
429 }