![]() |
imate
C++/CUDA Reference
|
Templated kernel code for implenentations of several BLAS-type functions in CUDA. More...
Functions | |
| template<typename DataType , typename ComputeType , unsigned int block_size> | |
| __global__ void | cublasTgemv_kernel (const bool trans, const int m, const int n, const DataType alpha, const DataType *RESTRICT A, const int lda, const DataType *RESTRICT x, const int incx, const DataType beta, DataType *RESTRICT y, const int incy) |
| Performs the operation \( \boldsymbol{y} = \alpha
\mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \). | |
| template<typename DataType > | |
| __global__ void | cublasTcopy_kernel (const int n, const DataType *RESTRICT x, const int incx, DataType *RESTRICT y, const int incy) |
| Performs \( \boldsymbol{y} = \boldsymbol{x} \). | |
| template<typename DataType > | |
| __global__ void | cublasTaxpy_kernel (const int n, const DataType alpha, const DataType *RESTRICT x, const int incx, DataType *RESTRICT y, const int incy) |
| Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} +
\boldsymbol{y} \). | |
| template<typename DataType , typename ComputeType , unsigned int block_size> | |
| __global__ void | cublasTdot_kernel (const int n, const DataType *RESTRICT x, const int incx, const DataType *RESTRICT y, const int incy, ComputeType *RESTRICT result) |
| Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \). | |
| template<typename DataType , typename ComputeType , unsigned int block_size> | |
| __global__ void | cublasTnrm2_kernel (const int n, const DataType *RESTRICT x, const int incx, ComputeType *RESTRICT result) |
| Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \). | |
| template<typename DataType > | |
| __global__ void | cublasTscal_kernel (const int n, const DataType alpha, DataType *RESTRICT x, const int incx) |
| Performs \( \boldsymbol{x} = \alpha \boldsymbol{x}
\). | |
Templated kernel code for implenentations of several BLAS-type functions in CUDA.
The motivation for re-implementing CuBLAS is that CUDA's CuBLAS library does not supports DataType type and __nv_bfloat16 type for some of it functions. For instance, while there is support for level 3 functions, they do not provide level 2 and 1 functions with DataType type.
The functions in this namespace provides some level 2 functions by implementing CUDA kernels from scratch. These implementations are templated with mixed precision computations where both the data types and inner computation types are templated. The data type is set by DatatType typename and the inner computation type is set by ComputeType typename.
Despite the generic templated functions, he main intent of these templates are to be used primarily for the missing types in CuBLAS, namely, the DataType type (which is float16 type) and __nv_bfloat6 type (which is Google's bfloat16 type). But users may utilize these templates for any data and compute types.
The prefix convension for all functions in this namespace are cublasT (for instance cublasTgemv) where T here denotes template. In the CuBLAS API, this letter a placeholder for data type, such as S for single preicsion and D for double precision.
The functions in this namespace are the host codes. The kernel codes corresponding to each host code can be found in cublas_impl_kernels namespace.
| __global__ void cublas_impl_kernels::cublasTaxpy_kernel | ( | const int | n, |
| const DataType | alpha, | ||
| const DataType *RESTRICT | x, | ||
| const int | incx, | ||
| DataType *RESTRICT | y, | ||
| const int | incy | ||
| ) |
Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).
This function is a device-code (kernel) for the host code function for cublas_impl::cublasTaxpy().
| [in] | n | Size of array \( \boldsymbol{x} \). |
| [in] | alpha | The scalar parameter \( \alpha \). |
| [in] | x | Input vector \( \boldsymbol{x} \) stored on GPU device. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
| [out] | y | Output vector \( \boldsymbol{y} \) stored on GPU device. |
| [in] | incy | Stride between consecutive elements of \( \boldsymbol{y} \). |
Definition at line 267 of file cublas_impl_kernels.cu.
References cu_arithmetics::abs().

| __global__ void cublas_impl_kernels::cublasTcopy_kernel | ( | const int | n, |
| const DataType *RESTRICT | x, | ||
| const int | incx, | ||
| DataType *RESTRICT | y, | ||
| const int | incy | ||
| ) |
Performs \( \boldsymbol{y} = \boldsymbol{x} \).
This function is a device-code (kernel) for the host code function for cublas_impl::cublasTcopy().
| [in] | n | Size of the array \( \boldsymbol{x} \). |
| [in] | x | Input vector \( \boldsymbol{x} \) stored on GPU device. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
| [out] | y | Output vector \( \boldsymbol{y} \) stored on GPU device. |
| [in] | incy | Stride between consecutive elements of \( \boldsymbol{y} \). |
Definition at line 223 of file cublas_impl_kernels.cu.
| __global__ void cublas_impl_kernels::cublasTdot_kernel | ( | const int | n, |
| const DataType *RESTRICT | x, | ||
| const int | incx, | ||
| const DataType *RESTRICT | y, | ||
| const int | incy, | ||
| ComputeType *RESTRICT | result | ||
| ) |
Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).
This function is a device-code (kernel) for the host code function for cublas_impl::cublasTdot().
| [in] | n | Size of array \( \boldsymbol{x} \). |
| [in] | x | Input vector \( \boldsymbol{x} \) stored on GPU device. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
| [out] | y | Output vector \( \boldsymbol{y} \) stored on GPU device. |
| [in] | incy | Stride between consecutive elements of \( \boldsymbol{y} \). |
| [out] | result | The dot product of two vectors. |
Definition at line 316 of file cublas_impl_kernels.cu.
References cu_arithmetics::abs().
Referenced by cublas_impl::cublasTdot().


| __global__ void cublas_impl_kernels::cublasTgemv_kernel | ( | const bool | trans, |
| const int | m, | ||
| const int | n, | ||
| const DataType | alpha, | ||
| const DataType *RESTRICT | A, | ||
| const int | lda, | ||
| const DataType *RESTRICT | x, | ||
| const int | incx, | ||
| const DataType | beta, | ||
| DataType *RESTRICT | y, | ||
| const int | incy | ||
| ) |
Performs the operation \( \boldsymbol{y} = \alpha \mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \).
This function is the device (kernel) code for cublas_impl::cublasTgemv() .
m and n are defined based on the sizes of y and x (respectively), not the size of A or its transpose. The matrix A (regardless of being transposed) is m*n.| [in] | trans | If set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively. |
| [in] | m | Size of y. |
| [in] | n | Size of x. |
| [in] | alpha | Scalar parameter \( \alpha \). |
| [in] | A | Matrix A. The matrix is assumed to be stored as a coalesced 1D array with column-major ordering. The matrix size is m*n. |
| [in] | lda | Leading dimension of A. |
| [in] | x | Input vector x of size n*incx. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
| [in] | beta | Scalar parameter \( \beta \). |
| [out] | y | Output vector y of size m*incy. |
| [in] | incy | Stride between consecutive elements of \( \boldsymbol{y} \). |
Definition at line 79 of file cublas_impl_kernels.cu.
References cu_arithmetics::abs().
Referenced by cublas_impl::cublasTgemv().


| __global__ void cublas_impl_kernels::cublasTnrm2_kernel | ( | const int | n, |
| const DataType *RESTRICT | x, | ||
| const int | incx, | ||
| ComputeType *RESTRICT | result | ||
| ) |
Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).
This function is a device-code (kernel) for the host code function for cublas_impl::cublasTnrm2().
| [in] | n | Size of array \( \boldsymbol{x} \). |
| [in] | x | Input vector \( \boldsymbol{x} \) stored on GPU device. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
| [out] | result | The norm squared of a vector. |
Definition at line 385 of file cublas_impl_kernels.cu.
References cu_arithmetics::abs().
Referenced by cublas_impl::cublasTnrm2().


| __global__ void cublas_impl_kernels::cublasTscal_kernel | ( | const int | n, |
| const DataType | alpha, | ||
| DataType *RESTRICT | x, | ||
| const int | incx | ||
| ) |
Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).
This function is a device-code (kernel) for the host code function for cublas_impl::cublasTscal().
| [in] | n | Size of array \( \boldsymbol{x} \). |
| [in] | alpha | The scalar parameter \( \alpha \). |
| [in,out] | x | Input and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place. |
| [in] | incx | Stride between consecutive elements of \( \boldsymbol{x} \). |
Definition at line 453 of file cublas_impl_kernels.cu.
References cu_arithmetics::abs().
