imate
C++/CUDA Reference
Loading...
Searching...
No Matches
cublas_api Namespace Reference

A collection of templates to wrapper cublas functions. More...

Functions

template<>
cublasStatus_t cublasXgemv< float > (cublasHandle_t handle, cublasOperation_t trans, int m, int n, const float *RESTRICT alpha, const float *RESTRICT A, int lda, const float *RESTRICT x, int incx, const float *RESTRICT beta, float *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).
 
template<>
cublasStatus_t cublasXgemv< double > (cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *RESTRICT alpha, const double *RESTRICT A, int lda, const double *RESTRICT x, int incx, const double *RESTRICT beta, double *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).
 
template<>
cublasStatus_t cublasXcopy< float > (cublasHandle_t handle, int n, const float *RESTRICT x, int incx, float *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \) in __half type.
 
template<>
cublasStatus_t cublasXcopy< double > (cublasHandle_t handle, int n, const double *RESTRICT x, int incx, double *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \) in double type.
 
template<>
cublasStatus_t cublasXaxpy< float > (cublasHandle_t handle, int n, const float *RESTRICT alpha, const float *RESTRICT x, int incx, float *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on __half precision.
 
template<>
cublasStatus_t cublasXaxpy< double > (cublasHandle_t handle, int n, const double *RESTRICT alpha, const double *RESTRICT x, int incx, double *RESTRICT y, int incy)
 Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on double precision.
 
template<>
cublasStatus_t cublasXdot< float > (cublasHandle_t handle, int n, const float *RESTRICT x, int incx, const float *RESTRICT y, int incy, float *RESTRICT result)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on __half precision.
 
template<>
cublasStatus_t cublasXdot< double > (cublasHandle_t handle, int n, const double *RESTRICT x, int incx, const double *RESTRICT y, int incy, double *RESTRICT result)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on double precision.
 
template<>
cublasStatus_t cublasXnrm2< float > (cublasHandle_t handle, int n, const float *RESTRICT x, int incx, float *RESTRICT result)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on __half precision.
 
template<>
cublasStatus_t cublasXnrm2< double > (cublasHandle_t handle, int n, const double *RESTRICT x, int incx, double *RESTRICT result)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on double precision.
 
template<>
cublasStatus_t cublasXscal< float > (cublasHandle_t handle, int n, const float *RESTRICT alpha, float *RESTRICT x, int incx)
 Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on __half precision.
 
template<>
cublasStatus_t cublasXscal< double > (cublasHandle_t handle, int n, const double *RESTRICT alpha, double *RESTRICT x, int incx)
 Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on double precision.
 
template<typename DataType >
cublasStatus_t cublasXgemv (cublasHandle_t handle, cublasOperation_t trans, int m, int n, const DataType *RESTRICT alpha, const DataType *RESTRICT A, int lda, const DataType *RESTRICT x, int incx, const DataType *RESTRICT beta, DataType *RESTRICT y, int incy)
 
template<typename DataType >
cublasStatus_t cublasXcopy (cublasHandle_t handle, int n, const DataType *RESTRICT x, int incx, DataType *RESTRICT y, int incy)
 
template<typename DataType >
cublasStatus_t cublasXaxpy (cublasHandle_t handle, int n, const DataType *RESTRICT alpha, const DataType *RESTRICT x, int incx, DataType *RESTRICT y, int incy)
 
template<typename DataType >
cublasStatus_t cublasXdot (cublasHandle_t handle, int n, const DataType *RESTRICT x, int incx, const DataType *RESTRICT y, int incy, DataType *RESTRICT result)
 
template<typename DataType >
cublasStatus_t cublasXnrm2 (cublasHandle_t handle, int n, const DataType *RESTRICT x, int incx, DataType *RESTRICT result)
 
template<typename DataType >
cublasStatus_t cublasXscal (cublasHandle_t handle, int n, const DataType *RESTRICT alpha, DataType *RESTRICT x, int incx)
 

Detailed Description

A collection of templates to wrapper cublas functions.

Note
The implementation in the cu file is wrapped inside the namepsace clause. This is not necessary in general, however, it is needed to avoid the old gcc compiler error (this is a gcc bug) which complains "no instance of function template matches the argument list const float".

Function Documentation

◆ cublasXaxpy()

template<typename DataType >
cublasStatus_t cublas_api::cublasXaxpy ( cublasHandle_t  handle,
int  n,
const DataType *RESTRICT  alpha,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  y,
int  incy 
)

Referenced by cuVectorOperations< DataType >::subtract_scaled_vector().

Here is the caller graph for this function:

◆ cublasXaxpy< double >()

template<>
cublasStatus_t cublas_api::cublasXaxpy< double > ( cublasHandle_t  handle,
int  n,
const double *RESTRICT  alpha,
const double *RESTRICT  x,
int  incx,
double *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on double precision.

This function is a half type implementation similar to cuBLAS's cublasSaxpy.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXgemv

Definition at line 956 of file cublas_api.cu.

964 {
965 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
966 // Use in-house implementation
967 cudaError_t error = cublas_impl::cublasTaxpy<double>(
968 n, alpha, x, incx, y, incy);
969
970 if (error != cudaSuccess)
971 {
972 return CUBLAS_STATUS_SUCCESS;
973 }
974 else
975 {
976 return CUBLAS_STATUS_INTERNAL_ERROR;
977 }
978
979 #else
980 return cublasDaxpy(handle, n, alpha, x, incx, y, incy);
981 #endif
982 }
cublasStatus_t cublasDaxpy(cublasHandle_t handle, int n, const double *alpha, const double *x, int incx, double *y, int incy)
Definition of CUDA's cublasDaxpy function using dynamically loaded cublas library.

References cublasDaxpy().

Here is the call graph for this function:

◆ cublasXaxpy< float >()

template<>
cublasStatus_t cublas_api::cublasXaxpy< float > ( cublasHandle_t  handle,
int  n,
const float *RESTRICT  alpha,
const float *RESTRICT  x,
int  incx,
float *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on __half precision.

This function is a half type implementation similar to cuBLAS's cublasSaxpy.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXgemv

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on __nv_bfloat16 precision.

This function is a half type implementation similar to cuBLAS's cublasSaxpy.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXgemv

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \) on float precision.

This function is a half type implementation similar to cuBLAS's cublasSaxpy.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXgemv

Definition at line 895 of file cublas_api.cu.

903 {
904 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
905 // Use in-house implementation
906 cudaError_t error = cublas_impl::cublasTaxpy<float>(
907 n, alpha, x, incx, y, incy);
908
909 if (error != cudaSuccess)
910 {
911 return CUBLAS_STATUS_SUCCESS;
912 }
913 else
914 {
915 return CUBLAS_STATUS_INTERNAL_ERROR;
916 }
917
918 #else
919 return cublasSaxpy(handle, n, alpha, x, incx, y, incy);
920 #endif
921 }
cublasStatus_t cublasSaxpy(cublasHandle_t handle, int n, const float *alpha, const float *x, int incx, float *y, int incy)
Definition of CUDA's cublasSaxpy function using dynamically loaded cublas library.

References cublasSaxpy().

Here is the call graph for this function:

◆ cublasXcopy()

template<typename DataType >
cublasStatus_t cublas_api::cublasXcopy ( cublasHandle_t  handle,
int  n,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  y,
int  incy 
)

Referenced by cuVectorOperations< DataType >::copy_scaled_vector(), and cuVectorOperations< DataType >::copy_vector().

Here is the caller graph for this function:

◆ cublasXcopy< double >()

template<>
cublasStatus_t cublas_api::cublasXcopy< double > ( cublasHandle_t  handle,
int  n,
const double *RESTRICT  x,
int  incx,
double *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \) in double type.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Definition at line 716 of file cublas_api.cu.

723 {
724 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
725 // Use in-house implementation
726 cudaError_t error = cublas_impl::cublasTcopy<double>(
727 n, x, incx, y, incy);
728
729 if (error != cudaSuccess)
730 {
731 return CUBLAS_STATUS_SUCCESS;
732 }
733 else
734 {
735 return CUBLAS_STATUS_INTERNAL_ERROR;
736 }
737
738 #else
739 // Use Nvidia's CuBLAS
740 return cublasDcopy(handle, n, x, incx, y, incy);
741 #endif
742 }
cublasStatus_t cublasDcopy(cublasHandle_t handle, int n, const double *x, int incx, double *y, int incy)
Definition of CUDA's cublasDcopy function using dynamically loaded cublas library.

References cublasDcopy().

Here is the call graph for this function:

◆ cublasXcopy< float >()

template<>
cublasStatus_t cublas_api::cublasXcopy< float > ( cublasHandle_t  handle,
int  n,
const float *RESTRICT  x,
int  incx,
float *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \) in __half type.

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHcopy (where H is used for __half type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \boldsymbol{x} \) in __nv_bfloat16 type.

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHcopy (where H is used for __nv_bfloat16 type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \boldsymbol{x} \) in float type.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Definition at line 660 of file cublas_api.cu.

667 {
668 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
669 // Use in-house implementation
670 cudaError_t error = cublas_impl::cublasTcopy<float>(
671 n, x, incx, y, incy);
672
673 if (error != cudaSuccess)
674 {
675 return CUBLAS_STATUS_SUCCESS;
676 }
677 else
678 {
679 return CUBLAS_STATUS_INTERNAL_ERROR;
680 }
681
682 #else
683 // Use Nvidia's CuBLAS
684 return cublasScopy(handle, n, x, incx, y, incy);
685 #endif
686 }
cublasStatus_t cublasScopy(cublasHandle_t handle, int n, const float *x, int incx, float *y, int incy)
Definition of CUDA's cublasScopy function using dynamically loaded cublas library.

References cublasScopy().

Here is the call graph for this function:

◆ cublasXdot()

template<typename DataType >
cublasStatus_t cublas_api::cublasXdot ( cublasHandle_t  handle,
int  n,
const DataType *RESTRICT  x,
int  incx,
const DataType *RESTRICT  y,
int  incy,
DataType *RESTRICT  result 
)

Referenced by cuVectorOperations< DataType >::inner_product().

Here is the caller graph for this function:

◆ cublasXdot< double >()

template<>
cublasStatus_t cublas_api::cublasXdot< double > ( cublasHandle_t  handle,
int  n,
const double *RESTRICT  x,
int  incx,
const double *RESTRICT  y,
int  incy,
double *RESTRICT  result 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on double precision.

This function is a half type implementation similar to cuBLAS's cublasSdot.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
See also
cublasHaxpy

Definition at line 1196 of file cublas_api.cu.

1204 {
1205 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1206 // Use in-house implementation
1207 cudaError_t error = cublas_impl::cublasTdot<double, double>(
1208 n, x, incx, y, incy, result);
1209
1210 if (error != cudaSuccess)
1211 {
1212 return CUBLAS_STATUS_SUCCESS;
1213 }
1214 else
1215 {
1216 return CUBLAS_STATUS_INTERNAL_ERROR;
1217 }
1218
1219 #else
1220 return cublasDdot(handle, n, x, incx, y, incy, result);
1221 #endif
1222 }
cublasStatus_t cublasDdot(cublasHandle_t handle, int n, const double *x, int incx, const double *y, int incy, double *result)
Definition of CUDA's cublasDdot function using dynamically loaded cublas library.

References cublasDdot().

Here is the call graph for this function:

◆ cublasXdot< float >()

template<>
cublasStatus_t cublas_api::cublasXdot< float > ( cublasHandle_t  handle,
int  n,
const float *RESTRICT  x,
int  incx,
const float *RESTRICT  y,
int  incy,
float *RESTRICT  result 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on __half precision.

This function is a half type implementation similar to cuBLAS's cublasSdot.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
See also
cublasHaxpy

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on __nv_bfloat16 precision.

This function is a half type implementation similar to cuBLAS's cublasSdot.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
See also
cublasHaxpy

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{y} \) on float precision.

This function is a half type implementation similar to cuBLAS's cublasSdot.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
See also
cublasHaxpy

Definition at line 1135 of file cublas_api.cu.

1143 {
1144 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1145 // Use in-house implementation
1146 cudaError_t error = cublas_impl::cublasTdot<float, float>(
1147 n, x, incx, y, incy, result);
1148
1149 if (error != cudaSuccess)
1150 {
1151 return CUBLAS_STATUS_SUCCESS;
1152 }
1153 else
1154 {
1155 return CUBLAS_STATUS_INTERNAL_ERROR;
1156 }
1157
1158 #else
1159 return cublasSdot(handle, n, x, incx, y, incy, result);
1160 #endif
1161 }
cublasStatus_t cublasSdot(cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result)
Definition of CUDA's cublasSdot function using dynamically loaded cublas library.

References cublasSdot().

Here is the call graph for this function:

◆ cublasXgemv()

template<typename DataType >
cublasStatus_t cublas_api::cublasXgemv ( cublasHandle_t  handle,
cublasOperation_t  trans,
int  m,
int  n,
const DataType *RESTRICT  alpha,
const DataType *RESTRICT  A,
int  lda,
const DataType *RESTRICT  x,
int  incx,
const DataType *RESTRICT  beta,
DataType *RESTRICT  y,
int  incy 
)

◆ cublasXgemv< double >()

template<>
cublasStatus_t cublas_api::cublasXgemv< double > ( cublasHandle_t  handle,
cublasOperation_t  trans,
int  m,
int  n,
const double *RESTRICT  alpha,
const double *RESTRICT  A,
int  lda,
const double *RESTRICT  x,
int  incx,
const double *RESTRICT  beta,
double *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is a template wrapper for cublasDgemv.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Definition at line 481 of file cublas_api.cu.

494 {
495 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
496 // Use in-house implementation
497 cudaError_t error = cublas_impl::cublasTgemv<double, double>(
498 trans, m, n, alpha, A, lda, x, incx, beta, y, incy);
499
500 if (error != cudaSuccess)
501 {
502 return CUBLAS_STATUS_SUCCESS;
503 }
504 else
505 {
506 return CUBLAS_STATUS_INTERNAL_ERROR;
507 }
508
509 #else
510 // Use Nvidia's CuBLAS
511 return cublasDgemv(handle, trans, m, n, alpha, A, lda, x, incx,
512 beta, y, incy);
513 #endif
514 }

◆ cublasXgemv< float >()

template<>
cublasStatus_t cublas_api::cublasXgemv< float > ( cublasHandle_t  handle,
cublasOperation_t  trans,
int  m,
int  n,
const float *RESTRICT  alpha,
const float *RESTRICT  A,
int  lda,
const float *RESTRICT  x,
int  incx,
const float *RESTRICT  beta,
float *RESTRICT  y,
int  incy 
)

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHgemv (where H is used for __nv_fp8_e5m2 type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHgemv (where H is used for __nv_fp8_e4m3 type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHgemv (where H is used for __half type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is not a template wrapper for CuBLAS, since CuBLAS API does not have cublasHgemv (where H is used for __nv_bfloat16 type). As such, this function is implemented with CUDA, rather than from CuBLAS.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Performs \( \boldsymbol{y} = \alpha \text{op}(\mathbf{A}) \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is a template wrapper for cublasSgemv.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mNumber of rows of matrix \( \mathbf{A} \).
[in]nNumber of columns of matrix \( \mathbf{A} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]ATwo-dimensional matrix \( \mathbf{A} \) stored on GPU device as one-dimensional array with column-major ordering.
[in]ldaLeading dimension of two-dimensional matrix \( \mathbf{A} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaThe scalar parameter \( \beta \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasXaxpy

Definition at line 399 of file cublas_api.cu.

412 {
413
414 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
415 // Use in-house implementation
416 cudaError_t error = cublas_impl::cublasTgemv<float, float>(
417 trans, m, n, alpha, A, lda, x, incx, beta, y, incy);
418
419 if (error != cudaSuccess)
420 {
421 return CUBLAS_STATUS_SUCCESS;
422 }
423 else
424 {
425 return CUBLAS_STATUS_INTERNAL_ERROR;
426 }
427
428 #else
429 // Use Nvidia's CuBLAS
430 return cublasSgemv(handle, trans, m, n, alpha, A, lda, x, incx,
431 beta, y, incy);
432 #endif
433 }

◆ cublasXnrm2()

template<typename DataType >
cublasStatus_t cublas_api::cublasXnrm2 ( cublasHandle_t  handle,
int  n,
const DataType *RESTRICT  x,
int  incx,
DataType *RESTRICT  result 
)

Referenced by cuVectorOperations< DataType >::euclidean_norm().

Here is the caller graph for this function:

◆ cublasXnrm2< double >()

template<>
cublasStatus_t cublas_api::cublasXnrm2< double > ( cublasHandle_t  handle,
int  n,
const double *RESTRICT  x,
int  incx,
double *RESTRICT  result 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on double precision.

This function is a half type implementation similar to cuBLAS's cublasSnrm2.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
See also
cublasHdot

Definition at line 1410 of file cublas_api.cu.

1416 {
1417 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1418 // Use in-house implementation
1419 cudaError_t error = cublas_impl::cublasTnrm2<double, double>(
1420 n, x, incx, result);
1421
1422 if (error != cudaSuccess)
1423 {
1424 return CUBLAS_STATUS_SUCCESS;
1425 }
1426 else
1427 {
1428 return CUBLAS_STATUS_INTERNAL_ERROR;
1429 }
1430
1431 #else
1432 return cublasDnrm2(handle, n, x, incx, result);
1433 #endif
1434 }
cublasStatus_t cublasDnrm2(cublasHandle_t handle, int n, const double *x, int incx, double *result)
Definition of CUDA's cublasDnrm2 function using dynamically loaded cublas library.

References cublasDnrm2().

Here is the call graph for this function:

◆ cublasXnrm2< float >()

template<>
cublasStatus_t cublas_api::cublasXnrm2< float > ( cublasHandle_t  handle,
int  n,
const float *RESTRICT  x,
int  incx,
float *RESTRICT  result 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on __half precision.

This function is a half type implementation similar to cuBLAS's cublasSnrm2.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
See also
cublasHdot

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on __nv_bfloat16 precision.

This function is a half type implementation similar to cuBLAS's cublasSnrm2.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
See also
cublasHdot

Performs \( \boldsymbol{y} = \boldsymbol{x} \cdot \boldsymbol{x} \) on float precision.

This function is a half type implementation similar to cuBLAS's cublasSnrm2.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
See also
cublasHdot

Definition at line 1356 of file cublas_api.cu.

1362 {
1363 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1364 // Use in-house implementation
1365 cudaError_t error = cublas_impl::cublasTnrm2<float, float>(
1366 n, x, incx, result);
1367
1368 if (error != cudaSuccess)
1369 {
1370 return CUBLAS_STATUS_SUCCESS;
1371 }
1372 else
1373 {
1374 return CUBLAS_STATUS_INTERNAL_ERROR;
1375 }
1376
1377 #else
1378 return cublasSnrm2(handle, n, x, incx, result);
1379 #endif
1380 }
cublasStatus_t cublasSnrm2(cublasHandle_t handle, int n, const float *x, int incx, float *result)
Definition of CUDA's cublasSnrm2 function using dynamically loaded cublas library.

References cublasSnrm2().

Here is the call graph for this function:

◆ cublasXscal()

template<typename DataType >
cublasStatus_t cublas_api::cublasXscal ( cublasHandle_t  handle,
int  n,
const DataType *RESTRICT  alpha,
DataType *RESTRICT  x,
int  incx 
)

◆ cublasXscal< double >()

template<>
cublasStatus_t cublas_api::cublasXscal< double > ( cublasHandle_t  handle,
int  n,
const double *RESTRICT  alpha,
double *RESTRICT  x,
int  incx 
)

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on double precision.

This function is a half type implementation similar to cuBLAS's cublasSscal.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
See also
cublasHcopy

Definition at line 1626 of file cublas_api.cu.

1632 {
1633 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1634 // Use in-house implementation
1635 cudaError_t error = cublas_impl::cublasTscal<double>(
1636 n, alpha, x, incx);
1637
1638 if (error != cudaSuccess)
1639 {
1640 return CUBLAS_STATUS_SUCCESS;
1641 }
1642 else
1643 {
1644 return CUBLAS_STATUS_INTERNAL_ERROR;
1645 }
1646
1647 #else
1648 return cublasDscal(handle, n, alpha, x, incx);
1649 #endif
1650 }
cublasStatus_t cublasDscal(cublasHandle_t handle, int n, const double *alpha, double *x, int incx)
Definition of CUDA's cublasDscal function using dynamically loaded cublas library.

References cublasDscal().

Here is the call graph for this function:

◆ cublasXscal< float >()

template<>
cublasStatus_t cublas_api::cublasXscal< float > ( cublasHandle_t  handle,
int  n,
const float *RESTRICT  alpha,
float *RESTRICT  x,
int  incx 
)

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on __half precision.

This function is a half type implementation similar to cuBLAS's cublasSscal.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
See also
cublasHcopy

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on __nv_bfloat16 precision.

This function is a half type implementation similar to cuBLAS's cublasSscal.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
See also
cublasHcopy

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \) on float precision.

This function is a half type implementation similar to cuBLAS's cublasSscal.

Parameters
[in]handleHandle object for CuBLAS library context.
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
See also
cublasHcopy

Definition at line 1571 of file cublas_api.cu.

1577 {
1578 #if !defined(USE_CUBLAS) || (USE_CUBLAS != 1)
1579 // Use in-house implementation
1580 cudaError_t error = cublas_impl::cublasTscal<float>(
1581 n, alpha, x, incx);
1582
1583 if (error != cudaSuccess)
1584 {
1585 return CUBLAS_STATUS_SUCCESS;
1586 }
1587 else
1588 {
1589 return CUBLAS_STATUS_INTERNAL_ERROR;
1590 }
1591
1592 #else
1593 return cublasSscal(handle, n, alpha, x, incx);
1594 #endif
1595 }
cublasStatus_t cublasSscal(cublasHandle_t handle, int n, const float *alpha, float *x, int incx)
Definition of CUDA's cublasSscal function using dynamically loaded cublas library.

References cublasSscal().

Here is the call graph for this function: