imate
C++/CUDA Reference
Loading...
Searching...
No Matches
cublas_impl_kernels Namespace Reference

Templated kernel code for implenentations of several BLAS-type functions in CUDA. More...

Functions

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublasTgemv_kernel (const bool trans, const int m, const int n, const DataType alpha, const DataType *RESTRICT A, const int lda, const DataType *RESTRICT x, const int incx, const DataType beta, DataType *RESTRICT y, const int incy)
 Performs the operation \( \boldsymbol{y} = \alpha \mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \).
 
template<typename DataType >
__global__ void cublasTcopy_kernel (const int n, const DataType *RESTRICT x, const int incx, DataType *RESTRICT y, const int incy)
 Performs \( \boldsymbol{y} = \boldsymbol{x} \).
 
template<typename DataType >
__global__ void cublasTaxpy_kernel (const int n, const DataType alpha, const DataType *RESTRICT x, const int incx, DataType *RESTRICT y, const int incy)
 Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).
 
template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublasTdot_kernel (const int n, const DataType *RESTRICT x, const int incx, const DataType *RESTRICT y, const int incy, ComputeType *RESTRICT result)
 Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).
 
template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublasTnrm2_kernel (const int n, const DataType *RESTRICT x, const int incx, ComputeType *RESTRICT result)
 Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).
 
template<typename DataType >
__global__ void cublasTscal_kernel (const int n, const DataType alpha, DataType *RESTRICT x, const int incx)
 Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).
 

Detailed Description

Templated kernel code for implenentations of several BLAS-type functions in CUDA.

The motivation for re-implementing CuBLAS is that CUDA's CuBLAS library does not supports DataType type and __nv_bfloat16 type for some of it functions. For instance, while there is support for level 3 functions, they do not provide level 2 and 1 functions with DataType type.

The functions in this namespace provides some level 2 functions by implementing CUDA kernels from scratch. These implementations are templated with mixed precision computations where both the data types and inner computation types are templated. The data type is set by DatatType typename and the inner computation type is set by ComputeType typename.

Despite the generic templated functions, he main intent of these templates are to be used primarily for the missing types in CuBLAS, namely, the DataType type (which is float16 type) and __nv_bfloat6 type (which is Google's bfloat16 type). But users may utilize these templates for any data and compute types.

The prefix convension for all functions in this namespace are cublasT (for instance cublasTgemv) where T here denotes template. In the CuBLAS API, this letter a placeholder for data type, such as S for single preicsion and D for double precision.

The functions in this namespace are the host codes. The kernel codes corresponding to each host code can be found in cublas_impl_kernels namespace.

See also
Namespace cublas_api .

Function Documentation

◆ cublasTaxpy_kernel()

template<typename DataType >
__global__ void cublas_impl_kernels::cublasTaxpy_kernel ( const int  n,
const DataType  alpha,
const DataType *RESTRICT  x,
const int  incx,
DataType *RESTRICT  y,
const int  incy 
)

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTaxpy().

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasTaxpy

Definition at line 267 of file cublas_impl_kernels.cu.

274 {
275 const int i = threadIdx.x + blockIdx.x * blockDim.x;
276
277 if (i < n)
278 {
279 y[i * incy] = \
280 cu_arithmetics::add<DataType>(
281 cu_arithmetics::mul<DataType>(alpha, x[i * incx]),
282 y[i * incy]
283 );
284 }
285 }
__host__ __device__ DataType abs(const DataType x)
Absolute value of a floating point number.

References cu_arithmetics::abs().

Here is the call graph for this function:

◆ cublasTcopy_kernel()

template<typename DataType >
__global__ void cublas_impl_kernels::cublasTcopy_kernel ( const int  n,
const DataType *RESTRICT  x,
const int  incx,
DataType *RESTRICT  y,
const int  incy 
)

Performs \( \boldsymbol{y} = \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTcopy().

Parameters
[in]nSize of the array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublasTcopy

Definition at line 223 of file cublas_impl_kernels.cu.

229 {
230 int i = threadIdx.x + blockIdx.x * blockDim.x;
231
232 if (i < n)
233 {
234 y[i * incy] = x[i * incx];
235 }
236 }

◆ cublasTdot_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublas_impl_kernels::cublasTdot_kernel ( const int  n,
const DataType *RESTRICT  x,
const int  incx,
const DataType *RESTRICT  y,
const int  incy,
ComputeType *RESTRICT  result 
)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTdot().

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]yOutput vector \( \boldsymbol{y} \) stored on GPU device.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
[out]resultThe dot product of two vectors.
See also
cublasTdot

Definition at line 316 of file cublas_impl_kernels.cu.

323 {
324 // The size of this array should be exactly the number of blocks (for
325 // this, see the corresponding host code, cublas_impl::cublasTdot)
326 __shared__ ComputeType partial_sum[block_size];
327
328 const int tid = threadIdx.x;
329 int i = blockIdx.x * blockDim.x + threadIdx.x;
330
331 ComputeType sum = static_cast<ComputeType>(0.0f);
332 while (i < n)
333 {
334 sum += cu_arithmetics::cast<DataType, ComputeType>(x[i * incx]) * \
335 cu_arithmetics::cast<DataType, ComputeType>(y[i * incy]);
336
337 i += blockDim.x * gridDim.x;
338 }
339
340 partial_sum[tid] = sum;
341
342 __syncthreads();
343
344 // Reduction in shared memory
345 for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
346 {
347 if (tid < stride)
348 {
349 partial_sum[tid] += partial_sum[tid + stride];
350 }
351 __syncthreads();
352 }
353
354 // Write result for this block to global memory
355 if (tid == 0)
356 {
357 atomicAdd(result, partial_sum[0]);
358 }
359 }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTdot().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cublasTgemv_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublas_impl_kernels::cublasTgemv_kernel ( const bool  trans,
const int  m,
const int  n,
const DataType  alpha,
const DataType *RESTRICT  A,
const int  lda,
const DataType *RESTRICT  x,
const int  incx,
const DataType  beta,
DataType *RESTRICT  y,
const int  incy 
)

Performs the operation \( \boldsymbol{y} = \alpha \mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is the device (kernel) code for cublas_impl::cublasTgemv() .

Note
This function incorporate both non-transposed and transposed operations. To this end, here m and n are defined based on the sizes of y and x (respectively), not the size of A or its transpose. The matrix A (regardless of being transposed) is m*n.
Parameters
[in]transIf set to CUBLAS_OP_N or CUBLAS_OP_T, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]mSize of y.
[in]nSize of x.
[in]alphaScalar parameter \( \alpha \).
[in]AMatrix A. The matrix is assumed to be stored as a coalesced 1D array with column-major ordering. The matrix size is m*n.
[in]ldaLeading dimension of A.
[in]xInput vector x of size n*incx.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[in]betaScalar parameter \( \beta \).
[out]yOutput vector y of size m*incy.
[in]incyStride between consecutive elements of \( \boldsymbol{y} \).
See also
cublas_impl::cublasTgemv

Definition at line 79 of file cublas_impl_kernels.cu.

91 {
92 // Each thread is dedicated to compute an element of y
93 const unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
94
95 // Device shared memory to cache x only (note: we do not cache A since
96 // the elements of A are read only once. In contrast, x is read several
97 // times).
98 __shared__ DataType x_shared[block_size];
99
100 // Summation for the dot product of i-th row of A (or A transposed)
101 // with the entire x. The sum variable is local to i-th thread only,
102 // and is not shared with other threads of block.
103 ComputeType sum = 0.0f;
104
105 // Iterate over blocks of x elements
106 const unsigned int num_blocks = (n + block_size - 1) / block_size;
107
108 // Each thread (index i) loops over all elements j of x in block by
109 // block manner.
110 #pragma unroll
111 for (unsigned long int block_counter = 0;
112 block_counter < num_blocks;
113 ++block_counter)
114 {
115 // Get j-th index of x. This is only used to read x to copy it to
116 // the cache of x.
117 unsigned long int j = threadIdx.x + \
118 block_counter * static_cast<unsigned long int>(block_size);
119
120 // Fill x cache
121 if (j < n)
122 {
123 // Read x from global memory to shared memory
124 x_shared[threadIdx.x] = x[j * incx];
125 }
126 else
127 {
128 // If block element exceeds x size, fill cache with zeros.
129 x_shared[threadIdx.x] = \
130 cu_arithmetics::cast<ComputeType, DataType>(0.0f);
131 }
132
133 // Sync all threads of block to finish caching x from global memory
134 // to shared memory
135 __syncthreads();
136
137 // Now that one block of cache is filled, perform matrix-vector
138 // multiplication for that one block.
139 #pragma unroll
140 for (unsigned int e = 0; e < block_size; ++e)
141 {
142 // Get the index of x (called e_j) corresponding to the e-th
143 // element of the cached block. This is different than the j
144 // above.
145 unsigned long int e_j = e + \
146 block_counter * static_cast<unsigned long int>(block_size);
147
148 // It is necessary to check indices i and e_j with array sizes
149 // as these indices can exceed the array indices since thread
150 // blocks are in the sizes of multiples of 32 (as wrap size).
151 if ((i < m) && (e_j < n))
152 {
153 // Perform matrix-vector multiplication for the i-th row of
154 // A (or i-th row of transposed A) and the e_j th element
155 // of x.
156 if (trans)
157 {
159 A[i * lda + e_j]) * \
160 cu_arithmetics::cast<DataType, ComputeType>(
161 x_shared[e]);
162 }
163 else
164 {
166 A[i + e_j * lda]) * \
167 cu_arithmetics::cast<DataType, ComputeType>(
168 x_shared[e]);
169 }
170 }
171 }
172
173 // Wait till all threads of block done with their matrix-vector
174 // multiplication (each thread has its own sum variable), but they
175 // all read cached x. This sync barrier makes sure no thread
176 // proceeds the next iteration of filling new cache.
177 __syncthreads();
178 }
179
180 // Update output vector only if thread does not exceed matrix size
181 if (i < m)
182 {
183 y[i * incy] = \
184 cu_arithmetics::add<DataType>(
186 alpha,
188 ),
190 beta,
191 y[i * incy]
192 )
193 );
194 }
195 }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTgemv().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cublasTnrm2_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void cublas_impl_kernels::cublasTnrm2_kernel ( const int  n,
const DataType *RESTRICT  x,
const int  incx,
ComputeType *RESTRICT  result 
)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTnrm2().

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]xInput vector \( \boldsymbol{x} \) stored on GPU device.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
[out]resultThe norm squared of a vector.
See also
cublasTnrm2

Definition at line 385 of file cublas_impl_kernels.cu.

390 {
391 // The size of this array should be exactly the number of blocks (for
392 // this, see the corresponding host code, cublas_impl::cublasTnrm2)
393 __shared__ ComputeType partial_sum[block_size];
394
395 const int tid = threadIdx.x;
396 int i = blockIdx.x * blockDim.x + threadIdx.x;
397
398 ComputeType sum = static_cast<ComputeType>(0.0f);
399 while (i < n)
400 {
402 x[i * incx]);
403 sum += val * val;
404 i += blockDim.x * gridDim.x;
405 }
406
407 partial_sum[tid] = sum;
408
409 __syncthreads();
410
411 // Reduction in shared memory
412 for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
413 {
414 if (tid < stride)
415 {
416 partial_sum[tid] += partial_sum[tid + stride];
417 }
418 __syncthreads();
419 }
420
421 // Write result for this block to global memory
422 if (tid == 0)
423 {
424 atomicAdd(result, partial_sum[0]);
425 }
426 }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTnrm2().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ cublasTscal_kernel()

template<typename DataType >
__global__ void cublas_impl_kernels::cublasTscal_kernel ( const int  n,
const DataType  alpha,
DataType *RESTRICT  x,
const int  incx 
)

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTscal().

Parameters
[in]nSize of array \( \boldsymbol{x} \).
[in]alphaThe scalar parameter \( \alpha \).
[in,out]xInput and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]incxStride between consecutive elements of \( \boldsymbol{x} \).
See also
cublasTscal

Definition at line 453 of file cublas_impl_kernels.cu.

458 {
459 const int i = threadIdx.x + blockIdx.x * blockDim.x;
460
461 if (i < n)
462 {
463 x[i * incx] = cu_arithmetics::mul<DataType>(x[i * incx], alpha);
464 }
465 }

References cu_arithmetics::abs().

Here is the call graph for this function: