Templated kernel code for implenentations of several BLAS-type functions in CUDA. More...

Functions
template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void	cublasTgemv_kernel (const bool trans, const int m, const int n, const DataType alpha, const DataType RESTRICT A, const int lda, const DataType RESTRICT x, const int incx, const DataType beta, DataType *RESTRICT y, const int incy)
	Performs the operation \( \boldsymbol{y} = \alpha \mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \).

template<typename DataType >
__global__ void	cublasTcopy_kernel (const int n, const DataType RESTRICT x, const int incx, DataType RESTRICT y, const int incy)
	Performs \( \boldsymbol{y} = \boldsymbol{x} \).

template<typename DataType >
__global__ void	cublasTaxpy_kernel (const int n, const DataType alpha, const DataType RESTRICT x, const int incx, DataType RESTRICT y, const int incy)
	Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void	cublasTdot_kernel (const int n, const DataType RESTRICT x, const int incx, const DataType RESTRICT y, const int incy, ComputeType *RESTRICT result)
	Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).

template<typename DataType , typename ComputeType , unsigned int block_size>
__global__ void	cublasTnrm2_kernel (const int n, const DataType RESTRICT x, const int incx, ComputeType RESTRICT result)
	Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).

template<typename DataType >
__global__ void	cublasTscal_kernel (const int n, const DataType alpha, DataType *RESTRICT x, const int incx)
	Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).

Detailed Description

Templated kernel code for implenentations of several BLAS-type functions in CUDA.

The motivation for re-implementing CuBLAS is that CUDA's CuBLAS library does not supports DataType type and __nv_bfloat16 type for some of it functions. For instance, while there is support for level 3 functions, they do not provide level 2 and 1 functions with DataType type.

The functions in this namespace provides some level 2 functions by implementing CUDA kernels from scratch. These implementations are templated with mixed precision computations where both the data types and inner computation types are templated. The data type is set by DatatType typename and the inner computation type is set by ComputeType typename.

Despite the generic templated functions, he main intent of these templates are to be used primarily for the missing types in CuBLAS, namely, the DataType type (which is float16 type) and __nv_bfloat6 type (which is Google's bfloat16 type). But users may utilize these templates for any data and compute types.

The prefix convension for all functions in this namespace are cublasT (for instance cublasTgemv) where T here denotes template. In the CuBLAS API, this letter a placeholder for data type, such as S for single preicsion and D for double precision.

The functions in this namespace are the host codes. The kernel codes corresponding to each host code can be found in cublas_impl_kernels namespace.

See also: Namespace cublas_api .

Function Documentation

◆ cublasTaxpy_kernel()

template<typename DataType >

__global__ void cublas_impl_kernels::cublasTaxpy_kernel	(	const int	n,
		const DataType	alpha,
		const DataType *RESTRICT	x,
		const int	incx,
		DataType *RESTRICT	y,
		const int	incy
	)

Performs \( \boldsymbol{y} = \alpha \boldsymbol{x} + \boldsymbol{y} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTaxpy().

Parameters

[in]	n	Size of array \( \boldsymbol{x} \).
[in]	alpha	The scalar parameter \( \alpha \).
[in]	x	Input vector \( \boldsymbol{x} \) stored on GPU device.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).
[out]	y	Output vector \( \boldsymbol{y} \) stored on GPU device.
[in]	incy	Stride between consecutive elements of \( \boldsymbol{y} \).

See also: cublasTaxpy

Definition at line 267 of file cublas_impl_kernels.cu.

    {
        const int i = threadIdx.x + blockIdx.x * blockDim.x;
 
        if (i < n)
        {
            y[i * incy] = \
                cu_arithmetics::add<DataType>(
                    cu_arithmetics::mul<DataType>(alpha, x[i * incx]),
                    y[i * incy]
                );
        }
    }

References cu_arithmetics::abs().

Here is the call graph for this function:

◆ cublasTcopy_kernel()

template<typename DataType >

__global__ void cublas_impl_kernels::cublasTcopy_kernel	(	const int	n,
		const DataType *RESTRICT	x,
		const int	incx,
		DataType *RESTRICT	y,
		const int	incy
	)

Performs \( \boldsymbol{y} = \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTcopy().

Parameters

[in]	n	Size of the array \( \boldsymbol{x} \).
[in]	x	Input vector \( \boldsymbol{x} \) stored on GPU device.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).
[out]	y	Output vector \( \boldsymbol{y} \) stored on GPU device.
[in]	incy	Stride between consecutive elements of \( \boldsymbol{y} \).

See also: cublasTcopy

Definition at line 223 of file cublas_impl_kernels.cu.

    {
        int i = threadIdx.x + blockIdx.x * blockDim.x;
 
        if (i < n)
        {
            y[i * incy] = x[i * incx];
        }
    }

◆ cublasTdot_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>

__global__ void cublas_impl_kernels::cublasTdot_kernel	(	const int	n,
		const DataType *RESTRICT	x,
		const int	incx,
		const DataType *RESTRICT	y,
		const int	incy,
		ComputeType *RESTRICT	result
	)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{y} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTdot().

Parameters

[in]	n	Size of array \( \boldsymbol{x} \).
[in]	x	Input vector \( \boldsymbol{x} \) stored on GPU device.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).
[out]	y	Output vector \( \boldsymbol{y} \) stored on GPU device.
[in]	incy	Stride between consecutive elements of \( \boldsymbol{y} \).
[out]	result	The dot product of two vectors.

See also: cublasTdot

Definition at line 316 of file cublas_impl_kernels.cu.

    {
        // The size of this array should be exactly the number of blocks (for
        // this, see the corresponding host code, cublas_impl::cublasTdot)
        __shared__ ComputeType partial_sum[block_size];
 
        const int tid = threadIdx.x;
        int i = blockIdx.x * blockDim.x + threadIdx.x;
 
        ComputeType sum = static_cast<ComputeType>(0.0f);
        while (i < n)
        {
            sum += cu_arithmetics::cast<DataType, ComputeType>(x[i * incx]) * \
                   cu_arithmetics::cast<DataType, ComputeType>(y[i * incy]);
 
            i += blockDim.x * gridDim.x;
        }
 
        partial_sum[tid] = sum;
 
        __syncthreads();
 
        // Reduction in shared memory
        for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
        {
            if (tid < stride)
            {
                partial_sum[tid] += partial_sum[tid + stride];
            }
            __syncthreads();
        }
 
        // Write result for this block to global memory
        if (tid == 0)
        {
            atomicAdd(result, partial_sum[0]);
        }
    }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTdot().

Here is the call graph for this function:

Here is the caller graph for this function:

◆ cublasTgemv_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>

__global__ void cublas_impl_kernels::cublasTgemv_kernel	(	const bool	trans,
		const int	m,
		const int	n,
		const DataType	alpha,
		const DataType *RESTRICT	A,
		const int	lda,
		const DataType *RESTRICT	x,
		const int	incx,
		const DataType	beta,
		DataType *RESTRICT	y,
		const int	incy
	)

Performs the operation \( \boldsymbol{y} = \alpha \mathbf{A} \boldsymbol{x} + \beta \boldsymbol{y} \).

This function is the device (kernel) code for cublas_impl::cublasTgemv() .

Note: This function incorporate both non-transposed and transposed operations. To this end, here m and n are defined based on the sizes of y and x (respectively), not the size of A or its transpose. The matrix A (regardless of being transposed) is m*n.

Parameters

[in]	trans	If set to `CUBLAS_OP_N` or `CUBLAS_OP_T`, the operator \( \mathbf{A} \) is not transposed or transposed, respectively.
[in]	m	Size of `y`.
[in]	n	Size of `x`.
[in]	alpha	Scalar parameter \( \alpha \).
[in]	A	Matrix `A`. The matrix is assumed to be stored as a coalesced 1D array with column-major ordering. The matrix size is `m*n`.
[in]	lda	Leading dimension of `A`.
[in]	x	Input vector `x` of size `n*incx`.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).
[in]	beta	Scalar parameter \( \beta \).
[out]	y	Output vector `y` of size `m*incy`.
[in]	incy	Stride between consecutive elements of \( \boldsymbol{y} \).

See also: cublas_impl::cublasTgemv

Definition at line 79 of file cublas_impl_kernels.cu.

    {
        // Each thread is dedicated to compute an element of y
        const unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
 
        // Device shared memory to cache x only (note: we do not cache A since
        // the elements of A are read only once. In contrast, x is read several
        // times).
        __shared__ DataType x_shared[block_size];
 
        // Summation for the dot product of i-th row of A (or A transposed)
        // with the entire x. The sum variable is local to i-th thread only,
        // and is not shared with other threads of block.
        ComputeType sum = 0.0f;
 
        // Iterate over blocks of x elements
        const unsigned int num_blocks = (n + block_size - 1) / block_size;
 
        // Each thread (index i) loops over all elements j of x in block by
        // block manner.
        #pragma unroll
        for (unsigned long int block_counter = 0;
             block_counter < num_blocks;
             ++block_counter)
        {
            // Get j-th index of x. This is only used to read x to copy it to
            // the cache of x.
            unsigned long int j = threadIdx.x + \
                block_counter * static_cast<unsigned long int>(block_size);
 
            // Fill x cache
            if (j < n)
            {
                // Read x from global memory to shared memory
                x_shared[threadIdx.x] = x[j * incx];
            }
            else
            {
                // If block element exceeds x size, fill cache with zeros.
                x_shared[threadIdx.x] = \
                    cu_arithmetics::cast<ComputeType, DataType>(0.0f);
            }
 
            // Sync all threads of block to finish caching x from global memory
            // to shared memory
            __syncthreads();
 
            // Now that one block of cache is filled, perform matrix-vector
            // multiplication for that one block.
            #pragma unroll
            for (unsigned int e = 0; e < block_size; ++e)
            {
                // Get the index of x (called e_j) corresponding to the e-th
                // element of the cached block. This is different than the j
                // above.
                unsigned long int e_j = e + \
                    block_counter * static_cast<unsigned long int>(block_size);
 
                // It is necessary to check indices i and e_j with array sizes
                // as these indices can exceed the array indices since thread
                // blocks are in the sizes of multiples of 32 (as wrap size).
                if ((i < m) && (e_j < n))
                {
                    // Perform matrix-vector multiplication for the i-th row of
                    // A (or i-th row of transposed A) and the e_j th element
                    // of x.
                    if (trans)
                    {
                        sum += cu_arithmetics::cast<DataType, ComputeType>(
                                    A[i * lda + e_j]) * \
                               cu_arithmetics::cast<DataType, ComputeType>(
                                    x_shared[e]);
                    }
                    else
                    {
                        sum += cu_arithmetics::cast<DataType, ComputeType>(
                                    A[i + e_j * lda]) * \
                               cu_arithmetics::cast<DataType, ComputeType>(
                                    x_shared[e]);
                    }
                }
            }
 
            // Wait till all threads of block done with their matrix-vector
            // multiplication (each thread has its own sum variable), but they
            // all read cached x. This sync barrier makes sure no thread
            // proceeds the next iteration of filling new cache.
            __syncthreads();
        }
 
        // Update output vector only if thread does not exceed matrix size
        if (i < m)
        {
            y[i * incy] = \
                cu_arithmetics::add<DataType>(
                    cu_arithmetics::mul<DataType>(
                        alpha,
                        cu_arithmetics::cast<ComputeType, DataType>(sum)
                    ),
                    cu_arithmetics::mul<DataType>(
                        beta,
                        y[i * incy]
                    )
                );
        }
    }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTgemv().

Here is the call graph for this function:

Here is the caller graph for this function:

◆ cublasTnrm2_kernel()

template<typename DataType , typename ComputeType , unsigned int block_size>

__global__ void cublas_impl_kernels::cublasTnrm2_kernel	(	const int	n,
		const DataType *RESTRICT	x,
		const int	incx,
		ComputeType *RESTRICT	result
	)

Computes \( a = \boldsymbol{x} \cdot \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTnrm2().

Parameters

[in]	n	Size of array \( \boldsymbol{x} \).
[in]	x	Input vector \( \boldsymbol{x} \) stored on GPU device.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).
[out]	result	The norm squared of a vector.

See also: cublasTnrm2

Definition at line 385 of file cublas_impl_kernels.cu.

    {
        // The size of this array should be exactly the number of blocks (for
        // this, see the corresponding host code, cublas_impl::cublasTnrm2)
        __shared__ ComputeType partial_sum[block_size];
 
        const int tid = threadIdx.x;
        int i = blockIdx.x * blockDim.x + threadIdx.x;
 
        ComputeType sum = static_cast<ComputeType>(0.0f);
        while (i < n)
        {
            ComputeType val = cu_arithmetics::cast<DataType, ComputeType>(
                    x[i * incx]);
            sum += val * val;
            i += blockDim.x * gridDim.x;
        }
 
        partial_sum[tid] = sum;
 
        __syncthreads();
 
        // Reduction in shared memory
        for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
        {
            if (tid < stride)
            {
                partial_sum[tid] += partial_sum[tid + stride];
            }
            __syncthreads();
        }
 
        // Write result for this block to global memory
        if (tid == 0)
        {
            atomicAdd(result, partial_sum[0]);
        }
    }

References cu_arithmetics::abs().

Referenced by cublas_impl::cublasTnrm2().

Here is the call graph for this function:

Here is the caller graph for this function:

◆ cublasTscal_kernel()

template<typename DataType >

__global__ void cublas_impl_kernels::cublasTscal_kernel	(	const int	n,
		const DataType	alpha,
		DataType *RESTRICT	x,
		const int	incx
	)

Performs \( \boldsymbol{x} = \alpha \boldsymbol{x} \).

This function is a device-code (kernel) for the host code function for cublas_impl::cublasTscal().

Parameters

[in]	n	Size of array \( \boldsymbol{x} \).
[in]	alpha	The scalar parameter \( \alpha \).
[in,out]	x	Input and output vector \( \boldsymbol{x} \) stored on GPU device. This vector is written in-place.
[in]	incx	Stride between consecutive elements of \( \boldsymbol{x} \).

See also: cublasTscal

Definition at line 453 of file cublas_impl_kernels.cu.

    {
        const int i = threadIdx.x + blockIdx.x * blockDim.x;
 
        if (i < n)
        {
            x[i * incx] = cu_arithmetics::mul<DataType>(x[i * incx], alpha);
        }
    }

References cu_arithmetics::abs().

Here is the call graph for this function:

Functions

Detailed Description

Function Documentation

◆ cublasTaxpy_kernel()

◆ cublasTcopy_kernel()

◆ cublasTdot_kernel()

◆ cublasTgemv_kernel()

◆ cublasTnrm2_kernel()

◆ cublasTscal_kernel()