7.3. Programming Methods¶
The first two sections of this chapter primarily discuss the significance, ideas, and basic principles behind the design of hardware accelerators. Co-optimization of software and hardware, as an important guiding principle for building efficient AI systems, requires mutual influence and close coupling between software algorithms/stacks and hardware architectures in neural network applications. In order to fully leverage the advantages of accelerators, it is necessary to design a set of programming methods based on the hardware system architecture.
7.3.1. Method Classification¶
Programming methods for hardware accelerators are categorized into three approaches: using high-level computation operators, harnessing primitives for specialized hardware units, and employing low-level assembly languages:
High-level computation operators: Hardware accelerators often come equipped with high-level, hardware-accelerated implementations of operators extensively used in numerical computing and deep learning. For instance, NVIDIA provides cuBLAS (CUDA Basic Linear Algebra Subprograms) and cuDNN (CUDA Deep Neural Network library). These libraries offer developers an accessible way to harness the power of NVIDIA GPUs without delving into low-level code. These operators are optimized for efficiency and automatically exploit specific GPU features, such as Tensor Cores.
Primitives for task-specific hardware units:: Hardware accelerators typically feature task-specific hardware units (like the Tensor Cores in NVIDIA GPUs) engineered to execute mixed-precision matrix multiplication operations at high speed. These units have associated programming primitives, such as CUDA’s Warp Matrix Multiply Accumulate (WMMA) and primitives for loading/unloading tensors on the units.
Low-level assembly languages: Hardware accelerators also have low-level assembly language interfaces. For instance, NVIDIA GPUs offer the PTX ISA (Parallel Thread Execution Instruction Set Architecture). It provides explicit control over all aspects of GPU behavior, but it requires a deep understanding of the GPU architecture and is more challenging to use correctly and effectively than the high-level interfaces provided by cuBLAS and cuDNN. PTX code is typically generated by a compiler from a high-level language like CUDA C++.
In essence, the above three methods operate at different levels of abstraction. High-level operators like cuBLAS and cuDNN provide easy-to-use interfaces to powerful hardware-accelerated operations, while the primitives provided by task-specific hardware units provide a more detailed interface to hardware operations, and low-level assembly languages like PTX ISA provide the most detailed, low-level control over accelerator behavior.
7.3.2. Programming Examples¶
We exemplify different programming methods by implementing the General Matrix Multiplication (GEMM) with each approach. The implementation targets an NVIDIA Volta GPU. GEMM follows the equation \(\bf{C} = \alpha \bf{A}\times \bf{B} + \beta \bf{C}\), where \(\bf{A}\in\mathbb{R}^{M\times K}, \bf{B}\in\mathbb{R}^{K\times N}, \bf{C}\in\mathbb{R}^{M\times N}\), and \(\alpha\) and \(\beta\) are parameters provided by users.
7.3.2.1. High-level Computation Operators¶
Using an operator acceleration library directly is the most
straightforward method. NVIDIA offers two types of operator libraries:
cuBLAS and cuDNN. cuBLAS provides an interface for leveraging Tensor
Cores to accelerate GEMM operations, while cuDNN offers an interface to
hasten neural network operations. To utilize Tensor Cores via cuBLAS
doing GEMM, we can use function cublasGemmEx
, its signature is shown
in Code lst:cublasGemmEx
.
lst:cublasGemmEx
cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)
handle
is the cuBLAS handle, which is created using the
cublasCreate
function. transa
denotes whether the matrices
\(\bf{A}\) and \(\bf{C}\) are transposed, while transb
denotes whether the matrix \(\bf{B}\) is transposed. m
, n
,
and k
are used to describe the shape of the matrices. alpha
and
beta
are used to scale the matrix multiplication results. A
,
B
, and C
are pointers to the starting addresses of the matrices.
Atype
, Btype
, and Ctype
describe the data type of the
matrices. For example, CUDA_R_16F
indicates that the data is stored
in real 16-bit floating point type. lda
, ldb
, and ldc
represent the leading dimensions of the matrices. computeType
is the
data type used in computation. For instance, CUBLAS_COMPUTE_16F
implies the use of Tensor Cores for computation in 16-bit floating
point. Notably, if the input data type is 32-bit float, we can use
CUBLAS_COMPUTE_32F_FAST_16F
to perform the computation in 16-bit
floating point and achieve acceleration using Tensor Cores. algo
is
the algorithm used in computation, and CUBLAS_GEMM_DEFAULT
is
commonly used to select the default algorithm.
7.3.2.2. Primitives for Hardware Units¶
The second approach to accelerator programming involves the use of programming primitives, such as invoking the CUDA Warp Matrix Multiply Accumulate (WMMA) API on a device. This approach hinges on the collaborative design of software and hardware, meaning that the design of programming APIs at this level is architecture-dependent. For instance, in the Volta architecture, the control object of WMMA is a \(16\times16\) matrix block, processed by two Tensor Cores at a time. This notion is tightly linked to the integration of Tensor Cores into a SM.
In the Volta architecture, NVIDIA offers three distinct sizes of WMMA multiply-accumulate computing interfaces for FP16 input data: \(16\times16\times16\), \(32\times8\times16\), and \(8\times32\times16\).
The basic control unit of the WMMA API is a fragment, which refers to a
template class that specifies information such as the meaning of
matrices (multiplier or accumulator), matrix shape
(WMMA_M, WMMA_N, or WMMA_K
), data type (FP16, FP32, etc.), and
layout (row_major or col_major
). Code lst:frament
shows the
fragment types.
lst:frament
wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag;
The data of the matrix block required by multiplication operations needs
to be loaded to the register as a fragment. Fragments are initialized or
cleared after multiply-accumulate operations performed by Tensor Cores,
the fragments are stored back in global memory. NVIDIA provides the
wmma.load_matrix_sync() and wmma.store_matrix_sync()
interfaces to
load or write the submatrix blocks. The wmma.fill_fragment()
interface is used to initialize the data of the corresponding fragments,
and the wmma.mma_sync()
interface is used to perform
multiply-accumulate operations on fragments.
7.3.2.3. Low-level Assembly Language Interface¶
The PTX ISA offers another programming interface, for example, the
mma.sync.aligned.m8n8k4
instruction in the Volta architecture. This
instruction uses the shape configuration of \(M=8, N=8, K=4\) to
perform multiply-add operations. The basic control unit of the API is
the data element. The matrix size (modifier .m8n8k4
), data format
(modifier .row
or .col
) and data formats of input accumulator D,
matrix A, matrix B, and output accumulator C (modifier .f32
or
.f16
) need to be specified. NVIDIA’s documentation provides
information about using the PTX instruction set, helping programmers
compile code based on the corresponding syntax rules, as shown in
Code lst:ptx
.
lst:ptx
half_t *a, *b;
float *C, *D;
unsigned const* A = reinterpret_cast<unsigned const*>(a);
unsigned const* B = reinterpret_cast<unsigned const*>(b);
asm volatile(
"mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
"{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
"{%12,%13,%14,%15,%16,%17,%18,%19};\n"
: "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
"=f"(D[5]), "=f"(D[6]), "=f"(D[7])
: "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
"f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
"f"(C[6]), "f"(C[7]));
Data elements are directly used as the input (unsigned
type is used
for containing FP16 data elements). Moreover, NVIDIA provides the
ldmatrix
instruction to load data from the shared memory to
fragments.
A finer-grained instruction, mma
, can form a warp-level WMMA API of
more diversified shapes to control the mapping between threads and data
in the warp. The PTX instructions offer greater flexibility than
directly using CUDA C++ codes.