7.3. Programming Methods

The first two sections of this chapter primarily discuss the significance, ideas, and basic principles behind the design of hardware accelerators. Co-optimization of software and hardware, as an important guiding principle for building efficient AI systems, requires mutual influence and close coupling between software algorithms/stacks and hardware architectures in neural network applications. In order to fully leverage the advantages of accelerators, it is necessary to design a set of programming methods based on the hardware system architecture.

7.3.1. Method Classification

Programming methods for hardware accelerators are categorized into three approaches: using high-level computation operators, harnessing primitives for specialized hardware units, and employing low-level assembly languages:

  1. High-level computation operators: Hardware accelerators often come equipped with high-level, hardware-accelerated implementations of operators extensively used in numerical computing and deep learning. For instance, NVIDIA provides cuBLAS (CUDA Basic Linear Algebra Subprograms) and cuDNN (CUDA Deep Neural Network library). These libraries offer developers an accessible way to harness the power of NVIDIA GPUs without delving into low-level code. These operators are optimized for efficiency and automatically exploit specific GPU features, such as Tensor Cores.

  2. Primitives for task-specific hardware units:: Hardware accelerators typically feature task-specific hardware units (like the Tensor Cores in NVIDIA GPUs) engineered to execute mixed-precision matrix multiplication operations at high speed. These units have associated programming primitives, such as CUDA’s Warp Matrix Multiply Accumulate (WMMA) and primitives for loading/unloading tensors on the units.

  3. Low-level assembly languages: Hardware accelerators also have low-level assembly language interfaces. For instance, NVIDIA GPUs offer the PTX ISA (Parallel Thread Execution Instruction Set Architecture). It provides explicit control over all aspects of GPU behavior, but it requires a deep understanding of the GPU architecture and is more challenging to use correctly and effectively than the high-level interfaces provided by cuBLAS and cuDNN. PTX code is typically generated by a compiler from a high-level language like CUDA C++.

In essence, the above three methods operate at different levels of abstraction. High-level operators like cuBLAS and cuDNN provide easy-to-use interfaces to powerful hardware-accelerated operations, while the primitives provided by task-specific hardware units provide a more detailed interface to hardware operations, and low-level assembly languages like PTX ISA provide the most detailed, low-level control over accelerator behavior.

7.3.2. Programming Examples

We exemplify different programming methods by implementing the General Matrix Multiplication (GEMM) with each approach. The implementation targets an NVIDIA Volta GPU. GEMM follows the equation \(\bf{C} = \alpha \bf{A}\times \bf{B} + \beta \bf{C}\), where \(\bf{A}\in\mathbb{R}^{M\times K}, \bf{B}\in\mathbb{R}^{K\times N}, \bf{C}\in\mathbb{R}^{M\times N}\), and \(\alpha\) and \(\beta\) are parameters provided by users.

7.3.2.1. High-level Computation Operators

Using an operator acceleration library directly is the most straightforward method. NVIDIA offers two types of operator libraries: cuBLAS and cuDNN. cuBLAS provides an interface for leveraging Tensor Cores to accelerate GEMM operations, while cuDNN offers an interface to hasten neural network operations. To utilize Tensor Cores via cuBLAS doing GEMM, we can use function cublasGemmEx, its signature is shown in Code lst:cublasGemmEx.

lst:cublasGemmEx

cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)

handle is the cuBLAS handle, which is created using the cublasCreate function. transa denotes whether the matrices \(\bf{A}\) and \(\bf{C}\) are transposed, while transb denotes whether the matrix \(\bf{B}\) is transposed. m, n, and k are used to describe the shape of the matrices. alpha and beta are used to scale the matrix multiplication results. A, B, and C are pointers to the starting addresses of the matrices. Atype, Btype, and Ctype describe the data type of the matrices. For example, CUDA_R_16F indicates that the data is stored in real 16-bit floating point type. lda, ldb, and ldc represent the leading dimensions of the matrices. computeType is the data type used in computation. For instance, CUBLAS_COMPUTE_16F implies the use of Tensor Cores for computation in 16-bit floating point. Notably, if the input data type is 32-bit float, we can use CUBLAS_COMPUTE_32F_FAST_16F to perform the computation in 16-bit floating point and achieve acceleration using Tensor Cores. algo is the algorithm used in computation, and CUBLAS_GEMM_DEFAULT is commonly used to select the default algorithm.

7.3.2.2. Primitives for Hardware Units

The second approach to accelerator programming involves the use of programming primitives, such as invoking the CUDA Warp Matrix Multiply Accumulate (WMMA) API on a device. This approach hinges on the collaborative design of software and hardware, meaning that the design of programming APIs at this level is architecture-dependent. For instance, in the Volta architecture, the control object of WMMA is a \(16\times16\) matrix block, processed by two Tensor Cores at a time. This notion is tightly linked to the integration of Tensor Cores into a SM.

In the Volta architecture, NVIDIA offers three distinct sizes of WMMA multiply-accumulate computing interfaces for FP16 input data: \(16\times16\times16\), \(32\times8\times16\), and \(8\times32\times16\).

The basic control unit of the WMMA API is a fragment, which refers to a template class that specifies information such as the meaning of matrices (multiplier or accumulator), matrix shape (WMMA_M, WMMA_N, or WMMA_K), data type (FP16, FP32, etc.), and layout (row_major or col_major). Code lst:frament shows the fragment types.

lst:frament

wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag;

The data of the matrix block required by multiplication operations needs to be loaded to the register as a fragment. Fragments are initialized or cleared after multiply-accumulate operations performed by Tensor Cores, the fragments are stored back in global memory. NVIDIA provides the wmma.load_matrix_sync() and wmma.store_matrix_sync() interfaces to load or write the submatrix blocks. The wmma.fill_fragment() interface is used to initialize the data of the corresponding fragments, and the wmma.mma_sync() interface is used to perform multiply-accumulate operations on fragments.

7.3.2.3. Low-level Assembly Language Interface

The PTX ISA offers another programming interface, for example, the mma.sync.aligned.m8n8k4 instruction in the Volta architecture. This instruction uses the shape configuration of \(M=8, N=8, K=4\) to perform multiply-add operations. The basic control unit of the API is the data element. The matrix size (modifier .m8n8k4), data format (modifier .row or .col) and data formats of input accumulator D, matrix A, matrix B, and output accumulator C (modifier .f32 or .f16) need to be specified. NVIDIA’s documentation provides information about using the PTX instruction set, helping programmers compile code based on the corresponding syntax rules, as shown in Code lst:ptx.

lst:ptx

half_t *a, *b;
    float *C, *D;
    unsigned const* A = reinterpret_cast<unsigned const*>(a);
    unsigned const* B = reinterpret_cast<unsigned const*>(b);

    asm volatile(
        "mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
        "{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
        "{%12,%13,%14,%15,%16,%17,%18,%19};\n"
        : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
        "=f"(D[5]), "=f"(D[6]), "=f"(D[7])
        : "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
        "f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
        "f"(C[6]), "f"(C[7]));

Data elements are directly used as the input (unsigned type is used for containing FP16 data elements). Moreover, NVIDIA provides the ldmatrix instruction to load data from the shared memory to fragments.

A finer-grained instruction, mma, can form a warp-level WMMA API of more diversified shapes to control the mapping between threads and data in the warp. The PTX instructions offer greater flexibility than directly using CUDA C++ codes.