7.5. Chapter Summary¶

Hardware accelerators offer various types of on-chip caches and computational units, enhancing the performance of deep learning computational tasks.
To fully exploit the performance potential of hardware accelerators, it’s necessary to implement programmable hardware accelerators, bringing architectural innovation.
To balance computational efficiency and usability, the programming methods for hardware accelerators range from high-level computation operators to harnessing the primitives associated with hardware units, and to using low-level assembly languages.
A variety of methods are crucial to optimize accelerator performance, which include enhancing arithmetic intensity, caching data in shared memory, and concealing data store/load latency.