8.5. 总结¶

面向深度学习计算任务，加速器通常都是由多种片上缓存以及多种运算单元组成来提升性能。
未来性能增长需要依赖架构上的改变，即需要利用可编程的硬件加速器来实现性能突破。
出于计算效率和易用性等原因，加速器一般会具有多个等级的编程方式，包括：算子库层级，编程原语层级和指令层级。
越底层的编程方式越能够灵活地控制加速器，但同时对程序员的能力要求也越高。

8.6. 扩展阅读¶

CUDA编程指导 CUDA
昇腾社区 Ascend
MLIR应用进展 MLIR

8.7. 参考文献¶

Bastoul, 2004: Bastoul, C. (2004). Code generation in the polyhedral model is easier than you think. Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004. (pp. 7–16).
Chen et al., 2018: Chen, T., Moreau, T., Jiang, Z., Shen, H., Yan, E. Q., Wang, L., … Krishnamurthy, A. (2018). Tvm: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799, 11, 20.
Lattner et al., 2020: Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., … Zinenko, O. (2020). Mlir: a compiler infrastructure for the end of moore’s law. arXiv preprint arXiv:2002.11054.
Liao et al., 2021: Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H., & Hu, Y. (2021). Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : industry track paper. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 789–801). doi:10.1109/HPCA51647.2021.00071
NVIDIA, 2017: NVIDIA (2017). NVIDIA Tesla V100 GPU Architecture: The World’s Most Advanced Datacenter GPU. http://www.nvidia.com/object/volta-architecture-whitepaper.html.
Ragan-Kelley et al., 2013: Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6), 519–530.
Vasilache et al., 2022: Vasilache, N., Zinenko, O., Bik, A. J., Ravishankar, M., Raoux, T., Belyaev, A., … others. (2022). Composable and modular code generation in mlir: a structured and retargetable approach to tensor compiler construction. arXiv preprint arXiv:2202.03293.
Verdoolaege, 2010: Verdoolaege, S. (2010). Isl: an integer set library for the polyhedral model. International Congress on Mathematical Software (pp. 299–302).
Zhao et al., 2021: Zhao, J., Li, B., Nie, W., Geng, Z., Zhang, R., Gao, X., … others. (2021). Akg: automatic kernel generation for neural processing units using polyhedral transformations. Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (pp. 1233–1248).
Zheng et al., 2020: Zheng, L., Jia, C., Sun, M., Wu, Z., Yu, C. H., Haj-Ali, A., … others. (2020). Ansor: generating $\$High-Performance$\$ tensor programs for deep learning. 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 863–879).