6.7. Further Reading¶

Memory allocation is an important concept of a machine learning backend. For further reading, see Training Deep Nets with Sublinear Memory Cost 1 and Dynamic Tensor Rematerialization 2.
For more about runtime scheduling and execution, see A Lightweight Parallel and Heterogeneous Task Graph Computing System 3, Dynamic Control Flow in Large-Scale Machine Learning 4, and Deep Learning with Dynamic Computation Graphs 5.
For further reading about operator compilers, see Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines 6, Ansor: Generating High-Performance Tensor Programs for Deep Learning 7, and Polly - Polyhedral optimization in LLVM 8.
One of challenges faced by modern deep learning compiler frameworks is to achieve performance levels comparable to manually optimized libraries that are specific to the target platform. To address this challenge, auto-tuning frameworks utilize statistical cost models to dynamically and efficiently optimize code. However, these frameworks also have certain drawbacks, such as the need for extensive exploration and training overheads in order to establish the cost model. Recent work, like MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks 9 predicts the performance of optimized codes with pre-trained model parameters.