8.9. Chapter Summary¶
- The advent of large-scale machine learning models has sparked an exponential increase in the need for computational power and memory, leading to the emergence of distributed training systems. 
- Distributed training systems often utilize data parallelism, model parallelism, or a combination of both, based on memory limitations and computational constraints. 
- Pipeline parallelism is another technique adopted by distributed training systems, which involves partitioning a mini-batch into micro-batches and overlapping the forward and backward propagation of different micro-batches. 
- Although distributed training systems usually function in compute clusters, these networks sometimes lack the sufficient bandwidth for the transmission of substantial gradients produced during training. 
- To meet the demand for comprehensive communication bandwidth, machine learning clusters integrate heterogeneous high-performance networks, such as NVLink, NVSwitch, and InfiniBand. 
- To accomplish synchronous training of a machine learning model, distributed training systems frequently employ a range of collective communication operators, among which the AllReduce operator is popularly used for aggregating the gradients computed by distributed nodes. 
- Parameter servers play a crucial role in facilitating asynchronous training and sparse model training. Moreover, they leverage model replication to address issues related to data hotspots and server failures. 
