Distributed training - pipeline parallelism

Generally speaking, training a larger network model can achieve better results on a variety of tasks, such as improving the accuracy of image classification tasks. However, with the expansion of the parameter scale, the problem of AI accelerator card storage (such as GPU video memory) capacity and card collaborative computing problem has become a bottleneck for training very large models. Pipeline parallelism solves these problems from the perspectives of model segmentation and scheduling execution. The following will take flying paddle pipeline parallelism as an example to introduce the basic principles and usage methods.

1. Principle Introduction¶

pipeline

Different from data parallelism, the pipeline parallelizes the different layers of the model to different computing devices, reducing the memory consumption of a single computing device, thereby realizing super-large-scale model training. Taking the above figure as an example, the example model contains four model layers. The model was split into three parts and placed on three different computing devices. That is, layer 1 is placed on device 0, layers 2 and 3 are placed on device 1, and layer 4 is placed on device 2. Data is transmitted between adjacent devices via communication links. Specifically, in the forward calculation process, the input data is first calculated on the first layer on device 0 to obtain intermediate results,

Guess you like

Origin blog.csdn.net/u013250861/article/details/132509613