ZeRO series of DeepSpeed: Carry out memory optimization to the end

foreword

At present, there are two main technical routes for training large-scale language models: TPU + XLA + TensorFlow/JAX and GPU + PyTorch + Megatron-LM + DeepSpeed. The former is dominated by Google. Due to the deep binding between TPU and its own cloud platform GCP, non-Googlers can only watch from a distance but not play with it. Behind the latter is the blessing of NVIDIA, Meta, and MS. The community atmosphere is active and more Welcomed by the masses.

The core of DeepSpeed ​​mentioned above is ZeRO (Zero Redundancy Optimizer). Simply put, it is a memory-optimized data parallelism (data parallelism, DP) solution. The topic of "optimization" is never-ending. In the past two years, the DeepSpeed ​​team has published three ZeRO-related papers, proposing methods such as removing redundant parameters, introducing CPU and memory, and introducing NVMe. One goal: to carry out video memory optimization to the end.

ZeRO: A Data Parallel Scheme for Removing Redundancy

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models  was published in SC 20, and the DeepSpeed ​​project was originally the official implementation of the ZeRO method in the paper.

background

Nowadays, training large models is inseparable from various distributed parallel strategies, commonly used parallel

Guess you like

Origin blog.csdn.net/u013250861/article/details/131248014