Microsoft's open source depth study optimized libraries DeepSpeed, trainable 100 billion parameter model

The latest trend is artificial intelligence, natural language larger models can provide better accuracy, but due to cost, time, and code integration barriers, large model train hard. Microsoft has open sourced a deep learning optimization library DeepSpeed , by increasing the size, speed, availability and lower cost, can be trained to have a deep learning model parameters more than 100 billion in the current generation of GPU clusters, greatly promote the training of large models. At the same time, compared with the latest technology, the system performance can be improved by more than 5 times.

According to Microsoft's description , DeepSpeed library has named ZeRO (zero redundancy optimizer, Zero Redundancy Optimizer) component, which is a new parallel optimizer, it can greatly reduce the data parallel model and the resources needed, At the same time a substantial increase in the number of parameters can be trained. The researchers used these breakthroughs created the Turing natural language generation model (Turing-NLG), which is the largest public language model, the parameter is 170 million.

ZeRO as part DeepSpeed, the new large-scale distributed memory optimization technology is a method for deep learning, it can be trained to have a deep learning model 100 billion in the current parameters of the GPU cluster, throughput is the current best system 3 to 5 times. It also provides a clear idea for the training of the model with trillions of parameters.

ZeRO optimization has three main stages, corresponding to the optimized state, a gradient parameter and a partition.

ZeRO overcomes the limitations of parallel data and parallel models, while achieving the advantages of both, that data across the parallel processes by the model divided state parameters shown on the graph, and the gradient state optimizer partition, instead of copying them, whereby eliminating the redundant memory data between parallel processes. Communication using dynamic programming during the training period (dynamic communication schedule), shared between the distributed state of the necessary equipment to maintain the particle size and the calculated data traffic in parallel.

The current implementation of the first phase of ZeRO, that the optimizer state partition (referred to as ZeRO-OS), has the support of 100 billion parameter model of powerful capabilities, this release stage with DeepSpeed.

DeepSpeed ​​compatible with PyTorch, DeepSpeed ​​API is lightweight package carried out on PyTorch, which means that developers can use in all PyTorch, without having to learn a new platform. In addition, DeepSpeed ​​manages all boilerplate SOTA training techniques, such as distributed training, mixing precision, gradient accumulation and checkpoints, developers can focus on model development. At the same time, developers only need to make changes to the model PyTorch a few lines of code, you can use the unique advantages of DeepSpeed ​​efficiency and effectiveness to improve the speed and scale.

DeepSpeed ​​performed well in the following four areas:

  • Scale : the most advanced large-scale model, for example OpenAI GPT-2, NVIDIA Megatron- LM and Google T5, respectively 1.5 billion, 8.3 billion and 11 billion parameter, and DeepSpeed of ZeRO first phase of the system to provide support to run multiple up to 100 billion of the model parameters, which is 10 times larger than the current most advanced model. Future plans to add support for the second and third phases ZeRO, providing the ability to model up to 200 billion or even trillions of parameters.
  • 速度:在各种硬件上,目前观察到的吞吐量比当前最先进技术高出 5 倍。例如,为了在 GPT 系列工作负载上训练大型模型,DeepSpeed 将基于 ZeRO 的数据并行与 NVIDIA Megatron-LM 模型并行相结合,在具有低带宽互连的 NVIDIA GPU 集群上(没有 NVIDIA NVLink 或 Infiniband),与仅对具有 15 亿参数的标准 GPT-2 模型使用 Megatron-LM 相比,DeepSpeed 将吞吐量提高了 3.75 倍。在具有高带宽互连的 NVIDIA DGX-2 集群上,对于 20 至 800 亿个参数的模型,速度要快 3 到 5 倍。这些吞吐量的提高来自 DeepSpeed 更高的内存效率以及使用较低程度的模型并行和较大的批处理量来拟合这些模型的能力。
  • 成本:提高吞吐量意味着大大降低训练成本,例如,要训练具有 200 亿个参数的模型,DeepSpeed 需要的资源是原来的 3/4。
  • 易用性:只需更改几行代码即可使 PyTorch 模型使用 DeepSpeed 和 ZeRO。与当前的模型并行库相比,DeepSpeed 不需要重新设计代码或重构模型,它也没有对模型尺寸、批处理大小或任何其它训练参数加以限制。对于参数多达 60 亿的模型,可以方便地使用由 ZeRO 提供的数据并行能力,而无需模型并行。而相比之下,对于参数超过 13 亿的模型,标准数据并行将耗尽内存。ZeRO 第二和第三阶段将进一步增加仅通过数据并行即可训练的模型大小。此外,DeepSpeed 支持 ZeRO 支持的数据并行与模型并行的灵活组合。

更具体的介绍查看微软的博客:

https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters

Guess you like

Origin www.oschina.net/news/113328/microsoft-opensource-deepspeed