ZeRO & DeepSpeed: allows training model has more than 100 billion parameter optimization (Microsoft)

Source: AINLPer micro-channel public number
editor: ShuYini
proofreading: ShuYini
Time: 2020-2-12

Papers Access:
1,Download the official website address: https://arxiv.org/abs/1910.02054
2、Watch AINLPer reply: ZeRO

introduction

    The latest trend in Artificial Intelligence is to have a greater natural language model to provide better accuracy. However, due to cost, time, and code integration simple (no code specifically optimized), which will lead to larger models difficult training. Microsoft has released an open source library called DeepSpeed of the library by increasing the size, speed, cost and availability, greatly promoted the training of large models, the ability to release a 100 billion-parameter model of training . DeepSpeed compatible with PyTorch. A part of the library called ZeRO (it is a new parallel optimizer), it greatly reduces the resources required to model and data in parallel, while greatly increasing the number of parameters can be trained. The researchers used these breakthroughs created Turing- NLG, which is the largest known language model, there are 17 billion parameter.
Here Insert Picture Description

ZeRO(The Zero Redundancy Optimizer)介绍

    ZeRO (The Zero Redundancy Optimizer) is a new memory optimization technology for large-scale distributed depth study . ZeRO can train deep learning model with 100 billion in the current parameters of the GPU cluster, which is three to five times the throughput of current best system. It also provides a clear path for training model with trillions of argument, suggesting that deep learning system technology has an unprecedented leap. We will publish ZeRO as part of DeepSpeed, DeepSpeed we used to accelerate the depth of study and training, high-performance distributed library.

Challenge of training a large depth learning model

    Large model allows accuracy is significantly improved, but the training of billions to trillions of argument often encounter hardware limitations. In order to fit these memory models, existing solutions require a tradeoff between computing, communications and development efficiency :

    • Parallel data does not help to reduce the memory footprint of each device: a model parameter has more than one billion, will run out of memory even on 32GB memory gpu.

    • Since the calculation of the fine-grained and expensive communication model of parallelism can not be effectively extended beyond a single node. Parallel model framework often require extensive code integration, and integration of these codes may be specific to the model architecture. For example, NVIDIA Megatron-LM record 8.3 billion new model size of the recording parameters. GPU for a plurality of models suitable for a single node, its stretchability is good, but when stretching across nodes, performance decreases. For example, when you run 40 billion parameter on the NVIDIA DGX-2 node, we observed about five teraflops / GPU.

Overcoming the limitations of parallelism and data parallelism model

    Microsoft developed to overcome the limitations ZeRO parallelism and data parallelism model, while achieving the advantages of both. ZeRO by dividing the model states (parameters, and optimizing the gradient state) in a parallel process data, instead of copying them, thereby eliminating redundant data memory in parallel processes. It uses dynamic communication during training program to share the necessary state between distributed devices to maintain the calculated traffic data and granularity parallelism.

    ZeRO driven data parallelism, which allows the memory usage of each device with the data of the degree of parallelism linearly extended, and generates data parallelism similar traffic. ZeRO supports data parallelism may be of any size suitable for the model, as long as the polymeric device memory large enough to model states to share

ZeRO three stages of its advantages

    ZeRO has three main stages of optimization (shown in FIG. 1), respectively corresponding to the divided state of the optimizer, and the gradient parameter. Cumulative enabled:

    1. Optimization of state partitions (Pos) - 4 times the memory is reduced, the communication capacity and the same data parallelism
    2. increase gradient partition (Pos + g) - 8x reduced memory, communications capacity and data parallelism same
    3 parameters increase partition (Pos + g + p) - to reduce memory and complexity of data parallelism and a linear relationship. For example, across 64 gpu (Nd = 64) will be reduced 64-fold split memory. Modest increase communication capacity by 50%.

    ZeRO eliminating redundant memory, the memory capacity of the polymerization and the entire cluster is available. When enabled all three phases, ZeRO a trillion-parameter model can be trained only in 1024 NVIDIA GPU. Similarly such Adam trillion parametric model optimizer having a 16-bit accuracy requires approximately 16 TB memory to save the state of the optimizer, and the gradient parameter. 16TB by 1024 is the 16GB, within a reasonable range in which the GPU. .

DeepSpeed: PyTorch compatibility and system performance

    ZeRO implement the first phase of the (state optimizer partition (referred to as ZeRO-OS)), it has the support of 100 billion power of the model parameters. The code optimization library DeepSpeed ​​released together with our training. DeepSpeed ​​brings the latest training techniques compatible with PyTorch lightweight API, for example ZeRO, distributed training, mixed precision and checkpoints. Just to PyTorch few lines of code to change the model, you can use DeepSpeed ​​resolve potential performance challenges, and improve the training speed and scale. DeepSpeed ​​outstanding performance in four (Figure 2):

    •最先进的大型模型,例如OpenAI GPT-2,NVIDIA Megatron-LM和Google T5,分别具有15亿,83亿和110亿个参数。 DeepSpeed的ZeRO第一阶段提供系统支持,以运行多达1000亿个参数的模型,该参数大10倍。 将来,我们计划增加对ZeRO第二和第三阶段的支持,从而释放将2000亿个参数训练为数万亿个参数的模型的能力。

    •速度: 在各种硬件上,我们观察到的吞吐量是最新技术的五倍。 例如,为了在GPT系列工作负载上训练大型模型,DeepSpeed将基于ZeRO的数据并行性与NVIDIA Megatron-LM模型并行性相结合。 在具有低带宽互连的NVIDIA GPU群集上(没有NVIDIA NVLink或Infiniband),与仅对具有15亿参数的标准GPT-2模型使用Megatron-LM相比,我们将吞吐量提高了3.75倍。 在具有高带宽互连的NVIDIA DGX-2群集上,对于20至800亿个参数的模型,速度要快三到五倍。 这些吞吐量的提高归因于DeepSpeed更高的内存效率以及使用较低的模型并行度和较大的批处理量来适应这些模型的能力。

    •成本: 吞吐量的提高可以转化为训练成本的大幅降低。 例如,要训练具有200亿个参数的模型,DeepSpeed需要的资源要少三倍。

    •可用性:只需几行代码更改,PyTorch模型就可以使用DeepSpeed和ZeRO。与当前的模型并行性库相比,DeepSpeed不需要重新设计代码或重构模型。它也没有对模型尺寸(例如注意头的数量、隐藏的大小和其他)、批量大小或任何其他训练参数进行限制。对于多达60亿个参数的模型,您可以方便地使用数据并行性(由0提供支持),而不需要模型并行性,相反,对于参数超过13亿个的模型,标准的数据并行性将耗尽内存。ZeRO第二和第三阶段将进一步增加仅通过数据并行性即可训练的模型大小。 此外,DeepSpeed支持ZeRO支持的数据并行性与模型并行性的灵活组合。

Turing-NLG和DeepSpeed大型模型训练

    在DeepSpeed中利用ZeRO-OS来训练一个170亿参数的Turing-NLG模型,其准确性和训练效率高于当前的最新方法。 具体可以点这里,其中显示了该模型建立的新准确性记录及其在自由格式文本生成,摘要和答案综合方面的广泛应用。

    ZeRO-OS与不同类型的模型并行性是互补的、兼容的,对于不适合单个节点(约200亿个参数或更多)的大型模型,与单独使用模型并行性相比,它提供了显著的性能收益、资源节省和模型设计灵活性。

    使用ZeRO-OS和NVIDIA的Megatron-LM在DeepSpeed中组合来训练Turning-NLG模型。与使用NVIDIA Megatron-LM相比,ZeRO-OS节省的内存使Turning-NLG模型的并行度降低了4倍,批处理大小增加了4倍。因此,我们实现了3倍的吞吐量增益。此外,与仅使用Megatron-LM所需的1024个GPU相比,我们仅需256个GPU即可以512个的批量训练。 最终,Megatron-LM无法运行此精确模型-不支持模型结构,因为其关注头(= 28)无法用模型并行度(= 16)整除。 DeepSpeed使模型从不可行运行变为可行而高效的训练!

GitHub project Address: https://github.com/microsoft/DeepSpeed

Past Events

[AAAI 2020] All accepted papers list (a)
[AAAI 2020] All accepted papers list (b)
[AAAI 2020] All accepted papers list (c)
[AAAI 2020] All accepted papers list (D)
[AAAI 2020] Accept All papers list (V)
[AAAI 2020] Full list of accepted papers (VI)
heavy! "Natural Language Processing (NLP)" global academia "Big guys" inventory information (a)!
Heavy! "Natural Language Processing (NLP)" global academia "Big guys" information inventory (two)!
Heavy! "Natural Language Processing (NLP)" global academia "Big guys" inventory information (c)!

Published 49 original articles · won praise 3 · Views 9212

Guess you like

Origin blog.csdn.net/yinizhilianlove/article/details/104303425