Use a large batch optimization deep learning: training BERT just 76 minutes | ICLR 2020

Author | Yang You, Jing Li, etc.

Translator | Liu Chang

Large depth training the neural network on the massive data sets, is very challenging. Recently, many studies using large batch stochastic optimization methods to solve this problem. In the field of research, the most prominent algorithm LARS, it is through the use of hierarchical adaptive learning rate, can be trained on ResNet within minutes ImageNet. However, like BERT attention model, LARS's poor performance, indicating that its performance is not consistent between different tasks. In this paper, the author first studied a principled layered adaptation strategies, so that you can use a large mini-batch training to accelerate the depth of neural networks.

Using this strategy, the authors developed a new adaptive hierarchical called LAMB high-volume optimization techniques. Then, the authors provide convergence LARS LAMB and analyzed, showing the general nonconvex claims, may converge to a fixed point.

The experimental results prove that the author's LAMB in various tasks (such as BERT training and RseNet-50) can be performed very well, and requires only a small amount of ultra-adjustment parameters. It is important for training BERT, the optimizer can use this article very large 32868 batch size, without compromising performance. By the batch size to TPUv3 Pod memory limitations, BERT training time can be reduced from three days to only 76 minutes (may be found in the following Table 1). LAMB implementation has been open source.

Code Address:

https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

introduction

With the advent of large-scale data sets, even with such as stochastic gradient descent (SGD) and other effective optimization method to train the neural network becomes deeper particularly difficult. For example, training and BERT ResNet-50 model in the study of the depth of 16 TPUv3 chip 3 days, while it is necessary in the 8 Tesla P100 gpu 29 hours. Therefore, the researchers developed to address the optimization of this issue has a strong interest.

The purpose of this paper is the research and development of optimization techniques in order to speed up the training of large depth of neural networks, mainly based on a variant of the method SGD. SGD-based approach to iteratively updating the model parameters is calculated by scaling the gradient in the direction of a small batch. However, SGD scalability by their inherent limitations sequential. Because of this limitation, the method SGD training time in the context of improving the depth of learning relies heavily on distributed asynchronous settings, however, due to the introduction of restrictions implicit asynchronous parallel methods, which usually results performance degradation. Due to recent advances in hardware, saw on the big batch of data parallel computing gradient feasibility. However, simply increasing the batch size often leads to generalization performance.

Recent studies suggest that, in the event of reaching some mini-batch size, the learning rate is linearly proportional to the mini-batch can be used to further speed up the training. The work also illustrates two content can be used in a large batch synchronous linear scaling of SGD: linear (i) the learning rate of scaling in the initial stage is harmful; therefore, the initial need to use slow warmup strategy to increase learning speed manual adjustment and linear (ii) scaling the learning speed may exceed a certain batch size. Using these tricks, Goyal, who can use the batch size for 8192 will ResNet-50 models significantly reduce training time from 29 hours to 1 hour.

Despite these efforts prove that this strategy can reduce the time required to train a large deep neural networks, but they stressed the need for adaptive learning rate mechanism for large batch learning. 

SGD has recently been proposed to use a layered approach based on adaptive learning rate to solve this problem. This field of research is the most successful LARS algorithm, which was originally proposed in order to train RESNET. Use LARS, just a few minutes to complete training on ResNet-50 on ImageNet! However, it has been observed in various tasks, which disagrees with the performance improvement effect. For example, for the attention of BERT and other models, LARS poor performance. In addition, the lack of adaptability to a large extent theoretical understanding of the LARS employed. Therefore, this research and development of a new method for large batch settings.

More specifically, this paper has the following contribution.

  • Inspired by LARS, we studied the general adaptation strategies specifically for large batch learning, and provides an intuitive method for the policy.

  • Based adaptation strategies, we developed a new optimization algorithm (LAMB), in order to achieve adaptive learning rate of SGD. In addition, this article provides for the LARS and LAMB convergence analysis, this article focuses on the benefits of these methods with a large batch of.

  • This paper shows the performance LAMB multiple tasks. Use LAMB, paper training BERT will extend batch size to more than 32k, without sacrificing performance. Accordingly, the training time from 3 days to 76 minutes. This article is the first BERT reduce training time to work within a few hours.

  • It also showed LAMB efficiency in such an image classification model aspects of training like RESNET. According to the authors' knowledge, this is the first precision can be achieved SOTA adaptive solver is RESNET-50.

 

method

 

The authors first discuss the general strategy to adapt to the learning rate in large Batch settings. Then two specific algorithms using this policy. Focus on the depth of learning performance.

General strategy: Suppose herein, using an iterative algorithm based A (e.g. SGD or ADAM) is provided with a small batch update rule in the following hierarchy:

U_t wherein A is a parameter updated by the algorithm at time t. In large batch, the main job of the following two modifications.

1, the updated parameter is normalized L2 norm. Normalized and do operate in each layer network.

2, the learning rate will be scaled by a function ф. Also in each network are doing this operation.

A method is based on the assumption SGD, so the above modified as follows:

And wherein x_t g_t i-th layer, respectively, and the gradient parameter. In practice, the authors observed a very simple function [Phi] ф = min {max (,),}. Specific theoretical analysis, detailed in the paper.

LARS vs LAMB algorithm

LARS and LAMB proposed algorithm is an instance of the above general strategy. LARS originally designed for use in large batch training RESNET ImageNet proposed. Typically, it can be observed, the optimizer can use the momentum with little expense deviation of stochastic gradient variance reduction. Unlike LARS, LAMB basic algorithm used is adam. And there are two different aspects of the adaptive provided, one for each dimension in ADAM square root of the second moment normalized. Second, because you can get adaptive layer by layer in layers normalized. The figure is a pseudocode two algorithms.

experiment

This article provides the optimizer uses the existing method LAMB comparison results in two large batch training tasks: BERT and RESNET-50 training. LAMB herein also compared with a conventional device optimized for the mini-batch (<1K size) and a small data set (e.g. CIFAR, MNIST) a. To illustrate the robustness of the method, as used herein, a very hyper-parameters to tune LAMB optimizer, specific details can be found in the paper.

BERT training experiment

In this paper, data sets contain wiki entries and bookscorpus, focusing SQuAD task. Evaluation is F1 scores. They are compared with BERT baseline. Specific training details, see papers. The results are as follows:

LAMB authors also compared with the LARS algorithm, the algorithm can also be seen LAMB steady convergence over 16k batch size.

In addition, the authors also compared the experimental results IMAGENET above, the following can be found in a large batch, lamb algorithm compared to adam / adagrad algorithms can converge better.

For BERT and ImageNet training, while increasing the batch size, the authors did not adjust hyperparameter LAMB optimizer. The authors used the LR square root and linear scaling rules warmup to automatically adjust the learning rate. Shown in the table below:

in conclusion

Large batch processing technology training is essential for speeding up the depth neural network. In this paper, the authors proposed LAMB optimizer, it supports adaptive learning rate and layered pixel update. In addition, LAMB is a general optimizer for small batch and large batch of.

It also provides a theoretical analysis LAMB optimizer, highlighting their situation is better than the standard SGD. By using LAMB, batch size herein can be extended to the pre-trained BERT 64K, but without loss of accuracy, thereby reducing training time BERT from 3 days to about 76 minutes. The first is a LAMB RESNET-50 may be used to achieve results in the SOTA ImageNet large batch training adaptive optimizer.

Original link:

https://static.aminer.cn/upload/pdf/program/5e5e18b793d709897ce2a20b_0.pdf

【end】

Welcome to all developers under the Fanger Wei code scanning fill out the "big developers and AI research", just 2 minutes, you can harvest value of 299 yuan, "AI developers million people congress" live online tickets!

推荐阅读全球呼吸机告急!医疗科技巨头美敦力“开源”设计图和源代码中国无人机“老炮儿”回忆录
互联网之父确诊新冠,一代传奇:任谷歌副总裁、NASA 访问科学家微软为一人收购一公司?破解索尼程序、写黑客小说,看他彪悍的程序人生!在Kubernetes上部署一个简单的、类PaaS的平台,原来这么容易!2020年,这20个大家都认识的加密交易所过得怎么样?你点的每个“在看”,我都认真当成了AI
Released 1375 original articles · won praise 10000 + · views 6.85 million +

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/105336745