Large-scale neural network training summary

 

A while back Microsoft's open source DeepSpeed ​​training framework, there are 10 times faster from the test point of view, but also for memory for a variety of optimization, the maximum can be trained 100B (illion) model parameters. Also released a training framework of 17B model Turing-NLG, top trench in the current tournament.

Training model 100B will not even think about the first (dog's head), first 110M of BERT-base training good on-line now. This paper describes a model training in speed and memory optimization strategy for the following situations:

  1. I reply tomorrow, and must test these ten finish today

  2. My model some big, hard to put on a card, before the completion of training one hundred million samples I can get a N + 1

  3. I came up with a wonderful T6 model, but not loaded into the 12GB card, but also get the best paper of the year

(The above is purely fictitious and any similarity please quickly see below)

Reality is always cruel, in fact, limit large model train only two factors: time and space (= GPU = money) , depending on the circumstances of the program can be used as follows:

1. The gradient accumulation Gradient Accumulation

If only a single card, and may load the model, but it can be limited batch gradient accumulation, is updated once a parameter reverse backwards before N times, corresponding to N times expansion of batch size.

Normal training code is as follows:

for i, (inputs, labels) in enumerate(training_set):
  loss = model(inputs, labels)              # 计算loss
  optimizer.zero_grad()								      # 清空梯度
  loss.backward()                           # 反向计算梯度
  optimizer.step()                          # 更新参数

After addition of the accumulated gradient:

for i, (inputs, labels) in enumerate(training_set):
  loss = model(inputs, labels)                    # 计算loss
  loss = loss / accumulation_steps                # Normalize our loss (if averaged)
  loss.backward()                                 # 反向计算梯度,累加到之前梯度上
  if (i+1) % accumulation_steps == 0:
      optimizer.step()                            # 更新参数
      model.zero_grad()                           # 清空梯度

It should be noted that the expansion of the batch, if you want to keep the sample weights equal, the learning rate but also a linear or expand appropriate adjustments . In addition batchnorm will also be affected , the mean and variance in small batch certainly not as precise a large batch, you can adjust the parameters BN momentum in solution [2].

2. checkpoint gradient Gradient Checkpointing

If only one card, want big training model, you can try to compress the memory occupied by the model.

A checkpoint is a method of gradient time for space, the model compression space saved by reducing the activation value, but must not recalculation activation value stored in the calculation of the gradient.

The details may refer to Chentian Qi Training Deep Nets with Sublinear Memory Cost [3].

 

NOTE: The first line is a forward node, the second line is the reverse

3. Mix Mixed Precision Training Precision Training

Training blending accuracy can be used in a single-card and multi-card, the lifting operation efficiency calculation half2 cuda type. Stored in a floating-point type half2 two FP16 can be made substantially simultaneously during operation, so the desired speed is twice FP32 to FP16 . For Gelu FP16 optimization of chestnuts:

//FP32的gelu运算float gelu(float x)
{
  float cdf = 0.5f * (1.0f + tanhf((0.7978845608028654f * (x + 0.044715f * x * x * x))));
  return x * cdf;
}
//FP16的gelu运算half2 gelu(half2 val)
{
  half2 val_pow3 = __hmul2(val, __hmul2(val, val)); //同时计算两个x*x*x
  float2 tmp_pow = __half22float2(val_pow3);
  float2 cdf =  __half22float2(val);
  //由于tanhf不支持half2类型,只能分开算
  cdf.x = 0.5f * (1.0f + tanhf((0.7978845608028654f * (cdf.x + 0.044715f * tmp_pow.x))));
  cdf.y = 0.5f * (1.0f + tanhf((0.7978845608028654f * (cdf.y + 0.044715f * tmp_pow.y))));
  //同时计算两个x * cdf;return __hmul2(val, __float22half2_rn(cdf));
}

Mixing precision training [5] is not difficult to understand, but to note the following:

  1. Mixing precision training is not simply turn into the FP32 FP16 to calculate it, only 80% of the FP16 can cause loss of accuracy

  2. Loss scaling: Since the value of the gradient is very small, with the overflow will FP16, FP32 and therefore with the first memory loss and amplification, so that a gradient can also be amplified, may be stored FP16, FP32 updates become rescaling

  3. When it comes to accumulate operations such BatchNorm, Softmax, FP16 at the overflow, need to save FP32 generally used FP16 * FP16 + FP32 = FP32 operation of the GPU TensorCore

The overall process: FP32 weight -> FP16 weights -> Pre-FP16 is calculated to - loss> FP32 expand -> turn FP16 -> FP16 reverse gradient calculation -> scaled gradient update the weight of heavy FP32

! ! Hand dividing line: the next step is to ditch the track! !

4. distributed training Distributed Training

Distributed training is more than one card parallel training, generally have the following two situations:

  • Multi-GPU: single multi-Cards, by PCIE, NVlink, GPU Direct P2P communication to

  • Multi-Node: Multi-machine card, by Sockets (Ethernet) or InfiniBand with GPU Direct RDMA communication

In practice the frame may be used NCCL NVIDIA communication, multi-machine communication speed can approach [6] in the machine through the IB (InfiniBand). The bottom of things not to say (I do not understand), in fact, for the alchemist who is father to help find operation and maintenance, and the open-source framework configuration server address on the list means.

There are several strategies to optimize parallel training, the main purpose is to reduce the parameter calculating synchronization (Sync) and a data transfer.

Currently up to 32GB cards put parameters of the model 1.3B, stuffed under, then you can use the data in a parallel fashion, otherwise you can put different layers on different machines for training. FIG facie difference between the two ways of [7] to understand it:

4.1 Data Parallel Data Parallelism

There are two ways a parallel data [9]:

Parameter Server

Cluster has a master and multiple worker, master need to wait for all of the nodes has been calculated Unified Computing gradient, update parameters on the master, after the new parameters broadcast to the worker. The main bottleneck in this way in the master, and therefore can also be asynchronous training, that is, not waiting for other nodes, a worker received after update of gradient parameters, but this other worker get on the old parameters after the gradient will be applied to the new parameters the resulting optimization model too far, into a sub-optimal solution.

Ring All-Reduce

All cluster worker form a closed loop, the data is divided into K parts, put the end of the calculation of a gradient accumulation good passed to the next home while receiving home gradient, the iterative gradient worker last all are equal, can be synchronized update parameters, to be efficient architecture than the PS, it is now mainstream way. FIG lower [10] and shows Scatter Reduce All Gather two stages:

preview

4.2 model Parallel Model Parallelism

Parallel model is not currently common, single card first, because most models are fit, and second, because the communication overhead than parallel data much as the need to reverse the spread of loss for each activation gradient values ​​are passed back, the number of samples then activate a large value, there are many.

Pipelined Parallelism

Pipeline parallel way to the different layers of the model into different machines, for calculating forward and reverse sequentially. 19 years Google and Microsoft has released a GPipe [11] and PipeDream [12] papers and source code, for everyone to sort out what their mentality:

First look at the most naive models in parallel, it is some waste of life:

NOTE: Reverse necessary to calculate the partial derivatives of the activation parameters and values, it takes longer.

So Google GPipe proposes an improved, in fact, the data slice, as calculated as allreduce finished some of it passed to the next node, the last synchronization update parameters, but still can not see it that way save our youth:

So Microsoft made PipeDream, in fact, the synchronization becomes asynchronous data on a small , complete calculation of a data slice immediately reverse, reverse gradient over updates, that nobody who he is, everyone crazy dry up:

However, there is such a problem that people getting on chaos, such worker1 used in calculating the forward parameter 5 is reversed after 1, 5, but after the calculation of the gradient parameter has long been reverse 2/3 / 4 updated. So authors joined Weight stashing mechanism, each data corresponding parameters are saved up! Worker1 so you can come up with the parameters from the previous treasure chest in 5 reverse when the update:

That question again: forward on worker1 5 is the parameter 1, but worker3 is 3, and the final summary is not the time to turn chaos? So author has joined the Vertical Sync mechanism to force all worker in the calculation when they are 5 parameter 1. So that when the final summary model, you can get the same parameters. However, this synchronization can cause a lot of computing void, such as with the 5 weight of 1 update, but 2/3/4 weights are calculated white, so the default is not to Vertical Sync, so that each layer while not exactly the same, but because weight stashing, all the parameters are valid.

Tensor Slicing

The neural network can be viewed as a complex function, in essence, is calculated between the various tensor, we define a set of good CNN, RNN is actually a computing function. To think from this perspective, the model is actually parallel to each tensor calculation dispersed on different machines . Research in this area 18 years FlexFLow and Mesh-TensorFlow, Nvidia Megatron [13] also use this strategy. Below an example to explain how to split Transformer.

Transformer mainly composed of self-attention and FFN, FFN in the first layer for Y = GLUE (XA) may be resolved in two ways:

Can be seen, a first calculation GLUE need to synchronize, so Megatron tensor performed by slicing the second embodiment, self-attention also uses a similar strategy, so that only when the first through polymerization g, a Reverse f by polymerization can be:

The remaining Layernorm and dropout or the need to synchronize the calculation:

Meanwhile, the author also vocab dimension of the embedding has been cut points, and the last of MLM prediction and cross-entropy fused together to reduce network traffic (or need to transfer batch_size * seq_len * vocab_size a prob, just after the turn pass batch_size * seq_len a loss value).

With the growing model, distributed training even reasoning is certainly a trend in engineering, there are many points can be optimized not only distributed strategy described above, as well as network traffic optimization, memory optimization.

The acceleration optimizer LAMB

Although the above-mentioned parallel data can improve the training speed nearly linearly, but too much can decrease Batch model accuracy and convergence rate (poor fit to the data). So Google in 19 years launched LAMB [14] optimizer, called Layer-Wise Adaptive Moments Batch Optimizer for Training , optimized for large batch, in a distributed training scenarios can sample 65536/32768 training, reducing iterations the number of times, thereby reducing training time, feel the taste of money:

LAMB mainly a combination of Adam and LARS (Layerwise Adaptive Rate Scaling), learning rate adjustments. When the batch mentioned above becomes large learning rate also needs to become larger, this will lead to convergence instability, LARS to solve this problem [15] by the norm ratio to LR by the weighting and Grads:

 

norm right here are taken layer of calculating the weight, it is layerwise. The above formula can be understood: the beginning of training, the weight is relatively small, and the loss and the gradient is relatively large, so start learning rate is small, but with the weight becomes larger & smaller gradient will slowly warmup. When a good fit for some samples, Loss approaches zero, the gradient becomes smaller learning rate will increase, trapped in local minima, to prevent over-fitting.

LAMB combines the idea of ​​this adaptation of layerwise:

The formulas with changes slightly, a weight is added to the weight norm mapping, essentially plays the role of scale; gradient formula is further added a weight decay, i.e. the objective function L2 regularization.

to sum up

This article describes the velocity model training and memory to optimize several ways, in practice it may be mixed are various, such as blending accuracy + parallel data, parallel data parallelism model +, + gradient data parallel checkpoints and the like. DeepSpeed ​​in covering the article talked about strategy, students can arrange with pytorch up ~

Finally, the introduction of various strategies, because of space reasons, there are some assumptions and omit the final result, students interested in studying in depth the contents of references - if passing bigwigs found that where there is an error please point out -

 

 

Published 33 original articles · won praise 0 · Views 3270

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104645796