[NLP] In-depth understanding of Megatron-LM

1. Introduction

NVIDIA Megatron-LM is a PyTorch-based distributed training framework for training large Transformer-based language models. Megatron-LM comprehensively applies Data Parallelism, Tensor Parallelism and Pipeline Parallelism to reproduce GPT-3.

In the field of natural language processing (NLP), large models can provide more accurate and powerful semantic understanding and reasoning capabilities. With the popularity of computing resources and the increase of data sets, the number of model parameters increases exponentially. However, training such a large-scale model faces some challenges:

  1. Memory limitations:  Even the largest current GPU main memory can hardly accommodate the parameters of these models. For example, a GPT-3 model with 175 billion parameters requires about 700GB of parameter space, the corresponding gradient is about 700GB, and the optimizer state needs an additional 1400GB, and the total requirement is as high as 2.8TB.
  2. Computational challenges:  Even if we managed to fit the model to a single GPU (e.g. by exchanging parameters between host memory and device memory), the heavy computational operations required by the model would result in significantly longer training times. For example, using an NVIDIA V100 GPU to train a GPT-3 model with 175 billion parameters takes about 288 years.
  3. Parallel strategy challenge:  Different parallel strategies correspond to different communication patterns and traffic volumes, which is also a challenge that needs to be considered.

2. Introduction to Parallel Strategies and Methods

1 Data parallelism

Data parallelism replicates the model on each worker so that each worker has a copy of the complete model. The input dataset is sharded, and a training mini-batch will be split among multiple workers; workers periodically aggregate their gradients to ensure that all workers see a consistent version of the weights. For large models that cannot fit into a single worker, one can use data parallelism on smaller shards within the model.

Data-parallel scaling generally works well, but has two limitations:

  • a) After a certain point, the batch size of each GPU becomes too small, which reduces the utilization of the GPU and increases the communication cost;
  • b) The maximum number of devices that can be used is the batch size, which limits the number of accelerators that can be used for training.

2 Model parallelism

People use some memory management techniques, such as activation checkpointing (activation checkpointing) to overcome this limitation of data parallelism, and also use model parallelism to partition the model to solve these two challenges, so that the weights and their associated optimizers The state does not need to reside on the processor at the same time.

The model parallel mode will distribute the memory and computation of a model among multiple workers, so as to solve the problem that a model cannot be accommodated on one card. The solution is to put the model on multiple devices.

There are two types of model parallelism: pipeline parallelism and tensor parallelism, which is the way to split the model.

  • Pipeline model parallel (pipeline model parallel) is to put different layers of the model on different devices, such as putting the first few layers on one device, the middle layers on another device, and the last few layers on the third device above.
  • Tensor parallelism is intra-layer segmentation. Segmenting a certain layer and placing it on different devices can also be understood as distributing matrix operations to different devices, such as dividing a certain matrix multiplication into multiple Matrix multiplication is placed on different devices.

Specifically, as shown in the figure below, the upper part is inter-layer parallelism (pipeline parallelism), with a vertical cut, the first three layers are for the first GPU, and the last three layers are for the second GPU. The following is intra-layer parallelism (tensor parallelism), which is cut horizontally, and each tensor is divided into two pieces, which are divided into different GPUs.

These two model segmentation methods can exist at the same time to achieve orthogonal and complementary effects.

Looking at it from another perspective, the two divisions exist at the same time, which are orthogonal and complementary (orthogonal and complimentary).

Communication analysis

For model-parallel communication situations.

Pipeline parallelism: The communication is above the adjacent segmentation point of the pipeline stage. The communication type is P2P communication. The amount of word communication data is small but relatively frequent, and because of the characteristics of the pipeline, there will be GPU idle time, which is called pipeline bubble ( Bubbles).

For example, in the figure below, the upper part is the original pipeline, the lower part is the parallel pipeline, and the position of the bubble is given in the middle.

 

Tensor parallelism: Communication occurs during the forward propagation and backward propagation of each layer. The communication type is all-reduce, not only the amount of data in a single communication is large, but also the communication is frequent.

Tensor parallelism is generally on the same machine, so it is accelerated through NVLink. For pipeline parallelism, it is generally connected through Infiniband switches.

2.1 Tensor parallelism

Tensor model parallelism (tensor model parallelism) divides the matrix multiplication in each transformer layer to multiple GPUs, although this method does not scale to more than Models with 20 billion parameters work well, but problems arise with larger models. Because larger models need to be split across multiple multi-GPU servers, this leads to two problems.

  • The all-reduce communication required for tensor parallelism needs to go through the inter-server link, which is slower than the high-bandwidth NVLink inside the multi-GPU server;
  • A high degree of model parallelism produces many small matrix multiplications (GEMMs), which can reduce GPU utilization.

2.2 Pipeline parallelism

Pipelined model parallelization is another technique that supports training large models. In pipeline parallelism, layers of a model are split across multiple GPUs. A batch is divided into smaller micro-batches and pipelined on these micro-batches.

Through pipeline parallelism, layers of a model are spread across multiple devices. When used in a model with repetition of the same transformer block, each device can be assigned the same number of transformer layers. Megatron does not consider more asymmetric model architectures where the assignment of layers to pipeline stages is more difficult. In pipelined model parallelism, training performs a set of operations on one device, then passes the output to the next device in the pipeline, which performs a different set of operations.

The original pipeline parallelism has the problem that an input sees weight updates in the backward pass that it does not see in its forward pass . Therefore, a pipelined scheme needs to ensure that the input sees a consistent version of the weights in the forward and backward passes to achieve explicit synchronous weight update semantics.

Layers of a model can be assigned to workers in various ways and use different schedules for forward and backward computation of the input. Layer allocation and scheduling strategies lead to different performance trade-offs. Regardless of the scheduling strategy, in order to maintain strict optimizer semantics, the optimizer operation steps need to be synchronized across devices, so that a pipeline flush is required at the end of each batch to complete the micro-batch execution operation (while no new micro-batch is injection). Megatron-LM introduces periodic pipeline flushes.

At the beginning and end of each batch, the device is idle. We call this idle time the pipeline bubble, and want it to be as small as possible. Depending on the number of micro-batches injected into the pipeline, up to 50% of the time may be spent flushing the pipeline. The larger the ratio of the number of micro-batches to the pipeline depth (size), the less time it takes to flush the pipeline. Therefore, in order to achieve high efficiency, a larger batch size is usually required.

Some methods use parameter servers in parallel with the pipeline. However, these have the problem of inconsistency. TensorFlow's GPipe framework overcomes this inconsistency by using synchronous gradient descent. However, this approach requires additional logic to handle the pipeline of these communication and computation operations, and suffers from pipeline bubbles that reduce efficiency, or changes to the optimizer itself that affect accuracy.

Certain asynchronous and bounded-staleness methods such as PipeMare, PipeDream and PipeDream-2BW do away with flushing entirely, but this relaxes the weight update semantics. Megatron will consider these options in future work.

2.3 Technology portfolio

Users can train large models using a variety of techniques, each of which involves different trade-offs. Furthermore, these techniques can also be used in combination. However, the combination of technologies may lead to complex interactions, especially in the design of system topology. It is not only necessary to reasonably cut the model according to the characteristics of the algorithm, but also to deliberate in the system architecture design integrating software and hardware to achieve a good performance. Therefore, the following questions are particularly important:

How can parallel techniques be combined to maximize training throughput for large models for a given batch size while preserving strict optimizer semantics?

The developers of Megatron-LM demonstrated a technique called PTD-P, which combines pipelining, tensors, and data parallelism. This technique trains large language models on 1000s of GPUs with good computational performance (up to 52% of peak device throughput). PTD-P leverages a combination of pipeline parallelism across multi-GPU servers, tensor parallelism within multi-GPU servers, and data parallelism, utilizing an optimized cluster environment with high-bandwidth links between GPUs on the same server and across servers, enabling training with A model with one trillion parameters and good scalability.

This technique demonstrates how to take full advantage of different parallel techniques in a large-scale distributed system for efficient large-scale model training.

Achieving throughput at this scale required innovation and careful design on several fronts:

  1. Efficient kernel implementation:  The key is to implement an efficient kernel (kernel), so that most computing operations are compute-bound rather than memory-bound. This means that computing tasks can be completed faster, thereby improving overall computing efficiency.
  2. Intelligent calculation graph segmentation:  intelligently segment the calculation graph on the device to reduce the amount of data transmitted through the network. By distributing calculations to multiple devices, not only the cost of data transmission is reduced, but also the idle time of devices can be limited, thereby improving the overall computing efficiency.
  3. Communication optimization and high-speed hardware utilization:  Implementing communication optimization in specific areas, utilizing high-speed hardware such as advanced GPUs, and using high-bandwidth links within the same server and between different server GPUs can greatly improve the speed and efficiency of data transmission. This is essential for data exchange in distributed systems.

Through innovation and optimization in the above aspects, the large-scale throughput of large-scale model training can be effectively improved, and higher training efficiency and performance can be achieved. This requires a combination of domain expertise and system design to solve various challenges and achieve success.

2.4 Guiding principles

Megatron developers have conducted research on different parallel mode combinations and the impact between them, and summarized some guiding principles for distributed training:

  1. Interaction of Parallel Modes:  Different parallelization strategies interact in complex ways. The choice of parallel mode affects the amount of traffic, the efficiency of the compute cores, and worker idle time due to pipeline flushing (pipeline bubbles). For example, tensor model parallelism works well on multi-GPU servers, but for large models, pipelined model parallelism is better.
  2. Scheduling impact of pipeline parallelism:  The scheduling method used for pipeline parallelism affects the amount of communication, the size of pipeline bubbles, and the memory required to store activations. Megatron proposes a new interleaved scheduling method. Compared with the previous scheduling method, it can increase throughput by up to 10% on the basis of slightly increasing memory usage.
  3. The impact of hyperparameters:  The value of hyperparameters, such as microbatch size (microbatch size), will affect the memory footprint, the effect of kernel execution on workers, and the size of pipeline bubbles.
  4. Communication Intensive:  Distributed training is a communication intensive process. Using slower inter-node connections or more communication-intensive partitions can limit performance.

Based on the above guiding principles, Megatron developers have provided a clearer direction for distributed training to achieve more efficient large-scale model training throughput by in-depth research on the interaction of different parallel technologies, hyperparameter tuning, and communication-intensive factors. quantity.

Megatron adopts the PTD-P (Pipeline, Tensor, and Data Parallelism) method when training a large model with trillions of parameters, thus achieving a highly aggregated throughput (502 petaFLOP/s).

In this method, the Tensor model is used in parallel for the intra-node transformer layer, which enables efficient operation on the platform based on the HGX system. At the same time, the parallelism of the Pipeline model is applied to the inter-node transformer layer, making full use of the design of multiple network cards in the cluster and improving the efficiency of model training.

In addition, data parallelism is also strengthened on the basis of the aforementioned two parallel strategies, so that the training can be extended to a larger scale and achieve faster training speed.

Three tensor model parallelism (Tensor Model Parallelism)

3.1 Principle

Here we use GEMM to see how to parallelize the model. What we need to do here is XA=Y. For the model, � is the input, A is the weight, and Y is the output. From the perspective of mathematical principles, for the linear layer (Linear layer) in the neural network, it can be regarded as dividing the input matrix into blocks for calculation, and then combining the calculation results into an output matrix. This process involves matrix multiplication and addition operations, where matrix multiplication involves multiplying between the weight matrix and the input data, followed by the addition of the bias vector.

For nonlinear layers (such as activation function layers), no additional design is usually required. The calculation process of these layers is to apply some kind of nonlinear function based on the input data, such as ReLU (Rectified Linear Unit), Sigmoid, Tanh, etc. These functions are known mathematically, you just need to pass the input data to these functions, and then get the output.

On the whole, the calculation of neural network can be abstracted as a series of matrix and vector operations, in which the linear layer involves matrix multiplication and addition, while the nonlinear layer involves specific function calculations. These operations are highly optimized in deep learning frameworks to improve computational efficiency and training speed.

3.2 Row Parallelism

Let's take a look at Row Parallelism first, which is to divide A into two parts according to the row. In order to ensure the operation, we also divide X into two parts according to the column. Here, the last dimension of X1 is equal to the first dimension of A1. Theoretically, it is:

Therefore, X1​ and A1​ can be placed on GPU1 for calculation, X2​ and A2​ can be placed on GPU2, and the results are added.

 Let's do the calculation next. The first step is to do the dot product of the horizontal red arrow and the vertical arrow on the graph to get the green color in Y.

 

 The green Y1 and the blue Y2 are obtained. At this time, you can add Y1 and Y2 to get the final output Y.

 

 

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/132507636