Key technologies for large model training and deployment

Since 2016 to the present, model size has increased 40 times every 18 months, and since 2019 to the present, it has increased 340 times every 18 months.

However, in comparison, the growth rate of hardware is slower. Since 2016, GPU performance has increased by 1.7 times every 18 months, and the gap between model size and hardware growth has gradually widened. Bottlenecks such as large memory usage, large computing power consumption, and high costs seriously hinder the rapid development of the AIGC industry. In this context, You Yang, founder of Luchen Technology, believes that distributed training is imperative.

Picture: You Yang, founder of Luchen Technology, giving a speech

The basic large model structure provides the infrastructure for model training.

First, the Transformer large model pioneered by Google is the most basic architecture of all large models today. Transformer has now become the fourth most important deep learning algorithm architecture besides MLP, CNN, and RNN.

Second, Google released BERT, the first pre-large model, which set off the trend of pre-training large horizontal models. BERT emphasized that it no longer uses the traditional one-way language model or combines two one-way language models as in the past. The shallow splicing method is used for pre-recognition training, but a new masked language model (MLM) is used to generate deep bidirectional language representations.

Third, ViT Google proposed the first large-scale visual model using Transformert. ViT is used as a visual transformer instead of a CNN power hybrid method to perform image tasks. The author hypothesizes that further pre-recognition training can improve performance because it is different from other existing models. Compared with technical models, ViT is relatively scalable.

Fourth, Google replaced the Feedforward Network (FFN) layer in Transformer with the MoE layer, and cleverly combined the MoE layer with data parallelism. During data parallel training, the model has been copied several times in the training cluster. The function of MoE is realized by introducing Al-to-Al communication in multi-channel data parallelism.

Based on these basic large model structures, in the past few years, in the development process of large models, several landmark large models have appeared, including GPT-3, T5, Swin Transformer, and Switch Transformer.

GPT-3: The first tens-billion-scale large-scale model released by OpenAI should be very groundbreaking. Today's large-scale models are all benchmarked against GPT-3. GPT-3 still continues its own one-way language model recognition training method. But this time the model size was increased to 175 billion, and 45TB of data was used for training.

T5 (Text-To-Text Transfer Transformer): Google T5 converts all NLP tasks into Text-to-Text (text to text) tasks. Its most important role is to provide a common framework for the entire NLP pre-training field, transforming all tasks into one form.

Swin Transformer: The new visual Transformer of Swin Transformer proposed by Microsoft Asia Research, which can be used as a general backbone for computer vision. Differences between domains, such as the large difference in scale of visual entities and the high resolution of pixels in images compared to words in text, pose the challenge of adapting the Transformer from language to vision.

Switch Transformer, a sparse large model with a scale of more than one trillion: A technology that can train language models containing more than one trillion parameters, directly increasing the number of parameters from 175 billion in GPT-3 to 1.6 trillion, at a speed that was previously developed by Google 4 times the language model T5-XXL.

In addition, a more landmark large model is the large oracle model PaLM implemented on Pathways.

Distributed framework Pathways: Many important ideas of Pathways come from existing systems, including XLA for expressing and executing TPU calculations, TensorFlow graphs and executors for representing and executing distributed CPU calculations, JAX based on the Python programming framework, and TensorFlow APL, by effectively using these modules, Pathways can run without requiring many changes to existing horizontal models.

PaLM model: What is eye-catching about PaLM is that the model has 540 billion parameters and is trained using the new generation AI framework Pathways. The model structure also provides many aspects of optimization. These technical optimizations draw on existing outstanding research results, including SwiGLU activation function instead of ReLU, parallel layer technology (Parallel Layers), and multi-query attention (Multi-Query Attention). Rotated positional encoding (RoPE), shared input and output word embeddings, removal of bias parameters (No Biases), etc.

The PaLM model is also formed by stacking the Decoder part in the Transformer. The model has 540 billion parameters and is trained using the new generation AI framework Pathways.

Current main technical routes for large-scale distributed training

The current main technical route for large-scale distributed training is parallel training technology. Distributed training parallel technology improves the training speed of neural networks by using GPU clusters (multiple machines and multiple cards) during the training process.

Data parallelism: The same setup and model are replicated multiple times, each being fed a different copy of the data each time, processing is done in parallel, and all copies are synchronized at the end of each training step.

Tensor parallelism: Each tensor is divided into multiple chunks, so each shard of the tensor is located on its designated GPU. During processing, each shard is processed separately on a different GPU in parallel. The result Synchronize at the end of the step.

Pipeline parallelism: The model is split vertically (i.e. by volume) across multiple GPUs, so that only one or more model layers are placed on a single GPU, and each GPU processes a different stage of the pipeline in parallel and processes a portion of the data of the batch.

Founded in 2021, Luchen Technology is a global company committed to "liberating AI productivity". The main business is to help enterprises reduce the cost of implementing large models and improve training and inference efficiency by building a distributed AI development and deployment platform.

Luchen's open source intelligent system architecture Colossal-AI technology has two major characteristics: First, it minimizes deployment costs. Colossal-AI can significantly improve the efficiency of large-scale AI model training and deployment. By writing a simple source code on your laptop, Colossal-AI can be automatically deployed to the cloud and supercomputers.

Typically training large models (such as GPT-3) requires more than 100 GPUs, but using Colossal-AI requires only half the computing resources. Even on low-end hardware, Colossal-AI can train 2-3 times larger models.

The second is to maximize computing efficiency. With the support of parallel computing technology, Colossal-AI trains AI models on hardware and significantly improves performance. The goal of Luchen Open Source is to increase the speed of training large AI models by more than 10 times.

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131024027