PTM of AI: Summary and progress of pre-training model technology (updating)

PTM of AI: Summary and progress of pre-training model technology (updating)

Table of contents

Pre-trained model technology

1. The emerging field of systematic research on the development and impact of ultra-large-scale intelligent models has been formed

(1), OpenAI proposes PALMS dataset construction and model fine-tuning methods

(2), Percy Liang, Li Feifei and other scholars put forward the basic model concept

(3), DeepMind published a paper on social hazard assessment of language models

2. The ultra-large-scale pre-training model research and development competition has entered a fierce stage

(1), Google developed a trillion-scale pre-training model Switch Transformer

(2), Zhiyuan releases super-large-scale intelligent model Enlightenment 1.0/2.0

(3), Microsoft and Nvidia released the pre-training model Megatron-Turing

(4), DeepMind released the pre-training model Gopher

(5) Other companies continue to develop ultra-large-scale pre-training models

3. The multi-modal pre-training model has become the next key development area for large models

(1), OpenAI proposes large-scale multimodal pre-training models DALL E and CLIP

(2) The Hebrew University of Israel proposed StyleCLIP, a high-definition graphic model for Vincent

(3), Zhiyuan, Tsinghua and other researchers proposed the Vincent graph model CogView

(4), Facebook researchers proposed a multi-task and multi-modal unified model UniT

(5), Tsinghua and other researchers proposed a cross-modal prompt learning model CPT

(6), researchers from Microsoft Asia Research Institute and Peking University proposed a pre-training model NÜWA (Nuwa) covering three modal data

4. Accelerate method innovation to improve the training efficiency of ultra-large parameter scale models

(1) In January 2021, researchers such as Microsoft proposed the ZeRO-Offload heterogeneous training technology

(2) In March 202, Zhiyuan and Tsinghua researchers jointly developed the FastMoE acceleration system

(3) In September 2021, Zhiyuan and Tsinghua researchers jointly developed the BMInf acceleration system

(4) In October 2021, Microsoft and Nvidia jointly proposed the PTD-P acceleration method

5. The pre-training model is applied in scenarios such as biological research and the Internet

(1), in May 2021, Google proposed a multi-task unified model MUM

(2), In June 2021, researchers such as Tsinghua University and Zhiyuan proposed the Chinese core language model CPM

(3), In August 2021, researchers such as Zhiyuan and Tsinghua University proposed the protein pre-training model ProteinLM


Pre-trained model technology

1. The emerging field of systematic research on the development and impact of ultra-large-scale intelligent models has been formed

With the rise of super-large models such as BERT, GPT-3, and DALL E, the adaptation scheme of " self-supervised learning + fine-tuning of pre-trained models " has gradually become the mainstream. However, as the ultra-large-scale pre-training model plays an increasingly prominent role in scientific research, industry, society, economy and other fields, its far-reaching impact has become the focus of scientists' attention.

(1), OpenAI proposes PALMS dataset construction and model fine-tuning methods

In June 2021, OpenAI proposed a data set construction and model fine-tuning method called "PALMS", which can construct "Values-Targeted Datasets" (Values-Targeted Datasets), so that it can correct the GPT-3 bias and solve the problem of Ethical issues posed by large models played a driving role.

Source : https://cdn.openai.com/palms.pdf

(2), Percy Liang, Li Feifei and other scholars put forward the basic model concept

In August 2021, scholars such as Percy Liang and Li Feifei named the large-scale pre-training model as Foundation Models, and wrote an article discussing the opportunities and challenges faced by the foundation model. The thesis is divided into four parts, respectively expounding the capabilities, application fields, technical aspects and social impact of the basic model.

Source : https://arxiv.org/pdf/2108.07258.pdf

(3), DeepMind published a paper on social hazard assessment of language models

In December 2021, DeepMind published a paper on the ethical and social hazards of pre-trained language models. The researchers mainly explored the adverse effects of the model in six aspects , and mentioned two aspects of ethical and social impacts that require researchers to continue to pay attention to. One is that current benchmarking tools are insufficient to assess some ethical and societal hazards. For example, when a language model generates false information, humans believe that information to be true. Assessing this hazard requires more human interaction with language models. Second, research on risk control is still insufficient . For example, language models learn, reproduce, and amplify social biases, but research on this issue is still in its early stages.

Legend: The ethics and social hazards of the six major language models studied in the DeepMind paper

来源Language modelling at scale: Gopher, ethical considerations, and retrieval

2. The ultra-large-scale pre-training model research and development competition has entered a fierce stage

The advent of GPT-3 has inspired researchers to explore ultra-large-scale pre-training models with larger scale and more amazing performance . Large-scale scientific research institutions and enterprises at home and abroad have invested huge amounts of computing power in research and development, pushing the scale of computing power to trillions of scales, and exploring the parameters, performance and general task capability boundaries of the model. At present, R&D institutions and companies such as OpenAI, Google, FaceBook, Microsoft, Nvidia, Zhiyuan Research Institute, Alibaba Dharma Institute, Huawei, Baidu, and Inspur have joined the "arms race."

(1), Google developed a trillion-scale pre-training model Switch Transformer

In January 2021, Google researchers developed a new language model Switch Transformer, which contains 1.6 trillion parameters , nine times that of GPT-3, which contains 175 billion parameters. The researchers compared Switch Transformer with Google's T5-Base and T5-Large models, and the results showed that under the same computing resources, the new model achieved a pre-training speed increase of up to 7 times .

Legend: Switch Transformer coding block structure

(2), Zhiyuan releases super-large-scale intelligent model Enlightenment 1.0/2.0

On March 20, 2021, Zhiyuan Research Institute released China's first ultra-large-scale intelligent information model " Enlightenment 1.0 ", trained a series of models including Chinese, multimodal, cognition, and protein prediction, and pre-trained the model A number of world-leading technological breakthroughs have been made in terms of paradigm, scale and performance amplification technology, and training corpus database construction. On June 1, Zhiyuan Research Institute released the " Enlightenment 2.0" model , with a parameter scale of  1.75 trillion , 10 times that of GPT-3, breaking the 1.6 trillion parameter record created by  the Switch Transformer pre-training model, and the first in China Trillion-scale models .

Legend: Technological innovations in Enlightenment 2.0

(3), Microsoft and Nvidia released the pre-training model Megatron-Turing

In October 2021, Microsoft and Nvidia launched the Megatron-Turing (MT-NLP) pre-training model. The model is a next-generation version of Microsoft 's T-NLG (Turing-NLG) and NVIDIA  Megatron-LM model, containing  530 billion parameters . The researchers selected 8 tasks in five domains to evaluate the effect of MT-NLG. In experiments, the model achieves the best performance on some of these tasks.

Legend: The data set used by the MT-NLG model

Legend: MT-NLG performance in different tasks under zero sample, single sample and small sample conditions

(4), DeepMind released the pre-training model Gopher

In December 2021, DeepMind released the pre-trained language model Gopher with a parameter scale of 280 billion . The model is trained with  4096 TPUv3 acceleration chips and combined with multiple parallel acceleration strategies . This research is mainly used to explore the advantages and disadvantages of models of different sizes , and to understand in which areas better performance can be obtained after the model parameter size increases. The researchers found that the increase in model size has greatly improved tasks such as reading comprehension , fact checking , and poisonous speech identification , but the improvement in logical reasoning and common sense tasks is not significant . In addition, researchers have also studied the capabilities and shortcomings of the Gopher model in areas such as dialogue.

Legend: Performance of Gopher and other models in different categories on the Massive Multitask Language Understanding (MMLU) benchmark

来源Language modelling at scale: Gopher, ethical considerations, and retrieval

(5) Other companies continue to develop ultra-large-scale pre-training models

In addition to the above cases, in April 2021, Huawei Cloud Combined Cycle Intelligence released the Pangu NLP  ultra-large-scale pre-training language model with a parameter scale of  100 billion , and jointly released the Pangu α ultra-large-scale pre-training model with a parameter scale of 200  billion ; Bodhidharma released the Chinese pre-training language model  PLUG with 27 billion parameters , and jointly with Tsinghua University released the Chinese multi-modal pre-training model  M6 with a parameter scale of  100 billion , which has exceeded 10 trillion parameters ;

In July, Baidu launched the ERNIE 3.0  knowledge enhancement model with a parameter scale of tens of billions ;

In October, Inspur released about 250 billion ultra-large-scale pre-training models;

In December, Baidu launched  the ERNIE 3.0 Titan  model with a parameter scale of  260 billion ; Google trained a giant  BERT model with a parameter scale of 481 billion  , and the results were published on the MLPerfv1.1 training list; in addition, Google also proposed  a 1.2 trillion parameter model . The general sparse language model  GLaM outperforms GPT-3 in 7 small-shot learning domains.

3. The multi-modal pre-training model has become the next key development area for large models

With the support of big data , large parameters and large computing power , the pre-training model can fully learn the representation in the text and master certain knowledge. If the model can learn data of multiple modalities, it will have a stronger performance in Vision Language tasks such as image-text generation and image-based question answering. The multi-modal pre-training model is a key research direction in 2021. Institutions such as OpenAI, Microsoft, Zhiyuan, Tsinghua University, and the Institute of Automation of the Chinese Academy of Sciences have released multi-modal pre-training models .

(1), OpenAI proposes large-scale multimodal pre-training models DALL E and CLIP

In January, OpenAI simultaneously released two large-scale multimodal pre-training models - DALL·E  and CLIP . DALL·E can generate corresponding images based on short text prompts (such as a sentence or a paragraph of text) , and CLIP can classify images based on text prompts . OpenAI stated that the goal of developing a multi-modal large model is to break through the boundaries of natural language processing and computer vision and realize a multi-modal artificial intelligence system.

Caption: "Avocado-shaped chair" generated by DALL·E

Legend: The CLIP model has achieved excellent levels in multiple ImageNet tests

(2) The Hebrew University of Israel proposed StyleCLIP, a high-definition graphic model for Vincent

In March, the Hebrew University of Israel, Adobe Research Institute, etc. combined the StyleGAN and CLIP models to propose a model that can generate high-definition images based on text prompts, called StyleCLIP. The researchers believe that StyleCLIP can combine the semantic knowledge learned by the pre-training model and the image generation ability of the generative confrontation network to create more realistic images, which has certain advantages in practical applications.

Legend: StyleCLIP's image processing process

Legend: Image PS operation results based on text prompts

Source : https://arxiv.org/pdf/2103.17249.pdf

(3), Zhiyuan, Tsinghua and other researchers proposed the Vincent graph model CogView

In May, researchers from Zhiyuan Research Institute, Tsinghua University, and Ali Dharma Institute released a paper on the CogView Vincent graph model, which combines VQ-VAE with a Transformer model with 4 billion parameters, through style learning, ultra-high-definition image generation, Fine-tuning on multiple downstream tasks such as text-image sorting and fashion design, and using stable pre-training methods such as removing NaN losses. Experimental results show that CogView achieves the highest FID results on the fuzzed MS COCO dataset, which is higher than previous GAN and DALL·E.

Legend: CogView architecture

Legend: CogView generates images according to the prompts

(4), Facebook researchers proposed a multi-task and multi-modal unified model UniT

In August, the Facebook research team proposed a multi-task and multi-modal unified Transformer model called UniT, which is based on a unified Transformer Encoder-Decoder architecture that can simultaneously solve a series of tasks in the fields of vision, multi-modality, and language, including Object detection, visual-text reasoning, natural language understanding, etc. The paper stated that the model has strong performance on 7 tasks.

Legend: A list of the data that the UniT model can learn and the tasks it completes

Legend: UniT model architecture

(5), Tsinghua and other researchers proposed a cross-modal prompt learning model CPT

In September, researchers from Tsinghua University and the National University of Singapore proposed a cross-modal cue learning model CPT, which uses color to fine-tune the cross-modal pre-training model based on cue learning, and learns few times in visual positioning and scene graph generation tasks. Compared with the baseline model, the scene has achieved significant improvement.

Legend: CPT cross-modal prompt learning framework

(6), researchers from Microsoft Asia Research Institute and Peking University proposed a pre-training model NÜWA (Nuwa) covering three modal data

In November, researchers from Microsoft Asia Research Institute and Peking University proposed a unified multi-modal pre-training model NÜWA. The model uses a 3D Transformer architecture capable of generating visual (image or video) information. By testing the model on 8 downstream tasks, the Nuwa model achieves the best performance on tasks such as Vincent graph, Vincent video, and video prediction.

Legend: Downstream tasks supported by Nuwa model

Legend: The structure of Nuwa model

4. Accelerate method innovation to improve the training efficiency of ultra-large parameter scale models

Restricted by computing power resources, the training and reasoning of ultra-large-scale pre-trained models face serious bottlenecks. In the research of GShard and Switch Transformer, Google adopted Mixture of Experts (MoE) and introduced multiple expert networks (Expert Network) in the neural network to reduce the number of neurons that need to be activated and improve the calculation of the model. Efficiency, increasing the parameters of the pre-trained language model to a trillion scale.

Legend: The architecture of MoE uses the sparse gating function (Sparse Gating Function) to determine the expert network to perform calculations

Source : https://arxiv.org/pdf/1701.06538.pdf

(1) In January 2021, researchers such as Microsoft proposed the ZeRO-Offload heterogeneous training technology

With the increase in the parameter scale of the ultra-large-scale pre-training model, more large-scale model computing acceleration and optimization methods have emerged this year, focusing on improving the computational efficiency of the model. In January, researchers from Microsoft and the University of California, Merced (University of California, Merced) proposed a heterogeneous deep learning training technology called "ZeRO-Offload", which can use the same hardware to train larger-scale 10x model. On a V100 GPU with 32GB RAM, users can train GPT-2 with 13 billion parameters through ZeRO-offload; on a single DGX-2 server, ZeRO-offload can train a model with over 70 billion parameters, based on the original hardware A 4.5x increase in model size is achieved.

(2) In March 202, Zhiyuan and Tsinghua researchers jointly developed the FastMoE acceleration system

Due to the binding of MoE technology and Google's hardware and software, it cannot be directly applied to open source algorithm frameworks such as PyTorch. In order to solve this problem, in March, Zhiyuan Research Institute and Tsinghua University jointly developed an acceleration system called FastMoE, which enables ordinary users to directly use the MoE module by rewriting the code. Compared with the original version, FastMoE achieves 47 times faster optimization. The FastMoE system can be used as a module in a PyTorch network, or it can be used to transform a layer in an existing network. Users only need a few lines of code to call the MoE module. FastMoE also supports any neural network module as an expert network, and includes some specially optimized CUDA codes, making full use of the GPU's large-scale parallel computing capabilities.

Legend: How to call FastMoE code

Source : GitHub - laekov/fastmoe: A fast MoE impl for PyTorch

Legend: Comparison of FastMoE and original PyTorch performance

Legend: Data Parallel Mode of FastMoE

Source : Zhiyuan x Tsinghua open source FastMoE, the cornerstone of the trillion AI model

(3) In September 2021, Zhiyuan and Tsinghua researchers jointly developed the BMInf acceleration system

Pre-trained large models have achieved amazing results in various fields, but the application of large models has a high threshold of computing power and a long model response speed. In September, Tsinghua University and Zhiyuan researchers jointly released the low-resource large model inference toolkit BMInf, which can also perform efficient inference of tens of billions of large models on consumer-grade graphics cards.

Legend: Comparison of BMInf and original PyTorch performance

来源GitHub - OpenBMB/BMInf: Efficient Inference for Big Models

(4) In October 2021, Microsoft and Nvidia jointly proposed the PTD-P acceleration method

In October, Microsoft and Nvidia jointly proposed the PTD-P (Inter-node Pipeline Parallelism, Intra-node Tensor Parallelism, and Data Parallelism) training acceleration method, through the "three-pronged approach" of data parallelism, tensor parallelism and Pipeline parallelism In this way, the throughput of the model can be increased by more than 10%. This parallel method can train a GPT architecture model with one trillion parameters on 3072 GPUs with a computing power of 502P, achieving a performance improvement of 52 per GPU throughput. Using this technology, Microsoft and Nvidia have trained Megatron-Turing, a very large-scale pre-trained language model with 530 billion parameters, on more than 3,000 GPUs.

Legend: The parameter scale and performance level achieved when training the model with PTD-P technology

Source : https://arxiv.org/pdf/2104.04473.pdf

5. The pre-training model is applied in scenarios such as biological research and the Internet

With the gradual expansion of the data scale and the further enrichment of the data modality, the pre-training model will penetrate into more fields, and complete various types of tasks through the "pre-training-fine-tuning" paradigm. In the field of scientific research, the pre-training model will be combined with data in the field to become a "basic model" for completing downstream tasks, helping to produce more scientific research discoveries. In the industrial field, for more complex intelligent decision-making scenarios, pre-training based on various Internet data, large models with decision-making capabilities may be the focus of the next development.

(1), in May 2021, Google proposed a multi-task unified model MUM

In May, Google released the Multitask Unified Model (MUM) at the 2021 IO Conference.

development situation. The MUM model can understand 75 languages ​​and has been pre-trained with a large amount of webpage data. It is good at understanding and answering complex decision-making problems, and can find information from cross-language multimodal webpage data. It can be used in Internet scenarios such as customer service, Q&A, and marketing. Has application value.

Legend: The MUM model can search for corresponding travel strategies from webpage information from multiple sources according to user questions

来源MUM: A new AI milestone for understanding information

(2), In June 2021, researchers such as Tsinghua University and Zhiyuan proposed the Chinese core language model CPM

In June, Tsinghua University, Zhiyuan and other researchers disclosed CPM, a multilingual pre-training model with Chinese as the core, at the Beijing Zhiyuan Conference. Compared with the existing open source pre-training model, the overall performance of the seven machine language ability tests, including , cross-language, generation, and generalization, is significantly better. The publicly downloadable CPM-2 model is divided into 3 different versions: 11 billion parameter Chinese model, 11 billion parameter Chinese-English model and 198 billion Chinese-English MoE model.

Legend: The performance of the CPM model in downstream tasks

Source : https://arxiv.org/pdf/2106.10715.pdf

(3), In August 2021, researchers such as Zhiyuan and Tsinghua University proposed the protein pre-training model ProteinLM

In August, the Wudao team of Zhiyuan Research Institute, together with Tsinghua University and Tencent Quantum Lab, proposed a protein pre-training model ProteinLM, which has open-sourced models with a scale of 200 million and 3 billion parameters. The model supports protein secondary structure prediction, fluorescence prediction, contact prediction, folding stability prediction and distant homology detection tasks. Compared with the baseline model TAPE (38 million parameters), ProteinLM has improved performance on downstream tasks, especially in protein folding prediction, and the model has improved by 39% compared with the baseline model.

Legend: Performance of the ProteinLM model in downstream tasks

Source : GitHub - BAAI-WuDao/ProteinLM: Protein Language Model

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/131160863