Talk about AI big model from chatgpt

      Recently, almost all the hot spots have been occupied by chatgpt. I believe everyone is familiar with chatgpt. I have read some recently and summarized some information about the large model. If there are some deficiencies or suggestions, please correct me.

What is an AI large model?

      The AI ​​large model is the Foundation Model (basic model), which refers to a model that can adapt to a series of downstream tasks after training on large-scale and broad data. (Foundation Model noun comes from Li Feifei and other scholars' paper On the Opportunities and Risks of Foundation Models).

      The AI ​​large model is a milestone technology for artificial intelligence to move towards general intelligence. As a landmark technology of the new generation of artificial intelligence, deep learning completely relies on models to automatically learn knowledge from data. While significantly improving performance, it also faces the contradiction between the proliferation of general data and the lack of special data. The AI ​​large model has both "large-scale" and "pre-training" attributes. Before modeling for actual tasks, it needs to be pre-trained on massive general data, which can greatly improve the generalization, versatility, and practicability of AI.

large model development

      The development of AI large models originated in the field of natural language processing. After the Transformer network was proposed in 2017, with the continuous improvement of the number of parameters, it has gradually become a basic architecture in the field of natural language processing, and its number of parameters reached 300 million in 2018, which is known as BERT. Based on such a large amount of parameters, the researchers found that it can perform multiple natural language processing tasks simultaneously, which attracts more and more people to join it.

      In the early stage of large model research, it was still mainly concentrated in the field of natural language processing, and a series of representative models such as the above-mentioned BERT and GPT-3 were born. 100 billion scale. What followed was the improvement of the corresponding ability, with the reasoning ability from simple text question and answer, text creation to symbolic language; in the past two years, some researchers have proposed other modalities (such as vision, etc.) I hope that the model can also understand everything in the world. At this stage, visual models containing hundreds of millions of parameters such as ViT were born. The above models have the ability to read and see respectively. Researchers expect to unify these two types of abilities to have multi-modal perception capabilities as reflected in the brain. Representative models of this part are models such as CLIP and DALL·E.

      For more information on multimodal models, see https://zhuanlan.zhihu.com/p/460512128

 mainstream model

(1) BERT: The BERT model released by Google in October 2018 is the most typical basic model. It uses the pure text part of BooksCorpus and English Wikipedia without labeling data, and uses two self-supervised tasks designed for training. , the trained model achieves the best performance on 11 downstream tasks through fine-tuning.

(2) Big Transfer, BiT, the visual transfer model released by Google in 2021

(3) The GPT-3 released by OpenAI in May 2020 is an autoregressive language model with 175 billion parameters. This basic model trained on Internet text data can use prompt examples to complete various For the task, use the description task (such as "translate English into French:"), add an example (such as "sea otter => loutre de mer"), and give a prompt prompt (such as "cheese =>"), GPT-3 The model can generate the French corresponding to cheese. Such models are becoming the mainstream AI paradigm.

      Take the GPT series as an example:

      1) GPT-1 has hundreds of millions of parameters, and the data set uses BookCorpus with 10,000 books and 2.5 billion words;

      2) The parameter volume of GPT-2 has reached 1.5 billion, of which the data comes from the Internet, using 8 million webpage data linked on Reddit, and after cleaning, it exceeds 40GB (WebText);

      3) The GPT-3 parameter scale exceeded 10 billion for the first time, and the data set expanded to a 570GB CC data set (400 billion words) + WebText2 (19 billion words) + BookCorpus (67 billion words) + Wikipedia (30 billion words) billion words).

(4) Google proposed FLAN last year, a GPT model based on finetune. Its model structure is similar to GPT. But different from GPT-3, it is based on 62 data sets, and each data set constructs 10 Prompt templates, that is, finetune after getting a total of 620 template data.

    More introductions are here https://zhuanlan.zhihu.com/p/545709881

List of foreign large models

 The above picture is from the link:Summary of the large model LLMs available today bzdww

Domestic large-scale model development

      In April 2021, HUAWEI CLOUD United Cycle Intelligence released the Pangu NLP ultra-large-scale pre-training language model with a parameter scale of 100 billion; and Peking University jointly released the Pangu α ultra-large-scale pre-training model with a parameter scale of 200 billion. Alibaba Dharma Institute released the Chinese pre-training language model PLUG with 27 billion parameters, and jointly released the Chinese multi-modal pre-training model M6 with a parameter scale of 100 billion with Tsinghua University.

      In June 2021, Beijing Zhiyuan Artificial Intelligence Research Institute released the ultra-large-scale intelligent model "Enlightenment 2.0", with parameters reaching 1.75 trillion, making it the world's largest pre-training model at that time.

      In July 2021, Baidu launched the ERNIE 3.0 knowledge enhancement model with a parameter scale of tens of billions.

      In October 2021, Inspur released about 250 billion ultra-large-scale pre-training model "Source 1.0". In December 2021, Baidu launched the ERNIE 3.0 Titan model with a parameter scale of 260 billion. The parameters of the M6 ​​model of Dharma Academy reached 10 trillion, which directly increased the parameters of the large model by an order of magnitude.

      By 2022, large models will continue to be hot. In the beginning, the large model was concentrated in the field of computing language, but now it has gradually expanded to vision, decision-making, application and even covers major scientific issues such as protein prediction and aerospace. Google, Meta, Baidu and other major companies have corresponding results .

List of domestic large models

 Comparison of large model and traditional model

AI model Traditional AI Model
1

Thanks to its "large-scale pre-training + fine-tuning" paradigm, AI large models can be well adapted to different downstream tasks, demonstrating its strong versatility

Due to the constraints of data scale or model expression ability, these models can only support one or one type of tasks in a targeted manner, but cannot support other tasks.
2 The AI ​​large model is pre-trained on a large amount of general data and has a variety of basic capabilities, which can be combined with the needs of various vertical industries and business scenarios for model fine-tuning and application adaptation Fragmentation of traditional AI capabilities and workshop-style development
3 The AI ​​large model has become the technical base of upper-level applications, which can effectively support the application of smart terminals, systems, platforms and other products There are many barriers and difficult deployment in the traditional AI application process
4 In the case of shared parameters, it only needs to make corresponding fine-tuning in different downstream experiments to obtain superior performance Traditional AI models have limitations that are difficult to generalize to other tasks
5 The self-supervised learning method can reduce data labeling, and the larger the model parameter scale, the more obvious the advantages, avoiding developers to carry out large-scale training, and using small samples to train the models they need, which greatly reduces the cost of development and use. Manual labeling costs are high, the cycle is long, and the accuracy is not high
6 It is expected to further break through the accuracy limitations of existing model structures

Model Accuracy--Traditional Model

      Judging from the history of the first 10 years of deep learning development, the improvement of model accuracy mainly depends on the structural changes of the network. For example, from AlexNet to ResNet50, and then to EfficientNet searched by NAS, the accuracy of ImageNet Top-1 has increased from 58 to 84. However, as the neural network structure design technology gradually matures and tends to converge, it is very difficult to break the accuracy limitation by optimizing the neural network structure.

 Model Accuracy--Bit Model Accuracy

       Take the visual transfer model Big Transfer, BiT released by Google in 2021 as an example. Expanding the data scale can also improve the accuracy. For example, using ILSVRC-2012 (1.28 million pictures, 1000 categories) and JFT-300M (300 million pictures, 18291 categories) to train ResNet50, the accuracy is 77% and 79%. In addition, using JFT-300M to train ResNet152x4, the accuracy can rise to 87.5%, which is 10.5% higher than the ILSVRC-2012+ResNet50 structure.

      (Below) Through the model parameters, we can see the impact of the large model on the accuracy when the parameter scale becomes larger. The colored text is a comment on the data set.

 Computing power demand

      Use the theoretical training time of a single Nvidia V100GPU to feel the demand for computing power of a large model. The training time of a typical large model such as GPT BERT GPT-2 is as follows.

       For example, the training of GPT-3 uses tens of thousands of NVIDIA v100 GPUs, and the total cost is as high as 27.6 million U.S. dollars. If an individual wants to train a PaLM, it will cost 9 to 17 million U.S. dollars. Although the training will use a larger scale of computing power consumption, the reasoning will be much less. For example, the bilingual large model GLM-130B, which is jointly open sourced by Tsinghua University and Zhipu AI, has been compressed to the point where it can be run on an A100 through a fast reasoning method. (40G*8) or V100 (32G*8) server for stand-alone reasoning. But an A100 8-card machine is also hundreds of thousands of dollars (A100 40G single card is about 7w, 8 cards is 56w, so the whole machine also needs about 60w), this cost is still very expensive for many AI applications. high.          

      The good news is that the computing power is iterating, and the cost of computing power is also decreasing. NVIDIA’s H-series graphics cards, such as the H100, have 7 times the computing power (fp32) compared to the previous T4 (the mainstream graphics card in the deep learning 1.0 era) ++ However, the bad news is that powerful computing graphics cards like the H100 are restricted from being exported to China.          

      In the era of large-scale models, accelerator cards and tool chains optimized for Transformer structures are also being launched continuously. While computing power manufacturers seize the high ground for large-scale model computing, they increase computing power and reduce costs, making it feasible for large-scale models to be implemented.

Domestic application scenarios

      For the 2021 Beijing Winter Olympics, Zhiyuan Research Institute proposed a large-scale model of "Enlightenment" to broadcast digital humans in sign language for the Winter Olympics, providing intelligent digital human sign language generation services, so that hearing-impaired people can also watch special reports on the event and improve their social engagement and well-being. This project has also received strong support from the Beijing Disabled Persons' Federation and the Municipal Disabled Persons' Association of the Deaf.

       Huawei Pangu CV model. It is mainly aimed at the scenario of UAV power intelligent inspection. Taking the State Grid Chongqing Yongchuan Power Supply Company as an example, the development of UAV intelligent inspection mainly faces two challenges: one is how to efficiently label massive data; the other is the type of defects There are hundreds of types, requiring dozens of AI recognition models.

      In terms of data labeling, the Pangu CV large model uses a large amount of unlabeled power data for pre-training, combined with a small number of labeled samples for fine-tuning, which improves the sample screening efficiency by about 30 times. Taking Yongchuan Power Supply as an example to collect 50,000 high-definition pictures every day, it can save The manual marking time is 170 man-days. In terms of model versatility, one model can adapt to hundreds of defects, replace more than 20 original small models, reduce model maintenance costs, increase the average accuracy by 18.4%, and reduce development costs by 90%.

       Of course, the recent Double Eleven is also indispensable. Double Eleven is the busiest day for Taobao’s system services. How to effectively respond to hundreds of millions of user inquiries.

      Intelligently generate content copy based on the M6 ​​large model developed by DAMO Academy, which is convenient for intelligent customer service to understand context and generate question answers.

      In addition, the multi-modal feature extraction capability of the large model can also perform downstream tasks such as product attribute label supplementation and cognitive recall.

Large Model Training Framework

      At present, some deep learning frameworks, such as Pytorch and Tensorflow, cannot meet the needs of super-large-scale model training, so Microsoft developed DeepSpeed ​​based on Pytroch, Tencent developed PatricStar based on Pytroch, and DAMO Academy developed a distributed framework based on Tensoflow Whale. Manufacturers such as Huawei Ascend's MindSpore, Baidu's PaddlePaddle, and domestic Zhuiyi Technology OneFlow have carried out in-depth follow-up and exploration of super-large model training, and support super-large model training based on the native AI framework.

Main head manufacturers of large models

      The main competitors are Nvidia -based GPU+Microsoft's DeepSpeed, Google 's TPU+Tensorflow, and of course Huawei Ascend Atlas800+MindSpore, which can achieve comprehensive optimization. As for other manufacturers, most of them are based on Nvidia's GPU and carry out some innovations and optimizations.

 

 The Large Model Center of Stanford University conducted a comprehensive evaluation of 30 mainstream large models in the world

       GLM-130B is the only selected large model in Asia. In comparison with the major models of OpenAI, Google Brain, Microsoft, Nvidia, and Meta AI, the evaluation report shows that GLM-130B is close to or equal to GPT-3 175B (davinci) in terms of accuracy and fairness indicators, robustness, Calibration error and unbiasedness are better than GPT-3 175B.

      Zhipu AI, a company transformed from Tsinghua University's technological achievements, has open sourced a new member of the GLM series model - the Chinese-English bilingual dialogue model ChatGLM-6B, which supports reasoning on a single consumer-grade graphics card. This is the research result of Zhipu AI's re-launch of the large-scale model after the previous open-source GLM-130B 100-billion base model.

      Open source address: https://github.com/THUDM/ChatGLM-6B

expect

     For AI large models, we not only expect a huge amount of parameters, but also have the ability to efficiently understand multiple modal information, cross-modal perception capabilities, and the ability to migrate and execute cross-differentiated tasks.

      The content of the article is placed in the ppt. Friends who want to download it can go to my resources to download it. The ppt is relatively rough, please forgive me.

https://download.csdn.net/download/sunnyrainflower/87642873

Guess you like

Origin blog.csdn.net/sunnyrainflower/article/details/129895649
Recommended