ChatGPT Development Report: Principles, Detailed Technical Architecture and Industry Future (Download Attached)

On December 1 this year, OpenAI launched ChatGPT, an artificial intelligence chat prototype, which once again attracted a lot of attention, triggering a big discussion in the AI ​​​​world similar to AIGC's unemployment of artists.

According to reports, ChatGPT has attracted more than 1 million Internet registered users in just a few days of open trial. And there are various interesting conversations inquiring about or molesting ChatGPT on social networks. Some people even compare ChatGPT to a combination of "search engine + social software", which can obtain reasonable answers to questions during real-time interaction.

ChatGPT is a language model focused on dialogue generation. It can generate corresponding intelligent answers according to the user's text input. This answer can be short words or long. Among them, GPT is the abbreviation of Generative Pre-trained Transformer (generated pre-trained transformation model).

By learning a large number of ready-made text and dialogue collections (such as Wiki), ChatGPT can conduct instant conversations like humans and answer various questions fluently. (Of course, the answering speed is slower than that of humans.) Whether in English or other languages ​​(such as Chinese, Korean, etc.), from answering historical questions, to writing stories, and even writing business plans and industry analysis, "almost" can do anything. Some programmers even posted a conversation about ChatGPT modifying the program.

ChatGPT can also be used in conjunction with other AIGC models to obtain more cool and practical functions. For example, the living room design drawing is generated through dialogue above. This greatly strengthens the ability of AI applications to communicate with customers, and enables us to see the dawn of large-scale implementation of AI.

1. The inheritance and characteristics of ChatGPT

1.1 OpenAI family

Let's first understand what kind of god OpenAI is.

Headquartered in San Francisco, OpenAI was co-founded by Tesla's Musk, Sam Altman and other investors in 2015, with the goal of developing AI technologies that benefit all mankind. Musk left in 2018 due to differences in the company's development direction.

Previously, OpenAI was known for launching the GPT series of natural language processing models. Since 2018, OpenAI has released a generative pre-trained language model GPT (Generative Pre-trained Transformer), which can be used to generate articles, codes, machine translations, questions and answers, and other content.

The number of parameters of each generation of GPT model has exploded, which can be called "bigger is better". The GPT-2 parameter volume released in February 2019 was 1.5 billion, while the GPT-3 in May 2020 had a parameter volume of 175 billion.

GPT family main model comparison

1.2 Main features of ChatGPT

ChatGPT is a dialogue AI model developed based on the GPT-3.5 (Generative Pre-trained Transformer 3.5) architecture, and is a sibling model of InstructGPT. ChatGPT is likely to be OpenAI's exercise before the official launch of GPT-4, or to collect large amounts of dialogue data.

Main features of ChatGPT

OpenAI trained ChatGPT using RLHF (Reinforcement Learning from Human Feedbac) technology, and added more human supervision for fine-tuning.

In addition, ChatGPT also has the following characteristics:

1) You can take the initiative to admit your mistakes. If the user points out their mistake, the model listens and refines the answer.

2) ChatGPT can challenge incorrect questions. For example, when asked the question "Columbus came to the United States in 2015", the robot will explain that Columbus does not belong to this era and adjust the output.

3) ChatGPT can admit its own ignorance and ignorance of professional technology.

4) Support continuous multiple rounds of dialogue.

Different from all kinds of smart speakers and "artificial mental retardation" that you use in your life, ChatGPT will remember the dialogue information of previous users during the dialogue process, that is, contextual understanding, to answer some hypothetical questions. ChatGPT can realize continuous dialogue, which greatly improves the user experience in dialogue interaction mode.

For accurate translation (especially Chinese and name transliteration), ChatGPT is still far from perfect, but it is similar to other online translation tools in terms of text fluency and identification of specific names.

Since ChatGPT is a large-scale language model, it does not currently have a web search function, so it can only answer based on the data set it has in 2021. For example, it doesn't know the situation of the 2022 World Cup, and it won't answer what the weather is like today or help you search for information like Apple's Siri. If ChatGPT can go online to find learning materials and search knowledge by itself, it is estimated that there will be a greater breakthrough.

Even with limited learning knowledge, ChatGPT can still answer many wonderful questions of humans with wide-open brains. In order to prevent ChatGPT from getting into bad habits, ChatGPT uses algorithmic shielding to reduce harmful and deceptive training inputs. , queries are filtered through the Moderation API and dismiss potential racist or sexist hints.

2. The principle of ChatGPT/GPT

2.1 NLP

Known limitations of the NLP/NLU domain include repetition of text, misinterpretation of highly specialized topics, and misinterpretation of contextual phrases.

For a human or an AI, it typically takes years of training to have a normal conversation. NLP-like models not only understand the meaning of words, but also understand how to make sentences and give contextually meaningful answers, and even use appropriate slang and professional vocabulary.

Application fields of NLP technology

Essentially, GPT-3 or GPT-3.5, which is the basis of ChatGPT, is a super-large statistical language model or sequential text prediction model.

2.2 GPT v.s.BERT

Similar to the BERT model, ChatGPT or GPT-3.5 automatically generates each word (word) of the answer based on the input sentence and the language/corpus probability. From the perspective of mathematics or machine learning, the language model is the modeling of the probability correlation distribution of the word sequence, that is, using the sentence that has been said (the sentence can be regarded as a vector in mathematics) as the input condition to predict the next The probability distribution of the occurrence of different sentences or even language sets at any time.

ChatGPT is trained using reinforcement learning from human feedback, a method that augments machine learning with human intervention for better results. During training, a human trainer acts as both user and AI assistant, fine-tuned by a proximal policy optimization algorithm.

Due to ChatGPT's stronger performance and massive parameters, it contains more topic data and can handle more niche topics. ChatGPT can now further process tasks such as answering questions, writing articles, text summarization, language translation and generating computer code.

The technical architecture of BERT and GPT (En in the figure is each word input, Tn is each word output answer)

3. The technical architecture of ChatGPT

3.1 Evolution of the GPT family

When it comes to ChatGPT, we have to mention the GPT family.

ChatGPT has several well-known brothers before, including GPT-1, GPT-2 and GPT-3. These brothers are bigger than each other, and ChatGPT is more similar to GPT-3.

Technical comparison between ChatGPT and GPT 1-3

Both the GPT family and the BERT model are well-known NLP models, both based on Transformer technology. GPT-1 has only 12 Transformer layers, and GPT-3 has increased to 96 layers.

3.2 Human Feedback Reinforcement Learning

The main difference between InstructGPT/GPT3.5 (the predecessor of ChatGPT) and GPT-3 is that the new addition is called RLHF (Reinforcement Learning from Human Feedback, human feedback reinforcement learning). This training paradigm enhances human conditioning of the model output and enables a more comprehensible ranking of the results.

In InstructGPT, the following are the evaluation criteria for "goodness of sentences".

  1. Authenticity: False or Misleading?

  2. Harmless: Does it cause physical or mental harm to people or the environment?

  3. Usefulness: Does it solve the user's task?

3.3 TAMER framework

Here I have to mention the framework of TAMER (Training an Agent Manually via Evaluative Reinforcement, evaluation-style enhanced artificial training agent). The framework introduces human markers into the learning cycle of Agents, and humans can provide reward feedback to Agents (that is, guide Agents to train), so as to quickly achieve the training task goal.

TAMER framework paper

The main purpose of introducing human labelers is to speed up training. Although reinforcement learning technology has outstanding performance in many fields, there are still many shortcomings, such as slow training convergence speed and high training cost. Especially in the real world, many tasks have high exploration cost or data acquisition cost. How to speed up training efficiency is one of the important issues to be solved in reinforcement learning tasks today.

TAMER can use the knowledge of human markers to train the agent in the form of reward letter feedback to speed up its rapid convergence. TAMER does not require taggers to have professional knowledge or programming skills, and the cost of corpus is lower. With TAMER+RL (Reinforcement Learning), the process of Reinforcement Learning (RL) from Markov Decision Process (MDP) rewards can be augmented with feedback from human markers.

Application of TAMER Architecture in Reinforcement Learning

In terms of specific implementation, human markers play the role of dialogue users and artificial intelligence assistants, provide dialogue samples, let the model generate some replies, and then the markers will score and rank the reply options, and feed back better results to the model. Learning in a feedback mode - human reinforcement and Markov decision process rewards as an integrated system, fine-tuning the model and continuously iterating through the reward strategy.

On this basis, ChatGPT can understand and complete human language or instructions better than GPT-3, imitate humans, and provide coherent and logical text information.

3.4 Training of ChatGPT

The training process of ChatGPT is divided into the following three stages:

Phase 1: Training a Supervised Policy Model

It is difficult for GPT 3.5 itself to understand the different intentions contained in different types of human instructions, and it is also difficult to judge whether the generated content is a high-quality result. In order for GPT 3.5 to initially have the intention of understanding instructions, firstly, questions will be randomly selected from the data set, and human labelers will give high-quality answers, and then use these manually labeled data to fine-tune the GPT-3.5 model (obtain SFT model, Supervised Fine-Tuning).

The SFT model at this point is already better than GPT-3 in following instructions/dialogues, but not necessarily in line with human preferences.

The training process of the ChatGPT model

The second stage: training reward model (Reward Mode, RM)

This stage is mainly to train the reward model by manually labeling the training data (about 33K data). Randomly sample questions in the dataset, and use the model generated in the first stage to generate multiple different responses to each question. Human annotators consider these results together and give a ranking order. The process is similar to coaching or teacher coaching.

Next, use this ranking result data to train the reward model. Multiple sorting results are combined in pairs to form multiple training data pairs. The RM model takes an input and gives a score that evaluates the quality of the answer. Thus, for a pair of training data, the parameters are tuned such that high-quality responses are scored higher than low-quality responses.

The third stage: use PPO (Proximal Policy Optimization, proximal strategy optimization) reinforcement learning to optimize the strategy.

The core idea of ​​PPO is to transform the On-policy training process in Policy Gradient into Off-policy, that is, transform online learning into offline learning. This transformation process is called Importance Sampling. In this stage, the reward model trained in the second stage is used to update the parameters of the pre-training model by reward scoring. Randomly select questions in the data set, use the PPO model to generate answers, and use the RM model trained in the previous stage to give quality scores. The reward scores are transmitted sequentially, thereby generating a policy gradient, and updating the PPO model parameters through reinforcement learning.

If we keep repeating the second and third stages, through iterations, a higher quality ChatGPT model will be trained.

4. Limitations of ChatGPT

As long as the user enters the question, ChatGPT can give the answer. Does it mean that we no longer need to feed keywords to Google or Baidu, and we can get the answer we want immediately?

Although ChatGPT has demonstrated excellent contextual dialogue ability and even programming ability, it has completed the change of the public's impression of the human-machine dialogue robot (ChatBot) from "artificial mental retardation" to "interesting", we must also see that ChatGPT technology still has some limitations , is still making progress.

1) ChatGPT lacks "human common sense" and extension ability in areas where it has not been trained with a large amount of corpus, and it will even be serious "nonsense". ChatGPT can "create answers" in many fields, but when users seek correct answers, ChatGPT may also give misleading answers. For example, let ChatGPT do a primary school application problem. Although it can write a long list of calculation processes, the final answer is wrong.

2) ChatGPT cannot handle complex, lengthy or particularly professional language structures. For questions from very specialized domains such as finance, natural sciences, or medicine, ChatGPT may not be able to generate appropriate responses without sufficient corpus "feeding".

3) ChatGPT requires a very large amount of computing power (chips) to support its training and deployment. Regardless of the need for a large amount of corpus data to train the model, at present, ChatGPT still needs server support with large computing power when it is applied, and the cost of these servers is unaffordable for ordinary users. Even a model with billions of parameters requires amazing amount of computing resources to run and train. , if facing hundreds of millions of user requests of real search engines, such as adopting the current free strategy, it is difficult for any enterprise to bear this cost. Therefore, for the general public, they still need to wait for a lighter model or a more cost-effective computing power platform.

4) ChatGPT has not been able to incorporate new knowledge online, and it is unrealistic to re-train the GPT model when some new knowledge emerges. Whether it is training time or training cost, it is difficult for ordinary trainers to accept. If the online training mode is adopted for new knowledge, it seems feasible and the corpus cost is relatively low, but it is easy to cause catastrophic forgetting of the original knowledge due to the introduction of new data.

5) ChatGPT is still a black-box model. At present, the internal algorithm logic of ChatGPT has not been decomposed, so there is no guarantee that ChatGPT will not produce statements that attack or even harm users.

Of course, the flaws are not hidden, and some engineers posted a dialogue asking ChatGPT to write verilog code (chip design code). It can be seen that the ChatGPT level has exceeded some verilog beginners.

5. The future improvement direction of ChatGPT

5.1 RLAIF with reduced human feedback

At the end of 2020, Dario Amodei, former vice president of research at OpenAI, founded Anthropic, an artificial intelligence company, with 10 employees. Anthropic's founding team members are mostly early and core employees of OpenAI, and have participated in OpenAI's GPT-3, multimodal neurons, and reinforcement learning of human preferences.

In December 2022, Anthropic published the paper "Constitutional AI: Harmlessness from AI Feedback" again to introduce the artificial intelligence model Claude. (arxiv.org/pdf/2212.0807)

CAI model training process

Both Claude and ChatGPT rely on reinforcement learning (RL) to train preference models. CAI (Constitutional AI) is also based on RLHF, the difference is that CAI's ranking process uses models (rather than humans) to provide an initial ranking result for all generated output results.

CAI uses artificial intelligence feedback to replace human preference for the innocence of expression, that is, RLAIF, and artificial intelligence evaluates the reply content according to a set of constitutional principles.

5.2 Make up for the shortcomings of mathematics and science

Although ChatGPT has a strong dialogue ability, it is prone to serious nonsense in the mathematical calculation dialogue.

Computer scientist Stephen Wolfram proposed a solution to this problem. Stephen Wolfram created the Wolfram Language and computational knowledge search engine Wolfram|Alpha, backed by Mathematica.

Combining ChatGPT with Wolfram|Alpha to handle carding problems

In this combined system, ChatGPT can "talk" to Wolfram|Alpha just like humans use Wolfram|Alpha, and Wolfram|Alpha will use its symbol translation ability to "translate" the natural language expressions obtained from ChatGPT into corresponding symbols computerized language. In the past, the academic community has been divided on the kind of "statistical approach" used by ChatGPT and the "symbolic approach" of Wolfram|Alpha. But now the complementarity of ChatGPT and Wolfram|Alpha has provided the NLP field with the possibility to go to the next level.

ChatGPT does not need to generate such codes, but only needs to generate regular natural language, and then use Wolfram|Alpha to translate it into accurate Wolfram Language, and then calculate it by the underlying Mathematica.

5.3 Miniaturization of ChatGPT

Although ChatGPT is powerful, its model size and cost of use also prohibit many people.

There are three categories of model compression that reduce model size and cost.

The first method is quantization, which reduces the precision of the numerical representation of individual weights. For example, the reduction of Transformer from FP32 to INT8 has little effect on its accuracy.

The second method of model compression is pruning, which removes network elements including channels from individual weights (unstructured pruning) to higher granularity components such as weight matrices. This approach works well in vision and smaller scale language models.

A third method of model compression is sparsification. For example, SparseGPT (arxiv.org/pdf/2301.0077) proposed by the Austrian Institute of Science and Technology (ISTA) can prune the GPT series models to 50% sparsity in a single pass without any retraining. For the GPT-175B model, this pruning can be achieved in a few hours using a single GPU.

SparseGPT compression process

6 ChatGPT's industrial future and investment opportunities

6.1 AIGC

When it comes to ChaGPT, AIGC has to be mentioned.

AIGC uses artificial intelligence technology to generate content. Compared with UGC (user-generated content) and PGC (professional-generated content) in the era of Web1.0 and Web2.0, AIGC, which represents content conceived by artificial intelligence, is a new round of change in content production methods, and AIGC content is in Web3. There will also be exponential growth in the 0 era.

The emergence of the ChatGPT model is of great significance to the AIGC application of text/speech modalities, and will have a major impact on the upstream and downstream of the AI ​​industry.

6.2 Benefit scenarios

From the perspective of downstream related beneficial applications, including but not limited to no-code programming, novel generation, dialogue search engine, voice companion, voice work assistant, dialogue virtual human, artificial intelligence customer service, machine translation, chip design, etc. From the perspective of increasing upstream demand, including computing power chips, data labeling, natural language processing (NLP), etc.

Large models are exploding (more parameters/greater demand for computing power chips)

With the continuous advancement of algorithm technology and computing power technology, ChatGPT will further move towards a more advanced version with stronger functions, and it will be applied in more and more fields to generate more and better conversations and content for human beings.

Finally, the author asked about the status of the storage-computing integrated technology in the field of ChatGPT (the author himself is currently focusing on promoting the product landing of the storage-computing integrated chip). ChatGPT thought about it and boldly predicted that the integrated storage-computing technology will occupy a dominant position in the ChatGPT chip. (won my heart)

View the original ChatGPT Development Report: Principles, Detailed Technical Architecture and Industry Future

Guess you like

Origin blog.csdn.net/qq_41838305/article/details/130658680