Talking about the key technology and implementation development of ChatGPT

Sharing guest | Liu Huanyong

Manuscript arrangement | William


1. Looking at the origin and essence of ChatGPT from a large-scale language model

ChatGPT can be divided into Chat and GPT to understand. The former represents an application form, and the latter is a generative model. It is defined in Baidu Encyclopedia as ChatGPT is a natural language processing tool driven by artificial intelligence technology. It can conduct conversations by learning and understanding human language, and can also interact according to the context of the chat. Chat and communicate like humans, and even Can write some emails, video scripts, copywriting, translation, code, writing papers and other tasks. At present, everyone is doing interpretation and some guesses, and there are no official reports or papers. Some key terms here need to be understood first:

Figure 1 Key terms

ChatGPT is essentially a natural language processing model based on GPT, using a transformer to predict the probability distribution of the next word, and then learning such a pattern on a large-scale text corpus. The intelligence from GPT1 to GPT3 is constantly improving. At that time, GPT1 had only 117 million parameters, then GPT2 became 1.5 billion parameters, and GPT3 reached 175 billion. Until later, we started to fine-tune the instructions because of the generated quality. Sometimes it is not particularly good, let it conform to a criterion of 600H as much as possible, that is, to learn more skills, and improve its results through feedback learning.

The development and changes of the entire ChatGPT are changes in the transformer series as a whole. When doing NLP tasks since 1950, it is a small amount of data processing based on rules, and then machine learning is used to classify parameters according to a certain range of data. Later, CNN is used for encoding to represent features, and then it is discovered that multi-headed attention The coding ability of force is obviously stronger than that of neural network, and then it is split into three routes: one is the GPT series, which is basically one iteration a year, the second is T5, and the third is BERT. At present, the two mainstream ones are shown in Figure 2. The only difference between the two architectures lies in the input layer, each of which is only related to the previous ones, so that contextual information can be fully captured, making it easier to understand.

Figure 2 Two mainstream architectures

When you experience ChatGPT, its internal working principle is a comparative method. For example, let ChatGPT write an article. It is essentially a way to repeat what the next word in the current text should be and calculate the one with the highest probability. Finally, to complete the task. So this is actually controlled by a probability, but the essence of this probability is controlled by a language model.

I don't know what I wrote at all, but I am very confident that the following word has the greatest probability of matching the previous word.

So this leads to serious nonsense that can be seen in the current ChatGPT.

Let's take a look at GPT1 of the GPT series first. It is actually a general-purpose multi-task pre-training to create a fine-tuning paradigm. Compared with the transformer, it has made significant changes. The first aspect is that only a 12-layer decoder is trained. The second is that compared to Google's BERT, it actually only uses the above predictions and will not see contextual information. Therefore, only the decoder part is used, and its structure is simple enough to perform language understanding well, and is suitable for the field of text generation, but there are relatively large defects in general language and conversational communication.

GPT2 has made a lot of improvements on the basis of GPT1. The first is that there are more sources of information, and the data of the entire data stream has been expanded to 40G. The second is that the number of layers has increased to 48, and the dimension of the hidden layer has increased to 1600, achieving 1.5 billion parameters. The third is to no longer fine-tune the modeling for different tasks, and only model it as a classification task. But there are also some problems. First of all, from a practical point of view, each new task requires large labeled data, which limits the applicability of the language model. The second aspect is the way of pre-training and fine-tuning Among them, the generalization ability will be relatively poor. The third reason is that human learning does not require large supervised data sets, so there are actually certain limitations in this concept.

In order to solve these problems, GPT3 used the idea of ​​in-context learning to make the model achieve better results, and then made more data, directly pulling the parameters to a scale of 175B.

2. Several key technical issues, open source projects and implementation challenges of ChatGPT

2.1 The core algorithm and core data of ChatGPT

Human feedback is added to Instruct-GPT to guide, so why do we need to add this feedback? Because there are some problems with GPT3 at present, including uneven quality of generation, easy to produce some fluent but useless or even harmful results, and poor comprehension ability of zero-shot. However, simply increasing the size of the language model cannot solve these problems, so reinforcement learning with human feedback is required for fine-tuning.

Fine-tuning includes these steps: First, some fine-tuning data is given, then the reward model is trained, and then reinforcement learning optimizes the STF and iterates. However, there are many ways for people to give feedback. The first is to mark the required prompts, and the other is to let the model sort the quality of the generated results, and then use reinforcement learning to guide the model to a better direction.

Instruct-GPT is trained based on reinforcement learning of human feedback. Figure 3 shows the whole process very clearly. There are three main stages here: the first stage is a cold start strategy model, which randomly extracts instructions or questions submitted by users (ie prompt ), and then do manual labeling, use these specified prompts and high-quality answers to fine-tune the GPT3 model to make it fit as much as possible to generate a model; the second stage is the training reward stage, scoring the output results of the model pre-training, The higher the score, the better the quality of the answer; the third stage is to use reinforcement learning to enhance the ability of the pre-training model, which is equivalent to feeding back the score to the original STF to allow the model to generate higher-quality answers.

Figure 3 Instruct-GPT process

SFT supervised learning, as a core algorithm in GPT, mainly collects the data set of how the manually written expected model outputs, so that it can be used to train a generative model. The data used is a pair of prompt and answer. Part of the collection of this data set comes from users who use openAI, and on the other hand, it will come from some engineers in openAI who wrote it themselves and hired annotators. In fact, if you want to do this, there are several options. First, you can collect data from Baidu Zhizhi or forums, but you need to carefully screen it. Second, some people in the company have some knowledge and requirements for the code. The third least cost-effective way is to hire some people to get the data set.

Another core algorithm is the reward model, which is actually a collection of sorted data sets between multiple outputs of the manually labeled model. Use the fine-tuning model SFT to make predictions, and get N results for the output. The sorted data is used to train the reward model, and the loss function is also calculated based on the results of manual selection. pair-wise loss. Specific steps: the first step is to generate an SFT model, randomly generate K candidates for each prompt, the second step is to select two candidates and form a tuple with the prompt, and perform C( K,2) expansion to generate more labeled data; the third step is to sort and score the two candidates according to the quality; the fourth step is to send the generated C(K,2) group of training data as a batch to the RM model for pair-wise training. Since the input of the loss function is a tuple, the output is the difference between the two candidate rewards worth sigmoid and then log, so it can be regarded as a regression model.

The latter core algorithm is PPO reinforcement learning, which uses the reward model as the reward function to fine-tune the model generated by supervised learning in the form of PPO. Each step calculates the KL divergence between the generated model trained in the first step. The goal is to deviate from the original generation model without reinforcement learning. The loss function is as follows:

The specific steps are as follows: the first step is to initialize the PPO strategy model by the fine-tuned SFT model, and initialize the value function by the generated RM model. The second step is to randomly sample a prompt from the PPO data set, and generate an output result through the PPO strategy model of the first step; the third step is to bring the prompt and the result into the RM model to calculate the reward value reward; the fourth step is to use the reward To update the PPO policy model parameters. Finally, repeat steps 2 to 4 until the PPO strategy model converges 

Next, let’s talk about some problems in the training data. GPT3 is trained on a total of 300B tokens, 60% of which come from common crawl, and others include webtext2, books1, books2 and Wikipedia. But it is not that the more things on the data, the better, it is more inclined to the quality of the data. If the quality is higher, the effect will be good. So when doing GPT3 training data, there is a set of data engineering work, including three stages of classifying data, how to wash data, and how to fetch data. The other is to deduplicate the data set. Deduplication helps prevent the pre-trained model from remembering or overfitting on the same data after facing the same data many times, thus helping to improve the generalization ability of the model. There is another aspect of diversity, including domain diversity, format diversity, and language diversity, including as much as possible a variety of data.

GPT3 training data has a process of processing and cleaning. The first is to filter the crawler data based on the similarity comparison with a series of high-quality reference corpora. The other is to filter the documents within the dataset and across datasets. Deduplication is performed and, ultimately, a known high-quality reference corpus is added to the training mix to enhance the diversity of the dataset.

The distribution after cleaning is shown in Figure 4.

Figure 4 Dataset distribution for training GPT3

Since the proportion of Chinese intonation in GPT3 is not very high, why is the effect shown in Chinese better than some large models in China? The question that arises from this is how to acquire the cross-lingual ability of GPT? From the official PPT data, we can see that Chinese only accounts for 0.12% of the number of documents, and for the entire word, it accounts for 0.1%. So does it mean that many things in English have been recorded better? Yes, and then during training, the model has automatically done translation alignment. Of course, this is just a guess.

The second is its table drawing ability. When using GPT, you can find that the underlying things are written in markdown through debugging. If you use this code, you can find from the source code that it will use some official comments or some functions. Defined as a question, the answer may be a function question one by one, and then a large number of data sets can be generated in this way, and then trained.

Another core point is the construction of RLFH data. The first is to strictly control the quality of the labeled population, and the second is to label the data source. This data includes three types of data. The first is manually written, including some random prompts, and at the same time Ensure the diversity of tasks as much as possible. Secondly, it not only needs to write some prompts, but also needs to write the corresponding answer. The third is to design according to a use case. Then there are three criteria for marking: useful, true and harmless. When labeling, three data sets are marked, one is the SFT data set, which is only about 13K, the other is RM data, mainly the real prompt, which is about 33K, and the last one is the PPO data set, which is completely real prompt , based on 10 different types of generation tasks provided by different users, with a total of 31K.

2.2 Some Thoughts on ChatGPT’s Computing Power, Team and Technology

Maybe everyone will wonder how much does ChatGPT cost? Guosheng Securities issued a report before that the cost of ChatGPT training is about 1.4 million U.S. dollars. For some larger models, it may cost between 2 million U.S. dollars and 12 million U.S. dollars. Based on the average number of unique visitors of ChatGPT in January of 13 million, the corresponding chip demand is more than 30,000 NVIDIA A100 GPUs, the initial investment cost is about 800 million U.S. dollars, and the daily electricity cost is about 50,000 U.S. dollars. If the current ChatGPT is deployed to every search made by Google, 512820.51 A100 HGX servers and a total of 4102568 A100 GPUs are required, and the total cost of these servers and networks is over $100 billion in CAPEX alone. A series of engineering processes such as data collation, cleaning, and manual labeling on the data side also require great investment.

The openAI team behind it is a global AI leader and also has products for other multimodal tasks. OpenAI was established at the end of 2015. The goal of the organization is to open patents and research results to the public through "free cooperation" with other institutions and researchers. In 2019, OpenAI transformed from a non-profit organization to a "capped" for-profit organization, with a profit cap of 100 times any investment.

Let's see why ChatGPT is effective? The first is the value of instruction fine-tuning, it will not inject new capabilities into the model, that is, how the quality of the model directly determines the result of ChatGPT, and the second point is that it will be differentiated into different skill trees, such as context learning, dialogue, etc. Then the ability to respond to human commands is also a product of command fine-tuning. Besides, the traceability of the model, including language generation ability, basic world knowledge and context learning, all come from the pre-trained model.  In addition, the ability to follow instructions and generalize to new tasks comes from expanding the number of instructions in instruction learning. In addition, the ability of the model to perform complex reasoning is likely to come from code training.

Why not Chat-BERT? This kind of AE (Auto Encoder) of BERT cooperates with the pre-training task of Mask LM. Although it can better see the above and below, there is also a gap between train and inference, because the AR architecture is very It conforms to the process of human thinking and replying, conforms to the "first principle", and can answer a lot of questions.

InstructGPT has brought some benefits, first of all, its effect will be more realistic, and then its harmlessness will be improved, and it has a very powerful coding ability. Of course, there are also some disadvantages. First, it will reduce the effect of the model on general NLP tasks. Second, it will give some absurd output, because the supervised language model task that affects the model's effect the most, humans only play a corrective role. , so it is likely to be limited by the limited correction data, resulting in untrue generated content. The third is that it is very sensitive to indicators. This is mainly due to the insufficient amount of data labeled by the labeler. The fourth is an overinterpretation of simple concepts, probably because the labeler tends to give higher rewards to longer output content when comparing generated content. The last point is that harmful instructions may output harmful replies.

There are three main directions for the later improvement of Chat-GPT: first, the cost reduction and efficiency increase of manual annotation, how to enable humans to provide more effective feedback methods, and the organic and ingenious combination of human performance and model performance are very important. The second point is the ability of the model to generalize and correct instructions. How to improve the generalization ability of the model and the ability to correct error instructions is a very important task to improve the model experience. The third point is to avoid the performance degradation of general tasks. Here it may be necessary to design a more reasonable way of using human feedback, or a more cutting-edge model structure.

In fact, apart from the problems mentioned just now, what I am most dissatisfied with now is the timeliness and accuracy, which directly leads to nonsense. Bing came out and solved this problem to a certain extent. How does he solve it? The first point is that it will integrate Bing’s search, aggregate all relevant answers, and do some summarization and sorting; the second method is to solve timeliness and quickly search for currently published questions.

2.3 ChatGPT implementation, analysis, detection, application related open source projects

First, a low-cost implementation framework is ColossIAI, which proposes an open source and low-cost ChatGPT equivalent implementation process, providing out-of-the-box ChatGPT training code. The address is at https://github.com/hpcaitech/ColossalAI. It also includes the open source implementation of RLHF, to feel how and what should be marked on the data. There is also PaLM-rlhf-pytorch, which implements RLHF (human feedback reinforcement learning) on ​​top of the PaLM architecture, basically equivalent to ChatGPT, the difference is the use of PaLM, project address: https://github.com/lucidrains/ PaLM-rlhf-pytorch.

There is also a ChatGPT-Comparison-Detection project. By collecting tens of thousands of comparative question-and-answer data from human experts and ChatGPT, the HumanChatGPT comparative corpus studies the characteristics of ChatGPT's answers, as well as the differences and The gap, and the ChatGPT detector is given, the project address is at the project address: https://github.com/Hello-SimpleAI/chatgpt-comparison-detection. Then there were some surprising discoveries: First, ChatGPT’s answers generally focused strictly on the given questions, while artificial answers were divergent and easily turned to other topics. Second, ChatGPT provides objective answers, while humans prefer subjective expressions. Third, ChatGPT responses are usually formal, while human responses are more colloquial. Humans also love to use humor, sarcasm, metaphors, and examples, while ChatGPT never uses irony. Finally, ChatGPT expresses less emotion in its responses, while humans select many punctuation and grammatical features in context to convey their feelings.

There are also some application-oriented open source projects that directly do some NLP tasks. The first one is unlocking-the-power-of-llms, the project address is https://github.com/howl-anderson/unlocking-the-power-of-llms. This project mainly involves the data enhancement operation of corpus expansion, and can also perform corpus cleaning, correct data errors, and give instructions for each correction. There are also some prompt application projects. Because of the same task, different prompts will bring different effects, so how to find a prompt for a specific task is very important. The project address is at the project address: https://github.com/f /awesome-chatgpt-prompts. There are some other application-side open source projects as shown in Figure 5.

Figure 5 Other application-side related open source projects

3. The relationship and combination direction of ChatGPT, NLP and KG

3.1 ChatGPT and NLP: Basic overview of pre-training models

The pre-training model is actually a concept derived from transfer learning. It uses large models and large computing power to utilize unlabeled or weakly labeled data. It also has the ability to learn with few samples and zero samples, and it can also perform multi-modal interaction. The language architecture of NLP is shown in Figure 6.

Figure 6 Architecture of NLP Natural Language Processing

The pre-trained language model is based on large-scale unlabeled data. The automatic learning of the general language model is strong in generalization and can be used for a variety of downstream tasks. Transfer learning uses the "pre-training-fine-tuning" framework to achieve "knowledge acquisition-knowledge transfer, including feature There are two types of representation migration and parameter migration, and then include self-supervised learning from images. 

The language model itself solves the problem of language representation. At the beginning, one-hot is used for representation, and later the large model is used to replace the middle feature coding layer to make a distributed representation, and then to do some downstream tasks. The evolution is shown in Figure 7. shown.

Figure 7 Evolution of language representation for natural language processing

The core of the language model is the maximization of the entire probability, that is, the probability of estimating the generation of the next word in the text. But there is a big problem, the amount of excess is large and the context is particularly large, and there will be an unacquired state, so I wonder if I can use the model to learn the context model. At the beginning, it was based on symbolic representation, and then it started to make statistical language models. By decomposing the matrix text, we can get the bag-of-words model to do it. Embedding is used later, and a low-dimensional, continuous, dense vector is directly used to represent words. Another is the NNLM model, which can generate some models by modeling language models. ELMO can get a solution to the problem of dynamic word vectors, train the forward language model and the reverse language model respectively, and solve the polysemy form of one word.

At present, the pre-trained language model has become a new paradigm for NLP processing. First, construct some flag tasks, and then use the flag data as a contrast. Prompt tuning is currently popular, and you can choose different prompts for different tasks.

To solve the problem of NLP unification to a certain extent, but there is a big problem in the construction of the template. The template structure is different, and the generated results will be different. The core of the pre-trained language model is the self-supervised learning task.

3.2 ChatGPT and KG: a pre-trained language model that integrates knowledge graphs

Knowledge graph: A knowledge base based on binary relationships, used to describe entities or concepts in the real world and their relationships, the basic unit is the [entity-relationship-entity] triplet. Fundamentally speaking, the knowledge map is essentially a knowledge representation method, which defines the domain ontology to realize the knowledge structure (concept, entity attribute, entity relationship, event attribute, and relationship between events) of a certain business domain. Accurate representation, making it a canonical representation of knowledge in a specific field. The comparison between entity knowledge and event knowledge graph is as follows:

Figure 8 Comparison of entity knowledge and event knowledge graphs

What are the differences and combinations between knowledge graph and ChatGPT? The difference is that they are not a replacement relationship, but a parallel relationship. The essence is that the knowledge graph is a formal representation of knowledge, and ChatGPT is a language model, which itself is parameterized knowledge. The advantage of KG is It is explainable, in fact, it can also be used later to explain why ChatGPT is effective. The combination point first includes the exchange of reasoning, followed by the mapping of various graph tasks, and the realization of deep learning.

The other is that data calculation is relatively good, and there are some challenges in reasoning, so knowledge graphs are currently used to solve reasoning problems. Usually, structured knowledge is difficult to build, but easy to reason about, and unstructured knowledge is easy to build (just store it directly), but it is difficult to use for reasoning. However, language models provide a new way to easily extract knowledge from unstructured text and efficiently reason about it without the need for predefined schemas.

Another is the system ability to integrate external knowledge. It may take a lot of energy to fundamentally solve the shortcomings of ChatGPT. It is better to combine it with its own Wolfram|Alpha knowledge engine, because the latter has a powerful structure Computerized computing power, but also understand natural language.

Integrating knowledge graphs into ChatGPT can be achieved in many ways. Give it enough correct knowledge, and then introduce knowledge management and information injection technologies such as knowledge graphs, and also limit its data range and application scenarios to make the content it generates more reliable.

First of all, there is an embedding representation of the knowledge graph. The entities and relationships in the knowledge graph can be represented as embedding vectors, which can be incorporated into the model as additional features to improve the performance of the model. The second is the context understanding based on the knowledge graph, which can help the model understand the context of the conversation and provide more accurate information for answering questions. There is also automatic question generation based on knowledge graphs. By combining information from knowledge graphs, questions can be automatically generated to help users better understand the semantics and context between entities and relationships. At present, the pre-trained language model that mainly integrates knowledge is shown in Figure 9.

Figure 9 Pre-trained language model with knowledge fusion

4. The possibility and application prospect of ChatGPT

ChatGPT has the ability to continue multiple rounds of dialogue, and supports multiple tasks. The application implementation first depends on the company. The business model of openAI is membership fees, open APIs, and strategic cooperation with Microsoft.

ChatGPT itself is a pre-training model, which can learn from some implementation models of the current pre-model. Last year, the Artificial Intelligence Research Institute released a report, dividing the implementation of the ultra-large-scale intelligent industry into three parts. One is the upstream, which mainly includes some low-level training. Architecture; a midstream, including technology research and development, management, and operation and maintenance development; and a downstream, which focuses on improving large-scale models.

Another landing mode is open source, which can be done through registration or membership. The third is the PaaS model, which integrates some specific software.

After the application is implemented, the specific distribution of what users want to do is shown in Figure 10.

Figure 10 Users' usage functions and frequency

Of course, some companies want to integrate in the document, to reply emails, make quotations, and so on. Then you can also combine with search engines, answer some search queries through ChatGPT, and improve the efficiency of search engines innovatively.

There are also some learning ones, by searching published papers, finding the most relevant abstracts from a large number of research papers, and being able to make grammar corrections; there are also creative ones, which are the most popular topics in Google by searching for keywords content, generate content based on the acquired data to obtain higher reading volume, and can also perform AI automatic writing.

In the entertainment category, AI can automatically generate stories, give brand-new event feedback and game experience, and be able to have dialogues with users. In the life category, there are tax assistants, which can use different models to extract text information and classify transaction types.

At present, the competing products with ChatGPT are shown in Figure 11. Of course, this also promotes the development of ChatGPT.

Figure 11 Competing products of ChatGPT

Guess you like

Origin blog.csdn.net/soaring_casia/article/details/130107737