ChatGPT Research


The instructor wants to investigate ChatGPT, so I just took a look at the relevant principles to see if I can find research points here, and talk about ChatGPT in four aspects

Background: A brief introduction to ChatGPT

ChatGPT is an artificial intelligence chatbot developed by OpenAI and released in November 2022. It is based on OpenAI's GPT-3.5 large-scale language model and fine-tuned using supervised learning and reinforcement learning techniques. It can answer general and technical questions as well as engage in interesting conversations. Compared with the previous model, the demonstrated comprehension and text generation capabilities have produced a qualitative change visible to the naked eye. First, the technical implementation of ChatGPT is introduced.
The new technology of ChatGPT mainly focuses on several papers, let’s not look at the specific implementation formula, just look at the ideas

FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS

insert image description here

This article introduces the instruction-tuning (instruction fine-tuning) technology, one of ChatGPT's technologies. The FLAN model proposed in this article is a large model based on transformer decoder. Its essence is to convert the input of NLP tasks into natural language instructions, and then It puts the model into training to enhance its ability to understand the input intent, and by providing the model with instructions and options, it can improve the performance of the Zero-Shot task.
The main motivation of the article is the traditional large language model and hint learning (GPT3, etc.), and there is a large performance gap between Zero-Shot and few-shot. The paper believes that the reason is that without a small number of example fine-tuning, Zero-Shot There is a gap between the input of the Shot condition and the input in the training, resulting in poor performance on invisible tasks. Simply put, the ability to understand the input intention is not enough.
This article aims to enhance the ability of the model, so an intuitive idea is proposed, which is to directly convert the task into natural language input, and improve its performance under Zero-Shot conditions through training, as shown in the figure below, various tasks All can be converted into instruction fine-tuning on the basis of the pre-trained model. After such training, the task intention can be understood more accurately on invisible tasks.

insert image description here

The difference between this model and the pre-training-fine-tuning, hint learning method is mainly shown in the figure below

insert image description here

A simple summary of the above picture is:

  • Fine-tuning: pre-training on a large-scale corpus, and then fine-tuning on a downstream task, such as BERT, T5
  • Prompt-tuning: First select a general large-scale pre-training model, and then generate a prompt template for a specific task to adapt to the large model for fine-tuning, such as GPT-3, which requires few-shot or prompting engineers
  • Instruction-tuning: It is equivalent to combining the characteristics of the first two, still on the basis of the pre-trained language model, first fine-tuning on multiple known tasks (in the form of natural language), here is supervised learning, and then reasoning Take a zero-shot on a new task

The author's experiment later mainly talks about how to choose this instruction template, and what is the specific form. Simply put, it is to find a large number of data sets and adapt them to make a data set that conforms to FLAN. The amount of data is also amazing. Only large companies can do this. I won't go into details here. The main template, that is, the form of training data, is as follows:

insert image description here

Note that this process is fine-tuned on the pre-trained large language model, specifically the LaMDA-PT model used, which means that the foundation is already very strong, plus the ability to understand the input language, the accuracy rate in the Zero-Shot scenario is greatly improved Improvement, of course, the premise is that the Zero-Shot scene also uses natural language to ask questions. Instruction-tuning is to enhance the ability to understand the input language. Later, a series of ablation experiments were done, which proved that instruction-tuning is indeed useful, not multi-task fine-tuning. , is indeed convincing

A simple example: the original big model is to teach you specific subjects, such as biology, then you can only take the biology test, and you will get 0 points for the physics test, but the fine-tuning of the instructions is to teach you how to read books. Here you go, it's better if you can do anything.

Note: The research in this paper shows that instruction-tuning will produce qualitative changes only when the model reaches a certain order of magnitude. Otherwise, it will be a negative improvement. It may be understood that the parameters of the input text have already filled the model, and the threshold is 8B~68B, so it is said that this is achieved in schools. something unrealistic

Fine-Tuning Language Models from Human Preferences

insert image description here

That is another technology used in ChatGPT: Reinforcement Learning from Human Feedback (RLHF), how to learn from the clear needs of users. This article proposes a method of using human feedback to fine-tune large language models (LLMs) so that it can generate text that is more in line with human preferences. It should be the first time that reinforcement learning is used instead of supervised learning to fine-tune large models. The specific process as follows

insert image description here

1. First, start with a pre-trained language model (such as GPT-2) and fine-tune it with a large-scale text dataset (such as WebText) to obtain a basic model.
2. In order to adapt to a specific task (such as story generation), fine-tune with a small-scale task-related data set (such as WritingPrompts) to obtain a task model.
3. Design some human feedback questions (such as which text is more interesting, more reasonable, more fluent, etc.), and use an online platform (such as Scale API) to allow human annotators to compare and score different texts generated by the task model.
4. Use these human feedbacks as reward signals, and further fine-tune the task model with reinforcement learning (RL) to obtain a preference model.

This model also has an online mode, which means that in the process of fine-tuning the preference model, human feedback is collected in real time and used to update the reward signal. This model has the following characteristics:
1. It can adapt to changes in human preferences faster because the latest feedback data is used for each iteration.
2. It makes better use of the human annotator's time because it shows them the most informative textual comparison every time.
3. It makes it easier to explore different text styles and content, because new text is generated each time based on the current reward signal
In short, this paper proposes a method for fine-tuning large language models using human The method makes it use reinforcement learning to fit human preferences, so that its output is more in line with human intuitive feeling, and can be continuously updated.

Learning to Summarize with Human Feedback

insert image description here

The overall structure of this paper is a bit like the system engineering of ChatGPT, but it is tested in a small area first. The goal is to train a language model that can generate high-quality summaries without depending on the data and indicators of specific tasks. Their approach is to use human feedback to guide reinforcement learning to optimize a pre-trained language model as follows:

insert image description here

It is divided into the following three steps:
1. Collect a human preference data set and let the annotator choose a better one between the two abstracts.
2. Using the human preference dataset, train a reward model (RM) through supervised learning to score each abstract.
3. Using the reward model (RM), fine-tune a pre-trained language model through reinforcement learning (RL) to generate high-scoring summaries.
This process can be carried out in a loop, using new summary samples to collect more human feedback data
. generated

Training language models to follow instructions with human feedback

insert image description here

That is InstructGPT, the predecessor of ChatGPT. The purpose of this paper is to improve the alignment of large language models, such as GPT-3, with user intentions, that is, to enable the model to generate satisfactory output according to the instructions given by the user. This paper is actually similar to the previous one. The main training process is as follows:

The main contents of the research papers published in the month are as follows:

The approach of this paper is to use human feedback to fine-tune the language model, that is, let the human annotator give some instructions and example outputs, and use them as supervision signals to update the model parameters.
The innovation of this paper is to propose a new algorithm HIR (Human Instruction Relabeling), which can convert human feedback into instructions and use them to relabel the original instructions, thereby improving the model's ability to understand and follow instructions.
The process of HIR is as follows: First, an instruction is randomly selected from a set of instructions and an output is generated with a preference model. Then, use an online platform to have human annotators score this output, ie give a feedback signal. Then, use this feedback signal to determine whether the original instruction needs to be relabeled, that is, replace it with a more suitable instruction. Finally, use this new instruction and output pair as supervisory data to update the parameters of the preference model. In fact, HIR is very powerful, it is an upgraded version of RLHF. The experimental results of this paper show that the language model after fine-tuning using human feedback is in
multiple There are significant improvements in tasks, such as story generation, question answering, summarization, etc., and the HIR algorithm can surpass the baseline algorithm and the supervised fine-tuning algorithm. Compared with the previous paper, its application domain is wider.

ChatGPT

Having said so much, although it is not open source, in fact ChatGPT should combine several previous technologies together. The picture given by its official website is almost the same as InstructGPT. If you change the color, it will not be released. It is used in the pre-training stage. Indicate the method of fine-tuning to increase its ability to understand input, then use RLHF to enhance its ability to output smoothly and meet human requirements, and finally perform PPO reinforcement learning to improve the overall ability of the model. The picture on the official website is 3.5 points, and his picture is 9 points. One thing to say is that it is really concise and easy to understand intuitively. The first fine-tuning is changed to instruction fine-tuning, and the evolved GPT-3 and the evolved RM are used for reinforcement learning. Finally, Formed ChatGPT

insert image description here

ChatGPT ability research

What is the ability of ChatGPT? After it was born, many people have explored how it performs on NLP tasks, and how it compares with the original BERT and GPT. It is also a question worth exploring. There are several papers conducted a detailed research

  • Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT
  • IS CHATGPT A GENERAL-PURPOSENATURAL LANGUAGE PROCESSING TASK SOLVER?
  • ChatGPT: Jack of all trades, master of none

Natural language processing (NLP) mainly includes natural language understanding (NLU) and natural language generation (NLG) tasks. ChatGPT's natural language generation (NLG) task is naturally the strongest on the surface. Needless to say, the main comparison with previous large models is in the area of ​​​​natural language understanding.

GLUE

In order to maximize the role of NLU tasks, institutions such as New York University and Washington University have created a multi-task natural language understanding benchmark and analysis platform, namely GLUE (General Language Understanding Evaluation). GLUE has nine tasks, namely CoLA
, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. As shown in the figure below, it can be divided into three categories, which are single-sentence tasks, similarity and paraphrase tasks. The introduction is shown in the figure below

insert image description here

That is to say, through this task, the natural language understanding ability of ChatGPT can be objectively evaluated. Obviously, the work of ChatGPT requires instruction, so a set of templates should be designed for the corresponding task before the experiment. The specific conversion is as follows

insert image description here

For the detailed experimental process, please refer to the paper. Here we only summarize the experimental results and conjectures of the above papers. The paper
draws a conclusion. Comparing ChatGPT with BERTbase on specific tasks, we can find:

  • 1. ChatGPT performs poorly on paraphrase and similarity tasks, namely MRPC and STS-B, with a performance drop of up to 24% points.
  • 2. ChatGPT surpasses all BERT-style models on natural language inference tasks (i.e., MNLI and RTE), showing that it has advantages in reasoning/reasoning.
  • 3. ChatGPT is comparable to BERT-base on single sentence classification tasks, namely Sentiment Analysis (SST-2) and Linguistic Acceptability (CoLA) and QA related tasks, namely QNLI.

The specific results are shown in the table below

insert image description here

Analyze each of the above categories of problems, starting with the worst performing similarity task

insert image description here
In MPRC, ChatGPT is not sensitive to semantic differences between a pair of sentences, which may be due to the lack of artificial feedback in model training.

insert image description here

Comparison of BERT-base and ChatGPT on STS-B. The x-axis represents the similarity distribution of STS-B, and the y-axis represents the absolute difference between prediction and ground truth. The error is larger under the condition of low similarity, and the almost similar (middle of the score) error is the largest, which is more consistent with the previous task. One of the reasons is that ChatGPT has not fine-tuned on the STS-B task and cannot determine the correct one. decision boundary.

reasoning

Let's look at the reasoning part with better performance. The experiment evaluates the accuracy of each category on the reasoning task. * means surpassing all BERT models

insert image description here

It can be observed that although its overall reasoning ability is strong, the accuracy of positive samples and negative samples is obviously different, and the types of reasoning are explored in detail, which are divided into arithmetic, common sense, symbolic and logical reasoning.

Here, it can be divided into COT and no COT during the experiment, that is to say, whether there are prompts to help the large model to think, COT often shows the inherent potential of the model, and the method of instruct is also used for comparison

Chain-of-Thought Prompting
Chain of Thinking (CoT) prompts LLM to generate intermediate reasoning steps before answering.
Recent research has focused on how to improve manual COT, including optimizing prompt selection and optimizing the quality of reasoning chains [Khot et al., 2022, Chen et al., 2022]. In addition, the researchers also investigated the feasibility of adopting COT in multilingual scenarios and the feasibility of adopting COT in smaller language models. Multimodal COT that incorporates visual features into COT reasoning has recently been proposed

On the data set of arithmetic reasoning, the test
insert image description here
arithmetic is stronger than ordinary large models, and the ability to understand problems is stronger. The words mixed with numbers can be understood very well, but the accuracy of pure arithmetic has not been very high, which means that there is no rule for mathematical formulas. , the specific arithmetic example is as follows:

insert image description here

The numerical calculations that plagued the large model seem to be effective, and now they can be calculated correctly on bing. Did they call external rules?

insert image description here

The performance on the other three inferences is as follows

insert image description here

The graph above shows the accuracy of ChatGPT compared to the popular LLM on seven common sense, symbolic, and logical reasoning datasets. The above results lead to two points:

  • Using CoT may not always provide better performance in commonsense reasoning tasks. CoT methods often yield flexible and plausible rationale, but end up predicting incorrectly in commonsense reasoning tasks. The results suggest that commonsense reasoning tasks may require more fine-grained background knowledge.
  • Unlike arithmetic reasoning, ChatGPT performs worse than GPT-3.5 in many cases, suggesting that GPT-3.5 is more capable accordingly. Note that ChatGPT performs much worse than GPT-3.5 on COPA which requires commonsense reasoning ability. The figure below shows several failure cases of ChatGPT. We can observe that ChatGPT can easily generate non-deterministic responses, leading to poor performance.

insert image description here

Let's look at some other common tasks

Natural Language Reasoning Tasks

insert image description here
insert image description here

It can be seen that ChatGPT can achieve better performance than GPT-3.5, FLAN, T0 and PaLM in the zero-shot setting. This demonstrates the superior zero-shot ability of ChatGPT to infer sentence relations. To examine why ChatGPT outperforms GPT-3.5 by a large margin, the per-class accuracies of both models are reported in Table 5. When the premise does contain the hypothesis, ChatGPT outperforms GPT-3.5, however, it does not perform as well as GPT-3.5 on the “Not Entailment” class (-14%). Therefore, it can be seen that ChatGPT is better suited for processing factual inputs (which are also generally favored by humans), which may be related to human feedback preferences in their own RLHF design during model training.

The specific example is shown in the figure below, which lacks common sense, which is strange. If Jane is really hungry, Jane will not give Joan candy, but eat the candy herself. There is a similar phenomenon in lowercase, and the logic of ChatGPT's answer is confusing. In general, ChatGPT is able to generate fluent responses following a certain pattern, but seems to have limitations in actually reasoning about sentences. One piece of evidence is that ChatGPT can't even answer questions that are easy for humans to answer

insert image description here
insert image description here

named entity recognition

insert image description here

insert image description here

Not as good as fine-tuning the model, sequence labeling is still problematic, specifically, ChatGPT outperforms GPT-3.5 on the "Loc" ("Location") and "Per" ("Person") classes, and outperforms GPT-3.5 on the "Org" (" Organization”) class performs worse than GPT-3.5. Both models show practical value in identifying "miscellaneous" ("miscellaneous entities") classes. The image above illustrates several failure cases for Miscellaneous. In the left part of the figure, the LLM identifies "Bowling" as a miscellaneous entity, while the ground truth is "None". However, "Bowling" does belong to the entity type "ball", which can be seen as a miscellaneous type. On the right, while "AMERICAN FOOTBALL CONFERENCE" is indeed an organization, the ground truth annotations are not recognized, suggesting that the ground truth annotations may need cleaning (although in rare cases). Therefore, the poor performance of the "miscellaneous entities" class is partly due to the different understanding of the entity scope and ground truth annotations of task-specific datasets between LLMs. Here is an example:

insert image description here

emotion analysis

The negative prediction is accurate and unbalanced. Overall, 3.5 is more balanced and more accurate. It can be observed that the performance of ChatGPT on different classes is quite unbalanced. It performs almost perfectly on negative samples, while it performs much worse on positively labeled data, which leads to poor overall performance. In contrast, the results of GPT-3.5 are more balanced, showing that GPT-3.5 can solve sentiment analysis more effectively than ChatGPT. This difference is caused by the different training data of ChatGPT and GPT-3.5.

insert image description here
Explanation in the experiments ChatGPT and GPT-3.5 still output some other answers such as "neutral" and "mixed" despite explicitly specifying that the answer should be exactly "positive" or "negative" in the task instruction, which partially explains why Their performance is much worse than FLAN.

The figure below is the summary table of the second test

insert image description here
For generative tasks such as dialogue and summary, ChatGPT crushes other zero-shot models, but it is still not as good as the fine-tuning model.

summary

In general, we can use the sentence "Jack of all trades, master of none" in the third evaluation article, that is, to be broad but not good at summing up. You can do everything, but you can't do anything best. You can see In the zero-shot field, ChatGPT almost kills all LLMs in seconds, but it is not as good as the latest LLM that has been fine-tuned in specific fields. Although it is far ahead of other models in terms of language fluency, ability to understand questions, multiple rounds of dialogue, QA, etc., it falls on specific natural language understanding tasks, and its reasoning ability is generally strong, but it still lacks common sense. In terms of symbols and logical reasoning The performance is not as good as ordinary LLM, the positive and negative samples of named entity recognition are unbalanced, the negative tendency of sentiment analysis is more serious, the classification problem performs well, and the regression problem performs poorly. It can be summarized as follows:

  • Providing false information is easy to be misled
  • The more complex the task, the lower the accuracy
  • lack of common sense
  • Similar content cannot be distinguished
  • The accuracy of tasks that need to understand semantics to do specified work (emotional analysis, named entity recognition) is lower than that of simple semantic understanding tasks, and is greatly affected by RLHF
  • The output is unstable, and there are still misunderstood commands
  • Every field has reached a usable level, but it is not as good as the latest fine-tuning LLM

The impact of ChatGPT on knowledge graph

Referring to Liu Huanyong's blog and Wang Haofen's interview
, in terms of the nature of the emergence of ChatGPT or LLM, there will be reflections on whether we need a fully structured (symbolic) expression of the traditional map. Even many traditional KG tasks, such as knowledge extraction, knowledge fusion, knowledge reasoning and calculation, as well as upper-level question and answer, search, and recommendation will actually be affected.

affected areas

  • In a short period of time, the construction of encyclopedic general knowledge graphs is of little significance , because the search paradigm has undergone major changes, and the implementation of knowledge graphs in this respect is still inferior to new generation search engines such as ChatGPT and new bing.
  • Search, question and answer, recommendation , the significance of using knowledge graph in these fields is not great in the short term
  • Knowledge extraction and construction should fully embrace LLM

Temporarily unaffected areas

  • Graph analysis, graph structure visualization, because it is discrete
  • The future inductive paradigm prediction of the time series map, this regularized prediction is still more accurate than ChatGPT for the time being, but it may need its assistance in collecting information and updating.
  • multimodal

point where the two can be combined

  1. Combined with its reasoning (common sense and domain reasoning), business system interaction, hyper-automation, access and update of time-sensitive content, there are many things that can be done.
  2. Text generation mapping for various map tasks, and prompt engineering.
  3. The implementation of retrieval augmented DL, where the retrieval library contains a large KG, can be used to select examples, constrain prompts, and improve reasoning capabilities. That is, KG constrains ChatGPT reasoning, retrieval enhancement, reasoning assistance, decision support and other functions to alleviate shortcomings such as lack of common sense
  4. KG itself focuses on more suitable symbols, including numerical calculations, including rule reasoning, etc., because this area is actually relatively weak for LLM, or the learning efficiency is too low.
  5. As a meta ontology, KG arranges and integrates various AIs, especially Maas, to form a more complete chain.
  6. Digging schema with LLM
  7. The part associated with DB or KRR, this piece can relatively think about how to better manage LLM and carry out effective collaboration while managing graph data, and KRR should consider broader reasoning and new knowledge representation

adjustments needed

Scientific research should be adjusted, change the previous task, the idea of ​​a model, think about how to stand on the shoulders of giants to make new innovations, especially for the evaluation and comparison of LLM, find out its advantages and disadvantages, and avoid problems caused by its advantages Going to do pseudo-scientific research again, this is actually the same as the adjustments that everyone had to make after BERT conquered the world at that time. In the short term, it can be the original KG. In the medium term, it should be chatGPT enhanced or based KG. In the long term, it should be the new KG research and development route. Competition and cooperation always exist, which is a good thing. In the Gartner curve, KG has already passed its peak and is on the decline. It should develop more from KG research to KG toolset or even KG ecosystem, so that it is similar to the Internet and becomes more usable and easier to use. This is also the reason why ChatGPT is out of the circle.

summary

Chatgpt, which is based on a large-scale pre-trained language model, has successfully gone out of the circle. In recent days, it has brought multiple daily limits to the artificial intelligence sector, which is enough to illustrate the arrival of this outlet. As for the "knowledge map" that used to be popular, how to find the difference between it and chatgpt and find its own position is particularly important.

Formal knowledge and parametric knowledge have always been a matter of consideration in terms of expression, and both technologies should have their own positioning and value.

The construction of knowledge graphs is often abstract, and often includes a series of knowledge conflict detection and resolution processes, and the entire process can be traced. Taking such knowledge as input can solve the factual fallacy problem of the current ChatGPT to a considerable extent, and it is interpretable. Reasoning based on knowledge graphs can also enhance the reasoning capabilities of current models. In addition, ChatGPT can also improve the ability to acquire knowledge, so these two technologies can iterate and improve together.

Fundamentally speaking, the knowledge map is essentially a knowledge representation method, which defines the domain ontology to realize the knowledge structure (concept, entity attribute, entity relationship, event attribute, and relationship between events) of a certain business domain. Accurate representation, making it a canonical representation of knowledge in a specific domain. Subsequently, structured data is extracted from various data sources by means of entity recognition, relationship extraction, event extraction, etc., knowledge is filled, and finally stored in attribute graph or RDF format.

From the perspective of the problem, the big model is right in terms of semantic understanding, but it does not really understand the meaning behind it. The factual correctness needs to be improved. The factual correctness of the knowledge graph constructed by humans will be controllable, but the cost is high and it is not easy to use.

Of course, ChatGPT also has obvious shortcomings. Document 1 believes that it is generally accepted that it is good at nonsense in a serious manner, because ChatGPT is a black-box calculation, and there are certain limitations in the credibility and controllability of the content. "We need to give it enough correct knowledge, and then introduce knowledge management and information injection technologies such as knowledge graphs, and also limit its data range and application scenarios to make the content it generates more reliable."

As far as chatgpt is concerned, its flaws also exist.

First, there is no internet access, so there is a lack of up-to-date information. There are often factual fallacies in the answers: for example, alphago is considered to be OpenAI's technology, historical figures and works are crowned and worn, and unwarranted technical words are readily explained and explained logically. (bing has been resolved)

Secondly, due to insufficient reasoning and computing power, it is difficult to give reliable predictions and inferences and establish potential associations. He often gives extremely confident wrong answers to slightly complicated mathematical calculation problems.

In addition, the interpretability is weak, and the source of knowledge and information cannot be given. At the same time, it also lacks entities, so it cannot really touch the real world of human beings, and can only communicate with human beings through "language interfaces". Lack of privacy protection mechanism.

However, if chatgpt creates a large amount of content and imports it into the knowledge map as a data source, it will affect the accuracy of the knowledge map, which undoubtedly needs to be taken seriously.

Summarize

ChatGPT has changed everyone's life. It can be said that it is the fourth industrial revolution. Although there is a gap between SOTA and SOTA in specific professional tasks, even if many experts say that there is no innovation, it is just the commercialization of existing technologies. , but in the eyes of every ordinary person, it is already amazing enough. Technology is like this. Only when it finally lands can it set off a wave of reform, otherwise it will always be just talk on paper. ChatGPT is the peak of the wave of our era and the next decade of AI. A bright starting point, take advantage of the momentum.

Guess you like

Origin blog.csdn.net/qq_44799683/article/details/129409386
Recommended