Google Research Scientists: The Evolution and Limitations of ChatGPT's Secret Weapon

9977606974fa6b1b4ff0ca90e60d9221.png

Source|TalkRL

OneFlow compilation
and translation|Xu Jiayu, Jia Chuan


Also based on the GPT pre-training model, why does the effect of ChatGPT far exceed that of previous generations of models such as GPT-3? The answer has been revealed. The secret weapon to achieve ChatGPT lies in RLHF, which is the reinforcement learning of human feedback .

In the pre-training stage, the GPT model learns everything about the world, while in the RLHF stage, ChatGPT pays more attention to making the model output correct, beneficial and appropriate results, and continuously fine-tuning the results.

Specifically, the tuning of the RLHF stage is divided into three steps : the first step: through supervised learning, fine-tune the LLM with the "ideal" human answer data to different prompts; the second step: the LLM provides multiple answers, and then human evaluators rank these answers (the ranking is used to train the reward model); the third step: use the proximal policy optimization (PPO) model to optimize the reward model of LLM.

Previously, John Schulman, head of ChatGPT , introduced the origin of the RLHF idea. The key is that they apply reinforcement learning in language models and use human feedback to define reward functions. In addition, many of the technologies used by OpenAI's RLHF are also based on the results of previous research, including the work of Natasha Jaques.

Natasha is a Senior Research Scientist at Google Brain, and she has been cited in many OpenAI work for her reinforcement learning papers related to RLHF and dialogue models. In the recent TalkRL podcast hosted by Robin Ranjit Singh Chauhan, she introduced ideas related to RLHF and its reward model from a third-party perspective, as well as her views on reinforcement learning research and AGI development.

Currently, her research focuses on Social Reinforcement Learning, developing algorithms that combine insights from social learning and multi-agent training to improve the learning, generalization, collaboration, and human-computer interaction capabilities of AI agents. In January 2024, she will join the UW School of Computer Science as an assistant professor.

(The following content is compiled and published by OneFlow after authorization. Please contact OneFlow for authorization for translation reprint. Source: https://www.talkrl.com/episodes/natasha-jaques-2)

1

RLHF-Related Research and Cost-Effectiveness

Robin Chauhan: You started very early on Reinforcement Learning with Human Feedback (RLHF) and similar research on dialogue models, and you have been cited in many important papers published by OpenAI. Can you talk about how your research connects to OpenAI's current research and these models?

Natasha Jaques: Back in 2016, I was thinking about how to use pre-trained language models for fine-tuning. Specifically, I'm focusing on LSTM models and trying to fine-tune them using reinforcement learning. At that time, my focus was not on language per se, but on methods like music generation and molecular generation, for example, to generate molecules like drugs.

Molecule generation is a good example in my opinion. We can train a supervised model based on a dataset of known molecules and generate new molecules, but these molecules may lack the properties we need, such as easy synthesis of drugs. Therefore, we also need to evaluate the "synthetic accessibility" of the molecule. But it is not enough to rely on dataset training alone, because optimized molecules cannot be obtained in this way. It is also possible to generate some useless molecules if only the synthetic accessibility of the molecules is optimized.

Therefore, we need to evaluate and optimize these two aspects. For this problem, we can use reinforcement learning to optimize drug likeness or synthetic accessibility, but this approach is not perfect due to flaws in the data.

We propose a solution: first pre-train on the dataset, and then use reinforcement learning to optimize some reward while minimizing the KL divergence between the pre-trained policy and the current policy. This approach can flexibly combine supervised learning and reinforcement learning, using supervised learning to obtain useful information in the dataset, while using reinforcement learning to optimize sequences with high rewards within the data distribution space. It can be seen that this is closely related to the currently used RLHF method.

In this technique, we first pre-train a large-scale language model on a dataset, and then optimize the model through human feedback while minimizing the KL divergence between the optimized model and the pre-trained prior model, which is beneficial for the RLHF framework. Significance.

At the same time, I am also working on RLHF methods that learn from human feedback. Around 2019, we used the same KL control method, that is, let the dialogue model try to optimize the signal obtained by talking with humans, instead of letting humans evaluate the quality of the dialogue, and at the same time use a different method from OpenAI's RLHF algorithm to achieve preference ranking .

Our goal is to learn from implicit signals in conversation with humans, rather than relying solely on human evaluations for optimization. We do not require people to provide additional feedback, but provide reward signals to the model by analyzing implicit signals such as the sentiment of the text.

For example, when people in a conversation sound generally happy, we train the model on this as a positive reward signal. Conversely, when they sound frustrated or confused, the model may be talking nonsense, which we interpret as a negative reward signal. Therefore, we use the same technique to optimize these signals to improve the performance of the model.

Robin Chauhan: This sounds a lot like what ChatGPT is doing right now. Maybe the function approximator is slightly different, or the way to get the feedback is different, but under the hood it's actually based on RLHF.

Natasha Jaques: True, but there are some key differences. OpenAI takes a different approach to human feedback than what we used in our 2019 paper, except that they train a reward model. Their approach is to ask a group of people to score two outputs, and then train a model to approximate these ratings. In fact, this idea has been proposed as early as OpenAI when it explored using human preferences for deep reinforcement learning research.

In contrast, my research in 2019 was about offline reinforcement learning (offline RL). At the time, I used actual human ratings of specific outputs as reward samples for training, but lacked a general reward model. Since the method of training the reward model can be sampled many times, it actually has good scalability.

Robin Chauhan: John Schulman, co-founder of OpenAI and inventor of the PPO algorithm, worked on RLHF. He talks about how ChatGPT's sibling model, InstructGPT, requires a lot of human feedback. Furthermore, detailed and lengthy scoring instructions are required to evaluate human feedback, which comes at a considerable cost. Will this cost limit the application of RLHF? Or is the cost unimportant and totally worth it in terms of return?

Natasha Jaques: Before InstructGPT, OpenAI has done a lot of research on summarization. In abstract research, one of the key factors to be able to effectively use RLHF is to invest a lot of effort in obtaining high-quality human data.

In an abstract research paper from OpenAI, they took a better approach to evaluator recruitment, with researchers sharing a Slack group with evaluators and answering evaluator questions to ensure evaluator alignment with researchers. Such investment is obviously very expensive.

It is worth mentioning that a phenomenon can be seen in InstructGPT: the performance of the 1.3 billion parameter model trained with RLHF is better than that of the 175 billion parameter model trained with supervised learning. That is, just using RLHF, the effect can catch up to the 100 times the size of the model, and the computational cost required to train the 100 times the size of the model is quite expensive. Although OpenAI does not disclose the exact amount of money they spend on collecting human data and training giant models, it is not difficult to find that since RLHF can reduce the cost of training larger models, it may actually be more cost-effective.

Robin Chauhan: In my opinion, they usually use the on-policy based PPO (Proximal Policy Optimization) method to process the dataset. Such approaches cannot reuse data because they rely on current model sample data or data very close to the model. If the model is biased after training on this data, is the dataset still valid? Or can this dataset be used to train other models?

Natasha Jaques: These datasets are not one-off. The training process of the reward model is actually similar to comparing text summaries. The result of this comparison not only depends on the policy model itself, but also a more objective and common result, so it has the off-policy feature, and these data can be used repeatedly.

2

Limitations of reward models

Robin Chauhan: John Schulman pointed out that while human feedback has some effectiveness during training, if the same reward model is used for a long time to train, performance may drop at some point. So I think that additional human feedback needs to be collected after each stage, and to further improve performance, a whole new dataset may need to be used. What do you think?

Natasha Jaques: I'm not very familiar with the work of OpenAI, but I found this phenomenon in my work: we try to optimize the reward to achieve the goal, and also consider the feasible range of the data, but it is easy to be constrained by the reward function, form over-dependence.

For example, when training a dialogue model, we use a reward function to encourage the model to have a dialogue with humans, while outputting text with high sentiment for positive feedback. But with limited data resources, we are likely to overfit the data and rewards, causing the model to perform poorly on new data.

Our goal is to maximize the reward while keeping the model fit to the data distribution. We used the maximum entropy reinforcement learning (maximum entropy RL) algorithm to find the optimal policy, it does not matter whether the behavior is constrained or not, but the reward function is reused. Therefore, when an agent is trained using rewards, it may be overly aggressive, polite, and pleasant.

The behavioral diversity of agents is based on the diversity of output texts. I wonder if there is a similar problem with their results, that overtraining a reward model can actually lead to diminishing returns and even eventually negative returns. In addition, the reward model itself does not seem to be perfect. Through the validation data (validation data), you will find that its accuracy rate is about 70%. Therefore, during training, overfitting is likely to occur. It is unclear whether reward models are comprehensive enough to describe quality outputs.

Robin Chauhan: Existing models are not very good at ignoring distractors, but this is mainly a function approximation problem, not a reinforcement learning problem. We don't seem to have found a solution to the distractor problem yet.

Natasha Jaques: More symbol-based representations may be needed to generalize so that objects like trucks and haystacks can be understood locally. We cannot rely solely on inductive deep learning, such as relying only on truck examples in the training dataset to identify trucks, because this approach will fail when faced with trucks outside the range of the training data.

Integrating language models into reinforcement learning agents has great potential because language is compositional, perhaps providing a compositional representation that can help with better generalization. Generating realistic images with linguistic cues demonstrates the potential advantages of combinatorial representations.

3

Reinforcement learning based on token level

Robin Chauhan: You have done similar work in this area before, doing reinforcement learning at the token level, treating each token as an independent action (action), and using methods such as "Sequence Tutor" and "Side Learning".

Natasha Jaques: Exactly. The same goes for InstructGPT if you dig a little deeper. It is easier to use the policy gradient method, by computing the probability of each token and summing them, the probability of the entire sequence can be obtained. However, no matter which method is used, the loss in the model is ultimately transmitted by increasing or decreasing the probability of the token level.

Robin Chauhan: Your paper describes it as a "bandit algorithm". In my opinion, this may give people an illusion that all tokens are a whole action (one action). But your take is that it's organized in a way that still allows us to analyze the probability of each token individually.

Natasha Jaques: You can calculate the reward for the entire sequence using the following formula: Add the rewards for each word and multiply by the probability of the entire output. However, in practice, the way to get the probability of the entire sequence is to add the probabilities at the token level. Therefore, the way to influence the model is actually by modifying the probabilities at the token level.

Robin Chauhan: So does that mean there is no benefit to doing analysis at the token level? Because I remember John saying that analyzing the dataset as a whole is more tractable.

Natasha Jaques: They take a different approach than token-level reinforcement learning. They set a discount factor of 1 and did not discount the same reward applied to all tokens in the sequence, that is, the reward received at the end of the sequence has the same value as the reward received at the beginning of the sequence. This approach works pretty well.

If I recall correctly, we had experiments where we tried to design rewards at the sequence level and at the whole dialogue level, say rewarding the duration of the dialogue, which involved multiple dialogue turns.

In addition, we also uniformly distribute the tokens in the sentence, and implement the reward design at the sentence level. However, we still use a discount factor when it comes to dialogue length. This is because there is no way of knowing how long the conversation will last, so these rewards need to be discounted. However, if the dialogue time is long enough, the reward will increase accordingly. Even so, optimizing discounted rewards in conversations is quite difficult.

4

AGI and AI Embodiment

Robin Chauhan: Do you think the current discussion and thinking about artificial general intelligence (AGI) is necessary, or is it just a distant dream that is not worth mentioning?

Natasha Jaques: I get a little frustrated when talking about artificial general intelligence (AGI) because people often don't know what they're talking about.

The definition of AGI is not clear, and trying to clarify what it means leads to circular arguments. For example, someone may tell me that AGI will be available in five years, but if I ask them why the CEO of a self-driving car company thinks it will take 20 years to launch a fully self-driving car, there will be a contradiction.

In my opinion, AGI can do everything a human can do, even better than a human, but if it can't drive a car, it can't be considered AGI. While some argue that AGI doesn't need to take any concrete physical form, what's the point?

Those debates aside, I'm really surprised, and even a little concerned, by the speed at which AI is developing. If we define AGI as a highly disruptive and rapidly developing artificial intelligence technology, we have already reached this stage. Take ChatGPT as an example, now universities have to redesign their writing courses, because the articles written by ChatGPT are now better than some undergraduates.

Robin Chauhan: It is true that AGI cannot replace all jobs, but there is undoubtedly a huge development prospect like ChatGPT, which is also the first technology I have seen that has truly achieved generality. Also, the self-driving cars you mentioned are a good example. While many have predicted in the past that fully autonomous vehicles would be available within two to three years, the actual launch has been repeatedly delayed.

Natasha Jaques: It is indeed difficult to launch a fully self-driving car in a short period of time, as can be seen from the Tesla accident mentioned by Andrej Karpathy. The accident happened because the Tesla Autopilot system couldn't sense the loading of one semi-trailer on top of another. In short, a semi-trailer is loaded on one vehicle, and another semi-trailer is loaded on the latter semi-trailer, resulting in "stacking".

These accidents occurred because Tesla's Autopilot system was unable to perceive situations outside of the training data. We know that if models go beyond the support of the training data, their performance often degrades. So how do you create a dataset that includes all the situations that might occur in the real world? In fact, this is impossible, because the world is always changing, and new things are constantly emerging.

I've been researching how to train reinforcement learning agents through adversarial environment design or unsupervised environment design. In these methods, we can find the problems that may cause the model to fail and train accordingly. These new methods are more feasible than supervised learning methods that rely only on limited data sets.

Robin Chauhan: There are still many problems with the AI ​​embodiment you mentioned. But what ChatGPT shows is that if we can freely create and express in the abstract world of text, the problem can be solved.

Natasha Jaques: For me, the thing that fascinates me most is embodied intelligence, which can embody while understanding language. Take AGI. If we want to define it, it must not only understand text , while also understanding how the text maps to the world, only then can we fully generalize things. It would be nice to have an agent that encodes everything in the same network.

Robin Chauhan: Using existing technology, we have greatly improved our ability to do many things that could not be done before. In the past, we mainly focused on text, abstract thinking, codes, and abstract symbols, etc., but the reality shows that robots and animal intelligence are really difficult things to do. In contrast, the unique abstract thinking of humans is even more difficult. easy to accomplish. We have now reached a goal that was previously thought to be unattainable, and ChatGPT allows us to see the lack of generality in robots.

Natasha Jaques: I remember a saying that activities that are difficult for humans, such as chess and Go, can be easily performed by AI. For AI, some low-level manipulation activities (such as picking things off the ground with your hands) are the real challenges.

I would like to share an interesting incident that better illustrates why incarnation is so difficult. I've been working on language conditioned RL agents (language conditioned RL agents), which aim to get machines to do real things, guided by natural language.

At that time, I read a paper from DeepMind. The main content of the paper was to imitate interactive intelligence and create a simulated world. In this world, robots can walk around at will. This world is like a low-resolution video game. Robots After getting instructions, you can do some things, such as picking up an orange and putting it on the bed, or picking up a cup and putting it on the table, etc.

The 30-person research team spent two years and millions of dollars on the project. They collect a lot of human data and try to apply this data in a simulated environment. Due to the sheer volume of data collected, perhaps half of it is duplicate data. And they use this data to train the robot. In the end, you guess that the probability of their successful execution of the order is 50%.

I think this ratio is relatively low. Although instructions such as "put oranges on the bed" may seem simple, given the amount of money the project team has already invested, they should be able to achieve a higher success rate. This also shows the challenge of the embodied task, even though we have successfully achieved the effective combination of text to image, and the combination of text to image generative models has also achieved good performance, but the manipulation of physical entities is difficult to control, making them difficult to control. It is also very difficult to complete simple tasks based on receiving visual and textual information.

5

Back to Academics: Studying Social Reinforcement Learning

Robin Chauhan: I heard that you plan to return to academia as an assistant professor at the University of Washington. What are you going to study?

Natasha Jaques: I already have a clear idea. When hiring, if you can't clearly describe your plan, they won't hire you. What I want to do is social reinforcement learning, i.e., where we can improve the performance of AI when learning in a multi-agent environment. Most current AI activities involve humans, and humans are very smart and have multiple ways to accomplish tasks.

Therefore, we should not only think about how to make AI flexibly learn from humans, but also think about human skills in social learning, that is, how to identify which models are worth learning, and when we should rely on learning from others instead of independent exploration. What I want to develop is AI that can interact with humans and be useful.

This addresses questions such as: How do you collaborate on a task with someone you've never met? How to understand the goals that humans want to solve? How to learn from human feedback (including implicit feedback)? How to use natural language to communicate with humans to solve tasks? How to use human feedback to train language? These are reinforcement learning in the context of languages ​​that I've been working on.

Robin Chauhan: Returning to academia after working in a leading laboratory is an interesting choice. I bet many people will make the opposite choice, especially considering the limited academic budget. Research is a big challenge because scale is important for AI, but scaling is expensive.

Natasha Jaques: Some people might think that if you want to contribute to AI, you need huge computing budgets and training large models, and how can academia possibly afford this cost? But in reality, teams of 30-50 people are often working on ideas that have already been proven to work, so researchers can join in and scale them up into large-scale projects. For example, some large teams at Google are trying to carry out RLHF projects. Their approach is similar to that of OpenAI, and they are all trying to expand and write their own infrastructure.

OpenAI and DeepMind are now increasingly focused on scaling rather than just publishing research. If you want to pursue innovative research directions that explore new ideas and confirm those ideas through experiments, then there may be more challenges in the industry.

What I pay more attention to is the research freedom and the ability to think independently and experiment. The role of academia is to come up with new research ideas and conduct proof-of-concepts, while industry is responsible for translating these ideas into practical systems.

Taking my work in KL control as an example, the exploratory work in academia has played a positive role in promoting the technological development in the industry. So what matters is what you like to do, whether you join an infrastructure work team or do more research. Personally, I prefer to work on something more research-oriented.

Robin Chauhan: Your contributions to AI have been recognized by academia but little known by the general public. People only see the achievements of OpenAI, but they don't know that it is also achieved by standing on the shoulders of its predecessors.

Natasha Jaques: That's true. But my goal is to practice my ideas and verify whether they are feasible, so as to contribute to the development of AI, not just to pursue glory.

Related Papers

1. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog(https://arxiv.org/abs/1907.00456

2. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control(https://arxiv.org/abs/1611.02796)

3. PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning(https://arxiv.org/abs/2102.12

4. Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience(https://arxiv.org/abs/2208.04919

5. Fine-Tuning Language Models from Human Preferences(https://arxiv.org/abs/1909.08593), Daniel M. Ziegler et al 2019

6. Learning to summarize from human feedback(https://arxiv.org/abs/2009.01325), Nisan Stiennon et al 2020  

7. Training language models to follow instructions with human feedback(https://arxiv.org/abs/2203.02155), Long Ouyang et al 2022  

everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/

470321c5c9fa95eb75e4dde9f4ffa453.png

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/130120420