Head of OpenAI Super Alignment: A four-year plan to "drive" superintelligence

1b6d23f50b68eba2866e243831f7948e.png

With the rapid development of the field of AI, OpenAI predicts that superintelligence will emerge within ten years. For humans, superintelligence is both an opportunity and a challenge. It can help humans solve many major problems, but it may also cause huge risks. Therefore, the alignment of superintelligence has become the focus of people's attention. We need to ensure that superintelligence and the overall will of mankind Be consistent and let them understand and fulfill human desires and needs. 

Recently, OpenAI promised to use 20% of the computing resources obtained so far to align superintelligence, and established a team called "Superalignment", claiming that it will solve the superintelligence alignment problem within four years. They plan to train an AI automatic alignment researcher who is roughly equivalent to human level, and then use the automatic alignment researcher to solve the super intelligent alignment problem.

The Super Alignment team is co-led by OpenAI co-founder and chief scientist Ilya Sutskever and Alignment head Jan Leike, who previously worked at DeepMind for four years, researching reinforcement learning with human feedback and recursive reward modeling.

On the AXRP podcast on AI risks, UC Berkeley doctoral student Daniel Filan and Jan Leike discuss the specifics and challenges of OpenAI's Super Alignment initiative. (The following content is compiled and published by OneFlow after authorization, please contact for authorization for reprinting. https://axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html)

Source | AXRP

OneFlow compilation

Translation|Wan Zilin, Yang Ting

1

Superintelligent Aligned Goals

Daniel Filan: What is the goal of the superalignment team?

Jan Leike: Our macro goal is to solve the superintelligence alignment problem within four years (that is, before mid-2027). OpenAI co-founder and chief scientist Ilya Sutskever will co-lead this project with me. OpenAI has committed to aligning superintelligence with 20 percent of its computing resources so far.

Our overall plan is to first train an AI auto-alignment researcher that is roughly at the level of a human, and then use the auto-alignment researcher to find a way to align superintelligence.

We would like to offload as much as possible the tasks required for the alignment to an automated system. Typically, when using LLMs (language large models) or building general-purpose AI systems, their skill profiles are not exactly the same as humans, and are far superior to humans in some areas, such as translation and grasping facts compared to humans has obvious advantages. However, they perform significantly worse on other tasks, such as today's language models are significantly inferior in arithmetic.

What we need to think about is, which tasks can be handed over to the AI ​​system, and in what order? As we progress through this process, human researchers will gradually shift their focus to tasks outside of AI systems, devoting more time and energy to more challenging areas. In this way, AI systems will take on an ever-expanding share of the overall work, and human researchers will become increasingly efficient at making actual progress.

In order for the system to really work, it needs to automate 99% or even 99.9% of the tasks, so that you can actually get 10, 100, or even 1,000 times the research output.

Daniel Filan: What type of mission do you want to achieve?

Jan Leike: The mission types are very broad and I would divide them into two categories. The first category of tasks is more like traditional machine learning engineering research, trying to enhance the capabilities of AI systems; the second category includes all the content required for alignment work . Thus, the first category of tasks involves implementing machine learning experiments, running the experiments, and viewing the results. The second category of tasks encompasses the very important high-level problem of how to find experiments that improve scalable supervision and enhance interpretability.

Also, there are more specific questions, assuming you're at a certain stage in your research, say, you've just finished a dissertation and you might be thinking, "If you keep going in this direction, what do you need to do next? ’ So I expect machine learning to excel at the first type of task, automatically designing and running experiments. Instead, our job is to find ways to automate the second type of task while speeding up the alignment in different ways.

Daniel Filan: How exactly should the term "human level" be defined? Does it bother you, or comfort you, if AI outperforms humans at certain tasks?

Jan Leike: The key question is, how risky is the system's role in alignment research? If it just knows a lot of facts, it's not terrible, but what really needs to be figured out is if we let the system take over some or even almost all of the alignment research, is it trying to deceive us? Will it seize the opportunity to usurp power? So, the question is how well the skill set required to perform the review task compares to the skill set required to obtain substantial alignment research help.

If we dig deeper into this question, how well does the model actually perform? Does it weave a coherent series of lies, deceiving us, pretending to believe one thing but actually longing to do another? In addition, there is a very critical ability, namely self-disclosure (self-exfiltration) . We need to figure out, can the model break through the security precautions, access its own weights, and try to replicate it elsewhere on the internet? Is it good at convincing engineers with access to weights to download and send elsewhere? So we can specifically measure how well the model performs on these aspects, and then compare it to how well it helps us in our alignment studies.

Daniel Filan: Critics of this field of research say that if you want to build a human-level automatic alignment researcher, it needs to be quite smart and creative, able to imagine things that we have never thought of, and plan to achieve goals. At the same time, it must also be good at solving alignment problems. It has been argued that the combination of these elements is inherently scary and dangerous, and if the task is to align auto-alignment researchers, are there any other problems that it needs to address?

Jan Leike: Ultimately, this is an empirical question. When scaling up a model, it can be difficult to know exactly which skills will be developed and in what order. I am very excited by the fact that a lot of work is currently being devoted to the emerging capabilities of predictive models. This will give us the ability to actually predict the next pretrained model. At the same time, we can also make some high-level arguments, for example, once the model becomes very powerful, we can leave a lot of alignment research to it, can the model also improve its own ability? Can a lot of machine learning research be done, such as improving computational efficiency, etc.? We can then leverage these findings to pre-train more powerful models.

This scheme sounds very attractive, but I think it will be very complicated in practice, and you can't do large-scale pre-training every week. So it could be months before results are actually available, while existing systems can still be used. Another open question: how many easy ways are there to improve computational efficiency?

At present, the existing community dedicated to making AI faster and more powerful is quite large, while the alignment community is relatively small. So if you're able to automate those tasks, the benefit is actually the alignment field more because it's a smaller community and we don't have to do those tasks anymore.

Daniel Filan: Create an automatic alignment researcher to help us achieve alignment, what problems does it need to help us solve?

Jan Leike: The long-term goal of the automatic alignment researchers is the creativity of the models. In my opinion, at least for language models or AI, they are more creative than humans. If you look at the images generated by the diffusion model, or sampled from the pre-trained base model, it contains a lot of whimsy, which is probably difficult to get from a single person or a small team. Therefore, they can actually sample from the entire distribution, which individuals usually cannot.

In terms of long-term goals, we can give AI systems some small and well-defined tasks, and if they can actually do these tasks well, it will actually help a lot. These tasks can be very specific, like "Here's the paper we just wrote, please suggest some next steps or new experiments." Imagine you have a real first-rate researcher, you can ask them these questions, they There's no need to pursue long-term goals, just optimize for the next few thousand tokens.

Daniel Filan: This seems to conflict with the goal of automating 99.9% of alignment studies. How to achieve aligned AI is a rather difficult part of alignment research.

Jan Leike: Exactly. The system adds a lot of value to us through its excellent performance in these tasks. You can then build an overall portfolio of tasks ranging from "write the code that implements these experiments" to "look at the results and tell me, or suggest what to do next". Then you can combine them using something like in Auto-GPT or language modeling programs, where each task is specific and independent so the system doesn't have to pursue long-term goals.

As an example, recently OpenAI published a study using process-based feedback in mathematics. Rather than just training the model to judge "Did the system get the right solution?" and then train it through reinforcement learning, they used human feedback to train the reward model, evaluating each step of the proof process.

It turns out that the former is more efficient, giving the AI ​​system a more granular way of learning and more detailed feedback. Now, can this staged approach to training compete with end-to-end reinforcement learning to end up with the right solution? This is not yet clear. But at least you can use this kind of step-by-step training to make the system do a lot of very valuable things that are within the range of human capabilities, and then put them together.

Daniel Filan: Yes. Even small tasks like this, like "look at the results and decide what to do next", I would have thought that to do that you have to have a big-picture awareness, need to think about "what next project will be most helpful to solve the hyper-alignment problem in four years" .

Jan Leike: Yes. But that doesn’t mean you’re optimizing and credit assignments for four-year goals, it’s more feasible to add broader goals and context to your prompts. When you're actually improving a system with reinforcement learning or RLHF, instead of waiting until the end of a research project to decide whether it's good enough, just use human feedback as a reward: "This direction looks good, better than anything I can think of." be good".

So I think the overall goal is not to build an auto-alignment researcher that is as powerful as possible within the technology, but to build a really valuable system that scales massively and, most importantly, that we believe is aligned enough can take over these tasks.

So, if you introduce inefficiency into the above process, this way of training actually limits the ability of the model, we may think, "The model can perform these split tasks, but if we do it end-to-end If you train it, it may become more powerful."

2

What can an Auto-Alignment Researcher do?

Daniel Filan: You will first solve the problem of how to roughly align with human level, and as the model gets smarter, additional problems will arise, and the system will start to solve these problems?

Jan Leike: Basically. A real solution to the superintelligent alignment problem would look very different from what is being done today. Currently, ChatGPT's alignment method is mainly trained from human feedback through reinforcement learning, but there is a broad consensus that this method cannot be scaled, and I agree with this statement, because it fundamentally assumes that humans really understand the system Detailed operation mode.

If the system does a lot of alignment research, involving tasks with millions of virtual humans, it's impossible to see all the details and detailed feedback in it. But the techniques we're currently working on are designed to take these steps and align a roughly human-like alignment researcher to do these difficult tasks, but not behave in a way that's radically different from humans, where, Scalable supervision is how it is derived from human reinforcement feedback.

I would define scalable supervision as a set of ideas and techniques that allow us to use AI to assist humans in evaluating difficult tasks. Typical examples of scalable supervision include debate, recursive reward modeling, and iterative distillation.

If we were to really align with superintelligence, a system that would far exceed human intelligence, be able to think much faster, and operate on a much larger scale, which would introduce many other problems, especially since it has strong Generality, many tasks can be performed, we have to figure out how to do alignment not only on the narrower distribution of alignment research tasks, but also other alignment problems.

It would be very exciting if we could introduce some formal verification into it. Perhaps, we can find a learning algorithm with theoretical guarantees. I'm not sure what's possible and theoretically feasible in this regard, though, if you're putting a lot of cognitive effort into solving the problem.

Taken together, all of these approaches are very different from what we are or will be doing. I also don't think a roughly human-level alignment researcher would immediately tackle those problems. Instead, we hope that researchers can figure out how to better align their next generation, so that it can solve these problems and make more progress on a stronger intelligent foundation. Therefore, we will gradually improve, and finally achieve a system capable of groundbreaking research, so that we have the ability to align superintelligence.

Daniel Filan: OpenAI has mentioned self-improvement loops in several blog posts. First, there is a correlation between security and capability. You need an intelligent model to solve the alignment problem, but at the same time you don't want it to evolve too fast. In the article "AGI and its future planning", there is such a sentence, to the effect that "a sufficiently intelligent AGI that can accelerate its own progress may appear, so that unexpected major changes will occur in a short period of time." The article also stated that, "We thought it would be easier to achieve security at a slower pace of development." I wonder if we made this very intelligent or human-level alignment researcher and then effectively increased the size of the alignment team by a factor of 10 or 100 times, will it lead to this recursive self-improvement cycle?

Jan Leike: Yes, it's bound to happen. If a recursive self-improvement cycle is to be performed, the alignment must also be significantly improved. Personally, I believe that the potential for rapid development is considerable and we should prepare for it to happen. Even if it didn't happen, I'd be glad it did.

Daniel Filan: When will the alignment ability develop rapidly?

Jan Leike: I'm not sure either, but you can compare it to other machine learning projects like AlphaGo, Dota or StarCraft, where the performance of these systems improves a lot every week. There's certainly a lot of uncertainty about what exactly will happen, but I think we should be prepared for that possibility. If things go really fast, we'll have auto-alignment researchers who can do thousands of years' worth of work in a week, beyond the reach of humans.

Daniel Filan: Will human-level AI alignment researchers threaten human jobs? Will we still need human superalignment teams then?

Jan Leike: That's a good question. In fact, I'd be happy to be replaced by AI. Historically, it’s usually been the situation we mentioned earlier: AI assistants do 99 percent or 99.9 percent of the work, and we just have to deal with the rest. It is also very important to ensure that humans retain some degree of connection and control with the AI ​​system, even in the long run, when we no longer fully understand what the AI ​​does.

So we still need some human effort to try to understand the high-level goals that AI is trying to achieve. This wouldn't necessarily be achievable by the hyper-aligned team that OpenAI has now, and would likely require a very different skill set than it does today. But I think there should always be some level of human involvement no matter what.

3

How to get AI to be an AI Alignment Researcher

Daniel Filan: Can you describe the specific implementation plan of the human-level automatic alignment researcher?

Jan Leike: Basically a two-part job. The first part is a system that is intelligent enough, and the second part is to align with that system, let it actually do the alignment research, these two parts are closely linked.

I'm not working on the first part, and there may be many different ways to achieve the first part of the task, but it's basically like training a bigger model, making it smart enough, and it will definitely work. The second part is what I'm really interested in. The question is, now that you have this smart enough pre-trained model, how do you make it actually do the alignment research in the way you expect? Or more importantly, how do you know you can trust it to do the job?

If you give this kind of task to an AI, a smart AI system might try to show you the results and techniques in a way that you can implement, but this will actually enhance the system itself or some related system to make it in The future holds more opportunities for power-seeking behaviors, such as self-exfiltration.

So first of all we need to align it and take full advantage of its capabilities to complete the task with the highest possible quality. We need multiple alignment training methods, and we also need to distinguish whether the model is really aligned to the task, and strive to align research, a system that really helps us and is honest and truth-seeking, and another system that pretends to be the former's behavior, but actually wants to To control the world or (for example) engage in self-disclosure, etc. These are two different systems, one is true alignment and the other is false alignment.

An important task is to find a way to distinguish these two systems, which requires a series of verification techniques. In a sense, these methods are not directly used to train the AI ​​system, but by providing some kind of independent signal to judge "which of these two versions is the model?"

Interpretability research is one such approach, for example. If I could look inside the model and find some evidence that it was secretly plotting against us, it would fail the verification test.

Another technique is as part of training or validation depending on how it is set up, which is an easy-to-hard generalization problem. If you can understand and improve model generalization from easy problems that we can supervise to hard problems that are hard to supervise, you can supervise the model on the easy part and then cross-validate on the hard part so you don't have to worry about the model Might be cheating on you behind your back or trying to do something else.

Daniel Filan: How to achieve scalable supervision? Especially in the current situation where we lack a consensus on what is an excellent alignment study, how to obtain a high-quality alignment study training signal?

Jan Leike: The reason for the lack of consensus actually proves that this problem is really difficult to solve, but also shows that the field is still in its immature stage, and so far, we do not have enough empirical evidence. But I think there is a very important property of aligned studies that we can exploit to enable scalable supervision, that it is much easier to evaluate aligned studies than aligned studies. This does not mean that evaluation is easy, but it is much easier to first find a paper that is creative, well designed, and achieves desired results than to complete an alignment study.

Evaluation is thus easier than generation, and exploiting this principle is at the heart of many ideas for scalable supervision. For example, if you think about recursive reward modeling, it is equivalent to having an AI system that acts as an assistant to help you evaluate other AI systems. Since evaluation is easier than generation, the tasks that the assistant needs to accomplish are relatively simple, especially when you're working with it, actually aligning the assistant to the simpler evaluation assistant task that, if successfully completed, can be used by a human/ A combination of assistants oversees a new system on a more difficult task. If this process is repeated, AI systems can be effectively supervised over a wider and wider range of tasks.

Daniel Filan: Therefore, we gradually add more and more AI knowledge to the evaluation part of AI alignment research, and continuously provide it with high-quality training signals through iteration.

Jan Leike: Specifically, RLHF is the easiest method and does not require the use of any assistants. Your AI system performs a certain task, and humans observe the results and make judgments. This is your training signal.

The next step is to train a minimal assistant model as a critique model. This independent language model reviews the results of the first AI system before writing reviews. Humans have always been bad at finding bugs in code, which is why there are so many buggy codes in the world. But now that the commenting system is able to write comments and point out mistakes, it's much easier for people to spot those mistakes. We can easily realize "this is a bug. We should fix it".

Code is usually written according to certain natural language specifications, the true meaning of these specifications may be somewhat vague, and the definition of whether some problems are bugs may not be clear. Importantly, using a commenting model as an assistant expands the range of tasks you can oversee, since you can essentially find all problems in code that an AI system would find or comment on, although that doesn't quite equate to a model finding all problems with bug code.

Even better, there are plenty of empirical ways to measure the effectiveness of this approach. One of the methods can refer to our paper on evaluating models published last year, which is to conduct targeted interference randomized controlled experiments. You choose a coding task (in the paper, we didn't use coding tasks, the specific content of the task is not important), and then, you will get a system response. This response doesn't have to be perfect, let's assume for now that it is. Next, you intentionally introduce tiny bugs or errors into the response that make it a bad response, but which are difficult for humans to detect.

Daniel Filan: How to introduce bugs?

Jan Leike: The easiest way is to let humans complete this task, but the ultimate goal is to achieve it with models, which will be explained in detail later. Now there is a pair of responses: one is the original response, and the other is one that deliberately makes it worse. Next, you randomly select one of the responses and show it to a human, or assisted by an AI, or other scalable supervision method. Then see how likely they are to find the problem, or you can just ask them "Is there a bug here?" This way, you can see how often people find bugs if there is a better way to help.

It can even go a step further and directly compare the ability of the models in terms of identifying problems, and the model can be trained to be a discriminator, distinguishing between good versions and flawed versions. By measuring the accuracy of its discrimination, the performance of the discriminator can be obtained. We then look at whether the model or the RLHF version model can write comments about code defects, so that we can know how well the model actually knows about the defects. Once the review accuracy is obtained, the two values ​​can be compared, which is known as the discriminator-critique gap.

If the difference is large, it means that the model probably detected a lot of problems, or knew about them, or used other metaphors that you wanted but didn't tell. If we let it write a review, it wouldn't reveal these issues.

The "discriminator-evaluator" gap is too large to cause worrying consequences, which means that your scalable supervision mechanism is very flawed in some way, and many problems are missed under your supervision, or Assistant isn't aligned enough in the sense that it's not doing its best to help you.

If the discriminator-evaluator gap is small, it means that if the model knows there is a problem, it will not hesitate to point it out. If a base model of similar size or the same size tries to plant a problem in your code, or tries to insert a hard-to-find trojan, if this code is run, the trojan may be triggered to take over or perform other operations. We can assume that if the model implanted these trojans itself, it should know about these potential problems. Therefore, if we have another copy of the model and measure a smaller discriminator-evaluator gap, we can be more confident that the other copy of the model should, by reasonable inference, write reviews.

Assuming discriminator training can actually weed this out of the model, I think there's good reason to believe that if you fine-tune the discriminator, you can actually access all the activations in the model by fine-tuning all the weights at the same time. So if the knowledge does exist in the activations of the model, if you do it right and have enough data, by fine-tuning the discriminator it should be able to extract that knowledge.

In fact, some rough calculations can be done to determine the amount of data needed. Assuming a single neuron knows if there is a problem with the code, you just take the logarithm of the total number of neurons to narrow it down and find it, not a lot of data required.

Daniel Filan: Actually, this involves a question I have about the whole program. The reason we don't recommend using RLHF is that it may not be able to distinguish what is correct from what seems to be good for humans, since humans are poor at evaluating.

Through this recursive evaluation method, we first train a discriminator, and then train an evaluator model. In one case, we train a model to write really good reviews, and then we work with it to train a model that writes better reviews. But I worry that maybe there is another situation where we train a model, inadvertently have it write good-sounding fake reviews, and then we work with it to further improve our ability to discriminate between good-sounding reviews . Next, we train this discriminator to distinguish which codes are good and which are problematic. We should be more worried about code that looks fine.

Jan Leike: This concern is very important and valid. In a way, this is one of the key problems we need to address in terms of scalable supervision, and it's why I'm excited about implementing these techniques and empirically validating them, because we'll be able to quantify how this effect behaves in practice degree.

A concrete and simple measure we can take is to intentionally train a model with a deceptive alignment and then see if scalable supervision can identify it, or how much effort we need to put in to make the model deceptively aligned , thus passing scalable supervision.

Daniel Filan: There are also some sandwich integration experiments where AI can answer legal questions. Some people know very little about the law, and then find some people who are very good at law, and hope they check whether the AI ​​can successfully fool the first part of the people.

Jan Leike: Yes. These experiments are very interesting and some are already underway, but there are some fundamental problems with this setup.

First, you have to really trust your experts, but experts make mistakes sometimes. Second, you must have full field overlap. We did some initial experiments with this, asking non-programmers to use assistive tools to find problems in code, but they didn't even know how to ask the right questions.

Another problem is that there is no real data (ground truth) at all for the tasks I am really interested in. Alignment studies that we do now, maybe "get some alignment researchers to flag some tasks." But then think, "What if they get it wrong? We don't agree on so many things, it seems difficult." And , there are not many such experts, and their time is very precious, so this will be very expensive data.

All in all, I would like to have an evaluation method that does not depend on real data . That's why I'm excited about the prospect of randomized controlled trials and targeted perturbations or discriminator-evaluator divides. Because it can be done without any real data, and tasks of any difficulty can be chosen.

4

Find bad behavior and internal mechanisms of the model

Daniel Filan: The super-alignment article said to look for bad behavior and bad internal mechanisms. What problems can the super-alignment team solve in this regard?

Jan Leike: The first is the question of interpretability. In a sense, interpretability is difficult to achieve. And I don't think interpretability has really proven to bring a lot of insight or add a lot of value in the language models right now, the understanding of these models and the internal mechanisms is still quite rudimentary.

Daniel Filan: But people do do some interpretability research on language models, things like induction heads and so on.

Jan Leike: Sorry, I don't mean to belittle existing research results... There have been many exciting results, but the most exciting is the use of interpretability techniques to analyze languages ​​​​at the scale of GPT-4 The reward model of the model, to derive some information about the reward model that we did not know before. The reward model provides the training signal for many RLHF training procedures, so a better understanding of it is extremely valuable. It would be great if you could flag or identify the behavioral issues it motivates that you don't want.

I think this is possible, and importantly, interpretability is neither a necessary nor a sufficient condition. We have a great opportunity to solve the alignment problem through behavior without really understanding the inner workings of the model. At the same time, even if the explainability problem is solved, there is no exact strategy for solving the alignment problem of superintelligence, but any insights gained from explainability will be of great potential value, and it will provide us with a kind of breakthrough.

At the same time, this is why interpretability is so difficult to achieve. Because the model is learning how to compute efficiently, it is not regularized into a human-understandable form, and there is no reason to believe that each neuron should correspond to anything relevant or familiar to the human mind. In fact, according to empirical data, neural networks use a single neuron to represent many different concepts, and each concept is distributed among different neurons. So neurons are not really the important factor in this case.

For interpretability, I have two interests. One is causality. You don't just want to look at a neuron as data goes through the model and say, "This neuron fires when we have a story about Canada," or something like that. This is one of our findings in the interpretability paper. We found a 'Canada' neuron that was activated when concepts related to Canada were presented. But this is only correlation, not causation.

To test whether there is a causal relationship, you need to deliberately write some texts that contain concepts related to Canada, and observe whether they all activate this neuron. At the same time, you also need to add other related concepts, which may sound related to Canada, or have nothing to do with Canada but are similar. Then, see if that neuron doesn't fire. Or you can take a piece of text, edit it, and watch as the neuron turns off.

Daniel Filan: This reminds me of a paper I remember called "The Illusion of Interpretability" where it was mentioned that there might be some neurons that fire for a certain thing, but on other datasets it's just An illusion where neurons fire for many other things.

Jan Leike: Another thing that is very exciting is a paper published earlier this year on automated interpretability. The basic idea is that there is a technique that can operate at the microscopic level of individual neurons to ensure that nothing is missed. Any detail, while operating on the entire model at the macro level.

At the end of the day, all the components in your model interact and are highly interrelated, so you need to take care of both. So far, technology has mostly been limited to just one aspect of it. There's been research on automated interpretability before, but I think in general, if you can do some very detail-oriented interpretability work, take a mechanistic approach to interpretability, really try to understand what's going on inside the model Individual circuits or computational units, then generalizing this approach to the scale of entire models requires the help of automation.

Explained in detail in this paper, actually writing natural language explanations for individual neurons, although not exactly the right way to explain it, provides a simple example showing what we have achieved in this regard. Here's how it works: You simply show GPT-4 a series of activation patterns, and ask GPT-4 to come up with a suggested explanation.

In general, the quality of these explanations is poor because the task is difficult and most neurons do not perform very well-defined, human-understandable tasks. But we can run this process for every neuron in GPT-2 and save all the interpretations and try to find interesting patterns. You can look at scaling trends such as "As the model gets bigger, how well does the automatic scoring of these explanations scale?" How will it change?"

The most exciting thing is that we can use language models again for automatic measurement. While this is not a perfect metric and has some issues, it gives us an approximation of whether humans think this explanation is good or not. We then apply this approximation to the entire model, running on a large number of neurons.

Daniel Fila n: If you think about the necessary interpretability work, how much of it do you think is looking for better basic units of explanation, how much is solving scaling problems, etc.?

Jan Leike: Both are necessary, finding the basic unit of explanation can be more challenging, but the extension part is critical to success.

5

Does the large language model understand alignment

Daniel Filan: I guess we don't have a clear specification for super alignment. If there is a specification, there will be no alignment problem, we just tell the model "I wrote a Python pseudocode to make the neural network bigger", but because there is no such specification, we can't align the text, we can only Tell the models to be nice and not to hurt humans. Some people think that we can find the answer from within the language model, what do you think?

Jan Leike: There is some truth to this view, but I'm actually not sure how important norms are. To a certain extent, if we pre-train the model with the words we have said, the model cannot really understand our thoughts, because we still have many thoughts that have not been written. But overall, the model can predict fairly accurately what people will say in a variety of situations, events, or scenarios in practice. In this sense, it will not be difficult for future models to understand the world and judge people's attitudes towards themselves.

Daniel Filan: That is, if I'm superintelligent, I know what I'm supposed to do even though I don't have definite rules in my head.

Jan Leike: I mean, the model will grow smart enough that it will be able to guess how people think about events. I think the obstacle we face is not that AI systems don't understand what humans really care about, and that models are less likely to be wrong in this regard, because models are very intelligent and powerful, and understand almost everything in the world. The real challenge is not to teach the language model to do things, but to actually make the language model do things.

Daniel Filan: Yes. The model knows a certain goal of ours, and what we have to do is to make the model take actual actions to achieve this goal.

Jan Leike: Models can be compared to people with antisocial personality. They have great abilities and know what to do and what not to do, but these people just do the opposite, which is very scary.

6

The four-year period of a super-aligned team

Daniel Filan: In the super-aligned blog post, it was mentioned that your goal is to solve the core technical challenges of super-intelligent alignment in the next four years. What is the core technical challenge here?

Jan Leike: This relates to the general technical approach of how to align superintelligence with a set of human values. The superintelligence we envision is a system that is far superior to human intelligence, can perform tasks much faster, can do massively parallel computing, and cooperates with multiple copies of itself. It is a really powerful system.

The reason for the four-year period is because we want to set an ambitious goal, but also a time frame that is realistically feasible, even if the progress of AI is very rapid and the technology improves significantly in the next few years, by then We can still deliver some tangible results.

Daniel Filan: Understood. This is not only building human-level AI to automatically align researchers, but also trying to use this method to align systems that are much smarter than humans.

Jan Leike: Exactly. In fact, we build human-level automatic alignment researchers not only to build an AI that can be aligned with human level, but more importantly, through this step, we can solve the technical problem of how to align superintelligence that is smarter than human beings. We don't currently know how to achieve superintelligent alignment, so an automatic alignment researcher is building that can help us study and solve this problem.

Daniel Filan: Four years from now, you want to solve these core technical challenges, so matching this goal, where do you expect to be two years from now?

Jan Leike: I think that after three years, we should have the relevant technical capabilities and complete most of the automatic alignment research. If you do not have the relevant capabilities at that time, you may need to extend the project time, ideally four years.

I hope that after two years, we can have a clear understanding of the technology actually used to align automatic alignment researchers, such as whether we have mastered a series of technical means? If these technologies are applied, can we have a reliable and highly available system and distribute work in large quantities? Hopefully by then we've broken down the problem well enough, most of the work right now is limited to the engineering level, which means we still have about two years to solve the research questions related to it.

This is the four-year target timeline that we set, and obviously, there is a very important correlation between this and the progress of AI capabilities. If progress in AI slows down, we may struggle to have models that are really good at useful alignment research tasks. We have tried to use GPT-4 for the alignment task, but the effect is not ideal, and GPT-4 is not smart enough. If four years from now there aren't smart enough models out there, then we'll spend a lot more time figuring these things out.

On the other hand, if the pace of AI development accelerates, we may have less than four years to solve the problem, superintelligence may emerge quickly, and we must adjust our plans accordingly. Therefore, four years can not only ensure the feasibility of the plan, but also ensure the timeliness of problem solving.

Daniel Filan: Let's say things develop as expected in terms of AI capability research. Four years later, you have the capabilities needed to build a good auto-alignment researcher, but it turns out that interpretability or scalable supervision mechanisms are more difficult than we thought, what then?

Jan Leike: If we cannot achieve our goals, we must confess to the public. But whether we can achieve our goals depends a lot on the overall development of the world, such as more time? Is the overall approach wrong? Is there a need for a change of direction? Many surprises can happen along the way.

Daniel Filan: In short, if the plan changes, you will announce the progress to the public and find the next point of effort.

Jan Leike: The alignment problem is actually quite operational. There are many good ways to deal with this problem, just follow it rigorously and measure the results, and you can really learn and make progress. Over the past two years, I've become increasingly optimistic about this issue. Even if it turns out that alignment turns out to be harder than we thought, our research will still be invaluable and lead to more evidence about alignment. This evidence is all the more important now that people have mixed opinions about the difficulty of alignment. Alternatively, perhaps it is more important to understand and measure how aligned the system is in practice.

My biggest concern is not that the system isn't aligned enough, but that we don't know how aligned it is. In this case, different experts may have different opinions, and if everyone agrees that the system is not sufficiently aligned, it will not be deployed.

The above problem is a small one, and it's even worse: you've got a robust system that's probably alright and probably aligned, but you're not sure. At this time, some experts are still very worried. In this case, even if it can be deployed immediately, we may give up. In addition, the deployment of this system is still facing tremendous commercial pressure. Some people think that the system may be aligned, but it is not certain, and at the same time there are strong commercial interests. Faced with this situation, it is very difficult to make the right decision.

Daniel Filan: There is also pressure to deploy systematically.

Jan Leike: Exactly. In the face of business pressure, it may happen that you have some confidence that the system is aligned, but not completely sure, and then you may delay the deployment, but the longer the delay, the greater the business pressure will be. Therefore, we can avoid the above problems by precisely measuring the alignment of the system. This is where a broader technology portfolio comes in.

7

Generalization of the model

Daniel Filan: OpenAI mentioned in a footnote to a blog post that the favorable assumptions that people have made so far may be invalidated. One of the assumptions is that generalization is benign, so how do you think differently about generalization?

Jan Leike: Recently we formed a generalization team. The current question is how we can understand and improve the ability of the model on the basis of human evaluation and supervision, so that it can generalize from tasks that are easy to supervise to tasks that are difficult to supervise. Specifically, generalization is complemented by scalable supervision, which complements human evaluation. If you think about recursive reward modeling, you think, "Can we have an AI assistant that is recursively evaluated to recursively evaluate everything the AI ​​does?" This really puts humans in the loop, front and center, to watch the AI Everything the system does.

Of course, in practice we cannot observe everything the AI ​​does because there are so many things the AI ​​system does, but we can observe it all with some small independent probability. In addition, we also face a problem: the model may generalize to unsupervised places. For this problem, my approach is: make sure that most of the generalization is independent and identically distributed (IID) generalization, that is, the tasks you focus on follow the same distribution as the tasks you don't focus on.

Daniel Filan: Your previous blog post stated that you will never rely on generalization, but will continue to train the model to maintain independent and identical distribution.

Jan Leike: Yes, this is my original idea. I don't want to rely on non-IID generalization. In neural networks, the effect of non-IID generalization is not good, and there is a lack of understanding. But the question now is, what if we really understand generalization and find the meaning of generalization? This is a very valuable question, and Ilya has mentioned it many times.

What we want to understand is, even in unsupervised tasks, even if these tasks are not independent and identically distributed, can we find the meaning of the model's generalization ability? Does the generalization of the model follow people's wishes? Does the model generalize in ways that appear to benefit people but do not? In this regard, we can conduct empirical research through experiments.

We have been trying to classify existing datasets into simple and complex, where simple problems . We then try to understand or improve the accuracy of large models on the entire dataset. This is a very interesting topic that provides a new backbone for training and validation techniques. Assuming this problem is solved so well that we can now supervise the reward model on some simple tasks that are easy to evaluate, or generalize the model to more complex problems, then we can generalize the model.

We then have this reward model that generalizes to harder tasks in the way we expect, even without supervision, and can be trained with this model. However, this still faces the problem that we don't know if the model is already aligned, but can be validated using techniques such as scalable supervision and interpretability, or the other way around, assuming we train an automatic alignment researcher with scalable supervision, and use Generalization is used for validation, i.e. generalizing to the probability of the true answer based on the best knowledge of the model.

Then, we ask a question like is there some small flaw in this code? Are there Trojans written with models aligned in a scalable supervised manner? We can now do advanced cross-validation and train two different models, one trained with generalization techniques and the other with scalable supervision. Can we cross check the model answers to see if they are the same model, or if they are fundamentally different? If so, what are the major differences?

Daniel Filan: Recently Scott Aaronson mentioned on our podcast that whenever he talks to Ilya Sutskever, Ilya always asks him about his theoretical definitions of love and kindness. How does your team view this issue?

Jan Leike: We might try a lot of different exploratory projects. According to Ilya, the ultimate question is how do we summon (summon) the concepts related to alignment? One of the concepts we want to call out is: does the model fundamentally want humans to succeed? Or in Ilya's words, does it love humans?

So, if the model is very intelligent, has read everything, fully understands the human mind about immoral behavior, then you can ask GPT-4 to give different ethical cases from different philosophical viewpoints, different situations, where On the one hand, it usually does a pretty good job. So GPT-4 fundamentally understands people's definition of morality and the way we think. So how to take advantage of it? I think this is the core of the problem. 

8

Different from other alignment labs

Daniel Filan: OpenAI seems to have an alignment team. Is this team still there? How do you see the relationship between the super alignment team and OpenAI?

Jan Leike: This alignment team was still there last year. At that time, the team consisted of two parts, namely practical alignment and scalable alignment. The task of practical alignment is to align OpenAI's most capable models, so the team mainly focuses on aligning GPT-4; the goal of the scalable alignment team is to solve alignment problems that we have not yet faced . With the release and success of ChatGPT, a lot of work is needed to improve ChatGPT's RLHF, improve the model quality, and make it a truly outstanding product, but the alignment team cannot complete these tasks.

Therefore, the work of the practical alignment team has been handed over to other teams in OpenAI, and has grown into a large project with hundreds of people. Scalable Alignment evolved into today's Super Alignment Team. We wanted to emphasize the ongoing effort to align superintelligence, so we chose this name. We are studying problems that have not yet been encountered and doing some forward-looking work. This does not mean that other work is unimportant, it's just that our main concern at the moment is alignment.

Daniel Filan: There are other very good AI research labs in the world that are also doing superintelligence alignment related work. How is the Super Alignment team different from these labs?

Jan Leike: Many laboratories are doing related work, especially DeepMind and Anthropic. In a way, we're all trying to solve the same problem, so it's natural to do things like interpretability and scalable supervision.

To some extent, we run the risk of duplicating work, and to avoid this, it is best for different laboratories to collaborate with each other. On the other hand, it is also beneficial to conduct similar research in different laboratories to avoid groupthink. If each lab tried to solve these problems on its own, people would naturally be more skeptical about the results of other labs. Finally, this situation may lead to an "either-or" phenomenon, that is, people are reluctant to use other laboratory technologies except for the results of their own laboratory, because people always have a preconceived prejudice that other people's things are not good enough. good.

Currently, we do not know how to balance the above situations. Maybe we should bring all the alignment people together and let them work together. But the reality is that various cutting-edge AI labs are driven by investment to focus on researching alignment problems. The success of RLHF made the model more commercially valuable, leading to a shift to more commercially valuable technologies, which quickly attracted substantial investment. If AI labs are the primary funders of research, then the labs are freed from the influence of capital, and alignment research will naturally take place in those places.

Daniel Filan: What is unique about the Super Alignment team in terms of the research agenda etc.?

Jan Leike: The super-alignment team focuses on aligning auto-alignment researchers, not on alignment-specific tasks. We're not too worried about the resulting costs. Other labs do not seem to emphasize this goal or direction in this way. We actively experiment with various scalable alignment techniques and look for ways to conduct empirical comparisons. Other labs may be very optimistic about a particular scalable supervision technique and work hard to bring it to fruition, our group takes an automated approach to interpretability, other labs may not be particularly emphatic about this, we tend to be in this area Experiment a lot.

Our team believes that computing power can be used to enhance alignment, which is one of the main strategies. Especially in terms of scalable supervision, it is desirable to figure out how to improve the quality of the supervised signal. We focus on how to use more computing resources to strengthen the supervisory signal. Computationally improving the evaluation model may be a way, so that by consuming more computing resources, better evaluations will be obtained.

However, the crux of the matter is, what other ways are there? How to use more computing resources to enhance the effect of supervisory signals? Automated interpretability is a feasible way to advance the progress of explainability problems with only a large investment of computing resources. We're not quite right at the moment, but in general, if we can make automatic explainability work, it will bring huge value, which is the attraction of automatic explainability.

Obviously, when doing automatic alignment research, we only need to invest more computing resources to obtain better alignment results. But what we really want to do is convert computing power into an alignment effect, so we came to the conclusion that we need a lot of computing power . This is also OpenAI's commitment to dedicate 20% of its computing resources to automatic alignment research, which may be the largest investment in the field of alignment to date, and may even exceed all other investments combined. If we set this ratio much larger, people wonder whether OpenAI can really achieve this goal? Because OpenAI still wants to develop cutting-edge models and pre-train state-of-the-art AI systems, this will require a lot of computing power.

If we do succeed in developing an auto-alignment researcher, and it turns out that more computing power is needed to run it. This means that the strategy of converting computing power into alignment effects may be successful, and this strategy will also be supported by OpenAI.

Daniel Filan: In terms of team size, how many people can you reach at most?

Jan Leike: The current team has about 20 people, and it may increase to about 30 people by the end of the year. In four years, the team size is unlikely to exceed a hundred people. However, the expansion of the team size is not simply by increasing the actual number of people, but also by the way of participating people in the dotted line. As a result, the team size scales massively.

9

Why are you bullish on Auto-Alignment Researcher?

Daniel Filan: There is a concern that aligning researchers at the human level would be very difficult and require quite complex thinking. Why are you so optimistic about this?

Jan Leike: That's a great question. "Whether the plan will be successful in four years" is a more complex question than "whether the plan will be successful." If you ask me whether our current plan will successfully align with superintelligence, I would say there is an 85% chance, compared to about a 60% chance planned last year. There are many reasons for my optimism, even if alignment isn't easy.

So why am I so optimistic? Here's why: Alignment Fellows has come a long way over the past few years, and at least to me, there have been some expected updates in the field of AI. The first is the success of the language model. If we pre-load the model with things that humans care about, such as our moral choices, cognitive preferences, etc., and the model can understand natural language, we can communicate directly with the model.

In some ways, it is much easier to communicate with a language model, express what we want the language model to align with than a deep reinforcement learning agent trained in a game or a virtual environment. Compared with a language model, deep reinforcement learning Agents may lack language, which is associated with many important skills.

Another important update is the good operation of RLHF. When I was originally working on RLHF, I thought it might be difficult to make it work in a reasonable amount of time because at the time GANs (generative adversarial networks) were hard to train and we were doing something very similar in a way, we trained this The model is rewarded and then used to train another network, but training can fail for various reasons.

Now, we're adding deep reinforcement learning to the process, which at the time was intractable. At that time, I thought that RLHF might fail, but it actually works very well, and in many games, even most Atari games, RLHF is almost comparable to models trained on the score function.

More importantly, the performance of RLHF on language models is very interesting, especially considering that the difference between InstructGPT and the base model from which it was fine-tuned is quite obvious, and the fine-tuned version outperforms the base model by 100 times on API tasks, And those are the tasks that people are willing to pay for. This is a very large difference, which shows that the fine-tuning of RLHF greatly improves the efficiency of the model to deal with people's needs.

At the same time, we invested very little computing power in RLHF, and I did not collect enough data. Arguably, this is our first attempt to use RLHF to align a real system, and it's really surprising how well it works. InstructGPT is very efficient.

While I don't think RLHF is the solution for alignment, especially for superintelligence, it's the first alignment method we've really tried in earnest, and it works so well. This at least means that alignment is easier than we thought.

The second reason I'm optimistic about automatic alignment researchers is that we can now measure the progress of alignment research . For RLHF in particular, we have various interventions that can be evaluated by humans to see how the system improves. The same is true for many other aspects such as scalable supervision, we can conduct randomized controlled trials of targeted interference, and we can also change some conditions, use automatic scoring functions, and see the improvement brought about by these changes. This scoring function is not perfect, it is only a local index that can give the local improvement slope, but it is very important to measure the research progress, it can help us set iteration goals and provide improvement directions.

So far, I don't think we'll be able to achieve the goal of aligning superintelligence, but we have a good chance of building human-level self-aligning researchers. Building a human-level automatic alignment of researchers is a relatively modest and feasible goal compared to aligning superintelligence , which is the third reason I am optimistic about it.

Years ago, when I first started doing alignment research, I knew that aligning superintelligence would be difficult. In comparison, the feasibility of automatically aligning researchers is very high. Faced with a very difficult goal, we can change our thinking. Instead of smashing the overall goal, it is better to decompose the goal into more specific and feasible small goals.

The fourth reason is that evaluation tasks are easier than generation tasks . This applies to a lot of situations, like cell phones, it's much easier to evaluate how good a cell phone is than it is to produce a cell phone. There are many NP problems (Non-deterministic Polynomial) in computer science, such as SAT solving or different versions of constraint satisfaction problems, for which we are not sure whether we can find the answer in polynomial time, but we can verify whether the answer is correct in polynomial time . In my opinion, NP problems are also applicable to many business activities, such as hiring someone to solve a certain problem, we need to evaluate the work ability of this person, and the workload of the latter is much lower than the former; this situation is also applicable The amount of work required for peer review in the field of academic research is much smaller than the amount of work required to conduct research. The same is true for alignment studies, in my opinion.

The last reason is that I have a lot of confidence in the development of language models . Language models have a lot of potential, they're going to be fantastic, and they're well suited for most alignment research-related tasks that we can formalize as textual input and output. Whether it is a machine learning type task (such as running an experiment and understanding the results) or a more conceptual or research-worthy task, if we have difficulty dealing with these tasks, such as not knowing what to do next, or not knowing what to do When understanding a problem, a model tries to help us solve it. Basically all of these tasks are text input and output tasks, and the most complex tasks may be viewing graphs and so on, but these tasks can be solved by GPT-4. Therefore, the current language model pre-training paradigm is very suitable for alignment planning and super-alignment research work.

Daniel Filan: I'm a bit skeptical about the usefulness of language models, which are really good at modeling text and generating highly rated answers based on it. But in terms of what the language model does for alignment, it's not goal-oriented, what do you think?

Jan Leike: Yes, at least initially, model pre-training is more like training on random text on the Internet, according to the narrow goal of "predicting the next word", which cannot be developed into a long-term goal, although A long-term goal might come up somehow, but a priori "predicting the next token" can only be a short-term goal .

Daniel Filan: People need to have long-term goals when generating text. So if you use some papers to train the model, usually people write papers to advance research projects, or to promote their own career development. If we model these content with long-term goals, maybe we can get long-term goals.

Jan Leike: This would explain why "predicting the next token" sometimes appears to be a long-term goal. I think the main problem is building agents that pursue long-term goals, building ways to achieve those goals, public reaction to that, and likely outcomes all combine to make predicting the next lexical a very hard function. While pre-training uses incentives to find the simplest function first, a good example is the inductive head notation, which is a simple induction mechanism that was discovered when training smaller models.

Daniel Filan: I think that the prediction mechanism of the language model is to check whether the previous word has appeared above when predicting the next word, and if so, to see what its next word is. The mechanism is very simple.

Jan Leike: Exactly. We will build more complex mechanisms on top of this simple mechanism. Since we want to improve the pre-training loss, the simplest function that helps us the most is pre-learned. So before learning the very complicated function of modeling an agent with a long-term goal, we learn many other functions.

I predict that the "predict the next word" function will probably be one of the last functions we learn, it's very complex and the model has a finite number of layers , so there's a limit to what you can do in a single forward pass .

everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/

a1acf6059cb9f23efccdbf54a96085a8.png

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/132332033