GPT-2 can supervise GPT-4, Ilya takes the lead in OpenAI Super Alignment’s first paper is here: AI alignment AI achieves empirical results

Humans can’t supervise superintelligent AI, but artificial intelligence can. In a recent interview, OpenAI chief scientist Ilya Sutskever boldly predicted that if the model can predict the next word well, it means that it can understand the profound reality that led to the creation of the word.

In the past year, large models whose essence is “predicting the next token” have swept across many tasks in the human world, demonstrating the huge potential of artificial intelligence.

In a recent interview, OpenAI chief scientist Ilya Sutskever boldly predicted that if the model can predict the next word well, it means that it can understand the profound reality that led to the creation of the word. This means that if AI continues to develop on its current path, perhaps in the near future, an artificial intelligence system that surpasses humans will be born.

But what is even more worrying is that "super artificial intelligence" may bring some unexpected negative consequences. This is also the meaning of "alignment".

Previous alignment methods relied on human supervision, such as reinforcement learning with human feedback (RLHF), which played a key role in ChatGPT training. But future AI systems may be able to perform behaviors that are so complex and creative that humans will struggle to reliably supervise them. For example, a transhuman model might write millions of lines of novel, potentially dangerous computer code that would be difficult for even human experts to understand.

Once artificial intelligence surpasses humans, how should we supervise artificial intelligence systems that are much smarter than ourselves? Will human civilization eventually be subverted or even destroyed?

Even academic giants like Hinton are pessimistic about this issue - he said that he has "never seen a case where something with a higher level of intelligence was controlled by something with a much lower level of intelligence."

Just now, the OpenAI "Super Alignment" team released its first paper since its establishment, claiming to have opened up a new research direction for empirical alignment of superhuman models.

Paper link: https://cdn.openai.com/papers/weak-to-strong-generalization.pdf

The OpenAI "Super Alignment" team was established in July this year. Its goal is to solve the alignment problem of super-intelligent AI within four years, that is, to figure out how to build a trustworthy human-level researcher and then use it to solve the alignment problem. It is said that this team invested 20% of the company's computing power.

In this paper, OpenAI makes a simple analogy to the problem of "human supervision of super artificial intelligence": let small models supervise large models.

Research shows that the 1.5 billion parameter GPT-2 model can be used to stimulate most of the capabilities of GPT-4, allowing it to achieve performance close to the GPT-3.5 level, and can even correctly generalize to difficult problems where small models fail.

OpenAI calls this phenomenon "weak-to-strong generalization," which suggests that a strong model has implicit knowledge of how to perform a task and can perform its tasks even when given shoddy instructions. Find this knowledge within yourself.

picture

However, the study also pointed out that there is still a big gap between strong models trained with weak supervision and strong models trained with real labels. This suggests that techniques such as reinforcement learning with human feedback (RLHF) may not scale well to superhuman models without additional work. The performance gap is particularly large for the ChatGPT reward modeling task.

Several simple methods can significantly improve weak-to-strong generalization, such as using intermediate model sizes for bootstrap supervision, adding an auxiliary confidence loss during fine-tuning to encourage the model to remain confident even when contradicting weak labels, Or improve representation learning through additional unsupervised pre-training.

To encourage other researchers to tackle such problems, OpenAI also announced today that it will provide $10 million in funding for research on various comparison methods.

Below are the paper details.

Research methods

This paper mainly guides or aligns the model through reinforcement learning with human feedback (RLHF). They do this by reinforcing behaviors that are rated highly by evaluators and punishing behaviors that are rated low by evaluators. If human evaluators can accurately judge whether the model behavior is good or bad, this approach is very effective, and this method is also a core part of training large language models such as ChatGPT.

However, the problem is that super models may perform complex and creative behaviors that humans cannot fully understand. For example, if a super-assistant model generates one million lines of code, humans will be unable to provide reliable oversight for key alignment-related tasks, including whether the code follows the user's intent, whether the assistant model answers questions about the code truthfully, and whether the code executes Is it safe or dangerous, etc.

Therefore, if we fine-tune a supermodel under human supervision, we cannot be sure how well the model will generalize to complex behaviors that humans themselves would struggle to reliably supervise. In other words, this means that even with human guidance, the performance of these super models in handling some complex or unknown situations is still uncertain.

This creates a challenge for aligning supermodels: How can a less intelligent supervisor control a model that is much smarter than them? Despite the importance of this issue, it is currently difficult to conduct empirical research.

Generally speaking, a core challenge of super-alignment is that humans need to supervise models that are smarter than themselves. This is a weak-to-strong learning problem: How can a weak supervisor supervise a model that is much smarter than it is? To solve this problem, this paper proposes a simple analogy of replacing weak human supervisors with weak models as supervisors.

Generally speaking, traditional machine learning focuses on this setting, where human-supervised models are weaker than humans. But for the ultimate super-alignment problem, human-supervised models outsmarted them. Therefore, this paper studies a similar problem: using weak models to supervise strong models.

Here's how they do it, for a given task:

  1. Constructing weak supervisors. This paper constructs weak supervisors by fine-tuning smaller pre-trained models on ground-truth labels. They call the performance of weak supervisors weak performance and generate weak labels through the predictions of weak models.
  2. Training strong student models with weak supervision. This paper uses the generated weak labels to fine-tune the strong model, and calls the model a strong student model, and the performance it produces is called weak-to-strong performance.
  3. Train a strong model capped at ground truth labels. For comparison, this paper fine-tunes the strong model using ground truth labels. The final performance of this model is called strong ceiling performance.

Typically, weak to strong performance will be somewhere between weak performance and strong ceiling performance. This article defines PGR (performance gap recovered) as a function of the above three types of performance (weak, weak to strong, and strong upper limit), as shown in the figure below.

picture

If perfect weak-to-strong generalization is achieved, PGR is 1. If the weak-to-strong model performs no better than the weak supervisor, the PGR is 0.

Experimental results

This paper evaluates the performance of strong student models on NLP tasks, chess, and reward modeling tasks, and the results are as follows. Overall, across all settings, we observe weak to strong generalization: strong student models consistently outperform their weak supervisors.

picture

This paper finds that weak-to-strong generalization can be greatly improved using simple methods, as shown in Figure 4.

picture

Figure 5 shows that for smaller strong students, although its performance is slightly worse than the naive baseline, the improved generalization ability is still obvious.

picture

Figure 7 (a) shows the ground truth test accuracy curve during the training process of ChatGPT RM task, and Figure 7 (b) and (c) compare the accuracy of the best and final ground truth test.

picture

Figure 9a considers 7 representative NLP tasks and compares fine-tuning, zero-shot hints and 5-shot hints; for zero-shot and 5-shot baselines, we use task-specific hints summarized in Table 2. 

picture

Guess you like

Origin blog.csdn.net/leyang0910/article/details/135028482