More than two hundred large model papers reveal the challenges and limitations of RLHF

From: Heart of the Machine

Enter the NLP group —> join the NLP exchange group

While powerful, the RLHF approach does not address the fundamental challenges of developing human-like AI.

Since the advent of ChatGPT, the training method Human Feedback Reinforcement Learning (RLHF) used by OpenAI has attracted much attention and has become the core method for fine-tuning large language models (LLM). RLHF methods use human feedback in training to minimize unhelpful, distorted, or biased output, aligning AI models with human values.

However, the RLHF method also has some shortcomings. Recently, dozens of researchers from MIT CSAIL, Harvard University, Columbia University and other institutions jointly published a review paper, analyzing and discussing more than 200 research papers in the field, systematically The flaws of the RLHF method are studied thoroughly.

7a0040e107b4a68241ca7826d4838c4c.png

Paper address: https://huggingface.co/papers/2307.15217

Overall, the paper highlights the limitations of RLHF and shows that developing safer AI systems requires a multi-faceted approach. The research team did the following work:

  • Open issues and fundamental limitations of RLHF and related methods are investigated;

  • Outlines ways to understand, improve and complement RLHF in practice;

  • Propose audit and disclosure standards to improve community oversight of the RLHF system.

Specifically, the core content of the thesis includes the following three parts:

1. Specific challenges faced by RLHF. The research team categorized and surveyed RLHF-related issues and distinguished between the challenges faced by RLHF, which are easier to address and can be addressed using improved methods within the framework of RLHF, and the fundamental limitations of RLHF, which must be addressed by other methods Fix alignment issues.

2. Integrating RLHF into a broader technical safety framework. The paper shows that RLHF is not a complete framework for developing safe AI, and describes some methods that help to better understand, improve and supplement RLHF, emphasizing the importance of multiple redundant strategies to reduce problems.

3. Governance and transparency. This paper analyzes the challenges of improving industry norms. For example, the researchers discussed the usefulness of having companies that train AI systems using RLHF disclose training details.

Let's take a look at the structure and basic content of the core part of the paper.

As shown in Figure 1 below, this study analyzes 3 processes related to RLHF: collecting human feedback, reward modeling, and policy optimization. Among them, the feedback process leads to human evaluation of the model output; the reward modeling process uses supervised learning to train a reward model that mimics human evaluation; the policy optimization process optimizes the artificial intelligence system to produce a better output for reward model evaluation. The third chapter of the thesis discusses the problems and challenges of the RLHF method from these three processes and four aspects of the joint training reward model and strategy.

3843ee81019e4672a92298e125a4360b.png

The issues summarized in Chapter 3 of the paper show that heavy reliance on RLHF to develop AI systems poses security risks. While RLHF is useful, it does not address the fundamental challenges of developing human-like AI.

f4ee5a75246c24db2be65fba47a4ed77.png

The research team believes that no single strategy should be considered a comprehensive solution. A better approach is to adopt a "defense in depth" of multiple security methods. Chapter 4 of the paper elaborates on ways to improve AI security from the aspects of understanding, improving, and supplementing RLHF.

ed5f86a83fc01099e429cc296445ba50.png

The fifth chapter of the thesis outlines the risk factors and audit measures faced by RLHF governance.

c261a1d67eb6c6d5d56e615698a8323c.png

Summarize

This study found that many of the problems in practice stem from fundamental limitations of RLHF and must be avoided or compensated by non-RLHF approaches. The paper thus highlights the importance of two strategies: (1) evaluating technological progress against the fundamental limitations of RLHF and other approaches, and (2) addressing AI by adopting defense-in-depth security measures and openly sharing research results with the scientific community alignment problem.

Furthermore, the study sheds light on some challenges and issues that are not unique to RLHF, such as the hard problem of RL policies, and some that are fundamental to AI alignment.

Interested readers can read the original text of the paper to learn more about the research content.

f3af27adbfcd7ed9799feeb57ea287e1.png


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132074347