调查分析两百余篇大模型论文,数十位研究者一文综述RLHF的挑战与局限

f576656de49ec0d4e23cc4358939e9b7.png

来源:机器之心
本文约1140字,建议阅读5分钟RLHF 方法虽然强大,但它并没有解决开发人性化人工智能的基本挑战。

自 ChatGPT 问世,OpenAI 使用的训练方法人类反馈强化学习(RLHF)就备受关注,已经成为微调大型语言模型(LLM)的核心方法。RLHF 方法在训练中使用人类反馈,以最小化无益、失真或偏见的输出,使 AI 模型与人类价值观对齐。

然而,RLHF 方法也存在一些缺陷,最近来自 MIT CSAIL、哈佛大学、哥伦比亚大学等机构的数十位研究者联合发表了一篇综述论文,对两百余篇领域内的研究论文进行分析探讨,系统地研究了 RLHF 方法的缺陷。

560972ce646649d5a3fae8c303a3d711.png

论文地址:

https://huggingface.co/papers/2307.15217

总的来说,该论文强调了 RLHF 的局限性,并表明开发更安全的 AI 系统需要使用多方面方法(multi-faceted approach)。研究团队做了如下工作:

  • 调查了 RLHF 和相关方法的公开问题和基本限制;

  • 概述了在实践中理解、改进和补充 RLHF 的方法;

  • 提出审计和披露标准,以改善社会对 RLHF 系统的监督。

具体来说,论文的核心内容包括以下三个部分:

1.RLHF 面临的具体挑战。研究团队对 RLHF 相关问题进行了分类和调查,并区分了 RLHF 面临的挑战与 RLHF 的根本局限性,前者更容易解决,可以在 RLHF 框架内使用改进方法来解决,而后者则必须通过其他方法来解决对齐问题。

2. 将 RLHF 纳入更广泛的技术安全框架。论文表明 RLHF 并非开发安全 AI 的完整框架,并阐述了有助于更好地理解、改进和补充 RLHF 的一些方法,强调了多重冗余策略(multiple redundant strategy)对减少问题的重要性。

3. 治理与透明度。该论文分析探讨了改进行业规范面临的挑战。例如,研究者讨论了让使用 RLHF 训练 AI 系统的公司披露训练细节是否有用。

我们来看下论文核心部分的结构和基本内容。

As shown in Figure 1 below, this study analyzes 3 processes related to RLHF: collecting human feedback, reward modeling, and policy optimization. Among them, the feedback process leads to human evaluation of the model output; the reward modeling process uses supervised learning to train a reward model that mimics human evaluation; the policy optimization process optimizes the artificial intelligence system to produce a better output for reward model evaluation. The third chapter of the thesis discusses the problems and challenges of the RLHF method from these three processes and four aspects of the joint training reward model and strategy.

ac2a8e1bdf82464d913e3e0e700cef44.png

The issues summarized in Chapter 3 of the paper show that heavy reliance on RLHF to develop AI systems poses security risks. While RLHF is useful, it does not address the fundamental challenges of developing human-like AI.

5eef0aab412f66576bba25b5fbe9b9a2.png

The research team believes that no single strategy should be considered a comprehensive solution. A better approach is to use "defense in depth" with multiple security methods. Chapter 4 of the paper elaborates on ways to improve AI security from the aspects of understanding, improving, and supplementing RLHF.

e3a5b5a06c0811ec3fdf0856f44f69de.png

The fifth chapter of the thesis outlines the risk factors and audit measures faced by RLHF governance.

673518c08933b744814125da74096aac.png

Summarize

This study found that many of the problems in practice stem from fundamental limitations of RLHF and must be avoided or compensated by non-RLHF approaches. Therefore, the paper emphasizes the importance of two strategies: (1) evaluating technological progress against the fundamental limitations of RLHF and other approaches, and (2) addressing AI by adopting defense-in-depth security measures and openly sharing research results with the scientific community alignment issues.

Furthermore, the study sheds light on some challenges and issues that are not unique to RLHF, such as the hard problem of RL policies, and some that are fundamental to AI alignment.

Editor: Wen Jing

ad6e2649763fcd7acd047a8b3d00f714.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/132137830