Human Feedback Learning RLHF for Large Language Models

Around 2017, Deep Reinforcement Learning (Deep Reinforcement Learning) gradually emerged and attracted widespread attention. Especially in June 2017, OpenAI and Google DeepMind jointly launched a research project called "Deep Reinforcement Learning from Human Preferences" (RLHF), which conducts deep reinforcement learning based on human preferences. This research aims to address an important challenge in deep reinforcement learning, namely how to efficiently design reward functions to guide the learning process of agents. Traditional reinforcement learning methods usually use hand-designed reward functions, but such methods often struggle to capture the nuances and human preferences in complex tasks.
Paper link: https://arxiv.org/pdf/1706.03741

RLHF proposes a novel approach to exploit human preferences to train deep reinforcement learning models. In this method, human participants are asked to watch agents perform tasks and make choices based on their preferences. These choices are used as reward signals for training deep reinforcement learning models. RLHF can enable agents to better learn the goals and priorities of tasks. This approach not only improves the performance of reinforcement learning models, but also makes the behavior of the models more in line with human expectations and preferences.
insert image description here

  • In the paper "Fine-Tuning Language Models from Human Preferences" by Zieglar et al. (2019), they introduce a method using human participation to train a reward model. In this method, human annotators make choices about the text generated by the policy model, such as choosing the best one among four answer options (y0, y1, y2, y3) as the label for the reward model. In this way, the reward model can learn more accurate text evaluation criteria and guide the training of the policy model.
    Paper link: https://arxiv.org/pdf/1909.08593

insert image description here

  • In the paper "Learning to summarize with human feedback" by Stiennon et al. (2020), the OpenAI team proposed a training mode similar to instructGPT and ChatGPT. In this mode, human annotators interact with the model, providing feedback and guidance to the model, helping the model learn how to perform the summarization task. Through the interaction with human annotators, the model can gradually optimize its own generation ability and generate more accurate and reasonable summaries.
    Paper link: https://arxiv.org/pdf/2009.01325
    insert image description here
    The paper mentions three key steps to fine-tune the supervised model and train the reward function.

  • Fine-tuning supervised models on human-annotated data: After pre-training a language model, use human-annotated data to fine-tune the model for a specific task. By inputting task-related data into the model and adjusting the model parameters through the back-propagation algorithm, it can better adapt to the task requirements. This process is similar to traditional supervised learning, where human-labeled data is used as training samples with the goal of optimizing the performance of the model.

  • Train a reward function: In traditional reinforcement learning, the reward function is usually designed manually. However, in this case, a reward function is trained instead of a human-designed reward function. The purpose of this reward function is to give an evaluation of the generated text according to the goal of the task. By training the reward function with high-quality labeled data, we can make the model better match the requirements of the task when generating text.

  • The strategy of using the PPO optimization model: PPO (Proximal Policy Optimization) is an algorithm for optimizing the strategy. Here, we use the PPO algorithm to update the model's policy to maximize the expected value of the reward function. In addition, to avoid the reward model being too absolute, a penalty term is added to balance the reward function. (I will write a blog about reinforcement learning in the future, and I will introduce this part in detail. Please be patient!)

These papers all explore the use of human participation to guide and improve the training process of language models. By incorporating human expertise and judgment, more accurate labels and feedback can be provided, improving model generation and performance. This human-involved training approach provides a beneficial avenue for the development and advancement of language models.

Relevant knowledge is being learned, to be continued! It is expected to be completed in a day or two

insert image description here

Guess you like

Origin blog.csdn.net/qq_38915354/article/details/131145372