The large model RLHF algorithm is updated, and DeepMind proposes the self-training offline reinforcement learning framework ReST

Article link: https://arxiv.org/abs/2308.08998

Behind the popularity of large models (LLMs), it is inseparable from the support of a variety of different basic algorithm technologies, such as the basic language architecture Transformer, autoregressive language modeling, hint learning, instruction learning, etc. These technologies have created base generation models such as GPT-3 and PaLM. On top of these base models, researchers have developed reliable models such as ChatGPT that are consistent with human preferences by introducing reinforcement learning algorithms with human feedback (RLHF) . The chat model truly brings LLMs into the public eye. RLHF brings a large training calculation cost due to its own online update limitations, and is prone to "external attacks" .

In order to solve the above problems, the research team from Google DeepMind proposed a new reinforced self-training algorithm (Reinforced Self-Training, ReST) . Compared with RLHF, ReST can make the output of LLMs consistent with human preferences with higher efficiency. . The design inspiration of ReST comes from their view of the alignment problem of language models as a growing batch reinforcement learning problem . Therefore, the author of this article first starts from an initial LLMs strategy and generates an offline data set based on the strategy, and then uses an offline RL algorithm. These samples are used in turn to update the LLMs strategy . The author focused on evaluating the performance of the ReST algorithm on machine translation tasks in basic NLP tasks. Experimental results show that ReST can significantly improve the translation quality of the model compared to RLHF.

01. Introduction

How to efficiently align the output of LLMs with human preferences or values ​​is currently a key issue in improving the performance of LLMs. Without proper alignment processing, LLMs may produce high-risk or completely wrong content, which has implications for downstream applications. devastating impact . Currently commonly used RLHF methods usually use human feedback annotated data to learn a reward model, which is then used for reinforcement learning objectives to fine-tune the LLM alignment. But RLHF usually relies on online RL methods, such as PPO [1] and A2C [2], which requires using the reward model multiple times during the model training process to sample new samples from the updated policy, which will bring high costs. Calculate the cost . In order to solve this problem, this paper proposes a self-training reinforcement learning algorithm ReST. ReST discards human annotators from the feedback training loop and generates and uses offline data for feedback training. The author cleverly designed an internal and external circulation mechanism , as shown in the figure below.

The outer loop is called the Grow loop. The model will sample and generate an aligned data set according to the current strategy. The inner loop is called the Improve loop. The model will filter the data set generated by the outer loop (using the human preference scoring function to sort and filter the samples. ), and the filtered data will continue to be used to fine-tune the optimization strategy, and the internal and external loops will interact with each other to reduce the training cost caused by sampling data. ReST no longer relies on online RL losses, and thus becomes a general reinforcement learning framework that allows the use of different offline RL losses when executing the Improve loop, making the overall framework more flexible.

02. Method of this article

2.1 The overall process of ReST

2.2 Grow outer loop

2.2 Improve internal circulation

03. Experimental results

The experiments of this article are mainly conducted on the machine translation benchmark. The author selected three data sets: IWSLT 2014, WMT 2020 and Web Domain. The first two are common machine translation data sets, and the latter is an internal test data set. These data sets Each contains a set of language texts and corresponding real reference translations given by human annotators. The author selected several different offline reinforcement learning algorithms as baseline comparison methods, including OAC, BVM, PO, GOLD and BC.

3.1 Analysis of the Improve loop

The author first analyzed the impact of the two loop steps of ReST on the final performance, such as whether increasing the number of Improve loops will increase the score of the reward model. As shown in the figure below, the gray column is the score of the supervised learning baseline. By adjusting the loss function type , Improve steps (I) and Grow steps (G) to form different ReST variants, whose scores are shown in purple bars .

It can be seen that as the number of Improve steps continues to increase, the average reward score of ReST on all three datasets increases .

3.2 Analysis of Grow cycle

The Grow step can continuously increase the number of samples for offline training, so the author compared the model performance after executing a single Grow step and executing two Grow steps, as shown in the figure below. The ReST variant that executes two Grow steps was used in IWSLT 2014 and Web There are significant improvements on the Domain data set .

3.3 Analyze the loss function

In the figure below, the author shows the comparison of the average reward scores of the method in this article with the supervised training model, as well as the ReST variants using different loss functions. It can be observed that even if only a single Grow step is used, the different variants of ReST (purple) are significantly different. The reward score obtained by outperforming the supervised learning model (grey) .

In addition, we can also observe that BC loss is significantly better than using other loss functions in the case of a single Grow step .

3.4 Comparison between ReST and online RL algorithms

The author selected the PPO algorithm as a comparison online RL algorithm. PPO is widely used in various RLHF processes. In the experiment, the PPO algorithm can access a comparable amount of training data as the ReST algorithm through a single Grow step. The comparison results are shown in the table below.

It can be seen that the average reward score of the online PPO algorithm is basically the same as that of the ReST algorithm, but this is only in the case of a single Grow step. When ReST uses multi-step Grow and Improve (and the amount of data involved in training is the same), the performance will be received significant improvement .

04. Summary

This article proposes a self-training offline reinforcement learning algorithm called ReST, which contains a new internal and external loop mechanism (divided into Grow outer loop and Improve inner loop) to efficiently schedule policy generation and updates in the RL process . At the same time, it has good scalability and can be flexibly applied to a variety of different RL losses. The author's experiments on machine translation benchmarks show that using the commonly used BC loss can make ReST achieve higher results in a variety of different environments. bonus points. The proposal of ReST also announced to the community that when aligning LLMs with human preferences, more RL optimization methods can be tried in addition to online RL algorithms such as PPO.

reference

[1] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[2] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Learning Representations, 2016.


  About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) is affiliated with Jiangmen Venture Capital and is a growth community that gathers global Chinese AI elites.

We hope to create more professional services and experiences for AI talents, accelerate and accompany their learning and growth.

We look forward to this becoming a high ground for you to learn cutting-edge AI knowledge, a fertile ground for sharing your latest work, and a base for upgrading and fighting monsters on the road to AI advancement!

More detailed introduction >> TechBeat, a learning and growth community that gathers global Chinese AI elites

Guess you like

Origin blog.csdn.net/hanseywho/article/details/132902106