Wombat: 93% ChatGPT performance! Aligning Human Language Models Without RLHF

eb79a0e99151742675c1cb4b7f937d8e.png

Text | zzy

Article address:
https://arxiv.org/abs/2304.05302v1

Training code:
https://github.com/GanjinZero/RRHF

Model weights:
https://huggingface.co/GanjinZero/wombat-7b-delta

The article proposes RRHF, an alignment method that does not require reinforcement learning to train language models. This article uses chatGPT or GPT-4 as a scoring model to develop language models Wombat-7B and Wombat-7B-GPT4 . Wombat-7B can reach 93% of the performance of ChatGPT on part of Vicuna's test set (because there is no GPT4 API, it cannot be fully tested) . Among them, GPT-4 gave an average of 8.5 points to the reply of ChatGPT, and an average of 7.9 points to Wombat-7B.

OpenAI's chatGPT understands a variety of human instructions and can respond well to the needs of different language tasks. The amazing ability of chatGPT comes from a novel large-scale language model fine-tuning method: RLHF (Aligning Human Feedback via Reinforcement Learning). The RLHF method is different from the previous fine-tuning method of traditional supervised learning. This method uses reinforcement learning to train LLM. RLHF unlocks the language model's ability to follow human instructions and aligns the language model's capabilities with human needs and values.

The current research on RLHF mainly uses the PPO algorithm to optimize the language model. The PPO algorithm contains many hyperparameters, and requires the cooperation of multiple independent models during the algorithm iteration process. Wrong implementation details will lead to poor training results .

2962c7725391aa986198a27d1fef962b.png

In terms of alignment with humans, is reinforcement learning algorithms necessary? The authors from Alibaba Dharma Institute propose a ranking-based human preference alignment method that does not require reinforcement learning, which scores responses generated by different language models (which can be ChatGPT, GPT-4, or the current training model), and Align them with human preferences via a ranking loss. Unlike PPO, the training process of RRHF can use the output of human experts or GPT4 as a comparison. The model trained by RRHF can be used as a generative language model and a reward model at the same time .

Suhail, CEO of Playgound AI, said this is the most exciting paper he has recently.

1b69d1cd0537d084a787c7357b388d76.png

The following figure compares the difference between the PPO algorithm and the RRHF algorithm.

048fc185a5e55cb1dafcc83d74585112.png

The RRHF algorithm can effectively align the output probability of the language model with human preferences. The training idea is very simple. The trained model has several characteristics:

  • Only need 1 or 2 models. In contrast, PPO requires 4 models.

  • Supervised fine-tuning (SFT) can be seen as a special form of him, with few hyperparameters

  • It can be used directly as a language model and a reward model at the same time

  • He can fit the preference of the reward model with a relatively low training difficulty, achieving the effect of PPO

Wombat Wombat as a new open source pre-trained model can better align with human preferences. The authors experimentally found that Wombat-7B possesses complex abilities such as role-playing and counterfactual reasoning . The author also released the weight of WomBat on Huggingface for everyone to try. The following is a self-introduction of Wombat Wombat.

  • Give an introduction of wombat

  • Wombat is a marsupial mammal native to Australia. It is typically found in mountainous areas and is known for its burrowing behavior. Wombats are typically around two feet long and weigh up to 30 pounds. They have a gray coat with white on their chests, and they have thick and powerful hind legs which they use for digging. Wombats are mostly nocturnal and eat a variety of insects, leaves, and grasses.

005bec52fd1c0de5642bcfbf8e295012.png

Link:
https://huggingface.co/GanjinZero/wombat-7b-delta

3d2288a285d56ef47a0ab65f6ed54822.jpegReply keywords in the background [ join the group ]

Join the NLP, CV, search promotion and job hunting discussion group

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130164695