[AI] Algorithm cheat sheet - the most complete RLHF framework at present: AlpacaFarm

The main purpose of the series of articles is to quickly clarify the principle differences and application scenarios of different methods.

For the details of the theory, please refer to the Reference at the end of the article,

Reference will also filter more correct and detailed descriptions

Among the many methods of large model fine-tuning , RLHF has always been considered the key to the success of ChatGPT, but the cost and training threshold are the highest. The RLHF solution of the GPT series has not been open sourced, so the team that has researched this step can only make magic changes based on Fine-Tuning Language Models from Human Preferences , and the process is complicated and expensive.

The proposal of AlpacaFarm is undoubtedly to solve the big pain point of the open source community. The main goal of this framework is to effectively integrate the current common Instruction-fowllowing model training technology based on human feedback (Human feedback), and to provide a complete and unified pipeline, which greatly reduces training costs. threshold. The following is the overall framework. It is claimed that a complete RLHF training only takes 24 hours and costs about $200

Briefly explain the entire training process. The open source content also continues the training process of Aplaca, using the Alpaca5.2K dataset , but extracting 10K of it for SFT (Supervised Finetune) , and the remaining 42K data for manual preference labeling and testing. The dataset is already available on HuggingFace .

The key to its main reduction is mainly to use the method of simulating manual labeling, which is somewhat similar to the self-instruct method. This process is 45 times lower than manual labeling. Comparing the training results of the simulated labeling method and the actual manual labeling data, the overall results are very consistent:

in conclusion

The main contributions of AlpacaFarm include:

  1. Simulate manual labeling: reduce cost and efficiency
  2. Model automated evaluation system: integrate Alpaca interaction data and public datasets to evaluate RLHF results
  3. Implement mainstream RLHF methods, including: PPO, Expert Iteration, Best-of-n sampling...etc

The full code is available at: GitHub - tatsu-lab/alpaca_farm: A Simulation Framework for RLHF and alternatives.

Reference

https://crfm.stanford.edu/2023/05/22/alpaca-farm.html

https://github.com/tatsu-lab/stanford_alpaca

just! Stanford releases AlpacaFarm (alpaca farm), which can reduce RLHF labor costs by 45 times! (Open source) - Zhihu

Guess you like

Origin blog.csdn.net/weixin_44491772/article/details/130878830