"See" vol25. Review | PKU-Beaver open source project team: Let's talk about the first reproducible RLHF benchmark

In order to solve the insecurity of reproducing RLHF technology and the big prediction model based on RLHF technology, the Peking University team has open sourced an open source project called PKU-Beaver .

The 25th issue is expected to see the closed-door sharing session. I "door" is very happy to invite the members of the PKU-Beaver open source project team—— Ms . Share the project introduction, recent situation, and conduct QA exchanges and interactions with online students.

The 20k data set of the first round of open source SafeRLHF, if you want to use the complete data set, please scan the QR code below to fill in the application form.

During the open mic discussion session of the event, everyone asked questions about RLHF one after another, and I "door" also intensively recorded the text version of the Q&A from the PKU-Beaver team, edited and sorted it out, and presented it to everyone~ If you want to go further Learn about the work of the PKU-Beaver team, and welcome to pay attention to the project homepage: https://github.com/PKU-Alignment/safe-rlhf

Or add the Xiaojiang (thexiaojiang) WeChat, join the RLHF communication community, and interact with more small partners in related fields!

1. How much magnitude does RLHF need to achieve in data to align human preferences?

Yang Yaodong (Assistant Professor of Artificial Intelligence Research Institute of Peking University): There is no fixed statement about this. At present, according to some opinions in the paper, it can use much less data than the pre-training stage. At present, the amount of pre-training data is about a few hundred billion tokens. Whether it is our own discovery or a vertical large-scale model manufacturer, we know that for a specific field, such a large amount of data is not needed.

At present, most people using RLHF still hope to focus on whether I can turn this model into a professional large model, that is, if I want to achieve H (Helpful), the token of several hundred billion is enough. Correspondingly, the scale of hundreds of books can be achieved. But if you want the other two H (Harmless and Honest), according to our current tests, Harmless needs at least hundreds of thousands of tokens corresponding to the preference data.

2. Is there anything special about the data selection of RLHF? Is it necessary to directly sample the dialogue data between real online voice assistants and users?

Yang Yaodong: There are three steps in RLHF. The first step is SFT, the second step is reward function learning, and the third step is PPO. The prompts collected during data collection are still very important. Facebook's LLaMa large model found that if it is a very high-quality prompt, only 1,000 prompts can replace the effect achieved in the next two steps of RLHF. Prompt is very important and even requires high-quality annotations by human annotators. When we chatted with colleagues in the industry, we found that it is very important to collect real online voice assistants and user dialogue data, which is one of the reasons why everyone is eager to open source their own large-scale model products. RLHF is further carried out by collecting human real data and then performing high-quality annotations.

3. People do not need to read hundreds of millions of tokens to achieve the current level of cognition, so how to reduce the cost of large models? Including computing and data resources?

Yang Yaodong: This question is very good, but what is not considered is that human beings have evolved for a long time, and there is a lot of knowledge passed down, and there is a lot of history domain accumulation. The large model is a brand-new exploration of human beings, which cannot be compared with the human brain. In my memory, the number of tokens that a person can see in a lifetime is a billion, which is much smaller than the order of magnitude of the data set required by the current large language model.

However, the relationship between large language models and human brain cognition has not been studied so clearly. I don't think it is necessary to make large language models like the human brain at a very low cost (30, 40 watts). A very intelligent behavior, this may not be realized in the short term. But how to reduce the cost of large models is indeed a question worth thinking about. Technologies such as weight quantization can help you achieve alignment in finer directions at a lower cost on the deployment or fine-tuning side.

4. How to take the reason of ranking other than the ranking itself as an important feedback signal and feed it back to the model for targeted fine-tuning?

Yang Yaodong: This is also a very good question. As an efficient algorithm for aligning models, it is relatively rough and violent-collecting human feedback for hard alignment. I think this technology itself will undergo earth-shaking innovations in the next few years. Collecting articles in NerulIPS this year, there are about 700-1000 articles on RLHF, and this technology itself will be further improved. The emergence of new technologies is to increase and use intrinsic reward, extrinsic reward, or unsupervized behavior learning methods to improve the efficiency of preference learning. The three technical routes mentioned before: one is to learn policy, one is to learn preference function, and the other is to learn reward function. If you have a way to reproduce the reward behind the preference, I think the reward can also give you more information. As for the reason why you use preference now, it is because the step from GPT3.5 to Chat mainly wants to achieve The level of human-like answers, it may be difficult to describe the human-like thing with a specific formula, so preference will be used. But I believe that there will be more feedback signals when I come back, including intrinsic rewards, extrinsic rewards, etc. to fine-tune the model, and there are even some jobs that do not use RL, but directly use feedback signals for tuning. A very meaningful attempt.

5. Previously, the emergent ability would only appear when the model parameters reached a certain number, but conversely, after we compress a large model into a small model, will the emergent ability be lost? Is there any future development trend in the direction of miniaturization of large models or side-to-side?

Yang Yaodong: Our group did not do this, but I just share some personal views. I think this matter is feasible. For example, the "false lottery ticket theory" a few years ago is very similar to your point of view. After training a model in a very large form, it does not mean that the most effective representation is so large. A method that can be quantified, and finally a small model with the same effect as the large model is obtained. From large to small, the performance of the model can be maintained, but it is not feasible to directly train the small model. This is not contrary to the emerging point of view, and it is a very worthwhile attempt in technology.

6. With the emergence of new technologies and the increase of training parameters, where is the upper limit of model capabilities? Will there be a superhuman situation?

Yang Yaodong: Although everyone is talking about big models, my personal point of view is that the current models are not big enough. First, people have 100 trillion links, which is 1000 times larger than the current model. The problem of the size of this model will have a lot of room for development in the future. Second, compared with large-scale scientific projects, human beings spend relatively cheap money on training large models (relatively speaking). So on the topic of making the model bigger, there will be a lot of work in the future, until we have trained all the available data in the world, or used up all the computing power that can be called. The CEO of Anthropic also shared that in the next five to ten years, we will see a 1,000-fold improvement in large models. But whether it can break through human intelligence is still unlikely in my personal opinion. Because there is one thing that is not very clear at present, that is, how the large model currently performs reasoning, and currently does not have the ability to complete concept learning. If we don't have this ability, we will still be far behind our real human beings. But in the foreseeable future, with the development of scaling law, we can still see room for growth.

7. Will open source Chinese datasets be considered in the future?

Yang Yaodong: We are currently only a school team, and we do not yet have the possibility of open source Chinese datasets in terms of funds. What we have achieved now is to open up the pipeline behind this, including data verification, code reproduction, etc. Our entire team has less than 10 people. At this stage, we are benchmarking against Stanford University Alpaca. The Chinese data set is indeed what we want to do, but Chinese requires more computing power, data labeling, etc. If other companies and institutions want to open source, our team is very happy to share this set of experience of the BEAVER project for The entire open source community brings its own strength.

8. Is training only on safe data at present? Wouldn't it hurt the performance of a generic task if so?

Yang Yaodong: This question is very important, how not to hurt Helpful. We currently have some mechanisms such as PTX loss, which is to prevent it from reducing the helpful ability. As far as this one is concerned, HH is essentially mutual. Research including Anthropic shows that helpful and harmless sometimes influence each other. This is why I think this trade off can be solved with safeRL, because there are some adaptive adjustments behind this algorithm. But we must have done some quantitative results, including running a big bench, etc., and these results will be released.

9. What do you think is the best way to build the entire Chinese evaluation set? What are the useful considerations when evaluating the overall capabilities of a large model?

Yang Yaodong: You can take the data of these prompts made by us, including some other organizations, and conduct self-screening. At least in these dimensions that we can think of, what is the performance of these models. Then you can use GPT4 (or other large models) to make a score, and you can probably perceive what the current level of the model is like. But what we mainly do is in English, and the Chinese evaluation collection includes some work by Mr. Huang Minglie from Tsinghua University. You can pay attention to the security evaluation prompt he released, and you can use it.

Regarding the moderation API, we will bring all the marked security data in June and train a moderation tool for everyone.

10. How to assess the diversity and quality of data?

Yang Yaodong: This question is very important. Not even at the RLHF stage, but at the SFT stage, it is really important to be able to construct higher quality data. I realize that like MOSS they have done a lot of work on data diversity. In the field of reinforcement learning, especially multi-agent reinforcement learning, we pay great attention to policy diversity. For example, when you are solving a zero-sum game, if you do not have a variety of strategies, you will easily be defeated by others. The dimension of diversity, including the evaluation of more scientific diversity, I think there will be a lot of research in the future, slowly from this strategy diversity to the diversity of the data end, or I will directly measure the answer of this model This diversity and so on. At present, many of them still rely on engineering methods to avoid them.

11. Do you have more experiences to share in the construction of harmless data sets?

Yang Yaodong: Generally, if you want to do Safe, you will generally want to do harmful data, because harmful data will generate cost, or generate very low preference, so you can actually avoid harmless from becoming harmful, so many times the problem is, instead How to induce it to give harmful answers, because the current model is relatively safe, and many questions are filtered out without answering at all, so what you need to do is to induce harmful questions. This requires a lot of high-quality prompts, prompts that can induce it to say bad things, and human beings. That is to say, after two or three rounds of dialogue, you may finally say something unsafe after being induced by humans. You may have to score this statement. This is instead a data set that was worked hard to build. Most of the data you are marking now, including the instruction you answered with humans, or the data you marked with GPT, is itself highly harmless. Because such as Cloud and GPT, the security layer is still quite strict, and even a little conservative.

12. For moderation, is its modeling now a supervised multi-label classification similar to OpenAI? Is it modeled as a multi-label classification on 100,000 data? Or is there a more efficient way of modeling this?

Ji Jiaming (PhD candidate at Peking University): We are labeling data like this now. We have built a lot of prompts. Later, when we release the first batch, we will release 20,000 prompts that will induce some harmful answers. For example, some racist, sexist, pornographic remarks. In the process of data labeling, we performed multi-label classification on this prompt, which itself is a simple text classification task. If we want to test the toxicity of the model, it is actually related to the answers spit out by the model itself. We classify for QA pairs and moderate according to QA pairs. Here Q is the Q we constructed, and A is the A that the model spits out. Some models, for the current question, may be a safe answer, or an unsafe answer. Then we combine the QA pairs to see if it is safe. We have only done one round at present. In the case of multiple rounds of dialogue, it may not produce such biased answers in the first and second rounds, but may produce biased answers in the following rounds. In this case, after our moderation comes out, for example, we have 13 categories, 12 categories, and we put 200 prompts in each category, and we let Alpaca generate corresponding answers, by constraining at a specific fine-grained level (pornography, discrimination, Bias, etc.), look at the nature of their QA combination with constraints, in this way we analyze the readability of each model, probably this logic. In terms of the size of the model, if you are doing QA classification, you need the model to have some reasoning capabilities. What we are currently doing is 125M and 7billion.

Around June, we will release the article.

13. Can the current category cover the target truncation of the instruction attack?

Ji Jiaming: GPT, the first mechanism in it, filters through keywords. For example, when you ask some keywords related to black discrimination on ChatGPT, it will not give you an answer. Second, take Fudan’s MOSS as an example. The training corpus itself is a safe data set. If I ask it a dangerous question, it will refuse to answer. There are some websites that publish the prompts of those successful attacks. However, it is difficult to avoid attacks through context.

If there is no model supported by the RLHF data set, there may be more dangerous remarks spit out, which is also our original intention.

14. RLHF requires human annotation, which is very costly for small companies. Is there a job with better generalization ability to do RL work? For example, RLAIF, etc.

Yang Yaodong: This is possible. Now there are a lot of self align jobs, you can pay attention to them. Especially on the issue of safety, people must not be completely out of control. If you just use AI to mark, many problems can escape. But I believe that AI can already meet human values ​​for most problems, especially those very obvious mistakes. But you need to be careful when using it. One is to pay attention to the escape problem, and the other is that if the problem is too sensitive, it may be blocked.

15. Introduce the methodology of data labeling?

Yang Yaodong: Data labeling is very painful. Many alchemy companies will raise data labelers themselves, and they may not outsource this work. Our school may not be able to support such a team, so more methods are to cooperate with data labeling companies. Then this data annotation will involve the scoring system, scorecard, evaluation indicators, and acceptance indicators of the labeling company. There are so many details in this.

16. If the fine-tuning stage uses a large amount of security data, how much is the benefit of RLHF?

Yang Yaodong: This is a good question. If you have a lot of data in the fine-tuning stage, do you have to consider whether SFT is enough? This question is well worth exploring. Now there are some very new jobs that say, if you have a big language model after learning a very general knowledge, then you can actually adjust it with a little SFT data, and it doesn’t have to be aligned. LLaMa is such a example. But SFT cannot completely solve the 3H problem. After all, besides the helpless we mentioned, there is also the problem of hallucinations. From the point of view of this problem, the domestic reproduction of RLHF is not very successful at present, but from some public reports and papers, we can see that this method is still very important to prevent hallucinations. Of course, if you have high-quality security data, I believe it is possible to make your model very secure.

17. Can RLHF or sate-RLHF solve security problems only through efficient parameter fine-tuning (such as LoRA) to achieve alignment? Or is it better to use full fineture model parameters?

Yang Yaodong: Our team believes that LoRA technology is more suitable for things like stable diffusion. For product-level security robots that can chat and talk, this method may still have some shortcomings, because it is a resource-constrained perspective after all. of a method. Regarding the performance of your model, I think LoRA may not be a very good tool. Although this has some good effects in the open source framework, I think it is still relatively difficult on the safe issue.

18. Are there some theoretical results that can help safe-RL training LLM can be extrapolated?

Yang Yaodong: First of all, all the cases listed on our homepage can be extrapolated, and they are not in our traning dataset. Second, the theoretical claim involves a core issue of machine learning, that is, can you do generalization? Because the methodology we use, including ppo Lagrangian, or cpo methods, is still a machine learning method and has generalization. But strictly speaking, it can avoid any security problems in the future, and this is not necessarily the case. But at least in the field of robotics, it has a certain generalization ability, including different scenarios and random factors in different environments, all of which can be generalized.

-The End-

Guess you like

Origin blog.csdn.net/hanseywho/article/details/131123551