"Reinforcement Learning Principles and Python Actual Combat" reveals the core technology RLHF of large models! ——AIC Squirrel Event Seventh

Table of contents

1. What is RLHF?

2. What tasks is RLHF suitable for?

3. What are the advantages and disadvantages of RLHF compared with other methods of constructing reward models?

4. What kind of human feedback is good feedback

5. What are the categories of RLHF algorithms, and what are their advantages and disadvantages?

6. What are the limitations of using human feedback in RLHF? 

7. How to reduce the negative impact of human feedback?


1. What is RLHF?

Reinforcement learning uses reward signals to train agents. Some tasks do not have their own environment that can give reward signals, and there is no ready-made method to generate reward signals. To this end, reward models can be built to provide reward signals. When building a reward model, the reward model can be trained using a data-driven machine learning approach, with data provided by humans. We call such a system that uses feedback data provided by humans to train a reward model for reinforcement learning as human feedback reinforcement learning, as shown in the diagram below.

picture

Figure: Human Feedback Reinforcement Learning: Train the reward model with human feedback data and use the reward model to generate reward signals

2. What tasks is RLHF suitable for?

RLHF is suitable for tasks that satisfy all of the following conditions simultaneously:

  • The task to be solved is a reinforcement learning task, but the reward signal is not readily available and the way the reward signal is determined is not known in advance. To train a reinforcement learning agent, consider constructing a reward model to obtain a reward signal.
    Counter-example: For example, video games have game scores, such a game program can give reward signals, then we can directly use the game program to give feedback without human feedback.
    Counter-example: The way to determine the reward signal of some systems is known. For example, the reward signal of the trading system can be completely determined by the money earned. At this time, the reward signal can be determined directly with known mathematical expressions, without manual feedback.

  • It is difficult to build a suitable reward model without using human feedback data, and human feedback can help to obtain a suitable reward model, and human feedback can be obtained at a reasonable cost (including cost, time, etc.). If there is no advantage to using human feedback to obtain data compared to data collected by other methods, then there is no need for human feedback.

3. What are the advantages and disadvantages of RLHF compared with other methods of constructing reward models?

The reward model can be manually specified, or it can be learned through machine learning methods such as supervised models and inverse reinforcement learning. RLHF uses machine learning methods to learn the reward model and uses human feedback during the learning process.

Comparing the pros and cons of manually specifying the reward model versus machine learning to learn the reward model: This is the same as the discussion of the pros and cons of machine learning in general. The advantages of machine learning methods include not requiring too much domain knowledge, being able to deal with very complex problems, being able to process fast and large amounts of high-dimensional data, and being able to improve accuracy as the data increases, and so on. The disadvantages of machine learning algorithms include that their training and use require resources such as data, time, space, and electricity, the interpretation of the model and output may be poor, the model may be flawed, the coverage is not enough, or it is attacked (such as prompt word injection in the large model) .

Compare the advantages and disadvantages of using human feedback data and non-human feedback data: manual feedback is often more time-consuming and laborious, and different people may perform inconsistently at different times, and people will make mistakes intentionally or unintentionally, or the results of human feedback are not as good as Use other methods to generate data to be effective, and so on. We explore the limitations of human feedback in more detail below. Using non-human feedback data such as machine-collected data has limitations on the types of data collected. Some data can only be collected by humans, or is difficult to collect by machines. Such data includes subjective, humanistic data (such as judging the artistry of a work of art), or some things that machines cannot do (such as playing a game where AI is temporarily inferior to humans).

4. What kind of human feedback is good feedback

  • Good feedback needs to be sufficient: feedback data can be used to learn a reward model, and the data is correct enough, large enough, and comprehensive enough to make the reward model good enough to obtain a satisfactory agent in subsequent reinforcement learning. .
    The evaluation indicators involved in this part include: the evaluation indicators of the data itself (correctness, data volume, coverage, consistency), the evaluation indicators of the reward model and its training process, the reinforcement learning training process and the trained agent evaluation indicators.

  • Good feedback needs to be available. Feedback needs to be available at a reasonable cost of time and money, and at a manageable cost without incurring other risks (such as legal risks).

    The evaluation indicators involved include: data preparation time, the number of personnel involved in data preparation, data preparation cost, and the judgment of whether other risks are caused.

5. What are the categories of RLHF algorithms, and what are their advantages and disadvantages?

The RLHF algorithm has the following two categories: RLHF that uses the idea of ​​supervised learning to train the reward model, and RLHF that uses the idea of ​​inverse reinforcement learning to train the reward model.

1. In the RLHF system that trains the reward model with the idea of ​​supervised learning, the human feedback is the reward signal or the derivative of the reward signal (such as the ordering of the reward signal).

There are advantages and disadvantages of direct feedback reward signal and derivative amount of feedback reward signal. This advantage is that after obtaining the reward reference value, it can be directly used as a label for supervised learning. The disadvantage is that the reward signals given by different people at different times may be inconsistent or even contradictory. Feedback derivatives of reward signals, such as comparisons or rankings of reward model inputs. Some tasks have difficulty giving consistent reward values, but comparing sizes is much easier. But there is no density information. In the case of a large number of similar situations, the sample corresponding to a certain part of the reward is too dense, and it may not even converge.

It is generally believed that using comparison-type feedback results in better median performance, but not better average performance.

2. In the RLHF system that uses the idea of ​​inverse reinforcement learning to train the reward model, the human feedback is not the reward signal, but the reward model input that makes the reward bigger. That is, humans give more correct quantities, texts, categories, physical actions, etc., and tell the reward model that the reward should be relatively large at this time. This is actually the idea of ​​inverse reinforcement learning.

Compared with RLHF, which uses supervised learning to train the reward model, this method has the advantage that the sample points for training the reward model are no longer limited to the samples that need to be judged given by the system. Because the samples given by the system that need to evaluate rewards may have limitations (because the system has not found the optimal interval).

In the initial stage of system construction, the reference answers provided by users can also be used to transform the initial reinforcement learning problem into an imitation learning problem.

Such designs can be further categorized by the type of feedback that allows humans to independently give expert opinions, and the other allows humans to improve on the basis of existing data. Asking humans to provide opinions is similar to asking humans to provide expert strategies in imitation learning (of course it may be slightly different, after all, the input of the reward model is not only actions). Allowing users to modify the existing reference content can reduce the cost of each human annotation, but the existing reference content may interfere with human independent judgment (this interference may be positive or negative).

6. What are the limitations of using human feedback in RLHF? 

As mentioned earlier, human feedback can be more time-consuming and laborious, and does not necessarily guarantee accuracy and consistency. In addition, the following points will lead to incomplete and incorrect reward models, resulting in unsatisfactory agent behaviors obtained in subsequent reinforcement learning training.

1. The population providing human feedback may be biased or limited.

This problem is related to the type of problems that may be encountered in sampling methods in mathematical statistics. The population providing feedback to the RLHF system may not be the best population. Sometimes due to factors such as cost and availability, teams with low labor costs are selected, but such teams may not be professional enough, or have different legal, moral and religious concepts, including discriminatory information. There may be malicious people among the feedback people who will provide misleading feedback.

2. Human decision-making may not be as smart as machine decision-making.
In some problems, machines can do better than humans. For example, for board games such as chess and go, real people cannot compare with artificial intelligence programs. On some problems, humans can process less information than data-driven programs. For example, for the application of autonomous driving, humans can only make decisions based on two-dimensional images and sounds, while programs can process information in three-dimensional space in continuous time. So in theory the quality of human feedback is not as good as that of programs.

3. The identity of the person providing the feedback is not introduced into the system.
Everyone is unique: everyone has their own growth environment, religious beliefs, moral concepts, study and work experience, knowledge reserves, etc. It is impossible for us to introduce all the characteristics of each person into the system. In this case, if the difference in a feature dimension between different people is ignored, a lot of effective information will be lost, resulting in a decrease in the performance of the reward model. 

Taking a large-scale language model as an example, the user can specify the model to communicate with a specific role or communication method by prompting the project. For example, sometimes the output text of the language model is required to be more polite, courteous and flattering, and sometimes the output text content needs to be loud and loud. It is less polite to say something; sometimes the output is required to be more creative, and sometimes it is required to respect the facts more rigorously; sometimes the output is required to be concise and concise, and sometimes the output is required to be detailed and complete; sometimes the output is required to be neutral and objective. Discussion within the scope, sometimes requires the output to consider the environmental background of the humanities and society. The different identity backgrounds and communication habits of the people who provide feedback data may just correspond to the output requirements in different situations. In this case, the characteristics of the feedback person are very important.

4. Human nature can lead to imperfect data sets.

For example, the language model may obtain high-scoring evaluations through behaviors such as flattering and wearing high hats, but such high-scoring evaluations may not really solve the problem and violate the original intention of the system design. It seems that the score is high, but the high score may be obtained by avoiding controversial topics or flattering, rather than really solving the problems that need to be solved, and failing to achieve the original intention of the system design.

In addition, there are other non-technical risks in providing feedback by humans, such as security risks such as leaks and regulatory legal risks.

7. How to reduce the negative impact of human feedback?

Aiming at the problem that human feedback is time-consuming and may lead to incomplete and incorrect reward models, it is possible to train reward models, train agents, and comprehensively evaluate reward models and agents while collecting human feedback data, so as to detect human feedback as early as possible. defect. When defects are found, make adjustments in time.

In response to feedback quality problems and wrong feedback in human feedback, human feedback can be verified and audited, such as introducing a verification sample with known rewards to verify the quality of human feedback, or requesting feedback for the same sample multiple times and comparing The results of multiple feedbacks, etc.

In view of the problem of improper selection of feedback persons, on the basis of effective control of labor costs, scientific methods can be used to select the persons who provide feedback. You can refer to the sampling methods in mathematical statistics, such as stratified sampling, cluster sampling, etc., to make the feedback population more reasonable.

For the problem that the reward model is not good enough because the characteristics of the feedback person are not included in the feedback data, the characteristics of the feedback person can be collected and used for the training of the reward model. For example, in the training of a large-scale language model, the professional background of the feedback person (such as a lawyer, doctor, etc.) can be recorded and taken into account when training the reward model. When the user asks the agent to work like a lawyer, it should use the part of the reward model learned from the data provided by the lawyer to provide a reward signal; when the user asks the agent to work like a doctor, it should use the data provided by the doctor The learned part of the reward model is used to provide the reward signal. In addition, during the implementation of the entire system, professional advice can be sought to reduce legal and security risks.

 

Today's good book recommendation : "Reinforcement Learning: Principles and Python Practice"

Squirrel Activity : This is the end of the recommended high-quality books in this issue, see you in the next issue!

Event date : 8.19-8.25

Receiving method : Bloggers send private messages to redeem rewards

Guess you like

Origin blog.csdn.net/zhaochen1127/article/details/132372258