Model Training Basics: What is Reinforcement Learning?

Table of contents

Reinforcement Learning Basic Concepts

Reinforcement Learning Elements

A Reinforcement Learning Path

value function

state value function

action value function

Difficulties in Combining Reinforcement Learning with NLP

ChatGPT and Reinforcement Learning

Reinforcement Learning Concept Mapping for ChatGPT

Create a reward model

Summarize


The last time reinforcement learning was known to the public was when the Go artificial intelligence model AlphaGO defeated Ke Jie in 2017. AlphaGo is an artificial intelligence program developed by Google DeepMind. It uses a deep reinforcement learning algorithm and can continuously improve its level through self-learning and game experience. It fully demonstrates the effect and ability of reinforcement learning. ChatGPT introduces reinforcement learning into the field of NLP, showing a human-like intelligence effect.

This section mainly briefly introduces the basic concepts of reinforcement learning and its modeling in NLP, laying a foundation for learning the RLHF method.

Reinforcement Learning Basic Concepts

Reinforcement learning is a machine learning method designed to allow agents (Agents, artificial intelligence models) to learn how to make optimal decisions (Policy) through interactions with the environment. In reinforcement learning, the agent affects the environment by performing actions (Action) according to the state (State) in the environment (Environment), and obtains rewards (Reward) or punishment from the environment . The goal of the agent is to develop an optimal policy by learning to maximize the long-term reward.

Reinforcement learning is very similar to the evolution of organisms. By continuously mutating genes, they are screened by the environment, and then adapt to the environment and survive. The application of reinforcement learning is very extensive, mainly in fields such as games and natural language processing.

Its basic modeling diagram is shown below. In the Super Mario game, the protagonist of Mario is an intelligent body controlled by a human or a model, and each level of the game is the environment in reinforcement learning. The process of our playing Mario games is actually the best example of reinforcement learning: we never know how to play games, after repeated attempts and failures, we finally become a master of Mario games and successfully pass the level. This illustrates the essence of intensive learning, which is a Trial & Failed learning model. Expressed in Chinese, it means repeated defeats, repeated defeats, summing up experience and achieving success . To sum it up in a common saying, intensive learning is "to eat a fall, to grow a wisdom".

Reinforcement Learning Elements

Next, we use the example of Mario to explain several major elements of reinforcement learning.

  • State (State): The overall state composed of the environment and the agent. This state is related to time, because every moment, the state will change, the agent may change, and the environment may also change. The state is generally represented by the letter s, and the states at all moments constitute a set, � ∈ �s ∈ S This is a finite set of states.

In the Super Mario game, the position of the player at all times, as well as various objects and dangerous obstacles in the game screen belong to the state of the game.

In Go, the distribution of chess pieces on the board is the current state of Go.

  • Action (Action): The description of the actions that the agent can make, all actions constitute a set, � ∈ �a ∈ A is a finite set of actions.

In the Super Mario game, players can use the up, down, left, and right arrow keys on the handle, as well as the shooting and jump keys, a total of 6 keys to control Mario's actions.

In Go, one player can place a piece on a certain position on the board. The size of the Go board is 19*19, therefore, the range of actions that the agent can choose is at most 361.

  • Policy (Policy): The mapping method from the perceived state of the environment to actions, �(�) → π(s) → a, which means, according to the current state, design a decision function and take a certain action.

In the Super Mario game, the player observes that there are gold coins (state) in front of him, so he presses the jump button (action) to get gold coins. This is a game operation strategy.

In Go, one player observes the situation on the board (state) and decides where to place the pieces (action).

  • Feedback, reward (Reward): The environment's feedback on the agent's actions. After making a decision based on the current state, the agent will be in the next state and get a feedback value.

In the Super Mario game, the player operates Mario just after eating gold coins (current state), presses the forward button (action), and just hits the turtle in front (next state), and the game fails (feedback) is the game environment’s negative feedback to the player. A kind of feedback, that is, punishment (negative reward).

In Go, the player who plays the game puts the pieces in a correct position (action) and directly wins the game (feedback), which is the reward feedback of Go to the player.

A Reinforcement Learning Path

Now, we can get a state-policy-feedback path:

�0 →�0 → �0 → �1 →�1 → �1 → ... → �� →�� → ��s0​ →a0​ → r0​ → s1​ →a1​ → r1​ → ... → st​ →at​ → rt​

Such a path is also called a sample , or a trace .

Taking the Mario game as an example, this path is actually the whole process of the player playing the game:

Smooth road => press the forward button => there is a turtle ahead => press the shooting button => there is a trap ahead => press the jump button => rescue the princess => win

Every game play is an experiment of reinforcement learning, which can be regarded as a path sampling of reinforcement learning in mathematics. In this example, the reward feedback is whether the player rescued the princess, or died when encountering a turtle. Therefore, it is necessary to go through a complete game process to determine the reward value. Assuming that the player successfully passes the level is recorded as 1, and the failure is recorded as 0. Corresponding to the above standard path, the reward value is the same every time, which is 1.

Continue to observe such a path. When we are in the �s state, we make the next action decision, and the experience we actually rely on is completely based on the current state. This property of relying only on the current state is called Markovianness .

Taking the Mario game as an example, when the player decides to press the jump button to jump the trap, whether there is a turtle on the previous game path has no effect on the subsequent state and decision-making.

value function

According to the definition of the previous basic concepts, we can get: an agent needs to constantly update its own strategy function in order to achieve the optimal effect. So how to define this effect? This is where the value function is used.

For example, in a game of Go, I play a black piece at a certain position, and the impact of this piece on whether I can win the game is mainly reflected in the situation of the next game.

In other words, the value of a strategy needs to measure its impact on the next operation steps, and the design of the value function also needs to care about the impact of the strategy on future operations.

state value function

State Value Function (State Value Function): This function is for a strategy ��π, which refers to the expected return that can be obtained from the state ��s and following the strategy ��π.

��(�)=��[��∣��=�]Vπ(s)=Eπ​[Gt​∣St​=s]

in:

�� = �� + ���+1 + �2��+2 + ... + ����+� = ∑0� ����+�Gt​ = Rt​ + γRt+1​ + γ2Rt+2​ + ... + γTRt+T​ = ∑0T​ γkRt+T​

First, the state-value function ��(�)Vπ(s) is not a track that has been fulfilled. It is an expected value that describes the cumulative feedback reward ��Gt​ on the action trajectory after the state ��s, and it makes a weighted sum of the feedback rewards in the next ��T steps. ��γ is an impact factor that gradually decreases over time, �� ∈ (0,1)γ ∈ (0,1), which means that the current strategy pays more attention to the rewards in the immediate vicinity, and does not care much about the rewards of future steps.

Take the Mario game as an example:

Trajectory: flat road => press the forward button => there is a turtle ahead => press the shooting button => there is a trap ahead => press the jump button => rescue the princess => win

Three states are involved: a flat road, a turtle ahead, and a trap ahead.

Three actions are involved: press the forward key, press the shooting key, and press the jump key.

Rescue the princess and get reward value 1.

Based on this, we can calculate the value of the "smooth road" state.

action value function

Similar to the state-value function, the action-value function refers to the expected reward value that can be obtained by following the policy ��π after executing the action ��a in the current state ��s.

��(�,�)=��[��∣��=�,��=�]Qπ(s,a)=Eπ​[Gt​∣St​=s,At​=a]

Assume that Xiao Li has 10,000 shares (state) in his hand, and the price of the stock will fluctuate in the future, and its potential value is the average of possible future prices. The so-called action value function means that now Xiao Li sells 5,000 shares (action), and the potential value of the remaining 5,000 shares. Of course, it also includes the money that Xiao Li directly earns by selling 5,000 shares.

The relationship between the action value function and the state value function is as follows:

��(�,�)=�(�,�)+�∑�′∈� �(�′∣�,�)��(�′)Qπ(s,a)=r(s,a)+γ∑s′∈S​ P(s′∣s,a)Vπ(s′)

where �'s' refers to the state following the �s state. The action value function includes the immediate feedback reward brought by the current action, and the potential state value after that.

In the Mario game, according to the current state selection strategy, the resulting next state is actually deterministic. When the player encounters a trap and selects the jump button, 100% of them will successfully jump over the trap. Therefore, under this deterministic strategy , the relationship between the two value functions can be changed to:

��(�,�)=�(�,�)+����(�′)Qπ(s,a)=r(s,a)+γVπ(s′)

Of course, the connotation and extension of reinforcement learning go far beyond the scope of the basic concepts explained above. We only need to clarify the essence and ideas of reinforcement learning to facilitate subsequent understanding of the methods adopted by ChatGPT.

Difficulties in Combining Reinforcement Learning with NLP

Reinforcement learning was first applied in chess and card games such as Go and various large-scale electronic games. For example, in Glory of Kings, there is an AI human-machine model based on reinforcement learning.

Reinforcement learning is easier to apply in these scenarios because these scenarios are artificially constructed virtual environments, and the ultimate goal is relatively simple. In other words, for games, the environment is easy to create, and rewards are easy to construct.

For the Super Mario game, all the levels created by the game software are its reinforcement learning environment, and the game software will immediately give the player whether to pass or fail the level as to the final reward value.

For AlphaGo, the environment is Go, the Go board is its whole world, and the final reward value is to judge which side of the chess player wins, which is nothing more than executing a program for the computer.

And it is too difficult to use reinforcement learning on NLP.

Because natural language is essentially a channel for describing the world. All things in the real world can be expressed through natural language, forming abstract concepts and abstract relationships. Therefore, the environment that NLP relies on is the entire real world , and the complexity of the entire world is far beyond that of a 19 by 19 chessboard. At the same time, the reward function cannot be designed in the NLP field. Before ChatGPT was created, no computer program could give an accurate judgment on the pros and cons of the output of an NLP program. In the field of NLP, reward values ​​can only be given one by one manually.

Building a reward function for an NLP model is a bit like a chicken-and-egg cycle.

We hope to be able to produce an advanced language model that passes the Turing test through reinforcement learning.

In order to implement reinforcement learning, we need an advanced language model that passes the Turing test as a reward.

ChatGPT and Reinforcement Learning

After introducing the basic concepts and modeling forms of reinforcement learning, we use two examples (Mario and Go) to introduce how reinforcement learning is applied to specific tasks. You can draw inferences from one instance and imagine how reinforcement learning should be applied in natural language processing.

In Go, the board is the environment, and there are many electronic Go programs that can automatically calculate both sides of the game, which side wins and which side loses. The same is true for the Mario game. The conditions for judging the failure of the game are very simple. When you encounter a dangerous object or fall into a ditch, you only need to write a simple program. According to the learning principle of reinforcement learning, "one learns to grow one's wisdom", only when the agent receives the feedback information of victory or failure, can it gain intelligence.

In natural language, ChatGPT is an agent, and the optimization of its model must also be based on a program or a person to tell ChatGPT whether the output generated by your model for me is good or not. If the generation is not good, then ChatGPT will take this negative feedback and remake it, so as to achieve the learning principle of "studying a fall and gaining a wisdom".

Reinforcement Learning Concept Mapping for ChatGPT

Let's review the ChatGPT workflow first:

In the introduction to reinforcement learning above, we took the Mario game as an example to introduce how to pass the level and how to win. It involves a concept of time, that is, in the process of Mario's progress, there is always a problem of state changing with time.

However, there is no concept of time in ChatGPT model modeling. Therefore, ChatGPT is modeled in the form of reinforcement learning, and the corresponding reinforcement learning elements have the following changes:

  • Agent: the ChatGPT model itself;

  • Environment: The entire real world is described and abstracted by natural language. ChatGPT actually interacts with people. Therefore, for ChatGPT, human users are its environment;

  • Status: In ChatGPT, according to the ChatGPT input and output flow chart, the status is prompt. At this time, the state no longer depends on time (like Mario game), time is omitted, and the modeling form is timeless state; ChatGPT actually only pays attention to input and output, that is, we only care about this time Feedback is just fine, you don’t need to care about the next action and feedback like Mario games;

  • Policy, action: We know that the policy in reinforcement learning is actually a probability distribution of responses to the environment, and ChatGPT itself is a probability distribution of a large language model (introduced in Section 3). Therefore, the ChatGPT model feeds back an output text for a given input, which is to perform an action according to the current model strategy;

  • Feedback reward: People's evaluation of ChatGPT output results, good or bad; note that in the previous section, we mentioned that the cost of manually labeling data is very high. This is also where rewards are difficult to make.

Create a reward model

The main difficulty in applying reinforcement learning to ChatGPT is the output feedback given by the model. There is no convenient program or mechanism to give proper evaluation, so it can only rely on manual feedback one by one .

OpenAI is still rich and powerful, and is willing to spend money to do these seemingly unskilled tasks. The company found 40 outsourcers and labeled a large amount of data. Using these labeled data, a reward model was made, which solved the problem of designing reward functions in one fell swoop.

Summarize

  • Reinforcement learning is to let the agent (Agent, that is, the artificial intelligence model) learn how to make the optimal decision (Policy) through the interaction with the environment.
  • The difficulty of combining NLP with reinforcement learning lies in the complex environment (there are infinitely many things in the real world), and the reward function is difficult to design (you can only manually evaluate the quality of the model output).

Guess you like

Origin blog.csdn.net/m0_68036862/article/details/131198391