A brief tutorial on the policy gradient algorithm

Why we need policy gradient

Value-based reinforcement learning methods are generally deterministic. Given a state, the reward (determined value) of each possible action can be calculated. However, this deterministic method cannot handle some real-life problems, such as playing with 100 stones. In the scissors-paper game, the best solution is to randomly use rock, scissors, and paper and try to ensure that the probability of these three gestures is the same, because any gesture with a higher probability than other gestures will be noticed by the opponent and use the corresponding gesture. Win the game.

For another example, suppose we need to explore the maze in the picture above to get the money bag. If you use a value-based approach, you will get deterministic feedback in a determined state, so using this method to determine the next action (left or right) of the gray (state) box is determined, that is, always left or to the right, which may lead to falling into the wrong loop (one white square to the left and two gray squares to the left) and not getting the money bag. Some people may question that the state at this time should not be represented by one square, but by all the squares in the maze. But consider that if we are in a huge maze and cannot obtain the layout information of the entire maze, if we always make fixed judgments in the same perceptible state, it will still lead to spinning around in a certain part. In fact, many practical problems, especially chess problems, have similar characteristics, that is, they need to apply different actions in seemingly the same state, such as the opening of a chess game.

Policy gradient was created to solve the above problems, and its secret weapon is "randomness". First of all, randomness can provide non-deterministic results. This non-deterministic result is not completely random, but random that obeys a certain probability distribution. Policy gradient does not calculate rewards, but outputs a probability distribution for selecting all actions, and then selects actions based on probability . The basic principle of its training is to adjust the strategy through feedback. Specifically, when receiving a positive reward, increase the probability of the corresponding action; when receiving a negative reward, reduce the probability of the corresponding action. The green dots in the left picture below represent actions that obtain positive rewards, and the right picture represents the updated strategy. It can be found that the probability of areas generating positive rewards has increased (the distance to the center of the circle is closer).

Let’s take a closer look at the policy gradient algorithm.

basic concept

Object system : The learning object of the policy gradient. This object can be a system, such as a car or a game, or an opponent, such as a momentum-paper-scissors game opponent or a professional Go player.

Policy strategy:\pi_{\theta } (\alpha | s) represents the probability of action occurring under state sand parameter conditions .\theta\alpha

Episode round : means using a certain strategy to generate actions and interact with the object system from the starting state until the end of a certain end state. For example, a round in the game of Go starts from the first move on the board until the outcome of the game is determined, or a round in autonomous driving refers to from the start of the car until it successfully reaches the designated destination. Of course, it crashes or drives into the car. The pond is also an undesirable end state.

Trajectory:\ can represents the sequence of states s, actions \alphaand rewards in a round of PG learning . rGive a chestnut: \tau=((s_0,\alpha_0,r_0),(s_1,\alpha_1,r_1),......). Since the strategy produces non-deterministic actions, the same strategy can produce multiple different trajectories in multiple rounds.

Round reward:\sum r(\number) represents the total reward generated by sequential actions in a round. In the implementation, when evaluating reward expectations for each policy, an average is taken over multiple epochs.

The learning of policy gradient is a policy optimization process. A policy is randomly generated at the beginning. Of course, this policy knows nothing about the object system, so the actions generated by this policy will likely receive a negative reward from the object system. In order to defeat our opponents, we need to gradually change our strategies. Policy gradient uses the same policy in a round of learning until the end of the round, changes the policy through gradient ascent and starts the next round of learning, and so on until the cumulative reward of the round converges.

objective function

According to the basic principles of policy gradient mentioned above, we can formally describe its goal as the following expression:

J(\theta )=argmax_\theta E[r_0+r_1+r_2+......|\pi_\theta ]

This function represents \pi_\thetathe expected value of the cumulative reward of the strategy from steps 0 to t. The reason why it is expected value is because the reward for each step is the reward expectation based on the strategy (generating the probability distribution of action selection); rather than choosing a certain action and getting a certain reward. The goal of policy gradient is to determine the parameters that constitute the policy \thetaso as to J(\theta )achieve the maximum expected value. Right now:

\theta ^*=argmaxJ(\theta )

Gradient ascent is used in the policy gradient algorithm to update parameters \theta, according to the definition of mathematical expectation:

J(\theta )=E_{r\sim \pi_\theta (\tau)}[\sum r_\tau]=\int_{​{}^{\tau}} r(t)\pi_\theta (t)dt

Find the derivative:

\nabla_\theta J(\theta)=\int_{​{}^{\tau}}r(t)\nabla_\theta\pi_\theta(t)dt

We can't proceed here. Because \pi_\theta(t)we depend on it \theta, we can't directly derive the derivative, so we have to use a little trick. \nabla logf(x)=\frac{\nabla f(x)}{f(x)}Convert according to :

\nabla_\theta \pi_\theta(t)=\pi_\theta(t)\frac{\nabla_\theta \pi_\theta(t)}{\pi_\theta(t)}=\pi_\theta(t). )\nabla_\theta log\pi_\theta(t)

Substitute:

\nabla_\theta J(\theta)=\int_{ {}^{\tau}}r(t)\pi_\theta(t)\nabla_\theta log\pi_\theta(t) dt

This time, after completing the calculation, we convert it back according to the definition of expectation, and we get:

\nabla_\theta J(\theta)=E_{\tau \sim \pi_\theta(\tau)}[\nabla_\theta log\pi_\theta(\tau)r(\tau)]

because:

log(\pi_\theta(\tau))=\sum_{t=1}^{T} log(\pi_\theta(\alpha_t|s_t))+logp(s_{t+1}|s_t,\alpha_t)

r(\tau)=\sum_{t=1}^{T} r(s_t,\alpha_t)

The final result is available:

\nabla_\theta J(\theta)=E_{\tau \sim \pi_\theta(\tau)}[\sum_{t=1}^{T}\nabla_\theta log\pi_\theta(\alpha_t| s_t)(\sum_{t=1}^{T} r(s_t,\alpha_t))]

Approximate the expectation by the mean:

\nabla_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^{N}[\sum_{t=1}^{T}\nabla_\theta log\pi_\theta(\alpha_t|s_t)(\sum_{t=1}^{T} r(s_t,\alpha_t))]

After so much trouble, I finally finished the calculation. This is the objective function to be optimized. Even if you don't quite understand it, you can get an intuitive understanding from here: when the reward is high, the strategy tends to increase the probability of the corresponding action, and when the reward is low, it decreases. The learning process of policy gradient is relatively similar to that of traditional supervised learning: each round consists of forward calculation and back propagation. The forward calculation is responsible for calculating the objective function, and the back propagation is responsible for updating the parameters of the algorithm, and so on. Multiple rounds of learning guide the learning effect to converge stably. The only difference is that the objective function of supervised learning is relatively straightforward, that is, the difference between the target value and the real value. This difference can be obtained through one forward feedback; while the objective function of the policy gradient is derived from all the rewards obtained in the round. , and certain mathematical conversions are required to calculate it. In addition, since sampling is used to simulate expectations, the same set of parameters also needs to be sampled multiple times to increase the accuracy of the simulation.

Applications

Below we introduce how to apply PG to solve specific problems through an example: learning to play Atari PONG game. PONG is a game that simulates playing table tennis. The player controls a small plane on one side of the screen to simulate a table tennis racket moving up and down to hit the ball. If the opponent is forced to concede the ball, his side's score will be increased by one, otherwise the opponent's score will be scored. The basic idea of ​​using policy gradient to learn the PONG game is to use one party controlled by the algorithm to play against the other party controlled by the game, and adjust the probability distribution of the action (up or down) by observing the game state and score changes to maximize the score of the party. The learning process can be written as the following code:

policy = build_policy_model()
game.start()
while True:
    state = game.currentState()
    action, prob = policy.feedforward(state)
    reward = game.play(action)
    trajectory.append((state, prob, action, reward))
    if game.terminated():
        if count < SAMPLE_COUNT:
            trajectories.append(trajectrory)
            trojectroy = []
            count += 1
            break
        else:
            policy.backpropagation(trajectories)
            game.restart()
            trajectory = []
            trajectories = []
            count = 0

Line 1 constructs a policy model and randomly initializes the model parameters \theta. The function of the model is to calculate the probability distribution of all actions from the state information through forward feedback, for example (up 90%, down 10%), and select the action with the highest probability to send it to the game as an instruction.

Line 2 starts the game.

Line 4 obtains the current status, such as the position of the racket and the speed and direction of the ball.

Line 5 passes the status information into the policy model to calculate the corresponding action. The probability of the action is also recorded here \pi_\theta(s), in order to calculate the objective function derivative during the reverse phase.

Line 6 plays the game using 4 calculated actions and gets the reward

Line 7 stores a round of interaction information (status, action probability, actions and rewards) into the current \ cantrajectory

Line 8 If the game is not over (the table tennis ball is still being hit by both sides), continue to use the current strategy model for the next interaction.

Line 10: If the game ends (one party does not hit the ping pong ball), the trajectory information of the previous round is saved and a new round of the game is started using the same strategy model. That is, in order to reduce the impact of individual differences, multiple variables are generated for the same strategy model. samples.

After generating enough samples in line 15, the parameters are updated through the reverse transfer of the policy model, and then a new round of learning is started with the updated policy.

Issues and improvements

Although policy gradient can theoretically handle complex problems that value-based methods cannot handle, due to its reliance on samples to optimize the policy, this method has a relatively large variance affected by individual differences in samples, and the learning effect is not easy to continuously enhance and converge. A basic improvement idea is to reduce the variance by reducing invalid elements. Since the current action will not affect past rewards, the objective function can be changed to:

\nabla_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^{N}[\sum_{t=1}^{T}\nabla_\theta log\pi_\theta(\alpha_t|s_t)(\sum_{t^{'}=t}^{T} r(s_{t^{'}},\alpha_{t^{'}}))]

You can also use the classic discount factor to reduce the impact of distant actions:

\nabla_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^{N}[\sum_{t=1}^{T}\nabla_\theta log\pi_\theta(\alpha_t|s_t)(\sum_{t^{'}=t}^{T} \gamma ^{|t-t^{'}|}r(s_{t^{'}},\alpha_{t^{'}}))]

In addition, we also need to consider a problem: the round rewards generated in actual calculations do not accurately represent the quality of the strategy. For example, when the strategy is already good, a less good sample is used to generate a smaller round reward. , since this reward is non-negative, the traditional policy gradient algorithm still tries to increase the probability of the action that produces this trajectory, resulting in a decrease in learning effect instead of an increase. Therefore, we need to introduce a baseline value so that the algorithm can increase the probability of actions better than the baseline value and reduce the probability of actions below the baseline value, that is:

\nabla_\theta J(\theta)=\frac{1}{N}\sum_{i=1}^{N}[\sum_{t=1}^{T}\nabla_\theta log\pi_\theta(\alpha_t|s_t)(\sum_{t^{'}=t}^{T} r(s_{t^{'}},\alpha_{t^{'}})-b)]

The benchmark value bis currently generally taken as the mean value, that is b=\frac{1}{N}\sum_{i=1}^{N}r(\alpha_i,s_i). It can be seen that this baseline value also changes dynamically with updates. Researchers are still trying to use other methods to generate better benchmark values. In fact, if you think about it carefully, you will find that the dynamic estimation of the benchmark value is actually an estimation problem of the value function. Therefore, the policy gradient algorithm can be combined with the value-based algorithm to achieve better results. For example, Actor-critic, in my opinion, is actually a combination of the policy gradient method and DQN. We will talk about this again when we have the opportunity~

Guess you like

Origin blog.csdn.net/FYZDMMCpp/article/details/112586572