Policy Gradient gradient strategy (PG)

Reference: Mushroom Book EasyRL 

Table of contents

1. Gradient strategy algorithm

2. Gradient strategy technique

2.1 Add a baseline (baseline)

2.2 Assign appropriate credits

2.3 Critics

3. REINFORCE: Monte Carlo policy gradient


1. Gradient strategy algorithm

Reinforcement learning has 3 components: actor, environment and reward function. The environment and rewards are given before learning, so what we can do is to adjust the actor's strategy to get the maximum reward.

 The policy is recorded as \pi, for example, in deep reinforcement learning, the policy is a network with some parameters in the network, and \thetathe state (vector or matrix) is input into the network, and the probability distribution of the action will be output. Strategy can also be understood as the probability distribution of actors giving each action in a certain state.

The interaction process between the actor and the environment is as follows: the environment gives a state s_1, the actor will give an action when seeing this state , and the environment will produce a new state a_1according to some changes , and this process will continue until the environment stops.a_1s_2

 During an interaction, the combination sof the environment output and the actor output ais a trajectory Trajectory

\tau =\left \{ s_1,a_1,s_2,a_2......s_t,a_t \right \}

 Given the actor parameters , the probability of a certain trajectory occurring \thetacan be calculated\ can

 First calculate s_1the probability of the environment output p\left ( s_1 \right ), and then calculate the probability of the actor's s_1execution . The environment is generated according to the environment, and the environment generates a new state according to the actor's actions. Whether there is a probability distribution depends on the internal settings. Generally speaking, there is a probability, so that a certain The probability of a trajectory occurring. In the above formula, it represents the environment, preset, and represents the actor, which depends on the parameters , which we can optimize and control.a_1s_1,a_1s_2\ canp\left ( s_{t+1}|s_t,a_t \right )p_\theta \left \{ a_t|s_t \right \}\theta

In addition to the environment and actors, rewards are a very important part. Each time the actor gives an action based on the state of the environment output, he will get a reward, which is obtained by adding up all the rewards of a trajectory R\left ( \tau \right ). Because the action taken by the actor in a certain state is random, the observation given by the environment is also random, so it is R\left ( \tau \right )a random variable, we can only get the expected reward \bar{R_\theta }, the training goal is to maximize the expected reward, using the gradient rise (gradient ascent).

To perform gradient ascent, the gradient of the expected reward must first be calculated, which can be obtained through a series of mathematical formulas as follows:

where is the reward of R\left ( \tau ^{n} \right )the trajectory and is the logarithmic probability of taking a certain action in a certain state. The parameters are updated through gradient ascent . If it is found to be positive at the end of the execution , the execution probability will be increased, otherwise, the probability will be reduced.\number ^{n}logp_\theta \left ( a_{t}^{n}|s_{t}^{n} \right )\thetas_ta_tR\left ( \tau ^{n} \right )s_ta_t

2. Gradient strategy technique

2.1 Add a baseline (baseline)

If action a is taken for a given state s, and the whole game gets a positive reward, the probability of (s, a) should be increased. If action a is performed for a given state s, and the whole game gets a negative reward, the probability of (s, a) should be reduced. But in many games, the reward is always positive, and the minimum is 0. Suppose there are three actions a, b, and c in a certain state. Since the reward is positive, the logarithmic probability of the three actions will be increased when the gradient is updated, but the weights are different. If the weight is small, the probability of the action will increase less. ; The greater the weight, the higher the probability of the action. The sum of the logarithmic probabilities of actions a, b, and c is 0, so the probability of action b will decrease after the normalization is done if the improvement is small; the probability of the action will only increase if it is increased more.

 This is just an ideal situation. In fact, we get the expectation through sampling. Some actions may not be sampled. Assuming that a is not sampled, the logarithmic probability of b and c will become larger, and the corresponding logarithmic probability of a will be becomes smaller, but a is not necessarily a bad move, just because it is not sampled.

In order to solve this problem, the reward is not always positive, and the reward is subtracted by b

b is called the baseline, so R\left ( \tau \right )-bthere are positive and negative. If the total reward is obtained R(\year )>b, the probability of (s, a) will increase. If R\left ( \tau \right )<b, even if R\left ( \tau \right )it is positive, a small value is bad, we let the probability of (s, a) decrease, and let the score of this state take this action decrease. b can take R\left ( \tau \right )the mean value of , so it will be continuously recorded during the training processR\left ( \tau \right )

2.2 Assign appropriate credits

In the gradient ascent formula, all (s, a) in an ep have the same weight, which is obviously unreasonable. For example, in the same game, some actions may be good and some actions may be bad. Assuming that the result of the whole game is good, it does not mean that every action in this game is good. If the result of the whole game is bad, it does not mean that every action in the game is bad. So we hope that we can multiply different weights in front of each different action. The different weights of each action reflect whether each action is good or bad.

In the first game, \left ( s_b,a_2 \right )the weight is 3, but 3 is not the result of s_bthe execution a_2. On the contrary, because of the execution a_2, entering s_3and then executing the action gets -2. In the second game, \left ( s_b,a_2 \right )the weight is -7, and -7 is not the result of s_bthe execution a_2, but s_athe execution a_2. Therefore, the result of the whole game does not represent the quality of the action.

One way is to calculate the reward of a certain state-action pair, instead of adding up all the rewards obtained in the entire game, only calculate the reward obtained after the execution of this action. Because what happened in this game before performing this action has nothing to do with performing this action, so the rewards obtained before performing this action cannot be regarded as the contribution of this action. We add up all the rewards that occur after performing this action, which is the real contribution of this action. The weight in the above image \left ( s_b,a_2 \right )becomes -2.

One step further, the future reward is considered as a discount factor, and the impact of actions on future rewards decreases as the step size increases. For example, the weight of the first game in the above picture \left ( s_a,a_1 \right )becomes  5+\gamma *0+\gamma ^{2}*\left ( -2 \right ).

2.3 Critics

The b in the previous article is estimated by a network and is the output of a network. Rb is called the advantage function and is represented by A^{\theta }\left ( s_t,a_t \right ). To calculate the value of the advantage function, a model needs to interact with the environment before knowing the next reward. s_tThe significance of the advantage function is, assuming that we perform a certain action in a certain state , how good it a_tis compared to other possible actions . a_tWhat the advantage function cares about is not absolute goodness, but relative goodness, that is, relative advantage. Because in the advantage function, we will subtract a baseline b, so this action is relatively good, not absolutely good. A^{\theta }\left ( s_t,a_t \right ) It can usually be estimated by a network called critics

3. REINFORCE: Monte Carlo policy gradient

The Monte Carlo method can be understood as after the algorithm completes a round, it uses the data of this round to learn and make an update. Because we have obtained the data of the whole round, we can also obtain the reward of each step, and we can easily calculate the total future reward of each step, that is, the return G_{t} . G_{t}is the total reward in the future, representing the sum of the rewards we can get from this step. Compared with the Monte Carlo method, which is updated once in a round, the temporal difference method is updated once per step, that is, every step is updated once, and the timing difference method has a higher update frequency, because the future reward is not known, so the Q value is used to approximate G

REINFORCE is the simplest and most classic algorithm in the strategy gradient. In the code processing, the reward of each step is obtained first, and then the total reward in the future is calculated, and the output of each action is optimized by substituting the following formula, which means it is easier to G_{t}^{n} calculate

The most important thing about REINFORCE is the last 4 lines of the pseudo-code, first generate a round of data (s, a, G), and then calculate the gradient for each action\bigtriangledown ln\pi \left ( a_t|s_t,\theta \right )

The network predicts the probability distribution of actions that should be output in each state, such as 0.2, 0.5, and 0.3. In fact, the output to the environment is to randomly select an action. For example, if you choose to go to the right, then the one-hot vector is (0,0,1 ). By substituting the output of the neural network and the actual action into the formula of cross entropy, we can find the gap between the probability of the output action and the probability of the actual action. But the actual action a_tis only the real action we output, it is not necessarily the correct action, it cannot be used as a correct label to guide the neural network to update in the correct direction like handwritten digit recognition, so we need to multiply a reward return G_t. G_tEquivalent to the evaluation of real actions. If G_t it is larger, the total reward in the future will be greater, which means that the current output of the real action is better, and the loss needs to be paid more attention to; if it G_tis smaller, the total reward in the future is smaller, it means that the current The worse the output of the real action, the smaller the weight and the smaller the optimization.

The figure below shows the REINFORCE algorithm. First, we need a strategy model to output the action probability. After outputting the action probability, we can get a specific action through the sample() function. After interacting with the environment, we can get the data of the entire round. After getting the round data, we then execute the learn() function. In the learn() function, we can use these data to construct a loss function, "throw" it to the optimizer for optimization, and update our strategy model.

 The corresponding methods and codes are given in the pytorch official website Probability distributions - torch.distributions — PyTorch 1.12 documentation

from torch.distributions import Categorical

probs = policy_network(state)
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
# 乘负号是因为优化器默认梯度下降,而此处需要梯度上升
loss = -m.log_prob(action) * reward
loss.backward()
  • Categorical class: Create a categorical distribution that is a discrete probability distribution
  • sample(): A method of the Categorical class for sampling
  • log_prob(): A method of the Categorical class that returns the logarithmic probability
from torch.distributions import Categorical
import torch
probs = torch.tensor([0,2, 0.3, 0.5])

# 创建分类分布
m = Categorical(probs)
m
Out[6]: Categorical(probs: torch.Size([4]))

# 随机采样一个动作作为真实动作
action = m.sample()
action
Out[15]: tensor(1)

# 增加动作的概率
m.log_prob(action)
Out[16]: tensor(-0.3365)

Guess you like

Origin blog.csdn.net/weixin_45526117/article/details/126330222