(6)Determistic Policy Gradient (DPG)

Determining policy gradient DPG can be used to solve continuous control problems

We consider the problem of continuous control of the robotic arm. The robotic arm has two joints that can move, so there are two variables, and the degrees of freedom are equal to 2. The action space is two-dimensional, and there are infinitely many actions.

DPG was proposed in 2014, and it was used in neural networks two years later, and DDPG appeared.

DPG is an Actor-Critic method with a strategy network and a value network.
The policy network controls the movement of the agent, so it is called an actor, and it makes action a according to the state s; the value network does not control the agent, and it scores the action a according to the state s, thereby guiding the policy network to make improvements, so the value network is called critic.

  • Use a deterministic policy network (actor): a = π ( s ; θ ) a=\pi(s;\theta) a=π ( s ;θ ) .
    The policy network is a deterministic function, denoted asπ ( s ; θ ) \pi(s;\theta)π ( s ;θ) θ \theta θ is the parameter of the neural network. The policy network is also called an actor because decisions are made by it.
    Its input is state s, and its output is not a probability distribution, but a specific action a. Given the state s, the output action is deterministic and has no randomness, which is why it is called determistic. The output of the policy network can be a real number or a vector. For example, in the example of the robot arm above, the output action is a two-dimensional one.

               \;\;\;\;\;\;\;

  • Use a value network (critic): q ( s , a , w ) q(s,a,w) q(s,a,w ) .
    The value network is also called critic, denoted asq ( s , a , w ) q(s,a,w)q(s,a,w ) , w is the parameter of the value network. The value network has two input parameters, one is state s and the other is action a. Based on the state s, the value network evaluates how good or bad the action a is.

               \;\;\;\;\;\;\;

  • The critic outputs a scalar that evaluates how good the action a a a is.
    The output of the value network is a real number, which is an evaluation of the quality of the action. The better the action, the larger the real number.

We need to train two neural networks, so that the two networks can make progress together, make the policy network better and better, and make the value network score more and more accurate.

Next we see how to train these two neural networks.

Updating Value Network by TD

First, use the TD algorithm to update the value network.

  • Transition: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1) .
    Each observed transition is a piece of training data.

  • Value network makes prediction for time t t t:
    q t = q ( s t , a t ; w ) . q_t=q(s_t,a_t;w). qt=q(st,at;w ) .
    Let the value network predict   t \,t\,tThe action value of the moment, denoted as qt q_tqt

  • Value network makes prediction for time t + 1 t+1 t+1:
    q t + 1 = q ( s t + 1 , a t + 1 ’ ; w ) , w h e r e    a t + 1 ‘ = π ( s t + 1 ; θ ) . q_{t+1}=q(s_{t+1}, a_{t+1}^’; w), where \;a_{t+1}^‘=\pi(s_{t+1};\theta). qt+1=q(st+1,at+1;w),whereat+1=π ( st+1;θ ) .
    Then let the value network predictt + 1 t+1t+1 moment of action value. We know the state st + 1 s_{t+1}at time tst+1, put st + 1 s_{t+1}st+1Enter the policy network π \piπ calculates the next action, recorded asat + 1 ' a_{t+1}^'at+1, this action is not the actual action performed by the agent, at + 1 ' a_{t+1}^'at+1It is only used to update the value network. Put st + 1 , at + 1 ' s_{t+1},a_{t+1}^'st+1,at+1Input the value network and calculate the action value at time t+1, denoted as qt + 1 q_{t+1}qt+1

  • TD error: δ t = q t − ( r t + γ ⋅ q t + 1 ) . \delta_t=q_t-(r_t+\gamma·q_{t+1}). dt=qt(rt+γqt+1).
    ( r t + γ ⋅ q t + 1 ) (r_t+\gamma·q_{t+1}) (rt+γqt+1) This part is TD Target, and part of it is the real observed rewardrt r_trt, and the other part is the prediction made by the value network qt + 1 q_{t+1}qt+1. We think that TD Target is better than simply predicting qt q_tqtcloser to the truth, so encourage qt q_tqtClose to TD Target, that is to make TD error as small as possible.

  • Update: w ⟵ w − α ⋅ δ t ⋅ ∂ q ( s t , a t ; w ) ∂ w w \longleftarrow w-\alpha·\delta_t·\frac{\partial q(s_t,a_t;w)}{\partial w} wwαdtwq(st,at;w).
    Doing gradient descent, let δ t \delta_tdtThe square of is reduced, that is to say, the prediction of the value network is closer to the target.

               \;\;\;\;\;\;\;

Updating Policy Network by DPG

Learning a policy network requires determining the policy gradient.
The policy network calculates the action a according to the input s, thereby controlling the agent movement.

  • The critic q ( s , a ; w ) q(s,a;w) q(s,a;w) evaluates how good the action a a a is.
    Training the policy network depends on the help of the value network. The value network can evaluate the quality of the action a, so as to guide the policy network to make improvements.
  • Improve θ \theta θ so that the critic believes a = π ( s ; θ ) a=\pi(s;\theta) a=π ( s ;θ ) is better.
    The parameter of the policy network isθ \thetaθ θ \theta The better θ is, the more correct the decision will be, and the better the output action a will be. The policy network itself does not know whether the action is good or bad, and it depends entirely on the evaluation of the value network.
  • Update θ \theta θ so that q ( s , a ; w ) = q ( s , π ( s ; θ ) ; w ) q(s,a;w)=q(s,\pi(s;\theta);w) q(s,a;w)=q(s,π ( s ;i ) ;w ) increase.
    The larger the output of the value network, the better the action. So we need to improve the θ \thetain the policy networkθ , let the output of the value network be as large as possible.
  • Goal: Increasing q ( s , a ; w ) , w h e r e    a = π ( s ; θ ) q(s,a;w), where\; a=\pi(s;\theta) q(s,a;w),wherea=π ( s ;θ ) .
    In summary, the goal of training the policy network is to increase the q value of the value network. The input of the value network is the state s and the action a, and the action a is determined by the strategy networkπ \piCalculated by π , for a certain state s, the policy network will output a certain action a.

               \;\;\;\;\;\;\;
If the state of the input is fixed, and the value network is also fixed, then the only factor that affects the value q is the parameter θ \theta of the policy networkθ . We want to updateθ \thetaθ , making the value q larger. So we compute q with respect toθ \thetaThe gradient of θ , and then do gradient rise, updateθ \thetaθ , which can make the value q larger.

This gradient is called the determined policy gradient DPG, which is the value q with respect to the policy network parameter θ \thetaGradient of θ . You can use the chain rule to calculate the gradient, the gradient is equal to action a aboutθ \thetaThe derivative of θ is multiplied by the derivative of q with respect to a. In fact, it is to let the gradient propagate from the value q to the action a, and then propagate from a to the policy network.

               \;\;\;\;\;\;\;

  • Gradient ascent: θ ⟵ θ + β ⋅ g \theta\longleftarrow\theta+\beta·g ii+βg .
    Finally, do gradient ascent to update,β \betaβ is the learning rate. Updateθ \thetaθ can make the value bigger, that is to say, the value network thinks that the strategy becomes better.

Improvement: Using Target Network

Using the algorithm just now to train the value network, the effect is not very good, and some techniques can be used to improve it, such as Target Network. Target Network can also improve the training of DQN.
First review the training of the previous value network.

               \;\;\;\;\;\;\;
Doing Bootstrapping when training DQN will cause deviation, and the deviation must be overestimated.
Using Bootstrapping here will also cause deviations. The deviation may not necessarily be an overestimate, but may also be an underestimate. If it is overestimated at the beginning, it will always be overestimated, and if it is underestimated at the beginning, it will always be underestimated. The reason is this, if there is an underestimation at the beginning, then TD Target will be undervalued, and then the underestimation will be propagated to the value network itself, causing the underestimation to always exist.

Bootstrapping will have this problem. The solution is to use different networks to calculate the TD target and try to avoid Bootstrapping, which can make the performance much more stable.

Computing TD Target using Target network

               \;\;\;\;\;\;\;

               \;\;\;\;\;\;\;

               \;\;\;\;\;\;\;

Observe carefully, updating the target network uses the strategy network and the value network, so the parameters of the target network still depend on the strategy network and the value network, and the TD Target calculated by the target network is still related to the strategy and value network, so the target network will still appear Bootstrapping, of course, it is better to use target network than not to use it.

Other methods make training better, and they can also be used for DPG.

               \;\;\;\;\;\;\;

Stochastic Policy VS Deterministic Policy

               \;\;\;\;\;\;\;

Guess you like

Origin blog.csdn.net/weixin_49716548/article/details/131689185