Reinforcement Learning - Policy Gradient

introduction

      Common methods of reinforcement learning are based on value functions or based on policy gradients.

Value function: Obtained when the value function is optimalThe optimal strategy, that is, under state s, The action corresponding to the maximum behavior value function maxQ(s,a).

      But for the continuous action space of the robot, when the actions are continuous, based on the value function, there are the following problems:

  1. When the action space is large or the actions are continuous sets, the value function-based method cannot effectively solve it.
  2. When improving the strategy based on the value function, it is necessary to obtain the behavior value function for each state behavior to obtain the optimal action\frac{argmax}{a\in A}Q(s,a)). In this case, Each state behavior is strictly independent, and it is unrealistic to determine the behavior that should be performed in a certain state.

Summary: Using value function Q to solve continuous spatial actions can also be used but not easy to use, so the policy gradient method appeared.

1. Policy gradient

Stochastic policy gradient: Using P (a,s;\theta ) direct approximation\pi (a,s), we need to obtain the instantaneous neural network parameters θ. In order to solve θ, you need to design an objective function J(θ)=G(θ) (cumulative return value). The θ update formula is:

      This method updates the policy parameters based on the gradient of the objective function J(θ).

J(θ) division 两种

1. In MC, there is a completed chain:

2. In TD, when there is a step size limit:

      In the formula, is the distribution of state s generated based on the strategy

      Further, we get the expression of gradient:

2.Actor

      The policy gradient is the A in AC

Actor: As you can see from the figure below, it inputs the image observed by the agent (for a computer, it is a matrix or vector) and outputs the probability of actions that the agent may take. distributed.

      Further, we need to measure the pros and cons of the Actor. Based on the Actor, we can get a series of returns, calculate the average return, and compare the pros and cons of the strategy. R is the J above.

     

     Then we want to find the optimal Actor, (R is the extension of J above), and use the gradient ascent method to get:

3.Extend the depth

Guess you like

Origin blog.csdn.net/weixin_48878618/article/details/134336260