Reinforcement learning from basic to advanced - frequently asked questions and must-know answers to interviews [7]: Detailed explanation of deep deterministic policy gradient DDPG algorithm and double-delay deep deterministic policy gradient TD3 algorithm

insert image description here
[Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

insert image description here
Column details : [Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

The plan for deep reinforcement learning is:

  • Basic single-intelligence algorithm teaching (gym environment-based)
  • Mainstream multi-intelligence algorithm teaching (gym environment-based)
    • Mainstream algorithms: DDPG, DQN, TD3, SAC, PPO, RainbowDQN, QLearning, A2C and other algorithm projects
  • Some interesting projects (Super Mario, playing backgammon, Fight the Landlord, various game applications)
  • Actual combat of single-intelligence and multiple-intelligence questions (the paper reproduces partial business such as: UAV optimization scheduling, power resource scheduling and other project applications)

This column is mainly to facilitate entry-level students to quickly grasp reinforcement learning single-agent | multi-agent algorithm principles + project practice. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Reinforcement learning from basic to advanced - frequently asked questions and must-know answers to interviews [7]: Detailed explanation of deep deterministic policy gradient DDPG algorithm and double-delay deep deterministic policy gradient TD3 algorithm

See the top of the article for the full series of ultra-detailed algorithm code sources for reinforcement learning

1. Core vocabulary

Deep deterministic policy gradient (deep deterministic policy gradient, DDPG) : A classic reinforcement learning algorithm in the field of continuous control, it is a deep Q network at a qualitative "meaning that its output is a deterministic action that can be used in a continuous action environment;" "Strategy gradient" means that it uses a strategy network, and it will be updated every step. It is a single-step update strategy network. It and the deep Q network have the target network and experience playback skills. In the experience playback part are the same, with a slight difference in the update of the target network.

2. Summary of frequently asked questions

2.1 Please explain random strategy and deterministic strategy, what is the difference between them?

(1) For the randomness strategy π θ ( at ∣ st ) \pi_\theta(a_t|s_t)Pii(atst) , we enter a certain statesss , take an actionaaThe possibility of a is not 100%, but there is a probability, just like a lottery, an action is randomly selected according to the probability.

(2) For a deterministic strategy μ θ ( st ) \mu_{\theta}(s_t)mi(st) , which has no probability effect. When the parameters of the neural network are fixed, the same input state will inevitably output the same action, which is a deterministic strategy.

2.2 For the control space of continuous actions and the control space of discrete actions, if we both adopt policy networks, how should we operate respectively?

First of all, it needs to be explained that for the control space of continuous actions, algorithms such as Q-learning and deep Q-network cannot handle it, so we need to use neural network for processing, because it can output both probability values ​​and definite strategies μ θ ( st ) \mu_{\theta}(s_t)mi(st)

(1) To output discrete actions, use Softmax as the final output activation function. It can guarantee that the output is the action probability, and the sum of all action probabilities is 1.

(2) To output continuous actions, you can add a layer of tanh activation function to the output layer, which can limit the output to [ − 1 , 1 ] [-1,1][1,1 ] . After we get this output, we can scale it according to a range of the actual action, and then output it to the environment. For example, the neural network outputs a floating-point number 2.8. After the tanh activation function, it can be limited to[ − 1 , 1 ] [-1,1][1,1 ] , which outputs 0.99. Assume that the speed range of the car is[ − 2 , 2 ] [-2,2][2,2 ] , then we scale it from[ − 1 , 1 ] [-1,1][1,1 ] expands to[ − 2 , 2 ] [-2,2][2,2 ] , 0.99 times 2, the final output is 1.98, which is output to the environment as the speed of the car or the force of pushing the car.

3. Interview must know and answer

3.1 Friendly interviewer: Please briefly describe the deep deterministic policy gradient algorithm.

The deep deterministic policy gradient algorithm uses the actor-critic structure, but the output is not the probability of the action, but the specific action, which can be used for continuous action prediction. The goal of optimization is to extend the deep Q-network to continuous action spaces. In addition, its meaning is as its name suggests:

(1) The depth is due to the use of a deep neural network;

(2) Deterministic means that the output is a definite action, which can be used in a continuous action environment;

(3) The policy gradient represents that it uses a policy network. The reinforcement algorithm updates the network every round, but the deep deterministic policy gradient algorithm updates the policy network every step, which is a single-step update policy network.

3.2 Friendly interviewer: Is the deep deterministic policy gradient algorithm the same strategy algorithm or different strategy algorithm? Please explain the specific reasons and analyze them.

different strategy algorithm. (1) The deep deterministic policy gradient algorithm is an optimized deep Q network, which uses experience replay, so it is a different strategy algorithm. (2) Because the deep deterministic policy gradient algorithm adds a certain amount of noise to the output action in order to ensure certain exploration, the behavioral policy is no longer an optimized policy.

3.3 Friendly interviewer: Have you ever heard about the distributed distributional deep deterministic policy gradient algorithm (distributed distributional deep deterministic policy gradient, D4PG)? Please describe it.

Compared with the distributed distributional deep deterministic policy gradient algorithm (distributed distributional deep deterministic policy gradient, D4PG), the optimization part is as follows.

(1) Distributed commentator: instead of only estimating the expected value of the Q value, it estimates the distribution of the expected Q value, that is, to estimate the expected Q value as a random variable.

(2) N N N- step cumulative return: When calculating the timing difference error, D4PG calculatesNNThe target value of the timing difference of N steps is not only one step, so that the reward of more steps in the future can be considered.

(3) Multiple distributed parallel actors: D4PG uses KKK independent actors collect training data in parallel and store into the same playback buffer.

(4) Prioritized experience replay (PER): Data is sampled from the replay buffer using a non-uniform probability.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131426727