Reinforcement learning DDPG: Interpretation of Deep Deterministic Policy Gradient

1. DDPG

Compared with the traditional PG algorithm, the DDPG method has three main improvements:

A. off-policy strategy

The traditional PG algorithm generally adopts the on-policy method, which divides the overall reinforcement learning process into multiple epochs, completes an update of the policy model and value model in each epoch, and needs to be re-sampled according to the decision model in each round of epoch to obtain the round of training samples.

But when the cost of interacting with the environment is relatively high, the efficiency of this on-policy approach is not good. Therefore, DDPG proposes an off-policy method, which can use historical samples, assuming that for historical samples $\{s,a,r,s'\}$ , DDPG's off-policy strategy will re-estimate the value according to the current target policy.

$G(s,a)=r + \gamma Q_{\phi_{targ}}(s', \mu_{\theta_{targ}}(s'))$

$Q_{\phi }(s,a)$ Therefore, the goal of DDPG for the value estimation model is $B$ to represent the Batch randomly drawn from all historical samples

$J(\phi )=\sum_{(s,a,r,s')\sim B} [Q_{\phi }(s,a) - (r +\gamma Q_{\phi _{targ}}(s', \mu_{\theta_{targ} }(s'))]^2$

The goal of the traditional on-policy strategy is that the following formula $R(s,a)$ can be the cumulative income after MC sampling, $D$ which represents the sampling result of the current epoch round.

$J(\phi )=\sum_{(s,a,r)\sim D} [Q_{\phi }(s,a) - R(s,a)]^2$

B. More Complex Deterministic Continuous Action Policy Scenario Modeling

The traditional PG algorithm models the operation through an action distribution $\pi(a|s)$ . This action distribution is generally discrete, or the action is modeled as a Gaussian distribution, and the two parameters of the distribution mean and standard deviation are fitted by a neural network.

DDPG employs an action generation network $our$ that can output deterministic and continuous action values.

C. Target Networks

DDPG's off-policy strategy will re-estimate the value according to the current target policy, so the estimated value here is calculated through another network Target Networks, mainly to avoid too much change when directly using the target optimization network for estimation And affect the effect.

The traditional DQN will take a longer period of time to synchronize the parameters of the target optimization network to the Target Network as a whole. However, DDPG adopts a smoother method to help the Target Network adapt to the parameters of the target optimization network in a timely manner.

$\theta_{targ}=\rho \theta_{targ} + (1-\rho )\theta \\ \phi_{targ}=\rho \phi_{targ} + (1-\rho )\phi$

D. Exploration

DDPG is a deterministic action decision, so in order to ensure exploration, DDPG adds a Gaussian noise after sampling actions, and adds truncation to avoid inappropriate action values.

$a=clip(\mu_{\theta}(s)+\epsilon, a_{low}, a_{high}),\\varepsilon\sim N$

The overall process of DDPG algorithm

2. Twin Delayed DDPG

Twin Delayed DDPG is also known as the TD3 algorithm, which mainly makes two upgrades on the basis of DDPG:

A. target policy smoothing

As mentioned above, DDPG's off-policy strategy will re-estimate the value based on the current target policy. The action generated by the target policy here does not add noise exploration, because it is only used to estimate the value, not to explore.

However, the TD3 algorithm adds noise to the target policy action here. The main reason is for regularization. This regularization operation smoothes some incorrect action spikes that may appear during training.

$a'(s'|\mu_{\theta_{targ} })=clip(\mu_{\theta_{targ} }(s')+clip(\epsilon, -c, c), a_{low}, a_ {high}),\\varepsilon\sim N$

B. clipped double-Q learning

DDPG is based on Q-learning. Since it is a definite action that takes the greatest possibility, it may cause a Maximum deviation (simple understanding is that due to the existence of the estimated distribution, the maximum value generally deviates from the expected value), this The problem may be solved by double Q-learning.

Based on DDPG, TD3 applies the idea of double Q-learning, introduces two target value estimation models, generates value estimations respectively, and selects the smallest value as the final estimation value.

$G_t(s,a)=r + \gamma \text{Max}_i\ Q_{\phi _{targ,i}}(s', a'(s'|\mu_{\theta_{targ} })),\ i=1,2$

There are also two target value models:

$J(\phi_i )=\sum_{(s,a,r,s')\sim D} [Q_{\phi_i }(s,a) - G(s,a)]^2,\ i=1,2$

However, there is only one set of target decision-making models, and it is only optimized according to a certain target value model:

$J(\theta)=\sum_{(s)\sim D} Q_{\phi_1 }(s,\mu_\theta(s))$

TD3 algorithm overall process

3. Soft Actor-Critic

Since DDPG can only produce deterministic actions, Soft Actor Critic (SAC) implements stochastic policy to produce probabilistic action decisions. Compared with TD3, the SAC algorithm has two main differences:

A. entropy regularization

Entropy regularization Entropy regularization is the core content of SAC, because SAC implements probabilistic action decision-making $a'\sim \pi(a'|s')$ . The main problem of probabilistic action decision-making is that the generated action probability may be too scattered, so SAC uses entropy regularization to avoid this situation.

$G_t(s,a)=r + \gamma \text{Min}_i\ [Q_{\phi _{targ,i}}(s', a')+\alpha H(\pi(\cdot |s'))],\ i=1,2\\G_t(s,a)=r + \gamma \text{Min}_i\ [Q_{\phi _{targ,i}}(s', a')-\alpha log(\pi(a'|s'))],\ i=1,2$

At the same time, the policy model update target also adds entropy regularization, but it applies both target models, which is different from TD3.

$J(\theta)=\sum_{(s,a,r)\sim D} [\text{Min}_i\ Q_{\phi_i }(s,a_\theta )-\alpha log(\pi(a_\theta|s))],\ i=1,2$

B. Exploration

SAC has natural exploration characteristics because its action decision function is probabilistic, so it does not have a target network for the action decision model. In addition, the exploration characteristics of the action decision-making model can also be controlled by controlling during training $\alpha$ .