Reinforcement study notes: DDPG

The notes are summarized in the reinforcement learning tutorial at datawhalechina.github.io

0x01 Discrete action and continuous action

This doesn't need to be said too much.

Discrete actions: such as left, right and fire
in atari Continuous actions: such as PUBG joystick players control the steering angle of the steering wheel when driving the vehicle

For discrete actions, the output of the network is the probability of all action sets. Generally, a softmax is put on the last layer
. For continuous actions, the output of the network is a specific value. Generally speaking, it is better to set a tanh for this output.

0x02 DDPG (Deep Deterministic Policy Gradient)

Deep Deterministic Policy Gradient(DDPG)It is a classic reinforcement learning algorithm in the field of continuous control and an extended version of DQN. During training, it borrows techniques from DQN: target network and experience replay. But the update of the target network is different from DQN.

  • Deep: Because of the use of neural networks.
  • Deterministic: Indicates that the output is a deterministic action and can be used in continuous actions.
  • Policy Gradient: The policy gradient is used. The data is updated after each step.

The following is the network structure diagram of DDPG.
insert image description here
At first glance, this thing Pathwise Derivative Policy Gradientlooks .

DDPG adds an actor (that is, a policy network) on the basis of DQN to directly output the action value. Therefore, we need to learn the policy network at the same time as the Q network. This is also an actor-critic structure.

The following describes what this network structure does:

  • The Q-network is critic, and after being told the current state and action, it returns a score.
  • The actor network will choose the next action based on the current state.
  • During training, at the beginning, the Q network made a few points, and the actor network also made a few choices. However, as the training progresses, the Q network becomes more and more accurate, and the actor will gradually choose actions with high reward to cater to the Q network.

Difference from Classic DQN: Continuous Case

In the classic DQN, the output of the Q network is discrete, we will see which action has the highest Q value, and then we will choose which action. But in the continuous case, this mechanism is invalid, because our action space is continuous, and it is not suitable to do exhaustive. Therefore, we might as well let the network learn how to find high reward actions.

Training process
insert image description here
As shown above, DDPG builds target networks for both networks.

The steps to update the Q network naturally need not be said, using replay buffer and timing difference.

When updating the W network, it is gradient ascent. When training the Q network, the parameters of W do not change.

Exploriation & Exploitation

Because the policy is deterministic, if an actor uses the same policy to explore, it will most likely not try enough actions to find useful learning signals at the beginning. Therefore, the original author constructed a time-dependent noise OU noise. But in practice, we found that Gaussian white noise works well. Noise can be continuously reduced as training progresses.

When testing, we don't add noise to the action in order to see how the actor behaves.

TD3: An optimization of DDPG

A common problem with DDPG is that the already learned Q function starts to significantly overestimate the Q value, which then causes the policy to be corrupted because it exploits the error in the Q function.

A simple idea is: you can compare the actual Q value with the Q value output by the network. The actual Q value can be calculated using MC. For example, let's take 1000 samples and use it to update the Q value.

双延迟深度确定性策略梯度(Twin Delayed DDPG,简称 TD3)Solve this problem by introducing three key tricks:

  1. Clipped double-Q learning: Two Q-networks are learned simultaneously in TD3 by minimizing the mean squared error. They all use one goal. The one of the two Q networks that gives the smaller value will be considered the target network.
  2. Delayed policy update: if we fix the actor network, the critic (ie Q network) will work better; if we train the actor and critic together, the effect will be worse. Therefore, the TD3 algorithm updates the actor network at a lower frequency and updates the Q network at a high frequency. Usually, the actor is updated once every two updates of the Q network.
  3. Target Policy Smoothing: TD3 adds noise to the target action, making it harder for the policy to exploit the error of the Q function by smoothing the change in Q along the action.

Among them, the principle of target strategy smoothing is as follows:
insert image description here

where, ϵ \epsilonϵ is essentially a noise.

Additional point:
The performance of DDPG (our DDPG) implemented by the author of TD3 is different from that of the official DDPG, which shows that DDPG is very sensitive to initialization and parameter tuning. TD3 is not so sensitive to parameters.

Guess you like

Origin blog.csdn.net/weixin_43466027/article/details/119535881