[Reinforcement Learning] Detailed Explanation of Deep Deterministic Policy Gradient (DDPG) Algorithm

1 Introduction to DDPG

DDPG absorbs the essence of Actor-Critic's single-step update of Policy Gradient, and also absorbs the essence of DQN, which allows computers to learn to play games, and merges it into a new algorithm called Deep Deterinistic Policy Gradient. What kind of algorithm is DDPG? Let's take it apart and analyze it. We divide DDPG into 'Deterministic' and 'Deterministic Policy Cradient', which can be subdivided into 'Deterministic' and 'Policy Gradient'. Next, we start Analyze one by one.

1.1 Deep and DQN

Deep, as the name suggests, is to go deeper. DQN uses a memory bank and two sets of neural networks with the same structure but different parameter update frequencies to effectively promote learning. Then we also apply this idea to DDPG, so that DDPG also has this excellent style. But DDPG's neural network is a little more complicated than DQN's. If you need to review DQN, you can click here: https://blog.csdn.net/shoppingend/article/details/124379079?spm=1001.2014.3001.5502 Read the review of [Reinforcement Learning] Deep Q Network Deep Q Network (DQN).

1.2 Deterministic Policy Gradient

Compared with other reinforcement learning algorithms, Policy Gradient can be used to screen actions on continuous actions. And when screening, it is randomly selected according to the learned action distribution, and Deterministic is a bit unbearable. Deterministic will make it unnecessary for Policy Gradient to be so uncertain and hesitant when making actions. Anyway, an action value must be output in the end. There is no need to be random. Therefore, Deterministic changed the process of outputting actions, and categorically output an action value on continuous actions.

1.3 DDPG neural network

The DDPG neural network is actually similar to the Actor-Critic line we mentioned before. It also needs a policy-based neural network and a value-based value neural network. However, in order to reflect DQN and ideas, we need to subdivide each neural network For two, on the Policy Gradient side, we have an estimated network and a real network. The estimated network is used to output real-time actions for actors to implement in reality. The reality network is used to update the value network system. So let’s look at the value system again. We also have the actual network and the estimated network. They are outputting the value of this state, but the input end is different. The state actual network will take the observation value from the action plus the state Analysis, while the state estimation network takes the actions imposed by the Actor at that time as input. In practice, this approach of DDPG does lead to a more effective learning process.

2 DDPG algorithm

2.1 Main points

Summarize DDPG in one sentence: Actor-Critic structure proposed by Google DeepMind, but the output is not the probability of behavior, but the specific behavior, which is used for the prediction of continuous action. DDPG combines the previously successful DQN structure and mentions the stability and convergence of Actor-Critic. If Actor-Critic is not completely clear, you can click here https://blog.csdn.net/shoppingend/article/details/124341639?spm=1001.2014.3001.5502 to find out.

2.2 Algorithm

The algorithm of DDPG is actually an Actor-Critic. Regarding the Actor part, its parameter update will also involve the Critic.
insert image description here
The above is about the update of the Actor parameters. The first half of the grad[Q] is from the Critic, which is to say: how to move the Actor this time to get a bigger Q, and the second half of the grad[μ] It comes from the Actor, which is to say: how does the Actor modify its own parameters, so that the Actor is more likely to perform this action. So the combination of the two means that the Actor should modify the action parameters in a direction that is more likely to obtain a big Q.
insert image description here
The above is an update about Critic. It borrows from the methods of DQN and Double Q-Learning. There are two neural networks for calculating Q. In Q_target, Actor is used to select actions according to the next state. At this time, Actor is also an Actor_target (with parameters from Actor long ago). The Q_target obtained by this method can cut off the correlation like DQN and improve the convergence.

2.3 Main code structure

We use Tensorflow to build a neural network. The main structure can be seen in the picture from this tensorboard.
insert image description here
Look at the structure of Actor and Critic separately.
insert image description here
The code part of their construction is here:

class Actor(object):
    def __init__(self):
        ...
        with tf.variable_scope('Actor'):
            # 这个网络用于及时更新参数
            self.a = self._build_net(S, scope='eval_net', trainable=True)
            # 这个网络不及时更新参数, 用于预测 Critic 的 Q_target 中的 action
            self.a_ = self._build_net(S_, scope='target_net', trainable=False)
        ...

class Critic(object):
    def __init__(self):
        with tf.variable_scope('Critic'):
            # 这个网络是用于及时更新参数
            self.a = a  # 这个 a 是来自 Actor 的, 但是 self.a 在更新 Critic 的时候是之前选择的 a 而不是来自 Actor 的 a.
            self.q = self._build_net(S, self.a, 'eval_net', trainable=True)
            # 这个网络不及时更新参数, 用于给出 Actor 更新参数时的 Gradient ascent 强度
            self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)

2.4 Actor-Critic

With the understanding of the two neural network structures in each Actor-Critic, let's take a look at how they communicate and transmit information. Let's start with the way Actors learn and update.
insert image description here
From this picture, you can see at a glance what the update of the Actor is based on. It can be seen that it uses two eval_nets, so we write the code for train in the Actor class as follows:

with tf.variable_scope('policy_grads'):
    # 这是在计算 (dQ/da) * (da/dparams)
    self.policy_grads = tf.gradients(
        ys=self.a, xs=self.e_params, # 计算 ys 对于 xs 的梯度
        grad_ys=a_grads # 这是从 Critic 来的 dQ/da
    )
with tf.variable_scope('A_train'):
    opt = tf.train.AdamOptimizer(-self.lr)  # 负的学习率为了使我们计算的梯度往上升, 和 Policy Gradient 中的方式一个性质
    self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params)) # 对 eval_net 的参数更新

At the same time, the a_grad sent to the Actor mentioned below should be calculated with Tensorflow. This a_grad is in the Critic class, and this a is calculated from Actor based on s:

with tf.variable_scope('a_grad'):
    self.a_grads = tf.gradients(self.q, self.a)[0]   # dQ/da

In Critic, we use something simpler.
insert image description here
Below is the updated code for Critic.

# 计算 target Q
with tf.variable_scope('target_q'):
    self.target_q = R + self.gamma * self.q_    # self.q_ 根据 Actor 的 target_net 来的
# 计算误差并反向传递误差
with tf.variable_scope('TD_error'):
    self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q))  # self.q 又基于 Actor 的 target_net
with tf.variable_scope('C_train'):
    self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

In the end, we wrote this when we built and integrated Actor and Critic.

actor = Actor(...)
critic = Critic(..., actor.a, actor.a_)  # 将 actor 同它的 eval_net/target_net 产生的 a/a_ 传给 Critic
actor.add_grad_to_graph(critic.a_grads) # 将 critic 产出的 dQ/da 加入到 Actor 的 Graph 中去

2.5 Memory Bank Memory

The following is the memory library code similar to DQN, we use a class to build.

class Memory(object):
    def __init__(self, capacity, dims):
        """用 numpy 初始化记忆库"""

    def store_transition(self, s, a, r, s_):
        """保存每次记忆在 numpy array 里"""

    def sample(self, n):
        """随即从记忆库中抽取 n 个记忆进行学习"""

2.6 Algorithm for each round

The round algorithm here only mentions the most important part, saving some unnecessary

var = 3  # 这里初始化一个方差用于增强 actor 的探索性

for i in range(MAX_EPISODES):
    ...
    for j in range(MAX_EP_STEPS):
        ...

        a = actor.choose_action(s)
        a = np.clip(np.random.normal(a, var), -2, 2) # 增强探索性
        s_, r, done, info = env.step(a)

        M.store_transition(s, a, r / 10, s_)   # 记忆库

        if M.pointer > MEMORY_CAPACITY: # 记忆库头一次满了以后
            var *= .9998    # 逐渐降低探索性
            b_M = M.sample(BATCH_SIZE)
            ...   # 将 b_M 拆分成下面的输入信息
            critic.learn(b_s, b_a, b_r, b_s_)
            actor.learn(b_s)

        s = s_

        if j == MAX_EP_STEPS-1:
            break

Article source: Mofan Reinforcement Learning https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/

Guess you like

Origin blog.csdn.net/shoppingend/article/details/124344083