[Reinforcement Learning] Deep Q Network Deep Q Network (DQN)

1 Introduction to DQN

1.1 Reinforcement Learning and Neural Networks

The reinforcement learning method is such a method that combines neural network and Q-Learning, named Deep Q Network.
Q-Learning uses a table to store each state and the Q value of each action in this state. Today's problems are too complicated, and there can be as many states as there are stars in the sky (such as playing Go). If we use tables to store them all, I am afraid that our computer will not have enough memory, and it will be very time-consuming to search the corresponding state in such a large table every time. But in machine learning, there's one method that's great for this kind of thing, and that's neural networks. We can use the state and action as the input of the neural network, and then get the Q value of the action through the analysis of the neural network, so that we don't need to record the Q value in the table, but directly use the neural network analysis to get the Q value of the action. In this way, we don't need to record the Q value in the table, but directly use the neural network to generate the Q value. There is another form in which we can only input state values ​​and output all action values, and then directly select the action with the maximum value as the next action according to the principle of Q-Learning. We can imagine that the neural network receives external information, which is equivalent to collecting information from the eyes to the ears, and then outputs the value of each action through brain processing, and finally selects the action through reinforcement learning.

1.2 Updating the neural network

insert image description here

Next, we analyze based on the second neural network. We know that the neural network must be trained to predict accurate values. So in reinforcement learning, how is the neural network trained? First of all, we need the correct Q value of a1 and a2. We will replace this Q value with the Q reality in Q-Learning before. Similarly, we also need a Q estimate to implement the update of the neural network. So the parameters of the neural network are the old NN parameters plus the learning rate α multiplied by the gap between Q reality and Q estimate. Let's tidy up and
insert image description here
predict the values ​​of Q(s2, a1) and Q(s2, a2) through NN, which is Q estimation. We then choose the action with the largest value in the Q estimate in exchange for a reward in the environment. The Q reality also includes two Q estimates analyzed from the neural network, but this Q estimate is aimed at the next step in the estimation of s'. Finally, update the parameters in the neural network through the algorithm just mentioned. But this is not the fundamental reason why DQN can play computers. There are two other factors that support DQN and make it extremely powerful. These two factors are Experience replay and Fixed Q-targets.

1.3 Two major weapons of DQN

In simple terms, DQN has a memory bank for learning from previous experiences. Q-Learning is an off-policy offline learning method. It can learn what you are currently experiencing, what you have experienced in the past, and even the experience of others. So every time DQN is updated, we can randomly extract some previous experiences for learning. Random sampling disrupts the correlation between experiences and makes neural network updates more efficient. Fixed Q-targets is also a mechanism for disrupting correlation. If fixed Q-targets is used, it is also a mechanism for disrupting correlation. If fixed Q-targets are used, we will use two structures with the same structure but in DQN. The neural network with different parameters, the neural network that predicts the Q estimate has the latest parameters, and the neural network that predicts the actual Q uses parameters that are very old. With these two means of improvement, DQN can surpass humans in some games.

2 DQN algorithm update

2.1 Main points

The abbreviation of Deep Q Network is called DQN, which combines the advantages of Q-Learning with Neual Networks. If we use tabular Q-Learning, for each state and action we need to store them in a q_table table. If we have tens of millions of states like in real life, if we put the values ​​of these tens of millions of states in the table, it is limited by our computer hardware, so it is inefficient to obtain data from the table and update the data . This is the reason for DQN. We can use a neural network to estimate the value of this state, so we don't need a table.

2.2 Algorithm

insert image description here
The whole algorithm is based on the Q-Learning algorithm with some modifications. The Q-Learning algorithm can be reviewed here: https://blog.csdn.net/shoppingend/article/details/124291112?spm=1001.2014.3001.5501
These decorations include: memory bank (for repeated learning), neural network calculation Q value, temporarily freezes q_target (cuts dependencies)

2.3 The code line format of the algorithm

The following code is the most important part of DQN's interaction with the environment

def run_maze():
    step = 0    # 用来控制什么时候学习
    for episode in range(300):
        # 初始化环境
        observation = env.reset()

        while True:
            # 刷新环境
            env.render()

            # DQN 根据观测值选择行为
            action = RL.choose_action(observation)

            # 环境根据行为给出下一个 state, reward, 是否终止
            observation_, reward, done = env.step(action)

            # DQN 存储记忆
            RL.store_transition(observation, action, reward, observation_)

            # 控制学习起始时间和频率 (先累积一些记忆再开始学习)
            if (step > 200) and (step % 5 == 0):
                RL.learn()

            # 将下一个 state_ 变为 下次循环的 state
            observation = observation_

            # 如果终止, 就跳出循环
            if done:
                break
            step += 1   # 总步数

    # end of game
    print('game over')
    env.destroy()


if __name__ == "__main__":
    env = Maze()
    RL = DeepQNetwork(env.n_actions, env.n_features,
                      learning_rate=0.01,
                      reward_decay=0.9,
                      e_greedy=0.9,
                      replace_target_iter=200,  # 每 200 步替换一次 target_net 的参数
                      memory_size=2000, # 记忆上限
                      # output_graph=True   # 是否输出 tensorboard 文件
                      )
    env.after(100, run_maze)
    env.mainloop()
    RL.plot_cost()  # 观看神经网络的误差曲线

3 DQN thinking decision

Code main structure:

class DeepQNetwork:
    # 上次的内容
    def _build_net(self):

    # 这次的内容:
    # 初始值
    def __init__(self):

    # 存储记忆
    def store_transition(self, s, a, r, s_):

    # 选行为
    def choose_action(self, observation):

    # 学习
    def learn(self):

    # 看看学习效果 (可选)
    def plot_cost(self):

Initial value:

class DeepQNetwork:
    def __init__(
            self,
            n_actions,
            n_features,
            learning_rate=0.01,
            reward_decay=0.9,
            e_greedy=0.9,
            replace_target_iter=300,
            memory_size=500,
            batch_size=32,
            e_greedy_increment=None,
            output_graph=False,
    ):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon_max = e_greedy     # epsilon 的最大值
        self.replace_target_iter = replace_target_iter  # 更换 target_net 的步数
        self.memory_size = memory_size  # 记忆上限
        self.batch_size = batch_size    # 每次更新时从 memory 里面取多少记忆出来
        self.epsilon_increment = e_greedy_increment # epsilon 的增量
        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max # 是否开启探索模式, 并逐步减少探索次数

        # 记录学习次数 (用于判断是否更换 target_net 参数)
        self.learn_step_counter = 0

        # 初始化全 0 记忆 [s, a, r, s_]
        self.memory = np.zeros((self.memory_size, n_features*2+2)) # 和视频中不同, 因为 pandas 运算比较慢, 这里改为直接用 numpy

        # 创建 [target_net, evaluate_net]
        self._build_net()

        # 替换 target net 的参数
        t_params = tf.get_collection('target_net_params')  # 提取 target_net 的参数
        e_params = tf.get_collection('eval_net_params')   # 提取  eval_net 的参数
        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)] # 更新 target_net 参数

        self.sess = tf.Session()

        # 输出 tensorboard 文件
        if output_graph:
            # $ tensorboard --logdir=logs
            tf.summary.FileWriter("logs/", self.sess.graph)

        self.sess.run(tf.global_variables_initializer())
        self.cost_his = []  # 记录所有 cost 变化, 用于最后 plot 出来观看

Store memory, the essence of DQN is only one: record all the steps you have experienced, these steps can be learned repeatedly, so this is an off-policy method, you can even play by yourself, and then record your own playing experience, Let this DQN learn how you pass levels.

class DeepQNetwork:
    def __init__(self):
        ...
    def store_transition(self, s, a, r, s_):
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0

        # 记录一条 [s, a, r, s_] 记录
        transition = np.hstack((s, [a, r], s_))

        # 总 memory 大小是固定的, 如果超出总大小, 旧 memory 就被新 memory 替换
        index = self.memory_counter % self.memory_size
        self.memory[index, :] = transition # 替换过程

        self.memory_counter += 1

Optional behavior:

class DeepQNetwork:
    def __init__(self):
        ...
    def store_transition(self, s, a, r, s_):
        ...
    def choose_action(self, observation):
        # 统一 observation 的 shape (1, size_of_observation)
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.epsilon:
            # 让 eval_net 神经网络生成所有 action 的值, 并选择值最大的 action
            actions_value = self.sess.run(self.q_eval, feed_dict={
    
    self.s: observation})
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)   # 随机选择
        return action

Learning, this is the most important step, is how to learn and update parameters in Deep Q Network. The interactive use of target_net and eval_net is designed here.

class DeepQNetwork:
    def __init__(self):
        ...
    def store_transition(self, s, a, r, s_):
        ...
    def choose_action(self, observation):
        ...
    def _replace_target_params(self):
        ...
    def learn(self):
        # 检查是否替换 target_net 参数
        if self.learn_step_counter % self.replace_target_iter == 0:
            self.sess.run(self.replace_target_op)
            print('\ntarget_params_replaced\n')

        # 从 memory 中随机抽取 batch_size 这么多记忆
        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        batch_memory = self.memory[sample_index, :]

        # 获取 q_next (target_net 产生了 q) 和 q_eval(eval_net 产生的 q)
        q_next, q_eval = self.sess.run(
            [self.q_next, self.q_eval],
            feed_dict={
    
    
                self.s_: batch_memory[:, -self.n_features:],
                self.s: batch_memory[:, :self.n_features]
            })

        # 下面这几步十分重要. q_next, q_eval 包含所有 action 的值,
        # 而我们需要的只是已经选择好的 action 的值, 其他的并不需要.
        # 所以我们将其他的 action 值全变成 0, 将用到的 action 误差值 反向传递回去, 作为更新凭据.
        # 这是我们最终要达到的样子, 比如 q_target - q_eval = [1, 0, 0] - [-1, 0, 0] = [2, 0, 0]
        # q_eval = [-1, 0, 0] 表示这一个记忆中有我选用过 action 0, 而 action 0 带来的 Q(s, a0) = -1, 所以其他的 Q(s, a1) = Q(s, a2) = 0.
        # q_target = [1, 0, 0] 表示这个记忆中的 r+gamma*maxQ(s_) = 1, 而且不管在 s_ 上我们取了哪个 action,
        # 我们都需要对应上 q_eval 中的 action 位置, 所以就将 1 放在了 action 0 的位置.

        # 下面也是为了达到上面说的目的, 不过为了更方面让程序运算, 达到目的的过程有点不同.
        # 是将 q_eval 全部赋值给 q_target, 这时 q_target-q_eval 全为 0,
        # 不过 我们再根据 batch_memory 当中的 action 这个 column 来给 q_target 中的对应的 memory-action 位置来修改赋值.
        # 使新的赋值为 reward + gamma * maxQ(s_), 这样 q_target-q_eval 就可以变成我们所需的样子.
        # 具体在下面还有一个举例说明.

        q_target = q_eval.copy()
        batch_index = np.arange(self.batch_size, dtype=np.int32)
        eval_act_index = batch_memory[:, self.n_features].astype(int)
        reward = batch_memory[:, self.n_features + 1]

        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

        """
        假如在这个 batch 中, 我们有2个提取的记忆, 根据每个记忆可以生产3个 action 的值:
        q_eval =
        [[1, 2, 3],
         [4, 5, 6]]

        q_target = q_eval =
        [[1, 2, 3],
         [4, 5, 6]]

        然后根据 memory 当中的具体 action 位置来修改 q_target 对应 action 上的值:
        比如在:
            记忆 0 的 q_target 计算值是 -1, 而且我用了 action 0;
            记忆 1 的 q_target 计算值是 -2, 而且我用了 action 2:
        q_target =
        [[-1, 2, 3],
         [4, 5, -2]]

        所以 (q_target - q_eval) 就变成了:
        [[(-1)-(1), 0, 0],
         [0, 0, (-2)-(6)]]

        最后我们将这个 (q_target - q_eval) 当成误差, 反向传递会神经网络.
        所有为 0 的 action 值是当时没有选择的 action, 之前有选择的 action 才有不为0的值.
        我们只反向传递之前选择的 action 的值,
        """

        # 训练 eval_net
        _, self.cost = self.sess.run([self._train_op, self.loss],
                                     feed_dict={
    
    self.s: batch_memory[:, :self.n_features],
                                                self.q_target: q_target})
        self.cost_his.append(self.cost) # 记录 cost 误差

        # 逐渐增加 epsilon, 降低行为的随机性
        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learn_step_counter += 1

In order to see the learning effect, we finally output the cost change curve during the learning process.

class DeepQNetwork:
    def __init__(self):
        ...
    def store_transition(self, s, a, r, s_):
        ...
    def choose_action(self, observation):
        ...
    def _replace_target_params(self):
        ...
    def learn(self):
        ...
    def plot_cost(self):
        import matplotlib.pyplot as plt
        plt.plot(np.arange(len(self.cost_his)), self.cost_his)
        plt.ylabel('Cost')
        plt.xlabel('training steps')
        plt.show()

Article source: Mofan Reinforcement Learning https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/

Guess you like

Origin blog.csdn.net/shoppingend/article/details/124379079