DQN(1)

DQN(1)

资料

莫烦PYTHON
DeepMind
《强化学习精要》
Deep Reinforcement Learning 基础知识（DQN方面）
用Tensorflow基于Deep Q Learning DQN 玩Flappy Bird
Human-level control through deep reinforcement learning

为什么需要DQN

q-learning需要一张q table来表示q value，如果state和action数量巨大，那么这张q table就会很大，导致存储查找都极其耗费时间和空间。解决的思路就是，用一个神经网络来代替这张q table。可以将这个神经网络想象成一个函数，即 $q\_value_{s,a} = f(state, action)$ ,或者更近一步，我们只需要输入state，因为q learning在选择action的时候比较激进，直接找最大的q value对应的action。给网络输入state，网络输出所有的action对应的q value，我们自己再在这个输出的tensor中
找到q value最大的对应的action，完成action的选择。
那么怎么更新q value的值呢？怎么训练网络。一个问题是监督学习需要大量数据，一个问题是强化学习的数据是有顺序的序列，而神经网络需要的是独立同分布的数据。
解决办法是，设置一个样本回放缓冲区replay buffer，也就是存储bot跟环境交互产生的(s, a, r, s_)。其容量很大，当replay buffer满了以后会用新数据覆盖老数据。每次训练的时候都从replay buffer中随机的抽取一批数据。这样一来，打乱了数据的相关性，向独立同分布靠近。同时也提高了数据的使用效率。
另外一个要解决的问题是不稳定的问题。在q-learning中，本次更新由上次的q-value和q target决定。

q^{T} (s, a) = (1 - α) q^{T - 1} (s, a) + γ [r (s^{'}) + m a x_{a^{'}} (q^{T - 1} (s^{'}, a^{'}))]

$q^T(s, a) = (1 - \alpha)q^{T-1}(s, a) + \gamma[r(s') + max_{a'}(q^{T-1}(s', a'))]$
那么如果用一个神经网络来完成更新的时候，由于样本之间的差异会造成一定的波动性，数据本身的这种不稳定性，在迭代的时候可能会有波动，如果有波动，会随着迭代传递下去。我们无法得到一个稳定的模型。
为了增加模型的稳定性，要将更新的两部分拆分开，解耦。
那么增加一个一模一样的网络target network。target network和behavior network模型一模一样，初始化参数也相同。但是behavior network和环境交互去获取（s, a, r, s_），同时决定action的产生。target network仅仅一个作用，就是替代上边公式中的一部分。

伪代码

depp Q-learning with experience replay:

Initialize replay memory $D$ to capacity $N$ .
Initialize action-value function $Q$ with random weights $\theta$
Initialize target action-value function $\hat Q$ with weights $\theta^- = \theta$
For $episode=1, M$ do:
- Initialize sequense $s_1=\{x_1\}$ and preprocessed sequence $\phi_1 = \phi(s_1)$
- For $t=1, T$ do:
- With probability $\epsilon$ select a random aciton $a_t$ , otherwise select $a_t=argmax_aQ(\phi_t, a; \theta)$
- Execute action $a_t$ in emulator and observe reward $r_t$ and image $x_{t+1}$
- Set $s_{t+1}={x_{t+1}},$ and preprecess $\phi_{t+1}=\phi(s_{t+1})$
- Store transition $( \phi_t, a_t, r_t, \phi_{t+1})$ in D
- Sample random minibatch of transitions $( \phi_j, a_j, r_j, \phi_{j+1})$
- Set:

y_{j} = {\begin{cases} r_{j} & (i f e p i s o d e t e r m i n a t e s a t s t e p j + 1) \\ r_{j} + γ * m a x_{a^{'}} \hat{Q} (ϕ_{j + 1}, a^{'}; θ^{-}) & (o t h e r w i s e) \end{cases}

$y_j = \begin{cases}r_j &\text(if episode terminates at step j+1)\\r_j + \gamma * max_{a'}\hat Q(\phi_{j+1}, a'; \theta^-) &\text(otherwise) \end{cases}$
- Perform a gradient descent step on

(y_{j} - Q (ϕ_{j}, a_{j}; θ))^{2}

$(y_j - Q(\phi_j,a_j;\theta))^2$ with respect to the network parameters \theta
- Set

s_{t} = s_{t + 1}

$s_t = s_{t+1}$
- Every C steps reset

\hat{Q} = Q

$\hat Q = Q$
- End for
- End for
改正了论文中的错误。论文中的算法如下：
这里写图片描述

需要

一个容量为N的容器D，存储大量的（ s， a，r，s’ ）,同时设置从容器抽取样本的数量batch_size；
两个相同结构的网络，Q、Q‘，并且初始化的参数相同。

复现莫烦PYTHON的核心代码

# coding:utf-8
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'


class DeepQNet:

    def __init__(
            self,
            n_features,
            n_actions,
            learning_rate=0.01,
            reward_decay=0.9,
            e_greedy_max=0.9,
            e_greedy_increment=None,
            replace_target_iter=300,
            memory_pool_size=500,
            batch_size=32,
            output_graph=False,
    ):
        # 强化学习需要的部分
        self.gamma = reward_decay  # 折扣因子
        self.e_greedy_max = e_greedy_max  # =0.9，90%利用，10%探索 e 即： epsilon
        self.e_greedy_increment = e_greedy_increment  # epsilon greedy的增量
        self.e_greedy = 0 if e_greedy_increment is not None else e_greedy_max  # 即有增量就从100%探索开始到100%利用，无就固定一个值

        # 神经网络需要的部分
        self.lr = learning_rate  # 学习率alpha
        self.n_features = n_features  # state,state_的特征数量
        self.n_actions = n_actions  # action的数量

        self.replace_target_iter = replace_target_iter  # 每隔replace_target_iter个action以后更新一次q target网络的参数，更新q target的步数
        self.learn_step_counter = 0  # 记录学习的步数，便于进行更新q target的参数更新

        # 记忆池
        self.memory_pool_size = memory_pool_size  # 记忆池的容量，一般比较大，比如100万
        self.memory_pool_counter = 0
        self.memory_pool = np.zeros((memory_pool_size, n_features * 2 + 2))  # 全零初始化记忆池
        self.batch_size = batch_size  # 每次从记忆池取出数据的数量

        self._build_net()  # 建立网络q target net， q evaluate net

        target_params = tf.get_collection('target_net_params')  # 从collection中提取q target的参数
        eval_params = tf.get_collection('eval_net_params')  # 提取q eval的参数
        self.replace_q_target_op_params = [tf.assign(t, e) for t, e in zip(target_params, eval_params)]

        self.cost_history = []  # cost的更改数据，用来监测网络学习的结果

        self.sess = tf.Session()

        if output_graph:
            # 需要从根目录开始写完整路径# FIXME 可改进仅写一次，不要每次运行都生成一个图，if图存在，则不生成
            # tensorboard --logdir=name1:/Users/tu/PycharmProjects/myFirstPythonDir/DQN/logs
            tf.summary.FileWriter('logs/', self.sess.graph)

        self.sess.run(tf.global_variables_initializer())

    # 建立神经网络
    def _build_net(self):
        self.state = tf.placeholder(tf.float32, [None, self.n_features], name='state')
        self.state_ = tf.placeholder(tf.float32, [None, self.n_features], name='state_')
        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # target net 的输出值

        w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)

        with tf.variable_scope('eval_net'):
            my_collections = ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, 20], initializer=w_initializer, collections=my_collections)
                b1 = tf.get_variable('b1', [1, 20], initializer=b_initializer, collections=my_collections)
                l1 = tf.nn.relu(tf.matmul(self.state, w1) + b1)

            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [20, self.n_actions], initializer=w_initializer, collections=my_collections)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=w_initializer, collections=my_collections)
                self.q_eval = tf.matmul(l1, w2) + b2  # 输出某一个state的actions value

        with tf.variable_scope('loss'):
            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))

        with tf.variable_scope('train'):
            self.train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

        with tf.variable_scope('target_net'):  # fixme 初始化的时候两个网络的参数不同，目标是相同
            my_collections = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, 20], initializer=w_initializer, collections=my_collections)
                b1 = tf.get_variable('b1', [1, 20], initializer=b_initializer, collections=my_collections)
                l1 = tf.nn.relu(tf.matmul(self.state_, w1) + b1)

            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [20, self.n_actions], initializer=w_initializer, collections=my_collections)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=my_collections)
                self.q_next = tf.matmul(l1, w2) + b2

    # 存储记忆
    def store_memory(self, state, action, reward, state_):
        transition = np.hstack((state, action, reward, state_))
        index = self.memory_pool_counter % self.memory_pool_size
        self.memory_pool[index, :] = transition
        self.memory_pool_counter += 1

    # 选择行为
    def choose_action(self, observation):
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.e_greedy:
            actions_value = self.sess.run(self.q_eval, feed_dict={self.state: observation})
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)
        return action

    # 更新q网络
    def learn(self):
        # 从memory poll 中随机获取一批数据
        if self.memory_pool_counter >= self.memory_pool_size:
            sample_index = np.random.choice(self.memory_pool_size, self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_pool_counter, self.batch_size)
        batch_memory = self.memory_pool[sample_index, :]

        # 计算出实际值
        q_eval, q_next = self.sess.run([self.q_eval, self.q_next],
                                       feed_dict={self.state: batch_memory[:, :self.n_features],
                                                  self.state_: batch_memory[:, -self.n_features:]})
        q_target = q_eval.copy()
        batch_index = np.arange(self.batch_size, dtype=np.int32)
        eval_action_index = batch_memory[:, self.n_features].astype(int)
        reward = batch_memory[:, self.n_features + 1]
        q_target[batch_index, eval_action_index] = reward + self.gamma * np.max(q_next, axis=1)  # axis = 1才为行向

        # 实际值与预测值构成lose，更新q eval参数
        _, cost = self.sess.run([self.train_op, self.loss], feed_dict={self.state: batch_memory[:, :self.n_features],
                                                                       self.q_target: q_target})

        # 到达一定局数，更新q target网络
        if (self.learn_step_counter % self.replace_target_iter) == 0:
            self.sess.run(self.replace_q_target_op_params)
            print('q target net has updated')

        # 添加loss
        self.cost_history.append(cost)

        # epsilon
        self.e_greedy = self.e_greedy + self.e_greedy_increment \
            if self.e_greedy < self.e_greedy_max else self.e_greedy_max

        self.learn_step_counter += 1

    # 代价函数下降线
    def plot_cost(self):
        plt.plot(np.arange(len(self.cost_history)), self.cost_history)
        plt.xlabel('my training steps')
        plt.ylabel('my cost')
        plt.show()