Paddle reinforcement learning from entry to practice (Day3) based on deep learning method: DQN

Background introduction

There are many situations in real scenes. We cannot simply abstract into discrete states (or too many discrete states). As a result, we cannot use table-based methods. At this time, we should introduce deep learning methods to help. We perceive the state and act as a Q function to find the Q of each Action in this state.

The value function is the Q function. The function of the Q table is to look up the table and output the Q value according to the action of the input state
. Disadvantages of the table method:

  • Tables can take up a lot of memory
  • When the table is very large, the table look-up efficiency is low

So in fact, we can use the Q function with parameters to approximate the Q table. For example, we can use a polynomial function or a neural network to
approximate the advantages of a value function:

  • Only need to store limited parameters
  • State generalization, similar states can output the same

basic structure

Difference between DQN and Q-Learning

From this, it can be seen that Q-learning artificially updates the table according to the environment, and then looks up the table to obtain the value of each A in this state; while DQN uses a neural network to replace the manual process of maintaining this table, and directly let the nerves The network calculates the value of each A in this state from the environment.

In analogy to the training of supervised neural network, DQN training is just a little difference in loss.

Two key issues of DQN

problem

  • The neural network requires that each sample be independently distributed during training, but the state in reinforcement learning is dependent on the front and back.
  • After the neural network is optimized, the same input may not have the same output, so our target value will also change accordingly, which is not convenient for optimization.

solution

Experience playback

After sampling some state disorder, input the neural network to obtain the corresponding value, and then optimize the neural network according to the target value, and put the sampled state in an experience pool for reuse, so that the sampled sample can be reused. And shuffle the samples to make the samples approximately independent.

Fixed Q target

After several iterations, we fixed the parameters of the neural network, and used the output of the model as the target value to continue to optimize the neural network.

Summarizing the DQN process can be expressed as the following figure:

PARL framework agent structure

View from right to left in three layers:

agent: Responsible for directly interacting with the environment, sample() uses the e-greedyc strategy to sample decisions during training, and predict() is the actual decision based on the environment after the training is completed. learn() is responsible for calling the update algorithm and controlling the training process.

algorithm: Responsible for the establishment and training of neural network models, and synchronization of target models.

model: The specific structure of the neural network model.

Detailed code

Model definition

class Model(parl.Model):
    def __init__(self, act_dim):

        hide1_size = 128
        hide2_size = 256
        hide3_size = 128

        self.fc1 = layers.fc(size = hide1_size,act='relu')
        self.fc2 = layers.fc(size = hide2_size,act='relu')
        self.fc3 = layers.fc(size = hide3_size,act='relu')
        self.fc4 = layers.fc(size = act_dim,act=None)

    def value(self, obs):
        # 定义网络
        # 输入state,输出所有action对应的Q,[Q(s,a1), Q(s,a2), Q(s,a3)...]
        
        h1 = self.fc1(obs)
        h2 = self.fc2(h1)
        h3 = self.fc3(h2)
        Q = self.fc4(h3)

        return Q

Algorithm definition

Call the already implemented directly from PALL

#from parl.algorithms import DQN 


# from parl.algorithms import DQN # 也可以直接从parl库中导入DQN算法

class DQN(parl.Algorithm):
    def __init__(self, model, act_dim=None, gamma=None, lr=None):
        """ DQN algorithm
        
        Args:
            model (parl.Model): 定义Q函数的前向网络结构
            act_dim (int): action空间的维度,即有几个action
            gamma (float): reward的衰减因子
            lr (float): learning rate 学习率.
        """
        self.model = model
        self.target_model = copy.deepcopy(model)

        assert isinstance(act_dim, int)
        assert isinstance(gamma, float)
        assert isinstance(lr, float)
        self.act_dim = act_dim
        self.gamma = gamma
        self.lr = lr

    def predict(self, obs):
        """ 使用self.model的value网络来获取 [Q(s,a1),Q(s,a2),...]
        """
        return self.model.value(obs)

    def learn(self, obs, action, reward, next_obs, terminal):
        """ 使用DQN算法更新self.model的value网络
        """
        # 从target_model中获取 max Q' 的值,用于计算target_Q
        next_pred_value = self.target_model.value(next_obs)
        best_v = layers.reduce_max(next_pred_value, dim=1)
        best_v.stop_gradient = True  # 阻止梯度传递
        terminal = layers.cast(terminal, dtype='float32')
        target = reward + (1.0 - terminal) * self.gamma * best_v

        pred_value = self.model.value(obs)  # 获取Q预测值
        # 将action转onehot向量,比如:3 => [0,0,0,1,0]
        action_onehot = layers.one_hot(action, self.act_dim)
        action_onehot = layers.cast(action_onehot, dtype='float32')
        # 下面一行是逐元素相乘,拿到action对应的 Q(s,a)
        # 比如:pred_value = [[2.3, 5.7, 1.2, 3.9, 1.4]], action_onehot = [[0,0,0,1,0]]
        #  ==> pred_action_value = [[3.9]]
        pred_action_value = layers.reduce_sum(
            layers.elementwise_mul(action_onehot, pred_value), dim=1)

        # 计算 Q(s,a) 与 target_Q的均方差,得到loss
        cost = layers.square_error_cost(pred_action_value, target)
        cost = layers.reduce_mean(cost)
        optimizer = fluid.optimizer.Adam(learning_rate=self.lr)  # 使用Adam优化器
        optimizer.minimize(cost)
        return cost

    def sync_target(self):
        """ 把 self.model 的模型参数值同步到 self.target_model
        """
        self.model.sync_weights_to(self.target_model)

Agent definition

class Agent(parl.Agent):
    def __init__(self,
                 algorithm,
                 obs_dim,
                 act_dim,
                 e_greed=0.1,
                 e_greed_decrement=0):
        assert isinstance(obs_dim, int)
        assert isinstance(act_dim, int)
        self.obs_dim = obs_dim
        self.act_dim = act_dim
        super(Agent, self).__init__(algorithm)

        self.global_step = 0
        self.update_target_steps = 200  # 每隔200个training steps再把model的参数复制到target_model中

        self.e_greed = e_greed  # 有一定概率随机选取动作,探索
        self.e_greed_decrement = e_greed_decrement  # 随着训练逐步收敛,探索的程度慢慢降低

    def build_program(self):
        self.pred_program = fluid.Program()
        self.learn_program = fluid.Program()

        with fluid.program_guard(self.pred_program):  # 搭建计算图用于 预测动作,定义输入输出变量
            obs = layers.data(
                name='obs', shape=[self.obs_dim], dtype='float32')
            self.value = self.alg.predict(obs)

        with fluid.program_guard(self.learn_program):  # 搭建计算图用于 更新Q网络,定义输入输出变量
            obs = layers.data(
                name='obs', shape=[self.obs_dim], dtype='float32')
            action = layers.data(name='act', shape=[1], dtype='int32')
            reward = layers.data(name='reward', shape=[], dtype='float32')
            next_obs = layers.data(
                name='next_obs', shape=[self.obs_dim], dtype='float32')
            terminal = layers.data(name='terminal', shape=[], dtype='bool')
            self.cost = self.alg.learn(obs, action, reward, next_obs, terminal)

    def sample(self, obs):
        sample = np.random.rand()  # 产生0~1之间的小数
        if sample < self.e_greed:
            act = np.random.randint(self.act_dim)  # 探索:每个动作都有概率被选择
        else:
            act = self.predict(obs)  # 选择最优动作
        self.e_greed = max(
            0.01, self.e_greed - self.e_greed_decrement)  # 随着训练逐步收敛,探索的程度慢慢降低
        return act

    def predict(self, obs):  # 选择最优动作
        obs = np.expand_dims(obs, axis=0)
        pred_Q = self.fluid_executor.run(
            self.pred_program,
            feed={'obs': obs.astype('float32')},
            fetch_list=[self.value])[0]
        pred_Q = np.squeeze(pred_Q, axis=0)
        act = np.argmax(pred_Q)  # 选择Q最大的下标,即对应的动作
        return act

    def learn(self, obs, act, reward, next_obs, terminal):
        # 每隔200个training steps同步一次model和target_model的参数
        if self.global_step % self.update_target_steps == 0:
            self.alg.sync_target()
        self.global_step += 1

        act = np.expand_dims(act, -1)
        feed = {
            'obs': obs.astype('float32'),
            'act': act.astype('int32'),
            'reward': reward,
            'next_obs': next_obs.astype('float32'),
            'terminal': terminal
        }
        cost = self.fluid_executor.run(
            self.learn_program, feed=feed, fetch_list=[self.cost])[0]  # 训练一次网络
        return cost

Experience pool

import random
import collections
import numpy as np

class ReplayMemory(object):
    def __init__(self, max_size):
        self.buffer = collections.deque(maxlen=max_size)

    # 增加一条经验到经验池中
    def append(self, exp):
        self.buffer.append(exp)

    # 从经验池中选取N条经验出来
    def sample(self, batch_size):
        mini_batch = random.sample(self.buffer, batch_size)
        obs_batch, action_batch, reward_batch, next_obs_batch, done_batch = [], [], [], [], []

        for experience in mini_batch:
            s, a, r, s_p, done = experience
            obs_batch.append(s)
            action_batch.append(a)
            reward_batch.append(r)
            next_obs_batch.append(s_p)
            done_batch.append(done)

        return np.array(obs_batch).astype('float32'), \
            np.array(action_batch).astype('float32'), np.array(reward_batch).astype('float32'),\
            np.array(next_obs_batch).astype('float32'), np.array(done_batch).astype('float32')

    def __len__(self):
        return len(self.buffer)

Training and testing

# 训练一个episode
def run_episode(env, agent, rpm):
    total_reward = 0
    obs = env.reset()
    step = 0
    while True:
        step += 1
        action = agent.sample(obs)  # 采样动作,所有动作都有概率被尝试到
        next_obs, reward, done, _ = env.step(action)
        rpm.append((obs, action, reward, next_obs, done))

        # train model
        if (len(rpm) > MEMORY_WARMUP_SIZE) and (step % LEARN_FREQ == 0):
            (batch_obs, batch_action, batch_reward, batch_next_obs,
             batch_done) = rpm.sample(BATCH_SIZE)
            train_loss = agent.learn(batch_obs, batch_action, batch_reward,
                                     batch_next_obs,
                                     batch_done)  # s,a,r,s',done

        total_reward += reward
        obs = next_obs
        if done:
            break
    return total_reward


# 评估 agent, 跑 5 个episode,总reward求平均
def evaluate(env, agent, render=False):
    eval_reward = []
    for i in range(5):
        obs = env.reset()
        episode_reward = 0
        while True:
            action = agent.predict(obs)  # 预测动作,只选最优动作
            obs, reward, done, _ = env.step(action)
            episode_reward += reward
            if render:
                env.render()
            if done:
                break
        eval_reward.append(episode_reward)
    return np.mean(eval_reward)

Total process control

# 创建环境
env = gym.make('CartPole-v0') # CartPole-v0: expected reward > 180                MountainCar-v0 : expected reward > -120
action_dim = env.action_space.n  # MountainCar-v0: 3
obs_shape = env.observation_space.shape  # MountainCar-v0: (2,)

# 创建经验池
rpm = ReplayMemory(MEMORY_SIZE)  # DQN的经验回放池


# 根据parl框架构建agent
model = Model(act_dim = action_dim)
algorithm = DQN(model,act_dim = action_dim,gamma=GAMMA,lr = LEARNING_RATE)
agent = Agent(algorithm
        ,obs_dim=obs_shape[0]
        ,act_dim=action_dim
        ,e_greed=0.1
        ,e_greed_decrement=1e-6)



# 加载模型
# save_path = './dqn_model.ckpt'
# agent.restore(save_path)

# 先往经验池里存一些数据,避免最开始训练的时候样本丰富度不够
while len(rpm) < MEMORY_WARMUP_SIZE:
    run_episode(env, agent, rpm)

max_episode = 2000

# 开始训练
episode = 0
while episode < max_episode:  # 训练max_episode个回合,test部分不计算入episode数量
    # train part
    for i in range(0, 50):
        total_reward = run_episode(env, agent, rpm)
        episode += 1

    # test part
    eval_reward = evaluate(env, agent, render=False)  # render=True 查看显示效果
    logger.info('episode:{}    e_greed:{}   test_reward:{}'.format(
        episode, agent.e_greed, eval_reward))

# 训练结束,保存模型
save_path = './dqn_model.ckpt'
agent.save(save_path)

Guess you like

Origin blog.csdn.net/fan1102958151/article/details/106879093