前言

本博文展示了如何使用PyTorch在OpenAI Gym的CartPole-v0任务上训练一个深度Q学习(DQN)代理。

在这里插入图片描述

任务

智能体必须在两种行动中做出选择——向左或向右移动小车——这样贴到车上的柱子才能保持直立。你可以在Gym网站上找到带有各种算法和可视化的官方排行榜。
在这里插入图片描述
当代理观察环境的当前状态并选择一个操作时，环境将转换为一个新状态，并返回一个指示操作结果的奖励。在这个任务中，每增加一个时间步奖励+1，如果柱子倒下太远或小车移动超过中心2.4个单位，环境就会终止。这意味着更好的表现方案将持续更长的时间，积累更大的回报。

CartPole任务被设计为向代理的输入是4个真实的值，代表环境状态(位置、速度等)。然而，神经网络可以纯粹通过查看场景来解决该任务，因此我们将使用以小车为中心的屏幕补丁作为输入。正因为如此，我们的结果不能直接与官方排行榜上的结果进行比较——我们的任务更加艰巨。不幸的是，这减慢了训练，因为我们必须渲染所有的帧。

严格地说，我们会以当前屏幕补丁与前一个屏幕补丁的差异来呈现这个状态。这将允许代理从一张图像中考虑到极点的速度。

首先，让我们导入所需的包。首先，我们需要gym作为环境(使用pip Install gym进行安装)。我们还将在PyTorch中使用以下代码:

neural networks (torch.nn)
optimization (torch.optim)
automatic differentiation (torch.autograd)
utilities for vision tasks (torchvision - a separate package).

import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple
from itertools import count
from PIL import Image

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T


env = gym.make('CartPole-v0').unwrapped

# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Replay Memory

我们将使用经验回放记忆来训练我们的DQN。它存储代理观察到的转换，允许我们稍后重用这些数据。通过随机抽样，构建批处理的转换是不相关的。已经证明，这大大稳定和改善了DQN训练程序。

为此，我们需要两个类:

trainsition:一个命名元组，表示环境中的单个转换。它本质上是将(状态，动作)对映射到它们的(next_state, reward)结果，状态就是后面描述的屏幕差异图像。
ReplayMemory:一个有界大小的循环缓冲区，保存最近观察到的转换。它还实现了一个.sample()方法，用于随机选择一批转换进行训练。

Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):

    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0

    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Q-network

我们的模型将是一个卷积神经网络，它将吸收当前和以前屏幕补丁之间的差异。它有两个输出，分别表示Q(s，左)和Q(s，右)(其中s是网络的输入)。实际上，网络试图预测给定当前输入的每个动作的预期回报。

class DQN(nn.Module):

    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
        self.bn3 = nn.BatchNorm2d(32)

        # Number of Linear input connections depends on output of conv2d layers
        # and therefore the input image size, so compute it.
        def conv2d_size_out(size, kernel_size = 5, stride = 2):
            return (size - (kernel_size - 1) - 1) // stride  + 1
        convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
        convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
        linear_input_size = convw * convh * 32
        self.head = nn.Linear(linear_input_size, outputs)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        return self.head(x.view(x.size(0), -1))

输入提取

下面的代码是用于从环境中提取和处理呈现图像的实用程序。它使用了torchvision包，使得图像转换更容易。运行单元格后，它将显示它提取的示例补丁。

resize = T.Compose([T.ToPILImage(),
                    T.Resize(40, interpolation=Image.CUBIC),
                    T.ToTensor()])


def get_cart_location(screen_width):
    world_width = env.x_threshold * 2
    scale = screen_width / world_width
    return int(env.state[0] * scale + screen_width / 2.0)  # MIDDLE OF CART

def get_screen():
    # Returned screen requested by gym is 400x600x3, but is sometimes larger
    # such as 800x1200x3. Transpose it into torch order (CHW).
    screen = env.render(mode='rgb_array').transpose((2, 0, 1))
    # Cart is in the lower half, so strip off the top and bottom of the screen
    _, screen_height, screen_width = screen.shape
    screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
    view_width = int(screen_width * 0.6)
    cart_location = get_cart_location(screen_width)
    if cart_location < view_width // 2:
        slice_range = slice(view_width)
    elif cart_location > (screen_width - view_width // 2):
        slice_range = slice(-view_width, None)
    else:
        slice_range = slice(cart_location - view_width // 2,
                            cart_location + view_width // 2)
    # Strip off the edges, so that we have a square image centered on a cart
    screen = screen[:, :, slice_range]
    # Convert to float, rescale, convert to torch tensor
    # (this doesn't require a copy)
    screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
    screen = torch.from_numpy(screen)
    # Resize, and add a batch dimension (BCHW)
    return resize(screen).unsqueeze(0).to(device)


env.reset()
plt.figure()
plt.imshow(get_screen().cpu().squeeze(0).permute(1, 2, 0).numpy(),
           interpolation='none')
plt.title('Example extracted screen')
plt.show()

training

超参数和实用工具

这个单元格实例化了我们的模型和它的优化器，并定义了一些实用工具:

select_action:将根据贪心策略选择相应的操作。简单地说，我们有时会用我们的模型来选择动作，有时我们会统一抽样。选择随机动作的概率将从EPS_START开始，并将向EPS_END呈指数衰减。EPS_DECAY控制衰减速率。
plot_durations :帮助绘制剧集持续时间，以及最近100集的平均值(官方评估中使用的衡量标准)。情节将位于包含主要训练循环的单元格下面，并在每一集之后更新。

BATCH_SIZE = 128
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10

# Get screen size so that we can initialize layers correctly based on shape
# returned from AI gym. Typical dimensions at this point are close to 3x40x90
# which is the result of a clamped and down-scaled render buffer in get_screen()
init_screen = get_screen()
_, _, screen_height, screen_width = init_screen.shape

# Get number of actions from gym action space
n_actions = env.action_space.n

policy_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

optimizer = optim.RMSprop(policy_net.parameters())
memory = ReplayMemory(10000)


steps_done = 0


def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        with torch.no_grad():
            # t.max(1) will return largest column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1)[1].view(1, 1)
    else:
        return torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)


episode_durations = []


def plot_durations():
    plt.figure(2)
    plt.clf()
    durations_t = torch.tensor(episode_durations, dtype=torch.float)
    plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Duration')
    plt.plot(durations_t.numpy())
    # Take 100 episode averages and plot them too
    if len(durations_t) >= 100:
        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(99), means))
        plt.plot(means.numpy())

    plt.pause(0.001)  # pause a bit so that plots are updated
    if is_ipython:
        display.clear_output(wait=True)
        display.display(plt.gcf())

trainning loop

最后是训练模型的代码。

在这里，您可以找到一个optimize_model函数，它只执行一个优化步骤。它首先采样一批，将所有张量连接成一个，计算Q(st,at)和V(st+1)=maxaQ(st+1,a)，并将它们合并成我们的损失。根据定义，如果s是终端状态，则V(s)=0。为了增加稳定性，我们还使用目标网络来计算V(st+1)。目标网络的权值在大部分时间都保持冻结状态，但每隔一段时间就会更新策略网络的权值。这通常是一组步骤，但为了简单起见，我们将使用集。

def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

下面，你可以找到主要的训练循环。一开始我们重置环境，初始化状态张量。然后，我们尝试一个动作，执行它，观察下一个屏幕和奖励(总是1)，并优化我们的模型一次。当情节结束时(我们的模型失败了)，我们重新开始循环。

下面，num_episodes设置得很小。你应该下载笔记本并运行更多的剧集，比如300多集，以获得有意义的持续时间改进。

num_episodes = 50
for i_episode in range(num_episodes):
    # Initialize the environment and state
    env.reset()
    last_screen = get_screen()
    current_screen = get_screen()
    state = current_screen - last_screen
    for t in count():
        # Select and perform an action
        action = select_action(state)
        _, reward, done, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)

        # Observe new state
        last_screen = current_screen
        current_screen = get_screen()
        if not done:
            next_state = current_screen - last_screen
        else:
            next_state = None

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the target network)
        optimize_model()
        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
    # Update the target network, copying all weights and biases in DQN
    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())

print('Complete')
env.render()
env.close()
plt.ioff()
plt.show()

下面的图表演示了总体结果数据流。
在这里插入图片描述
动作可以随机选择，也可以基于策略选择，从健身房环境中获取下一步的样本。我们在重播内存中记录结果，并在每次迭代中运行优化步骤。优化从重播内存中随机挑选一批来训练新策略。“旧的”target_net也用于优化，以计算预期的Q值;它偶尔会更新以保持最新。

基于DQN的CartPole实战

文章目录

前言

任务