算法思路
TD3是解决AC框架下Q值高估问题的,这个问题可追溯到double Q-learning中:
y = r + γ Q θ 1 ( s ′ , arg max a Q θ 2 ( s ′ , a ) ) y=r+\gamma Q_{\theta_{1}}\left(s^{\prime}, \arg \max _{a} Q_{\theta_{2}}\left(s^{\prime}, a\right)\right) y=r+γQθ1(s′,argamaxQθ2(s′,a))
max操作将s’下最大的Q值近似为V值,而V值实际是Q值的加权平均,V小于等于Q,因此会高估Q。再举一个例子, x 1 , x 2 , . . . x_1,x_2,... x1,x2,...为原始信号,对其加上均值为0,方差为 σ 2 \sigma^2 σ2的噪声,得到 x n o i s e 1 , x n o i s e 2 , . . . x_{noise1},x_{noise2},... xnoise1,xnoise2,...,对其求期望不变,但是 m a x { x n o i s e 1 , x n o i s e 2 , . . . } > = m a x { x 1 , x 2 , . . . } max\{x_{noise1},x_{noise2},...\}>=max\{x_1,x_2,...\} max{
xnoise1,xnoise2,...}>=max{
x1,x2,...}。理想的强化学习应该近似Q,因此TD3准备解决这个棘手的问题。
twin critic网络
TD3与其说是一种算法,不如说是一种算法的buff加成,TD3是该buff作用于DDPG的结果。
TD3最核心的思想即使双critic网络用来消除高估,本着“宁可低估,不可高估”的思想,原本DDPG的target_q改为如下计算方法:
y 1 = r + γ min i = 1 , 2 Q θ i ′ ( s ′ , π ϕ ′ ( s ′ ) ) y_{1}=r+\gamma \min _{i=1,2} Q_{\theta_{i}^{\prime}}\left(s^{\prime}, \pi_{\phi^{\prime}}\left(s^{\prime}\right)\right) y1=r+γi=1,2minQθi′(s′,πϕ′(s′))
两个critic网络中输出较小的将用来计算q_target,这个q_target与两个critic的输出构成loss反向传播更新各自网络,这样高估网络会被“拉”低。
target policy smoothing regularization
DDPG中taget actor网络输入s’后直接将target动作值输入target_critic,但TD3认为在该动作值上加小扰动可以使价值函数在action维度上更平滑,因此:
y = r + γ min i = 1 , 2 Q θ i ′ ( s ′ , π ϕ ′ ( s ′ ) + ϵ ) ϵ ∼ clip ( N ( 0 , σ ) , − c , c ) \begin{aligned} y &=r+\gamma \min _{i=1,2} Q_{\theta^{\prime}_i}\left(s^{\prime}, \pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon\right) \\ \epsilon & \sim \operatorname{clip}(\mathcal{N}(0, \sigma),-c, c) \end{aligned} yϵ=r+γi=1,2minQθi′(s′,πϕ′(s′)+ϵ)∼clip(N(0,σ),−c,c)
TTUR
这是GANs的一个技巧,用在AC网络上即为降低Actor的更新频率,让Critic网络收敛到位再更新Actor,一般来说,Actor更新一次,Critic更新两次,在Actor更新的同时,soft-update TD3的三个目标网络:
θ ′ ← τ θ + ( 1 − τ ) θ ′ \theta^{\prime} \leftarrow \tau \theta+(1-\tau) \theta^{\prime} θ′←τθ+(1−τ)θ′
TD3在2018年称自己达到state-of-the-art水准,一个知乎大佬也称用DDPG的场合一定可以改为TD3,但是具体情况自己试了才知道。
伪代码
TD3的伪代码写得清晰明了:
python高性能最简化实现
我整理了两份TD3 pytoch代码,一份来自ElegantRL,一份来自spinningup,都可以在giuhub搜到,其中ElegantRL还有一个可爱的称呼:“小雅”。
在ElegantRL实现中,TD3的twin_critic采用共享部分网络的结构,这样可以加速训练,在spinningup实现中,twin_critic由完全独立的网络构成,训练稍慢但性能略好。
ElegantRL TD3
main.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
import gym
from TD3Model import AgentTD3, ReplayBuffer
import matplotlib.pyplot as plt
import numpy as np
if __name__ == "__main__":
env = gym.make('Pendulum-v0')
state_dim, action_dim = env.observation_space.shape[0], env.action_space.shape[0]
td3 = AgentTD3()
td3.init(256,state_dim,action_dim)
buffer = ReplayBuffer(int(1e6), state_dim, action_dim, False, True)
MAX_EPISODE = 100
MAX_STEP = 500
batch_size = 100
gamma = 0.99
reward_list = []
for episode in range(MAX_EPISODE):
s = env.reset()
ep_reward = 0
for j in range(MAX_STEP):
# if episode > 130: env.render()
if episode > 20:
a = td3.select_action(s)*2
else:
a = env.action_space.sample()
s_, r, d, _ = env.step(a)
mask = 0.0 if d else gamma
other = (r,mask,*a)
buffer.append_buffer(s, other)
if episode > 20 and j % 50 == 0:
td3.update_net(buffer,50,batch_size,1)
ep_reward += r
s = s_
if d: break
reward_list.append(ep_reward)
print('Episode:', episode, 'Reward:%f' % ep_reward)
plt.figure()
plt.plot(np.arange(len(reward_list)), reward_list)
plt.show()
TD3Model.py
# !/usr/bin/python
# -*- coding: utf-8 -*-
import numpy as np
import torch
import torch.nn as nn
from copy import deepcopy # deepcopy target_network
def layer_norm(layer, std=1.0, bias_const=1e-6):
torch.nn.init.orthogonal_(layer.weight, std)
torch.nn.init.constant_(layer.bias, bias_const)
class ReplayBuffer:
def __init__(self, max_len, state_dim, action_dim, if_on_policy, if_gpu):
"""Experience Replay Buffer
save environment transition in a continuous RAM for high performance training
we save trajectory in order and save state and other (action, reward, mask, ...) separately.
:int max_len: the maximum capacity of ReplayBuffer. First In First Out
:int state_dim: the dimension of state
:int action_dim: the dimension of action (action_dim==1 for discrete action)
:bool if_on_policy: on-policy or off-policy
:bool if_gpu: create buffer space on CPU RAM or GPU
"""
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.max_len = max_len
self.now_len = 0
self.next_idx = 0
self.if_full = False
self.action_dim = action_dim # for self.sample_all(
self.if_on_policy = if_on_policy
self.if_gpu = if_gpu
if if_on_policy:
self.if_gpu = False
other_dim = 1 + 1 + action_dim * 2
else:
other_dim = 1 + 1 + action_dim
if self.if_gpu:
self.buf_other = torch.empty((max_len, other_dim), dtype=torch.float32, device=self.device)
self.buf_state = torch.empty((max_len, state_dim), dtype=torch.float32, device=self.device)
else:
self.buf_other = np.empty((max_len, other_dim), dtype=np.float32)
self.buf_state = np.empty((max_len, state_dim), dtype=np.float32)
def append_buffer(self, state, other): # CPU array to CPU array
if self.if_gpu:
state = torch.as_tensor(state, device=self.device)
other = torch.as_tensor(other, device=self.device)
self.buf_state[self.next_idx] = state
self.buf_other[self.next_idx] = other
self.next_idx += 1
if self.next_idx >= self.max_len:
self.if_full = True
self.next_idx = 0
def extend_buffer(self, state, other): # CPU array to CPU array
if self.if_gpu:
state = torch.as_tensor(state, dtype=torch.float32, device=self.device)
other = torch.as_tensor(other, dtype=torch.float32, device=self.device)
size = len(other)
next_idx = self.next_idx + size
if next_idx > self.max_len:
if next_idx > self.max_len:
self.buf_state[self.next_idx:self.max_len] = state[:self.max_len - self.next_idx]
self.buf_other[self.next_idx:self.max_len] = other[:self.max_len - self.next_idx]
self.if_full = True
next_idx = next_idx - self.max_len
self.buf_state[0:next_idx] = state[-next_idx:]
self.buf_other[0:next_idx] = other[-next_idx:]
else:
self.buf_state[self.next_idx:next_idx] = state
self.buf_other[self.next_idx:next_idx] = other
self.next_idx = next_idx
def sample_batch(self, batch_size) -> tuple:
"""randomly sample a batch of data for training
:int batch_size: the number of data in a batch for Stochastic Gradient Descent
:return torch.Tensor reward: reward.shape==(now_len, 1)
:return torch.Tensor mask: mask.shape ==(now_len, 1), mask = 0.0 if done else gamma
:return torch.Tensor action: action.shape==(now_len, action_dim)
:return torch.Tensor state: state.shape ==(now_len, state_dim)
:return torch.Tensor state: state.shape ==(now_len, state_dim), next state
"""
indices = torch.randint(self.now_len - 1, size=(batch_size,), device=self.device) if self.if_gpu \
else rd.randint(self.now_len - 1, size=batch_size)
r_m_a = self.buf_other[indices]
return (r_m_a[:, 0:1],
r_m_a[:, 1:2],
r_m_a[:, 2:],
self.buf_state[indices],
self.buf_state[indices + 1])
def sample_all(self) -> tuple:
"""sample all the data in ReplayBuffer (for on-policy)
:return torch.Tensor reward: reward.shape==(now_len, 1)
:return torch.Tensor mask: mask.shape ==(now_len, 1), mask = 0.0 if done else gamma
:return torch.Tensor action: action.shape==(now_len, action_dim)
:return torch.Tensor noise: noise.shape ==(now_len, action_dim)
:return torch.Tensor state: state.shape ==(now_len, state_dim)
"""
all_other = torch.as_tensor(self.buf_other[:self.now_len], device=self.device)
return (all_other[:, 0],
all_other[:, 1],
all_other[:, 2:2 + self.action_dim],
all_other[:, 2 + self.action_dim:],
torch.as_tensor(self.buf_state[:self.now_len], device=self.device))
def update_now_len_before_sample(self):
"""update the a pointer `now_len`, which is the current data number of ReplayBuffer
"""
self.now_len = self.max_len if self.if_full else self.next_idx
def empty_buffer_before_explore(self):
"""we empty the buffer by set now_len=0. On-policy need to empty buffer before exploration
"""
self.next_idx = 0
self.now_len = 0
self.if_full = False
def print_state_norm(self, neg_avg=None, div_std=None): # non-essential
max_sample_size = 2 ** 14
'''check if pass'''
state_shape = self.buf_state.shape
if len(state_shape) > 2 or state_shape[1] > 64:
print(f"| print_state_norm(): state_dim: {state_shape} is too large to print its norm. ")
return None
'''sample state'''
indices = np.arange(self.now_len)
rd.shuffle(indices)
indices = indices[:max_sample_size] # len(indices) = min(self.now_len, max_sample_size)
batch_state = self.buf_state[indices]
'''compute state norm'''
if isinstance(batch_state, torch.Tensor):
batch_state = batch_state.cpu().data.numpy()
assert isinstance(batch_state, np.ndarray)
if batch_state.shape[1] > 64:
print(f"| _print_norm(): state_dim: {batch_state.shape[1]:.0f} is too large to print its norm. ")
return None
if np.isnan(batch_state).any(): # 2020-12-12
batch_state = np.nan_to_num(batch_state) # nan to 0
ary_avg = batch_state.mean(axis=0)
ary_std = batch_state.std(axis=0)
fix_std = ((np.max(batch_state, axis=0) - np.min(batch_state, axis=0)) / 6 + ary_std) / 2
if neg_avg is not None: # norm transfer
ary_avg = ary_avg - neg_avg / div_std
ary_std = fix_std / div_std
print(f"| print_norm: state_avg, state_fix_std")
print(f"| avg = np.{repr(ary_avg).replace('=float32', '=np.float32')}")
print(f"| std = np.{repr(ary_std).replace('=float32', '=np.float32')}")
class AgentBase:
def __init__(self):
self.learning_rate = 1e-4
self.soft_update_tau = 2 ** -8 # 5e-3 ~= 2 ** -8
self.state = None # set for self.update_buffer(), initialize before training
self.device = None
self.act = self.act_target = None
self.cri = self.cri_target = None
self.act_optimizer = None
self.cri_optimizer = None
self.criterion = None
def init(self, net_dim, state_dim, action_dim):
"""
:int net_dim: net width
:int state_dim
:int action_dim
"""
def select_action(self, state) -> np.ndarray:
"""
:array state: state.shape==(state_dim, )
:return array action: action.shape==(action_dim, ), (action.min(), action.max())==(-1, +1)
"""
states = torch.as_tensor((state,), dtype=torch.float32, device=self.device).detach_()
action = self.act(states)[0]
return action.cpu().numpy()
def select_actions(self, states) -> np.ndarray:
"""
:array states: (state, ) or (state, state, ...) or state.shape==(n, *state_dim)
:return array action: action.shape==(-1, action_dim), (action.min(), action.max())==(-1, +1)
"""
states = torch.as_tensor(states, dtype=torch.float32, device=self.device).detach_()
actions = self.act(states)
return actions.cpu().numpy() # -1 < action < +1
def explore_env(self, env, buffer, target_step, reward_scale, gamma) -> int:
"""
:env: RL training environment. env.reset() env.step()
:buffer: Experience Replay Buffer. buffer.append_buffer() buffer.extend_buffer()
:int target_step: explored target_step number of step in env
:float reward_scale: scale reward, 'reward * reward_scale'
:float gamma: discount factor, 'mask = 0.0 if done else gamma'
:return int target_step: collected target_step number of step in env
"""
for _ in range(target_step):
action = self.select_action(self.state)
next_s, reward, done, _ = env.step(action)
other = (reward * reward_scale, 0.0 if done else gamma, *action)
buffer.append_buffer(self.state, other)
self.state = env.reset() if done else next_s
return target_step
def update_net(self, buffer, target_step, batch_size, repeat_times) -> (float, float):
"""
:buffer: Experience replay buffer. buffer.append_buffer() buffer.extend_buffer()
:int target_step: explore target_step number of step in env
:int batch_size: sample batch_size of data for Stochastic Gradient Descent
:float repeat_times: the times of sample batch = int(target_step * repeat_times) in off-policy
:return float obj_a: the objective value of actor
:return float obj_c: the objective value of critic
"""
def save_load_model(self, cwd, if_save):
"""
:str cwd: current working directory, we save model file here
:bool if_save: save model or load model
"""
act_save_path = '{}/actor.pth'.format(cwd)
cri_save_path = '{}/critic.pth'.format(cwd)
def load_torch_file(network, save_path):
network_dict = torch.load(save_path, map_location=lambda storage, loc: storage)
network.load_state_dict(network_dict)
if if_save:
if self.act is not None:
torch.save(self.act.state_dict(), act_save_path)
if self.cri is not None:
torch.save(self.cri.state_dict(), cri_save_path)
elif (self.act is not None) and os.path.exists(act_save_path):
load_torch_file(self.act, act_save_path)
print("Loaded act:", cwd)
elif (self.cri is not None) and os.path.exists(cri_save_path):
load_torch_file(self.cri, cri_save_path)
print("Loaded cri:", cwd)
else:
print("FileNotFound when load_model: {}".format(cwd))
@staticmethod
def soft_update(target_net, current_net, tau):
"""
:nn.Module target_net: target network update via a current network, it is more stable
:nn.Module current_net: current network update via an optimizer
"""
for tar, cur in zip(target_net.parameters(), current_net.parameters()):
tar.data.copy_(cur.data.__mul__(tau) + tar.data.__mul__(1 - tau))
class Actor(nn.Module): # DPG: Deterministic Policy Gradient
def __init__(self, mid_dim, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(nn.Linear(state_dim, mid_dim), nn.ReLU(),
nn.Linear(mid_dim, mid_dim), nn.ReLU(),
nn.Linear(mid_dim, mid_dim), nn.ReLU(),
nn.Linear(mid_dim, action_dim))
def forward(self, state):
return self.net(state).tanh() # action.tanh()
def get_action(self, state, action_std):
action = self.net(state).tanh()
noise = (torch.randn_like(action) * action_std).clamp(-0.5, 0.5)
return (action + noise).clamp(-1.0, 1.0)
class CriticTwin(nn.Module):
def __init__(self, mid_dim, state_dim, action_dim):
super().__init__()
lay_dim = mid_dim
self.net_sa = nn.Sequential(nn.Linear(state_dim + action_dim, mid_dim), nn.ReLU(),
nn.Linear(mid_dim, lay_dim), nn.ReLU())
self.net_q1 = nn.Linear(lay_dim, 1)
self.net_q2 = nn.Linear(lay_dim, 1)
layer_norm(self.net_q1, std=0.1)
layer_norm(self.net_q2, std=0.1)
def forward(self, state, action):
tmp = self.net_sa(torch.cat((state, action), dim=1))
return self.net_q1(tmp) # one Q value
def get_q1_q2(self, state, action):
tmp = self.net_sa(torch.cat((state, action), dim=1))
return self.net_q1(tmp), self.net_q2(tmp) # two Q values
class AgentTD3(AgentBase):
def __init__(self):
super().__init__()
self.explore_noise = 0.1 # standard deviation of explore noise
self.policy_noise = 0.2 # standard deviation of policy noise
self.update_freq = 2 # delay update frequency, for soft target update
def init(self, net_dim, state_dim, action_dim):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.act = Actor(net_dim, state_dim, action_dim).to(self.device)
self.act_target = deepcopy(self.act)
self.cri = CriticTwin(net_dim, state_dim, action_dim).to(self.device)
self.cri_target = deepcopy(self.cri)
self.criterion = torch.nn.MSELoss()
self.act_optimizer = torch.optim.Adam(self.act.parameters(), lr=self.learning_rate)
self.cri_optimizer = torch.optim.Adam(self.cri.parameters(), lr=self.learning_rate)
def select_action(self, state) -> np.ndarray:
states = torch.as_tensor((state,), dtype=torch.float32, device=self.device).detach_()
action = self.act(states)[0]
action = (action + torch.randn_like(action) * self.explore_noise).clamp(-1, 1)
return action.cpu().detach().numpy()
def update_net(self, buffer, target_step, batch_size, repeat_times) -> (float, float):
buffer.update_now_len_before_sample()
obj_critic = obj_actor = None
for i in range(int(target_step * repeat_times)):
'''objective of critic (loss function of critic)'''
with torch.no_grad():
reward, mask, action, state, next_s = buffer.sample_batch(batch_size)
next_a = self.act_target.get_action(next_s, self.policy_noise) # policy noise
next_q = torch.min(*self.cri_target.get_q1_q2(next_s, next_a)) # twin critics
q_label = reward + mask * next_q
q1, q2 = self.cri.get_q1_q2(state, action)
obj_critic = self.criterion(q1, q_label) + self.criterion(q2, q_label) # twin critics
self.cri_optimizer.zero_grad()
obj_critic.backward()
self.cri_optimizer.step()
# if i % self.update_freq == 0: # delay update
# self.soft_update(self.cri_target, self.cri, self.soft_update_tau)
#
# '''objective of actor'''
# q_value_pg = self.act(state) # policy gradient
# obj_actor = -self.cri_target(state, q_value_pg).mean()
#
# self.act_optimizer.zero_grad()
# obj_actor.backward()
# self.act_optimizer.step()
# if i % self.update_freq == 0: # delay update
# self.soft_update(self.act_target, self.act, self.soft_update_tau)
if i % self.update_freq == 0:
'''objective of actor'''
q_value_pg = self.act(state) # policy gradient
obj_actor = -self.cri_target(state, q_value_pg).mean()
self.act_optimizer.zero_grad()
obj_actor.backward()
self.act_optimizer.step()
self.soft_update(self.cri_target, self.cri, self.soft_update_tau)
self.soft_update(self.act_target, self.act, self.soft_update_tau)
return obj_actor.item(), obj_critic.item() / 2
spinningup TD3
main.py
from td3 import TD3
import gym
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
env = gym.make('Pendulum-v0')
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]
td3 = TD3(obs_dim,act_dim)
MAX_EPISODE = 100
MAX_STEP = 500
update_every = 50
batch_size = 100
rewardList = []
for episode in range(MAX_EPISODE):
o = env.reset()
ep_reward = 0
for j in range(MAX_STEP):
if episode > 20:
a = td3.get_action(o, td3.act_noise)*2
else:
a = env.action_space.sample()
o2, r, d, _ = env.step(a)
td3.replay_buffer.store(o, a, r, o2, d)
if episode >= 20 and j % update_every == 0:
td3.update(batch_size,update_every)
o = o2
ep_reward += r
if d: break
print('Episode:', episode, 'Reward:%i' % int(ep_reward))
rewardList.append(ep_reward)
plt.figure()
plt.plot(np.arange(len(rewardList)),rewardList)
plt.show()
td3.py
from copy import deepcopy
import itertools
import numpy as np
import torch
from torch.optim import Adam
import core
class ReplayBuffer:
"""
A simple FIFO experience replay buffer for TD3 agents.
"""
def __init__(self, obs_dim, act_dim, size):
self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
self.obs2_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32)
self.rew_buf = np.zeros(size, dtype=np.float32)
self.done_buf = np.zeros(size, dtype=np.float32)
self.ptr, self.size, self.max_size = 0, 0, size
def store(self, obs, act, rew, next_obs, done):
self.obs_buf[self.ptr] = obs
self.obs2_buf[self.ptr] = next_obs
self.act_buf[self.ptr] = act
self.rew_buf[self.ptr] = rew
self.done_buf[self.ptr] = done
self.ptr = (self.ptr+1) % self.max_size
self.size = min(self.size+1, self.max_size)
def sample_batch(self, batch_size=32):
idxs = np.random.randint(0, self.size, size=batch_size)
batch = dict(obs=self.obs_buf[idxs],
obs2=self.obs2_buf[idxs],
act=self.act_buf[idxs],
rew=self.rew_buf[idxs],
done=self.done_buf[idxs])
return {
k: torch.as_tensor(v, dtype=torch.float32) for k,v in batch.items()}
class TD3:
def __init__(self, obs_dim, act_dim, actor_critic=core.MLPActorCritic,
replay_size=int(1e6), gamma=0.99, polyak=0.995, pi_lr=1e-3, q_lr=1e-3,
act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2):
self.obs_dim = obs_dim
self.act_dim = act_dim
self.gamma = gamma
self.polyak = polyak
self.act_noise = act_noise
self.target_noise = target_noise
self.noise_clip = noise_clip
self.policy_delay = policy_delay
self.replay_buffer = ReplayBuffer(obs_dim=obs_dim, act_dim=act_dim,size=replay_size)
self.ac = actor_critic(obs_dim, act_dim)
self.ac_targ = deepcopy(self.ac)
for p in self.ac_targ.parameters():
p.requires_grad = False
# List of parameters for both Q-networks (save this for convenience)
self.q_params = itertools.chain(self.ac.q1.parameters(), self.ac.q2.parameters())
# Set up optimizers for policy and q-function
self.pi_optimizer = Adam(self.ac.pi.parameters(), lr=pi_lr)
self.q_optimizer = Adam(self.q_params, lr=q_lr)
# Experience buffer
replay_buffer = ReplayBuffer(obs_dim=obs_dim, act_dim=act_dim, size=replay_size)
def compute_loss_q(self,data):
o, a, r, o2, d = data['obs'], data['act'], data['rew'], data['obs2'], data['done']
q1 = self.ac.q1(o,a)
q2 = self.ac.q2(o,a)
# Bellman backup for Q functions
with torch.no_grad():
pi_targ = self.ac_targ.pi(o2)
# Target policy smoothing
epsilon = torch.randn_like(pi_targ) * self.target_noise
epsilon = torch.clamp(epsilon, -self.noise_clip, self.noise_clip)
a2 = pi_targ + epsilon
a2 = torch.clamp(a2, -1, 1)
# Target Q-values
q1_pi_targ = self.ac_targ.q1(o2, a2)
q2_pi_targ = self.ac_targ.q2(o2, a2)
q_pi_targ = torch.min(q1_pi_targ, q2_pi_targ)
backup = r + self.gamma * (1 - d) * q_pi_targ
# MSE loss against Bellman backup
loss_q1 = ((q1 - backup)**2).mean()
loss_q2 = ((q2 - backup)**2).mean()
loss_q = loss_q1 + loss_q2
return loss_q
def compute_loss_pi(self, data):
o = data['obs']
q1_pi = self.ac.q1(o, self.ac.pi(o))
return -q1_pi.mean()
def update(self, batch_size, repeat_times):
for i in range(int(repeat_times)):
data = self.replay_buffer.sample_batch(batch_size)
# First run one gradient descent step for Q1 and Q2
self.q_optimizer.zero_grad()
loss_q = self.compute_loss_q(data)
loss_q.backward()
self.q_optimizer.step()
# Possibly update pi and target networks
if i % self.policy_delay == 0:
# Freeze Q-networks so you don't waste computational effort
# computing gradients for them during the policy learning step.
for p in self.q_params:
p.requires_grad = False
# Next run one gradient descent step for pi.
self.pi_optimizer.zero_grad()
loss_pi = self.compute_loss_pi(data)
loss_pi.backward()
self.pi_optimizer.step()
# Unfreeze Q-networks so you can optimize it at next DDPG step.
for p in self.q_params:
p.requires_grad = True
# Finally, update target networks by polyak averaging.
with torch.no_grad():
for p, p_targ in zip(self.ac.parameters(), self.ac_targ.parameters()):
# NB: We use an in-place operations "mul_", "add_" to update target
# params, as opposed to "mul" and "add", which would make new tensors.
p_targ.data.mul_(self.polyak)
p_targ.data.add_((1 - self.polyak) * p.data)
def get_action(self, o, noise_scale):
a = self.ac.act(torch.as_tensor(o, dtype=torch.float32))
a += noise_scale * np.random.randn(self.act_dim)
return np.clip(a, -1, 1)
core.py
import numpy as np
import scipy.signal
import torch
import torch.nn as nn
def combined_shape(length, shape=None):
if shape is None:
return (length,)
return (length, shape) if np.isscalar(shape) else (length, *shape)
def mlp(sizes, activation, output_activation=nn.Identity):
layers = []
for j in range(len(sizes)-1):
act = activation if j < len(sizes)-2 else output_activation
layers += [nn.Linear(sizes[j], sizes[j+1]), act()]
return nn.Sequential(*layers)
def count_vars(module):
return sum([np.prod(p.shape) for p in module.parameters()])
class MLPActor(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
super().__init__()
pi_sizes = [obs_dim] + list(hidden_sizes) + [act_dim]
self.pi = mlp(pi_sizes, activation, nn.Tanh)
def forward(self, obs):
# Return output from network scaled to action space limits.
return self.pi(obs)
class MLPQFunction(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
super().__init__()
self.q = mlp([obs_dim + act_dim] + list(hidden_sizes) + [1], activation)
def forward(self, obs, act):
q = self.q(torch.cat([obs, act], dim=-1))
return torch.squeeze(q, -1) # Critical to ensure q has right shape.
class MLPActorCritic(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_sizes=(256,256),
activation=nn.ReLU):
super().__init__()
# build policy and value functions
self.pi = MLPActor(obs_dim, act_dim, hidden_sizes, activation)
self.q1 = MLPQFunction(obs_dim, act_dim, hidden_sizes, activation)
self.q2 = MLPQFunction(obs_dim, act_dim, hidden_sizes, activation)
def act(self, obs):
with torch.no_grad():
return self.pi(obs).numpy()
注意,上面的TD3默认输出动作范围在-1到1,使用的时候转换到自己的动作空间即可。
两种代码效果比较
左图ElegantRL,右图spinningup。实验环境gym:Pendulum-v0。