[DataWhale clock in] DDPG algorithm Deep Deterministric Policy Gradient
Video reference from: https://www.bilibili.com/video/BV1yv411i7xd?p=19
1. Mind map
2. Detailed
DDPG is an algorithm to solve the continuity control problem, but unlike PPO, the PPO output is a strategy and a probability distribution. The output of DDPG is an action.
DDPG adopts Actor-Critic architecture and is improved based on DQN. The action space in DQN must be discrete, so it cannot deal with the problem of continuous action space. DDPG made changes on its basis, introducing an Actor Network, allowing the output from a network to get a continuous action space.
Compared | AC | DDPG |
---|---|---|
Actor | The output is a probability distribution | Output is action |
Critic | Estimated V value | Estimated Q value |
Update | Gradient update with weights | Gradient rise |
When optimizing the Q network, if the Q-target is constantly changing, it will cause update difficulties. Similar to DQN, DDPG also adopts the method of fixing the network structure, first freeze the target network, update the parameters, and then assign the parameters to the target network. So what is needed are four networks:
- actor
- critic
- target actor
- target critic
As can be seen from the figure above, DDPG (also an Actor-Critic method) is actually a method of timing difference, combining Value-based and Policy-Based methods. Among them, Policy is Actor, which is used to give an action; the value function is Critic, which evaluates the quality of the Action given by the Actor, and generates a timing difference signal to guide the update of the value function and strategy function.
3. Code
The code mainly looks at the main modules of the DDPG algorithm:
3.1 Background
The problem to be solved by DDPG here is a pendulum problem, Pendulum-v0. In this version of the problem, the pendulum starts at a random position, and the goal is to swing it up to keep it upright. This is a problem of continuous control.
State representation:
Action space:
Reward evaluation:
− (θ 2 + 0.1 ∗ θ dt 2 + 0.001 ∗ action 2) -(\theta^2 + 0.1*\theta_{dt}^2 + 0.001*action^2)- ( θ2+0.1∗θdt2+0.001∗action2 ) It
can be seen that the goal is to maintain a zero angle, that is, to be vertical, while requiring the smallest rotation speed and the smallest force.
3.2 Actor
The role of Actor is to receive the state description and output an action. Because the action space requirement in DDPG is continuous, a tanh is used
class Actor(nn.Module):
def __init__(self, n_obs, n_actions, hidden_size, init_w=3e-3):
super(Actor, self).__init__()
self.linear1 = nn.Linear(n_obs, hidden_size)
self.linear2 = nn.Linear(hidden_size, hidden_size)
self.linear3 = nn.Linear(hidden_size, n_actions)
self.linear3.weight.data.uniform_(-init_w, init_w)
self.linear3.bias.data.uniform_(-init_w, init_w)
def forward(self, x):
x = F.relu(self.linear1(x))
x = F.relu(self.linear2(x))
x = F.tanh(self.linear3(x))
return x
In terms of implementation, it is a network designed with several fully connected layers, and the output result is a continuous value.
3.3 Critic
Critic critics, in DDPG, accept an Action value from the Actor and the current state, and output the expectations about Q obtained after the Action is adopted in the current state.
class Critic(nn.Module):
def __init__(self, n_obs, n_actions, hidden_size, init_w=3e-3):
super(Critic, self).__init__()
self.linear1 = nn.Linear(n_obs + n_actions, hidden_size)
self.linear2 = nn.Linear(hidden_size, hidden_size)
self.linear3 = nn.Linear(hidden_size, 1)
# 随机初始化为较小的值
self.linear3.weight.data.uniform_(-init_w, init_w)
self.linear3.bias.data.uniform_(-init_w, init_w)
def forward(self, state, action):
# 按维数1拼接
x = torch.cat([state, action], 1)
x = F.relu(self.linear1(x))
x = F.relu(self.linear2(x))
x = self.linear3(x)
return x
3.4 Replay Buffer
Replay Buffer is used to store a series of SARS fragments waiting to be learned.
class ReplayBuffer:
def __init__(self, capacity):
self.capacity = capacity
self.buffer = []
self.position = 0
def push(self, state, action, reward, next_state, done):
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.position] = (state, action, reward, next_state, done)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
state_batch, action_batch, reward_batch, next_state_batch, done_batch = map(np.stack, zip(*batch))
return state_batch, action_batch, reward_batch, next_state_batch, done_batch
def __len__(self):
return len(self.buffer)
The capacity of the Replay Buffer can be set. The push function is to add a SARS segment to the buffer; sample represents sampling batch size segments from the buffer.
3.5 DDPG
DDPG uses all the above objects, including Critic, Target Critic, Actor, Target Actor, and memory.
The init function is as follows:
def __init__(self, n_states, n_actions, hidden_dim=30, device="cpu", critic_lr=1e-3,
actor_lr=1e-4, gamma=0.99, soft_tau=1e-2, memory_capacity=100000, batch_size=128):
self.device = device
self.critic = Critic(n_states, n_actions, hidden_dim).to(device)
self.actor = Actor(n_states, n_actions, hidden_dim).to(device)
self.target_critic = Critic(n_states, n_actions, hidden_dim).to(device)
self.target_actor = Actor(n_states, n_actions, hidden_dim).to(device)
for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
target_param.data.copy_(param.data)
for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
target_param.data.copy_(param.data)
self.critic_optimizer = optim.Adam(
self.critic.parameters(), lr=critic_lr)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
self.memory = ReplayBuffer(memory_capacity)
self.batch_size = batch_size
self.soft_tau = soft_tau
self.gamma = gamma
The core function is the update function:
def update(self):
if len(self.memory) < self.batch_size:
return
state, action, reward, next_state, done = self.memory.sample(
self.batch_size)
# 将所有变量转为张量
state = torch.FloatTensor(state).to(self.device)
next_state = torch.FloatTensor(next_state).to(self.device)
action = torch.FloatTensor(action).to(self.device)
reward = torch.FloatTensor(reward).unsqueeze(1).to(self.device)
done = torch.FloatTensor(np.float32(done)).unsqueeze(1).to(self.device)
# 注意critic将(s_t,a)作为输入
policy_loss = self.critic(state, self.actor(state))
policy_loss = -policy_loss.mean()
next_action = self.target_actor(next_state)
target_value = self.target_critic(next_state, next_action.detach())
expected_value = reward + (1.0 - done) * self.gamma * target_value
expected_value = torch.clamp(expected_value, -np.inf, np.inf)
value = self.critic(state, action)
value_loss = nn.MSELoss()(value, expected_value.detach())
self.actor_optimizer.zero_grad()
policy_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.zero_grad()
value_loss.backward()
self.critic_optimizer.step()
for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.soft_tau) +
param.data * self.soft_tau
)
for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
target_param.data.copy_(
target_param.data * (1.0 - self.soft_tau) +
param.data * self.soft_tau
)
The overall process is as follows:
- Sample a batch of data from memory.
- policy_loss = self.critic(state, self.actor(state))
- Put the state in the actor object to get the action
- Put state and action on the critic object to get policy loss
next_action = self.target_actor(next_state)
target_value = self.target_critic(next_state, next_action.detach())
- Then the target actor and target critic also get the target value according to the above process
- Calculate the expected value based on the target value:
r + γ Q r+\gamma Q r+γQ
The implementation is as follows:
expected_value = reward + (1.0 - done) * self.gamma * target_value
expected_value = torch.clamp(expected_value, -np.inf, np.inf)
If done is 1, it means it is over and this coefficient is no longer needed. The second line puts a numerical restriction on the expected value.
- Next, calculate the value obtained from the action in the data set.
value = self.critic(state, action)
- Calculate the loss of the optimized Q network, using MSEloss
value_loss = nn.MSELoss()(value, expected_value.detach())
Compare the picture below:
- Perform gradient backhaul of policy loss and value loss, and update training parameters.
The training results are as follows:
4. References
The code part comes from the realization of johnjim, thanks.
https://www.jianshu.com/p/af3a7853268f
https://datawhalechina.github.io/leedeeprl-notes/#/chapter12/project3
https://www.bilibili.com/video/BV1yv411i7xd?p=19