HW12
1. Job description
In this HW, you can implement some deep reinforcement learning methods by yourself:
1. Policy Gradient
2. Actor-Critic
The environment of this HW is the lunar lander of OpenAI gym. Hopefully this lunar lander lands between the two flags.
What is a lunar lander?
"LunarLander-v2" simulates what happens when a vehicle lands on the lunar surface.
The task was to enable the plane to land "safely" on the tarmac between the two yellow flags. The landing pad is always located at coordinates (0,0). The coordinates are the first two numbers in the state vector. "LunarLander-v2" actually includes "agent" and "environment". In this assignment, we will use the function "step()" to control the actions of the "agent".
then 'step()' will return the observation/state and reward given by the "environment"...
Box(8,) means the observation is an 8 dimensional vector
'Discrete(4)' means that the agent can take four actions.
- 0 means the agent will take no action
- 2 means the agent will accelerate downwards
- 1, 3 means the agent will accelerate left and right
Next, we'll try to get the agent to interact with the environment.
Before taking any action, we recommend calling the 'reset()' function to reset the environment. Additionally, the function will return the initial state of the environment.
1、Policy Gradient
Output the action or the probability of an action directly from the state. So how to output it, the simplest is to use a neural network. How should the network be trained to achieve final convergence? For the backpropagation algorithm, we need an error function that minimizes our loss via gradient descent. But for reinforcement learning, we don't know whether the action is correct or not, and we can only judge the relative quality of the action through the reward value. If an action gets more rewards, then we increase the probability of its occurrence, and if an action gets less rewards, we decrease the probability of its occurrence.
A lot of data needs to be collected for each cycle to perform a parameter update.
2、Actor-Critic
Add benchmarks to judge whether this action is really good!
Assign different weights
Combined decay factor.
2. Experiment
1、simple
#torch.set_deterministic(True)
torch.use_deterministic_algorithms(True)
training result:
testing:
test reward:
server:
score:
2、medium
……
NUM_BATCH = 500 # totally update the agent for 400 time
rate = 0.99
……
while True:
action, log_prob = agent.sample(state) # at, log(at|st)
next_state, reward, done, _ = env.step(action)
log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
seq_rewards.append(reward)
state = next_state
total_reward += reward
total_step += 1
if done:
final_rewards.append(reward)
total_rewards.append(total_reward)
# calculate accumulative rewards
for i in range(2, len(seq_rewards)+1):
seq_rewards[-i] += rate * (seq_rewards[-i+1])
rewards += seq_rewards
break
training result:
testing:
test reward:
server:
score:
3、strong
from torch.optim.lr_scheduler import StepLR
class ActorCritic(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(8, 16),
nn.Tanh(),
nn.Linear(16, 16),
nn.Tanh()
)
self.actor = nn.Linear(16, 4)
self.critic = nn.Linear(16, 1)
self.values = []
self.optimizer = optim.SGD(self.parameters(), lr=0.001)
def forward(self, state):
hid = self.fc(state)
self.values.append(self.critic(hid).squeeze(-1))
return F.softmax(self.actor(hid), dim=-1)
def learn(self, log_probs, rewards):
values = torch.stack(self.values)
loss = (-log_probs * (rewards - values.detach())).sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.values = []
def sample(self, state):
action_prob = self(torch.FloatTensor(state))
action_dist = Categorical(action_prob)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob
training result:
testing:
test reward:
server:
score:
3. Code
** Preparations **
First, we need to install all the necessary packages.
One of these is gym built by OpenAI, which is a toolkit for developing reinforcement learning algorithms.
"step()" can be used to make the agent act according to a randomly selected "random_action".
The "step()" function will return four values:
- observation/status
- reward
- done (true/false)
- additional information
observation, reward, done, info = env.step(random_action)
print(done)
Bonus
The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. The reward for moving from the top of the screen to the landing pad and zero velocity is about 100...140 points. If the lander leaves the landing pad, it will lose its return. If the lander crashes or stops, the episode ends for an extra -100 or +100 points. Each leg ground contact is +10. -0.3 points per frame for igniting the main engine. Solved is 200 points.
Random Agent
Before we start training, we can see if a random agent can successfully land on the moon.
env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
done = False
while not done:
action = env.action_space.sample()
observation, reward, done, _ = env.step(action)
img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())#展示当前图窗的句柄
display.clear_output(wait=True)
Policy Gradient
Now, we can build a simple policy network. The network will return an action in the action space.
class PolicyGradientNetwork(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 16)
self.fc2 = nn.Linear(16, 16)
self.fc3 = nn.Linear(16, 4)
def forward(self, state):
hid = torch.tanh(self.fc1(state))
hid = torch.tanh(self.fc2(hid))
return F.softmax(self.fc3(hid), dim=-1)
Then, we need to build a simple proxy. The agent will act according to the output of the policy network described above. A proxy can do several things:
learn()
: Update the policy network according to log probability and reward.sample()
: Utilizes a policy network to inform which action to take after receiving observations from the environment. The return value of this function includes action probability and log probability.
from torch.optim.lr_scheduler import StepLR
class PolicyGradientAgent():
def __init__(self, network):
self.network = network
self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)
def forward(self, state):
return self.network(state)
def learn(self, log_probs, rewards):
loss = (-log_probs * rewards).sum() # You don't need to revise this to pass simple baseline (but you can)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def sample(self, state):
action_prob = self.network(torch.FloatTensor(state))
action_dist = Categorical(action_prob)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob
Training the Agent
Now let's start training our agent.
By having all interactions between the agent and the environment as training data, the policy network can learn from all these attempts.
agent.network.train() # Switch network into training mode
EPISODE_PER_BATCH = 5 # update the agent every 5 episode
NUM_BATCH = 500 # totally update the agent for 400 time
avg_total_rewards, avg_final_rewards = [], []
prg_bar = tqdm(range(NUM_BATCH))#进度条
for batch in prg_bar:
log_probs, rewards = [], []
total_rewards, final_rewards = [], []
# collect trajectory
for episode in range(EPISODE_PER_BATCH):
state = env.reset()
total_reward, total_step = 0, 0
seq_rewards = []
while True:
action, log_prob = agent.sample(state) # at, log(at|st)
next_state, reward, done, _ = env.step(action)
log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
# seq_rewards.append(reward)
state = next_state
total_reward += reward
total_step += 1
rewards.append(reward) # change here
# ! IMPORTANT !
# Current reward implementation: immediate reward, given action_list : a1, a2, a3 ......
# rewards : r1, r2 ,r3 ......
# medium:change "rewards" to accumulative decaying reward, given action_list : a1, a2, a3, ......
# rewards : r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... , r3+0.99*r4+0.99^2*r5+ ......
# boss : implement Actor-Critic
if done:
final_rewards.append(reward)
total_rewards.append(total_reward)
break
print(f"rewards looks like ", np.shape(rewards))
print(f"log_probs looks like ", np.shape(log_probs))
# record training process
avg_total_reward = sum(total_rewards) / len(total_rewards)
avg_final_reward = sum(final_rewards) / len(final_rewards)
avg_total_rewards.append(avg_total_reward)
avg_final_rewards.append(avg_final_reward)
prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")
# update agent
# rewards = np.concatenate(rewards, axis=0)
rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-9) # normalize the reward ,std求标准差
agent.learn(torch.stack(log_probs), torch.from_numpy(rewards))#torch.from_numpy创建一个张量,torch.stack沿一个新维度对输入张量序列进行连接,序列中所有张量应为相同形状
print("logs prob looks like ", torch.stack(log_probs).size())
print("torch.from_numpy(rewards) looks like ", torch.from_numpy(rewards).size())
Training Results
During training, we recorded "avg_total_reward", which represents the average total reward of the set before updating the policy network. In theory, if the agent gets better, avg_total_reward will increase.
plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()
Also, "avg_final_reward" represents the average final reward of the set. Specifically, the final reward is the reward received at the end of an episode, indicating whether or not the craft landed successfully.
plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()
Test
The test result will be the average reward of 5 tests
fix(env, seed)
agent.network.eval() # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
actions = []
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
total_reward = 0
done = False
while not done:
action, _ = agent.sample(state)
actions.append(action)
state, reward, done, _ = env.step(action)
total_reward += reward
img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())
display.clear_output(wait=True)
print(total_reward)
test_total_reward.append(total_reward)
action_list.append(actions) # save the result of testing
action distribution
distribution = {
}
for actions in action_list:
for action in actions:
if action not in distribution.keys():
distribution[action] = 1
else:
distribution[action] += 1
print(distribution)
Server
The code below simulates the environment on the judge server. available for testing.
action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)
agent.network.eval() # set network to evaluation mode
test_total_reward = []
if len(action_list) != 5:
print("Wrong format of file !!!")
exit(0)
for actions in action_list:
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
total_reward = 0
done = False
for action in actions:
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
print(f"Your reward is : %.2f"%total_reward)
test_total_reward.append(total_reward)