【Li Hongyi】HW12


1. Job description

In this HW, you can implement some deep reinforcement learning methods by yourself:
1. Policy Gradient
2. Actor-Critic
The environment of this HW is the lunar lander of OpenAI gym. Hopefully this lunar lander lands between the two flags.
What is a lunar lander?
"LunarLander-v2" simulates what happens when a vehicle lands on the lunar surface.
The task was to enable the plane to land "safely" on the tarmac between the two yellow flags. The landing pad is always located at coordinates (0,0). The coordinates are the first two numbers in the state vector. "LunarLander-v2" actually includes "agent" and "environment". In this assignment, we will use the function "step()" to control the actions of the "agent".
then 'step()' will return the observation/state and reward given by the "environment"...
insert image description here
Box(8,) means the observation is an 8 dimensional vector

insert image description here

'Discrete(4)' means that the agent can take four actions.

  • 0 means the agent will take no action
  • 2 means the agent will accelerate downwards
  • 1, 3 means the agent will accelerate left and right

Next, we'll try to get the agent to interact with the environment.
Before taking any action, we recommend calling the 'reset()' function to reset the environment. Additionally, the function will return the initial state of the environment.
insert image description here

1、Policy Gradient

Output the action or the probability of an action directly from the state. So how to output it, the simplest is to use a neural network. How should the network be trained to achieve final convergence? For the backpropagation algorithm, we need an error function that minimizes our loss via gradient descent. But for reinforcement learning, we don't know whether the action is correct or not, and we can only judge the relative quality of the action through the reward value. If an action gets more rewards, then we increase the probability of its occurrence, and if an action gets less rewards, we decrease the probability of its occurrence.
insert image description here
insert image description here
insert image description here
insert image description here

A lot of data needs to be collected for each cycle to perform a parameter update.

2、Actor-Critic

insert image description here
insert image description here
Add benchmarks to judge whether this action is really good!
insert image description here
Assign different weights

insert image description here
Combined decay factor.

2. Experiment

1、simple

#torch.set_deterministic(True)
torch.use_deterministic_algorithms(True)

training result:insert image description here
testing:insert image description here
test reward:insert image description here
server:
insert image description here
score:insert image description here

2、medium

……
NUM_BATCH = 500        # totally update the agent for 400 time
rate = 0.99
……
        while True:

            action, log_prob = agent.sample(state) # at, log(at|st)
            next_state, reward, done, _ = env.step(action)

            log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
            seq_rewards.append(reward)
            state = next_state
            total_reward += reward
            total_step += 1
            
            if done:
                final_rewards.append(reward)
                total_rewards.append(total_reward)
                # calculate accumulative rewards
                for i in range(2, len(seq_rewards)+1):
                    seq_rewards[-i] += rate * (seq_rewards[-i+1])
                rewards += seq_rewards
                
                break

training result:
insert image description here
insert image description here

testing:
insert image description here

test reward:
insert image description here
insert image description here

server:
insert image description here

score:

insert image description here

3、strong

from torch.optim.lr_scheduler import StepLR
class ActorCritic(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(8, 16),
            nn.Tanh(),
            nn.Linear(16, 16),
            nn.Tanh()
        )
        
        self.actor = nn.Linear(16, 4)
        self.critic = nn.Linear(16, 1)
        
        self.values = []
        self.optimizer = optim.SGD(self.parameters(), lr=0.001)
        
    def forward(self, state):
        hid = self.fc(state)
        self.values.append(self.critic(hid).squeeze(-1))
        return F.softmax(self.actor(hid), dim=-1)
    
    def learn(self, log_probs, rewards):
        values = torch.stack(self.values)
        loss = (-log_probs * (rewards - values.detach())).sum() 
        self.optimizer.zero_grad()
        loss.backward()
        
        self.optimizer.step()
        
        self.values = []
        
    def sample(self, state):
        action_prob = self(torch.FloatTensor(state))
        action_dist = Categorical(action_prob)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        return action.item(), log_prob
        

training result:

insert image description here

insert image description here

testing:

insert image description here

test reward:
insert image description here

insert image description here

server:
insert image description here

score:
insert image description here

3. Code

** Preparations **
First, we need to install all the necessary packages.
One of these is gym built by OpenAI, which is a toolkit for developing reinforcement learning algorithms.
insert image description here
"step()" can be used to make the agent act according to a randomly selected "random_action".
The "step()" function will return four values:
- observation/status
- reward
- done (true/false)
- additional information

observation, reward, done, info = env.step(random_action)
print(done)

insert image description here
Bonus
The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. The reward for moving from the top of the screen to the landing pad and zero velocity is about 100...140 points. If the lander leaves the landing pad, it will lose its return. If the lander crashes or stops, the episode ends for an extra -100 or +100 points. Each leg ground contact is +10. -0.3 points per frame for igniting the main engine. Solved is 200 points.
Random Agent
Before we start training, we can see if a random agent can successfully land on the moon.

env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

done = False
while not done:
    action = env.action_space.sample()
    observation, reward, done, _ = env.step(action)

    img.set_data(env.render(mode='rgb_array'))
    display.display(plt.gcf())#展示当前图窗的句柄
    display.clear_output(wait=True)

insert image description here
Policy Gradient
Now, we can build a simple policy network. The network will return an action in the action space.

class PolicyGradientNetwork(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(8, 16)
        self.fc2 = nn.Linear(16, 16)
        self.fc3 = nn.Linear(16, 4)

    def forward(self, state):
        hid = torch.tanh(self.fc1(state))
        hid = torch.tanh(self.fc2(hid))
        return F.softmax(self.fc3(hid), dim=-1)

Then, we need to build a simple proxy. The agent will act according to the output of the policy network described above. A proxy can do several things:

  • learn(): Update the policy network according to log probability and reward.
  • sample(): Utilizes a policy network to inform which action to take after receiving observations from the environment. The return value of this function includes action probability and log probability.
from torch.optim.lr_scheduler import StepLR
class PolicyGradientAgent():
    
    def __init__(self, network):
        self.network = network
        self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)
        
    def forward(self, state):
        return self.network(state)
    def learn(self, log_probs, rewards):
        loss = (-log_probs * rewards).sum() # You don't need to revise this to pass simple baseline (but you can)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
    def sample(self, state):
        action_prob = self.network(torch.FloatTensor(state))
        action_dist = Categorical(action_prob)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        return action.item(), log_prob

Training the Agent
Now let's start training our agent.
By having all interactions between the agent and the environment as training data, the policy network can learn from all these attempts.

agent.network.train()  # Switch network into training mode 
EPISODE_PER_BATCH = 5  # update the  agent every 5 episode
NUM_BATCH = 500        # totally update the agent for 400 time

avg_total_rewards, avg_final_rewards = [], []

prg_bar = tqdm(range(NUM_BATCH))#进度条
for batch in prg_bar:

    log_probs, rewards = [], []
    total_rewards, final_rewards = [], []

    # collect trajectory
    for episode in range(EPISODE_PER_BATCH):
        
        state = env.reset()
        total_reward, total_step = 0, 0
        seq_rewards = []
        while True:

            action, log_prob = agent.sample(state) # at, log(at|st)
            next_state, reward, done, _ = env.step(action)

            log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
            # seq_rewards.append(reward)
            state = next_state
            total_reward += reward
            total_step += 1
            rewards.append(reward) # change here
            # ! IMPORTANT !
            # Current reward implementation: immediate reward,  given action_list : a1, a2, a3 ......
            #                                                         rewards :     r1, r2 ,r3 ......
            # medium:change "rewards" to accumulative decaying reward, given action_list : a1,                           a2,                           a3, ......
            #                                                           rewards :           r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... ,  r3+0.99*r4+0.99^2*r5+ ......
            # boss : implement Actor-Critic
            if done:
                final_rewards.append(reward)
                total_rewards.append(total_reward)
                
                break

    print(f"rewards looks like ", np.shape(rewards))  
    print(f"log_probs looks like ", np.shape(log_probs))     
    # record training process
    avg_total_reward = sum(total_rewards) / len(total_rewards)
    avg_final_reward = sum(final_rewards) / len(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")

    # update agent
    # rewards = np.concatenate(rewards, axis=0)
    rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-9)  # normalize the reward ,std求标准差
    agent.learn(torch.stack(log_probs), torch.from_numpy(rewards))#torch.from_numpy创建一个张量,torch.stack沿一个新维度对输入张量序列进行连接,序列中所有张量应为相同形状
    print("logs prob looks like ", torch.stack(log_probs).size())
    print("torch.from_numpy(rewards) looks like ", torch.from_numpy(rewards).size())

Training Results
During training, we recorded "avg_total_reward", which represents the average total reward of the set before updating the policy network. In theory, if the agent gets better, avg_total_reward will increase.

plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()

Also, "avg_final_reward" represents the average final reward of the set. Specifically, the final reward is the reward received at the end of an episode, indicating whether or not the craft landed successfully.

plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()

Test
The test result will be the average reward of 5 tests

fix(env, seed)
agent.network.eval()  # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
  actions = []
  state = env.reset()

  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False
  while not done:
      action, _ = agent.sample(state)
      actions.append(action)
      state, reward, done, _ = env.step(action)

      total_reward += reward

      img.set_data(env.render(mode='rgb_array'))
      display.display(plt.gcf())
      display.clear_output(wait=True)
      
  print(total_reward)
  test_total_reward.append(total_reward)

  action_list.append(actions) # save the result of testing 

action distribution

distribution = {
    
    }
for actions in action_list:
  for action in actions:
    if action not in distribution.keys():
      distribution[action] = 1
    else:
      distribution[action] += 1
print(distribution)

Server
The code below simulates the environment on the judge server. available for testing.

action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)

agent.network.eval()  # set network to evaluation mode

test_total_reward = []
if len(action_list) != 5:
  print("Wrong format of file !!!")
  exit(0)
for actions in action_list:
  state = env.reset()
  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False

  for action in actions:
  
      state, reward, done, _ = env.step(action)
      total_reward += reward
      if done:
        break

  print(f"Your reward is : %.2f"%total_reward)
  test_total_reward.append(total_reward)

Guess you like

Origin blog.csdn.net/Raphael9900/article/details/128549103