Li Hongyi 2022 Machine Learning HW12 Analysis

Preparation

Job 12 is to use reinforcement learning to complete the Lunar Lander (moon landing) mission and train the aircraft to land on the moon. The job is based on OpenAI's gym framework (only available for linux systems). The teaching assistant code is required during the homework process. Follow this official account to get the code (including parsing code, there is a method at the end of the article).

submit address

https://ml.ee.ntu.edu.tw/hw12/, students who want to discuss communication can enter the QQ group: 156013866. The following is the job analysis.

Simple Baseline

Method : Run the TA code directly. The method used by the teaching assistant code is Policy Gradient . When running the code, there may be version incompatibility. The following code is the situation I encountered (the old code is commented) and the modification method. After the code runs, the final total rewards is: -71.65 .

#torch.set_deterministic(True)
torch.use_deterministic_algorithms(True)

Medium Baseline

Method: Accumulate Rewards . Based on the simple baseline, change the rewards to a cumulative form, the code is as follows. After the code runs, the final total rewards is: 8.49 .

rate = 0.99  
      ...... 
      while True:
            ......
            seq_rewards.append(reward)
            ......
            if done:
                final_rewards.append(reward)
                total_rewards.append(total_reward)
                # calculate accumulative rewards
                for i in range(2, len(seq_rewards)+1):
                    seq_rewards[-i] += rate * (seq_rewards[-i+1])
                rewards += seq_rewards

Strong Baseline

Method: Use Actor to Critic . Compared with Policy Gradient, the Actor to Critic model backend has two branches, one for predicting actions and one for predicting rewards. The loss function also needs to add the loss of predicting rewards. For details, see the answer code. After the code runs, a lucky total rewards is obtained: 106.57 .

Boss Baseline 

Methodology: Using Advantage Actor to Critic (A2C) . The loss function of Actor to Critic is the 3.5 version that the teacher said in the class. The output of Critic is used as the baseline, and the A2C is the 4.0 version, that is, "average minus average". This method is more reasonable, but the loss function is more complicated. The model is difficult to train, and parameter debugging is required. For detailed changes, see the answer code. After the code runs, the final total rewards is: 128.11 . This rewards is the average result of 5 logins, and the improvement is not much, but if we only look at a single one, we find that there are often good rewards, but the fluctuation is relatively large, which is related to the difficulty of model convergence.

How to get the answer to homework 12:

  1. Follow the WeChat public account " Machine Learning Craftsman

  2. Background reply keywords: 202212

Guess you like

Origin blog.csdn.net/weixin_42369818/article/details/126119360