8. Reinforcement learning-imitation learning

table of Contents

Course Outline

Introduction & Behavioral Cloning

DAGGER algorithm to improve BC [is the introduction of online iteration in BC, 2011]

Inverse RL & GAIL

Inverse RL

GAIL

Connection between IRL & GAIL

Improve the performance of imitation learning

Combination of imitation learning and reinforcement learning

(1) The simplest and direct combination: Pretrain and Finetune [application is very extensive]

(2) IL combined with Off-Policy RL: an improvement to Pretrain and Finetune

(3) Another combination method: use IL as an auxiliary loss function

An interesting Case Study-motion imitation

Problems with IL itself

to sum up


Course Outline

Introduction to Imitation Learning

Behavioral cloning BC and DAGGER algorithms

Inverse reinforcement learning IRL and generative adversarial imitation learning GAIL

Improve the performance of imitation learning

Combine imitation learning and reinforcement learning

Introduction & Behavioral Cloning

Start with the simplest method of behavioral cloning: The simpler idea is to treat the learning of strategies as supervised learning, such as learning a strategy network

In this way, it is actually problematic to directly treat it as a supervised problem: the hypothesis of data distribution is contradictory-supervised learning assumes that the data is IID, but the data collected in a time-series decision-making process is related ; And if the model enters the off-course state (a state that has not been seen during training), I don’t know how to come back

One possible solution is: keep adding data to become an online process

DAGGER algorithm to improve BC [is the introduction of online iteration in BC, 2011]

The disadvantage of DAGGER is that the third step is too time-consuming. Can DAGGER be improved? Can other algorithms be used for labeling in the third step?

Improve DAGGER:

Inverse RL & GAIL

Inverse RL

Comparison of IRL and RL:

Examples of IRL:

GAIL

Similar to IRL, GAN learns an objective function for generating the model, and GAIL imitates the idea of ​​GAN

Connection between IRL & GAIL

Improve the performance of imitation learning

How to improve our strategy model?

Question 1: Multimodal behavior

solution:

① Output a multi-Gaussian model, which is the form of multi-peak superposition

②Hidden variable model

③Autoregressive discrete

Question 2: Non-Markovian behavior

solution:

①Model the entire observation history, such as LSTM

An example of using LSTM and teaching data to complete the robotic arm grasping [AAAI 2018]

So in the field of robotics, how to scale up data has always been a big problem

The crowdsourcing method proposed by the Li Feifei group of Stanford collects teaching data from many, many, many people. The RoboTurk project has developed a solution

There are still some problems with imitation learning

①The data is provided artificially, and the data itself is limited

②People sometimes cannot provide data well, such as teaching drones and teaching complex robots

③People can explore freely in the environment. Can we learn from this?

So next we want to combine imitation learning with reinforcement learning

Combination of imitation learning and reinforcement learning

Comparison of the respective characteristics of imitation learning and reinforcement learning

How to combine the two, with both Demonstration and Rewards?

(1) The simplest and direct combination: Pretrain and Finetune [application is very extensive]

That is to say, use Demonstration to pre-train a policy (solve the problem of exploration), and then use RL to improve policy and solve those off-policy states, and finally reach the process of exceeding the performance of the demonstrator

The process of Pretrain and Finetune is as follows: 

Here is the previous DAGGER algorithm, which can be compared with Pretrain and Finetune:

Application of Pretrain and Finetune:

①Apply to AlphaGo【Nature 2016 Silver】

②Apply to Starcraft2【DeepMind work】

Problems with Pretrain and Finetune:

①In the third step, when the better policy we obtained before is trained with reinforcement learning, we may face the problem of inconsistent distribution

②The initial experience may be very bad, so that it will destroy the policy network during training.

Solution to the problem of Pretrain and Finetune: Consider how to keep Demonstration all the time-Off-Policy RL

 

(2) IL combined with Off-Policy RL: an improvement to  Pretrain and Finetune

Any experience data can be used for off-policy RL. For example, for Q-Learning, you can always use them as long as you put them in the replay buffer

①Form 1: Policy Gradient with Demonstration

Application examples:

②形式二:Q-Learning with Demonstration

(3) Another combination method: use IL as an auxiliary loss function

    Optimizing the expected return of RL + the maximum likelihood of IL

    Application examples: [2017]

An interesting Case Study-motion imitation

Sensors can be attached to the actual human joints to collect data, and even data can be collected from the video to train the agent through pose estimation

For details, go to Teacher Zhou's class~

Problems with IL itself

(1) How to collect Demonstration 

         ① Crowdsourcing

         ② Guided policy search or optimal control for trajectory optimization

(2) How to optimize Policy so that Agent can handle off-course conditions

         ① Model these off-course conditions and label them

         ② Use off-policy learning with the already collected samples

         ③ Combine IL and RL 

to sum up

 

Note: All the content in this article is derived from the intensive learning outline course updated and completed by Mr. Zhou Bolei at station B. After listening to it, I have benefited a lot. This article also shares my lecture notes. Teacher Zhou Bilibili video personal homepage: https://space.bilibili.com/511221970?spm_id_from=333.788.b_765f7570696e666f.2

Thanks to Teacher Zhou:)

Guess you like

Origin blog.csdn.net/weixin_43450646/article/details/106977619