table of Contents
Introduction & Behavioral Cloning
DAGGER algorithm to improve BC [is the introduction of online iteration in BC, 2011]
Improve the performance of imitation learning
Combination of imitation learning and reinforcement learning
(1) The simplest and direct combination: Pretrain and Finetune [application is very extensive]
(2) IL combined with Off-Policy RL: an improvement to Pretrain and Finetune
(3) Another combination method: use IL as an auxiliary loss function
An interesting Case Study-motion imitation
Course Outline
Introduction to Imitation Learning
Behavioral cloning BC and DAGGER algorithms
Inverse reinforcement learning IRL and generative adversarial imitation learning GAIL
Improve the performance of imitation learning
Combine imitation learning and reinforcement learning
Introduction & Behavioral Cloning
Start with the simplest method of behavioral cloning: The simpler idea is to treat the learning of strategies as supervised learning, such as learning a strategy network
In this way, it is actually problematic to directly treat it as a supervised problem: the hypothesis of data distribution is contradictory-supervised learning assumes that the data is IID, but the data collected in a time-series decision-making process is related ; And if the model enters the off-course state (a state that has not been seen during training), I don’t know how to come back
One possible solution is: keep adding data to become an online process
DAGGER algorithm to improve BC [is the introduction of online iteration in BC, 2011]
The disadvantage of DAGGER is that the third step is too time-consuming. Can DAGGER be improved? Can other algorithms be used for labeling in the third step?
Improve DAGGER:
Inverse RL & GAIL
Inverse RL
Comparison of IRL and RL:
Examples of IRL:
GAIL
Similar to IRL, GAN learns an objective function for generating the model, and GAIL imitates the idea of GAN
Connection between IRL & GAIL
Improve the performance of imitation learning
How to improve our strategy model?
Question 1: Multimodal behavior
solution:
① Output a multi-Gaussian model, which is the form of multi-peak superposition
②Hidden variable model
③Autoregressive discrete
Question 2: Non-Markovian behavior
solution:
①Model the entire observation history, such as LSTM
An example of using LSTM and teaching data to complete the robotic arm grasping [AAAI 2018]
So in the field of robotics, how to scale up data has always been a big problem
The crowdsourcing method proposed by the Li Feifei group of Stanford collects teaching data from many, many, many people. The RoboTurk project has developed a solution
There are still some problems with imitation learning
①The data is provided artificially, and the data itself is limited
②People sometimes cannot provide data well, such as teaching drones and teaching complex robots
③People can explore freely in the environment. Can we learn from this?
So next we want to combine imitation learning with reinforcement learning
Combination of imitation learning and reinforcement learning
Comparison of the respective characteristics of imitation learning and reinforcement learning
How to combine the two, with both Demonstration and Rewards?
(1) The simplest and direct combination: Pretrain and Finetune [application is very extensive]
That is to say, use Demonstration to pre-train a policy (solve the problem of exploration), and then use RL to improve policy and solve those off-policy states, and finally reach the process of exceeding the performance of the demonstrator
The process of Pretrain and Finetune is as follows:
Here is the previous DAGGER algorithm, which can be compared with Pretrain and Finetune:
Application of Pretrain and Finetune:
①Apply to AlphaGo【Nature 2016 Silver】
②Apply to Starcraft2【DeepMind work】
Problems with Pretrain and Finetune:
①In the third step, when the better policy we obtained before is trained with reinforcement learning, we may face the problem of inconsistent distribution
②The initial experience may be very bad, so that it will destroy the policy network during training.
Solution to the problem of Pretrain and Finetune: Consider how to keep Demonstration all the time-Off-Policy RL
(2) IL combined with Off-Policy RL: an improvement to Pretrain and Finetune
Any experience data can be used for off-policy RL. For example, for Q-Learning, you can always use them as long as you put them in the replay buffer
①Form 1: Policy Gradient with Demonstration
Application examples:
②形式二:Q-Learning with Demonstration
(3) Another combination method: use IL as an auxiliary loss function
Optimizing the expected return of RL + the maximum likelihood of IL
Application examples: [2017]
An interesting Case Study-motion imitation
Sensors can be attached to the actual human joints to collect data, and even data can be collected from the video to train the agent through pose estimation
For details, go to Teacher Zhou's class~
Problems with IL itself
(1) How to collect Demonstration
① Crowdsourcing
② Guided policy search or optimal control for trajectory optimization
(2) How to optimize Policy so that Agent can handle off-course conditions
① Model these off-course conditions and label them
② Use off-policy learning with the already collected samples
③ Combine IL and RL
to sum up
Note: All the content in this article is derived from the intensive learning outline course updated and completed by Mr. Zhou Bolei at station B. After listening to it, I have benefited a lot. This article also shares my lecture notes. Teacher Zhou Bilibili video personal homepage: https://space.bilibili.com/511221970?spm_id_from=333.788.b_765f7570696e666f.2
Thanks to Teacher Zhou:)