Reincarnating Strengthening [Reincarnating RL] Papers

Reincarnating Strengthening [Reincarnating RL] Papers

Foreword:

It’s been a long time since I wrote an article for speed reading. Recently, my group of friends recommended two very interesting articles, one is reset enhancement, the other is rebirth enhancement, from two different angles to make strengthening chemistry faster.

Since I am also doing related work recently, I read these two articles carefully. The first article recorded a video: [Interpretation and discussion of ResetNet-The Primacy Bias in Deep Reinforcement Learning paper - 哔哩哔哩] https://b23.tv/GJQ85N5

After discussing the second article with the group of friends this Saturday night, there should be a video. Sort out your thoughts today.

First of all, the openreview score of this article is not high, because his main method is relatively trivial, but his story is very good, the point of entry is very grand, and then the effect of realization is more obvious, and the experimental results are also very rich.

In addition, Zhihu once had a hot topic of reinforcement: why there are few pretrained models (Pretrained Model) in reinforcement learning? . "There are some very well-known pre-training models (such as BERT and ResNet) in the fields of NLP and CV, but it seems that there is no backbone similar to this in reinforcement learning?" Then this article tries to make reinforcement In the field of learning, a similar backbone can be built, and this pre-trained model paradigm is admired.

Based on this, they just provided an example in the article, which can realize the migration of policy to value (in fact, it is essentially to provide a suboptimal policy, plus some suboptimal data, to pre-train a whiteboard reinforcement (tabula-rasa RL) , and then use behavior cloning as a constraint to help the strategy quickly converge, and finally add an attenuation coefficient to the behavior cloning loss to "wean the milk", so that the new strengthening strategy can eventually exceed the original suboptimal guidance policy), this scheme is given in the article example works fine. But I personally feel that this scheme may not be so suitable for the algorithm of AC structure.

Article link:

https://agarwl.github.io/reincarnating_rl

Introduction of the author team:

insert image description here
Google Brain's research institute in Montreal. After watching the talk of the first work, he is also a cheerful person, which is a bit interesting.

Teacher Shen Xiangyang's essay ten questions:

Q1: What problem does the paper try to solve?
A1: This paper attempts to formally define well, the field of reinforcement learning, how to use past training results. It is best to break through the limitations of the network structure.

Q2: Is this a new question?
A2: This is not a new problem. How to use past training results has been tried by many people before. But they formally gave the definition for the first time, and provided a more universal solution, which can realize the rapid learning of accelerated whiteboard value-based RL only from suboptimal strategies and data.

Q3: What scientific hypothesis is this article going to verify?
A3: Provide an enhanced pre-training paradigm. As for the regeneration enhancement, it feels more like a gimmick, but it is indeed much taller.

Q4: What related research is there? How to classify? Who are noteworthy researchers in this field?
A4: There are five mentioned in the article: Rehearsal, JSRL, Offline RL Pretraining, Kickstarting, DQfD, but I feel that there is one more that is not mentioned, that is, residual reinforcement, because it is not a good idea to use guide-policy in exploration. Obviously, the state distribution of the guidance and learning strategies will not match, and the random mixing mode adopted by RHER's self-guided exploration strategy in my previous work will not have such a problem. As for the rebirth enhancement mentioned in this article, the proposed algorithm PVRL (policy(+data) to value RL), the similarities and differences between them and the above five schemes are that, like offline RL, the data of the teacher policy is used for offline pre-training , the subsequent online tuning, like Kickstarting, uses the strategy distillation loss. The difference is that this work adds an attenuation coefficient to the strategy distillation loss as a "weaning" operation. There are only three steps in total, and the last difference is actually one more hyperparameter. But they are the first to formally define the large model pre-training paradigm, and it has been verified in many tasks, which can be regarded as a solid work.

Q5: What is the key to the solution mentioned in the paper?
A5: I just started thinking that the solution is pre-training plus behavioral cloning loss. But when I read the article carefully, it didn't mention the specific method in the first three pages of the article, but generally named it "policy2value RL", realizing that only one sub-optimal policy can help a new value-based RL fast training, don't care about the network structure and hyperparameters of this new RL agent at all, this description makes me feel amazing, I don't even know how to implement it after reading it. Then when I saw the method, I realized that it is actually a policy distillation loss... even it needs some data from the teacher agent. In the words of the reviewers, it should not be called PVRL, but (Policy+Data to Value)RL. Then its key technology is the three steps mentioned in A4, pre-training, distillation, and adding an attenuation coefficient to "weaning" the teacher policy.

Q6: How are the experiments in the paper designed?
A6: The experimental design paradigm is completely different from ordinary reinforcement.
insert image description here
First of all, the horizontal and vertical coordinates are very different. The horizontal coordinates are divided into different panels (different parts?) according to different training modes, and the vertical coordinates make a norm. The performance of the random strategy is 0, and the performance of the teacher policy is 1. It should be a linear division to highlight the performance gap between teacher policy and student policy. Secondly, regarding the curve, the intermittent vertical line indicates the regeneration operation, the left side is the effect of teacher policies, or the performance curve of pre-training, and the right side is the performance curve of student policy online tuning. title and legend represent different settings.

Q7: What is the dataset used for quantitative evaluation? Is the code open source?
A7: Several large tasks that consume resources, such as the Arcade Learning Environment (ALE, arcade game task, discrete action, DQN series) in the Yatali game; the stickman control task: humanoid:run (the only continuous action Task, using TD3, no pre-training, very weird, I doubt very much that TD3 pre-training effect is not good~); real scene task (actually also simulation): Balloon Learning Environment (BLE, hot air balloon flight control, also discrete action, with the DQN series).

Not only the code is open source (in fact, there is nothing in the code...), because they want to promote the enhanced pre-training backbone, they also open source a lot of Google's training parameters and data at the bottom of the box.

Q8: Do the experiments and results in the paper well support the scientific hypothesis that needs to be verified?
A8: In the tasks they showed, their assumptions were well verified, that is, through pre-training, distillation loss, and attenuation distillation coefficients, a teacher policy + a small amount of data can be achieved, and the knowledge of a brand new value-based RL network Migrate quickly.

Q9: What is the contribution of this paper?
A9: The contribution is first to formally define regeneration reinforcement (strike out, it feels like it is essentially a reinforcement learning pre-training paradigm). Secondly, it provides a policy-to-value migration scheme that may be more useful in discrete actions. Finally, as advocates, they have open sourced some training models and data from Google's bottom-of-the-box, which is undoubtedly a good thing for laboratories and individuals with few computing resources.

Q10: What's next? Is there any work that can be continued in depth?
A10: In the next step, you can refer to the operation of NLP and CV to propose a better and more universal migration plan, such as how to quickly migrate the enhanced algorithm of the actor-critic structure, while avoiding the problems encountered by offline RL in the AC structure. For example, OOD, such as the problem that the performance will even drop during the offline2online process.

As for other content, that’s all for today. Interested students can read the original text by themselves, or discuss it together at the Tencent meeting at 7:00 p.m. on Saturday.

Contact information:

ps: Students who are doing intensive work are welcome to join the group to study together:

Deep Reinforcement Learning - DRL: 799378128

Mujoco modeling: 818977608

Students who play other physics engines are welcome to play together~

Welcome to pay attention to the Zhihu account: The alchemy apprentice who has not yet started

CSDN account: https://blog.csdn.net/hehedadaq

My two GitHub repositories, welcome to star~

Minimalist spinup+HER+PER code implementation: https://github.com/kaixindelele/DRLib

Self-Guided Exploration - Relay HER (the most efficient variant of HER to explore as far as I know): https://github.com/kaixindelele/RHER

Guess you like

Origin blog.csdn.net/hehedadaq/article/details/128320723
RL
Recommended