A Minimalist Approach to Offline Reinforcement Learning[TD3+BC] Reading Notes

A Minimalist Approach to Offline Reinforcement Learning[TD3+BC] Reading Notes

Foreword:

Recently, I am curious about what reliable and concise jobs offline currently have.
Then several group friends recommended the latest work of the author of TD3: TD3+BC.
What's interesting is that when I was researching BC, I also saw this article, but I didn't read it carefully. At that time, I thought this was it?

When I read its experimental results, code, and openreview, I realized that there is really something.

At least its method is concise, and the code changes two lines of code:
insert image description here
insert image description here

There are only three main files in the entire code base, and there is not even a messy inheritance folder, which is much better than the baseline next door.

Classic reproduction:

1. Diss counterparts:

While there are many proposed approaches to offline RL, we remark that few are truly “simple”, and even the algorithms which claim to work with minor additions to an underlying online RL algorithm make a significant number of implementation-level adjustments.

2. How to battle your own ideas with reviewers:

A Minimalist Approach to Offline Reinforcement Learning-OpenReview
The second reviewer:

First, it seems that the novelty of the method is a bit limited. The authors seem to directly adapt RL+BC to the offline setting except that they add the state normalization, which is also not new. The authors also didn’t theoretically justify the approach. For example, the authors should show that the method can guarantee safe policy improvement and moreover enjoys comparable or better policy improvement guarantees w.r.t. prior approaches. Without the theoretical justification and given the current form of the method, I think the method is a bit incremental.

Then I directly gave a 5-point weak rejection...
The author's reply is also very good:

On novelty: We don’t disagree at all that our algorithm is incremental in novelty (we highlight a number of similar algorithms in the related work). However, our main claim/contribution is not so much that this is the best possible offline RL algorithm, or that it is particularly novel, but rather the surprising observation that the use of very simple techniques can match/outperform current algorithms. The hope is that TD3+BC could be used as an easy-to-implement baseline or starting point for other additions (such as S4RL), while eliminating a lot of unnecessary complexity, hyperparameter tuning, or computational cost, required by more sophisticated methods.

There are not many innovations in the algorithm, but ours is concise and effective. Maybe only big boss battles can be accepted...
I read the related work of the paper, but it looks very confusing. In fact, many algorithms have used BC to constrain the policy not to deviate from the action distribution of the data set. Could it be that they only utilize BC?

3. Compared with the structural performance of the sota algorithm:

insert image description here
Leaving aside the innovation points, just look at the structure and performance. Is this method simple and good in performance?

Background knowledge of offline:

Since I have taken reading notes before, I will briefly say a few words about offline.

Generally speaking, reinforcement must interact with the environment. For some unexperienced (s,a) pairs, there may be wrong value estimates at the beginning, especially those overestimated values. In principle, they will be taken by real feedback. The received rewards are corrected.
But offline, except for the last test strategy π \piExcept for the performance of π , it will not interact with the environment. All data comes from a fixed dataset. Then for those (s, a) that are not in the data set, if there is a wrong overestimation, it will not be corrected. Then with this wrong value function, the strategy network optimized by gradient ascent is definitely off the mark. That's what it's calleddistributional shifts issues.

The current offline algorithms are all kinds of constraints, so that the output value of the policy network should not deviate from the actions in the data set as much as possible, including this article.

Before, they used various messy methods, some increased the amount of calculation, and some added redundant hyperparameters.

The core content of TD3+BC:

Although I didn't understand its performance comparison chart, it doesn't prevent me from directly thinking that its performance has reached sota...
On this basis, let's take a look at Chapter 5 of its paper. The entire algorithm is explained in one page, but the whole It took 17 pages to write an article... The workload of a top meeting is really difficult to deal with...

  1. When updating the policy network, there is an extra γ \gammaγ , a BC-loss.

insert image description here

However, it should be considered that the order of magnitude of the two variables should not be too different. Since the BC item is affected by the action value, generally speaking, the range of the action value is between [-1, 1], and the largest two-norm is only [1-(-1)]^2=4 That's all, then for the loss of the Q part, a constraint should also be made.
Recently, I am also doing the fusion of bc and -Q, and I have not considered weighting the two.

The weighting of Q in this paper is to directly divide the Q of the current mini-batch by the mean value of the absolute value, and then multiply by a coefficient α \alphaα , 2.5 is given in the text. That is, the value of the Q item is guaranteed to be between [-2.5, 2.5]. But in fact, the derivation of Q, through the critic network, the loss at the output of the actor does not seem to be the case? I didn't even think about it at the time. Hope someone can tell me...

  1. For the normalization of state, although many algorithms are used in this scheme, the author said that in order to reflect the transparency of TD3 modification, this is also taken out separately. But what I'm curious about is, is the mean variance obtained in the offline dataset migrated to the online scene, is it familiar?

Analysis of Experimental Results – Updated Version

I didn't understand what the random Medium, Export of that data meant, and I didn't elaborate on it in the text.
In the end, I opened the D4RL paper and made a summary of these ghost data sets:

  1. The “medium” dataset is generated by first training a policy online using Soft Actor-Critic (Haarnoja et al., 2018a), early-stopping the training, and collecting 1M samples from this partially-trained policy. The “medium” dataset
    is Generated by online training a policy using Soft Actor-Critic (Haarnoja et al., 2018a), stop training early, and collect 1 million samples from this partially trained policy. That is to say, it is all medium level and has no fun color.

  2. The “medium-replay” dataset consists of recording all samples in the replay buffer observed during training until the policy reaches the “medium” level of performance. All samples are recorded until the policy reaches the "medium" performance level. That is, from happy color to medium
  3. The “random” datasets are generated by unrolling a randomly initialized policy on these three domains
    . That is, they are all random happy colors.
  4. We further introduce a “medium-expert” dataset by mixing equal amounts of expert demonstrations and suboptimal data, generated via a partially trained policy or by unrolling a uniform-at-random policy. By mixing equal amounts of expert demonstrations and suboptimal data
    , We further introduce a "moderate expert" dataset, which is generated by a partially trained policy, or by unrolling a uniformly random policy. i.e. Moderate + Expert
  5. A large amount of expert data from a fine-tuned RL policy (“expert”).
    Expert data comes from a well-tuned policy.

With the above prior knowledge, insert image description here
it is much easier for us to look at Figure 123:
in Figure 1, they tested the percentage difference in performance after removing all components in CQL and fishbrc. There is a significant drop in performance for many tasks.
Numb, I still can't see what Percent difference means

About the comparison with Decision Transformer

Although I don't quite understand how DT is converted from offline to online, but in this author's mouth, DT can also be regarded as a sota-level offline benchmark.
Then it also made a comparison: DT needs targeted tuning, but it doesn't need it, its performance is slightly better than DT, and its speed beats DT.

Some other doubts:

The whole article is verified in the three ghost tasks of mujoco. According to my rough feeling of running the experiment, these three tasks have a certain degree of randomness, that is, even if these tasks can run well, but Other tasks don't necessarily work as well.

Contact information:

ps: Students who are doing intensive work are welcome to join the group to study together:

Deep Reinforcement Learning - DRL: 799378128

Mujoco modeling: 818977608

Students who play other physics engines are welcome to play together~

Welcome to pay attention to the Zhihu account: The alchemy apprentice who has not yet started

CSDN account: https://blog.csdn.net/hehedadaq

My personal blog:
the uninitiated alchemy apprentice
URL is very easy to remember:metair.top

Minimalist spinup+HER+PER code implementation: https://github.com/kaixindelele/DRLib

Guess you like

Origin blog.csdn.net/hehedadaq/article/details/122161632