DDPG-final state done's influence on the result

DDPG-final state done's influence on the result

Foreword:

pass

Set category

Insert picture description here
First I found an important problem. In the gym, if you don't manually set env.unwraped, it will encapsulate a lot of information, such as the maximum number of rounds. In the environment of the fetch series, the maximum number of round steps is 50, and when it reaches 50, it will return done=True, and the rest of done are False.
This is something I haven't noticed when I adjust the HER algorithm.
I directly debug the baselines code before, and found that their done is fixed at step=50, instead of completing the target task.
I thought it was weird at the time, which is different from the meaning of done in my perception.

Next is my own HER code. The results of baselines have not been reproduced. My HER hyperparameters use baselines, but they are always stuck in a "local worst" position.
The problem is very big. I haven't found the cause for a long time. I may have found the cause today.
Regarding this unwarp, I learned it from Mofan’s intensive tutorial, but I didn’t expect a nail to be buried...

  1. Spiningup: In the pseudo-final state, the done returned by the maximum number of steps in the round is manually reset to False. q_target(s,a)=r+gamma(1-d)*q_target_net(s',a') in the strengthening algorithm
  2. baselines: Did not process the done information, although his transition has the value of is_success, there is no d in the enhanced algorithm!
    Insert picture description here

Record the different done settings, tested in the fetch-slide environment:
1. Pseudo final state: the last step of the round is set to True, and the others are False. Final result. Forgot to do it. Fuck.
2. Spinup: The last step of the round is set to False, and everything else is itself (usually False). Although there is done in the algorithm, all done is False. The result and baselines should be the same. The final epoch success rate is 0, avgQ1=-7.79, minQ1=-88.3, ​​maxQ1=17.3. I don’t understand, the maximum reward is 0, why can Q be greater than 0? After looking at the picture, I know that I have experienced so many twists and turns...Under normal circumstances, the maximum Q value is still 0, and only a few outliers exceed 0, which should be regarded as a historical problem.
Insert picture description here

3. Baselines: Directly set done=False, the effect is the same as that of sp. Kangkang’s final epoch result: the success rate is 0, avgQ1=-7.79, minQ1=-65, maxQ1=20
Insert picture description here

Insert picture description here
The above are all cases where there is no convergence. Everyone can see the results, and the results are not good. Although three forms of done are set, the final results are all similar. The Q value deviating from the target is basically around -1/(1-0.98).

Now test the other settings

Guess you like

Origin blog.csdn.net/hehedadaq/article/details/113799677