DDPG-final state done's influence on the result
Article Directory
Foreword:
pass
Set category
First I found an important problem. In the gym, if you don't manually set env.unwraped, it will encapsulate a lot of information, such as the maximum number of rounds. In the environment of the fetch series, the maximum number of round steps is 50, and when it reaches 50, it will return done=True, and the rest of done are False.
This is something I haven't noticed when I adjust the HER algorithm.
I directly debug the baselines code before, and found that their done is fixed at step=50, instead of completing the target task.
I thought it was weird at the time, which is different from the meaning of done in my perception.
Next is my own HER code. The results of baselines have not been reproduced. My HER hyperparameters use baselines, but they are always stuck in a "local worst" position.
The problem is very big. I haven't found the cause for a long time. I may have found the cause today.
Regarding this unwarp, I learned it from Mofan’s intensive tutorial, but I didn’t expect a nail to be buried...
- Spiningup: In the pseudo-final state, the done returned by the maximum number of steps in the round is manually reset to False. q_target(s,a)=r+gamma(1-d)*q_target_net(s',a') in the strengthening algorithm
- baselines: Did not process the done information, although his transition has the value of is_success, there is no d in the enhanced algorithm!
Record the different done settings, tested in the fetch-slide environment:
1. Pseudo final state: the last step of the round is set to True, and the others are False. Final result. Forgot to do it. Fuck.
2. Spinup: The last step of the round is set to False, and everything else is itself (usually False). Although there is done in the algorithm, all done is False. The result and baselines should be the same. The final epoch success rate is 0, avgQ1=-7.79, minQ1=-88.3, maxQ1=17.3. I don’t understand, the maximum reward is 0, why can Q be greater than 0? After looking at the picture, I know that I have experienced so many twists and turns...Under normal circumstances, the maximum Q value is still 0, and only a few outliers exceed 0, which should be regarded as a historical problem.
3. Baselines: Directly set done=False, the effect is the same as that of sp. Kangkang’s final epoch result: the success rate is 0, avgQ1=-7.79, minQ1=-65, maxQ1=20
The above are all cases where there is no convergence. Everyone can see the results, and the results are not good. Although three forms of done are set, the final results are all similar. The Q value deviating from the target is basically around -1/(1-0.98).
Now test the other settings