I read your epsilon high initial setup, you should change the function of loss related to this. epsilon initial high is useless, because even learn something he would not have to perform, resulting in what can not be learned by experience to generate more useful. And the middle of loss surge, I think it may be because the agent learned strategy, but because of epsilon too, if a random action twice before, once the best, this will lead to a big loss.
The record about the experience of others parameter adjustment
Guess you like
Origin www.cnblogs.com/awgn/p/12339929.html
Recommended
Ranking