Reinforcement learning loss function does not decline

Problem Description

PPO algorithm using the training gym.make('CartPole-v0')environment.
Parameters are as follows:

hidden_units = 50
layers = 3
learning_rate = 0.001 # critic 和 actor learning rate相同
max_train_episodes = int(1e4)

During training effect gradually changed for the better, an increase of 50 steps per average reward, but the loss function does not decline
Here Insert Picture Description

But the training process of critic loss and actor loss (tensorboard) has not declined

Here Insert Picture Description

Cause Analysis

As the training progresses, the data in the Buffer increasing data has been dynamic, therefore actor and critic of the training data set is dynamic, this fixed and supervised learning data sets are different, so the loss does not show a downward trend.
Reference:
https://stackoverflow.com/questions/47036246/dqn-q-loss-not-converging

Published 36 original articles · won praise 0 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_38102912/article/details/97614897