Problem Description
PPO algorithm using the training gym.make('CartPole-v0')
environment.
Parameters are as follows:
hidden_units = 50
layers = 3
learning_rate = 0.001 # critic 和 actor learning rate相同
max_train_episodes = int(1e4)
During training effect gradually changed for the better, an increase of 50 steps per average reward, but the loss function does not decline
But the training process of critic loss and actor loss (tensorboard) has not declined
Cause Analysis
As the training progresses, the data in the Buffer increasing data has been dynamic, therefore actor and critic of the training data set is dynamic, this fixed and supervised learning data sets are different, so the loss does not show a downward trend.
Reference:
https://stackoverflow.com/questions/47036246/dqn-q-loss-not-converging