Policy gradient reinforcement learning and optimize the depth of the (two) - DDPG

NPG

   Before discussing the application DQN play Atari games. However, these are done in a discrete environment, which has a limited number of acts. Consider a continuous environment of space, such as training the robot to walk. Under these circumstances, we can not use QQQ learning, this is because greedy strategy at each time step requires a lot of optimization. Even if the discrete continuous environment, it may lose some important features, in order to end up with a huge action space. In this case, it is difficult to ensure convergence.

   To do this, use a technique called new architecture critic actors, including the two networks: the network of actors and critics network. Actors architecture critic is state policy and the behavior of the gradient function values combined. Network action behavior is adjustable parameters θ \ theta

Guess you like

Origin blog.csdn.net/weixin_43283397/article/details/105144144