A3C (Asynchronous advantage actor-critic) / Asynchronous advantage of actor-critic algorithm

     

  • Recalls DQN algorithm before the next, in order to facilitate convergence using techniques playback experience. So we Actor-Critic is not it also possible to use techniques playback experience it? of course can! However A3C Further, also overcome some experience playback problems. What experience playback problems? Playback pool empirical data correlation is too strong, when the effect is likely to be used for training poor. For example, we learn to play chess, and always the same individuals, hoping to improve chess. This is of course no problem, but to a certain extent hard to improve, then the best way to learn is to find another master.
  •  A3C ideas too, which uses multi-threaded, at the same time and interactive learning environment in which multiple threads, respectively, each thread is regarded learning outcomes Taken together, finishing stored in a public place. And, on a regular basis from a public place to put all of one mind learning outcomes get it back, to guide their own learning interactions and behind the environment.
  • In this way, A3C avoid too strong experience playback-related problems, but do concurrent asynchronous learning model.
  •  For example I use neural network; you can actually do function approximated by a linear function, kernel methods and so on.
  •  Actor (player): For Fun game to get the highest possible reward, you need to implement a function: input state, output action, namely the above step 2. Neural networks can be used to approximate this function. The remaining task is how to train the neural network, it performs better (higher the reward). This network is called actor
  •  Critic (judges): To train actor, you need to know the actor's performance in the end how to determine the adjustments to the parameters of the neural network based on performance. This use to strengthen learning "Q-value". However, Q-value is an unknown function, so it can be approximated using a neural network. This network is called critic.

  • Actor-Critic training.
  1.        Actor see the game's current state, make an action.
  2.        Critic according to both state and action, the performance of the actor playing just a fraction.
  3.        Actor依据critic(评委)的打分,调整自己的策略(actor神经网络参数),争取下次做得更好。
  4.        Critic根据系统给出的reward(相当于ground truth)和其他评委的打分(critic target)来调整自己的打分策略(critic神经网络参数)。
  5.        一开始actor随机表演,critic随机打分。但是由于reward的存在,critic评分越来越准,actor表现越来越好。

 

Guess you like

Origin blog.csdn.net/lxlong89940101/article/details/90903984