[Reinforcement learning paper notes (6)]: A3C

Asynchronous Methods for Deep Reinforcement Learning

Papers address

A3C

notes

Point of departure:

Online agent status data observed is unstable (non-stationary) and the correlation.

DQN used the experience replay, you can use the batch and random sample to study in depth and general training linked

A3C naturally wanted to experience reply some criticism:

  • More memory and more computing resources
  • With the off-policy data generated by one of more older policy. (But this is also not the shortcomings of it, DQN said off-policy is their own merits)

A3C is the AC asynchronous multi-threading. AC defer table. Everyone loves multithreading, each agent are happy to play in its own thread, and then update the parameters of global shared. Although I was the equivalent of online, on-policy. But every time I have a lot of irrelevant data for training, very good.

However, when the total disregard for the update we worked so hard in the classroom learning process safety, process locks and the like, no lockout is doing. In fact, no problem.

A3C naturally put their own advantages to put forward a big news:

  • Save computing hardware, (GPU contrast K40)
  • on-polcy algorithms can benefit in this way
  • No longer need to experience replay
  • Explore different agent is likely to be different, it is natural, "exploration"
  • Time and thread into a linear inverse relationship

Guess you like

Origin www.cnblogs.com/Lzqayx/p/12141966.html