【5分钟 Paper】Asynchronous Methods for Deep Reinforcement Learning

  • 论文题目:Asynchronous Methods for Deep Reinforcement Learning

Paper title and author information

The problem solved?

  In reinforcement learning algorithm in agentthe observed datais non-stationaryand strongly correlatedthe. The memorynon-stationarity and decorrelates updates can be reduced by setting , but off-policythe RLalgorithm used by these methods will be limited and additional operations will be added.

  The author mainly uses multiple agents to sample data in parallel to decorate the agents' data in to a more stationary process . And on-policystrategies that can be used .

background

  Prior to this, there are some studies, for example The General Reinforcement Learning Architecture (Gorila), actorinteract with the environment sample (multiple computers), into the data replay memoryin learnerthe replay memoryacquired data and calculating DQNthe algorithm defined by Lossthe gradient, but the gradient is not used to update learnerparameters, the gradient is asynchronously sent to a central parameter server which updates a central copy of the model. the updated policy parameters are sent to the actor-learners at fi xed intervals ( learnerthe targettake central parameter serverparameter update update learner).

Gorila network structure

  Some research will be Map Reduce frameworkintroduced to speed up matrix operations, (not to speed up sampling). There is also some work to learnershare some parameter information through communication.

The method used?

  The method used by the author is Gorilasimilar to the method of the framework parameter server, but instead of using multiple machines and parameter servers ( ), it uses a multi-threaded GPUrun on a single machine, one for each thread, learnerand the data they sample It is even more abundant, and the learner onlinegradients of multiple updates are summarized, which is actually equivalent to cutting off the correlation between the data. Therefore, the author does not use replay memoryit but learneruses a different one for each exploration policy, so this method can also use on-policyreinforcement learning algorithms, such as sarsathis. If it is used in an Q-Learningalgorithm, the following single learner- threaded pseudo code can be obtained :

 one-step Q-learning algorithm pseudocode

  For the actor-criticframework, the single learner- threaded pseudocode is as follows:

A3C algorithm pseudocode

The effect achieved?

  The computing resources required are smaller, and one multi-core CPUcan be used for training. compares the learning speed of the DQN algorithm trained on an Nvidia K40 GPU with the asynchronous methods trained using 16 CPU cores on fi ve Atari 2600 games.

Comparison of experimental results

  For some robust analysis, you can refer to the original text, so I will not say it here. In the discussion, the author emphasized that it is not experience replacebad. Introducing it may improve the sampling efficiency and may make the effect better.

Information published? author information?

  This article is the one ICML2016above. The first author Volodymyr Mnihis a TorontoPh.D. in machine learning from the university, Geoffrey Hintonand he is also DeepMinda researcher at Google . Master's degree at Albertauniversity Csaba Szepesvari.

Volodymyr Mnih

Reference link

  1. The General Reinforcement Learning Architecture (Gorila) of (Nairetal.,2015) performs asynchronous training of reinforcement learning agents in a distributed setting. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals.
  • 参考文献:Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, Maria, Alessandro De, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, Petersen, Stig, Legg, Shane, Mnih, Volodymyr, Kavukcuoglu, Koray, and Silver, David. Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.
  1. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).
  • 参考文献:Chavez, Kevin, Ong, Hao Yi, and Hong, Augustus. Distributed deep q-learning. Technical report, Stanford University, June 2015.
  1. In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation. Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning.
  • 参考文献:Li, Yuxi and Schuurmans, Dale. Mapreduce for parallel reinforcement learning. In Recent Advances in Reinforcement Learning - 9th European Workshop, EWRL 2011, Athens, Greece, September 9-11, 2011, Revised Selected Papers, pp. 309–320, 2011.
  1. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actor learner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.
  • 参考文献:Grounds, Matthew and Kudenko, Daniel. Parallel reinforcement learning with linear function approximation. In Proceedings of the 5th, 6th and 7th European Conference on Adaptive and Learning Agents and Multi-agent Systems: Adaptation and Multi-agent Learning, pp. 60– 74. Springer-Verlag, 2008.

Further reading

  Based on value estimationthe criticmethod. It is widely used in various fields, but there are some shortcomings that limit its application. Such as:

  1. It is difficult to apply to random strategy ( stochastic policy) and continuous action space.
  2. value functionA small change in will cause a huge change in strategy, thus preventing training from converging. Especially function approximation,FAafter introducing function approximation ( ), although the algorithm generalization ability is improved, it is also introduced bias, which makes the convergence of training more difficult to guarantee.

  The actormethod- based learning strategy directly by parameterizing the strategy. The advantage of this is that it has better convergence than the former, and it is suitable for high-dimensional continuous action space stochastic policy. But the disadvantages include that the gradient estimation is variancerelatively high, and it is easy to converge to the non-optimal solution. In addition, because the estimation of each gradient does not depend on the previous estimation, it means that the old information cannot be fully utilized.

  But for the ACalgorithm, its architecture can be traced back to 30 or 40 years ago. Witten first proposed a similar ACalgorithm in 1977 , and then introduced architectures around 1983 with Daniel Barto, Suttonand Andersoothers actor-critic. However, due to ACthe difficulty of algorithm research and some historical accidental factors, the academic circles began to shift the research focus to value-basedmethods. After a while, value-basedmethods and policy-basedmethods have developed vigorously. The former is a typical TDsystematic method. Classical Sarsa, Q-learningetc. belong to this column; the latter, such as the classic REINFORCEalgorithm. Afterwards, the ACalgorithm combines the development dividends of the two, and its theory and practice have made great progress again. Until the Deep learning, DLera of deep learning ( ), ACmethods combined DNNactions FA, produced chemical reactions, and emerged DDPG, A3Csuch a batch of advanced algorithms, and some other improvements and variants based on them. As you can see, this is a successful story that splits up and down.

  • Reference article : http://www.voidcn.com/article/p-mihgmljj-wy.html

My micro-channel public number Name : deep learning and advanced intelligent decision-making
micro-channel public number ID : MultiAgent1024
Public Number Description : The main share deep learning, computer games, reinforcement learning, and other related content! Looking forward to your attention, welcome to learn and exchange progress together!

Published 185 original articles · praised 168 · 210,000 views

Guess you like

Origin blog.csdn.net/weixin_39059031/article/details/104572749