- 论文题目:Asynchronous Methods for Deep Reinforcement Learning
The problem solved?
In reinforcement learning algorithm in agent
the observed data
is non-stationary
and strongly correlated
the. The memory
non-stationarity and decorrelates updates can be reduced by setting , but off-policy
the RL
algorithm used by these methods will be limited and additional operations will be added.
The author mainly uses multiple agents to sample data in parallel to decorate the agents' data in to a more stationary process . And on-policy
strategies that can be used .
background
Prior to this, there are some studies, for example The General Reinforcement Learning Architecture (Gorila)
, actor
interact with the environment sample (multiple computers), into the data replay memory
in learner
the replay memory
acquired data and calculating DQN
the algorithm defined by Loss
the gradient, but the gradient is not used to update learner
parameters, the gradient is asynchronously sent to a central parameter server which updates a central copy of the model. the updated policy parameters are sent to the actor-learners at fi xed intervals ( learner
the target
take central parameter server
parameter update update learner
).
Some research will be Map Reduce framework
introduced to speed up matrix operations, (not to speed up sampling). There is also some work to learner
share some parameter information through communication.
The method used?
The method used by the author is Gorila
similar to the method of the framework parameter server
, but instead of using multiple machines and parameter servers ( ), it uses a multi-threaded GPU
run on a single machine, one for each thread, learner
and the data they sample It is even more abundant, and the learner online
gradients of multiple updates are summarized, which is actually equivalent to cutting off the correlation between the data. Therefore, the author does not use replay memory
it but learner
uses a different one for each exploration policy
, so this method can also use on-policy
reinforcement learning algorithms, such as sarsa
this. If it is used in an Q-Learning
algorithm, the following single learner
- threaded pseudo code can be obtained :
For the actor-critic
framework, the single learner
- threaded pseudocode is as follows:
The effect achieved?
The computing resources required are smaller, and one multi-core CPU
can be used for training. compares the learning speed of the DQN algorithm trained on an Nvidia K40 GPU with the asynchronous methods trained using 16 CPU cores on fi ve Atari 2600 games.
For some robust analysis, you can refer to the original text, so I will not say it here. In the discussion, the author emphasized that it is not experience replace
bad. Introducing it may improve the sampling efficiency and may make the effect better.
Information published? author information?
This article is the one ICML2016
above. The first author Volodymyr Mnih
is a Toronto
Ph.D. in machine learning from the university, Geoffrey Hinton
and he is also DeepMind
a researcher at Google . Master's degree at Alberta
university Csaba Szepesvari
.
Reference link
- The General Reinforcement Learning Architecture (Gorila) of (
Nairetal.,2015
) performs asynchronous training of reinforcement learning agents in a distributed setting. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals.
- 参考文献:Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, Maria, Alessandro De, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, Petersen, Stig, Legg, Shane, Mnih, Volodymyr, Kavukcuoglu, Koray, and Silver, David. Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.
- We also note that a similar way of parallelizing DQN was proposed by (
Chavez et al., 2015
).
- 参考文献:Chavez, Kevin, Ong, Hao Yi, and Hong, Augustus. Distributed deep q-learning. Technical report, Stanford University, June 2015.
- In earlier work, (
Li & Schuurmans, 2011
) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation. Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning.
- 参考文献:Li, Yuxi and Schuurmans, Dale. Mapreduce for parallel reinforcement learning. In Recent Advances in Reinforcement Learning - 9th European Workshop, EWRL 2011, Athens, Greece, September 9-11, 2011, Revised Selected Papers, pp. 309–320, 2011.
- (
Grounds & Kudenko, 2008
) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actor learner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.
- 参考文献:Grounds, Matthew and Kudenko, Daniel. Parallel reinforcement learning with linear function approximation. In Proceedings of the 5th, 6th and 7th European Conference on Adaptive and Learning Agents and Multi-agent Systems: Adaptation and Multi-agent Learning, pp. 60– 74. Springer-Verlag, 2008.
Further reading
Based on value estimation
the critic
method. It is widely used in various fields, but there are some shortcomings that limit its application. Such as:
- It is difficult to apply to random strategy (
stochastic policy
) and continuous action space. value function
A small change in will cause a huge change in strategy, thus preventing training from converging. Especiallyfunction approximation,FA
after introducing function approximation ( ), although the algorithm generalization ability is improved, it is also introducedbias
, which makes the convergence of training more difficult to guarantee.
The actor
method- based learning strategy directly by parameterizing the strategy. The advantage of this is that it has better convergence than the former, and it is suitable for high-dimensional continuous action space stochastic policy
. But the disadvantages include that the gradient estimation is variance
relatively high, and it is easy to converge to the non-optimal solution. In addition, because the estimation of each gradient does not depend on the previous estimation, it means that the old information cannot be fully utilized.
But for the AC
algorithm, its architecture can be traced back to 30 or 40 years ago. Witten first proposed a similar AC
algorithm in 1977 , and then introduced architectures around 1983 with Daniel Barto, Sutton
and Anderso
others actor-critic
. However, due to AC
the difficulty of algorithm research and some historical accidental factors, the academic circles began to shift the research focus to value-based
methods. After a while, value-based
methods and policy-based
methods have developed vigorously. The former is a typical TD
systematic method. Classical Sarsa
, Q-learning
etc. belong to this column; the latter, such as the classic REINFORCE
algorithm. Afterwards, the AC
algorithm combines the development dividends of the two, and its theory and practice have made great progress again. Until the Deep learning, DL
era of deep learning ( ), AC
methods combined DNN
actions FA
, produced chemical reactions, and emerged DDPG
, A3C
such a batch of advanced algorithms, and some other improvements and variants based on them. As you can see, this is a successful story that splits up and down.
- Reference article : http://www.voidcn.com/article/p-mihgmljj-wy.html
My micro-channel public number Name : deep learning and advanced intelligent decision-making
micro-channel public number ID : MultiAgent1024
Public Number Description : The main share deep learning, computer games, reinforcement learning, and other related content! Looking forward to your attention, welcome to learn and exchange progress together!