1 Introduction to A3C
The full name of A3C is Asynchronous Advantage Actor-Critic. As the name suggests, it adopts the form of Actor-Critic (you need to review Actor-Critic, you can click here [Reinforcement Learning] Actor-Critic (actor-critic) algorithm detailed explanation ). In order to train a pair of Actor and Critic, we will make multiple copies of him in red, and then put them in different parallel spaces at the same time, let them play their own games. Then each red copy quietly told the black Actor-Critic how he was playing in the world over there, and what experiences were worth sharing. Then you can get the customs clearance cheats after comprehensively considering all the copy experience from the black Actor-Critic. In this way, an efficient reinforcement learning method is formed.
Most of the current computers are dual-core, 4-core, or even 6-core, 8-core. The general learning method can only let the robot play on one core. But if we use the A3C method, we can assign them to different cores and perform parallel operations. The experimental result is that such a calculation method is often many times faster than the traditional method.
2 Asynchronous Advantage Actor-Critic (A3C) Detailed Explanation
2.1 Main points
One sentence summarizes A3C: an algorithm proposed by Google DeepMind to solve the problem of Actor-Critic non-convergence. It will create multiple parallel environments, allowing multiple agents with secondary structures to update the parameters in the main structure on these parallel environments at the same time. The agents in parallel do not interfere with each other, and the parameter update of the main structure is disturbed by the discontinuous update submitted by the secondary structure, so the correlation of the update is reduced and the convergence is improved.
2.2 Algorithm
A3C's algorithm actually puts Actor-Critic in multiple threads for synchronous training. It can be imagined that several people are playing the same game at the same time, and their experience of playing the game will be uploaded synchronously to a central brain. Then they get the newest way to play the game from the central brain.
In this way, for these few people, their benefits are: the central brain brings together the experience of all people, and is the one who is the best at playing games. From time to time, they can obtain the ultimate move of the central brain and use it in their own scenes.
The advantage for the central brain is that the central brain is most afraid of the continuous update of one person, and this method of pushing updates based on one person can cancel this continuity. The central brain can be updated well without having to use memory banks like DQN and DDPG.
In order to achieve this goal, we need two systems, which can be regarded as the central brain having the global net and its parameters. Each player has a copy of the gobal net, the local net, which can regularly push updates to the global net, and then periodically update from the glabal net. Then get the comprehensive version of the update.
If you look at the system we want to build in tensorboard:
w_0 is the 0th worker, and each worker can share the global_net.
If the pull in sync is called, the worker will get the latest parameters from global_net.
If we call push in sync, the worker will push its personal updates to global_net.
2.3 Main structure
We use Tensorflow to build a neural network. For our Actor, we can clearly see how we built it in tensorboard:
we use Normal distribution to select actions, so when building a neural network, the actor side needs to output actions. mean and variance. Then put it in Normal distribution to select the action. When calculating actor loss, we also need to use the TD error provided by the critic as a guide for gradient ascent.
The critic is very simple, you only need to get his value for the state, which is used to calculate the TD error.
2.4 Actor-Critic Network
We merge Actor and Critic into a complete system, which is easy to run.
# 这个 class 可以被调用生成一个 global net.
# 也能被调用生成一个 worker 的 net, 因为他们的结构是一样的,
# 所以这个 class 可以被重复利用.
class ACNet(object):
def __init__(self, globalAC=None):
# 当创建 worker 网络的时候, 我们传入之前创建的 globalAC 给这个 worker
if 这是 global: # 判断当下建立的网络是 local 还是 global
with tf.variable_scope('Global_Net'):
self._build_net()
else:
with tf.variable_scope('worker'):
self._build_net()
# 接着计算 critic loss 和 actor loss
# 用这两个 loss 计算要推送的 gradients
with tf.name_scope('sync'): # 同步
with tf.name_scope('pull'):
# 更新去 global
with tf.name_scope('push'):
# 获取 global 参数
def _build_net(self):
# 在这里搭建 Actor 和 Critic 的网络
return 均值, 方差, state_value
def update_global(self, feed_dict):
# 进行 push 操作
def pull_global(self):
# 进行 pull 操作
def choose_action(self, s):
# 根据 s 选动作
These are creating networks, workers and their own class, used to perform work in each thread.
2.5 Worker
Each woeker has its own class, which contains his work content
class Worker(object):
def __init__(self, name, globalAC):
self.env = gym.make(GAME).unwrapped # 创建自己的环境
self.name = name # 自己的名字
self.AC = ACNet(name, globalAC) # 自己的 local net, 并绑定上 globalAC
def work(self):
# s, a, r 的缓存, 用于 n_steps 更新
buffer_s, buffer_a, buffer_r = [], [], []
while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:
s = self.env.reset()
for ep_t in range(MAX_EP_STEP):
a = self.AC.choose_action(s)
s_, r, done, info = self.env.step(a)
buffer_s.append(s) # 添加各种缓存
buffer_a.append(a)
buffer_r.append(r)
# 每 UPDATE_GLOBAL_ITER 步 或者回合完了, 进行 sync 操作
if total_step % UPDATE_GLOBAL_ITER == 0 or done:
# 获得用于计算 TD error 的 下一 state 的 value
if done:
v_s_ = 0 # terminal
else:
v_s_ = SESS.run(self.AC.v, {
self.AC.s: s_[np.newaxis, :]})[0, 0]
buffer_v_target = [] # 下 state value 的缓存, 用于算 TD
for r in buffer_r[::-1]: # 进行 n_steps forward view
v_s_ = r + GAMMA * v_s_
buffer_v_target.append(v_s_)
buffer_v_target.reverse()
buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)
feed_dict = {
self.AC.s: buffer_s,
self.AC.a_his: buffer_a,
self.AC.v_target: buffer_v_target,
}
self.AC.update_global(feed_dict) # 推送更新去 globalAC
buffer_s, buffer_a, buffer_r = [], [], [] # 清空缓存
self.AC.pull_global() # 获取 globalAC 的最新参数
s = s_
if done:
GLOBAL_EP += 1 # 加一回合
break # 结束这回合
2.6 Worker parallel work
Here is the real focus, the parallel computing of Worker.
with tf.device("/cpu:0"):
GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE) # 建立 Global AC
workers = []
for i in range(N_WORKERS): # 创建 worker, 之后在并行
workers.append(Worker(GLOBAL_AC)) # 每个 worker 都有共享这个 global AC
COORD = tf.train.Coordinator() # Tensorflow 用于并行的工具
worker_threads = []
for worker in workers:
job = lambda: worker.work()
t = threading.Thread(target=job) # 添加一个工作线程
t.start()
worker_threads.append(t)
COORD.join(worker_threads) # tf 的线程调度
Article source: Mofan Reinforcement Learning https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/