[Reinforcement Learning] Asynchronous Advantage Actor-Critic (A3C)

1 Introduction to A3C

The full name of A3C is Asynchronous Advantage Actor-Critic. As the name suggests, it adopts the form of Actor-Critic (you need to review Actor-Critic, you can click here [Reinforcement Learning] Actor-Critic (actor-critic) algorithm detailed explanation ). In order to train a pair of Actor and Critic, we will make multiple copies of him in red, and then put them in different parallel spaces at the same time, let them play their own games. Then each red copy quietly told the black Actor-Critic how he was playing in the world over there, and what experiences were worth sharing. Then you can get the customs clearance cheats after comprehensively considering all the copy experience from the black Actor-Critic. In this way, an efficient reinforcement learning method is formed.
insert image description here
Most of the current computers are dual-core, 4-core, or even 6-core, 8-core. The general learning method can only let the robot play on one core. But if we use the A3C method, we can assign them to different cores and perform parallel operations. The experimental result is that such a calculation method is often many times faster than the traditional method.

2 Asynchronous Advantage Actor-Critic (A3C) Detailed Explanation

2.1 Main points

One sentence summarizes A3C: an algorithm proposed by Google DeepMind to solve the problem of Actor-Critic non-convergence. It will create multiple parallel environments, allowing multiple agents with secondary structures to update the parameters in the main structure on these parallel environments at the same time. The agents in parallel do not interfere with each other, and the parameter update of the main structure is disturbed by the discontinuous update submitted by the secondary structure, so the correlation of the update is reduced and the convergence is improved.

2.2 Algorithm

A3C's algorithm actually puts Actor-Critic in multiple threads for synchronous training. It can be imagined that several people are playing the same game at the same time, and their experience of playing the game will be uploaded synchronously to a central brain. Then they get the newest way to play the game from the central brain.
In this way, for these few people, their benefits are: the central brain brings together the experience of all people, and is the one who is the best at playing games. From time to time, they can obtain the ultimate move of the central brain and use it in their own scenes.
The advantage for the central brain is that the central brain is most afraid of the continuous update of one person, and this method of pushing updates based on one person can cancel this continuity. The central brain can be updated well without having to use memory banks like DQN and DDPG.
insert image description here
In order to achieve this goal, we need two systems, which can be regarded as the central brain having the global net and its parameters. Each player has a copy of the gobal net, the local net, which can regularly push updates to the global net, and then periodically update from the glabal net. Then get the comprehensive version of the update.
If you look at the system we want to build in tensorboard:
insert image description here
w_0 is the 0th worker, and each worker can share the global_net.
insert image description here
If the pull in sync is called, the worker will get the latest parameters from global_net.
insert image description here
If we call push in sync, the worker will push its personal updates to global_net.

2.3 Main structure

We use Tensorflow to build a neural network. For our Actor, we can clearly see how we built it in tensorboard:
insert image description here
we use Normal distribution to select actions, so when building a neural network, the actor side needs to output actions. mean and variance. Then put it in Normal distribution to select the action. When calculating actor loss, we also need to use the TD error provided by the critic as a guide for gradient ascent.
insert image description here
The critic is very simple, you only need to get his value for the state, which is used to calculate the TD error.

2.4 Actor-Critic Network

We merge Actor and Critic into a complete system, which is easy to run.

# 这个 class 可以被调用生成一个 global net.
# 也能被调用生成一个 worker 的 net, 因为他们的结构是一样的,
# 所以这个 class 可以被重复利用.
class ACNet(object):
    def __init__(self, globalAC=None):
        # 当创建 worker 网络的时候, 我们传入之前创建的 globalAC 给这个 worker
        if 这是 global:   # 判断当下建立的网络是 local 还是 global
            with tf.variable_scope('Global_Net'):
                self._build_net()
        else:
            with tf.variable_scope('worker'):
                self._build_net()

            # 接着计算 critic loss 和 actor loss
            # 用这两个 loss 计算要推送的 gradients

            with tf.name_scope('sync'):  # 同步
                with tf.name_scope('pull'):
                    # 更新去 global
                with tf.name_scope('push'):
                    # 获取 global 参数

    def _build_net(self):
        # 在这里搭建 Actor 和 Critic 的网络
        return 均值, 方差, state_value

    def update_global(self, feed_dict):
        # 进行 push 操作

    def pull_global(self):
        # 进行 pull 操作

    def choose_action(self, s):
        # 根据 s 选动作

These are creating networks, workers and their own class, used to perform work in each thread.

2.5 Worker

Each woeker has its own class, which contains his work content

class Worker(object):
    def __init__(self, name, globalAC):
        self.env = gym.make(GAME).unwrapped # 创建自己的环境
        self.name = name    # 自己的名字
        self.AC = ACNet(name, globalAC) # 自己的 local net, 并绑定上 globalAC

    def work(self):
        # s, a, r 的缓存, 用于 n_steps 更新
        buffer_s, buffer_a, buffer_r = [], [], []
        while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:
            s = self.env.reset()

            for ep_t in range(MAX_EP_STEP):
                a = self.AC.choose_action(s)
                s_, r, done, info = self.env.step(a)

                buffer_s.append(s)  # 添加各种缓存
                buffer_a.append(a)
                buffer_r.append(r)

                # 每 UPDATE_GLOBAL_ITER 步 或者回合完了, 进行 sync 操作
                if total_step % UPDATE_GLOBAL_ITER == 0 or done:
                    # 获得用于计算 TD error 的 下一 state 的 value
                    if done:
                        v_s_ = 0   # terminal
                    else:
                        v_s_ = SESS.run(self.AC.v, {
    
    self.AC.s: s_[np.newaxis, :]})[0, 0]

                    buffer_v_target = []    # 下 state value 的缓存, 用于算 TD
                    for r in buffer_r[::-1]:    # 进行 n_steps forward view
                        v_s_ = r + GAMMA * v_s_
                        buffer_v_target.append(v_s_)
                    buffer_v_target.reverse()

                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)

                    feed_dict = {
    
    
                        self.AC.s: buffer_s,
                        self.AC.a_his: buffer_a,
                        self.AC.v_target: buffer_v_target,
                    }

                    self.AC.update_global(feed_dict)    # 推送更新去 globalAC
                    buffer_s, buffer_a, buffer_r = [], [], []   # 清空缓存
                    self.AC.pull_global()   # 获取 globalAC 的最新参数

                s = s_
                if done:
                    GLOBAL_EP += 1  # 加一回合
                    break   # 结束这回合

2.6 Worker parallel work

Here is the real focus, the parallel computing of Worker.

with tf.device("/cpu:0"):
    GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # 建立 Global AC
    workers = []
    for i in range(N_WORKERS):  # 创建 worker, 之后在并行
        workers.append(Worker(GLOBAL_AC))   # 每个 worker 都有共享这个 global AC

COORD = tf.train.Coordinator()  # Tensorflow 用于并行的工具

worker_threads = []
for worker in workers:
    job = lambda: worker.work()
    t = threading.Thread(target=job)    # 添加一个工作线程
    t.start()
    worker_threads.append(t)
COORD.join(worker_threads)  # tf 的线程调度

Article source: Mofan Reinforcement Learning https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/

おすすめ

転載: blog.csdn.net/shoppingend/article/details/124403514