A brief description of actor-critic related algorithms

This article briefly introduces the relevant algorithms based on actor-critic in deep reinforcement learning (deep reinforcement learning) by sorting out the content of teacher Li Hongyi's machine learning tutorial.

The B station link of Li Hongyi's course:
Li Hongyi, deep reinforcement learning, actor-critic

Related Notes:
A brief description of the policy gradient algorithm A brief description of
the proximal policy optimization algorithm A brief description of
the DQN (deep Q-network) algorithm


asynchronous advantage actor-critic(A3C)


References:
Volodymtr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning”, ICML

Among the related algorithms based on actor-critic, the most well-known method is asynchronous advantage actor-critic, or A3C for short.
If you remove asynchronous (asynchronous), it is an advantage actor-critic, that is, A2C.

首先回顾一下策略梯度法:
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − t r t ′ n − b ) ▽ ln ⁡ p θ ( a t n ∣ s t n ) \triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} (\sum_{t^{\prime}=t}^{T_n} \gamma^{t^{\prime} - t} r_{t^{\prime}}^n - b) \triangledown \ln p_\theta(a^n_t | s^n_t) RˉiN1n=1Nt=1Tn(t=tTncttrtnb)lnpi(atnstn)

Record the cumulative incentive as GGG
G t n = ∑ t ′ = t T n γ t ′ − t r t ′ n G^n_t = \sum_{t^{\prime}=t}^{T_n} \gamma^{t^{\prime} - t} r_{t^{\prime}}^n Gtn=t=tTncttrtn

However, due to the randomness of the interaction process itself, the cumulative incentive may be very unstable, and the number of samples before each update of the parameters cannot be many, and problems may arise:

G, unstable

So consider using cumulative incentives to GGThe expected value of G is used instead of the sampled value, and the method to obtain the expected value is a value-based method, namely DQN.

DQN has two kinds of critics, namely the state value function (state value function) V π ( s ) V^\pi(s)Vπ (s)and the state-action value function (state-action value function)Q π ( s , a ) Q^\pi(s, a)Qπ (s,a)
DQN, critics

Among them, both the Monte-Carlo based approach (MC) and the temporal-difference approach (TD) can be used for estimation. The temporal difference is relatively stable, and Monte Carlo is more accurate.

In fact, the cumulative incentive GGThe expected value of G isQQQ,即:
E [ G t n ] = Q π θ ( s t n , a t n ) E[G^n_t] = Q^{\pi_\theta}(s^n_t, a^n_t) E [ Gtn]=QPii(stn,atn)

Regarding the baseline, a common practice is to use the state-value function V π θ ( stn ) V^{\pi_\theta}(s^n_t)VPii(stn) to represent, that is:
G and baseline

The disadvantage of the above method is that it is necessary to estimate the QQQ andVVV two networks, the risk of inaccurate estimates is doubled. Actually, we can estimate only one network,VVV network, withVVThe value of V to representthe QQThe value of Q. Since:
Q π ( stn , atn ) = E [ rtn + V π ( st + 1 n ) ] Q^{\pi}(s^n_t, a^n_t) = E[r^n_t + V^\pi( s^n_{t+1})]Qπ (stn,atn)=E [ rtn+Vπ (st+1n)]

Remove the expectation and use VV directlyV stands forQQQ
Q π ( s t n , a t n ) = r t n + V π ( s t + 1 n ) Q^{\pi}(s^n_t, a^n_t) = r^n_t + V^\pi(s^n_{t+1}) Qπ (stn,atn)=rtn+Vπ (st+1n)

Then get the expression of advantage function (advantage):
rtn + V π ( st + 1 n ) − V π ( stn ) r^n_t + V^\pi(s^n_{t+1}) - V^\pi (s^n_t)rtn+Vπ (st+1n)Vπ (stn)

update strategy π \piGive the πvalues ​​as follows:
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( rtn + V π ( st + 1 n ) − V π ( stn ) ) ▽ ln ⁡ p θ ( . atn ∣ stn ) \triangledown \bar R_{\theta} \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(r^n_t+ V^\pi(s^n_{t+1}) - V^\pi(s^n_t)) \triangledown \ln p_\theta(a^n_t | s^n_t);RˉiN1n=1Nt=1Tn(rtn+Vπ (st+1n)Vπ (stn))lnpi(atnstn)

As for the reason for removing the expected value, it is the result of the original A3C paper trying various methods for experimental comparison.

The overall process is as follows:
flow chart

In the strategy gradient method, after collecting the data, it is necessary to update the strategy;
but in the actor-critic method, instead of directly using those data to update the strategy, first use these data to estimate the value function VVV (the method based on Monte Carlo or the method based on time series difference can be used), and then based on the value function, use the above formula to update the strategyπ \pip .

There are two tricks here.

First, when implementing actor-critic, we need to estimate two networks, VVThe network of V functions and the network of policiesπ \piπ . Since the input of these two networks is the statesss , so the first few layers can be shared, especially for games where the input is an image:

tip 1

Second, we need a mechanism for exploration. A common approach is to policy π \piThe output of π is a constraint so that the entropy of the output distribution should not be too small, that is, the probability of exploring different actions is expected to be average. In this way, you can try various actions and explore the environment better.

The above algorithm is the advantage actor-critic (A2C), and finally explain what is asynchronous (asynchronous).

As shown in the figure below, each actor is running in parallel, that is, "each does its own thing, regardless of each other". When a process finishes running and prepares to return parameters, the original parameters may have been overwritten by other actors, but the It doesn't matter, just update directly, which is "asynchronous". (Note: △ θ \triangle \theta
in the figureθ should be▽ θ \triangledown \thetaθ
insert image description here


pathwise derivative policy gradient


参考文献:
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML 2014
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nocolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “Continuous Control with Deep Reinforcement Learning”, ICLR 2016

Next, the pathwise derivative policy gradient algorithm is introduced. The idea of ​​this algorithm is that the critic not only evaluates the quality of the action, but also tells the actor what action is good, namely:
arg ⁡ max ⁡ a Q π ( s , a ) \arg \max_a Q^\pi(s, a )argamaxQπ (s,a)

But if it is a continuous action scene, it will be more difficult to solve the above optimization problem, so you can learn another network to solve this optimization problem (analogous to GAN):

pathwise derivative policy gradient, network

The algorithm flow is, first use a strategy π \piπ interacts with the environment, estimatingQQQ function, then fixQQQ , and then learn the strategyπ \piπ , get a better actor, and so on, as shown in the figure below:

pathwise derivative policy gradient, flow chart

Similar to the real QQ in the target networkQ Network and TargetQQQ network (see DQN section), there are also two actors in this algorithm, that is, the realπ \piπ and targetπ ^ \hat \piPi^ .
In addition, the techniques mentioned before, such as replay buffer and exploration, etc. can be used.

The specific process is as follows (compared with the original DQN):

pathwise derivative policy gradient, specific process


The relationship between actor-critic and GAN


Actor-critic is very similar to GAN. For details, please refer to the paper:
David Pfau, Oriol Vinyals, "Connecting Generative Adversarial Networks and Actor-Critic Methods", arXiv preprint 2016

Both are known to be difficult to train, so the article collects various methods on how to train GAN. And because there are two groups of people doing GAN and actor-critic, the article lists which methods of the two algorithms have been tried, as shown in the figure below:

actor-critic & GAN

Guess you like

Origin blog.csdn.net/Zhang_0702_China/article/details/123501178