(5)Advantage Actor-Critic (A2C)

Advantage Actor-Critic (A2C)

  • Policy network (actor): π ( a ∣ s ; θ ) \pi(a|s; \theta) π(as;θ)
       \; It is an approximation to the policy function, π ( a ∣ s ) \pi(a|s) π(as).
       \; It controls the agent.
  • Value network (critic): v ( s ; w ) v(s;w) v(s;w)
       \; It is an approximation to the state-value function, V π ( s ) V_{\pi}(s) Vπ(s).
       \; It evaluates how good the state s s s is.

注意:A2C与AC不一样,AC里面用的是动作价值Q,A2C用的是状态价值V。动作价值Q依赖于状态s和动作a,而状态价值v只依赖于状态s,所以V比Q更好训练。

A2C的策略和价值网络输入都是状态s。策略网络输出的是各动作的概率,agent执行的动作就是从这里面抽样出来的;价值网络的输出为一个实数,它可以评价当前状态的好坏。两个网络用的架构一样,区别在于如何训练这两个网络。

Training of A2C

  • Observe a transition ( s t , a t , r r , s t + 1 s_t,a_t,r_r,s_{t+1} st,at,rr,st+1).
  • TD target: y t = r t + γ ⋅ v ( s t + 1 ; w ) y_t=r_{t}+\gamma·v(s_{t+1};w) yt=rt+γv(st+1;w).
  • TD error: δ t = v ( s t ; w ) − y t \delta_t=v(s_t;w)-y_t δt=v(st;w)yt.
  • Update the policy network (actor) by:
    θ ⟵ θ − β ⋅ δ t ⋅ ∂ l n π ( a t ∣ s t ; θ ) ∂ θ . \theta \longleftarrow\theta-\beta·\delta_t·\frac{\partial ln\pi(a_t|s_t;\theta)}{\partial\theta}. θθβδtθl(atst;θ).
  • Update the value network (critic) by:
    w ⟵ w − α ⋅ δ t ⋅ ∂ v ( s t ; w ) ∂ w . w \longleftarrow w-\alpha·\delta_t·\frac{\partial v(s_t;w)}{\partial w}. wwαδtwv(st;w).

Approximate Policy Gradient

g ( a t ) ≈ ∂ l n π ( a t ∣ s t ; θ ) ∂ θ ⋅ ( r t + γ ⋅ v ( s t + 1 ; w ) − v ( s t ; w ) ) . g(a_t) \approx\frac{\partial ln\pi(a_t|s_t;\theta)}{\partial\theta}·(r_t+\gamma·v(s_{t+1};w)-v(s_t;w)). g(at)θl(atst;θ)(rt+γv(st+1;w)v(st;w)).

  • ( r t + γ ⋅ v ( s t + 1 ; w ) (r_t+\gamma·v(s_{t+1};w) (rt+γv(st+1;w) depends on a t a_t at
  • v ( s t ; w ) v(s_t;w) v(st;w) independent of a t a_t at
  • If a t a_t at is good, v ( s t + 1 ; w ) v(s_{t+1};w) v(st+1;w) is biger than v ( s t ; w ) v(s_t;w) v(st;w), so their difference (TD-error) is positive.
  • their difference (TD-error) is called as advantage, which be evaluated by the critic. And it can be used to teach policy network to update.

Procedure

策略网络也叫actor,它根据状态s,来计算动作a的概率分布,从而控制agent。
agent与环境交互,执行动作a后,环境会做状态转移,给出新的状态。

我们还需要价值网络来评价动作a的好坏,价值网络被称为critic,价值网络会评估相邻两个状态,从而得出动作a带来的Advantage,价值网络会把优势告诉策略网络。

训练策略网络,要用到状态s、动作a、以及价值网络提供的Advantage。用策略梯度来更新策略网络中的 θ \theta θ,让策略网络做出的动作越来越好。

训练价值网络要用到状态s和奖励r,用TD算法更新价值网络中的w,让价值网络对状态价值的预测越来越准。

猜你喜欢

转载自blog.csdn.net/weixin_49716548/article/details/131684764
今日推荐