Deep Reinforcement Learning Actor-Critic Update Logical Combing Notes

Deep Reinforcement Learning Actor-Critic Update Logical Combing Notes

Foreword:

A few days ago, when I was explaining the update logic of the actor-critic architecture to my younger brother, I got stuck on the optimization logic of the actor for a while, and in the end I couldn't sort out the logic completely. Today, I just took out the previous PPT while my mind was clear. With the content of the PPT, the update logic of the AC architecture is clearly explained, and a note is hereby made.

Actor-Critic Architecture Introduction:

Regarding the AC architecture, let me simply talk about my understanding. For reinforcement, the purpose is to find an optimal strategy model that maximizes the cumulative return value of its action trajectory. Naturally, there will be a strategy model, called Actor here, whose input is the current state information state, and whose output is action. If there is no evaluation network critic, the parameters can only be updated by using the cumulative return of the trajectory. A trajectory is updated once, and the efficiency is low. So someone came up with the idea of ​​whether to use the entire evaluation model to directly give an evaluation value for a specific state and action to guide the optimization direction of the actor.

critic's update logic

And what is the basis for the evaluation model critic evaluation value? If you have the previous reinforced basic knowledge, the evaluation of (s, a) is Q(s, a)=r+γ*Q(s', a'), which is the Q value form of the Bellman equation, about I seem to have taken notes on the understanding of the Bellman equation.

With the above equation, assuming that Q(s', a') is known, s, a, r, s' are all known, then only the parameter φ of the critic needs to be updated so that Q(s, a The output of |φ) is close to r+γ*Q(s', a'). This is supervised learning, which is a basic operation of deep learning.

As for whether Q(s',a') is accurate or not, in the DDPG algorithm, the update frequency of Q(s',a'|φ-target) is lower than that of Q(s,a|φ), which is similar to walking At the same time, the left foot is fixed first, the right foot moves forward, and then the left foot moves forward again, moving forward step by step~

Here we have got the update method of the critic, as long as we have enough (s, a, r, s'), then we can get a good evaluation model, for a specific (s, a) Can give a "pertinent" evaluation.

Actor's update logic:

How does this evaluation update the actor model?
Let's look at its expression:
Actor's formula: a = π ( s ∣ θ ) a=\pi(s|\theta)a=π ( s θ )
critical function:q = Q ( s , a ∣ ϕ ) = Q ( s , π ( s ∣ θ ) ∣ ϕ ) q=Q(s,a|\phi)=Q(s , \pi(s|\theta)|\phi)q=Q(s,aϕ)=Q(s,π ( s θ ) ϕ )

The idea of ​​​​updating the actor is to adjust the actor model parameter θ for a specific state s, so that the output of the actor π ( s ∣ θ ) \pi(s|\theta)π ( s θ ) , the output Q after the critic model( s , π ( s ∣ θ ) ∣ ϕ ) Q(s, \pi(s|\theta)|\phi)Q(s,π ( s θ ) ϕ ) is updated in a larger direction.

Obviously this requires chaining derivatives of a composite function.
Overall, the function Q ( s , π ( s ∣ θ ) ∣ ϕ ) Q(s, \pi(s|\theta)|\phi)Q(s,π ( s θ ) ϕ ) , when the critic parameter φ is constant and in a specific state s, Q is a compound function about the actor parameter θ, and the chain derivation can be obtained:

J = δ Q / δ θ = ( δ Q ( s , a ) / δ π ( s ∣ θ ) ) ∗ ( δ π ( s ∣ θ ) / δ θ ) J = \delta Q / \delta \theta = ( \delta Q(s, a)/ \delta \pi(s|\theta)) * (\delta \pi(s|\theta)/\delta \theta)J=δQ/δθ=(δQ(s,a ) / δ π ( s θ ) )( δ π ( s θ ) / δ θ )

After getting the derivative, according to the definition, θ is updated along the gradient, so that Q ( s , π ( s ) ) Q(s,\pi(s))Q(s,π ( s ) ) becomes larger, that is, the output of the actor will become better.

Paste the DDPG algorithm flow chart:
insert image description here

The core formula is the following:
insert image description here
here a=u(s), u(s) is π ( s ) \pi(s)π ( s ) is written as u(s) due to the deterministic strategy, and the log is just a trick for easy calculation.

Guess you like

Origin blog.csdn.net/hehedadaq/article/details/122514197