Reinforcement Learning: Policy Gradients

The idea of ​​strategy gradient method

  We used to express policies in the form of tables before, but now we can also express policies in functions. All the methods learned before are called value-based, and the next one is called policy-based. Next, let's take a look at the idea of ​​the strategy gradient method. The strategies learned before are all expressed in tables, as follows:
insert image description here
  Now, we change the table into a function, then π πThe writing of π will also change, as follows:
insert image description here
Among them,θ θθ is a vector that can be used to representπ πThe parameters in the function π .

  The difference between tables and functions is the probability of obtaining an action. The table form directly looks up the table through the index, but using the function will be a little troublesome, you can’t directly go to the index, you need to calculate the corresponding π ( a ∣ s , θ ) π(a|s,\theta)π ( a s ,i )

  The difference between table and function representations is also in the way the policy is updated. You can directly modify the values ​​in the table in the table. When expressed as a parameterized function, the policy π ππ can only be modified by modifying the parameterθ \thetaθ to update the policy.

.The
idea of ​​the strategy gradient method:
  when represented by a function, we will establish some scalar objective function J ( θ ) J(\theta)J ( θ ) , by optimizing the objective function so that the strategyπ ππ reaches the optimum, as follows:
insert image description here

Selection of scalar objective function

  Above we know that we need to establish a scalar objective function, so what is the scalar objective function? In general, we commonly use two types of scalar objective functions.

  The first one is the average value of the state value , or simply called the average value, which is actually a weighted average of the state value, as follows:
insert image description here
v ˉ \bar vvˉ is the weighted average of state value
d ( s ) d(s)d ( s ) represents the statessThe probability that s is selected

The above form can also be written in a more concise form, which is the inner product of two vectors:
insert image description here
.

  So, how do we choose d ( s ) d(s)What about d ( s ) ? We have two situations, one isdddπ pπ doesn't matter; it'sdddπ pπ is related.

  when dddπ pWhen π has no relationship, we use d 0 d_0respectivelyd0π ˉ 0 \bar π_0Piˉ0means that we can also take the uniform distribution d 0 ( s ) = 1 / ∣ S ∣ = 1 / n d_0(s)=1/|S|=1/nd0(s)=1/∣S=1/ n , if a certain state is more important, then we can increase its weight.

  when dddπ pWhen π has a relationship, the commonly used method is to choose a stable distribution, as follows:
insert image description here
.
  The second is the average value of instant rewards, which is a weighted average of instant rewards, as follows:
insert image description here

The above is the first form of reward, and we often see another form of reward, as follows:
insert image description here
Among them, we assume that we follow a given strategy and generate a trajectory, and get a series of rewards ( R t + 1 , R t + 2 , … … ) (R_{t+1},R_{t+2},……)(Rt+1,Rt+2,... ) ; after running an infinite number of steps,s 0 s_0s0It doesn't matter anymore, so put s 0 s_0 at the ends0removed
.

  Above we introduced the selection methods of the two scalar objective functions, and then we will further summarize these two scalars:
  1. They are all strategies π πFunction of π
  2. Strategyπ ππ is in the form of a function whose parameter isθ \thetaθ , differentθ \thetaθ will get different values
  ​​3, you can find the optimalθ \thetaθ to maximize the value of the scalar
   4,r ˉ π \bar r_πrˉpv ˉ π \bar v_πvˉpare equivalent, and when one of them is optimized, the other is also optimized. In the discount factor γ < 1 γ < 1c<1是,有r ˉ π = ( 1 − γ ) v ˉ π \bar r_π=(1-γ)\bar v_πrˉp=(1c )vˉp

Policy Gradient Solving

   After getting a policy scalar, calculate its gradient. Then, gradient-based methods are applied for optimization, where gradient computation is one of the most complex parts. That's because, first of all, we need to distinguish between different v ˉ π \bar v_πvˉpr ˉ π \bar r_πrˉpv ˉ π 0 \bar v_π^0vˉPi0; Second, we need to distinguish between discounted and undiscounted. The calculation of the gradient, here we will give a brief introduction.

insert image description here
J ( θ ) J(\theta) J ( θ ) can bev ˉ π \bar v_πvˉpr ˉ π \bar r_πrˉpv ˉ π 0 \bar v_π^0vˉPi0;
the nη is the distribution probability or weight
"=" can represent strict equality, approximation or proportional to

v ˉ π \bar v_πvˉpr ˉ π \bar r_πrˉpv ˉ π 0 \bar v_π^0vˉPi0The corresponding gradient formula is as follows:
insert image description here
.
Gradient formula analysis:
   we can write the above formula as follows:
insert image description here

S obeys the η distribution S obeys the η distributionS obeys the η distribution ;A obeys the π ( A ∣ S , θ ) distribution A obeys the π(A|S,\theta) distributionA obeys π ( A S ,θ ) distribution

Why do we need such a formula? This is because the real gradient contains the expected EEE , hopeEEE is not known, so we can approximate it by sampling for optimization, as follows:
insert image description here

Supplementary note:
   because it is necessary to calculate ln π ( a ∣ s , θ ) lnπ(a|s,\theta)l(as,θ ),so the requirementsπ( a ∣ s , θ ) > 0 π(a|s,\theta)>0π ( a s ,i )>0 , how to ensure that allπ ππ for allaaa is all greater than 0? Use the softmax function for normalization, as follows:
insert image description here

So, p pThe expression of π
insert image description here
is as follows: h ( s , a , θ ) h(s,a,\theta)h(s,a,θ ) is another function, usually obtained by a neural network.

Gradient Ascent and REINFORCE

   The basic idea of ​​the gradient ascent algorithm is that the real gradient has the expected EEE , so the real gradient is replaced by a random gradient, but there is also aq π ( s , a ) q_π(s,a)qp(s,a ) Immediate strategyπ πThe real action value corresponding to π is unknown. Similarly, we use a method to approximate or toq π q_πqpFor sampling, the method is to combine with MC——reinforce, as follows:

insert image description here
.We
use random gradients instead of real gradients, so how do we make random variables ( S , A ) (S,A)(S,A ) What about sampling? First toSSS is sampled becauseS follows the η distribution S follows the η distributionS obeys the η distribution , which requires a large amount of data, and it is difficult to achieve a stable state in reality, so it is generally not considered in practice. What aboutAAWhat about A sampling? BecauseA obeys the π ( A ∣ S , θ ) distribution A obeys the π(A|S,\theta) distributionA obeys π ( A S ,θ ) distribution , therefore, inst s_tstShould be according to strategy π ( θ ) π(\theta)π ( θ )at a_tatTake a sample. All policy gradients here belong to the on-policy algorithm.
insert image description here

Algorithm understanding
insert image description here
requires α β t \alpha\beta_ta btis smaller, it can be found that β t \beta_tbtAble to balance algorithm discovery and data utilization. Because β t \beta_tbt q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is proportional, therefore, whenqt ( st , at ) q_t(s_t,a_t)qt(st,at) is largerβ t \beta_tbtwill also be relatively large, meaning π t ( st , at ) π_t(s_t,a_t)Pit(st,at) has a higher probability of being selected. β t \beta_tbt π ( a t ∣ s t , θ t ) π(a_t|s_t,\theta_t) π ( atst,it) is inversely proportional, therefore, whenβ t \beta_tbt 较大时 π ( a t ∣ s t , θ t ) π(a_t|s_t,\theta_t) π ( atst,it) will be relatively small, which means that if I chooseπ t ( st , at ) π_t(s_t,a_t)Pit(st,at) is relatively small, give it a greater probability to choose it at the next moment.

β t > 0 \beta_t>0 bt>0时,this isπ ( at ∣ st , θ ) π(a_t|s_t,θ)π ( atst,θ ) Gradient rising algorithm, there are:
insert image description here
whenβ t < 0 \beta_t<0bt<0时,this isπ ( at ∣ st , θ ) π(a_t|s_t,θ)π ( atst,θ ) Gradient descent algorithm, there are:
insert image description here
.

reinforcement algorithm

   用 q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) to approximately replaceq π ( st , at ) q_π(s_t,a_t)qp(st,at) q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is obtained by Monte Carlo method, that is, from( st , at ) (s_t,a_t)(st,at) start to get an episode, and then assign the return of this episode toqt q_tqt, this is the reinforcement algorithm.
insert image description here
.
Reinforce algorithm, its pseudo code is as
insert image description here
follows :For k iterations, first select an initialstate states t a t e according to the current strategyπ ( θ k ) π(\theta _k)p ( ik) interacts with the environment to get an episode, and we have to go through each element in this episode. Then operate on each element, which is divided into two steps. The first step is to do value update, which is to use the Monte Carlo method to estimateqt q_tqt ,从 ( s t , a t ) (s_t,a_t) (st,at) to add up all the rewards obtained later. The next step is policy update, the qt q_tthat will be obtainedqtSubstitute into the formula to update θ t θ_tit, and finally take the final obtained θ T θ_TiTAs a new θ k θ_kikPerform iterative updates.

Guess you like

Origin blog.csdn.net/qq_50086023/article/details/131397020