Brief description of the policy gradient algorithm

This article briefly introduces the policy gradient method in deep reinforcement learning (deep reinforcement learning) by sorting out the content of Mr. Li Hongyi's machine learning tutorial.

The B station link of Li Hongyi's course:
Li Hongyi, deep reinforcement learning, policy gradient

Related Notes:
Proximal Policy Optimization Algorithm Brief
DQN (deep Q-network) Algorithm Brief
Actor-Critic Algorithm Brief


Assume:
the trajectory of a game (trajectory): τ \tauτ
player (actor) strategy (policy):θ \thetai

Then the expected value of incentive (reward) can be estimated by N sampling (sampling) (incentive RRR is a random variable):
R ˉ θ = ∑ τ R ( τ ) P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 NR ( τ n ) \bar R_{\theta} = \sum_{\tau} R (\tau) P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n})Rˉi=tR ( τ ) P ( τ θ )N1n=1NR ( tn)

The optimal strategy is:
θ ∗ = arg ⁡ max ⁡ θ R ˉ θ \theta^{*} = \arg \max_{\theta} \bar R_{\theta}i=argimaxRˉi


Give the following equation for a gradient ascent
: ▽ R ˉ θ = ∑ τ R ( τ ) ▽ P ( τ ∣ θ ) = ∑ τ R ( τ ) . P ( τ ∣ θ ) ▽ P ( τ ∣ θ ) P ( τ ∣ θ ) = ∑ τ R ( τ ) P ( τ ∣ θ ) ▽ ln ⁡ P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 NR ( τ n ) ▽ ln ⁡ P ( τ n ∣ θ ) \triangledown \bar R_{\theta} = \sum_{\tau} R(\tau) \triangledown P(\tau | \theta) = \sum_{\tau} R(\tau) P(\; tau | \theta ) \ frac { \ triangledown P ( \ tau | \ theta )} { P ( \ tau | \ theta )} = \ sum_{ \ tau } R ( \ tau ) P ( \ tau | \ theta ) \ triangledown \ln P(\tau | \theta) \approx \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^ {n}|\theta)Rˉi=tR ( τ ) P ( τ θ )=tR ( τ ) P ( τ θ )P(τθ)P(τθ)=tR ( τ ) P ( τ θ ) lnP(τθ)N1n=1NR ( tn)lnP ( tnθ)

Among them, the operation principle of logarithm:
d ln ⁡ ( f ( x ) ) dx = 1 f ( x ) df ( x ) dx \frac {d \ln (f(x))} {dx} = \frac{ 1}{f(x)}\frac{df(x)}{dx}dxdln(f(x))=f(x)1dxdf(x)

Since the probability of the trajectory occurring under the condition of the policy:
P ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) ⋯ = p ( s 1 ) ∏ t = 1 T p ( at ∣ st , θ ) p ( rt , st + 1 ∣ st , at ) P(\tau | \theta) = p(s_1) p(a_1 | s_1, \theta) p(r_1, s_2 | s_1, a_1) p(a_2 | s_2, \theta) p(r_2, s_3 | s_2 , a_2) \cdots = p(s_1) \prod_{t=1}^{T} p(a_t | s_t, \theta) p(r_t, s_{t+1} | s_t, a_t)P(τθ)=p(s1)p(a1s1,i ) p ( r1,s2s1,a1)p(a2s2,i ) p ( r2,s3s2,a2)=p(s1)t=1Tp(atst,i ) p ( rt,st+1st,at)

Among them, sss is the game state (state) at each moment,aaa is the player's action.
Onlyp ( at ∣ st , θ ) p(a_t | s_t, \theta)p(atst,θ ) part and the player's strategyθ \thetaθ is related, the other two termsp ( s 1 ) p(s_1)p(s1) p ( r t , s t + 1 ∣ s t , a t ) p(r_t, s_{t+1} | s_t, a_t) p(rt,st+1st,at) are independent of player strategy.

Let us define the equation:
ln ⁡ P ( τ ∣ θ ) = ln ⁡ p ( s 1 ) + ∑ t = 1 T [ ln ⁡ p ( at ∣ st , θ ) + ln ⁡ p ( rt , st + 1 ). ∣ st , at ) ] ▽ ln ⁡ P ( τ ∣ θ ) = ∑ t = 1 T ▽ ln ⁡ p ( at ∣ st , θ ) \ln P(\tau | \theta) = \ln p(s_1) + \sum_{t=1}^{T} [\ln p(a_t | s_t, \theta) + \ln p(r_t, s_{t+1} | s_t, a_t)] \\ \triangledown \ln P( \tau | \theta) = \sum_{t=1}^{T} \triangledown \ln p(a_t | s_t, \theta);lnP(τθ)=lnp(s1)+t=1T[lnp(atst,i )+lnp(rt,st+1st,at)]lnP(τθ)=t=1Tlnp(atst,i )

We have the following equations:
▽ R ˉ θ ≈ 1 N ∑ n = 1 NR ( τ n ) ▽ ln ⁡ P ( τ n ∣ θ ) = 1 N ∑ n = 1 NR ( τ n ) ∑ t = 1 T n ▽ ln ⁡ p ( atn ∣ stn , θ ) = 1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ▽ ln ⁡ p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta } \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^{n} | \theta) = \frac {1}{N} \sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T_n} \triangledown \ln p(a^n_t | s^n_t , \theta) = \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^{n})\triangledown\ln p( a^n_t | s^n_t, \theta)RˉiN1n=1NR ( tn)lnP ( tnθ)=N1n=1NR ( tn)t=1Tnlnp(atnstn,i )=N1n=1Nt=1TnR ( tn)lnp(atnstn,i )

Note the following points:

First, the incentive multiplied by the above formula is the global benefit, not a single-step incentive, otherwise it will not be possible to learn the actions that motivate the subsequent moments. (The fourth point will be improved accordingly)

Second, the reason for taking the logarithm:

Since taking the logarithm and then finding the gradient is equivalent to calculating the gradient of the probability and dividing it by the probability itself:
▽ ln ⁡ p ( atn ∣ stn , θ ) = ▽ p ( atn ∣ stn , θ ) p ( atn ∣ stn , θ ) \ triangledown \ln p(a^n_t | s^n_t, \theta) = \frac {\triangledown p(a^n_t | s^n_t, \theta)} {p(a^n_t | s^n_t, \theta) }lnp(atnstn,i )=p(atnstn,i )p(atnstn,i ).

And dividing by the probability itself can prevent certain actions with low incentives from being sampled multiple times, resulting in the accumulation of excessive incentives:
The reason for going logarithmic

Third, introduce the baseline (baseline):

When the incentive of the game is constant and non-negative, in order to prevent the probability value of the unsampled high incentive action from decreasing, the baseline is added:

Reasons for introducing baselines 1
Reasons for introducing baselines 2
One of the simplest ways to set the baseline is for R ( τ ) R(\tau)R ( τ ) is averaged:
b ≈ E [ R ( τ ) ] b \approx E[R(\tau)]bE [ R ( τ ) ]

Fourth, assign appropriate credits to each action:

Actions at each moment, only considering the sum of all incentives after this time point until the end of the game:
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T nrt ′ n − b ) ▽ ln ⁡ p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1} ^{T_n} (\sum_{t^{\prime}=t}^{T_n} r_{t^{\prime}}^n - b) \triangledown \ln p(a^n_t | s^n_t, \ theta)RˉiN1n=1Nt=1Tn(t=tTnrtnb)lnp(atnstn,i )

Further, discount future incentives, that is, the longer the time, the smaller the influence: ▽
R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − trt ′ n − b ) ▽ ln ⁡ p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N } \sum_{t=1}^{T_n} (\sum_{t^{\prime}=t}^{T_n} \gamma^{t^{\prime} - t} r_{t^{\prime} }^n - b) \triangledown \ln p(a^n_t | s^n_t, \theta)RˉiN1n=1Nt=1Tn(t=tTncttrtnb)lnp(atnstn,i )

Among them, the discount factor γ \gammaThe value range of γ is[ 0 , 1 ] [0, 1][0,1 ] , usually take0.9 0.90.9 or 0.99 0.99 _0 . 9 9 , if take0 00 , it means only care about immediate incentives, if it is1 11 , it means that future incentives are equal to immediate incentives.


Guess you like

Origin blog.csdn.net/Zhang_0702_China/article/details/122528740