This article briefly introduces the policy gradient method in deep reinforcement learning (deep reinforcement learning) by sorting out the content of Mr. Li Hongyi's machine learning tutorial.
The B station link of Li Hongyi's course:
Li Hongyi, deep reinforcement learning, policy gradient
Related Notes:
Proximal Policy Optimization Algorithm Brief
DQN (deep Q-network) Algorithm Brief
Actor-Critic Algorithm Brief
Assume:
the trajectory of a game (trajectory): τ \tauτ
player (actor) strategy (policy):θ \thetai
Then the expected value of incentive (reward) can be estimated by N sampling (sampling) (incentive RRR is a random variable):
R ˉ θ = ∑ τ R ( τ ) P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 NR ( τ n ) \bar R_{\theta} = \sum_{\tau} R (\tau) P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n})Rˉi=t∑R ( τ ) P ( τ ∣ θ )≈N1n=1∑NR ( tn)
The optimal strategy is:
θ ∗ = arg max θ R ˉ θ \theta^{*} = \arg \max_{\theta} \bar R_{\theta}i∗=argimaxRˉi
Give the following equation for a gradient ascent
: ▽ R ˉ θ = ∑ τ R ( τ ) ▽ P ( τ ∣ θ ) = ∑ τ R ( τ ) . P ( τ ∣ θ ) ▽ P ( τ ∣ θ ) P ( τ ∣ θ ) = ∑ τ R ( τ ) P ( τ ∣ θ ) ▽ ln P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 NR ( τ n ) ▽ ln P ( τ n ∣ θ ) \triangledown \bar R_{\theta} = \sum_{\tau} R(\tau) \triangledown P(\tau | \theta) = \sum_{\tau} R(\tau) P(\; tau | \theta ) \ frac { \ triangledown P ( \ tau | \ theta )} { P ( \ tau | \ theta )} = \ sum_{ \ tau } R ( \ tau ) P ( \ tau | \ theta ) \ triangledown \ln P(\tau | \theta) \approx \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^ {n}|\theta)▽Rˉi=t∑R ( τ ) ▽ P ( τ ∣ θ )=t∑R ( τ ) P ( τ ∣ θ )P(τ∣θ)▽P(τ∣θ)=t∑R ( τ ) P ( τ ∣ θ ) ▽lnP(τ∣θ)≈N1n=1∑NR ( tn)▽lnP ( tn∣θ)
Among them, the operation principle of logarithm:
d ln ( f ( x ) ) dx = 1 f ( x ) df ( x ) dx \frac {d \ln (f(x))} {dx} = \frac{ 1}{f(x)}\frac{df(x)}{dx}dxdln(f(x))=f(x)1dxdf(x)
Since the probability of the trajectory occurring under the condition of the policy:
P ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) ⋯ = p ( s 1 ) ∏ t = 1 T p ( at ∣ st , θ ) p ( rt , st + 1 ∣ st , at ) P(\tau | \theta) = p(s_1) p(a_1 | s_1, \theta) p(r_1, s_2 | s_1, a_1) p(a_2 | s_2, \theta) p(r_2, s_3 | s_2 , a_2) \cdots = p(s_1) \prod_{t=1}^{T} p(a_t | s_t, \theta) p(r_t, s_{t+1} | s_t, a_t)P(τ∣θ)=p(s1)p(a1∣s1,i ) p ( r1,s2∣s1,a1)p(a2∣s2,i ) p ( r2,s3∣s2,a2)⋯=p(s1)t=1∏Tp(at∣st,i ) p ( rt,st+1∣st,at)
Among them, sss is the game state (state) at each moment,aaa is the player's action.
Onlyp ( at ∣ st , θ ) p(a_t | s_t, \theta)p(at∣st,θ ) part and the player's strategyθ \thetaθ is related, the other two termsp ( s 1 ) p(s_1)p(s1) 和 p ( r t , s t + 1 ∣ s t , a t ) p(r_t, s_{t+1} | s_t, a_t) p(rt,st+1∣st,at) are independent of player strategy.
Let us define the equation:
ln P ( τ ∣ θ ) = ln p ( s 1 ) + ∑ t = 1 T [ ln p ( at ∣ st , θ ) + ln p ( rt , st + 1 ). ∣ st , at ) ] ▽ ln P ( τ ∣ θ ) = ∑ t = 1 T ▽ ln p ( at ∣ st , θ ) \ln P(\tau | \theta) = \ln p(s_1) + \sum_{t=1}^{T} [\ln p(a_t | s_t, \theta) + \ln p(r_t, s_{t+1} | s_t, a_t)] \\ \triangledown \ln P( \tau | \theta) = \sum_{t=1}^{T} \triangledown \ln p(a_t | s_t, \theta);lnP(τ∣θ)=lnp(s1)+t=1∑T[lnp(at∣st,i )+lnp(rt,st+1∣st,at)]▽lnP(τ∣θ)=t=1∑T▽lnp(at∣st,i )
We have the following equations:
▽ R ˉ θ ≈ 1 N ∑ n = 1 NR ( τ n ) ▽ ln P ( τ n ∣ θ ) = 1 N ∑ n = 1 NR ( τ n ) ∑ t = 1 T n ▽ ln p ( atn ∣ stn , θ ) = 1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ▽ ln p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta } \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^{n} | \theta) = \frac {1}{N} \sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T_n} \triangledown \ln p(a^n_t | s^n_t , \theta) = \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^{n})\triangledown\ln p( a^n_t | s^n_t, \theta)▽Rˉi≈N1n=1∑NR ( tn)▽lnP ( tn∣θ)=N1n=1∑NR ( tn)t=1∑Tn▽lnp(atn∣stn,i )=N1n=1∑Nt=1∑TnR ( tn)▽lnp(atn∣stn,i )
Note the following points:
First, the incentive multiplied by the above formula is the global benefit, not a single-step incentive, otherwise it will not be possible to learn the actions that motivate the subsequent moments. (The fourth point will be improved accordingly)
Second, the reason for taking the logarithm:
Since taking the logarithm and then finding the gradient is equivalent to calculating the gradient of the probability and dividing it by the probability itself:
▽ ln p ( atn ∣ stn , θ ) = ▽ p ( atn ∣ stn , θ ) p ( atn ∣ stn , θ ) \ triangledown \ln p(a^n_t | s^n_t, \theta) = \frac {\triangledown p(a^n_t | s^n_t, \theta)} {p(a^n_t | s^n_t, \theta) }▽lnp(atn∣stn,i )=p(atn∣stn,i )▽p(atn∣stn,i ).
And dividing by the probability itself can prevent certain actions with low incentives from being sampled multiple times, resulting in the accumulation of excessive incentives:
Third, introduce the baseline (baseline):
When the incentive of the game is constant and non-negative, in order to prevent the probability value of the unsampled high incentive action from decreasing, the baseline is added:
One of the simplest ways to set the baseline is for R ( τ ) R(\tau)R ( τ ) is averaged:
b ≈ E [ R ( τ ) ] b \approx E[R(\tau)]b≈E [ R ( τ ) ]
Fourth, assign appropriate credits to each action:
Actions at each moment, only considering the sum of all incentives after this time point until the end of the game:
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T nrt ′ n − b ) ▽ ln p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1} ^{T_n} (\sum_{t^{\prime}=t}^{T_n} r_{t^{\prime}}^n - b) \triangledown \ln p(a^n_t | s^n_t, \ theta)▽Rˉi≈N1n=1∑Nt=1∑Tn(t′=t∑Tnrt′n−b)▽lnp(atn∣stn,i )
Further, discount future incentives, that is, the longer the time, the smaller the influence: ▽
R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − trt ′ n − b ) ▽ ln p ( atn ∣ stn , θ ) \triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N } \sum_{t=1}^{T_n} (\sum_{t^{\prime}=t}^{T_n} \gamma^{t^{\prime} - t} r_{t^{\prime} }^n - b) \triangledown \ln p(a^n_t | s^n_t, \theta)▽Rˉi≈N1n=1∑Nt=1∑Tn(t′=t∑Tnct′−trt′n−b)▽lnp(atn∣stn,i )
Among them, the discount factor γ \gammaThe value range of γ is[ 0 , 1 ] [0, 1][0,1 ] , usually take0.9 0.90.9 or 0.99 0.99 _0 . 9 9 , if take0 00 , it means only care about immediate incentives, if it is1 11 , it means that future incentives are equal to immediate incentives.