Introduction to Deep Reinforcement Learning (DRL) and Classification of Common Algorithms (DQN, DDPG, PPO, TRPO, SAC)

Briefly introduce the basic concepts of deep reinforcement learning, common algorithms, processes and their classification (continuously updated), so that everyone can better understand and apply reinforcement learning algorithms, and better solve cutting-edge problems in their respective fields. Welcome everyone to leave a message to discuss and make progress together.

(PS: If you only pay attention to the algorithm implementation, you can directly read the content of parts 3 and 4.)

1. Reinforcement Learning

Reinforcement Learning (RL): Reinforcement Learning
Reinforcement Learning is a type of machine learning. It is different from supervised learning and unsupervised learning. Through the continuous interaction between the agent and the environment (that is, taking actions), and then obtain rewards, so as to continuously optimize its own action strategy , in anticipation of maximizing its long-term payoff (sum of rewards). Reinforcement learning is particularly well suited to sequential decision problems (involving a series of ordered decision problems).

In practical applications, we often cannot accurately label each data or state for certain tasks, but we can use reinforcement learning to know or evaluate the current situation or whether the data is good or bad. For example, play Go, Starcraft II (Starcraft II), and other games.

1.1 Definition of Reinforcement Learning

Agent interacts with its surroundings known as the environment. Agent will get a reward from the environemnt once it takes an action in the current enrivonment. Meanwhile, the environment evolves to the next state. The goal of the agent is to maximize its total reward (the Return) in the long run.

The agent continuously interacts with the environment (i.e. takes an action in a given state), and thus obtains a reward, while the environment moves from one state to the next. By continuously optimizing its own action strategy, the agent expects to maximize its long-term reward or benefit (sum of rewards).

Schematic diagram of the interaction between the agent and the environment

1.2 Related concepts of reinforcement learning

(1)状态 State (S): agent’s observation of its environment;
(2)动作 Action (A): the approaches that agent interacts with the environment;
(3)奖励 Reward ( R t R_t Rt ): the bonus that agent get once it takes an action in the environment at the given time step t t t .
Return (Return) is the sum of rewards obtained by the Agent.
(4) Transition Probability Transistion Probability ( P ): the transition possibility that environment evolves from one state to another. The transition of the environment from one state to another
can be a deterministic transition process, for example,S t + 1 = f ( S t , A t ) S_{t+1} = f(S_t, A_t)St+1=f(St,At) , it can also be a random transfer process, such asS t + 1 ∼ p ( S t + 1 ∣ S t , A t ) S_{t+1} \sim p\left( S_{t+1}|S_t, A_t \right)St+1p(St+1St,At) .
(5) Discount factor Discount factor (γ \gammaγ ): to measure the importance of future reward to agent at the current state.
(6) Trajectory is a series of states, actions, and rewards, which can be expressed as:
τ = ( S 0 , A 0 , R 0 , S 1 , A 1 , R 1 , . . . ) \tau = \left( S_0, A_0, R_0, S_1, A_1, R_1, ... \right)t=(S0,A0,R0,S1,A1,R1,...)

Use the trajectory τ \tauτ to record how the agent interacts with the environment. The initial state of the trajectory is randomly sampled from the distribution of starting states. A track is sometimes called an episode (Episode) or a round, which is a sequence from the initial state (Initial State, such as the opening of the game) to the final state (Terminal State, such as death or victory in the game).
(7) Exploration-Exploitation Tradeoff (Exploration-Exploitation Tradeoff)
Here, exploration means that the agent obtains more information by interacting with the environment, and utilization means using the currently known information to make the agent perform optimally , for example, the greedy strategy. At the same time, you can only choose one of the two. Therefore, how to balance exploration and utilization to maximize long-term returns is a very important issue in reinforcement learning.

Therefore, you can use (S, A, P, R, γ \gammaγ ) to describe the reinforcement learning process.

1.3 Mathematical Modeling for Reinforcement Learning

(1) Markov Process (MP) is a discrete random process with Markov properties.

The Markov property refers to the next state S t + 1 S_{t+1}St+1only depends on the current state S t S_tSt.

p ( S t + 1 ∣ S t ) = p ( S t + 1 ∣ S 0 , S 1 , S 2 , … , S t ) p\left(S_{t+1}|S_t\right) = p\left(S_{t+1}|S_0,S_1,S_2,\ldots,S_t\right) p(St+1St)=p(St+1S0,S1,S2,,St)

A finite state set S \mathcal{S} can be usedS and state transition matrixP \mathbf{P}P means that the MP process is< S , P > <\mathcal{S}, \mathbf{P}><S,P>.

(2) Markov Reward Process (Markov Reward Process, MRP)

In order to be able to describe the environment's feedback rewards to the Agent, the Markov reward process changes the above MP from < S , P > <\mathcal{S}, \mathbf{P}><S,P> Expanded to< S , P , R , γ > <\mathcal{S}, \mathbf{P}, R, \gamma><S,P,R,c> . Here,RRR represents the reward function, andγ \gammaγ represents the reward discount factor.

R t = R ( S t ) R_t = R(S_t) Rt=R(St)

Return (Return) is the cumulative reward of the Agent on a trajectory. The discounted return is defined as follows:

G t = 0 : T = R ( τ ) = ∑ t = 0 T γ t R t G_{t=0:T}=R(\tau)=\sum\limits_{t=0}^{T}\gamma^tR_t Gt=0:T=R ( τ )=t=0TctRt

Value Function V ( s ) V(s)V ( s ) is the agent in statesss expected return (Expected Return).
V π ( s ) = E [ R ( τ ) ∣ S 0 = s ] V^{\pi}(s) = \mathbb E [R(\tau)|S_0 = s]Vπ (s)=E [ R ( τ ) S0=s]

(3) Markov Decision Process (Markov Decision Process, MDP)
MDP is widely used in economics, cybernetics, queuing theory, robotics, network analysis and many other fields.
Immediate rewards for Markov decision processes (Reward, RRR ) is related to state and action. MDP can be used< S , A , P , R , γ > <\mathcal{S},\mathcal{A}, \mathbf{P}, R, \gamma><S,A,P,R,c> to characterize.
A \mathcal{A}A represents a limited set of actions. At this time, the immediate reward becomes
R t = R ( S t , A t ) R_t = R(S_t, A_t)Rt=R(St,At)

Policy (Policy) is used to describe the way the Agent takes actions based on environmental observations. Policy is from a state s ∈ S s \in \mathcal{S}sS sum actiona ∈ A a \in \mathcal{A}aA to action probability distributionπ ( a ∣ s ) \pi(a|s)π ( a s ) projection,π ( a ∣ s ) \pi(a|s)π ( a s ) means in statesss , take actionaaa的概率。
π ( a ∣ s ) = p ( A t = a ∣ S t = s ) , ∃ t \pi(a|s) = p(A_t = a |S_t = s), \exists t π ( a s )=p(At=aSt=s),t

Expected Return (Expected Return) refers to the expected return value of all possible trajectories under a given strategy, which can be expressed as:
J ( π ) = ∫ τ p ( τ ∣ π ) R ( τ ) = E τ ∼ π [ R ( τ ) ] J(\pi) = \int_{\tau}p(\tau|\pi)R(\tau) = \mathbb{E}_{\tau \sim \pi} [R(\tau) ]J ( p )=tp ( τ π ) R ( τ )=Eτπ[ R ( τ )]

Here, p ( τ ∣ π ) p(\tau|\pi)p ( τ π ) represents the given initial state distributionρ 0 \rho_0r0and strategy π \piπ , a TTin a Markov decision processTrajectory τ \tauof T stepThe probability of occurrence of τ is as follows:

p ( τ ∣ π ) = ρ 0 ( s 0 ) ∏ t = 0 T − 1 p ( S t + 1 ∣ S t , A t ) π ( A t ∣ S t ) ] p(\tau|\pi) = \rho_0(s_0) \prod\limits_{t=0}^{T-1}p(S_{t+1}|S_t, A_t)\pi(A_t|S_t)] p(τπ)=r0(s0)t=0T1p(St+1St,At) p ( AtSt)]

Reinforcement learning optimization problems improve policies by optimization methods to maximize expected rewards. Optimal policy π ∗ \pi^*Pi can be expressed as

π ∗ = arg ⁡ max ⁡ π J ( π ) \pi^* = \arg \max\limits_{\pi} J(\pi)Pi=argPimaxJ ( p )

Given a strategy π \piπ , value functionV ( s ) V(s)V ( s ) , the expected return at a given state, can be expressed as

V π ( s ) = E τ ∼ π [ R ( τ ) ∣ S 0 = s ] = E A t ∼ π ( ⋅ ∣ S t ) [ ∑ t = 0 ∞ γ t R ( S t , A t ) ∣ S 0 = s ] V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi} [R(\tau)|S_0 = s] = \mathbb{E}_{A_t \sim \pi(\cdot | S_t)} \left[ \sum\limits_{t=0}^{\infty} \gamma^tR(S_t, A_t)|S_0 = s \right] Vπ (s)=Eτπ[ R ( τ ) S0=s]=EAtπ(St)[t=0ctR(St,At)S0=s]

In MDP, given an action, there is an Action-Value Function (Action-Value Function), which is based on the state and the expected return of the action. It is defined as follows:

Q π ( s , a ) = E τ ∼ π [ R ( τ ) ∣ S 0 = s , A 0 = a ] = E A t ∼ π ( ⋅ ∣ S t ) [ ∑ t = 0 ∞ γ t R ( S t , A t ) ∣ S 0 = s , A 0 = a ] Q^{\pi}(s, a) = \mathbb{E}_{\tau \sim \pi} [R(\tau)|S_0 = s, A_0 = a] = \mathbb{E}_{A_t \sim \pi(\cdot | S_t)} \left[ \sum\limits_{t=0}^{\infty} \gamma^tR(S_t, A_t)|S_0 = s, A_0 = a \right] Qπ (s,a)=Eτπ[ R ( τ ) S0=s,A0=a]=EAtπ(St)[t=0ctR(St,At)S0=s,A0=a]

According to the above definition, we can get:
V π ( s ) = E a ∼ π [ Q π ( s , a ) ] V^{\pi}(s) = \mathbb{E}_{a \sim \pi} [ Q^{\pi}(s, a)]Vπ (s)=Eaπ[Qπ (s,a)]

2. Deep reinforcement learning

Deep Learning + Reinforcement Learning = Deep Reinforcement Learning (DRL)
deep learning DL has strong abstraction and representation capabilities, especially suitable for modeling value functions in RL, for example: action value function Q π ( s , a ) Q^\pi \left(s, a \right)QPi(s,a ) .
The combination of the two greatly expands the application range of RL.

3. Common Deep Reinforcement Learning Algorithms

There are many algorithms for deep reinforcement learning, the common ones are: DQN, DDPG, PPO, TRPO, A3C, SAC and so on.
(The brief process of each algorithm will be supplemented later)

3.1 Deep Q-Networks (DQN)

The DQN network combines Q-Learning and deep learning, and introduces two novel techniques to solve the problem of using nonlinear function approximators such as neural networks to represent the action value function Q ( s , a ) Q(s,a )Q(s,a ) The resulting instability problem:
Technique1 ◯ {\textcircled{\small{1}}}\normalsize1 : Experience Replay Buffer (Replay Buffer): Store the experience obtained by the Agent in the cache, and then uniformly adopt (or consider priority-based sampling) small batch samples from the cache for Q-Learning update; Technology
2◯ {\textcircled{\small{2}}}\normalsize2 : Target Network: Introduce an independent network to replace the required Q network to generate the Q-Learning target and further improve the stability of the neural network.

Here, technique 1 ◯ {\textcircled{\small{1}}}\normalsize1 It can improve the efficiency of sample usage, reduce the correlation between samples, and smooth the learning process; technology2 ◯ {\textcircled{\small{2}}}\normalsize2 It can be that the target value is not affected by the latest parameters, and there is much less divergence and shock.

Tested on various games in Atari, the DQN algorithm shows excellent performance.

The specific description of the DQN algorithm is as follows:

1. Hyperparameter : playback cache capacity BBB , reward discount factorγ \gammaγ , target network update delay step sizeCCC ;
2.Input: playback bufferB \mathcal{B}B , initialize the action value function networkQ ( s , a ∣ θ Q ) Q(s,a|\theta^Q)Q(s,aθQ ), target networkQ ′ ( s , a ∣ θ Q ′ ) Q^{\prime} (s,a|\theta^{ {Q}^\prime})Q(s,aθQ)$;
3. \qquad Initialize the target network parameters Q ′ Q^\primeQ :θ Q ′ ← θ Q \theta^{Q^\prime} \leftarrow \theta^{Q}iQiQ
4. for e p i s o d e = 1 , ⋯   , M episode = 1, \cdots, M episode=1,,M do
5. \qquad Initialize the environment;
6. \qquadThe initial state is S 0 S_0S0
7. \qquad for t = 1 , ⋯   , T t = 1,\cdots, T t=1,,T do
8. \qquad\qquad With probability ϵ \epsilonϵ choose a random actionA t A_tAt; Otherwise, choose action A t = arg ⁡ max ⁡ a Q ( S t , a ∣ θ Q ) A_t = \arg \max\limits_{a} Q\left(S_t, a|\theta^Q \right)At=argamaxQ(StaθQ)
9. \qquad\qquad Execute action A t A_tAtget reward R t R_tRt, observe the next state S t + 1 S_{t+1}St+1
10. \qquad\qquad Store state transition samples ( S t , A t , R t , S t + 1 ) \left(S_t, A_t, R_t, S_{t+1}\right)(St,At,Rt,St+1) toB \mathcal{B}B
11. \qquad\qquad from cache B \mathcal{B}Random samplingNN in BN个状态转移 ( S i , A i , R i , S i + 1 ) \left(S_i, A_i, R_i, S_{i+1}\right) (Si,Ai,Ri,Si+1)
12. \qquad\qquad 设置 Y i = R i + γ max ⁡ a Q ′ ( S i + 1 , a ∣ θ Q ′ ) Y_i = R_i + \gamma \max\limits_{a}Q^{\prime}\left({S_{i+1}, a| \theta^{Q^{\prime}} }\right) Yi=Ri+camaxQ(Si+1,aθQ)
13. \qquad\qquad Update the network Q by minimizing the loss function ( s , a ∣ θ Q ) Q(s,a|\theta^Q)Q(s,aθQ)
14. \qquad\qquad L = 1 N ∑ i ( Y i − Q ( S i , A i ∣ θ Q ) ) 2 L = \frac{1}{N}\sum_i \left( Y_i - Q(S_i, A_i|\theta^Q) \right)^2 L=N1i(YiQ(Si,AiθQ))2
15. \qquad\qquad per CCStep C , for the target networkQ ′ ( s , a ∣ θ Q ′ ) Q^{\prime} (s,a|\theta^{ { Q}^\prime})Q(s,aθQ )to synchronize:
16.\qquad\qquad θ Q ′ ← θ Q \theta^{Q^\prime} \left arrow \theta^{Q}iQiQ
17. \qquad end for
18. end for


Note: here random action selection probability ϵ \epsilonϵ generally decreases gradually with the increase of iteration Episode and Time Step, the purpose is to reduce the influence of random strategy and gradually improve the influence of Q network on Agent action selection.

Solve,Line 14 is a simple equation:
θ Q ← θ Q + β ∑ i ∈ N ∂ Q ( s , a ∣ θ Q ) ∂ θ Q [ yi − Q ( s , a ∣ θ Q ) ] . \theta ^Q \leftarrow \theta^Q + \beta\sum\limits_{i\in \mathcal{N}}\frac{\partial Q(s,a|\theta^Q)}{\partial \theta^Q} [y_i-Q(s,a|\theta^Q)]iQiQ+biNθQQ(s,aθQ)[yiQ(s,aθQ )]
Among them, the setN \mathcal{N}NNfor minibatch in NN ( S t , A t , R t , S t + 1 ) \left(S_t, A_t, R_t, S_{t+1}\right) (St,At,Rt,St+1) experience sample set,β \betaβ represents the iteration step size in one gradient iteration.

References:
[1] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 101-1 529–533, Feb. 2015.

3.2 Deep Deterministic Policy Gradient(DDPG)

The DDPG algorithm can be regarded as the combination of the Deterministic Policy Gradient (DPG) algorithm and the deep neural network, which is an extension of the above-mentioned deep Q network (DQN) in the continuous action space.

DDPG establishes Q value function (Critic) and strategy function (Actor) at the same time. Here, Critic is the same as DQN, and is updated using the TD method; while Actor uses the estimate of Critic to update through the strategy gradient method.

The DDPG algorithm is specifically described as follows:

1. Hyperparameter : soft update factor ρ \rhoρ , reward discount factorγ \gammaγ ;
2.Input: playback bufferB \mathcal{B}B , initialize the critic networkQ ( s , a ∣ θ Q ) Q(s,a|\theta^Q)Q(s,aθQ )actor function( s ∣ θ π ) \pi(s|\theta^{\pi})π ( s θπ )
3__Target network Q ′ Q^\primeQπ ′ \pi^\primePi
4. \qquad Initialize the target network parameters Q ′ Q^\primeQ andπ ′ \pi^\primePi
5. \qquad θ Q ′ ← θ Q \theta^{Q^\prime} \left arrow \theta^{Q}iQiQ ,θ π ′ ← θ π \theta^{\pi^\prime} \leftarrow \theta^{\pi}iPiiπ
6. for e p i s o d e = 1 , ⋯   , M episode = 1, \cdots, M episode=1,,M do
7. \qquad Initialize the random process N \mathcal{N}N , used to add exploration to actions;
8.\qquadThe initial state is S 1 S_1S1
9. \qquad for t = 1 , ⋯   , T t = 1,\cdots, T t=1,,T do
10. \qquad\qquad Let A t = π ( S t ∣ θ π ) + N t A_t = \pi(S_t|\theta^\pi) + \mathcal{N}_tAt=π ( Stθp )+Nt
11. \qquad\qquad Execute action A t A_tAtget reward R t R_tRt, observe the next state S t + 1 S_{t+1}St+1
12. \qquad\qquad Store state transition samples ( S t , A t , R t , S t + 1 ) \left(S_t, A_t, R_t, S_{t+1}\right)(St,At,Rt,St+1) toB \mathcal{B}B
13. \qquad\qquad from cache B \mathcal{B}Random samplingNN in BN个状态转移 ( S i , A i , R i , S i + 1 ) \left(S_i, A_i, R_i, S_{i+1}\right) (Si,Ai,Ri,Si+1)
14. \qquad\qquad 设置 Y i = R i + γ Q ′ ( S i + 1 , π ′ ( S i + 1 ∣ θ π ′ ) ∣ θ Q ′ ) Y_i = R_i + \gamma Q^{\prime}\left({S_{i+1}, \pi^{\prime}(S_{i+1}|\theta^{\pi^{\prime}}) | \theta^{Q^{\prime}} }\right) Yi=Ri+γQ(Si+1,Pi(Si+1θPi)θQ)
15. \qquad\qquad Update the Critic network by minimizing the loss function:
16. \qquad\qquad L = 1 N ∑ i ( Y i − Q ( S i , A i ∣ θ Q ) ) 2 L = \frac{1}{N}\sum_i \left( Y_i - Q(S_i, A_i|\theta^Q) \right)^2 L=N1i(YiQ(Si,AiθQ))2
13. \qquad\qquad Update the actor network by sampling the policy gradient:
14. \qquad\qquad ∇ θ π J ≈ 1 N ∑ i ∇ a Q ( s , a ∣ θ Q ) ∣ s = S i , a = π ( S i ) ∇ θ π π ( s ∣ θ π ) ∣ S i \nabla_{\ theta^\pi}J \approx \frac{1}{N}\sum_i{\nabla_aQ(s, a|\theta^Q)|_{s=S_i, a = \pi(S_i)}\nabla_{\ theta^{\pi}} \pi(s|{\theta^{\pi}})|_{S_i}}ipJN1iaQ(s,aθQ)s=Si,a=π(Si)θππ(sθπ)Si
15. \qquad\qquad 更新目标网络:
16. \qquad\qquad θ Q ′ ← ρ θ Q + ( 1 − ρ ) θ Q ′ \theta^{Q^\prime} \leftarrow \rho\theta^{Q} + (1-\rho)\theta^{Q^{\prime}} θQρθQ+(1ρ)θQ
17. \qquad\qquad θ π ′ ← ρ θ π + ( 1 − ρ ) θ π ′ \theta^{\pi^\prime} \leftarrow \rho\theta^{\pi} + (1- \rho)\theta^{\pi^{\prime}} θπρθπ+(1ρ)θπ
18. \qquad end for
19. end for

原论文中采用Ornstein-Uhlenbeck过程(O-U过程)作为添加噪声项 N \mathcal{N} N,也可以采用时间不相关的零均值高斯噪声(相关实践表明,其效果也很好)。

参考文献:
[1] Lillicrap, Timothy P., et al. “Continuous control with deep reinforcement learning”,arXiv preprint, 2015, online: https://arxiv.org/pdf/1509.02971.pdf

3.3 Proximal Policy Optimization(PPO)

PPO算法是对信赖域策略优化算法(Trust Region Policy Optimization,TRPO)的一个改进,用一个更简单有效的方法来强制策略 π θ \pi_{\theta} πθθ ′ \pi_{\theta}^{\prime }PiiFor example
, TRPO has a range of voltages:
max ⁡ π θ ′ L π θ ( π θ ′ ) s . t E s ∼ ρ π θ [ DKL ( π θ ∣ ∣ π θ ′ ) ] ≤ δ \max\limits_{\pi_{\theta}^{\prime}} \mathcal{L}_{\pi_{\theta} }\left(\pi_{\theta}^{\prime}\right)\\st\mathbb{E}_{s\sim \rho_{\pi_{\theta}}}\left[D_{KL}\ left ( \pi_{\theta}||\pi_{\theta}^{\prime}\right)\right] \le \deltaPiimaxLPii( pi)s.t.EsρPii[DKL( pi∣∣πi)]δ
and the PPO algorithm directly optimizes the regular version of the above problem, namely:

max ⁡ π θ ′ L π θ ( π θ ′ ) − λ E s ∼ ρ π θ [ DKL ( π θ ∣ ∣ π θ ′ ) ] \max\limits_{\pi_{\theta}^{\prime}}; \mathcal{L}_{\pi_{\theta}}\left(\pi_{\theta}^{\prime}\right) - \lambda\mathbb{E}_{s \sim \rho_{\pi_{ \theta}}}\left[D_{KL}\left(\pi_{\theta}||\pi_{\theta}^{\prime}\right)\right]PiimaxLPii( pi)λEsρPii[DKL( pi∣∣πi)]

Here, λ \lambdaλ is the regularization coefficient, corresponding to eachδ \deltaδ , there is a correspondingλ \lambdaλ , so that the above two optimization problems have the same solution. However,λ \lambdaThe value of λ depends onπ θ \pi_\thetaPii, therefore, in PPO, it is necessary to use a dynamically adjustable λ \lambdalambda . Specifically, there are two methods:λ \lambda
by checking the KL divergence valueWhether λ is increased or decreased, this version of the PPO algorithm is called PPO-Penalty;
(2) Directly truncate the objective function used for the policy gradient to obtain a more conservative update, this method is called PPO-Clip.

The specific description of the PPO-Clip algorithm is as follows:

1. Hyperparameters : truncation factor ϵ \epsilonϵ , the number of sub-iterationsM , BM, BM,B ;
2.Input: Initialization strategy parameterθ \thetaθ , initial value parameterϕ \phiϕ
3. for k = 1 , ⋯ k= 1, \cdots k=1, do
4. \qquad Enforce policy π θ k \pi_{\theta_k} in the environmentPiik, and save the trajectory set D k = { τ i } \mathcal{D}_k = \left\{\tau_i\right\}Dk={ vi}
5. \qquad Calculate the reward G ^ t \hat{G}_t will getG^t
6. \qquad Based on the current value function V ϕ k V_{\phi_k}Vϕk, calculate the advantage function A ^ t \hat{A}_tA^t(using any method of estimating odds);
7. \qquadfor m = 1 , ⋯   , M m= 1,\cdots, M m=1,,M do
8. \qquad\qquad ℓ t ( θ ′ ) = π θ ( A t ∣ S t ) π θ o l d ( A t ∣ S t ) \ell_t(\theta^{\prime})=\frac{\pi_{\theta}(A_t|S_t)}{\pi_{\theta_{old}}(A_t|S_t)} t( i)=Piiold(AtSt)πθ(AtSt)
9. \qquad\qquad 采用Adam随机梯度上升算法最大化PPO-Clip的目标函数来更新策略:
10. \qquad\qquad θ k + 1 = arg ⁡ max ⁡ θ 1 ∣ D k ∣ T ∑ τ ∈ D k ∑ t = 0 T f ( θ ′ ) \theta_{k+1} = \arg \max\limits_{\theta}{\frac{1}{|\mathcal{D}_k|T}}\sum\limits_{\tau \in \mathcal{D}_k}\sum\limits_{t=0}^{T}f(\theta^{\prime}) θk+1=argθmaxDkT1τDkt=0Tf ( i)
11. \qquad end for
12. \qquad for b = 1 , ⋯   , B b = 1, \cdots, B b=1,,B do
13. \qquad \qquad Learn the value function by minimizing the mean squared error using gradient descent:
14. \qquad\qquad ϕ k + 1 = arg ⁡ max ⁡ θ 1 ∣ D k ∣ T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( S t ) − G ^ t ) 2 \phi_{k+1} = \arg \max\limits_{\theta}{\frac{1}{|\mathcal{D}_k|T}}\sum\limits_{\tau \in \mathcal{D}_k}\sum\limits_{t=0}^{T}\left(V_{\phi}(S_t)-\hat{G}_t \right)^2 ϕk+1=argimaxDkT1τDkt=0T(Vϕ(St)G^t)2
15. \qquad end for
16. end for


其中, f ( θ ′ ) = min ⁡ ( ℓ t ( θ ′ ) A π θ o l d ( S t , A t ) , c l i p ( ℓ t ( θ ′ ) , 1 − ϵ , 1 + ϵ ) A π θ o l d ( S t , A t ) ) f(\theta^{\prime}) = \min\left(\ell_t(\theta^{\prime})A^{\pi_{\theta_{old}}}(S_t, A_t), clip(\ell_t(\theta^{\prime}),1-\epsilon, 1+\epsilon)A^{\pi_{\theta_{old}}}(S_t, A_t)\right) f ( i)=min(t( i)APiiold(St,At),clip(t( i),1ϵ ,1+ϵ ) APiiold(St,At))

For example, clip ( x , 1 − ϵ , 1 + ϵ ) clip(x,1-\epsilon, 1+\epsilon)clip(x,1ϵ ,1+ϵ ) means thatxxx truncated at[ 1 − ϵ , 1 + ϵ ] [1-\epsilon, 1+\epsilon][1ϵ ,1+ϵ ]中。

参考文献:
[1] Schulman, J. , et al. “Proximal Policy Optimization Algorithms”,arXiv preprint, 2017, online: https://arxiv.org/pdf/1707.06347.pdf
[2] Schulman J, Levine S, Abbeel P, et al. “Trust region policy optimization”, International conference on machine learning. PMLR, 2015: 1889-1897, online: http://proceedings.mlr.press/v37/schulman15.pdf

4. Classification of Deep Reinforcement Learning Algorithms

There are many classification standards for deep reinforcement learning algorithms. The common classification methods and specific categories are summarized as follows:

4.1 According to whether the strategies used in Agent training and testing are consistent

4.1.1 off-policy (off-track strategy, offline strategy)

The strategy used by the Agent during training (generating data) π 1 \pi_1Pi1The strategy used when testing with the agent (method evaluation and promotion) π 2 \pi_2Pi2Inconsistent.

For example, in the DQN algorithm, during training, ϵ − greedy \epsilon-greedy is usually usedϵg ree d y strategy; when testing performance or actual use, usea ∗ = arg max ⁡ a Q π ( s , a ) a^* = arg \max\limits_{a} Q^{\pi}\left ( s, a \right)a=argamaxQPi(s,a ) Strategy.

Common algorithms are: DDPG, TD3, Q-learning, DQN, etc.

4.1.2 on-policy (same track strategy, online strategy)

The strategy used by Agent during training (data generation) and the strategy used during testing (method evaluation and promotion) are the same strategy π \pip .

Common algorithms include: Sarsa, Policy Gradient, TRPO, PPO, A3C, etc.

4.2 Different ways of strategy optimization

4.2.1 Value-based algorithms (value-based algorithms)

Value-based methods usually imply that the action-value function Q π ( s , a ) Q^{\pi}(s,a)Qπ (s,a ) optimization, the optimal strategy selects the functionQ π ( s , a ) Q^{\pi}(s,a)Qπ (s,a ) The action corresponding to the maximum value, that is,π ∗ ≈ arg ⁡ max ⁡ π Q π ( s , a ) \pi^* \approx \arg \max\limits_{\pi}Q^{\pi}(s, a)PiargPimaxQπ (s,a ) , here,≈ \approx is caused by the function approximation error.

The value-based algorithm has the advantages of relatively high sampling efficiency, small value function estimation variance, and is not easy to fall into local optimum. The disadvantage is that it usually cannot deal with continuous action space problems, and the final strategy is usually a deterministic strategy.

Common algorithms include Q-learning, DQN, Double DQN, etc., which are suitable for Discrete action space. Among them, the DQN algorithm is based on the state-action function Q ( s , a ) Q(s,a)Q(s,a ) to select the optimal action.

4.2.2 Policy-based algorithms (policy-based algorithms)

The strategy-based method directly optimizes the strategy, and iteratively updates the strategy through customs clearance to maximize the cumulative reward (return). It has the advantages of simple policy parameterization and fast convergence, and is suitable for continuous or high-dimensional action spaces.

The policy gradient method (Policy Gradient Method, PGM) is a class of reinforcement learning methods that directly target the expected return through gradient descent (Gradient Descent, for the minimization problem) for policy optimization. It does not need to solve the optimization problem of value maximization in the action space, so it is more suitable for continuous and high-Dimension action space, and can also model random strategies naturally.

The PGM method directly optimizes the agent's strategy on the parameters of the neural network through the method of gradient ascent.

According to related theories, the expected return J ( π θ ) J(\pi_{\theta})J ( fi) about the parameterθ \thetaThe gradient of θ can be expressed as:

∇ θ J ( π θ ) = E τ ∼ π θ [ ∑ t = 0 T R t ∇ θ ∑ t ′ = 0 T log ⁡ π θ ( A t ′ ∣ S t ′ ) ] = E τ ∼ π θ [ ∑ t ′ = 0 T ∇ θ log ⁡ π θ ( A t ′ ∣ S t ′ ) ∑ t = 0 T R t ] \nabla_{\theta}J(\pi_{\theta})=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[ \sum\limits_{t=0}^{T}R_t\nabla_{\theta}\sum\limits_{t^{\prime}=0}^{T}\log{\pi_{\theta}(A_{t^\prime}|S_{t^{\prime}})}\right] = \mathbb{E}_{\tau \sim \pi_{\theta}}\left[ \sum\limits_{t^{\prime}=0}^{T}\nabla_{\theta}\log{\pi_{\theta}(A_{t^\prime}|S_{t^{\prime}})}\sum\limits_{t=0}^{T}R_t \right] iJ ( fi)=Eτπi[t=0TRtit=0TlogPii(AtSt)]=Eτπi[t=0TilogPii(AtSt)t=0TRt]

T → ∞ T\rightarrow \infty TWhen ∞ , the above formula can be expressed as:

∇ θ J ( π θ ) = E τ ∼ π θ [ ∑ t ′ = 0 ∞ ∇ θ log ⁡ π θ ( A t ′ ∣ S t ′ ) γ t ′ ∑ t = t ′ ∞ γ t − t ′ R t ] \nabla_{\theta}J(\pi_{\theta})= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[ \sum\limits_{t^{\prime}=0}^{\infty}\nabla_{\theta}\log{\pi_{\theta}(A_{t^\prime}|S_{t^{\prime}})}\gamma^{t^{\prime}}\sum\limits_{t=t^{\prime}}^{\infty}\gamma^{t-t^{\prime}}R_t \right] iJ ( fi)=Eτπi[t=0ilogPii(AtSt) ctt=tcttRt]
In practice,γ t ′ \gamma^{t^{\prime}}ct , so as to avoid the problem of overemphasizing the early states of the trajectory.

The above methods tend to have larger methods for gradient estimation (reward R t R_tRtThe randomness of may grow exponentially with the trajectory length L). For this reason, the common method is to introduce a benchmark function b ( S i ) b(S_i)b(Si) , just stateS i S_iSiThe function. The gradient above can be modified to:

∇ θ J ( π θ ) = E τ ∼ π θ [ ∑ t ′ = 0 ∞ ∇ θ log ⁡ π θ ( A t ′ ∣ S t ′ ) ( ∑ t = t ′ ∞ γ t − t ′ R t − b ( S t ′ ) ) ] \nabla_{\theta}J(\pi_{\theta})= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[ \sum\limits_{t^{\prime}=0}^{\infty}\nabla_{\theta}\log{\pi_{\theta}(A_{t^\prime}|S_{t^{\prime}})}\left(\sum\limits_{t=t^{\prime}}^{\infty}\gamma^{t-t^{\prime}}R_t-b(S_{t^{\prime}}) \right)\right] iJ ( fi)=Eτπi[t=0ilogPii(AtSt)(t=tcttRtb(St))]

Common PGM algorithms include REINFORCE, PG, PPO, TRPO, etc.

4.2.3 Actor-Critic algorithms (actor-critic method)

The Actor-Critic method combines the above-mentioned value-based method and policy-based method, uses the value-based method to learn the Q-value function or the state value function V to improve the sampling efficiency (Critic), and uses the policy-based method to learn the policy function (Actor ), thus suitable for continuous or high-dimensional action spaces. Its shortcomings also inherit the shortcomings of the two. For example, Critic has an overestimation problem, while Actor has a problem of underexploration.

Common algorithms include DDPG, A3C, TD3, SAC, etc., suitable for continuous and high-Dimension action space,

4.3 Parameter update methods are different

Parameters updating methods

4.3.1 Monte Carlo method

Monte Carlo method: must wait for a trajectory τ k \tau_ktkCannot be updated until (true value) is generated.

Common algorithms include: Policy Gradient, TRPO, PPO, etc.

4.3.2 Temporal Difference method

Time-difference method: Bootstrapping (estimated value) can be updated in time at each step of action execution.

Common algorithms are: DDPG, Q-learning, DQN, etc.

Guess you like

Origin blog.csdn.net/b_b1949/article/details/128997146