[Paper Reading] Reinforcement Learning - Proximal Policy Optimization Algorithms (PPO)

(1) Title

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-ndwWn8ra-1687154576169) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230616145909128.png)]
Written in front: This article introduces the PPO optimization method and the derivation of some formulas. In the original article, the author gave three optimization methods , the third one is an extension of the first one, these two are widely used, and the effect is good, the second method is not good in the experiment, but it is also a trick, The author also analyzed in the article.

(2) Abstract

​ Deep reinforcement learning is difficult to avoid during the training process, and the effect is prone to degradation and it is difficult to recover from such problems, resulting in unstable training. The natural policy gradient [1] (NPG, Natural Policy Gradient) algorithm solves the convergence problem of the policy gradient algorithm, but the algorithm needs to calculate the second-order derivative matrix , which has limited performance and poor scalability in actual use. A lot of existing research work is around how to reduce the algorithm complexity by approximating the second-order optimization algorithm . This article will discuss based on the Proximal Policy Optimization [2] ( PPO, Proximal Policy Optimization) algorithm, and analyze an optimization strategy with different ideas. PPO does not introduce a strong constraint, but uses the constraint item as a penalty item in the objective function , and uses the first-order optimization algorithm to optimize the model, which greatly reduces the complexity of the algorithm. At the same time, the experimental results show that it is more general, easier to implement, and has better sampling complexity than TRPO (Trust Region Policy Optimization).

(三)Introduction

In recent years, many researchers have proposed several different methods to use neural network function approximators for reinforcement learning, such as Deep Q-Learning[3], "Vanilla" policy gradient[4], trust region/natural policy gradient[ 5] and other methods. But these methods have their deficiencies, for example: Deep Q-Learning fails on many simple tasks and has poor comprehension ability; the "Vanilla" policy gradient method is very efficient and robust in processing data. Poor; TRPO has better performance than other methods, but the algorithm complexity is high, and it is not compatible with other frameworks. The PPO [2] algorithm studied in this paper achieves the data efficiency and reliability of TRPO when only first-order optimization is used to improve the current algorithm.

A new objective function is proposed in the PPO algorithm. The objective function has a truncated probability ratio , and the truncated probability ratio generally results in a pessimistic estimate of the performance, that is, the lower bound of the estimated performance. To optimize this policy, PPO selects among the data sampled in the policy, and then performs several rounds of optimization on the sampled data. In addition, PPO adopts the importance sampling method , that is, data sampling does not depend on the current policy. Therefore, allowing multiple rounds of training using a single sample can increase the efficiency of data utilization.

The author proposes three optimization methods to implement the PPO algorithm. At the same time, the final experiment shows that the PPO algorithm performs better and has lower complexity than other algorithms , especially in continuous tasks.

(四)Background: Policy Optimization

1. Policy Gradient Methods

Policy gradient methods work by computing an estimator of the policy gradient and plugging it into the stochastic gradient ascent algorithm. The following objective function is usually used for gradient boosting, and it is generally negative to use the gradient descent optimization algorithm when implementing:
g ^ = E t ^ [ ∇ θ log π θ ( at ∣ st ) A ^ t ] (1) \hat{g }=\hat{\mathbb{E}_t}[\nabla_{\theta}log\pi_{\theta}(a_t\mid{s_t})\hat{A}_t]\tag{1}g^=Et^[il o g πi(atst)A^t]( 1 )
Among them,π θ \pi_{\theta}Piiis a random strategy, A ^ t \hat{A}_tA^tIs the estimate of the advantage function at time t, E ^ t \hat{E}_{t}E^tis the expectation, which represents the average experience value of the limited batch of sampling values. In actual implementation, automatic differentiation software can be used to construct an objective function whose gradient is the policy gradient estimate ; and the gradient estimate can be obtained by differentiating the following objective:
LPG ( θ ) = E t ^ [ log π θ ( at ∣ st ) A ^ t ] (2) L^{PG}(\theta)=\hat{\mathbb{E}_t}[log\pi_{\theta}(a_t\mid{s_t})\hat{A }_t]\tag{2}LPG (θ)=Et^[ l o g πi(atst)A^t]( 2 )
​ Although it is recommended to use the same trajectory forLPG ( θ ) {L^{PG}(\theta)}LPG (θ)performs multi-step optimization, but the disadvantage of this optimization method is obvious. After each parameter update, it needs to re-interact with the environment, calculate the advantage function of the new strategy, and then update the parameters. That is, a trajectory can only update parameters once, resulting in most of the time wasted interacting with the environment.

2. Trust Region Methods

In TRPO [5], the objective function is maximized subject to constraints on the policy update size.
maximize θ E t ^ [ π θ ( at ∣ st ) π θ old ( at ∣ st ) A ^ t ] (3) \underset {\theta}{maximize} {\quad} \hat{\mathbb{E}_t }[\frac{\pi_{\theta}(a_t\mid{s_t})}{\pi_{\theta_{old}}(a_t\mid{s_t})}\hat{A}_t]\tag{3 }imaximizeEt^[Piiold(atst)Pii(atst)A^t](3)

s . t . m a x i m i z e θ E t ^ [ K L [ π θ o l d ( ⋅ ∣ s t ) , π θ ( ⋅ ∣ s t ) ] ] ≤ δ (4) s.t.\underset {\theta}{maximize} {\quad}\hat{\mathbb{E}_t}[KL[\pi_{\theta_{old}}(·|s_t),\pi_{\theta}(·|s_t)]]\leq\delta\tag{4} s.t.imaximizeEt^[ K L [ πiold(st),Pii(st)]]d(4)

​ Among them, θ old \theta_{old}ioldis a vector of policy parameters to update previous ones. It can be seen that the main difference from formula 2 is that a constraint item is added, requiring the KL divergence between the two strategies before and after to be less than a certain threshold . After linear approximation of the objective function and quadratic estimation of the constraints, the conjugate gradient algorithm can be used to approximately solve the problem effectively.

​ Since the second-order derivative needs to be approximated in the TRPO algorithm, the complexity of the algorithm is high. An intuitive solution is to directly use the constraint term as the penalty term of the objective function, so that the first-order algorithm optimization can be performed directly . The author proves in Equation 5 that TRPO actually uses a penalty item instead of a constraint item, that is, to solve an unrestricted optimization problem.
maximize θ E t ^ [ π θ ( at ∣ st ) π θ old ( at ∣ st ) A ^ t − β KL [ π θ old ( ⋅ ∣ st ) , π θ ( ⋅ ∣ st ) ] ] (5) \ underset {\theta}{maximize} {\quad} \hat{\mathbb{E}_t}[\frac{\pi_{\theta}(a_t\mid{s_t})}{\pi_{\theta_{old} }(a_t\mid{s_t})}\hat{A}_t-\beta KL[\pi_{\theta_{old}}(·|s_t),\pi_{\theta}(·|s_t)]]\ tag{5}imaximizeEt^[Piiold(atst)Pii(atst)A^tb K L [ piold(st),Pii(st)]]( 5 )
​ The TRPO algorithm can make a trajectory update parameters multiple times, thus saving time. However, in TRPO, the difference between the old and new strategies should not be too large as a constraint item, which is difficult in actual operation. Even if the constraint is put into the objective function as a penalty item, it is necessary to adaptively adjust the hyperparameter β \betaβ to get the best performance, choose the appropriate value β \beta on different problemsβ is very hard, and even on a single problem, different features change with learning. Therefore, it is not enough to simply set a fixed parameter to optimize the objective function of TRPO using SGD, and additional modifications are required.

(五)Clipped Surrogate Objective

​ 使用 r t ( θ ) r_t (\theta) rtDefine ( θ ) : rt ( θ ) = π θ ( at ∣ st ) π θ old ( at ∣ st ) r_t (\theta)=\frac{\pi_{\theta}(a_t\mid{s_t}); }{\pi_{\theta_{old}}(a_t\mid{s_t})}rt( i )=Piiold(atst)Pii(atst), and r ( θ old ) = 1 r (\theta_{old})=1r ( iold)=1

​ The objective function of TRPO can be further rewritten as:
LCPI ( θ ) = E t ^ [ π θ ( at ∣ st ) π θ old ( at ∣ st ) A ^ t ] = E t ^ [ rt ( θ ) A ^ t ] (6) L^{CPI}(\theta)=\hat{\mathbb{E}_t}[\frac{\pi_{\theta}(a_t\mid{s_t})}{\pi_{\theta_{ old}}(a_t\mid{s_t})}\hat{A}_t]=\hat{\mathbb{E}_t}[r_t (\theta)\hat{A}_t]\tag{6}LCPI(θ)=Et^[Piiold(atst)Pii(atst)A^t]=Et^[rt( i )A^t]( 6 )
​ Among them, the superscript CPI means Conservative Policy Iteration (CPI)[6]. Without the constraints, maximizing the CPI objective will result in an overly large policy update. Therefore,the author here considers whenrt ( θ ) r_t (\theta)rtHow to modify the objective function to penalize policy updates when ( θ ) is far from 1 . This is becausert ( θ ) r_t (\theta)rtWhen ( θ ) is far from 1, it means that the strategies before and after the update are quite different, and the author does not want to update the strategy significantly.

In order to ensure that the difference between the old and new strategies is not too large, the author adopts a truncation function to limit the changes between the old and new strategies . The proposed objective function of this method is as follows:
LCLIP ( θ ) = E t ^ [ min ( rt ( θ ) A ^ t , clip ( rt ( θ ) , 1 − ϵ , 1 + ϵ ) A ^ t ) ] (7 ) L^{CLIP}(\theta)=\hat{\mathbb{E}_t}[min(r_t (\theta)\hat{A}_t,clip(r_t (\theta),1-\epsilon,1 +\epsilon)\hat{A}_t)]\tag{7}LCLIP(θ)=Et^[ my ( rt( i )A^t,clip(rt( i ) ,1ϵ ,1+) _A^t)]( 7 )
​ Among them,ϵ \epsilonϵ is a hyperparameter, set to 0.2. The first term of the objective function isLCPIL^{CPI}LCP I , and the second term is to truncate the objective function of TPRO, which truncates the probability ratio, which guarantees thatrt r_trtalways in the set range [1- ϵ \epsilonϵ , 1+ϵ \epsilonϵ ] between. Finally, the authors minimize the truncated and untruncated objective functions, sothe final objective function is the lower bound of the untruncated objective function. With this mechanism in place, a change to the probability ratio is ignored and not adjusted only if that change would potentially improve its objective function. When adjusted to make it worse, the probability ratio is included and corrected. The author draws a schematic diagram to illustrate the problem, as shown in Figure 1:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-qI2b7Aqw-1687154576170) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230619104922513.png)]

figure 1

From Figure 1, it can be analyzed that:

  1. A > 0 A>0 A>When 0 , maximize the objective functionLCPIL^{CPI}LCP I is equivalent to maximizingrrr , thisrrWhen r is too large, ie r > 1 + ϵ r>1+\epsilonr>1+ϵ ,LLL is a constant, no longer continue to rise, at this timeLLLθ \thetaThe derivative of θ is 0, and the policy is no longer updated, forcing the old and new policies to be less different.

  2. A < 0 A<0 A<When 0 , maximize the objective functionLCPIL^{CPI}LCP I is equivalent to minimizingrrr , thisrrr is too small, ier > 1 − ϵ r>1-\epsilonr>1ϵ ,LLL is a constant, no longer determined byrrr changes and changes, at this timeLLLθ \thetaThe derivative of θ is 0, and the policy is no longer updated, forcing the difference between the old and new policies to be small.

  3. r = 1 r=1 r=1 is the optimization starting point, which means that the new and old strategies are consistent and there is no difference.

  4. According to the empirical value ϵ \epsilonϵ is taken as 0.2.

Figure 2 shows how several objective functions change when interpolating along the policy update direction, obtained by proximal policy optimization for a continuous control problem. It can be observed intuitively that LCLIPL^{CLIP}LC L I P isLCPIL^{CPI}LA lower bound on CPI that penalizes overly large policy updates.

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-PslrUm9c-1687154576170) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230619110637835.png)]

figure 2

(六)Adaptive KL Penalty Coefficient

Another way to optimize TRPO is to impose a penalty on the KL divergence and adaptively adjust the penalty coefficient β \betaβ , so that the target value of KL divergencedtarg d_{targ}dt a r g. However, in follow-up experiments, it was found that the effect of this method is not as good as the first method, that is, to truncate the objective function, but it is also analyzed here and used as a benchmark algorithm for comparison.

In the case of a typical Minibatch SGD, the KL solvent can be solved:
LKLPEN ( θ ) = E t ^ [ π θ ( at ∣ st ) π θ old ( at ∣ st ) A ^ t − β KL π θ old ( ⋅ ∣ st ) , π θ ( ⋅ ∣ st ) ] ] (8) L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}[\frac{\pi_{\theta }(a_t\mid{s_t})}{\pi_{\theta_{old}}(a_t\mid{s_t})}\hat{A}_t-\beta KL[\pi_{\theta_{old}}( ·|s_t),\pi_{\theta}(·|s_t)]]\tag{8}LK L PEN (θ)=Et^[Piiold(atst)Pii(atst)A^tb K L [ piold(st),Pii(st)]]( 8 )
In other words, the functiond = E t ^ [ KL [ π θ old ( ⋅ ∣ st ) , π θ ( ⋅ ∣ st ) ] ] d=\hat{\mathbb{E}_t}[KL[\pi_ {\theta_{old}}(·|s_t),\pi_{\theta}(·|s_t)]d=Et^[ K L [ πiold(st),Pii(st)]] ​Used to adjust the adaptive penalty coefficientβ \betaThe size of β :

  • d < d t a r g / 1.5 , β ← β / 2 d <d_{targ}/1.5,\beta \leftarrow \beta/2 d<dt a r g/1.5βb /2

  • d > d t a r g × 1.5 , β ← β × 2 d > d_{targ}\times1.5,\beta \leftarrow \beta \times2 d>dt a r g×1.5 bb×2

​ Updated β \betaβ is used for the next policy update. Using this method, it may be found that during an update, KL divergence anddtarg d_{targ}dt a r gThe difference is large, but this is rarely the case, and β \betaβ will quickly adjust to this. The above parameters 1.5 and 2 are obtained through experiments,β \betaThe initial value of β is also a hyperparameter, but these are not important in practical applications, because the algorithm will quickly tune it.

​ The general idea of ​​this method is still that the difference between the old and new strategies cannot be too large , but it cannot be completely unchanged. The intuitive understanding is:

  1. When the difference is small, let the penalty coefficient become smaller and reduce this part of the penalty, so that the objective function LKLPEN ( θ ) L^{KLPEN}(\theta)LThe change range of K L PEN (θ)is appropriately larger,LKLPEN ( θ ) L^{KLPEN}(\theta)LK L PEN (θ)to parameterθ \thetaThe gradient of θ will be appropriately increased, so that there is a certain difference between the old and new strategies, but not completely consistent;

  2. When the difference is large, let the penalty coefficient become larger and increase this part of the penalty, so that the objective function LKLPEN ( θ ) L^{KLPEN}(\theta)LThe change of K L PEN (θ)will slow down,LKLPEN ( θ ) L^{KLPEN}(\theta)LK L PEN (θ)to parameterθ \thetaThe gradient of θ will become smaller, forcing the difference between old and new policies to be small.

(七)Algorithm

​ Entropy represents the degree of chaos in the distribution. In reinforcement learning, entropy represents the unpredictability of an action. The larger the entropy, the more chaotic the distribution, the stronger the randomness of the action, and the stronger the model's exploration ability; the smaller the entropy, the more uniform the distribution, the strategy becomes deterministic, the model does not explore new solutions, and thus loses more choices. Opportunities for better solutions. The general reinforcement learning method is to learn a series of actions by optimizing the expected discount reward, but this learning process tends to reduce the policy entropy, and the model converges to the local optimum prematurely. One solution is to introduce a The Entropy Bonus regular term is used to encourage the entropy of the strategy to increase, so as to ensure that the model has enough exploration to escape from the local optimum .

The above methods are all algorithms based on Policy Gradient. However, the solution space of such methods is very large, and it is difficult to fully sample them, resulting in large variance. In order to reduce the variance of the advantage function, the state function V ( s ) V(s) is generally introducedV ( s ) . If using a neural network architecture that shares parameters between the policy and the value function, a loss function that combines the policy objective and the error term of the value function must be used. On this basis, the objective function can further ensure sufficient exploration by adding Entropy Bonus. Therefore, optimization can be achieved by maximizing the following objective function:
L t CLIP + VF + W ( θ ) = E t ^ [ L t CLIP ( θ ) − c 1 L t VF ( θ ) + c 2 S [ π θ ] ( st ) ] (9) L_t^{CLIP+VF+W}(\theta)=\hat{\mathbb{E}_t}[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\ theta)+c_2S[\pi_\theta](s_t)]\tag{9}LtCLIP+VF+W( i )=Et^[LtCLIP( i )c1LtVF( i )+c2S [ πi](st)]( 9 )
​ Among them,L t CLIPL^{CLIP}_{t}LtCLIPis the objective function mentioned in the third section, L θ VF = ( V θ ( st ) − V ttarg ) 2 L^{VF}_{\theta}=(V_{\theta}(s_t)-V^{ targ}_t)^2LiVF=(Vi(st)Vtt a r g)2Representative function error,S [ π θ ] ( st ) = − ∑ i π θ ( ai ∣ st ) log π θ ( ai ∣ st ) S[\pi_{\theta}](s_t)=-\sum_ {i}{\pi_{\theta}({a_i}|{s_t})}log\pi_{\theta}({a_i}|{s_t})S [ πi](st)=iPii(aist) l o g πi(aist) represents the Entropy Bonus item.

​ The estimator of the advantage function A is determined below.

​A3C [4] is a popular algorithm based on policy gradients, well suited for use with recurrent neural networks (RNNs), which enforce the policy over time steps T and are updated using the collected samples . This approach requires a dominance estimator that does not go beyond the time step T. The estimator used in A3C is:
A ^ t A 3 C = ∑ i = 0 k − 1 ( γ irt + i + γ k V ( st + k ) − V ( st ) ) (10) \hat{A} _t^{A3C}=\sum_{i=0}^{k-1}({\gamma^{i}r_{t+i}+\gamma^{k}V(s_{t+k})- V(s_t)}\tag{10})A^tA 3 C=i=0k1( cand rt+i+cto V(pt+k)V(st))( 10 )
​ Letk = T − tk=Ttk=Tt,则
A ^ t A 3 C = − V ( s t ) + r t + γ r t + 1 + … … + γ T − t − 1 r T − 1 + γ T − t V ( s T ) (11) \hat{A}_t^{A3C}=-V(s_t)+r_t+\gamma r_{t+1}+……+\gamma^{T-t-1}r_{T-1}+\gamma^{T-t}V(s_T)\tag{11} A^tA 3 C=V(st)+rt+γrt+1+……+cTt1rT1+cTtV(sT)( 11 )
​ Generalizing the above formula, we can use the truncated version of generalized advantage estimation:
A ^ t PPO = δ t + ( γ λ ) δ t + 1 + … … + ( γ λ ) T − t − 1 δ T − 1 (12) \hat{A}_t^{PPO}=\delta_t+(\gamma \lambda)\delta_{t+1}+...+(\gamma \lambda)^{Tt-1}\delta_ {T-1}\tag{12}A^tPPO=dt+( gl ) dt+1+……+( gl )Tt1δT1(12)

δ t = r t + γ V ( s t + 1 ) − V ( s t ) (13) \delta_t=r_t+\gamma V(s_{t+1})-V(s_t)\tag{13} dt=rt+γV(st+1)V(st)(13)

λ = 1 \lambda =1l=1时,发现,
A ^ t , λ = 1 P P O = A ^ t A 3 C (14) \hat{A}_{t,\lambda=1}^{PPO}=\hat{A}_t^{A3C}\tag{14} A^t , λ = 1PPO=A^tA 3 C( 14 )
​ As shown in the table below, it is the pseudocode of the proximal policy optimization (PPO) algorithm using fixed-length trajectory segments. Each iteration, every N (parallel) participants collect data for T time steps. Then in these data a total ofN × TN\times TN×Build the loss function on T time steps, and use Minibatch SGD (or use Adam [7] for better performance) for K rounds of optimization.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-KxqXBOtA-1687154576170) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230619132829496.png)]

(8) Experiments

1. Comparison of Surrogate Objectives

First, several different objective functions under different hyperparameters are compared. Because it is necessary to find suitable hyperparameters for each algorithm and its variants, a benchmark with low computational cost is selected to test the algorithm, that is, seven simulated robot tasks are implemented in Open AI Gym [8], and each One was trained for one million steps. For the objective function based on the KL penalty term, a fixed penalty coefficient β \beta can be usedβ or use the KL target value dtarg d_{targ}proposed in Section 4dt a r gThe obtained adaptive coefficients. Other hyperparameters are shown in Table 1.

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-odw1k8Kv-1687154576171) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230619133607790.png)]

Table 1 Hyperparameter settings

To represent the policy, a fully connected MLP with two 64-units is used, along with the non-linear activation function tanh tanht anh , used to output the mean of a Gaussian distribution with variable standard deviation [9]. In this experiment, no parameters are shared between the policy and value functions, and no entropy rewards are used.

Each algorithm is run on all 7 environments with 3 random seeds in each environment. Each run of the algorithm is scored by computing the average total reward over the last 100 episodes. A single scalar is generated for each algorithm setting by shifting and scaling the scores for each environment so that a random strategy gives a score of 0, the best result is set to a score of 1, and averaged over 21 runs.

The results are shown in Table 2. Note that for settings with no truncation or penalty, the score is negative, as it results in a very low score for a half cheetah environment, which is worse than the initial random policy.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-evPe6vSH-1687154576171) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ image-20230619133700038.png)]

Table 2 Continuous control benchmark results, mean normalized scores for each algorithm/hyperparameter setting (algorithms run over 21 times, across 7 environments), with β initialized to 1.

2. Comparison to Other Algorithms in the Continuous Domain

The PPO algorithm (including the truncated objective function method proposed in Section III) is compared with several other methods, which are shown to be commonly applied to continuous control problems. The specific algorithms are: Trust Region Policy Optimization [5]; Cross-Entropy Method [10]; Vanilla policy gradient with adaptive step size; A2C[4]; A2C with trust domain [11]. A2C stands for Advantage Actor Critic, which is a synchronous version of A3C, and it performs the same or better than the asynchronous version. For PPO, the hyperparameters from the previous section were used, where ϵ = 0.2 \epsilon= 0.2ϵ=0.2 . It is evident from Fig. 3 that PPO outperforms previous methods in almost all continuous control environments.
insert image description here

Figure 3 Comparison of several algorithms in multiple MuJoCo environments, with a training step size of one million

3. Showcase in the continuous control environment: Humanoid Running and Steering (Showcase in the Continuous Domain: Humanoid Running and Steering)

To demonstrate the performance of PPO on high-dimensional continuous control problems, we train on a set of problems involving 3D humanoid robots, where the robot must move, turn, and leave the ground. The difficulty of the three tasks tested is from high to low: (1) RoboschoolHumanoid: only move forward; (2) RoboschoolHumanoidFlagrun: the target position changes randomly every 200 time steps or whenever the goal is reached; (3): RoboschoolHumanoidFlagrunHarder, The robot is thrown by the cube and needs to get off the ground. Figure 4 shows the learning curves for the three tasks, Figure 5 shows the static framework of the learning policy, and Table 3 provides the hyperparameters. In parallel work, Heess et al. [12] use an adaptive KL variant of PPO (Section IV) to learn a motion policy for 3D robots.

insert image description here

Fig. 4 The learning curve of PPO using Roboschool for 3D humanoid control task

insert image description here

Fig. 5 Static framework of policy learned from RoboschoolHumanoidFlagrun. In the first six frames, the robot runs towards a goal. Then, the position changes randomly, and the robot turns around and runs towards the new target.

insert image description here

Table 3 Hyperparameter settings

4. Comparison to Other Algorithms on the Atari Domain

Also run PPO on the Arcade Learning Environment [13] benchmark and compare with optimized implementations of A2C [4] and ACER [11]. For these three algorithms, the same policy network architecture as used in [4] is used. Table 5 provides the hyperparameters of PPO. For the other two algorithms, hyperparameters tuned to maximize performance in this benchmark were used.

Consider the following two scoring metrics: (1) the average reward per chapter during the entire training period (good for fast learning); (2) the average reward per chapter for the last 100 training chapters (good for final performance). Table 4 shows the number of games each algorithm "won", with the winner calculated by averaging the score metric over three trials.

insert image description here

Table 4 Number of games "won" by each algorithm, where the score metric is averaged over three trials

insert image description here

Table 5 hyperparameter settings

(九)Conclusion

This paper mainly studies the policy optimization method of deep reinforcement learning based on proximal policy optimization (PPO). From the traditional policy gradient algorithm, to the natural policy gradient algorithm, to the TRPO algorithm, and the current PPO algorithm, after continuous optimization iterations, the PPO algorithm has become one of the mainstream algorithms in the field of reinforcement learning.

From a vertical perspective, the improvement of PPO to the policy gradient algorithm is mainly aimed at the step of limiting parameter iterations . The natural policy gradient algorithm introduces KL divergence constraints, TRPO uses line search and improvement checks to ensure the feasibility under the constraints, and PPO limits the range of policy changes through the clip function. Compared with the previous two types of algorithms, PPO achieves the right balance between speed, rigor, and usability, and has better overall performance.

references:

[1] Kakade S M .A Natural Policy Gradient[C]//Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada].2001.

[2] Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[J]. 2017.DOI:10.48550/arXiv.1707.06347.

[3] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 107-111. 529–533.

[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. “Asynchronous methods for deep reinforcement learning”. In: arXiv preprint arXiv:1602.01783 (2016).

[5] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust region policy optimization”. In: CoRR, abs/1502.05477 (2015).

[6] S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. In: ICML. Vol. 2. 2002, pp. 267–274.

[7] D. Kingma and J. Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W.Zaremba. “OpenAI Gym”. In: arXiv preprint arXiv:1606.01540 (2016)

[9] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. In: arXiv preprint arXiv:1604.06778 (2016).

[10] I. Szita and A. L¨ orincz. “Learning Tetris using the noisy cross-entropy method”. In:Neural computation 18.12 (2006), pp. 2936–2941.

[11] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv preprint arXiv:1611.01224 (2016).

[12] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller, et al. “Emergence of Locomotion Behaviours in Rich Environments”. In: arXiv preprint arXiv:1707.02286 (2017).

[13] M. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The arcade learning environment: An evaluation platform for general agents”. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.

Guess you like

Origin blog.csdn.net/weixin_46084134/article/details/131286622
Recommended