Basic principles and flow chart of PPO algorithm (KL penalty and Clip two methods)

Basic principles of PPO algorithm

PPO (Proximal Policy Optimization) proximal policy optimization algorithm is a policy-based reinforcement learning algorithm and an off-policy algorithm.

For detailed mathematical derivation process, why off-policy algorithm, advantage function design, and importance sampling, please refer to Study: Teacher Li Hongyi’s series of intensive learning courses. I will also share my study notes with another party. In a blog: Basic principles of PPO algorithm (Li Hongyi’s course study notes) https://blog.csdn.net/ningmengzhihe/article/details/131457536, interested friends are welcome to communicate together! ! !

KL penalty and Clip

The core of the PPO algorithm is to update the policy gradient. There are two mainstream methods, one is KL divergence for penalty, and the other is clipping. Their main functions are to limit the amplitude of the policy gradient update, thereby deriving different neural networks. Parameter update method

If the KL penalty algorithm is used, then the neural network parameters are updated in the following way
Insert image description here
. If the Clip algorithm is used, the neural network parameters are updated in the following way
Insert image description here
. The pseudo code of the PPO algorithm using the KL penalty algorithm is as follows:
Insert image description here
Insert image description here

The pseudo code of the PPO algorithm using the Clip algorithm is as follows
Insert image description here

Algorithm flowchart

The following algorithm flow chart is based on the PPO algorithm code implementation of Mofan python, and also refers to the algorithm flow of the network code. It does not use memory. The data used for each update of ppo is a continuous transition (including the current status and execution actions). and cumulative discount reward value), which uses two actor networks (one actor_old and one actor).
Insert image description here
The PPO class contains the following four parts, that is, four methods
Insert image description here

(1) Initialization

Insert image description here

(2) Select action

Insert image description here

(3) Calculate status value

Insert image description here

(4) update method to update/train the network

Insert image description here
The KL penalty and Clip algorithms are reflected in the different ways of updating the actor network, which is the yellow box in the flow chart below.

KL penalty algorithm

Clip algorithm

The actor network and critic network update implementation is not fixed. The above algorithm is to update the actor network and critic network separately. Some actor_loss and critic_loss update the network together after weighting (see simple_ppo.py for the code). Their network structure designs are also different.

There is no theoretical basis for which method is more effective. You often need to try to run the code and then choose according to the specific problem.

References

(1) Paper: Emergence of Locomotion Behaviors in Rich Environments
(2) Paper: Proximal Policy Optimization Algorithms
(3) Don’t bother with Python
(4) PPO2 code pytorch framework - Zhihu https://zhuanlan.zhihu.com/p/538486008 , this is a code that can be run through, it is very good
(5) Li Hongyi’s intensive learning course at Station B, which explains the mathematical principles very well
(6) This tutorial is an in-depth explanation of Li Hongyi’s intensive learning, and it is also good: ChatGPT and PPO (Chinese introduction)_bilibili_bilibili https://www.bilibili.com/video/BV1sg4y1p7hw/?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.click&vd_source=1565223f5f03f44f5674538ab582448c

Guess you like

Origin blog.csdn.net/ningmengzhihe/article/details/131459848