Basic principles of PPO algorithm
PPO (Proximal Policy Optimization) proximal policy optimization algorithm is a policy-based reinforcement learning algorithm and an off-policy algorithm.
For detailed mathematical derivation process, why off-policy algorithm, advantage function design, and importance sampling, please refer to Study: Teacher Li Hongyi’s series of intensive learning courses. I will also share my study notes with another party. In a blog: Basic principles of PPO algorithm (Li Hongyi’s course study notes) https://blog.csdn.net/ningmengzhihe/article/details/131457536, interested friends are welcome to communicate together! ! !
KL penalty and Clip
The core of the PPO algorithm is to update the policy gradient. There are two mainstream methods, one is KL divergence for penalty, and the other is clipping. Their main functions are to limit the amplitude of the policy gradient update, thereby deriving different neural networks. Parameter update method
If the KL penalty algorithm is used, then the neural network parameters are updated in the following way
. If the Clip algorithm is used, the neural network parameters are updated in the following way
. The pseudo code of the PPO algorithm using the KL penalty algorithm is as follows:
The pseudo code of the PPO algorithm using the Clip algorithm is as follows
Algorithm flowchart
The following algorithm flow chart is based on the PPO algorithm code implementation of Mofan python, and also refers to the algorithm flow of the network code. It does not use memory. The data used for each update of ppo is a continuous transition (including the current status and execution actions). and cumulative discount reward value), which uses two actor networks (one actor_old and one actor).
The PPO class contains the following four parts, that is, four methods
(1) Initialization
(2) Select action
(3) Calculate status value
(4) update method to update/train the network
The KL penalty and Clip algorithms are reflected in the different ways of updating the actor network, which is the yellow box in the flow chart below.
The actor network and critic network update implementation is not fixed. The above algorithm is to update the actor network and critic network separately. Some actor_loss and critic_loss update the network together after weighting (see simple_ppo.py for the code). Their network structure designs are also different.
There is no theoretical basis for which method is more effective. You often need to try to run the code and then choose according to the specific problem.
References
(1) Paper: Emergence of Locomotion Behaviors in Rich Environments
(2) Paper: Proximal Policy Optimization Algorithms
(3) Don’t bother with Python
(4) PPO2 code pytorch framework - Zhihu https://zhuanlan.zhihu.com/p/538486008 , this is a code that can be run through, it is very good
(5) Li Hongyi’s intensive learning course at Station B, which explains the mathematical principles very well
(6) This tutorial is an in-depth explanation of Li Hongyi’s intensive learning, and it is also good: ChatGPT and PPO (Chinese introduction)_bilibili_bilibili https://www.bilibili.com/video/BV1sg4y1p7hw/?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.click&vd_source=1565223f5f03f44f5674538ab582448c