Li Hongyi Intensive Learning (Mandarin) Course (2018) Notes (2) Proximal Policy Optimization (PPO)

   Li Hongyi Intensive Intensive Learning (Mandarin) Course (2018)_哔哩哔哩_bilibili

        on-policy: The agent to learn and the agent to interact with the environment are the same, that is, the agent learns while interacting with the environment;

        off-policy: The agent to learn and the agent to interact with the environment are the same, that is, to watch others play.

        The purpose of on-policy→off-policy is to improve data utilization efficiency .

        The formula derivation of on-policy→off-policy: realize the sample data from the policy of p distribution→sample data in the policy of q distribution.

        In actual operation, the difference between the p distribution and the q distribution should not be too large, otherwise it will cause some problems. This is because the expectations are equal, and the variance Variance is not necessarily equal. The formula is derived as follows.

         If the number of samples is not enough, there will be problems, as shown in the figure below.

Since it is θ', not θ ,         that interacts with the environment , the data from the θ' sample has nothing to do with θ. Furthermore, after θ' interacts with the environment to generate a large amount of data, θ can be updated many times. After the Train reaches a certain level, θ' interacts with the environment again.

        A^{^{\theta }}(s_{_{t}},a_{_{t}})It is the Accumulated Reward minus the bias, which is used to estimate the relative quality of the action. If it is positive, the action probability will be increased, and if it is negative, the probability will be reduced.

        There is an assumption here, p_{\theta }(s_{_{t}})which p_{​{\theta }'}(s_{t})is similar to the distribution and can be offset. The other reason is that it cannot be calculated.

        The previous assumption is that the difference between p_{\theta }(s_{t}|a_{t})and p_{​{\theta}' }(s_{t}|a_{t})cannot be too much, otherwise the result will be inaccurate. Then, how to avoid too much difference is what PPO has to do, which is to add an extra constraint during training . This constraint is the KL divergence of the actions output by the two models θ and θ', KL Divergence. TRPO is the predecessor of PPO, and the position of the constraint is the difference between the two.

        Note: PPO is much easier to operate than TRPO, and the effect is similar. KL Divergence is not the distance between θ and θ' parameters, but the distance in behavior, that is, the gap between the action probability distributions when the same state is given.

        PPO algorithm process:

         The PPO 2 formula is complicated, but it is simple to operate.

In the formula, the meaning of the clip function

In the formula, the meaning of the min function 

Guess you like

Origin blog.csdn.net/qq_22749225/article/details/125491056