Reinforcement Learning PPO: Interpretation of Proximal Policy Optimization Algorithms

The PPO algorithm is a kind of Policy Gradient reinforcement learning method. The classic Policy Gradient uses a parameterized decision-making model \pi(a|s,\theta )to sdetermine actions according to the state, and its parameters are updated through the following formula:

\theta_{t+1} = \theta_{t} + \alpha \partial_{\theta_{t}} J(\theta_t)

J(\theta_t)It is used to measure the pros and cons of the decision-making model. The optimization goal of the decision-making model is to find the optimal decision to maximize the overall value of the decision.

\text{max}_{\pi}\ J(\theta) = E_{s,a\sim \pi}[\pi(a|s,\theta)Q_{\pi}(s,a)]

 Because the optimal decision \piis unknown, a simple idea is to directly \pi_\thetaoptimize the current parameter model. Called Vanilla Policy Gradient.

1. Vanilla Policy Gradient

Vanilla Policy Gradient defines the optimization goal as:

\text{max}_{\theta }\ J(\theta) \\= E_{s,a\sim {\pi_{\theta }}}[\pi(a|s,\theta)A(s,a)]\\=\sum_\tau \sum_{t=0}^T \sum_a\pi(a|s_t,\theta)A(s_t,a)

  • \ canIndicates an episode, tindicates a certain moment in an episode, and Tindicates the final state of an episode.
  • A(s_t,a_t)=Q(s_t,a_t)-v(s_t|w)=G(s_t,a_t)-v(s_t|w)It is called Advantages estimates. It Q(s_t,a_t)subtracts the average state value estimate from the original estimate v(s_t|w). The advantage of doing this is to correct the deviation between different state values ​​and speed up the model convergence. For details, please refer to this article .

Gradient calculation formula:

\partial J(\theta) \\ =\sum_\tau \sum_{t=0}^T \sum_a\partial_{\theta}\pi(a|s_t,\theta)A(s_t,a) \\=\sum_\tau \sum_{t=0}^T \sum_a \pi(a|s_t,\theta) A(s_t,a) \frac{\partial_{\theta}\pi(a|s_t,\theta)}{\pi(a|s_t,\theta)} \\\doteq \sum_\tau \sum_{t=0}^T A(s_t,a_t) \frac{\partial_{\theta}\pi(a_t|s_t,\theta)}{\pi(a_t|s_t,\theta)} \\=\sum_\tau \sum_{t=0}^T \sum_a \pi(a|s_t,\theta) A(s_t,a) \frac{\partial_{\theta}\pi(a|s_t,\theta)}{\pi(a|s_t,\theta)} \\\doteq \sum_\tau \sum_{t=0}^T A(s_t,a_t) \partial_{\theta}\text{In}(\pi(a_t|s_t,\theta))

Parameter update formula:

\theta_{t+1}=\theta + \alpha\partial_{\theta} J(\theta_t)

In addition, the average state value estimation v(s_t|w)is optimized through the VE loss, and the policy gradient update is updated in two steps according to the framework of AC.

2. Trust Region Policy Optimization

The Vanilla Policy Gradient algorithm we mentioned above is to divide the training into multiple epochs. Each epoch will sample multiple episodes according to the current decision-making model to form a batch, and perform multiple rounds of training for the batch in the epoch.

Since the training data of this round of epoch is sampled according to the model of the last round of epoch, the updated decision-making model may be inconsistent with the sampling decision-making model at this time. The problem caused by this inconsistency and off-policy is similar and may cause lead to model training bias.

Although the Vanilla Policy Gradient algorithm is on-policy, due to multiple rounds of training on the batch, it actually causes the problem of inconsistency between the target policy and the behavior policy. Therefore, Trust Region Policy Optimization (TRPO) adds importance sampling to the original optimization target according to off-policy.

J(\theta) = E_{s,a\sim {\pi_{\theta^k }}}[\frac{\pi(a|s,\theta)}{\pi(a|s,\theta^k)}A^{\theta^k}(s,a)]

In the above formula, \theta^kit represents the parameters of the last round of models. At the same time, in order to avoid too much change in the target policy \pi(a|s,\theta )and behavior policy \pi(a|s,\theta^k ), resulting in an increase in the variance of the optimization target, it is also necessary to use the KL divergence constraint to constrain the change area \pi(a|s,\theta )compared to the small change. , so called Trust Region\pi(a|s,\theta^k )\pi(a|s,\theta )

D_{KL}(\pi(a|s,\theta )||\pi(a|s,\theta^k )) \leq \delta

It is difficult to optimize directly based on the above goals and constraints, so TRPO made a certain formula simplification:

\text{max}_{\theta}\ J(\theta) \approx E_{s,a\sim {\pi_{\theta^k }}}[\pi(a|s,\theta)A^{\theta^k}(s,a)][\theta - \theta^k]=g[\theta - \theta^k] \\ s.t. \ D_{KL} \approx \frac{1}{2}[\theta - \theta^k]^T H[\theta - \theta^k] \leq \delta

At the same time, it gives the update formula of the decision parameters, which is \alpha^jused to control the KL divergence constraint. Backtracking line search technology is used here, which sets multiple \alpha^jfactors and continuously selects a factor during training to make the constraint Meet the maximum gradient update at the same time.

\theta_{t+1}=\theta_{t} + \alpha^j \sqrt{\frac{2\delta }{g^T_tH^{-1}g_t}}H^{-1}g

3. Proximal Policy Optimization 

A big problem with TRPO is that its KL constraint is added to the optimization goal through a penalty item, so the weight parameter of the penalty item is a hyperparameter, and its setting affects the overall training effect. Although TRPO uses Backtracking line search technology, it can Adaptively select hyperparameters, but it is still detrimental to the overall training effect.

Therefore, PPO believes that since the KL constraint is for the constraint \pi(a|s,\theta )that \pi(a|s,\theta^k )the change is not large, and the optimization objective already includes these two items, why not add this constraint to the optimization objective. The variation is constrained within a certain range by truncation r(\theta |\theta_k), so as to achieve an effect similar to the KL divergence constraint.

r(\theta |\theta_k)=\frac{\pi(a|s,\theta)}{\pi(a|s,\theta_k)}

J(\theta) = E_{s,a\sim {\pi_{\theta^k }}}(min[r(\theta|\theta_k)A^{\theta^k}(s,a), clip(r(\theta|\theta_k),1-\epsilon, 1+\epsilon)A^{\theta^k}(s,a)])

The paper also shows through experiments that the clip method of PPO algorithm sampling is better than the method of fixed KL constraint parameters and adaptive KL constraint parameters. 

Guess you like

Origin blog.csdn.net/tostq/article/details/131216089