[CHANG - reinforcement learning notes] p1-p2, PPO

A, policy gradient Review

The core part, the weighted sum is expected :
Here Insert Picture Description
  PPO is an improved version of the policy gradient of the first review under the policy gradient and introduced two tips. Policy gradient background: We now have N pieces of data, use these data to optimize agent that is π-function. Wherein each of a packet of data is:
                 [tau] = {S1, A1, R1, S2, A2, R2, ..., sT, the aT, rT is}

Through the reward value differential, we can get the agent parameter stepping values are as follows:
Here Insert Picture Description
That every act of updating the parameters are to make a contribution, but that some unreasonable, so there is the following improvements:
 
  Improved 1 : agent can be taken in the same state / scenario different action, but due to the limited amount of data, we may not sampling all possible, which will result in not being sampled to reduce the chance of action (to fall behind the truth) All can be added to a baseline value for the whole trajectory after the award, only greater than the baseline will increase its possibilities to play table tennis, for example, scored six points or less, we can say that agent's decision is very bad, just more than 7 points It will strengthen the appropriate decisions when. This will alleviate the problem of unsampled. Baseline selection method is not fixed, such as the mean value of the bonus data may be employed.
  
  Improved 2 : The history of China for example, many dynasties have prosperity, but also decline. We have both the overall evaluation of a dynasty of judgment but also have local analysis. To the Han Dynasty, for example, if we make a model of the emperor: generally considered to be a powerful Han dynasty, then we generally think of all the Han Dynasty emperors are great, to refine their character to allow future generations to learn this certainly does not work, because both the Han Dynasty wise monarch, but also stupid emperor, that the emperor's merit evaluation is backward, depending on national initiatives as well as his children and grandchildren to evaluate the merits of the Han Dynasty, is it still considered Liu Bang it? This is one, another point, a man of influence but also add a attenuation coefficient, if after many generations of children and grandchildren out of an unworthy, I can not blame the head.
  
  To sum up: policy gradient is modified to the following form:
Here Insert Picture Description
  typically is expressed as Rb advantage function Aθ (St, At) i.e. blue arrow exit a result, it can be used to characterize, to take with respect to a certain other s how good choice, that the throne passed to who is better.

二、from on-policy to off-policy

First, an example of a secondary appreciated General: A do Nanjing subway station to No. 8, B 9 to do the same class in the subway station to the pontoon, the station is assumed to Nanjing elapsed time between the pontoon 10min, then B can refer to A travel time reasonable arrangements for their travel plans.

A car next time to make amendments called on-policy, B way plan is called off-policy. Corresponds to strengthen learning, policy-gradient called on-policy way, because after a packet of data to use for agent conduct update, then the data will not be used again, you need to collect new data with a new agent, again and again, so that low efficiency. So off-policy the way it is necessary to introduce. The following first talk about the theoretical basis of off-policy: importance-sampling (importance sampling), and then discusses the application of the policy-gradient.

2.1 importance sampling

Here Insert Picture Description
  p is the PDF of x, now seeking f (x) is desired, if convenient manner by integration, sampling can be approximated, i.e., we sampled based on the x-p, then the calculated value f (x), and these values averaging that is f (x) is desired. Further, if we convenient from p (x) samples, but only from Q (x) samples, then the calculation process is transformed into the following equation: Based on Q (x) for x samples, respectively, after the sampled x is substituted into f, p, q, the final mean value can also be calculated f (x) of.
Sample problems: insufficient sampling when the amount of the calculated value will be a great deviation, for example in terms of:
Here Insert Picture Description
  the distribution p (x), we can infer that f (x) expectations should be negative. When the sample can be sufficiently ensured by this importance, even a large difference between p and q distribution. But when there is insufficient sampling from the estimated f q expectations may be positive:
Here Insert Picture Description
  So, using the importance-sampling must consider two aspects: First, the number of samples to be enough, and second, the two distributions as close as possible.

2.2 off-policy

现在我们就可以分析off-policy,假如θ’是负责生成实验数据的agent,现在θ’已经更新成θ,那么我们怎么来使用旧的实验数据呢?显然有:
Here Insert Picture DescriptionHere Insert Picture Description
  也就是说对于新的agent,我们还是使用之前的实验数据τ,所不同的是获取的奖励值要乘以一个系数,假如新的agent更容易采集到的这个序列,即分子大于分母,那么步进值会被放大,这很好理解,假设分子是分母的两倍,那么假如我们现在用新的agent采集数据,那么τ可能会出现两次,所以就会被强化两次。也就是说,我们现在可以重复利用τ了。
Here Insert Picture Description
  这个地方,没有讲太清楚,此处记下我的理解。按照之前的讲法有:
Here Insert Picture Description
  所以:
Here Insert Picture Description
  也就是说:
Here Insert Picture Description
  最后的目标函数为:
Here Insert Picture Description
但是为什么目标函数是这样的?以及前面的gradient的含义,还是不太清楚。希望之后可以来这里填坑。

2.3 PPO

之前论述过重要性采样需要注意的问题,即两个分布不能差异太大,现在把他落实到argument中。实现的方法就是目标函数加约束:
Here Insert Picture Description
  J是我们想优化的目标,其微分影响agent的更新幅度和方向。现在加上散度之后,假如两个agent差异太大,那么将会削弱奖励值,假设J是一个大的正数,加上正则化后就会变得很小,含义就是现在两个agent差别太大,奖励值不具有参考性,由此带来的update幅度就会降低。(但是这样貌似会带来一个问题,假如本来的J是负值,这样就讲不通了)
  β的调整:可以设定两个限度,然后根据下面的规则来更新:
Here Insert Picture Description
  当差异太大时,加大罚项以削弱旧数据的影响;分布差异很小时,降低罚项防止对学习过程造成不好的影响。所以只有当分布差异不大时,J才更有主导性。

2.3.1 ppo algorithm

Here Insert Picture Description

2.3.2 ppo2 algorithm

Here Insert Picture Description
  what does it mean? The following two graphs the abscissa the ratio of two probabilities, the green original function curve i.e. first min, blue curve after trimming, i.e., the second min. The Green Line and Blue Line selected a smaller one. In the final analysis it is to adjust the weight value.
Here Insert Picture Description
  When A> 0, the bonus value is positive, it indicates a very good action. When business increased within a certain range means that the new agent is actually more likely to sample such data, the function ensures that the value of their rewards obtained can be linearly increased to obtain a larger update step, but when business is large, represents the difference between the two distributions is too large, then the reward of lower confidence level, so set limits to prevent update on a step's too much.
  
  When A <0, is negative reward, not represented action. When to lower within a certain range means that the new agent is unlikely to actually sampled such data, so the function can ensure their access to the reward value can be gradually close to 0, to weaken the influence of old experimental data, but when business is small representing the two distributions are too different, then the reward of lower confidence level, so set a lower limit to prevent the update step is too small (step too small not learn something).

Published 12 original articles · won praise 1 · views 263

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/104239921