A, policy gradient Review
G is the sum of earnings obtained after take after seeing at st, but this value is a distribution of values, may be significant in a different episode in fluctuations. If there is enough data, then this is not a problem, when the lack of data, the model will be a lot of volatility in the hope that the expected value instead of sampling (current) value. That is to say, training a network, input to output s reward expectations.
Two, Q_learning Review
V is a situation assessment, Q is guide the selection.
Three, Actor-Critic
That is, before the current fluctuation coefficient calculated by the two networks. Q choices are currently used to characterize Doha, V characterization mean, so there is positive and negative. The difficulty lies in the two networks need to train, how to simplify it?
V is expressed by Q, in theory, as required, on expectations, but now only reward fluctuations in the value of the step, so in order to simplify the problem, not expectations. There:
First, let's agent and environment interact to do, get a packet of data, then use the MC or TD to train a Vπ, V good after training into the equation can be used to update the agent.
The so-called first survey, speak again, no investigation, no right to speak. A tip is:
1, behavior and theories originate from the thing itself, so the two will share part of the network.
2, actor desired output greater entropy, which facilitates exploration
Finally mention A3C, a method is multithreaded, copy parameter → update parameters.
四、Pathwise Derivative
在讲具体的技术之前,还是以一个例子大概类比下原理。没有想到现实中很好的例子,YY一个吧。假设有两个人A和B,现在要打一场仗,为了获得胜利A和B决定合作一把。A负责观看大量的战例,学习在不同形势下不同作战方案的优劣,即对不同的方案打分。B负责根据战场形势作出最好的方案。
之所以说是合作,是因为B需要A对他评价,之后才能不断更新自己的方案制定,而A需要B产生的大量战例。当然还有一个第三者就是环境,环境返回的奖励值是A能够学习的根本。之后A不断进步,B也不断进步。最后看到一个形态B就能马上找到最好的作战方案。
另外,要说明的一点是:既然A可以对方案打分,理论上A自己就可以生成最佳的方案,确实是这样的,只不A要不断的分析推演(gradient ascent)才能想到好的方案,但是B直接就输出了,所以不让A做这个事情,相当于用人力换时间(对应到实际计算机就是用空间换时间,因为多了一个network)。
4.1 介绍
被认为是特殊的actor-critic以及q_learning解连续问题的方法。我们知道:Q是输入s输出不同a的收益期望,但是是针对离散的情况。相比之下Pathwise更加炫酷,为什么呢?因为他直接输出的就是最佳的action。
输入是s,直接输出最佳的action。下面分析其原理。
首先,看到St,采取At,获得奖励Rt,进入到画面St+1,存入到buffer中。
然后从buffer中采样一组新数据:(si,ai,ri,si+1)针对这组数据有两个目标:
1:输入Si+1到大网络(target-network已经fixed)中,生成一个奖励值,把这个奖励值和r相加,用来作为Q针对s和a回归的目标值。获得Q的梯度。
2:调整π使得输出的a获得Q最大化。获得π的梯度。
4.2 算法