[CHANG - reinforcement learning notes] p6, Actor-Critic

A, policy gradient Review

Here Insert Picture Description
G is the sum of earnings obtained after take after seeing at st, but this value is a distribution of values, may be significant in a different episode in fluctuations. If there is enough data, then this is not a problem, when the lack of data, the model will be a lot of volatility in the hope that the expected value instead of sampling (current) value. That is to say, training a network, input to output s reward expectations.

Two, Q_learning Review

Here Insert Picture Description
V is a situation assessment, Q is guide the selection.

Three, Actor-Critic

Here Insert Picture Description
That is, before the current fluctuation coefficient calculated by the two networks. Q choices are currently used to characterize Doha, V characterization mean, so there is positive and negative. The difficulty lies in the two networks need to train, how to simplify it?
Here Insert Picture Description
V is expressed by Q, in theory, as required, on expectations, but now only reward fluctuations in the value of the step, so in order to simplify the problem, not expectations. There:
Here Insert Picture Description
First, let's agent and environment interact to do, get a packet of data, then use the MC or TD to train a Vπ, V good after training into the equation can be used to update the agent.
The so-called first survey, speak again, no investigation, no right to speak. A tip is:
Here Insert Picture Description
1, behavior and theories originate from the thing itself, so the two will share part of the network.
2, actor desired output greater entropy, which facilitates exploration
Finally mention A3C, a method is multithreaded, copy parameter → update parameters.

四、Pathwise Derivative

在讲具体的技术之前,还是以一个例子大概类比下原理。没有想到现实中很好的例子,YY一个吧。假设有两个人A和B,现在要打一场仗,为了获得胜利A和B决定合作一把。A负责观看大量的战例,学习在不同形势下不同作战方案的优劣,即对不同的方案打分。B负责根据战场形势作出最好的方案。
之所以说是合作,是因为B需要A对他评价,之后才能不断更新自己的方案制定,而A需要B产生的大量战例。当然还有一个第三者就是环境,环境返回的奖励值是A能够学习的根本。之后A不断进步,B也不断进步。最后看到一个形态B就能马上找到最好的作战方案。
另外,要说明的一点是:既然A可以对方案打分,理论上A自己就可以生成最佳的方案,确实是这样的,只不A要不断的分析推演(gradient ascent)才能想到好的方案,但是B直接就输出了,所以不让A做这个事情,相当于用人力换时间(对应到实际计算机就是用空间换时间,因为多了一个network)。

4.1 介绍

被认为是特殊的actor-critic以及q_learning解连续问题的方法。我们知道:Q是输入s输出不同a的收益期望,但是是针对离散的情况。相比之下Pathwise更加炫酷,为什么呢?因为他直接输出的就是最佳的action。
Here Insert Picture Description
输入是s,直接输出最佳的action。下面分析其原理。
Here Insert Picture Description
首先,看到St,采取At,获得奖励Rt,进入到画面St+1,存入到buffer中。
然后从buffer中采样一组新数据:(si,ai,ri,si+1)针对这组数据有两个目标:
1:输入Si+1到大网络(target-network已经fixed)中,生成一个奖励值,把这个奖励值和r相加,用来作为Q针对s和a回归的目标值。获得Q的梯度。
2:调整π使得输出的a获得Q最大化。获得π的梯度。

4.2 算法

Here Insert Picture Description
Here Insert Picture Description

Published 12 original articles · won praise 1 · views 261

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/104288259