[CHANG - reinforcement learning notes] a depth of reinforcement learning surface

B station lecture notes, instructor: National Taiwan University Professor CHANG
this blog is to form a basic understanding of reinforcement learning, the latter will be in-depth study.

First, understand:

Sensibility to understand : if we are fighting a war, then war situation is constantly changing, officers and men need to make real-time decisions based on the form of war, sometimes offensive and sometimes retreat; sometimes feint, sometimes the main, this is a war to final victory. Now suppose that the war won, then the commander will naturally strengthen a series of decisions in the course of the war, when faced with the same form of war again, before coping style was strengthened. Maybe this is not the optimal solution, but it is a feasible solution. Reinforcement learning is to learn to win the battle, but also spoils the better.
Modeling analysis :
Here Insert Picture Description

Figure 1.1 reinforcement learning structure diagram

  As shown, Agent obtain from 1.1 environment Environment observations observation or a state as input, a behavioral model π will take Action, and environmental feedback to the target agent a reward value Reward, the model is to obtain the greatest reward value.

Second, strengthen the classification of learning:

1、model_based:policy_based,value_based,A3C
2、Model_free

Third, strengthen the characteristics of learning:

** 1, Reward Delay : ** reward delay is that some rewards are not immediate reward, that Rt = 0, but for the whole process (episode) awards maximize the action is not be omitted, which requires our agent visionary by considering the cumulative reward way the agent visionary.
** 2, sequence decision : front ** action will affect later action, namely to strengthen the decision-making process is a learning sequence. This requires our agent should not only exploration but also exploitation, wanted stability is learning from experience, and innovative in order to find a better agent.

Fourth, strengthen the practice analysis study

Here are three steps to strengthen the study of:

4.1 determine the form of agent

The traditional method is to use look-up table, but the exhaustion is a major problem, if using neural networks, such as CNN, RNN like modeling, it is the depth of reinforcement learning behind us to introduce. With space invade game, for example, the use of CNN to model shown in Figure 4.1, agent saw pixel, decided to left, right, or fire, it is worth noting that, agent does not necessarily take probity biggest action, and based on a certain probability take some action. This is also the basis for exploitation.
Here Insert Picture Description

Figure 4.1 space invade the agent

4.2 Evaluation Methods

τ is used to indicate observation, action, reward constitute a continuous sequence of decisions, that is:
           Here Insert Picture Description
if it is play games, τ will complete record of information about the game each time. We can calculate the entire episode reward and.
Here Insert Picture Description
The expectation is that R:
Here Insert Picture Description
From the theoretical analysis, to calculate the expected value to traverse all sequences τ and obtains probability, and then follow the formula, but this is often not possible, the method of sampling approximated. Let agent such as N times to play the game, you can remember to expectations:
    Here Insert Picture Description
this will simplify the weighted sum method for the sampling will encounter later. The door so I can make agent to play the game N times, and then calculate the reward expectations.

4.3 Optimization

Symbol Description:
Here Insert Picture Description

The above problems can be attributed to reward maximization problem, namely:
                Here Insert Picture Description
that is to say by adjusting the parameters agent makes the final award to obtain the maximum. So naturally differentiating the reward to expectations:
       Here Insert Picture Description
Continuing the process of splitting τ:
Here Insert Picture Description          Here Insert Picture Description
  understood : two rear binding update rule according to equations parameters can be seen, when the bonus value R (τn) is positive during a time, so this decision-making process will be strengthened, or weakened. Divided by the probability can eliminate the preference for most, because most of the rewards are not necessarily the largest.
  When the reward is always non-negative value when the problem is caused by the process of being sampled will be strengthened, not weakened equivalent sampling process, the relief is to set a baseline.
       Here Insert Picture Description

五、policy gradient

Combined on one, you can see agent update rules:
     Here Insert Picture Description

对微分部分的理解
  概率值p是一个0到1之间的正数。log值为负,但是对数函数的导数处处都是为正的。
  假设奖励值为正,就进入强化通道,意味着在St情况下,agent就朝着at进了一步,前进的距离正比于学习率、奖励值,反比于action的概率,即越是罕见的action,步进值越大,因为log x越接近0,导数越大、获得的奖励越多步进值越大。
  假设奖励值为负,就进入弱化通道,意味着在St情况下,agent就朝着at反方向进了一步,同样的,前进的距离正比于学习率、奖励值、action的概率。
和普通深度学习方法的对比:以图片分类为例如图5.1所示。
         Here Insert Picture Description
            图5.1 policy gradient 理解
对比分析:
  上图就是用CNN来做分类,假设label为left,那么根据交叉熵,由于y-hat只有一维为1其余为0,相当于最大化log(“left”|s)。其更新公式如图片最下方所示。
对于深度学习假设迭代N圈,共有T个图片,对于强化学习假设迭代M圈,每一圈的奖励期望来源于N圈,那么两种情况下agent在训练过程中的总步进值。
  可以看出,强化学习的创新点在于加入奖励值,这个奖励值是第N次决策过程的整体评估,如果agent的效果好,那么就步进;如果效果不好就后退。假如我们把游戏过程的画面抽帧,从微观的角度来讲,深度学习对应的公式考虑的是个体,这些图片的顺序无论怎么调整,都不改变最后的模型参数;强化学习考虑的是整个序列,模型的效果和图片出现的顺序是有绝对关联的。现在我对强化学学习的了解还少,暂时的结论是强化学习考虑了条件、时序、环境。
  
至此有如下问题:
  1、强化学习和RNN都是时序模型,二者相比如何?
  2、对于游戏的例子,如何对过程划分即时间片的分割问题?

六、Value based-learning a critic

前面介绍了actor的学习过程,下面介绍value based的内容,就是学习一个critic,critic不像actor那样输出action,而是评估当前的actor(用π表示)看到某个scenario后,episode剩余部分所能得到的reward总和的期望值。举例来讲,就是面对同一个问题,评估不同的人对这个问题的解决效果。即critic是和actor关联在一起的。

critic就是评估某个agent适不适合当前的问题。那么,如何去训练出一个critic呢?主要有两种方法:
  1、monte-carlo based approach即蒙特卡洛方法
  2、Temporal-difference approach

6.1、Monte-carlo based approach

就是让机器去看曾经的游戏过程,这样agent就知道当看到Sa时,将来能得到的奖励值的期望值。过程如下图所示:
Here Insert Picture Description
              图6.1基于蒙特卡洛的critic
   可以看出基于蒙特拉罗的方法,需要考虑整个episode的奖励值,当episode很长的时候,这会让学习变得很慢,下面的方法可以解决这个问题。

6.2、Temporal-difference approach

只要考虑连续的两个scenario,就可以直接训练模型,框架如图2所示:
Here Insert Picture Description
            图6.2 Temporal-difference的critic
  以reward为界,把前后两个scenario作为输入进行评估,那么两个输出的差值就是该奖励值reward。
  
两种方法的对比:
Here Insert Picture Description
  按照第一种方法,Vπ(Sb)=6/8=3/4; Vπ(Sa)=0#统计量的计算,考虑所有的episode。按照第二种方法,如果设V(end)=0的话,Vπ(Sb) = 3/4=Vπ(Sa)。计算结果之所以不一样,在于Sa比较罕见,而对于罕见的画面,蒙特卡洛方法在特殊里考虑特殊,差分方法将特殊融入到常见中,我认为差分方法更加合理。毕竟由episode可以看出Sb后获得奖励的概率是很大的,所以前置的Sa也应该有奖励。

6.3、State-action value function

另外一种critic,不是评估agent在该scenario后可以得到多少分,而是给出在不同的action后agent可以获得的cumulated reward。一般的critic是评估形势,本方法是给出选择,如图3所示。
Here Insert Picture Description
              图3 state-action value function

七、Q_learning

用抄袭者来形容Q_learning我觉得很合适,在形式化的表述之前我们可以用一个例子来类比。假设有两个学生在做一套选择题,学生A在做题之前刷了很多试卷,所以每次看到题目,他都要进行一番推理分析,然后作答,学生B什么也不会,他每次都问A:这道题目最有可能选哪个,我们假设A每次都诚实的告诉他。那么我们可以下一个结论,B的分数大于等于A。其具体表述如图4所示。
Here Insert Picture Description
             图7.4 Q_learning的内涵

八、Actor-Critic

这一块还有后面的Pathwise Derivative Policy Gradient没听懂,先挖个坑。

九、Inverse Reinforcement Learning (IRL)

模仿学习的一种,一般的强化学习都是基于reward function,然后来学一个actor使得可以获取最大的奖励值。但是有很多领域难以指定reward function,例如chit-bot,自动驾驶。这时候可以采用IRL的技术。举一个例子,现在去教机器倒水,由于机械臂的活动空间非常大,那么制定规则很难,这时,我们可以先手动指导机器倒水,然后让机器去模仿就可以了。RL和IRL的对比如图9.1所示。
Here Insert Picture Description
Here Insert Picture Description
               图9.1 RL和IRL的对比
Here Insert Picture Description               图9.2 IRL用于电玩的示意图
  图9.2是IRL用于电玩的示意图,首先,已经有N笔expert的游戏记录,然后让actor也玩N次游戏,制定一个reward function 使得老师的分数高,然后不断的调整actor和reward function。
  至此,导论部分结束。

Published 12 original articles · won praise 1 · views 264

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/104167241