[CHANG - reinforcement learning notes] p3-p5, Q_learning

一、introduction of Q_learning

First of all still talk about Q_learning the simple understanding: Remember cattle Rengen Mengniu it? He once said: Do not ask me capable hands of a number of things, but asked how many hands move Taishan need; do not ask me how many pounds of rice a pot can cook, but asked how much labor Qianjun the pot; do not ask me a how many lights can shine in the road, but asked how many lights lit world needs. Thought of this passage and Q_learning have the same purpose.

Niu Hui root of these words encourage us to aim high, overcoming all obstacles and difficulties not feel powerless and frustrated, the only way to obtain a satisfactory life. Q_learning constantly update themselves so called qualified critic, told the agent to take some action maximum reward value that can be obtained, so based on this critic can tell the agent how to choose in order to obtain the maximum benefit.

Both meet the point is how to make the best of themselves the moment, make the most sensible choice, and now we turn to the problem itself, our task is to train a critic, he can tell us the best action.
The question is how to train this critic it? Suppose we have a packet of data:
(ST, AT, RT, ST + 1)

You ask critic: After seeing the biggest gains st I take action to get at is how much?
Critic: You can get the maximum benefit was seen after 1 st + maximum attainable income plus rt.
I: nonsense, but it seems very reasonable way.

target network based on this we can train a network, that is to be introduced later. Prior to this, based on the value of the critic under review.

1.1、Critic

Unlike the actor as critic output action, but to assess the current actor (represented by [pi]) to see the sum of the expected values ​​after a reward scenario, episode can get the remaining portion.

It is noteworthy critic for actor linked together, just like the same problem everyday life so that different people to solve, the result is not the same, the critic is to assess an individual is suitable to the current problem. So, how to train a critic of it? There are two main methods:

** 1, ** monte-carlo based approach i.e. the Monte Carlo method
** 2, ** Temporal-difference approach

1.1.1、Monte-carlo based approach

Is to allow the machine to see the former course of the game, this agent will know when you see Sa, the future value of the reward expectations can get. Process is shown below:
Here Insert Picture Description
            Figure 1 is based on Monte Carlo critic
can be seen Monte La-based approach to consider the entire episode of the value of prizes, a long time when the episode, it will become very slow learning, the following ways to solve this problem.

1.1.2、Temporal-difference approach

By considering two consecutive scenario, the model can be trained directly, the frame shown in Figure 2:
Here Insert Picture Description
              Fig 2 Temporal-difference of critic
in reward for the community, both before and after the scenario evaluated as input, then the difference of the two outputs of the this is the reward value reward.

Comparison of two methods:
Here Insert Picture Description
in fact, every time I see the same state, the ultimate reward is a lot of value and not that this summation is not fixed, but rather a distribution of values, that is, in fact, the target MC method network of variance It is relatively large; TD method obviously is obviously a small variance, but there may be a bias, that is to say the accuracy is not high. Method tips will later combine the two.

1.1.3、State-action value function

Critic Alternatively, instead of evaluating this agent in the scenario can be many points, but after the cumulated reward given different Action agent available. The critic is generally assess the situation, the present method of choice is given, as shown in FIG.
Here Insert Picture Description
            FIG 3 state-action value function

1.2、target network

Here Insert Picture Description
First copy of the parameters Q and hold time t + 1 for estimating the maximum benefit. Then at time t with the idea of ​​returning to approximate this value. Update parameters target after several steps. Take a step by step strategy for the camp.

1.3 exploration和buffer

1.3.1 exploration

If my current financial, smoking can only choose 20 yuan file, the first time I bought a package Furong Wang, feeling good, then later Striving for every time I buy only Furong Wang, because it can get a stable reward value (nicotine to satisfy). Until one day I was handed a Yuxi, I found more than Furong Wang also lighter, so I decided to buy Yuxi later. In other words, sometimes do some exploring, there may be greater benefits, of course, also possible to obtain a very bad experience, just as the phrase seek wealth and risk. Two common methods of exploration are as follows:
Here Insert Picture Description

1.3.1.1 Epsilon Greedy

With probability 1-Epsilon's wanted stability, the probability of Epsilon novelty.

1.3.1.2 Boltzmann Exploration

Not choose the best option, but to give all possible assign a probability value.

1.3.2 buffer

所谓buffer是episode片段的存储区,每次从中选择一个batch的数据用来更新Q function。这些数据之所以珍贵而且可以重复使用,因为我们现在不是像之前policy_based那样训练一个actor(离线数据需要加重要性权重),现在是训练一个评估器,此时想要get的点是t时刻看到s采取action后,t+1时刻的s是什么,这样的数据是environment决定的,与agent无关。
Actor:行动派→实践出真知。
Critic:理论派→见多识广。

In each iteration:1. Sample a batch 2. Update Q-function
Here Insert Picture Description

1.4 Q_learning algorithm

Here Insert Picture Description

二、tips for Q_learning

Q_learning在实作的时候效果非常不好,表现在network输出的值总是会远大于实际值,也就是说network高估了action会得到的最大收益。所以就有一系列的改进方法。

2.1 Double DQN

Here Insert Picture Description
可以看出,DQN远大于实际值,老师在视频中分析的原因是:network有一定的误差,而network倾向于选择每次被高估的值。Double DQN的做法是:在Target network中有两个Q,分别为:
Q:update参数的网络
Q’:target network
每次让Q来选择action,让Q’来计算奖励的期望值,这样只要两个网络不同时高估某一个选择,就不会太离谱。但是我个人认为这不能解决TD方法造成的bias问题,而且两个网络的输出曲线理论上类似平移关系,因为每经过N step两个网络的参数就会同步一次。所以我认为可以采用如下的方法:直接用原始的DQN来训练,模型收敛后,统计估计值和实际值之间的差异的期望作为bias直接在network后减去这个bias。
Paper
1.Hado V. Hasselt, “Double Q-learning”, NIPS 2010
2.Hado van Hasselt, Arthur Guez, David Silver, “Deep Reinforcement Learning with Double Q-learning”, AAAI 2016

2.2 Dueling DQN

正如我在上面分析的那样,针对DQN的bias问题,Dueling DQN用网络train出一个bias然后和另外一个输出值相加。
Here Insert Picture Description
为了防止网络忽略V(s),增加了一些constraint,例如A的输出和为0,这样假如我们的目标值刚好比当前大1,那么就会迫使网络修改V,因为修改A起不到作用。还有一些细节需要揣摩,参考论文:
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas, “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015

2.3 prioritized Reply

思想就是对于训练过程中,难以拟合的数据给予一个较大的采样概率,有迎难而上的感觉。原始论文的parameters更新过程也有调整。Paper:https://arxiv.org/abs/1511.05952?context=cs

2.4 multi-step

将原始网络的一步之差变为多步之差。
Here Insert Picture Description

2.5 Noisy-Net

之前讲exploration的时候有提过epsilon greedy,就是以小概率随机选择action而不是基于利益最大化。Noisy net是类似的考虑,但是也有区别,后者是在参数层面加噪声,而且直到episode结束才切换。

Noisy in act Noisy in parameter
May different actions Same action
随机乱试 系统的测试

当训练收敛以后,基于parameters的net输出结果是固定的,虽然和加噪声之前会有差,可以理解为给模型加了某种倾向(本质上和parameters update了一下等价),但是如果是act noisy的方法,其结果仍是不能预测的。

2.6 Distributional Q-function

We know, take some action that returns a value in the state in the distribution agent. DQN and the desired value as an output, Distributional Q-function will return the actual action of each distribution output (taken several, for example five each action corresponding to revenue value).

2.7 rainbow

The above method of synthesis, paper: https: //arxiv.org/abs/1710.02298

三、continuous Action

What is discrete? What is continuous, it is for the action is concerned, in order to go, for example, every time the action is a goal on the board, you can exhaust all possible. But if it is the autopilot, action can be taken is continuous, you can turn left 5 degrees, 5.05 degrees and so many possibilities.

3.1 solution

3.1.1 Sampling for action

3.1.2 gradient ascent to optimization

3.1.3 design network to simplify optimization

Model output is not considered a three values: scalar, vector, matrix, then calculates Q (S, a), obviously, if Σ is a positive definite matrix, and when a maximum Q will us sale. What does that mean? I want something like this: After the fixed V, initialize a, updated network
Here Insert Picture Description

3.1.4 using other methods

Published 12 original articles · won praise 1 · views 262

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/104266890