Strengthen Q-Learning Learning (Reinforcement Learning) in, DQN, see this interview is enough!

1. What is reinforcement learning

Many other machine learning algorithms learner is to learn how to do it, and reinforcement learning (Reinforcement Learning, RL) is to learn to choose what kind of action in a particular situation can get the maximum return in the process of trying. In many scenarios, the current action will not only affect the current rewards, and will also affect the state after a series of rewards. RL The most important three particular that:

  1. The basic form is a closed-loop;
  2. It does not directly indicate which options for action (actions);
  3. After a long period of time will affect a series of actions and rewards signal (reward signals).

Reinforcement Learning (Reinforcement Learning, RL), also known as reinforcement learning, evaluation of learning or enhancing learning, is machine learning and the paradigm of methodological one, used to describe and solve the agent (agent) in the process of interaction with the environment through learning strategies to achieve maximum return on the problem or to achieve specific goals [1].

The figure above agent on behalf of itself, if it is autopilot, agent is a car; if you play the game it is your current control game characters such as Mario, Mario move forward when the environment has been changing, there is little monsters or obstacles matter appears, it needs to escape by jumping, it is to do the action (such as to move forward and jump action); the action is unmanned vehicles left, right or brakes, etc., it is all the time and the environment produce interactive, action will be fed back to the environment, and changes in the environment, if the automatic driving car driving goal is 100 m, it opened 10 meters forward, and that the environment will change, so each generate action will lead to changes in the environment, environment the change will be fed back to itself (agent), is one such cycle; and feedback in two ways:

  1. Do good (reward) that is positive feedback,
  2. Doing a good job (punishment punishment) that is negative feedback.

Agent may do well, it may do well, the environment will always give it feedback, agent will try to do it on their own favorable decisions by repeatedly such a cycle, agent will be more and do them well, like a child right from wrong will gradually grow up, this is the reinforcement learning.

2. Reinforcement Learning Model

Pictured above left, one agent (for example: a player / agent, etc.) to make an action, had an impact on the environment, that is, to change the state, and the environment for feedback to the agent, agent will get a reward (for example, : points / fraction), continue to carry out such a cycle, until the end.

The above process is equivalent to a Markov decision process , why call it? Because compliance with horse Markov assumptions :

  • St only the current state is determined by the state St-1 and action, and more states in the preamble is not related.

After the right as shown in FIG, S0 state after the behavior of a0, he received awards r1, into a state Sl, then after the rewarded behavior a0 r2, into state S2, and so forth cycle, until the end.

2.1 discount future rewards

Through the above description, we have identified a concept, that is, at the moment the decision made by agent (agent) make sure to maximize future earnings, then the sum of a horse reward Markov decision process corresponding to:

\[R=r_1+r_2+r_3+...+r_n\]

at time t (moment) of future rewards, considering only reward, not in front of the back of the change:

\[R_t=r_t+r_{t+1}+r_{t+2}+...+r_n\]

Next, action to make the current situation is the ability to get results, but for the future is an uncertain impact, which is consistent with our real world, for example, who do not know a butterfly wings would only incite a the impact of Hurricane (the butterfly effect). Therefore, the current behavior for the future is uncertain, fight a discount, which is joined by a factor gamma, is a value of 0-1:

\[R_t=r_1+\gamma_{}r_{t+1}+\gamma^2r_{t+2}+...+\gamma^{n-1}r_n\]

离当前越远的时间,gamma的惩罚系数就会越大,也就是越不确定。为的就是在当前和未来的决策中取得一个平衡。gamma取 0 ,相当于不考虑未来,只考虑当下,是一种很短视的做法;而gamma取 1 ,则完全考虑了未来,又有点过虑了。所以一般gamma会取 0 到 1 之间的一个值。

Rt 可以用 Rt+1 来表示,写成递推式:

\[R_t=r_t+\gamma(r_{t+1}+\gamma(r_{t+2}+...))=r_t+\gamma_{}R_{t+1}\]

2.2 Q-Learning算法

Q(s, a)函数(Quality),质量函数用来表示智能体在s状态下采用a动作并在之后采取最优动作条件下的打折的未来奖励(先不管未来的动作如何选择):

\[Q(s_t,a_t)=maxR_{t+1}\]

假设有了这个Q函数,那么我们就能够求得在当前 t 时刻当中,做出各个决策的最大收益值,通过对比这些收益值,就能够得到 t 时刻某个决策是这些决策当中收益最高。

\[\pi(s)=argmax_aQ(s,a)\]

于是乎,根据Q函数的递推公式可以得到:

\[Q(s_t,a_t)=r+\gamma_{}max_aQ(s_{t+1},a_{t+1})\]

这就是注明的贝尔曼公式。贝尔曼公式实际非常合理。对于某个状态来讲,最大化未来奖励相当于
最大化即刻奖励与下一状态最大未来奖励之和。

Q-learning的核心思想是:我们能够通过贝尔曼公式迭代地近似Q-函数。

2.3 Deep Q Learning(DQN)

Deep Q Learning(DQN)是一种融合了神经网络和的Q-Learning方法。

2.3.1 神经网络的作用

使用表格来存储每一个状态 state, 和在这个 state 每个行为 action 所拥有的 Q 值. 而当今问题是在太复杂, 状态可以多到比天上的星星还多(比如下围棋). 如果全用表格来存储它们, 恐怕我们的计算机有再大的内存都不够, 而且每次在这么大的表格中搜索对应的状态也是一件很耗时的事. 不过, 在机器学习中, 有一种方法对这种事情很在行, 那就是神经网络.

我们可以将状态和动作当成神经网络的输入, 然后经过神经网络分析后得到动作的 Q 值, 这样我们就没必要在表格中记录 Q 值, 而是直接使用神经网络生成 Q 值.

还有一种形式的是这样, 我们也能只输入状态值, 输出所有的动作值, 然后按照 Q learning 的原则, 直接选择拥有最大值的动作当做下一步要做的动作.

我们可以想象, 神经网络接受外部的信息, 相当于眼睛鼻子耳朵收集信息, 然后通过大脑加工输出每种动作的值, 最后通过强化学习的方式选择动作.

2.3.2 神经网络计算Q值

这一部分就跟监督学习的神经网络一样了我,输入状态值,输出为Q值,根据大量的数据去训练神经网络的参数,最终得到Q-Learning的计算模型,这时候我们就可以利用这个模型来进行强化学习了。

3. 强化学习和监督学习、无监督学习的区别

  1. 监督式学习就好比你在学习的时候,有一个导师在旁边指点,他知道怎么是对的怎么是错的。

    强化学习会在没有任何标签的情况下,通过先尝试做出一些行为得到一个结果,通过这个结果是对还是错的反馈,调整之前的行为,就这样不断的调整,算法能够学习到在什么样的情况下选择什么样的行为可以得到最好的结果。

  2. 监督式学习出的是之间的关系,可以告诉算法什么样的输入对应着什么样的输出。监督学习做了比较坏的选择会立刻反馈给算法。

    强化学习出的是给机器的反馈 reward function,即用来判断这个行为是好是坏。 另外强化学习的结果反馈有延时,有时候可能需要走了很多步以后才知道以前的某一步的选择是好还是坏。

  3. 监督学习的输入是独立同分布的。

    强化学习面对的输入总是在变化,每当算法做出一个行为,它影响下一次决策的输入。

  4. 监督学习算法不考虑这种平衡,就只是 exploitative。

    强化学习,一个 agent 可以在探索和开发(exploration and exploitation)之间做权衡,并且选择一个最大的回报。

  5. 非监督式不是学习输入到输出的映射,而是模式(自动映射)。

    对强化学习来说,它通过对没有概念标记、但与一个延迟奖赏或效用(可视为延迟的概念标记)相关联的训练例进行学习,以获得某种从状态到行动的映射。

强化学习和前二者的本质区别:没有前两者具有的明确数据概念,它不知道结果,只有目标。数据概念就是大量的数据,有监督学习、无监督学习需要大量数据去训练优化你建立的模型。

监督学习 非监督学习 强化学习
标签 正确且严格的标签 没有标签 没有标签,通过结果反馈调整
输入 独立同分布 独立同分布 输入总是在变化,每当算法做出一个行为,它影响下一次决策的输入。
输出 输入对应输出 自学习映射关系 reward function,即结果用来判断这个行为是好是坏

4. 什么是多任务学习

在机器学习中,我们通常关心优化某一特定指标,不管这个指标是一个标准值,还是企业KPI。为了达到这个目标,我们训练单一模型或多个模型集合来完成指定得任务。然后,我们通过精细调参,来改进模型直至性能不再提升。尽管这样做可以针对一个任务得到一个可接受得性能,但是我们可能忽略了一些信息,这些信息有助于在我们关心的指标上做得更好。具体来说,这些信息就是相关任务的监督数据。通过在相关任务间共享表示信息,我们的模型在原始任务上泛化性能更好。这种方法称为多任务学习(Multi-Task Learning)

在不同的任务中都会有一些共性,而这些共性就构成了多任务学习的一个连接点,也就是任务都需要通过这个共性能得出结果来的。比如电商场景中的点击率和转化率,都要依赖于同一份数据的输入和神经网络层次。多语种语音识别等。

image

5. 参考文献

机器学习通俗易懂系列文章

3.png


作者:@mantchs

GitHub:https://github.com/NLP-LOVE/ML-NLP

欢迎大家加入讨论!共同完善此项目!群号:【541954936】NLP面试学习群

Guess you like

Origin www.cnblogs.com/mantch/p/11373187.html