强化学习在机器人中的应用

强化学习是机器学习中的一个子领域,其中智能体通过与环境的交互,观测交互结果以及获得相应的回报。这种学习的方式是模拟人或动物的学习过程

我们人类,与我们所处的环境有一个直接的感官接触,我们可以通过执行动作,目睹动作所产生的影响。这个观点可以理解成“cause and effect”,毫无疑问地,这就是我们人生中建立起对环境的认知的关键。本文章将从以下几个方面介绍强化学习在机器人中的应用:

强化学习目前应用在哪些方面
为什么强化学习与机器人有密切的联系
机器人强化学习简介 
值函数近似
机器人强化学习挑战 
维度灾难
实际环境采样灾难
模型限制和模型不确定性灾难
机器人强化学习准则 
有效的表征
模型近似
先验知识
强化学习目前应用在哪些方面
很多问题都已经通过强化学习得以解决,因为RL所设计的智能体不需要通过专家进行监督学习,对于那些复杂的没有明显或不容易通过流程来解决的这类问题就最适合使用强化学习得以解决,比如说:

Game playing – 在游戏场景中做出最佳的动作往往依赖与很多因素,特别是在特定的游戏中可能的游戏状态有非常多的时候。要想通过传统的方法去覆盖如此多的状态,意味着将要设定非常多的人工定制的规则,RL将会不需要人为定制规则,智能体能通过玩游戏学到规则。对于双人对抗游戏比如“backgammon”,智能体能够通过与人类玩家或者其他RL智能体对抗游戏中得到训练。
Control problems – 比如电梯调度问题。同样的,没有一个明显的策略能够提供最好最省时的电梯服务。对于这样的控制问题,RL智能体能够在仿真环境中进行学习最终学到最佳的控制策略。RL在控制问题中的应用的一些优点是能够很容易的适应环境的变化,连续不断的训练,同时不断的提升自己的性能。
一个最近的比较好的列子是DeepMind’s paper Human-level control through deep reinforcement learning

为什么强化学习与机器人有密切的联系
J. Kober, J. Andrew (Drew) Bagnell, and J. Peters 在Reinforcement Learning in Robotics: A Survey中指出:

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors 
强化学习提供给机器人学一个设计复杂和人为难以设定的工程的行为的工具集和框架

同时,强化学习逐渐成为现实世界中一个普遍存在的工具。一般来说,对于人类来说复杂的问题恰好机器人可能会很容易解决,同时,对于我们人类来说简单的问题,机器人可能解决起来会非常复杂。也导致很多人认为,机器人可以解决复杂但又在每次试验任务上表现简单,换句话说:

What’s complex for a human, robots can do easily and viceversa - Víctor Mayoral Vilches

举个简单例子,想象我们桌上有一个三关节的操作机器人从事某项重复任务,传统来说,机器人工程师处理这样一个特定任务要么设定整个应用要么使用已有的工具(已经由制造商提供)去编程设计这个应用案例。不管这个工具和任务的复杂程度,我们都会遇到:

逆运动学(Inverse Kinematics)时产生的每个电机(关节)的误差
设定闭环的时候模型精度
设计整个控制流程
经常的程序式标定(Calibration)
所有的这些是为了让机器人在被控制环境下产生一个确定的动作。 
但是事实是:真实环境是不可控

Problems in robotics are often best represented with high-dimensional, continuous states and actions (note that the 10-30 dimensional continuous actions common in robot reinforcement learning are considered large (Powell, 2012)). In robotics, it is often unrealistic to assume that the true state is completely observable and noise-free. 
机器人中问题一般都表现为:高维,连续状态,连续动作(通常10-30维的连续动作空间在机器人强化学习中都认为是巨大的),在机器人学中,假设状态空间被完全观察是不现实的,同时观测量一般都带有噪声

回到J. Kober 等的文章Reinforcement Learning in Robotics: A Survey

Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. Instead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot. 
强化学习会是机器人通过与环境交互地式错学习的方式自主的发现最优行为。不需要关心解决问题的具体细节,强化学习中任务的设计器将会依据目标函数提供反馈,以度量机器人每一步的表现性能。

这样说是很有道理的。举个投篮的列子:

I get myself behind the 3 point line and get ready for a shot
At this point, my consciousness has no whatsoever information about the exact distance to the basket, neither the strength I should use to use to make the shot so my brain produces an estimate based on the model that I have (built upon years of trial an error shots)
With this estimate, I produce a shot. Let’s assume I miss the shot, which I notice through my eyes (sensors). Again, the information perceived through my eyes is not accurate thereby what I pass to my brain is not: “I missed the shot by 5.45 cm to the right” but more like “The shot was slightly too much to the right and i missed”. This information updates the model in my brain which receives a negative reward. We could get ourselves discussing about why did my estimate failed. Was it because the model is wrong regardless of the fact that I’ve made hundreds of 3-pointers before with that exact model? Was it because of the wind (playing outdoors generally)? or was it because i didn’t eat properly that morning?. It could easily be all of those or none, but the fact is that many of those aspects can’t really be controlled be me. So i proceed iterating.
With the updated model, i make another shot which in case it fails drives me to step 2) but if I make it, I proceed to step 5).
Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward. 
Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.
机器人强化学习简介
强化学习的目标在于:找到一个状态空间 SS 到动作空间 XX 的映射,称为策略 ππ, 获得在给定状态 ss下的动作 aa,保证最大化累积期望回报 rr。 q强化学习找到最优策略π∗π∗,状态空间或者观察空间到动作空间的映射,达到最大化期望回报JJ:
Jπ=E[R(τ|π)]=∫(R(τ)pπ(τ)dτ)
Jπ=E[R(τ|π)]=∫(R(τ)pπ(τ)dτ)
其中pπ(τ)pπ(τ)代表关于轨迹τ=(x0,a0,x1,a1,...)τ=(x0,a0,x1,a1,...)的分布,同时R(τ)R(τ)是关于该轨迹的累积折扣回报:
R(τ)=∑t=0∞γtr(xt,at)
R(τ)=∑t=0∞γtr(xt,at)
qi其中γt∈[0,1)γt∈[0,1) 代表折扣因子,这样的数学描述,机器人中的很多任务都可以很自然的形式化描述成强化学习(Reinforcement Learning)问题。 传统的强化学习方法,典型的是去估计在某一个策略ππ下,每个状态xx和时间步tt的长期期望回报,称为值函数(Value Function) Vπt(x)Vtπ(x).值函数方法有时候称为*”critic-only method”*. 核心思想是先观测,评估所选择的控制的性能(value function),然后根据获得的知识选择策略。于此相对于的*policy search method* 策略搜索,直接推断最优策略π∗π∗,有时候称为*actor-only methods*.
函数逼近
函数逼近是用于表示感兴趣区域的函数的一系列数学和统计技术,它在计算上或信息理论上难以精确地或完全地表示函数。 正如J. Kober等人说:
Typically, in reinforcement learning the function approximation is based on sample data collected during interaction with the environment. Function approximation is critical in nearly every RL problem, and becomes inevitable in continuous state ones. In large discrete spaces it is also often impractical to visit or even represent all states and actions, and function approximation in this setting can be used as a means to generalize to neighboring states and actions. Function approximation can be employed to represent policies, value functions, and forward models. 
强化学习中函数逼近是基于在智能体与环境交互过程中的样本数据的。函数逼近在几乎每个强化学习问题上都很重要,并且在连续空间中显得很有必要。在巨大的离散空间中,遍历每个状态和动作是很不现实的,而函数逼近就代表着相邻状态和动作的均值,函数逼近可以用于代表策略(policy),代表值函数(value function),或前向模型(forward models).

机器人强化学习挑战
维度灾难
Bellman 在1957年提出这个词*Curse of Dimensionality*,他在最优控制中发现在离散高维空间中,探索状态和动作将会面临高维灾难。随着维数的增加,在cover 整个状态-动作空间的时候,将会需要更多的数据和计算。 b比如,我们控制一个7自由度的机械臂的时候,机器人的状态将要代表每个自由度的关节角度和速度,同时还有末端执行器*end_effector*的笛卡尔位置和速度。
NumberofStates=2×(7+3)=20NumberofActions=7
NumberofStates=2×(7+3)=20NumberofActions=7
假设每个状态空间化分10个levels,在我们这个机器人中就有10201020种不一样的状态。
现实世界样本灾难
正如 J. Kober 等人提出的,机器人强化学习面临这就是现实世界中的问题,比如机器人设备一般比较昂贵,机器人设备的磨损(wear and tear ),需要精心的维护机器人,修复机器人需要代价,物理劳动,长期的等待周期等付出。 在 J. Kober的论文中有现实世界样本灾难的例子。主要体现在以下几个方面:
Applying reinforcement learning in robotics demands safe exploration which becomes a key issue of the learning process, a problem often neglected in the general reinforcement learning community (due to the use of simulated environments).
While learning, the dynamics of a robot can change due to many external factors ranging from temperature to wear thereby the learning process may never fully converge (i.e. how light conditions affect the performance of the vision system and, as a result, the task’s performance). This problem makes comparing algorithms particularly hard.
Reinforcement learning algorithms are implemented on a digital computer where the discretization of time is unavoidable despite that physical systems are inherently continuous time systems. Time discretization of the actuation can generate undesirable artifacts (e.g., the distortion of distance between states) even for idealized physical systems, which cannot be avoided.
模型限制和模型不确定性灾难
精确模型仿真可以用于代替现实世界的交互。比如现在的机器人的仿真通常使用Gazebo,由ROS提供的gazebo,可以与很多物理引擎兼容,给机器人研究人员提供很有用的工具。 
i在完美的假设情况下,这样的方法允许我们在仿真中学习行为,随后迁移到实际机器人上。不幸的是,创建一个足够精确的机器人模型和他的环境往往是具有挑战的,同时需要许多数据样本采集,由于在模型精度下小的模型误差,将是仿真机器人与实际机器人有隔离。 
从仿真转移到实际机器人上通常分为两个大的场景: 
1.Tasks where the system is self-stabilizing (that is, where the robot does not require active control to remain in a safe state or return to it), transferring policies often works well. 
2. Unstable tasks where small variations have drastic consequences. In such scenarios transferred policies often perform poorly

机器人强化学习准则
考虑到前面提到的机器人强化学习面临的挑战,我们可能会觉得强化学习在机器人中的应用注定失败。实际上,为了使机器人强化学习表现出好的性能,我们需要考虑一下原则:

Effective representations
Approximate models
Prior knowledge or information 
以下将依次讨论他们,更多细节可以参考J. Kober的论文
Effective representations有效的表征
许多强化学习成功的案例大多巧妙的利用了近似特征表示,由于基于表格的表示不具备扩展性,近似的需求在机器人中显得尤为重要: 
- Smart State-Action discretization: Reducing the dimensionality of states or actions by smart state-action discretization is a representational simplification that may enhance both policy search and value function-based methods.(巧妙的状态动作离散化) 
- Value Function Approximation: A value function-based approach requires an accurate and robust but general function approximator that can capture the value function with sufficient precision while maintaining stability during learning (e.g. ANNs).(值函数近似) 
- Pre-structured policies: Policy search methods require a choice of policy representation that controls the complexity of representable policies to enhance learning speed.(预结构化策略)

Approximate models 模型近似
现实世界中经验采集可以用于从数据中学习前向模型(Åström and Wittenmark, 1989) ,我们渴望大大减少从实际机器人中学习,因为在仿真中,学习将会更加快速,安全。在机器人仿真强化学习中,学习通常称之为mental rehearsal(内心演练)

mental rehearsal 的核心问题是 
1. 仿真偏差 
2. 现实世界的复杂性 
3. 来着仿真环境的样本数据的有效优化 
这些问题通常在 Iterative Learning Control、Value Function Methods with Learned Models、Locally Linear Quadratic Regulators得以体现和解决

Simulation biases Given how hard is to obtain a forward model that is accurate enough to simulate a complex real-world robot system, many robot RL policies learned on simulation perform poorly on the real robot. This is known as simulation bias. It is analogous to over-fitting in supervised learning – that is, the algorithm is doing its job well on the model and the training data, respectively, but does not generalize well to the real system or novel data. It has been proved that simulation biases can be addressed by introducing stochastic models or distributions over models even if the system is very close to deterministic.

Prior knowledge or information 先验知识
先验知识可以显著的帮助学习进程,这些方法可以有效的减少搜索空间,加快学习进度

Prior Knowledge Through Demonstration: Providing a (partially) successful initial policy allows a reinforcement learning method to focus on promising regions in the value function or in policy space.
Prior Knowledge Through Task Structuring: Pre-structuring a complex task such that it can be broken down into several more tractable ones can significantly reduce the complexity of the learning task.
参考文献
Chris Watkins, Learning from Delayed Rewards, Cambridge, 1989
Awesome Reinforcement Learning repository
J. Kober, J. Andrew (Drew) Bagnell, and J. Peters, “Reinforcement Learning in Robotics: A Survey,” International Journal of Robotics Research, July, 2013.
Marc Peter Deisenroth, Gerhard Neumann and Jan Peters, A Survey on Policy Search for Robotics
--------------------- 
作者:yangchao_THU 
来源:CSDN 
原文:https://blog.csdn.net/yangchao_emigmo/article/details/53994936 
版权声明:本文为博主原创文章,转载请附上博文链接!

猜你喜欢

转载自blog.csdn.net/c2a2o2/article/details/83184723