【计算机科学】【2016】从深度强化学习到随机计算图的期望优化

在这里插入图片描述
本文为美国加利福尼亚大学伯克利分校(作者:john schulman)的博士论文,共103页。

本文主要研究强化学习,将其看作是一个优化问题:即相对于策略参数最大化期望的总回报。本文的第一部分是关于使策略梯度方法更具样本效率和可靠性,特别是当与表达的非线性函数逼近器(如神经网络)一起使用时。第3章考虑了如何确保策略更新导致的单调改进,以及如何在给定一批采样轨迹的情况下最优更新策略。在提供了理论分析之后,我们提出了一种实用的方法,称为信赖域策略优化(TRPO),它在两个具有挑战性的任务:模拟机器人运动和使用屏幕图像作为输入的Atari游戏上表现良好。第4章着眼于以与TRPO互补的方式改进策略梯度方法的样本复杂度:使用状态值函数减少策略梯度估计的方差。使用这种方法,我们获得了最先进的学习三维机器人运动控制器的结果。强化学习可以看作是优化期望的一种特殊情况,并且在机器学习的其他领域出现类似的优化问题;例如,在变分推理中,以及在使用包括内存和注意力机制的体系结构时。第5章提供了一个统一的观点,这些问题的一般演算所获得梯度估计器的目标,涉及到混合的采样随机变量和可微操作。这种统一的观点促使算法从强化学习应用到其他预测和概率建模问题。

This thesis is mostly focused onreinforcement learning, which is viewed as an optimization problem: maximizethe expected total reward with respect to the parameters of the policy. Thefirst part of the thesis is concerned with making policy gradient methods moresample-efficient and reliable, especially when used with expressive nonlinearfunction approximators such as neural networks. Chapter 3 considers how toensure that policy updates lead to monotonic improvement, and how to optimallyupdate a policy given a batch of sampled trajectories. After providing atheoretical analysis, we propose a practical method called trust region policyoptimization (TRPO), which performs well on two challenging tasks: simulatedrobotic locomotion, and playing Atari games using screen images as input.Chapter 4 looks at improving sample complexity of policy gradient methods in away that is complementary to TRPO: reducing the variance of policy gradientestimates using a state-value function. Using this method, we obtainstate-of-the-art results for learning locomotion controllers for simulated 3Drobots. Reinforcement learning can be viewed as a special case of optimizing anexpectation, and similar optimization problems arise in other areas of machinelearning; for example, in variational inference, and when using architecturesthat include mechanisms for memory and attention. Chapter 5 provides a unifyingview of these problems, with a general calculus for obtaining gradientestimators of objectives that involve a mixture of sampled random variables anddifferentiable operations. This unifying view motivates applying algorithmsfrom reinforcement learning to other prediction and probabilistic modelingproblems.

1 引言
2 项目背景
3 信赖域策略优化
4 广义优势估计
5 随机计算图
6 结论

更多精彩文章请关注公众号:在这里插入图片描述

发布了253 篇原创文章 · 获赞 153 · 访问量 32万+

猜你喜欢

转载自blog.csdn.net/weixin_42825609/article/details/103619412
今日推荐