本文为强化学习笔记，主要参考以下内容：

Gym 库

目前用于强化学习编程实践的常用手段是使用 OpenAI 推出的 gym库

在这里插入图片描述

gym库的一个很大的特点是可以可视化，把强化学习算法的人机交互用动画的形式呈现出来

Reinforcement Learning (RL)

Characteristics of RL:

Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.
Moreover, the learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them out. (试错)

One of the challenges that arise in reinforcement learning is the trade-off between exploration and exploitation (试探与开发). The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.

In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. (延迟收益)
It explicitly considers the $w h o l e$ problem of a goal-directed agent interacting with an uncertain environment.

A full specication of reinforcement learning problems in terms of optimal control of Markov decision processes must wait until Chapter 3, but the basic idea is simply to capture the most important aspects of the real problem (sensation, action, and goal / state, action and reward) facing a learning agent interacting with its environment to achieve a goal. The formulation is intended to include the three aspects in their simplest possible forms without trivializing any of them. (马尔可夫决策过程以最简洁而不失本质的方式呈现了智能体在环境交互中所必需的感知、动作和目标)

Elements(要素) of Reinforcement Learning

Agent (智能体) & Environment
Policy (策略)
Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. In general, policies may be stochastic.
Reward signal (收益信号)
A reward signal defines the goal in a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number, a $r e w a r d$ .
In general, reward signals may be stochastic functions of the state of the environment and the actions taken.
The agent’s sole objective is to maximize the total reward it receives over the long run.

强化学习基于这样的”奖励假设”：所有问题解决的目标都可以被描述成最大化累积奖励

Value function (价值函数)
Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the $v a l u e$ of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.

Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run.

Unfortunately, it is much harder to determine values than it is to determine rewards. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.

Notice that methods like $p o l i c y$ $g r a d i e n t$ $m e t h o d s$ do not appeal to value functions. They estimate the directions the parameters should be adjusted in order to most rapidly improve a policy’s performance. In fact, some of these methods take advantage of value function estimates to improve their gradient estimates.

Model of the environment
This is something that mimics the behavior of the environment, or more generally, that allows inferences (推断) to be made about how the environment will behave. Models
are used for $p l a n n i n g$ (规划), by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Methods for solving reinforcement learning problems that use models and planning are called $m o d e l - b a s e d$ methods, as opposed to simpler $m o d e l - f r e e$ methods that are explicitly trial-and-error learners

RL(Chapter 1): The Reinforcement Learning Problem

目录

Gym 库

Reinforcement Learning (RL)

Elements(要素) of Reinforcement Learning

猜你喜欢