MDP和RL的区别和联系

The general relationship between RL and MDP is that RL is a framework for solving problems that can be expressed as MDPs.

DP requires you to fully describe the MDP, with known transition probabilities and reward distributions, which the DP algorithm uses. That's what makes it model-based.

DP: value-based, model-based, bootstrapping and off-policy

RL: model-free, policy-gradient, does not bootstrap, on-policy.

The purpose of RL is to solve a MDP when you don't know the MDP, you don't know the states you can visit and you don't know the transition function from each state. One way to solve such a problem is to first learn the MDP and then solve it using algorithms such as value-iteration and policy-iteration, both using the Bellman equation. The MDP can be learnt simulating different actions from each state until you have a high degree of condidence in the learned transition function and learned reward function. But this is ususally computationally unrealistic.

RL algorithms such as Q-learning, try to do both things at the same time: to learn the MDP and solve it to find the optimal policy. To do that, the algorithm needs to solve the exploration and exploitation tradeoff. Exploration means trying random actions, this helps discover the underlying MDP; exploitation means following the so-far optimal policy, this helps maximize rewards. So basically, RL is a technique to learn an MDP and solve it for the optimal policy at the same time.

The RL model consists of : 1) a set of environment and agent state S, 2) a set of actions A of the agent, 3) policies mapping states to actions, 4) rules that describe what the agent observes.
Except for the agent and the environment, we have four sub-elements of reinforcement learning system: 1) Policy: it defines the learning agent's way of behaving at a given time, 2) Reward function: it defines the goal in reinforcement learning problem, 3) value functions: it specifies what is good in the long run, 4) model of the environment (optional): models are used for planning.

Reinforcement learning is all about trying to understand the optimal way of making decisions/actions so that we maximize reward R.

The reward signal indicates what is good in the immediate sense, while the value function is more indicative of how good it is to be in the state in the long run.

MDP和RL的区别和联系

猜你喜欢