Lecture 2:Markov Decision Processes

Part 1

First, Markov chain

Markov features to meet the future transfer of the past is an independent, it depends only on the present.
State transition matrix
Here Insert Picture Description
after a given Markov chain may be sampled to obtain a track.

Second, Markov reward processes

Markov Chain + reward function
reward function is a desirable
analog power without a paper boats, drift to a certain position to give the corresponding reward.
Here Insert Picture Description
Value function is the future reward expectation
Here Insert Picture Description
reason for introducing the gamma
avoid falling into the ring; at the same time can be obtained as soon as the appropriate incentives, not only rewarded later.

You can set 0: only concerned about the current reward
can be set to 1: more concerned about the future reward

Calculating a state value, by taking a lot of track, and then taking the average of the state (Monte Carlo)
or by Bellman equation:
Here Insert Picture Description
Bellman equation is the relationship between the current state and the future state iteration:
Here Insert Picture Description
R & lt here is now reaching a position s the resulting award, regardless of the next moment.

The Bellman equation written in matrix form:
Here Insert Picture Description
inversion through the matrix, the value of seeking out. When million for the state when the inversion is very complex.

The simplest method is iterative methods:

Dynamic Programming
Monte Carlo
TDlearning

(1), the Monte Carlo
from a certain state to give a lot of tracks, to give a lot of G, averaging, value can be obtained.
(2), dynamic programming
iterative function value using Bellman equation until convergence.

Third, the Markov decision process

Relative Markov reward processes more than one action.
Transition probability function and value are more of an A

With the action, so there is strategy . Strategies are presented in two forms: the form of the probability that each action there is much probability of being selected, assuming that the probability is static; either deterministic behavior.

Conversion Markov decision processes and Markov reward processes:
known Markov decision processes and a policy, by summing each action, you can directly get the transition probability Markov reward processes; at the same time to reward function may also be removed in the same way to a.
Here Insert Picture Description
Here Insert Picture Description

Here on the value of the function Markov decision processes redefined, where the expectation is based on the policy of pi (because G is based on the pi).
Q defined function, but also policy-based pi.
Relationship between the two, the sum of q for all actions of the function.
Here Insert Picture Description

Bellman equation in policy pi is called the Bellman equation expectation, referring to the sum of all possible behaviors are out.
Here Insert Picture Description
At 1 also noted here in the form of a probabilistic + (which is also an expectation, the function is the value obtained by summing up).

After two interconvertible obtained:
Here Insert Picture Description

FIG two backtracking
Here Insert Picture Description
Here Insert Picture Description

Part 2

Prediction (value function) and control (finding the best strategy, the best-valued function) Markov Decision Process

First, the dynamic Solver

Put a problem into a number of sub-structure, if the sub-structure can be resolved, the original problem can be solved. Markov structure is to meet the dynamic programming structure. Because the structure can be decomposed into a series of recursive.

1, policy iteration:

(1), policy evaluation

The repeated with the current policy iteration Bellman equation until convergence:
Here Insert Picture Description
to obtain a coat time value function, can the current value of time.

After the elimination of a sum, you can to Markov reward processes through a more streamlined functions such value iteration, you can get the value of each state:
Here Insert Picture Description

(2), upgrade policy

MDP is a solution, refer to obtain an optimal value function, it may be more optimal strategy.
Here Insert Picture Description

How to find?
After the function v convergence, a q seek to maximize the function of each state is the optimal strategy.
Here Insert Picture Description
Proof See "Introduction to strengthen learning."

When the improvement stopped, we get the optimal Bellman equation:
Here Insert Picture Description

At the same time will be a function of the transfer equation between q and v function, wherein the conversion equation between the base value v is a function of the iteration, the conversion equation between the q function is Q-Learning basis :
Here Insert Picture Description

2, value iteration

Through continuous iteration optimal value function, and finally you can get the best (much) ...
Here Insert Picture Description
to find the optimal strategy column, you can reconstruct q, and then find argmax, after each round to find a strategy.

3, both Comparative

Here Insert Picture Description

4, summary

Here Insert Picture Description

Published 32 original articles · won praise 7 · views 2166

Guess you like

Origin blog.csdn.net/def_init_myself/article/details/105298200