Lecture 2:Markov Decision Processes
Part 1
First, Markov chain
Markov features to meet the future transfer of the past is an independent, it depends only on the present.
State transition matrix
after a given Markov chain may be sampled to obtain a track.
Second, Markov reward processes
Markov Chain + reward function
reward function is a desirable
analog power without a paper boats, drift to a certain position to give the corresponding reward.
Value function is the future reward expectation
reason for introducing the gamma
avoid falling into the ring; at the same time can be obtained as soon as the appropriate incentives, not only rewarded later.
You can set 0: only concerned about the current reward
can be set to 1: more concerned about the future reward
Calculating a state value, by taking a lot of track, and then taking the average of the state (Monte Carlo)
or by Bellman equation:
Bellman equation is the relationship between the current state and the future state iteration:
R & lt here is now reaching a position s the resulting award, regardless of the next moment.
The Bellman equation written in matrix form:
inversion through the matrix, the value of seeking out. When million for the state when the inversion is very complex.
The simplest method is iterative methods:
Dynamic Programming
Monte Carlo
TDlearning
(1), the Monte Carlo
from a certain state to give a lot of tracks, to give a lot of G, averaging, value can be obtained.
(2), dynamic programming
iterative function value using Bellman equation until convergence.
Third, the Markov decision process
Relative Markov reward processes more than one action.
Transition probability function and value are more of an A
With the action, so there is strategy . Strategies are presented in two forms: the form of the probability that each action there is much probability of being selected, assuming that the probability is static; either deterministic behavior.
Conversion Markov decision processes and Markov reward processes:
known Markov decision processes and a policy, by summing each action, you can directly get the transition probability Markov reward processes; at the same time to reward function may also be removed in the same way to a.
Here on the value of the function Markov decision processes redefined, where the expectation is based on the policy of pi (because G is based on the pi).
Q defined function, but also policy-based pi.
Relationship between the two, the sum of q for all actions of the function.
Bellman equation in policy pi is called the Bellman equation expectation, referring to the sum of all possible behaviors are out.
At 1 also noted here in the form of a probabilistic + (which is also an expectation, the function is the value obtained by summing up).
After two interconvertible obtained:
FIG two backtracking
Part 2
Prediction (value function) and control (finding the best strategy, the best-valued function) Markov Decision Process
First, the dynamic Solver
Put a problem into a number of sub-structure, if the sub-structure can be resolved, the original problem can be solved. Markov structure is to meet the dynamic programming structure. Because the structure can be decomposed into a series of recursive.
1, policy iteration:
(1), policy evaluation
The repeated with the current policy iteration Bellman equation until convergence:
to obtain a coat time value function, can the current value of time.
After the elimination of a sum, you can to Markov reward processes through a more streamlined functions such value iteration, you can get the value of each state:
(2), upgrade policy
MDP is a solution, refer to obtain an optimal value function, it may be more optimal strategy.
How to find?
After the function v convergence, a q seek to maximize the function of each state is the optimal strategy.
Proof See "Introduction to strengthen learning."
When the improvement stopped, we get the optimal Bellman equation:
At the same time will be a function of the transfer equation between q and v function, wherein the conversion equation between the base value v is a function of the iteration, the conversion equation between the q function is Q-Learning basis :
2, value iteration
Through continuous iteration optimal value function, and finally you can get the best (much) ...
to find the optimal strategy column, you can reconstruct q, and then find argmax, after each round to find a strategy.