The Markov Processes of the Markov decision process are introduced above, which can be moved to the following:
Markov Processes of the Markov decision process
In this article, we summarize the Markov Reward Process (Markov Reward Process), value function and other knowledge points of the Markov decision process.
1Markov Reward Process
Markov reward process adds reward R and attenuation coefficient γ: <S,P,R,γ> on the basis of Markov process.
R is a reward function. The reward in the S state is the reward expectation that can be obtained at the next moment (t+1) in the state s at a certain moment (t), as follows:
Here everyone may have questions about why Rt+1 instead of Rt, we are more I tend to understand that this is equivalent to leaving this state to get rewards rather than entering this state to get rewards. In the video, students also consulted David.
David's answer: David pointed out that this is only an agreement, in order to describe the observation O, behavior A, and reward R involved in the RL problem more convenient.
He also pointed out that if the reward is changed to Rt instead of Rt+1, as long as it is well specified, the essence is the same. In the expression, the reward can be described as "when entering a certain state, you will get a corresponding reward." Everyone thinks it's a promise.
The detailed definition is as follows:
2Example: Student MRP
The following figure is an example of the "Markov Reward Process" diagram. On the basis of the "Markov Process", rewards for each state are added.
For example: when a student is in the first class (Class1), the reward he/she gets after taking the second class (Class2) is -1; at the same time, the reward obtained when he/she enters the state of browsing facebook is also -1.
When browsing the facebook status, the reward obtained by continuing to browse at the next moment will be -1, and the reward obtained by returning to the class content will be -1.
When the student enters the second class (Class2), the reward obtained by continuing to participate in the third class (Class3) is -2, which is -2 compared to the Reward for Sleep.
When a student is in the state of the third class, his reward for passing the test is +10, and the same is true for rewards in other states.
3Return
Definition: The harvest Gt is the decayed sum of all rewards from time t on a Markov reward chain.
The definition formula is as follows:
4Why discount?
Why does the calculation of Return need a discount factor? David gave the following explanations:
- The convenience of mathematical expression is also the most important
- Avoid getting stuck in an infinite loop
- Long-term interests are uncertain
- In finance, immediate returns are more profitable than delayed reports
- The character
slides in line with human beings that value immediate interests are as follows:
5Value function
The value function gives the long-term value of a certain state or behavior.
Definition: The value function of a certain state in a Markov reward process is the expectation of the Markov chain from that state:
why there is an expectation symbol, because Gt we said above, from t to the end state There is more than one Markov chain, and each one has a corresponding probability and return return, so the corresponding probability multiplied by the corresponding return will naturally have the expected sign, ppt is as follows:
6Example: Student MRP Returns
Let's look at the example of G1:
the calculation in the above figure is actually the calculation of the following Markov Reward Process graph:
We can see that G1 actually has 4 paths, and each path has a corresponding probability. Then we can understand that the value function needs to add the expected symbol when evaluating the value of a certain state.
In the above example, if the value function is calculated as (if there are only these four paths, and the probability of each is 1/4):
v(s) = (-2.25+(-3.125)+(-3.41)+(-3.20 ))/4 =2.996
Summarizing this for the time being, the next lecture summarizes knowledge points such as Bellman Equation, Markov Decision Process, etc.~
Reference:
David Silver depth of reinforcement learning courses
Lesson 2 - Markov Decision Process Ye Qiang: "Reinforcement Learning" Lecture Markov Decision Process
Recommended reading:
Markov Processes of Markov Decision Process
[Deep Learning Actual Combat] How to deal with RNN input variable length sequence padding in pytorch
[Basic Theory of Machine Learning ] Explain the understanding of maximum posterior probability estimation (MAP) in detail
欢迎关注公众号学习交流~