Markov Reward Process (Markov Reward Process)

Markov Reward Process (Markov Reward Process)
The Markov Processes of the Markov decision process are introduced above, which can be moved to the following:
Markov Processes of the Markov decision process

In this article, we summarize the Markov Reward Process (Markov Reward Process), value function and other knowledge points of the Markov decision process.

1Markov Reward Process


Markov reward process adds reward R and attenuation coefficient γ: <S,P,R,γ> on the basis of Markov process.

R is a reward function. The reward in the S state is the reward expectation that can be obtained at the next moment (t+1) in the state s at a certain moment (t), as follows:
Markov Reward Process (Markov Reward Process)
Here everyone may have questions about why Rt+1 instead of Rt, we are more I tend to understand that this is equivalent to leaving this state to get rewards rather than entering this state to get rewards. In the video, students also consulted David.

David's answer: David pointed out that this is only an agreement, in order to describe the observation O, behavior A, and reward R involved in the RL problem more convenient.

He also pointed out that if the reward is changed to Rt instead of Rt+1, as long as it is well specified, the essence is the same. In the expression, the reward can be described as "when entering a certain state, you will get a corresponding reward." Everyone thinks it's a promise.

The detailed definition is as follows:

Markov Reward Process (Markov Reward Process)

2Example: Student MRP


The following figure is an example of the "Markov Reward Process" diagram. On the basis of the "Markov Process", rewards for each state are added.
Markov Reward Process (Markov Reward Process)

For example: when a student is in the first class (Class1), the reward he/she gets after taking the second class (Class2) is -1; at the same time, the reward obtained when he/she enters the state of browsing facebook is also -1.

When browsing the facebook status, the reward obtained by continuing to browse at the next moment will be -1, and the reward obtained by returning to the class content will be -1.

When the student enters the second class (Class2), the reward obtained by continuing to participate in the third class (Class3) is -2, which is -2 compared to the Reward for Sleep.

When a student is in the state of the third class, his reward for passing the test is +10, and the same is true for rewards in other states.

3Return


Definition: The harvest Gt is the decayed sum of all rewards from time t on a Markov reward chain.
The definition formula is as follows:
Markov Reward Process (Markov Reward Process)
Markov Reward Process (Markov Reward Process)
Markov Reward Process (Markov Reward Process)

4Why discount?


Why does the calculation of Return need a discount factor? David gave the following explanations:

  • The convenience of mathematical expression is also the most important
  • Avoid getting stuck in an infinite loop
  • Long-term interests are uncertain
  • In finance, immediate returns are more profitable than delayed reports
  • The character
    slides in line with human beings that value immediate interests are as follows:
    Markov Reward Process (Markov Reward Process)

5Value function


The value function gives the long-term value of a certain state or behavior.
Definition: The value function of a certain state in a Markov reward process is the expectation of the Markov chain from that state:
Markov Reward Process (Markov Reward Process)
why there is an expectation symbol, because Gt we said above, from t to the end state There is more than one Markov chain, and each one has a corresponding probability and return return, so the corresponding probability multiplied by the corresponding return will naturally have the expected sign, ppt is as follows:

Markov Reward Process (Markov Reward Process)

6Example: Student MRP Returns


Let's look at the example of G1:
Markov Reward Process (Markov Reward Process)
the calculation in the above figure is actually the calculation of the following Markov Reward Process graph:
Markov Reward Process (Markov Reward Process)

We can see that G1 actually has 4 paths, and each path has a corresponding probability. Then we can understand that the value function needs to add the expected symbol when evaluating the value of a certain state.
In the above example, if the value function is calculated as (if there are only these four paths, and the probability of each is 1/4):
v(s) = (-2.25+(-3.125)+(-3.41)+(-3.20 ))/4 =2.996

Summarizing this for the time being, the next lecture summarizes knowledge points such as Bellman Equation, Markov Decision Process, etc.~

Reference:
David Silver depth of reinforcement learning courses
Lesson 2 - Markov Decision Process Ye Qiang: "Reinforcement Learning" Lecture Markov Decision Process
Markov Reward Process (Markov Reward Process)

Recommended reading:

Markov Processes of Markov Decision Process
[Deep Learning Actual Combat] How to deal with RNN input variable length sequence padding in pytorch
[Basic Theory of Machine Learning ] Explain the understanding of maximum posterior probability estimation (MAP) in detail

      欢迎关注公众号学习交流~         

Markov Reward Process (Markov Reward Process)

Guess you like

Origin blog.51cto.com/15009309/2554225