Markov decision process MDP, Markov reward process MRP

introduction

In probability theory and statistics, the Markov process (English: Markov process) is a random process with Markov properties, because the Russian mathematician Andrey Markov got its name. The Markov process does not have memorylessness. In other words, the conditional probability of a Markov process is only related to the current state of the system, and is independent and irrelevant to its past history or future state.
Insert picture description here

Introduction

Insert picture description here

1. Markov Decision Process

In machine learning algorithms (supervised, unsupervised, and weakly supervised), the Markov decision process is a type of weak supervision called reinforcement learning. The difference between increased learning and traditional supervised and unsupervised methods is that these methods determine the final result at one time, and cannot describe a decision-making process, and cannot directly define the pros and cons of each decision, which means that each decision is made. Information is weak information, so to some extent, reinforcement learning is also weakly supervised learning. From a model point of view, it also belongs to the Markov model, which is very comparable with the hidden Markov model.

The following is the division relationship of a commonly used Markov model
Insert picture description here

1.1 MDP definition

MDP is a Markov reward process with decision-making state. Here we directly give the definition of Markov decision process:
Insert picture description here

  • State: The set of states the agent is in at each step
  • Action: The set of actions that the agent can perform in each step
  • Transition Probability (transition): The probability that the agent will transition to state s'after performing action a in state s
  • Reward: The reward value obtained immediately after the agent is in state s, after performing action a, and transferred to state s'
  • Policy: The probability that the agent should perform action a in state s

Insert picture description here
It is worth noting that in the Markov decision process, the state set is discrete, the action set is discrete, the transition probability is known, and the reward is known. Learning under this condition is called model learning.

Insert picture description here

1.2 Problem solving 1

Insert picture description here

1.2.1 Strategy iteration algorithm

Insert picture description here
Insert picture description here

1.2.2 Value iteration algorithm

Insert picture description here
Insert picture description here

1.3 Examples

1.3.1 Strategy iteration example

Insert picture description here
Use Markov decision process strategy iterative algorithm to calculate, the specific process is described in detail,

https://github.com/persistforever/ReinforcementLearning/tree/master/carrental

1.3.2 Value iteration example

Gambler's problem: A gambler toss a coin to place a bet. If the coin is facing up, he will win the same amount of money as the bet. If the coin is facing up, he will lose the bet in this game. When he loses all All gambling or winning $100 will stop gambling, and the probability that the coin will face up is p. The gambling process is a limited Markov decision problem without discount.

Use Markov decision process value iteration algorithm for calculation. For details, please refer to
https://github.com/persistforever/ReinforcementLearning/tree/master/gambler

1.4 Problem solving 2

1.4.1 Policies

Insert picture description here

1.4.2 Policy based Value Function

Insert picture description here

1.4.3 Bellman Expectation Equation

Insert picture description here
Insert picture description here
Insert picture description here

1.4.4 Optimal Value Function

Insert picture description here

1.4.5 Theorem of MDP

Insert picture description here

1.4.6 Finding an Optimal Policy

Insert picture description here

1.4.7 Bellman Optimality Equation

Insert picture description here
Insert picture description here

1.4.7.1 Solving the Bellman Optimality Equation

Bellman's optimal equation is nonlinear. Generally speaking, there is no fixed solution. There are many well-known iterative solutions:

  • Value Iteration
  • Policy Iteration
  • Q-learning
  • Sarsa

You can learn more about this later.

1.5 Optimal decision

Maybe the objective function above is not clear, how to solve the most decision-making, how to maximize the cumulative return

The following is an example to introduce how to solve the above objective function. It also shows that the cumulative return function itself is the cumulative return of a process, and the return function is the return of each step.
Insert picture description here
Let's look at solving the above-mentioned optimal problem again, which is the cumulative return from the initial state along the decision function to the end state with s as the initial state.

1.6 Value iteration

Insert picture description here

1.7 Strategy iteration

Value iteration is to iterate to optimize the cumulative return value, while strategy iteration is to use the equivalence of the optimal cumulative return, that is, the optimal strategy, to perform strategy iteration.
Insert picture description here

1.8 Parameter estimation in MDP

Looking back at the previous Markov decision process, the definition is a five-tuple. Under normal circumstances, the five-tuple should be determined when we build a Markov decision model for a more specific problem, and based on this. Solve the optimal decision. Therefore, before solving the optimal decision, we need to establish a Markov model for more practical problems. The modeling process is the process of determining the five-tuple, in which we only consider the state transition probability, which is a parameter estimation process. (Other parameters are generally easy to determine or set).

Suppose, in the course of time, we have the following state transition path:
Insert picture description here

2. Markov Reward Process

2.1 MRP

To put it simply, the Markov reward process is a Markov chain containing rewards. To understand the meaning of the MRP equation, we have to figure out the origin of the reward function. We can express the reward as the reward obtained after entering a certain state. . The reward function is as follows:
Insert picture description here

2.2 Return

Insert picture description here

2.3 Value Function

Insert picture description here

2.4 Bellman Equation

Insert picture description here

https://zhuanlan.zhihu.com/p/271221558

Guess you like

Origin blog.csdn.net/Anne033/article/details/109562802