RL - Reinforcement Learning Markov Decision Process (MDP) to Markov Reward Process (MRP)

Welcome to my CSDN: https://spike.blog.csdn.net/
This article address: https://blog.csdn.net/caroline_wendy/article/details/131097165

There is a conversion relationship between the Markov Decision Process (MDP) and the Markov Reward Process (MRP). A Markov decision process (MDP) is a mathematical model used to describe the randomness and uncertainty in a decision process. MDP consists of 5 elements: state set (S), action set (A), state transition probability function (P), reward function (R), and discount factor (γ). However, the Markov decision process does not directly contain the reward information, but handles the reward by introducing a Markov reward process (MRP). Markov reward processes are a subset of Markov decision processes, excluding action sets and policies.

Below are the steps to convert MDP to MRP:

  1. State Set (S) and Action Set (A) Invariant: During transitions, the state set and action set remain unchanged.
  2. State transition probability function (P): For each combination of state s and action a, calculate the state transition probability under this combination. This can be done by summing the probabilities of all possible next states s'. That is, for each s and a, calculate P(s'|s, a).
  3. Reward function (R): Computes a reward for each combination of state s and action a. This can be done by taking a weighted average of the rewards of all possible next states s'. That is, for each s and a, compute R(s, a, s') = Σ P(s'|s, a) * R(s, a, s').
  4. Discount factor (γ): Keep the discount factor γ in the original MDP unchanged.

After this conversion process, we converted from MDP to MRP, which is Markov Reward Process. In MRP, we no longer consider actions and policies, but only focus on state transition probabilities and reward functions. This allows us to focus more on modeling and analyzing the reward process.

It should be noted that it is not possible to convert from MRP back to MDP because there is no concept of actions and policies in MRP. Therefore, the transition from MDP to MRP is reversible, but not vice versa.


Markov Decision Process (MDP), on the basis of Markov Reward Process (MRP), introduces actions (Action), namely < S , A , P , r , γ > < \ mathcal{S},\mathcal{A},\mathcal{P},r,\gamma><S,A,P,r,c>

That is, the introduction of action set A \mathcal{A}A , the reward reward is determined byr ( s ) r(s)r ( s ) converts tor ( s , a ) r(s,a)r(s,a ) , the state transition probability is determined byP ( s ′ ∣ s ) \mathcal{P}(s'|s)P(ss)intoP ( s ′ ∣ s , a ) \mathcal{P}(s'|s,a)P(ss,a ) , both increase the actionaaa . At the same time, introduce the strategyπ \piThe concept of π .

Strategy π \piπ includes 2 types, deterministic policy (Deterministic Policy) and random policy (Stochastic Policy). The deterministic strategy is the state transition chain, that is, one state is determined to another state, and it can also be understood as the probability is 1, and the others are 0. The random strategy is the state transition probability, that is, the states 1 s_{1}s1to [ s 1 , s 2 , s 3 , . . . ] [s_{1}, s_{2}, s_{3},...][s1,s2,s3,... ] , which is more general.

Markov decision process ( < S , A , P , r , γ > <\mathcal{S},\mathcal{A},\mathcal{P},r,\gamma><S,A,P,r,c> ) + strategyπ \piπ = Markov reward process (< S , P , r , γ > <\mathcal{S},\mathcal{P},r,\gamma><S,P,r,c>

The state transition probability matrix of MRP, the conversion formula is as follows:
P ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) P ( s ′ ∣ s , a ) P(s'|s)=\sum_{a \in A} \ \pi(a|s) \ P(s'|s,a)P(ss)=aA π ( a s ) P ( s s,a )
MRP state reward list, the conversion formula is as follows:
r ′ ( s ) = ∑ a ∈ A π ( a ∣ s ) r ( s , a ) r'(s)=\sum_{a\in A} \ \ pi(a|s) \ r(s,a)r(s)=aA π ( a s ) r ( s , a )
In this way, the Bellman Equation of MRP can be used to calculate the strategyπ \piπ state valueV π ( s ′ ) V^{\pi}(s')Vπ (s ), the Bellman Equation is as follows:

V = R + γ PVV = ( I − γ P ) − 1 R \mathcal{V} = \mathcal{R} + \gamma \mathcal{P} \mathcal{V} \\ \mathcal{V} = (\ mathcal{I}-\gamma \mathcal{P})^{-1} \mathcal{R}V=R+γPVV=(IγP)1R

With state value V π ( s ′ ) V^{\pi}(s')Vπ (s ), that is, the action valueQ π ( s , a ) Q^{\pi}(s,a)Qπ (s,a ) , namely in statesss , executeaaThe value of a action, in a certain state, try to use the action with the highest value, similar to automatic driving, when encountering a certain situation, use the optimal action to deal with it, because the action value is the highest.

Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) Q^{\pi}(s,a)=r(s,a) + \gamma \sum_{s' \in S}P(s'|s,a)V^{\pi}(s') Qπ (s,a)=r(s,a)+csSP(ss,a)Vπ (s)

Examples are as follows:

MDP

The source code is as follows:

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/6/7
"""

import numpy as np

def compute(P, rewards, gamma, states_num):
  """
  利用 贝尔曼方程 解析状态价值
  """
  rewards = np.array(rewards).reshape((-1, 1))  # 转换成列向量
  value = np.dot(np.linalg.inv(np.eye(states_num, states_num) - gamma * P), rewards)
  return value


def main():
    S = ["s1", "s2", "s3", "s4", "s5"]  # 状态集合
    A = ["保持s1", "前往s1", "前往s2", "前往s3", "前往s4", "前往s5", "概率前往"]  # 动作集合
    # 状态转移函数
    P = {
    
    
        "s1-保持s1-s1": 1.0,
        "s1-前往s2-s2": 1.0,
        "s2-前往s1-s1": 1.0,
        "s2-前往s3-s3": 1.0,
        "s3-前往s4-s4": 1.0,
        "s3-前往s5-s5": 1.0,
        "s4-前往s5-s5": 1.0,
        "s4-概率前往-s2": 0.2,
        "s4-概率前往-s3": 0.4,
        "s4-概率前往-s4": 0.4,
    }
    # 奖励函数
    R = {
    
    
        "s1-保持s1": -1,
        "s1-前往s2": 0,
        "s2-前往s1": -1,
        "s2-前往s3": -2,
        "s3-前往s4": -2,
        "s3-前往s5": 0,
        "s4-前往s5": 10,
        "s4-概率前往": 1,
    }
    gamma = 0.5  # 折扣因子
    MDP = (S, A, P, R, gamma)

    # 策略1,随机策略
    Pi_1 = {
    
    
        "s1-保持s1": 0.5,
        "s1-前往s2": 0.5,
        "s2-前往s1": 0.5,
        "s2-前往s3": 0.5,
        "s3-前往s4": 0.5,
        "s3-前往s5": 0.5,
        "s4-前往s5": 0.5,
        "s4-概率前往": 0.5,
    }
    # 策略2
    Pi_2 = {
    
    
        "s1-保持s1": 0.6,
        "s1-前往s2": 0.4,
        "s2-前往s1": 0.3,
        "s2-前往s3": 0.7,
        "s3-前往s4": 0.5,
        "s3-前往s5": 0.5,
        "s4-前往s5": 0.1,
        "s4-概率前往": 0.9,
    }

    # 把输入的两个字符串通过“-”连接,便于使用上述定义的P、R变量
    def join(str1, str2):
        return str1 + '-' + str2

    # 策略1,随机策略
    # 第1行: "s1-s1": 0.5, "s1-s2": 0.5
    # 第2行: "s2-s1": 0.5, "s2-s3": 0.5
    # 第3行: "s3-s4": 0.5, "s3-s5": 0.5
    # 第4行:"s4-s5": 0.5, "s4-概率前往": 0.5 * ["s4-概率前往-s2": 0.2, "s4-概率前往-s3": 0.4, "s4-概率前往-s4": 0.4]
    # 第5行:终点
    P_from_mdp_to_mrp = [
        [0.5, 0.5, 0.0, 0.0, 0.0],
        [0.5, 0.0, 0.5, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.5, 0.5],
        [0.0, 0.1, 0.2, 0.2, 0.5],
        [0.0, 0.0, 0.0, 0.0, 1.0],
    ]
    P_from_mdp_to_mrp = np.array(P_from_mdp_to_mrp)

    # R奖励:
    # 第1个值: 0.5 * -1 + 0.5 * 0 = -0.5
    # 第2个值:0.5 * -1 + 0.5 * -2 = -1.5
    # 第3个值:0.5 * -2 + 0.5 * 0 = -1.0
    # 第4个值:0.5 * 10 + 1 * 0.5 = 5.5
    # 第5个值:final = 0
    R_from_mdp_to_mrp = [-0.5, -1.5, -1.0, 5.5, 0]
    V = compute(P_from_mdp_to_mrp, R_from_mdp_to_mrp, gamma, 5)
    print("MDP中每个状态价值分别为\n", V)


if __name__ == '__main__':
    main()

Output s 1 ∼ s 6 s_{1} \sim s_{6}s1s6The state value V π ( s ′ ) V^{\pi}(s')Vπ (s' )as follows:

 [[-1.22555411]
 [-1.67666232]
 [ 0.51890482]
 [ 6.0756193 ]
 [ 0.        ]]

State action value Q π ( s 4 , probability to go) Q^{\pi} (s_{4}, probability to go)Qπ (s4,Probability to go ) is:
Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ SP ( s ′ ∣ s , a ) V π ( s ′ ) 2.152 = 1 + 0.5 ∗ [ 0.2 ∗ ( − 1.68 ) + 0.4 ∗ 0.52 + 0.4 ∗ 6.08 ] Q^{\pi}(s,a)=r(s,a) + \gamma \sum_{s' \in S}P(s'|s,a )V^{\pi}(s') \\ 2.152 = 1+0.5*[0.2*(-1.68)+0.4*0.52+0.4*6.08]Qπ (s,a)=r(s,a)+csSP(ss,a)Vπ (s)2.152=1+0.5[0.2(1.68)+0.40.52+0.46.08]

Guess you like

Origin blog.csdn.net/u012515223/article/details/131097165