[Reinforcement Learning] 03 - Markov Decision Process

Recommend another article here [Autonomous Driving Decision Planning] Introduction to POMDP

1. Markov Decision Process (MDP)

  • It provides a mathematical framework for modeling the decision-making process in which the results are partly random and partly under the control of the decision-maker .
  • MDP formally describes a reinforcement learning environment
    • The environment is fully observable
    • The current state can completely characterize the process (Markov property)
  • Almost all RL problems can be converted to MDP to solve
    • Optimal control mainly deals with continuous MDP
    • Some unsightly problems can be converted into MDP
    • A multi-armed bandit is a single-state MDP

1.1. Markov properties

“The future is independent of the past given the present”

The probability distribution of future states is only related to the current state and has nothing to do with past states .

Definition :

  • StatusS t S_tStis Markov's if and only if P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , . . . , S t ] \mathbb{P}\left[S_{t+1} \mid S_t\right]=\mathbb{P}\left[S_{t+1}\mid S_1,...,S_t\right]P[St+1St]=P[St+1S1,...,St]

Properties :

  • Status captures all relevant information in history
  • Once you know this state, you can abandon history
  • That is, the current state is a sufficient statistics of the future

1.2. State transition matrix

P s s ′ \boldsymbol{P}_{ss^{\prime}} PssFor the slave state sss transitions to states ′ s's probability, also known as one-step state transition probability. P \boldsymbol{P}P is a one-step state transition matrix.

P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] P = [ P 11 P 12 … P 1 n P 21 P 22 … P 2 n … … P n 1 P n 2 … P n n ] \begin{gathered} P[S_{t+1}|S_t]=P[S_{t+1}|S_1,\ldots,S_t] \\ \boldsymbol{P}_{ss^{\prime}}=P[S_{t+1}=s^{\prime}|S_{t}=s] \\ \boldsymbol{P}=\begin{bmatrix}P_{11}&P_{12}&\ldots&P_{1n}\\P_{21}&P_{22}&\ldots&P_{2n}\\\ldots\\\ldots\\P_{n1}&P_{n2}&\ldots&P_{nn}\end{bmatrix} \end{gathered} P[St+1St]=P[St+1S1,,St]Pss=P[St+1=sSt=s]P= P11P21Pn 1P12P22Pn2P1nP2nPnn

矩阵有以下性质:

  1. 非负性性质, P i j ≥ 0 P_{ij}\geq0 Pij0
  2. 行元素和为1, ∑ P i j = 1 , i = 1 , 2 , . . . , n \sum P_{ij}=1,i=1,2,...,n Pij=1,i=1,2,...,n

Insert image description here
以上图为例, S 1 S_1 S1转移到自身和 S 2 S2 S2的概率分别为0.1,0.9; S 2 S_2 S2转移到自身和 S 1 S1 The probabilities of S 1 are 0.2, 0.8 respectively. The state transition matrix can be expressed as: P = [ 0.1 0.9 0.8 0.2 ] \boldsymbol{P}=\begin{bmatrix}0.1&0.9\\0.8&0.2\end{bmatrix}P=[0.10.80.90.2]

1.3. Markov process

Markov process refers to a random process with Markov properties, also known as Markov chain . The Markov chain can be composed of the binary group (S, P) (S, P)(S,P ) description. The state transition probability does not change with time.

1.3.1. A simple example

Insert image description here

C l a s s 1 Class 1 Cl a ss 1 is the initial state,S sleep SleepSle ee p is the terminal state, and the number in it represents the transition probability between each state. The sum of the probabilities of transitioning from each state to other states is 1.

We can write the state transition matrix of this Markov process:

Insert image description here

Given a Markov process, we can start from a certain state and generate a state sequence (episode) based on its state transition matrix . This step is also called sampling . For example, we may get the following sampling results:

  • C1 C2 C3 Pass Sleep
  • C1 FB FB C1 C2 Sleep
  • C1 C2 C3 Pub C2 C3 Pass Sleep
  • C1 FB FB C1 C2 C3 Pub C1 FB FB
  • FB C1 C2 C3 Pub C2 Sleep

2. Markov reward process

Add the reward function rr based on the Markov processr and the discount factorγ \gammaγ , you can get the Markov reward process (Markov reward process). A Markov reward process consists of⟨ S , P , R , γ   \langle\mathcal{S},\mathcal{P},\color{red}{\mathcal{R}},\gamma\rangleS,P,R,γ consists of, the meaning of each component element is as follows.

  • S S S : state set,S = { s 1 , s 2 , . . . , sn } S=\{s_1, s_2, ..., s_n\}S={ s1,s2,...,sn}
  • P ( s ′ ∣ s ) P(s'|s) P(ss): state transition probability, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s'sThe probability of ′ ,s ′ ∈ S s'\in SsS s , s ′ ∈ S s, s'\in S s,sS P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss^{\prime}}=\mathbb{P}\left[S_{t+1}=s^{\prime}\mid S_t=s\right] Pss=P[St+1=sSt=s]
  • R ( s , s ′ ) R(s, s') R(s,s ): Immediate reward function, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s's The immediate reward obtained,s , s ′ ∈ S s, s'\in Ss,sS R s = E [ R t + 1 ∣ S t = s ] \mathcal{R}_s=\mathbb{E}\left[R_{t+1}\mid S_t=s\right] Rs=E[Rt+1St=s]
  • γ ∈ ( 0 , 1 ) γ\in(0,1)c(0,1 ) : Discount factor, indicating the weight ratio of current rewards and future rewards.
    • if γ → 1 γ\rightarrow1c1 , means more belief in the current state, that is, "short-sighted myopic ";
    • if γ → 0 γ\rightarrow0c0 , means more belief in the future state, that is, " far-sighted ";

2.1. Return

Definition :
In a Markov reward process, from ttState at time t S t S_tStFrom the beginning to the end state, the sum of the decay of all rewards is called return (Return) G t G_tGt, the formula is as follows:

G t = R t + 1 + γ R t + 2 + . . . = ∑ k = 0 ∞ γ k R t + k + 1 G_t=R_{t+1}+\gamma R_{t+2}+...=\sum_{k=0}^\infty\gamma^kR_{t+k+1} Gt=Rt+1+γRt+2+...=k=0ckRt+k+1

Still using the above example:
for example, enter the state Class 2 Class2Cl a ss 2 can get reward− 2 -22 , indicating that we do not want to enter, enterPass PassP a ss can get the highest reward10 1010 , but enterSleep SleepAfter Sleep the reward is zero and the sequence terminates at this point. A calculation example:C lass 1 → C lass 2 → C lass 3 → P ass → S leep Class1\rightarrow Class2\rightarrow Class3\rightarrow Pass \rightarrow SleepClass1Class2Class3PassSleep − 2 + 0.5 × ( − 2 ) + ( 0.5 ) 2 × ( − 2 ) + ( 0.5 ) 3 × 10 = − 2.25 -2 + 0.5\times(-2)+(0.5)^2\times(-2)+(0.5)^3\times 10=-2.25 2+0.5×(2)+(0.5)2×(2)+(0.5)3×10=2.25
Insert image description here
expressed in code:

import numpy as np

# Define the transition Matrix
P = [
    [0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0],
    [0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.2],
    [0.0, 0.0, 0.0, 0.6, 0.4, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
    [0.2, 0.4, 0.4, 0.0, 0.0, 0.0, 0.0],
    [0.1, 0.0, 0.0, 0.0, 0.0, 0.9, 0.0],
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
]
P = np.array(P)
RewardVector = [-2, -2, -2, 10, 1, -1, 0]

# 给定一条序列,计算从某个索引(起始状态)开始到序列最后(终止状态)得到的回报
def ComputeSequenceReward(Start_idx, Sequence, RewardVector, gamma=0.5):
    TotalReward = 0.0
    for i in reversed(range(Start_idx, len(Sequence))):
        TotalReward = gamma * TotalReward + RewardVector[Sequence[i] - 1]
    return TotalReward

def test01():
    chain = [1, 2, 3, 4, 7]
    start_index = 0
    print("根据本序列计算得到回报为:%s。"% ComputeSequenceReward(start_index, chain, RewardVector, gamma=0.5))
    
if __name__ == "__main__":
    test01()
根据本序列计算得到回报为:-2.25

γ = 0 , 0.9 , 1 \gamma=0,0.9,1 c=Examples at 0 , 0.9 , and 1 (the numbers in the picture are the values ​​of each node):
Insert image description here
Insert image description here

2.2. Value function

In the Markov reward process, the expected return of a state (that is, the expectation of future cumulative rewards starting from this state) is called the value of this state . The values ​​of all states form a value function. The input of the value function is a certain state, and the output is the value of this state. We write the value function as v ( s ) = E [ G t ∣ S t = s ] v(s)=\mathbb{E}\left[G_t\mid S_t=s\right]v(s)=E[GtSt=s],展开为 V ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . ∣ S t = s ] = E [ R t + 1 + γ ( R t + 2 + γ R t + 3 + . . . ) ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 + γ V ( S t + 1 ) ∣ S t = s ] \begin{aligned} V(s)& =\mathbb{E}\left[G_t\mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...\mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+...\right)\mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma G_{t+1}\mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma V(S_{t+1})\mid S_t=s\right] \end{aligned} V(s)=E[GtSt=s]=E[Rt+1+γRt+2+c2 Rt+3+...St=s]=E[Rt+1+c(Rt+2+γRt+3+...)St=s]=E[Rt+1+c Gt+1St=s]=E[Rt+1+γV(St+1)St=s]

In the last equal sign of the above formula, on the one hand, the expectation of immediate reward is exactly the output of the reward function, that is, E [ R t + 1 ∣ S t = s ] = R s ] \mathbb{E}[R_{t +1}|S_t=s]=\mathcal{R}_s]E [ Rt+1St=s]=Rs] ; On the other
hand, the remaining part of the equationE [ γ V ( S t + 1 ) ∣ S t = s ] \mathbb{E}[\gamma V(S_{t+1})|S_t=s]E[γV(St+1)St=s ] can be based on the slave statessThe transition probability starting from s is obtained, that is, V ( s ) = R s + γ ∑ s ′ ∈ SP ss ′ V ( s ′ ) V(s)=\mathcal{R}_s+\gamma\sum_{s^{ \prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}V(s^{\prime})V(s)=Rs+csSPssV(s)

The above formula is the very famous Bellman equation in the Markov reward process , which is true for every state. We can write this in matrix form:
V = R + γ PV \mathcal{V}=\mathcal{R}+\gamma\mathcal{P}\mathcal{V}V=R+γPV [ V ( s 1 ) ⋮ V ( s n ) ] = [ R 1 ⋮ R n ] + γ [ P 11 … P 1 n ⋮ P 11 … P n n ] [ V ( s 1 ) ⋮ V ( s n ) ] \begin{bmatrix}V(s_1)\\\vdots\\V(s_n)\end{bmatrix}=\begin{bmatrix}\mathcal{R}_1\\\vdots\\\mathcal{R}_n\end{bmatrix}+\gamma\begin{bmatrix}\mathcal{P}_{11}&\ldots&\mathcal{P}_{1n}\\\vdots\\\mathcal{P}_{11}&\ldots&\mathcal{P}_{nn}\end{bmatrix}\begin{bmatrix}V(s_1)\\\vdots\\V(s_n)\end{bmatrix} V(s1)V(sn) = R1Rn +c P11P11P1nPnn V(s1)V(sn)
We can directly solve it based on matrix operations and get the following analytical solution: V = R + γ PV ( I − γ P ) V = RV = ( I − γ P ) − 1 R \begin{aligned}\mathcal{V}&= \mathcal{R}+\gamma\mathcal{P}\mathcal{V}\\(I-\gamma\mathcal{P})\mathcal{V}&=\mathcal{R}\\\mathcal{V} &=(I-\gamma\mathcal{P})^{-1}\mathcal{R}\end{aligned}V(Iγ P ) VV=R+γPV=R=(IγP)1R

The computational complexity of the above analytical solution is O ( n 3 ) O(n^3)O ( n3 ), wherennn is the number of states, so this method is only suitable for small Markov reward processes. When solving the value function in a larger-scale Markov reward process, you can usedynamic programmingalgorithms,methods(see 3.6) andtemporal differences(temporal difference)

Next, write code to implement the analytical solution method for solving the value function, and calculate the value of all states in the Markov reward process accordingly.

# Exploit Bellman equation to compute value of all states
def ComputeValue(RewardVector, Statesize, TransitionMatrix=P, gamma=0.5):
    RewardVector = np.array(RewardVector).reshape(-1, 1)
    Value = np.dot(np.linalg.inv(np.eye(Statesize, Statesize) - gamma * TransitionMatrix),
                   RewardVector)
    return Value

print("MRP中每个状态价值分别为\n", ComputeValue(RewardVector, 7))
MRP中每个状态价值分别为
 [[-2.90815722]
 [-1.55006913]
 [ 1.12482718]
 [10.        ]
 [ 0.62413589]
 [-2.08255975]
 [ 0.        ]]

3. Markov decision process

3.1. MDP quintuple

MDP (Markov Decision Process) is a mathematical model used to describe the interaction between an agent and the environment. MDP can be expressed as a five-tuple ( S , A , P , R , γ ) (S, \textcolor{red}{A}, P, R, \gamma)(S,A,P,R,c ) :

  • S S S : state set,S = { s 1 , s 2 , . . . , sn } S=\{s_1, s_2, ..., s_n\}S={ s1,s2,...,sn} , including lane, environment, world model and other information.
  • A A A : action set,A = { a 1 , a 2 , . . . , am } A=\{a_1, a_2, ..., a_m\}A={ a1,a2,...,am} , the vehicle’s decision-making space, including lane changing, following, overtaking, etc.
  • P ( s ′ ∣ s , a ) P(s'|s, a) P(ss,a ) : State transition probability function, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s'sThe probability of ′ ,s ′ ∈ S s'\in SsS s , s ′ ∈ S s, s'\in S s,sS a ∈ A a\in A aA P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] \mathcal{P}_{\mathbf{ss}^{\prime}}^{\color{red}{a}}=\mathbb{P}\left[S_{t+1}=s^{\prime}\mid S_t=s,A_t=\textcolor{red}{a}\right] Pssa=P[St+1=sSt=s,At=a]
  • R ( s , a , s ′ ) R(s, a, s') R(s,a,s ): Immediate reward function, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s's The immediate reward obtained,s , s ′ ∈ S s, s'\in Ss,sS a ∈ A a\in A aA R s a = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{R}_{\mathfrak{s}}^{\color{red}{a}}=\mathbb{E}\left[R_{t+1}\mid S_{t}=s,A_{t}=\textcolor{red}{a}\right] Rsa=E[Rt+1St=s,At=a]
  • γ ∈ ( 0 , 1 ) γ\in(0,1)c(0,1 ) : Discount factor, indicating the weight ratio of current rewards and future rewards.

Note : In the above definition of MDP, we no longer use the state transition matrix method similar to the MRP definition, but directly express it as a state transition function .

  • This is done because the state transition is also related to the action at this time and becomes a three-dimensional array instead of a matrix (two-dimensional array);
  • The second reason is that the state transition function has more general meaning. For example, if the state set is not finite, it cannot be represented by an array, but it can still be represented by the state transition function.

Unlike a Markov reward process, in a Markov decision process there is usually an agent that performs the action. The Markov decision-making process is a time-related ongoing process, and there is a continuous interaction process between the agent and the environment. Generally speaking, the interaction between them is a cyclic process as shown below: The agent determines the current state S t S_tStSelect action A t A_tAt;For state S t S_tStand action A t A_tAt, the environment obtains S t + 1 S_{t+1} based on the reward function and state transition functionSt+1and R t + 1 R_{t+1}Rt+1and feedback to the agent. The goal of the agent is to maximize the cumulative reward obtained. The function by which an agent selects an action from a set of actions based on the current state is called a policy.

Insert image description here

3.2. Strategy

The agent's policy is usually represented by the letters π \piπ represents. In input statessTake action aa under s situationThe strategy for the probability of a is shown in the following formula. π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s)=\mathbb{P}\left[A_t=a\mid S_t=s\right]π ( a s )=P[At=aSt=s]

  • The strategy only needs to be related to the current state and does not need to consider the historical state.
  • The policy is fixed (independent of time) A t ∼ π ( ⋅ ∣ S t ) , ∀ t > 0 A_t\sim\pi(\cdot|S_t),\forall t>0Atπ ( St),t>0

3.3. Value function

3.3.1. State value function

We use V π ( s ) V^\pi(s)Vπ (s)π \pibased on MDPThe state- value functionof π is defined as starting from statesss starts and follows the strategyπ \piThe mathematical expression of the expected return that π can obtain is:
V π ( s ) = E π [ G t ∣ S t = s ] V^\pi(s)=\mathbb{E}_\pi[G_t|S_t=s ]Vπ (s)=Ep[GtSt=s]

3.3.2. Action value function

Different from MRP, in MDP, due to the existence of actions, we define an additional action- value function . We use Q π ( s , a ) Q^\pi(s,a)Qπ (s,a ) means following the policyπ \piWhen π , for the current statesss executes actionaaThe expected return for a:
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] Q^\pi(s,a)=\mathbb{E}_\pi[G_t |S_t=s,A_t=a]Qπ (s,a)=Ep[GtSt=s,At=a]

3.4. Bellman’s expectation equation

The word "expectation" is added to the Bellman equation to distinguish it from the following Bellman optimal equation .

The state value function can be decomposed into the reward at the current moment plus the discounted state value of the subsequent state
V π ( s ) = E π [ R t + 1 + γ V π ( S t + 1 ) ∣ S t = s ] V_\pi(s)=\mathbb{E}_\pi\left[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s\right]Vp(s)=Ep[Rt+1+γ Vp(St+1)St=s]

The action value function can also be decomposed:
Q π ( s , a ) = E π [ R t + 1 + γ Q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] Q_ \pi(s,a)=\mathbb{E}_\pi\left[R_{t+1}+\gamma Q_\pi(S_{t+1},A_{t+1})\mid S_t= s,A_t=a\right]Qp(s,a)=Ep[Rt+1+γQp(St+1,At+1)St=s,At=a]

The relationship between state value function and action value function: using strategy π \piπ中,状态 s s s的价值等于在该状态下基于策略 π \pi π采取所有动作 a i a_i ai的概率与相应的价值相乘再求和的结果:Insert image description here
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) V_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)Q_\pi(s,a) Vπ(s)=aAπ(as)Qπ(s,a)

使用策略 π \pi π时,状态 s s s下采取动作 a a a的价值等于即时奖励加上经过衰减后的所有可能的下一个状态的状态转移概率与相应的价值的乘积:

Insert image description here
Q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a V π ( s ′ ) Q_\pi(s,a)=\mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}^aV_\pi(s^{\prime}) Qp(s,a)=Rsa+csSPssaVp(s)

Superposing the two, we can get
Insert image description here
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R sa + γ ∑ s ′ ∈ SP ss ′ a V π ( s ′ ) ) V_\pi(s) =\sum_{a\in\mathcal{A}}\pi(a|s)\left(\mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}} \mathcal{P}_{ss^{\prime}}^aV_\pi(s^{\prime})\right)Vp(s)=aAπ ( a s )(Rsa+csSPssaVp(s))

以及
Insert image description here
Q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) Q_\pi(s,a)=\mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}^a\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})Q_\pi(s^{\prime},a^{\prime}) Qp(s,a)=Rsa+csSPssaaAπ ( as)Qp(s,a )the Bellman expectation equation
of the two value functions obtained through simple derivation.

Also use an example to explain: The figure is based on the strategy π \piπ takes all actionsai a_iaiThe probability is: π ( a ∣ s ) = 0.5 \pi(a|s)=0.5π ( a s )=0.5 . CalculationClass3 Class3The state value of Cl a ss 3 needs to calculate the sum of its state values ​​in the next step, so it can be obtained as shown in the formula in the figure:7.4 = 0.5 ∗ ( 1 + 0.2 ∗ − 1.3 + 0.4 ∗ 2.7 + 0.4 ∗ 7.4 ) + 0.5 ∗ 10 7.4 = 0.5 * (1 + 0.2* -1.3 + 0.4 * 2.7 + 0.4 * 7.4) + 0.5 * 107.4=0.5(1+0.21.3+0.42.7+0.47.4)+0.510

Insert image description here

MRP can also be used for solution. We can marginalize the action selection of the strategy to obtain MRP without actions. Specifically, for a certain state, we weight it according to the probability of all actions of the strategy, and the obtained reward sum can be considered as the reward of an MRP in this state, that is: R π = R s ′ = ∑ a ∈ A π ( a ∣ s ) R sa \mathcal{R}^\pi=\mathcal{R}_s^{'}=\sum_{a\in\mathcal{A}}\pi(a|s)\mathcal{R }_s^aRPi=Rs=aAπ ( a s ) Rsa
In the same way, we calculate the probability of taking action and make sss is transferred tos ′ s's , and then add these products together, the sum is the statesss is transferred from tos ′ s's Equations:P π = P ss ′ ′ = ∑ a ∈ A π ( a ∣ s ) P ss ′ a \mathcal{P}^\pi=\mathcal{P}_{ss^{\prime}}^ {'}=\sum_{a\in\mathcal{A}}\pi(a|s)\mathcal{P}_{ss^{\prime}}^{a}PPi=Pss=aAπ ( a s ) Pssa

In this way, you can continue to use the previous MRP steps: V π = R π + γ P π V π V π = ( 1 − γ P π ) − 1 R π V_\pi=\mathcal{R}^\pi+\ gamma\mathcal{P}^\pi V_\pi \\ V_\pi=(1-\gamma\mathcal{P}^\pi)^{-1}\mathcal{R}^\piVp=RPi+γPπVpVp=(1γPp )1RPi

def test02():
    # Define the transition Matrix
    # C1 C2  Pass FB Sleep
    P_TransformMDP2MRP = [
        [0.0, 0.5, 0.0, 0.5, 0.0],
        [0.0, 0.0, 0.5, 0.0, 0.5],
        [0.1, 0.2, 0.2, 0.0, 0.5],
        [0.5, 0.0, 0.0, 0.5, 0.0],
        [0.0, 0.0, 0.0, 0.0, 1.0]
    ]
    P_TransformMDP2MRP = np.array(P_TransformMDP2MRP)
    R_TransformMDP2MRP = [-1.5, -1, 5.5, -0.5, 0.0]
    print("MDP中每个状态价值分别为\n", ComputeValue(R_TransformMDP2MRP, 5, P_TransformMDP2MRP, gamma=1))
MDP中每个状态价值分别为
 [[-1.30769231]
 [ 2.69230769]
 [ 7.38461538]
 [-2.30769231]
 [ 0.        ]]

3.5. Optimal strategy

The goal of reinforcement learning is usually to find a strategy that allows the agent to obtain the maximum expected return from the initial state. We first define the partial ordering relationship between strategies: if and only if for any state sss都有V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi^{\prime}}(s)Vπ (s)VPi (s),记π > π ′ \pi>\pi^{'}Pi>Pi . Therefore, in the MDP with a finite set of states and actions, there is at least one policy that is better than all other policies or at least one policy that is no worse than all other policies. This policy is the optimalpolicy. There may be many optimal strategies, which we all express asπ ∗ ( s ) \pi^*(s)Pi(s)

The optimal strategies all have the same state value function, which we call the optimal state value function , expressed as:
V ∗ ( s ) = max ⁡ π V π ( s ) , ∀ s ∈ SV^*(s)=\ max_\pi V^\pi(s),\quad\forall s\in\mathcal{S}V(s)=PimaxVπ (s),sS

In the same way, we define the optimal action value function :
Q ∗ ( s , a ) = max ⁡ π Q π ( s , a ) , ∀ s ∈ S , a ∈ AQ^*(s,a)=\max_{\ pi}Q^\pi(s,a),\quad\forall s\in\mathcal{S},a\in\mathcal{A}Q(s,a)=PimaxQπ (s,a),sS,aA

Examples of optimal state values
Insert image description here
​​Examples of optimal action values

Insert image description here
By maximizing Q ∗ ( s , a ) Q^*(s,a)Q(s,a ) ,Let us have the following function:
π ∗ ( a ∣ s ) = { 1 if a = argmax q ∗ ( s , a ) 0 otherwise \left.\pi_*(a|s)=\left\{\ begin{array}{cc}1&\mathrm{~if~}a=\mathrm{argmax~}q_*(s,a)\\0&otherwise\end{array}\right.\right.Pi(as)={ 10 if a=argmax  q(s,a)otherwise

  • For all MDPs, there is a certain optimal strategy
  • Just find Q ∗ ( s , a ) Q^*(s,a)Q(s,a ) , we can get the optimal strategy

The red arc arrow in the figure below represents the choice of the optimal strategy

Insert image description here

3.5.1. Bellman’s optimal equation

The equation is also derived recursively:

The optimal state value is the state value when choosing the action that maximizes the optimal action value at this time:
Insert image description here
V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) V_*(s)=\max_aQ_*(s ,a)V(s)=amaxQ(s,a)

The relationship between the optimal state value function and the optimal action value function:
Insert image description here
Q ∗ ( s , a ) = R sa + γ ∑ s ′ ∈ SP ss ′ a V ∗ ( s ′ ) Q_*(s,a)= \mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}^aV_*(s^{\prime })Q(s,a)=Rsa+csSPssaV(s )
and then superimposed, we can get
Insert image description here
V ∗ ( s ) = max ⁡ a ∈ A { R sa + γ ∑ s ′ ∈ SP ss ′ a V ∗ ( s ′ ) } V_*(s)=\max_{a\in \mathcal{A}}\{\mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}^ aV_*(s^{\prime})\}V(s)=aAmax{ Rsa+csSPssaV(s)}
Insert image description here
Q ∗ ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a max ⁡ a ′ ∈ A Q ∗ ( s ′ , a ′ ) Q_*(s,a)=\mathcal{R}_s^a+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathcal{P}_{ss^{\prime}}^a\max_{a^{\prime}\in\mathcal{A}}Q_*(s^{\prime},a^{\prime}) Q(s,a)=Rsa+csSPssaaAmaxQ(s,a )
In this way, we obtainthe Bellman optimality equation.

3.5.2. Optimal strategy solution

  • Bellman's optimal equation is nonlinear
  • Generally speaking, there is no closed-form solution
  • Some iteration methods:
    • Value Iteration
    • Policy Iteration
    • Q-learning
    • dressing

3.5.3. Extended MDP

  • Infinite and continuous MDPs
  • Partially observable MDPs(POMDP)
  • Undiscounted, average reward MDPs(Ergodic Markov Process)

3.6 Monte-Carlo method

When solving the value function in a larger-scale Markov reward process, dynamic programming algorithms, Monte-Carlo methods, and temporal differences can be used. This section introduces the Monte-Carlo method .

Monte-Carlo methods , also known as statistical simulation methods, are a numerical calculation method based on probability statistics. When using the Monte Carlo method, we usually use repeated random sampling, and then use probability and statistics methods to summarize the numerical estimate of the target we want to obtain from the sampling results. A simple example is using the Monte Carlo method to calculate the area of ​​a circle. For example, if a number of points are randomly generated inside the square shown in the figure below, and the number of points falling on the midpoint of the circle is counted, the ratio of the area of ​​the circle to the area of ​​the square is equal to the ratio of the number of midpoints of the circle to the number of midpoints of the square. . The more points we randomly generate, the closer the calculated area of ​​the circle will be to the real area of ​​the circle. Insert image description here
We now describe how to use Monte Carlo methods to estimate the state value function of a policy in a Markov decision process. Recall that the value of a state is its expected return, so a very intuitive idea is to use a strategy to sample many sequences on the MDP, calculate the return starting from this state, and then find the expectation. The formula is as follows: V π ( s ) = E π [ G t ∣ S t = s ] ≈ 1 N ∑ i = 1 NG t ( i ) V^\pi(s)=\mathbb{E}_\pi[G_t|S_t=s] \approx\frac{1}{N}\sum_{i=1}^NG_t^{(i)}Vπ (s)=Ep[GtSt=s]N1i=1NGt(i)

In a sequence, this state may not appear before, it may only appear once, or it may appear many times. The Monte Carlo value estimation method we introduce calculates the payoff for that state every time it occurs . Another option is to only calculate the reward once for a sequence, that is, calculate the cumulative reward when the state appears for the first time in the sequence, and ignore it when the state appears again later.

Suppose we now use the policy π \piπ from statesss starts the sampling sequence, based on which the state value is calculated. We maintain a counter and total return for each state. The specific process of calculating the state value is as follows.

  1. Use strategy π \piπ采样若干条序列: s 0 ( i ) → a 0 ( i ) r 0 ( i ) , s 1 ( i ) → a 1 ( i ) r 1 ( i ) , s 2 ( i ) → a 2 ( i ) ⋯ → a T − 1 ( i ) r T − 1 ( i ) , s T ( i ) s_0^{(i)}\xrightarrow{a_0^{(i)}}r_0^{(i)},s_1^{(i)}\xrightarrow{a_1^{(i)}}r_1^{(i)},s_2^{(i)}\xrightarrow{a_2^{(i)}}\cdots\xrightarrow{a_{T-1}^{(i)}}r_{T-1}^{(i)},s_T^{(i)} s0(i)a0(i) r0(i),s1(i)a1(i) r1(i),s2(i)a2(i) aT1(i) rT1(i),sT(i)
  2. For each time step tt in each sequencet ’s statussss performs the following operations:
  • update status ssCounter of s N ( s ) = N ( s ) + 1 N(s)=N(s)+1N(s)=N(s)+1
  • update status ssTotal return of s M ( s ) = M ( s ) + G t M(s)=M(s)+G_tM(s)=M(s)+Gt;
  • Or use an incremental update strategy: V ( s ) ← V ( s ) + 1 N ( s ) ( G − V ( S ) ) V(s)\leftarrow V(s)+\frac1{N(s)} (GV(S))V(s)V(s)+N(s)1(GV(S))
  1. The value of each state is estimated as the average return V ( s ) = M ( s ) / N ( s ) V(s)=M(s)/N(s)V(s)=M ( s ) / N ( s ) . According to the law of large numbers, whenN ( s ) → ∞ N(s)\rightarrow \inftyN(s) , withV ( s ) → V π ( s ) V(s)\to V_{\pi}(s)V(s)Vp(s)

Some results of code sampling

[('C2', 'Study', -2, 'Pass'), ('Pass', 'Pub', 1, 'Pass'), ('Pass', 'Pub', 1, 'Pass'), ('Pass', 'Pub', 1, 'Pass'), ('Pass', 'Study', 10, 'Sleep')]
[('C1', 'Study', -2, 'C2'), ('C2', 'Sleep', 0, 'Sleep')]
[('FB', 'Quit', 0, 'C1'), ('C1', 'Facebook', -1, 'FB'), ('FB', 'Quit', 0, 'C1'), ('C1', 'Facebook', -1, 'FB'), ('FB', 'Quit', 0, 'C1'), ('C1', 'Study', -2, 'C2'), ('C2', 'Study', -2, 'Pass'), ('Pass', 'Pub', 1, 'Pass'), ('Pass', 'Pub', 1, 'Pass')]
[('C1', 'Study', -2, 'C2'), ('C2', 'Sleep', 0, 'Sleep')]
[('C2', 'Study', -2, 'Pass'), ('Pass', 'Pub', 1, 'C2'), ('C2', 'Sleep', 0, 'Sleep')]

It can be seen that the final result is still relatively close.

使用蒙特卡洛方法计算MDP的状态价值为
 {
    
    'C1': -1.6584167352261565, 'C2': 0.5744913689985154, 'Pass': 6.330419227770518, 'FB': -1.1820907116805823, 'Sleep': 0}
MDP中每个状态价值分别为
 [[-1.67666232]
 [ 0.51890482]
 [ 6.0756193 ]
 [-1.22555411]
 [ 0.        ]]

3.7. Occupancy metrics

The value functions of different strategies are different. This is because for the same MDP, the probability distributions of states visited by different strategies are different. We need to understand the fact that different strategies will cause the agent to access states with different probability distributions, which will affect the value function of the strategy.

First we define the initial state distribution of MDP as ν 0 ( s ) \nu_0(s)n0( s ) , in some materials, the initial state distribution will be defined into the component elements of the MDP. We useP t π ( s ) P_t^\pi(s)Ptp( s ) representsthe adoption of strategy π \piπ makes the agent at timettt state isssThe probability of s , so we haveP 0 π ( s ) = ν 0 ( s ) P_0^\pi(s)=\nu_0(s)P0p(s)=n0( s ) , then you can definevisitation distribution:

ν π ( s ) = ( 1 − γ ) ∑ t = 0 ∞ γ t P t π ( s ) \nu^\pi(s)=(1-\gamma)\sum_{t=0}^\infty\ gamma^tP_t^\pi(s)nπ (s)=(1c )t=0ctPtp( s )
where,1 − γ 1-\gamma1γ is the normalization factor used to make the probabilities sum to 1. State access probability represents the distribution of states that a policy will access when interacting with MDP. It should be noted that theoretically it is necessary to interact after infinite steps when calculating the distribution, but in fact the interaction between the agent and the MDP is limitedin a sequence. However, we can still use the above formula to express the idea of ​​state access probability. State access probability has the following properties: ν π ( s ′ ) = ( 1 − γ ) ν 0 ( s ′ ) + γ ∫ P ( s ′ ∣ s , a ) π ( a ∣ s ) ν π ( s ) dsda \nu^\pi(s')=(1-\gamma)\nu_0(s')+\gamma\int P(s'|s,a) \pi(a|s)\nu^\pi(s)dsdanπ (s)=(1c ) n0(s)+cP(ss,a ) π ( a s ) nπ (s)dsda
In addition, we can also definethe occupancy measure of the policy:ρ π ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t P t π ( s ) π ( a ∣ s ) \rho^\pi(s,a)=(1-\gamma)\sum_{t=0}^\infty\gamma^tP_t^\pi(s)\pi(a|s)rπ (s,a)=(1c )t=0ctPtp( s ) π ( a s )
which represents the action state pair( s , a ) (s,a)(s,a ) Probability of being visited. There is the following relationship between the two:ρ π ( s , a ) = ν π ( s ) π ( a ∣ s ) \rho^\pi(s,a)=\nu^\pi(s)\pi(a |s)rπ (s,a)=nπ (s)π(as)
further leads to the following two theorems. Theorem 1: Agents adopt strategiesπ 1 \pi_1Pi1Sum π 2 \pi_2Pi2Occupancy metric ρ π 1 obtained by interacting with the same MDP \rho^{\pi_1}rPi1Sum ρ π 2 \rho^{\pi_2}rPi2π π 1 = ρ π 2 ⟺ π 1 = π 2 \rho^{\pi_1}=\rho^{\pi_2}\iff\pi_1=\ pi_2rPi1=rPi2Pi1=Pi2
Theorem 2 : Given a legal occupancy metric, the only strategy that can generate the occupancy metric is π ρ = ρ ( s , a ) ∑ a ′ ρ ( s , a ′ ) \pi_\rho=\frac{\rho(s ,a)}{\sum_{a^{\prime}}\rho(s,a^{\prime})}Pir=ap ( s ,a)p ( s ,a)
Note: The "legal" occupancy metric mentioned above refers to the probability that there is a policy that enables the state-action pairs generated by the interaction between the agent and the MDP to be accessed.

# Occupancy
def test04():
    # 策略2
    Policy_2 = {
    
    
        "C1-Study": 0.6,
        "C1-Facebook": 0.4,
        "FB-Facebook": 0.3,
        "FB-Quit": 0.7,
        "C2-Study": 0.5,
        "C2-Sleep": 0.5,
        "Pass-Study": 0.1,
        "Pass-Pub": 0.9,
    }
    MAXTimeStep = 8
    MDP, Policy_1 = Set_MDPParameterAndPolicy()
    Sequences1 = MonteCarloSampling(MDP, Policy_1, MAXTimeStep, SamplingNum=1000)
    Sequences2 = MonteCarloSampling(MDP, Policy_2, MAXTimeStep, SamplingNum=1000)
    # Sequences1 = sample(MDP, Policy_1, MAXTimeStep, 1000)
    # Sequences2 = sample(MDP, Policy_2, MAXTimeStep, 1000)
    rho1 = ComputeOccupancy("Pass", "Pub", Sequences1, MAXTimeStep, MDP)
    rho2 = ComputeOccupancy("Pass", "Pub", Sequences2, MAXTimeStep, MDP)
    print(rho1, rho2)

The resulting occupancy measure is different

0.058 0.1145

code

import numpy as np

# 给定一条序列,计算从某个索引(起始状态)开始到序列最后(终止状态)得到的回报
def ComputeSequenceReward(Start_idx, Sequence, RewardVector, gamma=0.5):
    TotalReward = 0.0
    for i in reversed(range(Start_idx, len(Sequence))):
        TotalReward = gamma * TotalReward + RewardVector[Sequence[i] - 1]
    return TotalReward

# Exploit Bellman equation to compute value of all states
def ComputeValue(RewardVector, Statesize, TransitionMatrix, gamma=0.5):
    RewardVector = np.array(RewardVector).reshape(-1, 1)
    try:
        Value = np.dot(np.linalg.inv(np.eye(Statesize, Statesize) - gamma * TransitionMatrix),
                   RewardVector)
    except:
        print("-------------状态转移矩阵为奇异矩阵,存在求解误差-------------")
        TransitionMatrix[Statesize - 1][Statesize - 1] += 1e-7
        I = np.eye(Statesize, Statesize)
        Value = np.dot(np.linalg.inv(I - gamma * TransitionMatrix),
                   RewardVector)
    return Value

def Set_MDPParameterAndPolicy():
    # 状态集合
    S = ["C1", "C2", "Pass", "FB", "Sleep"]
    # 动作集合
    A = ["Facebook", "Study", "Sleep", "Pub", "Quit"]
    # 状态转移函数
    P = {
    
    
        "C1-Study-C2": 1.0,
        "C1-Facebook-FB": 1.0,
        "FB-Facebook-FB": 1.0,
        "FB-Quit-C1": 1.0,
        "C2-Study-Pass": 1.0,
        "C2-Sleep-Sleep": 1.0,
        "Pass-Study-Sleep": 1.0,
        "Pass-Pub-C1": 0.2,
        "Pass-Pub-C2": 0.4,
        "Pass-Pub-Pass": 0.4,
    }
    # 奖励函数
    R = {
    
    
        "C1-Study": -2,
        "C1-Facebook": -1,
        "FB-Facebook": -1,
        "FB-Quit": 0,
        "C2-Study": -2,
        "C2-Sleep": 0,
        "Pass-Study": 10,
        "Pass-Pub": 1,
    }
    # 折扣因子
    gamma = 0.5
    MDP = (S, A, P, R, gamma)

    # 策略1,随机策略
    Pi_1 = {
    
    
        "C1-Study": 0.5,
        "C1-Facebook": 0.5,
        "FB-Facebook": 0.5,
        "FB-Quit": 0.5,
        "C2-Study": 0.5,
        "C2-Sleep": 0.5,
        "Pass-Study": 0.5,
        "Pass-Pub": 0.5,
    }
    # 策略2
    Pi_2 = {
    
    
        "C1-Study": 0.7,
        "C1-Facebook": 0.3,
        "FB-Facebook": 0.3,
        "FB-Quit": 0.7,
        "C2-Study": 0.5,
        "C2-Sleep": 0.5,
        "Pass-Study": 0.2,
        "Pass-Pub": 0.8,
    }
    return MDP, Pi_1

# 把输入的两个字符串通过“-”连接,便于使用上述定义的P、R变量
def join(str1, str2):
    return str1 + '-' + str2

def MonteCarloSampling(MDP, Policy, MAXTimeStep, SamplingNum):
    ''' 采样函数,策略Pi,限制最长时间步MaxTimeStep,总共采样序列数SamplingNum '''
    S, A, P, R, gamma = MDP
    StateNum = len(S)
    Sequences = []
    for _ in range(SamplingNum):
        Sequence = []
        TimeStep = 0
        # 随机选择一个除Sleep以外的状态s作为起点
        s = S[np.random.randint(StateNum - 1)]
        # 当前状态为终止状态或者时间步太长时,一次采样结束
        while s != "Sleep" and TimeStep <= MAXTimeStep:
            TimeStep += 1
            rand, temp = np.random.rand(), 0
            # 在状态s下根据策略选择动作
            for a_ in A:
                temp += Policy.get(join(s, a_), 0.0)
                if temp >= rand:
                    a = a_
                    r = R.get(join(s, a_), 0.0)
                    break
            rand, temp = np.random.rand(), 0
            # 根据状态转移概率得到下一个状态s_next
            for s_ in S:
                temp += P.get(join(join(s, a), s_), 0.0)
                if temp >= rand:
                    s_next = s_
                    break
            # 把(s,a,r,s_next)元组放入序列中
            Sequence.append((s, a, r, s_next))
            # s_next变成当前状态,开始接下来的循环
            s = s_next
        Sequences.append(Sequence)
    return Sequences

# 对所有采样序列计算所有状态的价值
def MonteCarloComputeValue(Sequences, MDP):
    gamma = MDP[4]
    V = {
    
    "C1": 0, "C2": 0, "Pass": 0, "FB": 0, "Sleep": 0}
    N = {
    
    "C1": 0, "C2": 0, "Pass": 0, "FB": 0, "Sleep": 0}
    for Sequence in Sequences:
        G = 0
        # 一个序列从后往前计算
        for i in reversed(range(len(Sequence))):
            s, r = Sequence[i][0], Sequence[i][2]
            G = r + gamma * G
            N[s] = N[s] + 1
            V[s] = V[s] + (G - V[s]) / N[s]
    return V

def ComputeOccupancy(s, a, Sequences, MAXTimeStep, MDP):
    ''' 计算状态动作对(s,a)出现的频率,以此来估算策略的占用度量 '''
    gamma = MDP[4]
    rho = 0
    total_times = np.zeros(MAXTimeStep)  # 记录每个时间步t各被经历过几次
    occur_times = np.zeros(MAXTimeStep)  # 记录(s_t,a_t)=(s,a)的次数
    for Sequence in Sequences:
        for i in range(len(Sequence)):
            try:
                s_, a_ = Sequence[i][0], Sequence[i][1]
                total_times[i] += 1
                if s_ == s and a_ == a:
                    occur_times[i] += 1
            except IndexError:
                continue
    for i in reversed(range(MAXTimeStep)):
        if total_times[i]:
            # 用频率来估算策略的占用度量
            rho = gamma ** i * occur_times[i] / total_times[i]
    return (1 - gamma) * rho

def SampleTEXT():
    MDP, Policy = Set_MDPParameterAndPolicy()
    Sequences = MonteCarloSampling(MDP, Policy, MAXTimeStep=8, SamplingNum=5)
    for Sequence in Sequences:
        print(Sequence)

def MonteCarloTEXT():
    MDP, Policy = Set_MDPParameterAndPolicy()
    Sequences = MonteCarloSampling(MDP, Policy, MAXTimeStep=8, SamplingNum=5000)
    V = MonteCarloComputeValue(Sequences, MDP)
    print("使用蒙特卡洛方法计算MDP的状态价值为\n", V)

def test01():
    # Define the transition Matrix
    # C1 C2 C3 Pass Pub FB Sleep
    P = [
        [0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0],
        [0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.2],
        [0.0, 0.0, 0.0, 0.6, 0.4, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
        [0.2, 0.4, 0.4, 0.0, 0.0, 0.0, 0.0],
        [0.1, 0.0, 0.0, 0.0, 0.0, 0.9, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
    ]
    P = np.array(P)
    RewardVector = [-2, -2, -2, 10, 1, -1, 0]
    chain = [1, 6, 6, 1, 2, 7]
    start_index = 0
    print("根据本序列计算得到回报为:%s。"% ComputeSequenceReward(start_index, chain, RewardVector, gamma=0.5))
    print("MRP中每个状态价值分别为\n", ComputeValue(RewardVector, 7, P))

# MDP2MRP
def test02():
    # Define the transition Matrix
    # C1 C2  Pass FB Sleep
    P_TransformMDP2MRP = [
        [0.0, 0.5, 0.0, 0.5, 0.0],
        [0.0, 0.0, 0.5, 0.0, 0.5],
        [0.1, 0.2, 0.2, 0.0, 0.5],
        [0.5, 0.0, 0.0, 0.5, 0.0],
        [0.0, 0.0, 0.0, 0.0, 1.0]
    ]
    P_TransformMDP2MRP = np.array(P_TransformMDP2MRP)
    R_TransformMDP2MRP = [-1.5, -1, 5.5, -0.5, 0.0]
    print("MDP中每个状态价值分别为\n", ComputeValue(R_TransformMDP2MRP, 5, P_TransformMDP2MRP, gamma=0.5))

# MonteCarlo
def test03():
    # SampleTEXT(
    MonteCarloTEXT()
    test02()

# Occupancy
def test04():
    # 策略2
    Policy_2 = {
    
    
        "C1-Study": 0.6,
        "C1-Facebook": 0.4,
        "FB-Facebook": 0.3,
        "FB-Quit": 0.7,
        "C2-Study": 0.5,
        "C2-Sleep": 0.5,
        "Pass-Study": 0.1,
        "Pass-Pub": 0.9,
    }
    MAXTimeStep = 8
    MDP, Policy_1 = Set_MDPParameterAndPolicy()
    Sequences1 = MonteCarloSampling(MDP, Policy_1, MAXTimeStep, SamplingNum=1000)
    Sequences2 = MonteCarloSampling(MDP, Policy_2, MAXTimeStep, SamplingNum=1000)
    rho1 = ComputeOccupancy("Pass", "Pub", Sequences1, MAXTimeStep, MDP)
    rho2 = ComputeOccupancy("Pass", "Pub", Sequences2, MAXTimeStep, MDP)
    print(rho1, rho2)

if __name__ == "__main__":
    test04()

reference

[1] Boyu AI
[2] https://www.deepmind.com/learning-resources/introduction-to-reinforcement-learning-with-david-silver
[3] Hands-on learning reinforcement learning
[4] Reinforcement Learning

Guess you like

Origin blog.csdn.net/sinat_52032317/article/details/133215106