Markov Process (MP) -> Markov Reward Process (MRP) -> Markov Decision Process (MDP)

1. Markovianness - only relevant to the current state

Markov property, the current state contains useful information needed for future prediction, past information is not important for future prediction, it should satisfy Markov property, strictly speaking, a certain state information contains all relevant history , as long as the current state is known, all historical information is no longer needed, and the current state can determine the future, then the state is considered to be Markovian. The following formula is used to describe the Markov property:
P(S t+1 |S t ) = P(S t+1 |S 1 , S 2 , ……, S t )

According to the formula, the future state S t+1 has nothing to do with the past state, but only with the current state S t .
We use state transition probabilities to describe the Markov property P ss' = P(S t+1 = s'|S t = s)

2. Markov Process

Markov process is also called Markov chain (Markov Chain), which is a memoryless random process, which can be represented by a tuple <S, P>, where S is a finite number of state sets, and P is the state transition probability matrix.
We use the state transition probability matrix to describe all possibilities, and the sum of each row on the transition probability matrix is ​​1.

3. Markov Reward Process

The Markov reward process is based on the Markov process by adding the reward function R and the decay coefficient γ

  • The reward R in the S state is the expectation of the reward that can be obtained at the next moment (t+1) when the state is in the state s at a certain moment, and the formula is expressed as: R s = E(R
    t + 1 | S t = s)

Definition: Harvest G t is the sum of the decaying revenue of all rewards starting from time t on a Markov reward chain. The formula is as follows:
G t =R t+1 +γR t+2 +⋯+γ n -1 R t+n

  • The attenuation coefficient γ is used to describe the uncertainty of long-term benefits, which reflects the value ratio of future rewards at the current moment. Obviously, the closer to 1, the more long-term benefits are considered. Now that we have harvested, we need to measure a certain value
    . The value of the state, we define as follows:

Definition: The value function of a certain state s in a Markov reward process is the expected harvest of the Markov chain starting from this state, the formula is as follows:
V s = E(G t |S t = s)

The Bellman equation
V s = E(G t |S t =s)
= E(R t+1 +γR t+2 +⋯+γ n-1 R t+n |S t =s)
= E(R t+1 +γV(S t+1 )|S t =s)
= E(R t+1 |S t =s) +γE(V(S t+1 )|S t =s)
one is currently obtained The expectation of the reward, the other is the value expectation of the state at the next moment, we can use the transition probability matrix to get the expectation, this is the Bellman equation,
V s = Rs + γ sum Pss' V(s')
where S Indicates all states at the next moment, and s' indicates the possible state at the next moment.

4. Markov Decision Process

The Markov decision process is a decision process added to the Markov reward process. In fact, it is an additional action set, represented by <S, A, P, R, γ>.
Both P and R here correspond to a specific behavior a, instead of only corresponding to a certain state like the Markov reward process, A represents a limited set of behaviors.

Use the formula to express
P a ss' = P(S t+1 = s'|S t = s, A t =a)
R a s = E(R t+1 |S t =s, A t =a)

4.1 Strategy

We use π to represent the set of strategies, whose elements are the probability of taking a possible action a for a certain state s in the process, expressed as π(a|s) = P(A t =a|
S t = s )

4.2 [MDP and Strategy] State Transition Probability/Reward Function

Given an MDP and a policy π(a|s), then the state sequence is a Markov process <S,P>; similarly, the state and reward sequence is a Markov reward process, and in this reward process Satisfy the following two equations:
state transition probability: P π ss' = sum over a π(a|s) P a ss'
reward function: R π s = sum over a π(a|s) R a s
state The transition probability can be described as: when the strategy is executed, the probability of the state transitioning from s to s' is equal to the sum of the products of the probability of executing all actions in the state and the probability that the corresponding action can make the state transition from s to s'.

4.3 Policy-based value function [state value/behavior value]

Definition: v(s) is a strategy-based state-value function under MDP, which represents the expectation of the harvest obtained when following the current strategy starting from state s, expressed as follows: V π =
E π ( G t | S t = s)

Definition: q π (s,a) is a strategy-based behavioral value function, which represents the expectation of what the current state s can achieve by performing a specific behavior a. The formula is expressed as follows: q π (s,a
) = E π (G t |S t = s, A t = a)

Do a similar derivation according to the bellman equation:
V π = E π (G t |S t =s)
= E π (R t+1 +γV(S t+1 )|S t =s)
q π (s,a) = E π (G t |S t =s, A t = a)
= E π (R t+1 +γV(S t+1 )|S t =s, A t = a)
we can get the state value function Relationship with behavioral value function
V π (s) = sum over a π(a|s) q π (s,a)

qπ(s,a) = Eπ(Rt+1 +γV(St+1)|St=s, At= a)
= Eπ(Rt+1|St=s, At= a) + γEπ(V(St+1)|St=s, At= a)
因为:Ras = E(Rt+1|St=s, At=a) , Pass’ = P(St+1= s’|St = s, At=a)
所以 = Ras+ γ sum over s' P a ss' V π (s')

5. Optimal strategy

The optimal state value function refers to the selection of the function that maximizes the value of the state s among the state value functions generated from all strategies:
max V π (s)
Similarly, the optimal behavior function is derived from the behavior value functions generated by all strategies , the selection is the function
max q π (s, a) with the largest value of the state behavior to <s, a
> For any state s, the value of following the strategy π is not less than the value of following the strategy π', then the strategy π is better than the strategy π '

Theorem: For any MDP, the following points hold:
1. There exists an optimal strategy that is better than or at least equal to any other strategy;
2. All optimal strategies have the same optimal value function;
3. All optimal Policies have the same behavioral value function.
According to the above theorem, we can find the optimal policy by maximizing the optimal behavior value function.

The Bellman optimal equation is nonlinear, there is no fixed solution, and it is solved by some iterative methods: value iteration, policy iteration, Q-learning, Sarsa, etc.

References:
https://blog.csdn.net/weixin_42389349/article/details/82948725

Guess you like

Origin blog.csdn.net/weixin_36378508/article/details/129146177