1. Reinforcement learning---Markov decision process

Markov process

The known past process is:
ht = { s 1 , s 2 , s 3 , . . . . . st } h_t = \{s_1,s_2,s_3,...s_t\}ht={ s1,s2,s3,.....st}
The state with Markov properties has the following properties:
p ( st + 1 ∣ st ) = p ( st + 1 ∣ ht ) p ( st + 1 ∣ st , at ) = p ( st + 1 ∣ ht , at ) \begin{aligned} p(s_{t+1}|s_{t}) &= p(s_{t+1}|h_t)\\ p(s_{t+1}|s_t,a_t) &= p(s_{t+1}|h_t,a_t) \end{aligned}p(st+1st)p(st+1st,at)=p(st+1ht)=p(st+1ht,at)
The state transition matrix is:
Insert image description here

Markov Reward Process (MRP)

MRP is Markov chain + reward.
MRP is defined as:

  1. S is a set of finite states;
  2. P is a dynamic transition probability model P ( S t + 1 = s ′ ∣ st = s ) P(S_{t+1} = s'|s_t = s)P(St+1=sst=s)
  3. R is a reward function R ( st = s ) = E [ rt ∣ st = s ] R(s_{t} = s) = \mathbb{E}[r_t|s_t = s]R(st=s)=E [ rtst=s]
  4. Discount factor γ ∈ [ 0 , 1 ] \gamma\in[0,1]c[0,1]

If the state is finite, R is a vector

MRP example:
Insert image description here
about rt r_trtand RRUnderstanding R : First seeRRR is forttt seeks expectation, so it is a function about state and has nothing to do with time. Sorr_r is a random process,rt r_trtis a random variable, what we usually call reward refers to RRR

value function

The definition of return (return): the discounted cumulative reward from time t to the end of one epsiode, expressed as G t G_tGt来表示:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . . . + γ T − t − 1 R T G_t = R_{t+1} + \gamma R_{t+2} + {\gamma}^2R_{t+3}+.....+{\gamma}^{T-t-1}R_{T} Gt=Rt+1+γRt+2+c2 Rt+3+.....+cTt1RT
Note: R t R_tRtis a random variable, RRR is not!
The definition of the value function for the MRP process is:
V t ( s ) = E [ G t ∣ st = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . . . + γ T − t − 1 RT ∣ st = s ] \begin{aligned} V_t(s) &= \mathbb{E}[G_t|s_t = s]\\ &= \mathbb{E}[R_{t +1} + \gamma R_{t+2} + {\gamma}^2R_{t+3}+.....+{\gamma}^{Tt-1}R_{T}|s_t = s] \end{aligned}Vt(s)=E [ Gtst=s]=E [ Rt+1+γRt+2+c2 Rt+3+.....+cTt1RTst=s]
It can be seen from the expression that G t G_tGtis a random variable, V t ( s ) V_t(s)Vt( s ) is a relation aboutttt andssThe binary function of s represents the current momenttttCurrent statusssG t G_tunder sGtThe expectation is a scalar, so it will change with time and state. The size of the value function also reflects the size of the expected reward that can be obtained in the current state at the current point in time. (This expectation is for G t G_tGtThe distribution of this random variable is integrated. )

MRP example: As
Insert image description here
can be seen from the above example, since G t G_tGtis about ttThe random variable of t , so at different points in time, the return from the same state (return)G t G_tGtare different, and will vary with the length of the time step and γ \gammaThe size of γ changes greatly.

Bellman equation of MRP:

The following recursive formula can be obtained through the definition of the value function:
V ( s ) = R ( s ) + γ ∑ s ′ ∈ SP ( s ′ ∣ s ) V ( s ′ ) V(s) = R(s) + \ gamma \sum_{s'\in S} P(s'|s)V(s')V(s)=R(s)+csSP(ss)V(s )
Proof: First introduce a lemma: the expected sum equals the expected sum.
E [ ( x ∣ s ) dx + ∫ yf Y ( y ∣ s ) dy = E [ X ∣ S ] + E [ Y ∣ S ] \begin{aligned} E[X+Y|S] &= \iint(x+ y)f(x,y|s)dxdy \\ &=\iint xf(x,y|s)dxdy + \iint yf(x,y|s)dxdy \\ &=\int xf_X(x|s) dx +\int yf_Y(y|s)dy\\ & = E[X|S] + E[Y|S] \end{aligned}And [ _+YS]=(x+y)f(x,y s ) d x d y=xf(x,y s ) d x d y+yf(x,y s ) d x d y=xfX(xs)dx+yfY(ys)dy=E[XS]+E[YS]
所以
V ( s ) = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . . . + γ T − t − 1 R T ∣ s t = s ] = E [ R t + 1 ∣ s t = s ] + γ E [ G t + 1 ∣ s t = s ] V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + {\gamma}^2R_{t+3}+.....+{\gamma}^{T-t-1}R_{T}|s_t = s] \\ = \mathbb{E}[R_{t+1}|s_t = s] +\gamma\mathbb{E}[G_{t+1}|s_t = s] V(s)=E [ Rt+1+γRt+2+c2 Rt+3+.....+cTt1RTst=s]=E [ Rt+1st=s]+γE[Gt+1st=s ]
It can be known from the above thatG t G_tGtis given by R t R_tRtIt is formed by summing up. From the definition formula, we know G t + 1 G_{t+1}Gt+1There is no Markovianity here. ( G t = R t + 1 + γ R t + 2 . . . . G_t=R_{t+1} + \gamma R_{t+2}....Gt=Rt+1+γRt+2....)

E [ G t + 1 ∣ s t = s ] = E [ G t + 1 ] = ∑ P ( s ′ ∣ s ) E [ G t + 1 ∣ s t + 1 = s ′ ] \mathbb{E}[G_{t+1}|s_t = s] = \mathbb{E}[G_{t+1}] = \sum P(s'|s)\mathbb{E}[G_{t+1}|s_{t+1} = s'] E [ Gt+1st=s]=E [ Gt+1]=P(ss)E[Gt+1st+1=s]

We can also write the recursion in matrix form:
Insert image description here
the V vector can be obtained by solving the above equation, but because the complexity is too high, this method is generally not used.

Iterative algorithm to find the value function of MRP

A. Monte Carlo algorithm
Insert image description here
(The meaning of t here is the value function at the current moment (after iterating N times).)
The MC method uses sampling and averaging to replace the expectation with the expected unbiased estimate average.

B. Iterative solution
Insert image description here
According to the Bellman equation of MRP, iterate until the value function vector becomes stable.

Markov Decision Process (MDP)

  1. S S S is a set of finite states.
  2. A A A is a set of finite actions.
  3. P a P^a Pa is a transfer model =P ( st + 1 = s ′ ∣ st = s , at = a ) P(s_{t+1} = s' | s_t = s,a_t = a)P(st+1=sst=s,at=a)

MDP function(S,A,P,R, γ \gammaγ ) composition.
In the MDP process, R is not only related to the state but also to the actions taken.

Policy in MDP

Policy is the distribution of actions in a given state.
Policy: π ( a ∣ s ) = P ( at = a ∣ st = s ) \pi(a|s) = P(a_t = a|s_t=s)π ( a s )=P(at=ast=s)

According to the Policy, MDP(S,A,P,R, γ \gammac ) 和policyπ \piπ ) and the MRP process (S,P π P^\piPπ ,R π R^\piRπ ,γ \gammaγ)等价:
P π ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) P ( s ′ ∣ s , a ) R π ( s ) = ∑ a ∈ A π ( a ∣ s ) R ( s , a ) P^\pi(s'|s) = \sum_{a\in A}\pi(a|s)P(s'|s,a)\\ R^\pi(s) = \sum_{a\in A}\pi(a|s)R(s,a) Pπ (ss)=aAπ ( a s ) P ( ss,a)Rπ (s)=aAπ ( a s ) R ( s ,a )
Schematic diagram comparing the MP/MRP process and the MDP process:
Insert image description here
MDP will also go through one more step of sampling actions on the action distribution to determine the transition probability of the next state.

MDP value function

Value function (state-value) in MDP v π ( s ) v^\pi(s)vπ (s)represents the state s, and the policy isπ \piThe expected return of π . The action-value function isq π ( s , a ) q^\pi(s,a)qπ (s,a)
v π ( s ) = E [ G t ∣ s t = s ] q π ( s , a ) = E [ G t ∣ s t = s , A t = a ] v^\pi(s) = \mathbb{E}[G_t|s_t = s]\\ q^\pi(s,a) = \mathbb{E}[G_t|s_t = s,A_t = a] vπ (s)=E [ Gtst=s]qπ (s,a)=E [ Gtst=s,At=a ]
v π ( s ) v^\pi(s)vπ (s)q π ( s , a ) q^\pi(s,a)qπ (s,a ) the relationship between:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) q π ( s , a ) = R as + γ ∑ s ′ ∈ SP ( s ′ ∣ s , a ) v π ( s ′ ) v^\pi(s) = \sum_{a\in A} \pi(a|s)q^\pi(s,a)\\ q^\pi (s,a) = R_a^s +\gamma \sum_{s'\in S} P(s'|s,a)v^\pi(s')vπ (s)=aAπ ( a s ) qπ (s,a)qπ (s,a)=Ras+csSP(ss,a)vπ (s)

Bellman expectation equation

v π ( s ) = E π [ R t + 1 + γ v π ( s t + 1 ) ∣ s t = s ] q π ( s , a ) = E π [ R t + 1 + γ q π ( s t + 1 , A t + 1 ) ∣ s t = s , A t = a ] v^\pi(s) = E_\pi[R_{t+1}+\gamma v^\pi(s_{t+1})|s_t = s]\\ q^\pi(s,a) = E_\pi[R_{t+1}+\gamma q^\pi(s_{t+1},A_{t+1})|s_t = s,A_t = a] vπ (s)=Ep[Rt+1+v _π (st+1)st=s]qπ (s,a)=Ep[Rt+1+γqπ (st+1,At+1)st=s,At=a ]
According to the previous Bellman equation of MRP, the Bellman equation of MDP can be easily obtained:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ SP ( s ′ ∣ s , a ) v π ( s ′ ) ) q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ SP ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) v^\pi (s) = \sum_{a\in A}\pi(a|s)(R(s,a)+\gamma \sum_{s'\in S}P(s'|s,a)v^\pi(s')) \\ q^\pi(s,a) = R(s,a) +\gamma\sum_ {s'\in S}P(s'|s,a)\sum_{a'\in A}\pi(a'|s')q^\pi(s',a')vπ (s)=aAπ ( a s ) ( R ( s ,a)+csSP(ss,a)vπ (s))qπ (s,a)=R(s,a)+csSP(ss,a)aAπ ( as)qπ (s,a )
Then intuitively understand the meaning expressed by the following equations:
Insert image description here
Insert image description here
The codes for the two exercises in the figure below are attached:
Insert image description here

#练习1
S = list(range(7))
V1 = np.array([0,0,0,0,0,0,0])
V = np.array([999,999,999,999,999,999,999])
R = np.array([5,0,0,0,0,0,10])
epsilon = 10
gamma = 0.5
pro=np.array([[0,0,0,0,0,0,0],[1,0,0,0,0,0,0],[0,1,0,0,0,0,0],[0,0,1,0,0,0,0],
              [0,0,0,1,0,0,0],[0,0,0,0,1,0,0],[0,0,0,0,0,1,0]])
#terminal_state = 0

while np.abs(V1.sum()-V.sum())>0.0001:
    V = V1
    for s in S:
        all_sum = 0
        for s_ in S:
            all_sum = all_sum + pro[s][s_]*V[s_]
        V1[s] = R[s] + gamma*all_sum
print(V)

#练习2
S = list(range(7))
V1 = np.array([0,0,0,0,0,0,0])
V = np.array([999,999,999,999,999,999,999])
R = np.array([5,0,0,0,0,0,10])
epsilon = 10
gamma = 0.5
pro=np.array([[0.5,0.5,0,0,0,0,0],[0.5,0,0.5,0,0,0,0],[0,0.5,0,0.5,0,0,0],[0,0.5,0,0.5,0,0,0],
              [0,0,0,0.5,0,0.5,0],[0,0,0,0,0.5,0,0.5],[0,0,0,0,0,0.5,0.5]])
while np.abs(V1.sum()-V.sum())>0.0001:
    V = V1
    for s in S:
        all_sum = 0
        for s_ in S:
            all_sum = all_sum + pro[s][s_]*V[s_]
        V1[s] = R[s] + gamma*all_sum
print(V)

optimal value function

The optimal value function and action value function refer to traversing all policy selections to make the value function v π ( s ) v_\pi(s)vp( s ) flowq π ( s , a ) q_\pi(s,a)qp(s,a ) Maximum strategy and use the maximum value as the optimal value function.
v ∗ ( s ) = max ⁡ π v π ( s ) q ∗ ( s , a ) = max ⁡ π q π ( s , a ) v_*(s) = \max_{\pi}v_{\pi}( s)\\ q_*(s,a) = \max_{\pi}q_{\pi}(s,a)v(s)=Pimaxvp(s)q(s,a)=Pimaxqp(s,a )
The optimal value function shows the possible optimal performance of MDP.

optimal policy

Definition: If for any state v π ( s ) > = v π ′ ( s ) v_\pi(s)>= v_{\pi'}(s)vp(s)>=vPi( s ) Then there is,π > = π ′ \pi>=\pi'Pi>=Pi
Theorem: For any MDP, the following properties exist:

  • There must be an optimal strategy π ∗ \pi_*Pi
  • The optimal strategy must be able to produce the optimal value function v π ∗ ( s ) = v ∗ ( s ) v_{\pi_*}(s) = v_*(s)vPi(s)=v(s)
  • The optimal strategy must be able to produce the optimal action value function q π ∗ ( s , a ) = q ∗ ( s , a ) q_{\pi_*}(s,a) = q_*(s,a)qPi(s,a)=q(s,a)

The optimal policy can be obtained by maximizing the action value function:
π ∗ ( a ∣ s ) = { 1 if a = argmax ⁡ a ∈ A q ∗ ( s , a ) 0 otherwise \pi_{*}(a | s) =\left\{\begin{array}{ll} 1 & \text { if } a=\underset{a \in \mathcal{A}}{\operatorname{argmax}} q_{*}(s, a) \\ 0 & \text { otherwise } \end{array}\right.Pi(as)={ 10 if a=aAargmaxq(s,a) otherwise 
Note: There must be an optimal and decisive policy for any MDP.

Prediction and control in MDP problems

1、prediction

  • 输入:MDP < S , A , P , R , γ > <S,A,P,R,\gamma> <S,A,P,R,c> and ploicyπ \piπvariance <S , P π , R π , γ > <S,P^\pi,R^\pi,\gamma><S,PPi ,RPi ,c>
  • Output: Value function v π v^\pivPi

2、control

  • 输入:MDP < S , A , P , R , γ > <S,A,P,R,\gamma> <S,A,P,R,c>
  • Output: optimal value function v ∗ v^*v and the optimal policyπ ∗ \pi^*Pi

Both of the above problems can be solved using dynamic programming. Because the original problem can be recursively decomposed into multiple sub-problems, if the global optimum is achieved, it will also be optimal on any sub-problem.

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105448467