Reinforcement learning from basic to advanced - case and practice [2]: Markov decision, Bellman equation, dynamic programming, strategy value iteration

insert image description here
[Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

insert image description here
Column details : [Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

The plan for deep reinforcement learning is:

  • Basic single-intelligence algorithm teaching (gym environment-based)
  • Mainstream multi-intelligence algorithm teaching (gym environment-based)
    • Mainstream algorithms: DDPG, DQN, TD3, SAC, PPO, RainbowDQN, QLearning, A2C and other algorithm projects
  • Some interesting projects (Super Mario, playing backgammon, Fight the Landlord, various game applications)
  • Actual combat of single-intelligence and multiple-intelligence questions (the paper reproduces partial business such as: UAV optimization scheduling, power resource scheduling and other project applications)

This column is mainly to facilitate entry-level students to quickly grasp reinforcement learning single-agent | multi-agent algorithm principles + project practice. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

# Reinforcement learning from basic to advanced - case and practice [2]: Markov decision, Bellman equation, dynamic programming, strategy value iteration

图 2.1 介绍了强化学习里面智能体与环境之间的交互,智能体得到环境的状态后,它会采取动作,并把这个采取的动作返还给环境。环境得到智能体的动作后,它会进入下一个状态,把下一个状态传给智能体。在强化学习中,智能体与环境就是这样进行交互的,这个交互过程可以通过马尔可夫决策过程来表示,所以马尔可夫决策过程是强化学习的基本框架。

图 2.1 智能体与环境之间的交互

本章将介绍马尔可夫决策过程。在介绍马尔可夫决策过程之前,我们先介绍它的简化版本:马尔可夫过程(Markov process,MP)以及马尔可夫奖励过程(Markov reward process,MRP)。通过与这两种过程的比较,我们可以更容易理解马尔可夫决策过程。其次,我们会介绍马尔可夫决策过程中的策略评估(policy evaluation),就是当给定决策后,我们怎么去计算它的价值函数。最后,我们会介绍马尔可夫决策过程的控制,具体有策略迭代(policy iteration)价值迭代(value iteration) 两种算法。在马尔可夫决策过程中,它的环境是全部可观测的。但是很多时候环境里面有些量是不可观测的,但是这个部分观测的问题也可以转换成马尔可夫决策过程的问题。

2.1 马尔可夫过程

2.1.1 马尔可夫性质

在随机过程中,马尔可夫性质(Markov property) 是指一个随机过程在给定现在状态及所有过去状态情况下,其未来状态的条件概率分布仅依赖于当前状态。以离散随机过程为例,假设随机变量 X 0 , X 1 , ⋯   , X T X_0,X_1,\cdots,X_T X0,X1,,XT构成一个随机过程。这些随机变量的所有可能取值的集合被称为状态空间(state space)。如果 X t + 1 X_{t+1} Xt+1 对于过去状态的条件概率分布仅是 X t X_t Xt 的一个函数,则
p ( X t + 1 = x t + 1 ∣ X 0 : t = x 0 : t ) = p ( X t + 1 = x t + 1 ∣ X t = x t ) p\left(X_{t+1}=x_{t+1} \mid X_{0:t}=x_{0: t}\right)=p\left(X_{t+1}=x_{t+1} \mid X_{t}=x_{t}\right) p(Xt+1=xt+1X0:t=x0:t)=p(Xt+1=xt+1Xt=xt)
其中, X 0 : t X_{0:t} X0:t 表示变量集合 X 0 , X 1 , ⋯   , X t X_{0}, X_{1}, \cdots, X_{t} X0,X1,,Xt x 0 : t x_{0: t} x0:t 为在状态空间中的状态序列 x 0 , x 1 , ⋯   , x t x_{0}, x_{1}, \cdots, x_{t} x0,x1,,xt. The Markov property can also be described as the conditional independence of future states from past states given the current state. If a certain process satisfies the Markov property , then the future transition is independent of the past, it only depends on the present. The Markov property is the basis of all Markov processes.

2.1.2 Markov chain

A Markov process is a set of random variable sequences with Markov properties s 1 , ⋯ , st s_1,\cdots,s_ts1,,st, where the state at the next moment st + 1 s_{t+1}st+1only depends on the current state st s_tst. We set the state history as ht = { s 1 , s 2 , s 3 , … , st } h_{t}=\left\{s_{1}, s_{2}, s_{3}, \ldots, s_ {t}\right\}ht={ s1,s2,s3,,st} h t h_t htcontains all the previous states), then the Markov process satisfies the condition:
p ( st + 1 ∣ st ) = p ( st + 1 ∣ ht ) (2.1) p\left(s_{t+1} \mid s_{ t}\right) =p\left(s_{t+1} \mid h_{t}\right) \tag{2.1}p(st+1st)=p(st+1ht)(2.1)
从当前 s t s_t st 转移到 s t + 1 s_{t+1} st+1,它是直接就等于它之前所有的状态转移到 s t + 1 s_{t+1} st+1

离散时间的马尔可夫过程也称为马尔可夫链(Markov chain)。马尔可夫链是最简单的马尔可夫过程,其状态是有限的。例如,图 2.2 里面有4个状态,这4个状态在 s 1 , s 2 , s 3 , s 4 s_1,s_2,s_3,s_4 s1,s2,s3,s4 之间互相转移。比如从 s 1 s_1 s1 开始, s 1 s_1 s1 有 0.1 的概率继续存留在 s 1 s_1 s1 状态,有 0.2 的概率转移到 s 2 s_2 s2,有 0.7 的概率转移到 s 4 s_4 s4 。如果 s 4 s_4 s4 是我们的当前状态,它有 0.3 的概率转移到 s 2 s_2 s2,有 0.2 的概率转移到 s 3 s_3 s3,有 0.5 的概率留在当前状态。

图 2.2 马尔可夫链示例

我们可以用状态转移矩阵(state transition matrix) P \boldsymbol{P} P 来描述状态转移 p ( s t + 1 = s ′ ∣ s t = s ) p\left(s_{t+1}=s^{\prime} \mid s_{t}=s\right) p(st+1=sst=s)
P = ( p ( s 1 ∣ s 1 ) p ( s 2 ∣ s 1 ) … p ( s N ∣ s 1 ) p ( s 1 ∣ s 2 ) p ( s 2 ∣ s 2 ) … p ( s N ∣ s 2 ) ⋮ ⋮ ⋱ ⋮ p ( s 1 ∣ s N ) p ( s 2 ∣ s N ) … p ( s N ∣ s N ) ) \boldsymbol{P}=\left(\begin{array}{cccc} p\left(s_{1} \mid s_{1}\right) & p\left(s_{2} \mid s_{1}\right) & \ldots & p\left(s_{N} \mid s_{1}\right) \\ p\left(s_{1} \mid s_{2}\right) & p\left(s_{2} \mid s_{2}\right) & \ldots & p\left(s_{N} \mid s_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ p\left(s_{1} \mid s_{N}\right) & p\left(s_{2} \mid s_{N}\right) & \ldots & p\left(s_{N} \mid s_{N}\right) \end{array}\right) P= p(s1s1)p(s1s2)p(s1sN)p(s2s1)p(s2s2)p(s2sN)p(sNs1)p(sNs2)p(sNsN)
The state transition matrix is ​​similar to the conditional probability (conditional probability), which means that when we know that we are currently in the state st s_tst, the probability of reaching all the following states. So each row of it describes the probability of reaching all other nodes from one node.

2.1.3 Examples of Markov Processes

Figure 2.2 shows an example of a Markov process with seven states. For example from s 1 s_1s1start, it has a probability of 0.4 to s 2 s_2s2, with a probability of 0.6 to stay in the current state. s 2 s_2s2with probability 0.4 to s 1 s_1s1, with a probability of 0.4 to s 3 s_3s3, with another 0.2 probability of staying in the current state. So given the Markov chain of state transitions, we can sample the chain and get a sequence of trajectories. For example, suppose we start from state s 3 s_3s3At the beginning, 3 trajectories can be obtained:

  • s 3 , s 4 , s 5 , s 6 , s 6 s_3, s_4, s_5, s_6, s_6 s3,s4,s5,s6,s6
  • s 3 , s 2 , s 3 , s 2 , s 1 s_3, s_2, s_3, s_2, s_1 s3,s2,s3,s2,s1
  • s 3 , s 4 , s 4 , s 5 , s 5 s_3, s_4, s_4, s_5, s_5 s3,s4,s4,s5,s5

By sampling the states, we can generate many such trajectories.

Figure 2.3 Example of a Markov process

2.2 Markov Reward Process

A Markov reward process (MRP) is a Markov chain plus a reward function. In the Markov reward process, the state transition matrix and state are the same as the Markov chain, except that there is an additional reward function . Reward function RRR is an expectation, indicating how much reward we can get when we reach a certain state. The discount factor γ \gammais additionally defined heregamma . If the number of states is finite, thenRRR can be a vector.

2.2.1 Return and Value Function

Here we further define some concepts. Horizon refers to the length of a round (the maximum number of time steps per round), which is determined by a finite number of steps.
The return (return) can be defined as the gradual addition of rewards, assuming time ttThe reward sequence after t is rt + 1 , rt + 2 , rt + 3 , ⋯ r_{t+1},r_{t+2},r_{t+3},\cdotsrt+1,rt+2,rt+3,,则回报为
G t = r t + 1 + γ r t + 2 + γ 2 r t + 3 + γ 3 r t + 4 + … + γ T − t − 1 r T G_{t}=r_{t+1}+\gamma r_{t+2}+\gamma^{2} r_{t+3}+\gamma^{3} r_{t+4}+\ldots+\gamma^{T-t-1} r_{T} Gt=rt+1+γrt+2+c2 rt+3+c3 rt+4++cTt1rT
Among them, TTT is the final moment,γ \gammaγ is the discount factor, the more rewards you get later, the more discounts you get. This shows that we prefer to get existing rewards and discount future rewards. When we have a reward, we can define the value of the state, which isthe state-value function. For Markovian reward processes, the state-value function is defined as the expected reward, that is,
V t ( s ) = E [ G t ∣ st = s ] = E [ rt + 1 + γ rt + 2 + γ 2 rt + 3 + … + γ T − t − 1 r T ∣ st = s ] \begin{aligned} V^{t}(s) &=\mathbb{E}\left[G_{t} \mid s_{t}= s\right] \\ &=\mathbb{E}\left[r_{t+1}+\gamma r_{t+2}+\gamma^{2} r_{t+3}+\ldots+\gamma^ {Tt-1} r_{T} \mid s_{t}=s\right] \end{aligned}Vt(s)=E[Gtst=s]=E[rt+1+γrt+2+c2 rt+3++cTt1rTst=s]
Among them, G t G_tGtis the discounted return defined earlier . We have a G t G_tGtTaking an expectation, the expectation is how much value we may get from this state. So expectation can also be seen as the performance of the current value of possible rewards in the future, that is, when we enter a certain state, how much value we have now.

We use discount factors for the following reasons. First, some Markov processes are looped, which do not terminate, and we want to avoid infinite rewards. Second, we cannot build a model that perfectly simulates the environment. Our assessment of the future may not be accurate, and we may not fully trust the model. Because of this uncertainty, we add a discount to our future assessment. We want to express this uncertainty, hoping to get the reward as soon as possible, not at a certain point in the future. Third, if the reward is of real value, we may prefer to be rewarded immediately rather than later (money now is more valuable than money later). Finally, we also prefer instant rewards. Sometimes the discount factor can be set to 0 ( γ = 0 \gamma=0c=0 ), we only focus on the current reward. We can also set the discount factor to 1 (γ = 1 \gamma=1c=1 ) There is no discount for future rewards, and future rewards are the same as current rewards. The discount factor can be adjusted as a hyperparameter of the reinforcement learning agent. By adjusting the discount factor, we can obtain agents with different actions.

在马尔可夫奖励过程里面,我们如何计算价值呢?如图 2.4 所示,马尔可夫奖励过程依旧是状态转移,其奖励函数可以定义为:智能体进入第一个状态 s 1 s_1 s1 的时候会得到 5 的奖励,进入第七个状态 s 7 s_7 s7 的时候会得到 10 的奖励,进入其他状态都没有奖励。我们可以用向量来表示奖励函数,即

R = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] \boldsymbol{R}=[5,0,0,0,0,0,10] R=[5,0,0,0,0,0,10]

图 2.4 马尔可夫奖励过程的例子

我们对 4 步的回合( γ = 0.5 \gamma=0.5 γ=0.5)来采样回报 G G G

(1) s 4 , s 5 , s 6 , s 7 的回报 : 0 + 0.5 × 0 + 0.25 × 0 + 0.125 × 10 = 1.25 s_{4}, s_{5}, s_{6}, s_{7} \text{的回报}: 0+0.5\times 0+0.25 \times 0+ 0.125\times 10=1.25 s4,s5,s6,s7的回报:0+0.5×0+0.25×0+0.125×10=1.25

(2) Returns of s 4 , s 3 , s 2 , s 1 : 0 + 0.5 × 0 + 0.25 × 0 + 0.125 × 5 = 0.625 s_{4}, s_{3}, s_{2}, s_{1 } \text{return}: 0+0.5 \times 0+0.25\times 0+0.125 \times 5=0.625s4,s3,s2,s1s return:0+0.5×0+0.25×0+0.125×5=0.625

(3) Returns of s 4 , s 5 , s 6 , s 6 : 0 + 0.5 × 0 + 0.25 × 0 + 0.125 × 0 = 0 s_{4}, s_{5}, s_{6}, s_{6 } \text{return}: 0+0.5\times 0 +0.25 \times 0+0.125 \times 0=0s4,s5,s6,s6s return:0+0.5×0+0.25×0+0.125×0=0

We can now calculate the reward for each trajectory, such as our trajectory s 4 , s 5 , s 6 , s 7 s_4,s_5,s_6,s_7s4,s5,s6,s7The reward is calculated, where the discount factor is 0.5. at s 4 s_4s4When , the reward is 0. next state s 5 s_5s5, since we have reached the next step, we will put s 5 s_5s5Make a discount, s 5 s_5s5The reward is also 0. then s 6 s_6s6, the reward is also 0, and the discount factor should be 0.25. reach s 7 s_7s7, we get a reward, but because state s 7 s_7s7The reward for is a future reward, so we have to discount it 3 times. The final payoff for this trajectory is 1.25. Similarly, we can get returns for other trajectories.

This leads to a question, when we have the actual return of some trajectories, how to calculate its value function? For example, we want to know s 4 s_4s4The value of , that is, when we enter s 4 s_4s4After all, what is its value? A feasible approach is that we can generate many trajectories and then superimpose them all. For example, we can start from s 4 s_4s4At the beginning, sampling generates many trajectories, calculates the returns of these trajectories, and then averages them as we enter s 4 s_4s4the value of. This is actually a way to calculate the value function, that is, to calculate s 4 s_4 by Monte Carlo (MC) sampling methods4the value of.

2.2.2 Bellman equation

But here we adopt another calculation method, deriving the Bellman equation from the value function :
V ( s ) = R ( s ) ⏟ instant reward + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) ⏟ the sum of discounts for future rewards V(s)=\underbrace{R(s)}_{\text {instant reward}}+\underbrace{\gamma \sum_{s^{\prime} \ in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right)}_{\text {Sum of discounts for future rewards}}V(s)=instant reward R(s)+Sum of discounts for future rewards csSp(ss)V(s)
in,

  • s ′ s' s can be regarded as all future states,
  • p ( s ′ ∣ s ) p(s'|s) p(ss)refers to the probability of transitioning from the current state to the future state.
  • V ( s ′ ) V(s') V(s) 代表的是未来某一个状态的价值。我们从当前状态开始,有一定的概率去到未来的所有状态,所以我们要把 p ( s ′ ∣ s ) p\left(s^{\prime} \mid s\right) p(ss) 写上去。我们得到了未来状态后,乘一个 γ \gamma γ,这样就可以把未来的奖励打折扣。
  • γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) \gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right) γsSp(ss)V(s) 可以看成未来奖励的折扣总和(discounted sum of future reward)。

贝尔曼方程定义了当前状态与未来状态之间的关系。未来奖励的折扣总和加上即时奖励,就组成了贝尔曼方程。

1.全期望公式

在推导贝尔曼方程之前,我们先仿照 全期望公式(law of total expectation) 的证明过程来证明:
E [ V ( s t + 1 ) ∣ s t ] = E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ G t + 1 ∣ s t ] \mathbb{E}[V(s_{t+1})|s_t]=\mathbb{E}[\mathbb{E}[G_{t+1}|s_{t+1}]|s_t]=\mathbb{E}[G_{t+1}|s_t] E[V(st+1)st]=E[E[Gt+1st+1]st]=E[Gt+1st]

全期望公式也被称为叠期望公式(law of iterated expectations,LIE)。
如果 A i A_i Ai 是样本空间的有限或可数的划分(partition),则全期望公式可定义为
E [ X ] = ∑ i E [ X ∣ A i ] p ( A i ) \mathbb{E}[X]=\sum_{i} \mathbb{E}\left[X \mid A_{i}\right] p\left(A_{i}\right) E[X]=iE[XAi]p(Ai)

证明:
为了符号简洁并且易读,我们去掉下标,令 s = s t s=s_t s=st g ′ = G t + 1 g'=G_{t+1} g=Gt+1 s ′ = s t + 1 s'=s_{t+1} s=st+1。我们可以根据条件期望的定义来重写回报的期望为

E [ G t + 1 ∣ s t + 1 ] = E [ g ′ ∣ s ′ ] = ∑ g ′ g ′   p ( g ′ ∣ s ′ ) (2.2) \begin{aligned} \mathbb{E}\left[G_{t+1} \mid s_{t+1}\right] &=\mathbb{E}\left[g^{\prime} \mid s^{\prime}\right] \\ &=\sum_{g^{\prime}} g^{\prime}~p\left(g^{\prime} \mid s^{\prime}\right) \end{aligned} \tag{2.2} E[Gt+1st+1]=E[gs]=gg p(gs)(2.2)

If XXXYYY are all discrete random variables, then conditional expectation (conditional expectation)E [ X ∣ Y = y ] \mathbb{E}[X|Y=y]E[XY=y ]定义为
E [ X ∣ Y = y ] = ∑ xxp ( X = x ∣ Y = y ) \mathbb{E}[X \mid Y=y]=\sum_{x} xp(X=x \mid y=y)E [ XY=y]=xxp(X=xY=y)

Let st = s s_t = sst=s , we can obtain the expectation of formula (2.2)
E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ E [ g ′ ∣ s ′ ] ∣ s ] = E [ ∑ g ′ g ′   p ( g ′ ∣ s ′ ) ∣ s ] = ∑ s ′ ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) p ( s ′ ∣ s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) p ( s ′ ∣ s ) p ( s ) p ( s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) p ( s ′ , s ) p ( s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ , s ′ , s ) p ( s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ , s ′ ∣ s ) = ∑ g ′ ∑ s ′ g ′ p ( g ′ , s ′ ∣ s ) = ∑ g ′ g ′ p ( g ′ ∣ s ) = E [ g ′ ∣ s ] = E [ G t + 1 ∣ s t ] \begin{aligned} \mathbb{E}\left[\mathbb{E}\left[G_{t+1} \mid s_{t+1}\right] \mid s_{t}\right] &=\mathbb{E} \left[\mathbb{E}\left[g^{\prime} \mid s^{\prime}\right] \mid s\right] \\ &=\mathbb{E} \left[\sum_{g^{\prime}} g^{\prime}~p\left(g^{\prime} \mid s^{\prime}\right)\mid s\right]\\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} g^{\prime} p\left(g^{\prime} \mid s^{\prime}, s\right) p\left(s^{\prime} \mid s\right) \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime} \mid s^{\prime}, s\right) p\left(s^{\prime} \mid s\right) p(s)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime} \mid s^{\prime}, s\right) p\left(s^{\prime}, s\right)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime}, s^{\prime}, s\right)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} g^{\prime} p\left(g^{\prime}, s^{\prime} \mid s\right) \\ &=\sum_{g^{\prime}} \sum_{s^{\prime}} g^{\prime} p\left(g^{\prime}, s^{\prime} \mid s\right) \\ &=\sum_{g^{\prime}} g^{\prime} p\left(g^{\prime} \mid s\right) \\ &=\mathbb{E}\left[g^{\prime} \mid s\right]=\mathbb{E}\left[G_{t+1} \mid s_{t}\right] \end{aligned} E[E[Gt+1st+1]st]=E[E[gs]s]=E gg p(gs)s =sggp(gs,s)p(ss)=sgp(s)gp(gs,s)p(ss)p(s)=sgp(s)gp(gs,s)p(s,s)=sgp(s)gp(g,s,s)=sggp(g,ss)=gsgp(g,ss)=ggp(gs)=E[gs]=E[Gt+1st]

2. Bellman equation derivation

The derivation process of the Bellman equation is as follows:

V ( s ) = E [ G t ∣ s t = s ] = E [ r t + 1 + γ r t + 2 + γ 2 r t + 3 + … ∣ s t = s ] = E [ r t + 1 ∣ s t = s ] + γ E [ r t + 2 + γ r t + 3 + γ 2 r t + 4 + … ∣ s t = s ] = R ( s ) + γ E [ G t + 1 ∣ s t = s ] = R ( s ) + γ E [ V ( s t + 1 ) ∣ s t = s ] = R ( s ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) \begin{aligned} V(s)&=\mathbb{E}\left[G_{t} \mid s_{t}=s\right]\\ &=\mathbb{E}\left[r_{t+1}+\gamma r_{t+2}+\gamma^{2} r_{t+3}+\ldots \mid s_{t}=s\right] \\ &=\mathbb{E}\left[r_{t+1}|s_t=s\right] +\gamma \mathbb{E}\left[r_{t+2}+\gamma r_{t+3}+\gamma^{2} r_{t+4}+\ldots \mid s_{t}=s\right]\\ &=R(s)+\gamma \mathbb{E}[G_{t+1}|s_t=s] \\ &=R(s)+\gamma \mathbb{E}[V(s_{t+1})|s_t=s]\\ &=R(s)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right) \end{aligned} V(s)=E[Gtst=s]=E[rt+1+γrt+2+c2 rt+3+st=s]=E[rt+1st=s]+c E[rt+2+γrt+3+c2 rt+4+st=s]=R(s)+γE[Gt+1st=s]=R(s)+γE[V(st+1)st=s]=R(s)+csSp(ss)V(s)

The Bellman equation is the iterative relationship between the current state and the future state, which means that the value function of the current state can be calculated by the value function of the next state. The Bellman equation is due to its proposer and dynamic programming founder Richard ⋅ \cdot Bellman (Richard Bellman), also known as the "dynamic programming equation".

The Bellman equation defines the iterative relationship between states, that is,
V ( s ) = R ( s ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) V(s)=R(s) +\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right)V(s)=R(s)+csSp(ss)V(s)

Assuming there is a Markov chain as shown in Figure 2.5a, the Bellman equation describes a transition from the current state to the future state. As shown in Figure 2.5b, suppose we are currently at s 1 s_1s1, then it can only go to 3 future states: there is a probability of 0.1 to stay in its current position, and a probability of 0.2 to go to s 2 s_2s2state, there is a probability of 0.7 to go to s 4 s_4s4state. So we multiply the state transition probability by the value of its future state, plus its immediate reward (immediate reward), and we will get the value of its current state. The Bellman equation defines the iterative relationship between the current state and the future state.

Figure 2.5 State transition

We can write the Bellman equation in matrix form:
( V ( s 1 ) V ( s 2 ) ⋮ V ( s N ) ) = ( R ( s 1 ) R ( s 2 ) ⋮ R ( s N ) ) + γ ( p ( s 1 ∣ s 1 ) p ( s 2 ∣ s 1 ) … p ( s N ∣ s 1 ) p ( s 1 ∣ s 2 ) p ( s 2 ∣ s 2 ) … p ( s N ∣ s 2 ) ⋮ ⋮ ⋱ ⋮ p ( s 1 ∣ s N ) p ( s 2 ∣ s N ) … p ( s N ∣ s N ) ) ( V ( s 1 ) V ( s 2 ) ⋮ V ( s N ) ) \left(\begin{array}{c} V\left(s_{1}\right) \\ V\left(s_{2}\right) \\ \vdots \\ V\left(s_{N}\right) \end{array}\right)=\left(\begin{array}{c} R\left(s_{1}\right) \\ R\left(s_{2}\right) \\ \vdots \\ R\left(s_{N}\right) \end{array}\right)+\gamma\left(\begin{array}{cccc} p\left(s_{1} \mid s_{1}\right) & p\left(s_{2} \mid s_{1}\right) & \ldots & p\left(s_{N} \mid s_{1}\right) \\ p\left(s_{1} \mid s_{2}\right) & p\left(s_{2} \mid s_{2}\right) & \ldots & p\left(s_{N} \mid s_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ p\left(s_{1} \mid s_{N}\right) & p\left(s_{2} \mid s_{N}\right) & \ldots & p\left(s_{N} \mid s_{N}\right) \end{array}\right)\left(\begin{array}{c} V\left(s_{1}\right) \\ V\left(s_{2}\right) \\ \vdots \\ V\left(s_{N}\right) \end{array}\right) V(s1)V(s2)V(sN) = R(s1)R(s2)R(sN) +γ p(s1s1)p(s1s2)p(s1sN)p(s2s1)p(s2s2)p(s2sN)p(sNs1)p(sNs2)p(sNsN) V(s1)V(s2)V(sN)

我们当前的状态是向量 [ V ( s 1 ) , V ( s 2 ) , ⋯   , V ( s N ) ] T [V(s_1),V(s_2),\cdots,V(s_N)]^\mathrm{T} [V(s1),V(s2),,V(sN)]T。每一行来看,向量 V \boldsymbol{V} V乘状态转移矩阵里面的某一行,再加上它当前可以得到的奖励,就会得到它当前的价值。

当我们把贝尔曼方程写成矩阵形式后,可以直接求解:
V = R + γ P V I V = R + γ P V ( I − γ P ) V = R V = ( I − γ P ) − 1 R \begin{aligned} \boldsymbol{V} &= \boldsymbol{\boldsymbol{R}}+ \gamma \boldsymbol{P}\boldsymbol{V} \\ \boldsymbol{I}\boldsymbol{V} &= \boldsymbol{R}+ \gamma \boldsymbol{P}\boldsymbol{V} \\ (\boldsymbol{I}-\gamma \boldsymbol{P})\boldsymbol{V}&=\boldsymbol{R} \\ \boldsymbol{V}&=(\boldsymbol{I}-\gamma \boldsymbol{P})^{-1}\boldsymbol{R} \end{aligned} VIV(IγP)VV=R+γPV=R+γPV=R=(IγP)1R

We can get the analytic solution directly :
V = ( I − γ P ) − 1 R \boldsymbol{V}=(\boldsymbol{I}-\gamma \boldsymbol{P})^{-1} \boldsymbol {R}V=(IγP)1R

We can take V \boldsymbol{V} by matrix inversionThe value of V is directly calculated. But one problem is that the complexity of this matrix inversion process isO ( N 3 ) O(N^3)O ( N3 ). So when there are a lot of states, such as from 10 states to 1000 states, or to 1 million states, when we have 1 million states, the state transition matrix will be a matrix of 1 million by 1 million, Inverting such a large matrix is ​​very difficult. So this method of solving by analytical solution is only suitable for a small number of Markov reward processes.

2.2.3 Iterative Algorithm for Computing Markov Reward Process Value

We can apply the iterative method to the Markov reward process (large MRP) with many states, such as: dynamic programming method, Monte Carlo method (calculate it by sampling), temporal difference learning (temporal- difference learning, TD learning) method (temporal difference learning is a combination of dynamic programming and Monte Carlo methods).

首先我们用蒙特卡洛方法来计算价值。如图 2.6 所示,蒙特卡洛方法就是当得到一个马尔可夫奖励过程后,我们可以从某个状态开始,把小船放到状态转移矩阵里面,让它“随波逐流”,这样就会产生一个轨迹。产生一个轨迹之后,就会得到一个奖励,那么直接把折扣的奖励即回报 g g g 算出来。算出来之后将它积累起来,得到回报 G t G_t Gt。 当积累了一定数量的轨迹之后,我们直接用 G t G_t Gt 除以轨迹数量,就会得到某个状态的价值。

图 2.6 计算马尔可夫奖励过程价值的蒙特卡洛方法

比如我们要计算 s 4 s_4 s4 状态的价值,可以从 s 4 s_4 s4 状态开始,随机产生很多轨迹。把小船放到状态转移矩阵里面,然后它就会“随波逐流”,产生轨迹。每个轨迹都会得到一个回报,我们得到大量的回报,比如100个、1000个回报,然后直接取平均值,就可以等价于现在 s 4 s_4 s4 的价值,因为 s 4 s_4 s4 的价值 V ( s 4 ) V(s_4) V(s4) 定义了我们未来可能得到多少的奖励。这就是蒙特卡洛采样的方法。

如图 2.7 所示,我们也可以用动态规划的方法,一直迭代贝尔曼方程,直到价值函数收敛,我们就可以得到某个状态的价值。我们通过自举(bootstrapping) 的方法不停地迭代贝尔曼方程,当最后更新的状态与我们上一个状态的区别并不大的时候,更新就可以停止,我们就可以输出最新的 V ′ ( s ) V'(s) V(s) 作为它当前的状态的价值。这里就是把贝尔曼方程变成一个贝尔曼更新(Bellman update),这样就可以得到状态的价值。

动态规划的方法基于后继状态价值的估计来更新现在状态价值的估计(如图 2.7 所示算法中的第 3 行用 V ′ V' V 来更新 V V V )。根据其他估算值来更新估算值的思想,我们称其为自举。

图 2.7 计算马尔可夫奖励过程价值的动态规划算法

bootstrap 的本意是“解靴带”。这里使用了德国文学作品《吹牛大王历险记》中解靴带自助(拔靴自助)的典故,因此将其译为“自举”。

2.2.4 马尔可夫奖励过程的例子

如图 2.8 所示,如果我们在马尔可夫链上加上奖励,那么到达每个状态,我们都会获得一个奖励。我们可以设置对应的奖励,比如智能体到达状态 s 1 s_1 s1时,可以获得 5 的奖励;到达 s 7 s_7 s7, you can get a reward of 10; there is no reward for reaching other states.
Because the state here is finite, we can use the vector R = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] \boldsymbol{R}=[5,0,0,0,0,0,10 ]R=[5,0,0,0,0,0,10 ] to represent the reward function,R \boldsymbol{R}R represents the size of the reward for each state.

We use a vivid example to understand the Markov reward process. If we put a paper boat in a river, it will flow with the current, and it will have no motive power. So we can regard the Markov reward process as an example of drifting with the tide. When we start from a certain point, the paper boat will flow with the pre-defined state transition. After it reaches each state, we will It is possible to get some rewards.

Figure 2.8 Example of a Markov Reward Process

2.3 Markov decision process

Compared with the Markov reward process, the Markov decision process has more decisions (decisions refer to actions), and other definitions are similar to those of the Markov reward process. In addition, there is an additional condition for state transition, which becomes p ( st + 1 = s ′ ∣ st = s , at = a ) p\left(s_{t+1}=s^{\prime} \mid s_ {t}=s,a_{t}=a\right)p(st+1=sst=s,at=a ) . The future state not only depends on the current state, but also depends on the actions taken by the agent in the current state. The Markov decision process satisfies the conditions:
p ( st + 1 ∣ st , at ) = p ( st + 1 ∣ ht , at ) p\left(s_{t+1} \mid s_{t}, a_{t} \right) =p\left(s_{t+1} \mid h_{t}, a_{t}\right)p(st+1st,at)=p(st+1ht,at)

For the reward function, it also has one more current action, which becomes R ( st = s , at = a ) = E [ rt ∣ st = s , at = a ] R\left(s_{t}=s, a_{t}=a\right)=\mathbb{E}\left[r_{t} \mid s_{t}=s, a_{t}=a\right]R(st=s,at=a)=E[rtst=s,at=a ] . The current state and the actions taken will determine how much reward the agent may get at the moment.

2.3.1 Strategies in the Markov decision process

A policy defines what action should be taken in a certain state. After knowing the current state, we can substitute the current state into the policy function to get a probability, namely
π ( a ∣ s ) = p ( at = a ∣ st = s ) \pi(a \mid s)=p\left(a_ {t}=a \mid s_{t}=s\right)π ( as)=p(at=ast=s )
Probability represents how to take action in all possible actions. For example, there may be a probability of 0.7 to go left, and a probability of 0.3 to go right. This is a representation of probability. In addition, the strategy may also be deterministic, it may directly output a value, or directly tell us what action should be taken at present, rather than the probability of an action. Assuming that the probability function is stationary, the actions we take at different time points are actually sampling the policy function.

Known Markov decision process and strategy π \piπ , we can convert a Markov decision process into a Markov reward process. In the Markov decision process, the state transition functionP ( s ′ ∣ s , a ) P(s'|s,a)P(ss,a ) Based on its current state and its current actions. Because we now know the policy function, that is, the probability of actions that may be taken in each state, so we can directly sum the actions and removeaaa,这样我们就可以得到对于马尔可夫奖励过程的转移,这里就没有动作,即
P π ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) p ( s ′ ∣ s , a ) P_{\pi}\left(s^{\prime} \mid s\right)=\sum_{a \in A} \pi(a \mid s) p\left(s^{\prime} \mid s, a\right) Pπ(ss)=aAπ(as)p(ss,a)

对于奖励函数,我们也可以把动作去掉,这样就会得到类似于马尔可夫奖励过程的奖励函数,即
r π ( s ) = ∑ a ∈ A π ( a ∣ s ) R ( s , a ) r_{\pi}(s)=\sum_{a \in A} \pi(a \mid s) R(s, a) rπ(s)=aAπ(as)R(s,a)

2.3.2 马尔可夫决策过程和马尔可夫过程/马尔可夫奖励过程的区别

The difference between state transition in Markov decision process and Markov reward process and state transition in Markov process is shown in Figure 2.9. The state transitions of Markov processes/Markov reward processes are directly determined. For example, the current state is sss , then directly determine what the next state is through the transition probability. But for the Markov decision process, there is an extra layer of actionaaa , that is, when the agent is in the current state, it must first decide to take a certain action, so that we will reach a certain black node. After arriving at this black node, because of certain uncertainty, when the current state of the agent and the actions currently taken by the agent are determined, the state of the agent entering the future is actually a probability distribution. There is an extra layer of decision-making in the transition process between the current state and the future state, which is a very different point between the Markov decision-making process and the previous Markov process/Markov reward process. In a Markov decision process, actions are determined by the agent, which takes actions to determine future state transitions.

Figure 2.9 Comparison of state transitions between Markov decision process and Markov process/Markov reward process

2.3.3 Value function in Markov decision process

The value function in the Markov decision process can be defined as
V π ( s ) = E π [ G t ∣ st = s ] (2.3) V_{\pi}(s)=\mathbb{E}_{\pi} \left[G_{t} \mid s_{t}=s\right] \tag{2.3}Vp(s)=Ep[Gtst=s](2.3)
其中,期望基于我们采取的策略。当策略决定后,我们通过对策略进行采样来得到一个期望,计算出它的价值函数。

这里我们另外引入了一个 Q 函数(Q-function)。Q 函数也被称为 动作价值函数(action-value function)。Q 函数定义的是在某一个状态采取某一个动作,它有可能得到的回报的一个期望,即
Q π ( s , a ) = E π [ G t ∣ s t = s , a t = a ] (2.4) Q_{\pi}(s, a)=\mathbb{E}_{\pi}\left[G_{t} \mid s_{t}=s, a_{t}=a\right] \tag{2.4} Qπ(s,a)=Eπ[Gtst=s,at=a](2.4)
这里的期望其实也是基于策略函数的。所以我们需要对策略函数进行一个加和,然后得到它的价值。
对 Q 函数中的动作进行加和,就可以得到价值函数:
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) (2.5) V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s) Q_{\pi}(s, a) \tag{2.5} Vπ(s)=aAπ(as)Qπ(s,a)(2.5)

此处我们对 Q 函数的贝尔曼方程进行推导:
Q ( s , a ) = E [ G t ∣ s t = s , a t = a ] = E [ r t + 1 + γ r t + 2 + γ 2 r t + 3 + … ∣ s t = s , a t = a ] = E [ r t + 1 ∣ s t = s , a t = a ] + γ E [ r t + 2 + γ r t + 3 + γ 2 r t + 4 + … ∣ s t = s , a t = a ] = R ( s , a ) + γ E [ G t + 1 ∣ s t = s , a t = a ] = R ( s , a ) + γ E [ V ( s t + 1 ) ∣ s t = s , a t = a ] = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ( s ′ ) \begin{aligned} Q(s,a)&=\mathbb{E}\left[G_{t} \mid s_{t}=s,a_{t}=a\right]\\ &=\mathbb{E}\left[r_{t+1}+\gamma r_{t+2}+\gamma^{2} r_{t+3}+\ldots \mid s_{t}=s,a_{t}=a\right] \\ &=\mathbb{E}\left[r_{t+1}|s_{t}=s,a_{t}=a\right] +\gamma \mathbb{E}\left[r_{t+2}+\gamma r_{t+3}+\gamma^{2} r_{t+4}+\ldots \mid s_{t}=s,a_{t}=a\right]\\ &=R(s,a)+\gamma \mathbb{E}[G_{t+1}|s_{t}=s,a_{t}=a] \\ &=R(s,a)+\gamma \mathbb{E}[V(s_{t+1})|s_{t}=s,a_{t}=a]\\ &=R(s,a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s,a\right) V\left(s^{\prime}\right) \end{aligned} Q(s,a)=E[Gtst=s,at=a]=E[rt+1+γrt+2+γ2rt+3+st=s,at=a]=E[rt+1st=s,at=a]+γE[rt+2+γrt+3+γ2rt+4+st=s,at=a]=R(s,a)+γE[Gt+1st=s,at=a]=R(s,a)+γE[V(st+1)st=s,at=a]=R(s,a)+γsSp(ss,a)V(s)

2.3.4 贝尔曼期望方程

我们可以把状态价值函数和 Q 函数拆解成两个部分:即时奖励和后续状态的折扣价值(discounted value of successor state)。
通过对状态价值函数进行分解,我们就可以得到一个类似于之前马尔可夫奖励过程的贝尔曼方程————贝尔曼期望方程(Bellman expectation equation)
V π ( s ) = E π [ r t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] (2.6) V_{\pi}(s)=\mathbb{E}_{\pi}\left[r_{t+1}+\gamma V_{\pi}\left(s_{t+1}\right) \mid s_{t}=s\right] \tag{2.6} Vπ(s)=Eπ[rt+1+γVπ(st+1)st=s](2.6)

对于 Q 函数,我们也可以做类似的分解,得到 Q 函数的贝尔曼期望方程:
Q π ( s , a ) = E π [ r t + 1 + γ Q π ( s t + 1 , a t + 1 ) ∣ s t = s , a t = a ] (2.7) Q_{\pi}(s, a)=\mathbb{E}_{\pi}\left[r_{t+1}+\gamma Q_{\pi}\left(s_{t+1}, a_{t+1}\right) \mid s_{t}=s, a_{t}=a\right] \tag{2.7} Qπ(s,a)=Eπ[rt+1+γQπ(st+1,at+1)st=s,at=a](2.7)
贝尔曼期望方程定义了当前状态与未来状态之间的关联。

我们进一步进行简单的分解,先给出式(2.8):

V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) (2.8) V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s) Q_{\pi}(s, a) \tag{2.8} Vπ(s)=aAπ(as)Qπ(s,a)(2.8)

接着,我们再给出式(2.9):
Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) (2.9) Q_{\pi}(s, a)=R(s,a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right) \tag{2.9} Qπ(s,a)=R(s,a)+γsSp(ss,a)Vπ(s)(2.9)

式(2.8)和式(2.9)代表状态价值函数与 Q 函数之间的关联。

我们把式(2.9)代入式(2.8)可得
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) ) (2.10) V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)\right) \tag{2.10} Vp(s)=aAπ ( as)(R(s,a)+csSp(ss,a)Vp(s))(2.10)

Equation (2.10) represents the relationship between the value of the current state and the value of the future state.

我们把式(2.8)代入式(2.9)可得
Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) (2.11) Q_{\pi}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q_{\pi}\left(s^{\prime}, a^{\prime}\right) \tag{2.11} Qp(s,a)=R(s,a)+csSp(ss,a)aAPi(as)Qp(s,a)(2.11)

Equation (2.11) represents the relationship between the Q function at the current moment and the Q function at the future moment.

Equations (2.10) and (2.11) are another form of the Bellman expectation equation.

2.3.5 Backup map

Next we introduce the concept of backup. Backup is similar to the iterative relationship between bootstrapping, for a certain state, its current value is linearly related to its future value.
We refer to diagrams like Figure 2.10 as backup diagrams or backtracking diagrams because the relationships they show form the basis for update or backup operations that are at the heart of reinforcement learning methods. These operations transfer value information from a state (or state-action pair) to its successor state (or state-action pair) back to it.
Each open circle represents a state, and each solid circle represents a state-action pair.

                                                  Figure 2.10 V π V_{\pi}Vpbackup map

As shown in Equation (2.12), there are two layers of summation. The first level of summing is to sum up the leaf nodes, and back up one level up, we can put the future value ( s ′ s's 的价值)备份到黑色的节点。
第二层加和是对动作进行加和,得到黑色节点的价值后,再往上备份一层,就会得到根节点的价值,即当前状态的价值。
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) ) (2.12) V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)\right) \tag{2.12} Vπ(s)=aAπ(as)(R(s,a)+γsSp(ss,a)Vπ(s))(2.12)

Figure 2.11 shows the calculation decomposition of the state-value function. The calculation formula in Figure 2.11b is
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) (2.13) V_{\pi} (s)=\sum_{a \in A} \pi(a \mid s) Q_{\pi}(s, a) \tag{2.13}Vp(s)=aAπ ( as)Qp(s,a)(2.13)

Figure 2.11b shows the relationship between the state value function and the Q function. Figure 2.11c calculates the Q function as

Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) (2.14) Q_{\pi}(s,a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right) \tag{2.14} Qp(s,a)=R(s,a)+csSp(ss,a)Vp(s)(2.14)

我们将式(2.14)代入式(2.13)可得
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) ) V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)\right) Vp(s)=aAπ ( as)(R(s,a)+csSp(ss,a)Vp(s))

So the backup graph defines the relationship between the state-value function at the next moment in the future and the state-value function at the previous moment.

Figure 2.11 Calculation decomposition of state value function

For the Q function, we can also perform such a derivation. As shown in Figure 2.12, the root node is now a node of the Q function. The Q function corresponds to the nodes in black. The Q function at the next moment corresponds to the leaf nodes, and there are 4 black leaf nodes.
Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) (2.15) Q_{\pi}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q_{\pi}\left(s ^{\prime}, a^{\prime}\right) \tag{2.15}Qp(s,a)=R(s,a)+csSp(ss,a)aAPi(as)Qp(s,a)(2.15)

As shown in Equation (2.15), there are also two levels of summation here. The first layer of summing first pushes the leaf nodes from the black nodes to the hollow circle nodes, and enters the state of the hollow circle nodes.
When we reach a certain state, we add up the hollow circle nodes, so that the hollow circle nodes are pushed back to the Q function at the current moment.

                                                  Figure 2.12 The backup graph of $Q^{\pi}$

图 2.13c 中,
V π ( s ′ ) = ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) (2.16) V_{\pi}\left(s^{\prime}\right)=\sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q_{\pi}\left(s^{\prime}, a^{\prime}\right) \tag{2.16} Vp(s)=aAPi(as)Qp(s,a)(2.16)

我们将式(2.16)代入式(2.14)可得未来 Q 函数与当前 Q 函数之间的关联,即
Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) Q_{\pi}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q_{\pi}\left(s^{\prime}, a^{\prime}\right) Qπ(s,a)=R(s,a)+γsSp(ss,a)aAπ(as)Qπ(s,a)

图 2.13 Q函数的计算分解

2.3.6 策略评估

已知马尔可夫决策过程以及要采取的策略 π \pi π , calculate the value functionV π ( s ) V_{\pi}(s)VpThe process of ( s ) ispolicy evaluation. Strategy evaluation is also called(value) prediction [(value) prediction)], which is to predict how much value our current strategy will eventually produce. As shown in Figure 2.14a, for the Markov decision process, we can actually think of it as a ferryman on a boat. She can control the movement of the boat and prevent the boat from drifting with the tide. Because at every moment, the actions taken by the ferryman will determine the direction of the ship. As shown in Figure 2.14b, for Markov Reward Process and Markov Process, the paper boat will drift with the current and then generate a trajectory. The difference with a Markov decision process is that there is an agent controlling the ship so that we can get as much reward as possible.

Figure 2.14 The difference between Markov decision process and Markov process/Markov reward process

Let's look at the example of policy evaluation again and explore how to calculate the value of each state in the decision-making process. As shown in Figure 2.15, suppose there are two actions in the environment: go left and go right. Now the reward function should be a function of the two variables of action and state. But it is stipulated here that no matter what action the agent takes, as long as it reaches the state s 1 s_1s1, there is a reward of 5; as long as the state s 7 s_7 is reacheds7, there is a reward of 10, and there is no reward for reaching other states. We can express the reward function as R = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] \boldsymbol{R}=[5,0,0,0,0,0,10]R=[5,0,0,0,0,0,10]。假设智能体现在采取一个策略:不管在任何状态,智能体采取的动作都是往左走,即采取的是确定性策略 π ( s ) = 左 \pi(s)=\text{左} π(s)=。假设价值折扣因子 γ = 0 \gamma=0 γ=0,那么对于确定性策略,最后估算出的价值函数是一致的,即 V π = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] \boldsymbol{V}_{\pi}=[5,0,0,0,0,0,10] Vπ=[5,0,0,0,0,0,10]

图 2.15 策略评估示例

我们可以直接通过贝尔曼方程来得到价值函数:
V π k ( s ) = r ( s , π ( s ) ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , π ( s ) ) V π k − 1 ( s ′ ) V^{k}_{\pi}(s)=r(s, \pi(s))+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, \pi(s)\right) V^{k-1}_{\pi}\left(s^{\prime}\right) Vπk(s)=r(s,π ( s ))+csSp(ss,π ( s ) )VPik1(s )
among them,kkk is the number of iterations. We can iterate endlessly, and finally the value function will converge. After convergence, the value of the value function is the value of each state.

Let's look at another example, if the discount factor γ = 0.5 \gamma=0.5c=0.5 , we can iterate through formula (2.17):
V π t ( s ) = ∑ ap ( π ( s ) = a ) ( r ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π t − 1 ( s ′ ) ) (2.17) V^{t}_{\pi}(s)=\sum_{a} p(\pi(s)=a)\left(r(s , a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{t-1}_{\pi}\ left(s^{\prime}\right)\right) \tag{2.17}VPit(s)=ap(π(s)=a)(r(s,a)+γsSp(ss,a)Vπt1(s))(2.17)
其中, t t t是迭代次数。然后就可以得到它的状态价值。

最后,例如,我们现在采取随机策略,在每个状态下,有 0.5 的概率往左走,有 0.5 的概率往右走,即 p ( π ( s ) = 左 ) = 0.5 p(\pi(s)= \text{左})=0.5 p(π(s)=)=0.5 p ( π ( s ) = 右 ) = 0.5 p(\pi(s)= \text{右})=0.5 p(π(s)=)=0.5 , how to find the state value under this strategy? We can do this: at the beginning, we haveV ( s ′ ) V(s')V(s )to initialize, differentV ( s ′ ) V(s')V(s )will have a value; then, we willV ( s ′ ) V(s')V(s )into the Bellman expectation equation to iterate, and its state value can be calculated.

2.3.7 Prediction and Control

Prediction and control are the core issues in the Markov decision process. The input to predict (evaluate a given policy) is a Markov decision process < S , A , P , R , γ > <S,A,P,R,\gamma><S,A,P,R,c> and strategyπ \pip

The output is the value function V π V_{\pi}Vp. Prediction means given a Markov decision process and a strategy π \piπ , calculate its value function, that is, calculate the value of each state.

The input to the control (search for the best policy) is a Markov decision process < S , A , P , R , γ > <S,A,P,R,\gamma><S,A,P,R,c> , the output is the optimal value functionV ∗ V^*V and optimal policyπ ∗ \pi^*Pi . Control means that we find an optimal strategy, and then output its optimal value function and optimal strategy at the same time.

In Markov decision process, both prediction and control can be solved by dynamic programming. It should be emphasized that the difference between the two is that the prediction problem is given a strategy, and we need to determine its value function. The control problem is that without a strategy, we need to determine the best value function and the corresponding decision-making scheme. In fact, the relationship between the two is progressive. In reinforcement learning, we solve the control problem by solving the prediction problem.

Give an example to illustrate the difference between prediction and control. The first is the prediction problem. In the grid of Figure 2.16a, we specify that from A → \to A' can get a +10 reward from B→ \to B’ 可以得到 +5 的奖励,其他步骤的奖励为 $-$1。如图 2.16b 所示,现在,我们给定一个策略:在任何状态中,智能体的动作模式都是随机的,也就是上、下、左、右的概率均为0.25。预测问题要做的就是,求出在这种决策模式下,价值函数是什么。图 2.16c 是对应的价值函数。

图 2.16 网格世界例子:预测

接着是控制问题。在控制问题中,问题背景与预测问题的相同,唯一的区别就是:不再限制策略。也就是动作模式是未知的,我们需要自己确定。 所以我们通过解决控制问题,求得每一个状态的最优的价值函数,如图 2.17b 所示;也得到了最优的策略,如图 2.17c 所示。
控制问题要做的就是,给定同样的条件,求出在所有可能的策略下最优的价值函数是什么,最优策略是什么。

图 2.17 网格世界例子:控制

2.3.8 动态规划

动态规划(dynamic programming,DP) 适合解决满足 最优子结构(optimal substructure)重叠子问题(overlapping subproblem) 两个性质的问题。最优子结构意味着,问题可以拆分成一个个的小问题,通过解决这些小问题,我们能够组合小问题的答案,得到原问题的答案,即最优的解。重叠子问题意味着,子问题出现多次,并且子问题的解决方案能够被重复使用,我们可以保存子问题的首次计算结果,在再次需要时直接使用。

马尔可夫决策过程是满足动态规划的要求的,在贝尔曼方程里面,我们可以把它分解成递归的结构。当我们把它分解成递归的结构的时候,如果子问题的子状态能得到一个值,那么它的未来状态因为与子状态是直接相关的,我们也可以将之推算出来。价值函数可以存储并重用子问题的最佳的解。动态规划应用于马尔可夫决策过程的规划问题而不是学习问题,我们必须对环境是完全已知的,才能做动态规划,也就是要知道状态转移概率和对应的奖励。使用动态规划完成预测问题和控制问题的求解,是解决马尔可夫决策过程预测问题和控制问题的非常有效的方式。

2.3.9 马尔可夫决策过程中的策略评估

策略评估就是给定马尔可夫决策过程和策略,评估我们可以获得多少价值,即对于当前策略,我们可以得到多大的价值。我们可以直接把贝尔曼期望备份(Bellman expectation backup) ,变成迭代的过程,反复迭代直到收敛。这个迭代过程可以看作同步备份(synchronous backup) 的过程。

同步备份是指每一次的迭代都会完全更新所有的状态,这对于程序资源的需求特别大。异步备份(asynchronous backup)的思想就是通过某种方式,使得每一次迭代不需要更新所有的状态,因为事实上,很多状态也不需要被更新。

式(2.18)是指我们可以把贝尔曼期望备份转换成动态规划的迭代。 当我们得到上一时刻的 V t V_t Vt 的时候,就可以通过递推的关系推出下一时刻的值。 反复迭代,最后 V V V的值就是从 V 1 V_1 V1 V 2 V_2 V2 到最后收敛之后的值 V π V_{\pi} Vπ V π V_{\pi} Vπ 就是当前给定的策略 π \pi π 对应的价值函数。

V t + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V t ( s ′ ) ) (2.18) V^{t+1}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{t}\left(s^{\prime}\right)\right) \tag{2.18} Vt+1(s)=aAπ(as)(R(s,a)+γsSp(ss,a)Vt(s))(2.18)

The core idea of ​​policy evaluation is to repeatedly iterate the Bellman expectation backup shown in formula (2.18), and then obtain a converged value function value. Because the policy function has been given, we can directly simplify it into an expression of a Markov reward process, which is equivalent to aaa 去掉,即
V t + 1 ( s ) = r π ( s ) + γ P π ( s ′ ∣ s ) V t ( s ′ ) (2.19) V_{t+1}(s)=r_{\pi}(s)+\gamma P_{\pi}\left(s^{\prime} \mid s\right) V_{t}\left(s^{\prime}\right) \tag{2.19} Vt+1(s)=rp(s)+γPp(ss)Vt(s)( 2.19 )
There are only value function and state transition function in the iterative formula. Through iteration (2.19), we can also get the value of each state. Because whether it is in the Markov reward process or in the Markov decision process, the value functionVVThe variables contained in V are only related to the state, which indicates how much value the agent may get in the future when it enters a certain state. For example, the current environment is a small grid world. The purpose of the agent is to start walking from a certain state, and then reach the end state. Its end state is the upper left corner and the lower right corner (as shown in Figure 2.18 (right) shaded square shown). The small grid world has a total of 14 non-terminal states:1 , ⋯ , 14 1,\cdots,141,,14 . We represent each of its positions with a state. As shown in Figure 2.18 (left), in the small grid world, the agent’s policy function is directly given, and it walks randomly in each state, that is, up, down, left, and right in each state Walking, adopt uniform random policy (uniform random policy),π ( l ∣ . ) = π ( r ∣ . ) = π ( u ∣ . ) = π ( d ∣ . ) = 0.25 \pi(\mathrm{l} \mid .)=\pi(\mathrm{r} \mid .)=\pi(\mathrm{u} \mid .)=\pi(\mathrm{d} \mid .)=0.25π ( l.)=p ( r.)=π ( u.)=π ( d.)=0.25 . When it is in the boundary state, for example, when it goes to the left in the No. 4 state, it still stays in the No. 4 state. We put a restriction on it, this restriction is that the action of going out of the boundary will not change the state, and the corresponding probability is set to 1, such asp ( 7 ∣ 7 , r ) = 1 p(7\mid7,\mathrm{r})=1p(77,r)=1 .
The reward function we give is that every time the agent takes a step, it will get a reward of -1, that is, the reward for each step before reaching the terminal state is -1, so the agent needs to reach the terminal state as soon as possible.

The transition between states after a given action is deterministic, for example p ( 2 ∣ 6 p(2 \mid 6p(26,u ) = 2 )=2 )=2 , that is, going up from state No. 6, it will directly reach state No. 2. In many cases, some environments are probabilistic. For example, when the agent is in state No. 6, when it chooses to go up, the floor may be slippery, and then it may slide to state No. 3 or state No. 1. This is a probabilistic transition. But we have simplified the environment, from state No. 6, it will reach state No. 2. Because we already know each probability and probability transition in the environment, we can directly use formula (2.19) to iterate, and the value of each state will be calculated.

Figure 2.18 Small grid world environment

我们再来看一个动态的例子,推荐斯坦福大学的一个网页,这个网页模拟了式(2.18)所示的单步更新的过程中,所有格子的状态价值的变化过程。

如图 2.19a 所示,网格世界里面有很多格子,每个格子都代表一个状态。每个格子里面有一个初始值0。每个格子里还有一个箭头,这个箭头是指智能体在当前状态应该采取什么策略。我们这里采取随机的策略,不管智能体在哪一个状态,它往上、下、左、右的概率都是相同的。比如在某个状态,智能体都有上、下、左、右各 0.25 的概率采取某一个动作,所以它的动作是完全随机的。在这样的环境里面,我们想计算每一个状态的价值。我们也定义了奖励函数,我们可以看到有些格子里面有一个 R R R 的值,比如有些值是负的。我们可以看到有几个格子里面是 $-$1 的奖励,只有一个 +1 奖励的格子。在网格世界的中间位置,我们可以看到有一个 R R R 的值是 1。所以每个状态对应一个值,有一些状态没有任何值,它的奖励就为0。

如图 2.19b 所示,我们开始策略评估,策略评估是一个不停迭代的过程。当我们初始化的时候,所有的 V ( s ) V(s) V(s) 都是 0。我们现在迭代一次,迭代一次之后,有些状态的值已经产生了变化。比如有些状态的 R R R 值为 $-$1,迭代一次之后,它就会得到 $-$1 的奖励。对于中间绿色的格子,因为它的奖励为正,所以它是值为 +1 的状态。当迭代第1次的时候,某些状态已经有些值的变化。

Figure 2.19 Grid World: An Example of Dynamic Programming

As shown in Figure 2.20a, we iterate one more time, and the surrounding states of the state that previously had values ​​also start to have values. Because the surrounding state is adjacent to the previous valued state, this is equivalent to transferring the surrounding state. As shown in Figure 2.20b, we iterate step by step, and the values ​​are always changing.

Figure 2.20 Grid World: Example of a Policy Evaluation Process

After we iterate many times, the value function of some distant states already has value, and the whole process is a process of gradual diffusion, which is actually a visualization of strategy evaluation. As we iterate at each step, distant states get some value, and values ​​are gradually diffused from states that already have rewards. When we perform many iterations, the value of each state will gradually stabilize, and the final value will remain unchanged. After convergence, the value of each state is its state value.

2.3.10 Markov decision process control

Policy evaluation means that given a Markov decision process and a policy, we can estimate the value of the value function. If we only have a Markov decision process, how should we find the optimal policy and thus obtain the optimal value function ?

The optimal value function is defined as
V ∗ ( s ) = max ⁡ π V π ( s ) V^{*}(s)=\max _{\pi} V_{\pi}(s)V(s)=PimaxVp( s )
The best value function means that we search for a strategyπ \piπ maximizes the value of each state. V ∗ V^*V is the maximization of its value when reaching each state.
In this maximization case, the policy we get is the optimal policy, that is,
π ∗ ( s ) = arg ⁡ max ⁡ π V π ( s ) \pi^{*}(s)=\underset{\pi} {\arg \max }~ V_{\pi}(s)Pi(s)=Piargmax Vp( s )
The optimal strategy makes the value function of each state achieve the maximum value. So if we can get an optimal value function, we can consider the environment of a Markov decision process to be solvable. In this case, the optimal value function is consistent, and the value of the upper limit achievable in the environment is consistent, but there may be multiple optimal strategies here, and multiple optimal strategies can achieve the same optimal value.

After obtaining the best value function, we can get the best policy by maximizing the Q function:
π ∗ ( a ∣ s ) = { 1 , a = arg ⁡ max ⁡ a ∈ AQ ∗ ( s , a ) 0 , other \pi^{*}(a \mid s)=\left\{\begin{array}{ll} 1, & a=\underset{a \in A}{\arg \max} Q^{* }(s, a) \\ 0, & \text {other} \end{array}\right.Pi(as)={ 1,0,a=aAargmaxQ(s,a) Other

When the Q function converges, because the Q function is a function of state and action, if a certain action is taken in a certain state to maximize the Q function, then this action is the best action. If we can optimize a Q function Q ∗ ( s , a ) Q^{*}(s, a)Q(s,a ) , you can directly take the value of an action that maximizes the value of the Q function in the Q function, and then you can extract the best strategy.

Q: How to conduct strategy search?

A: The simplest strategy search method is exhaustive . Assuming that both states and actions are finite, then we can take AA for each stateA type of action strategy, the total is∣ A ∣ ∣ S ∣ |A|^{|S|}AS possible strategies. We can exhaustively enumerate the strategies, calculate the value function of each strategy, and compare them to get the best strategy.

But exhaustive enumeration is very inefficient, so we have to take other methods. There are two common approaches to searching for optimal policies: policy iteration and value iteration.

The process of finding the optimal strategy is the control process of the Markov decision process. Markov decision process control is to find an optimal strategy so that we get a maximum value function value, namely

π ∗ ( s ) = arg ⁡ max ⁡ π V π ( s ) \pi^{*}(s)=\underset{\pi}{\arg \max } ~ V_{\pi}(s)Pi(s)=Piargmax Vp(s)

For a predetermined Markov decision process, when the agent adopts the optimal strategy, the optimal strategy is generally determined and stable (it does not change over time). But the optimal strategy is not necessarily the only one, and multiple actions may achieve the same value.

We can solve the control problem of Markov decision process through policy iteration and value iteration.

2.3.11 Policy iteration

Policy iteration consists of two steps: policy evaluation and policy improvement. As shown in Figure 2.21a, the first step is policy evaluation, currently we are optimizing the policy π \piπ , to get an up-to-date policy during optimization. We first ensure that this policy remains unchanged, and then estimate its value, that is, given the current policy function to estimate the state value function.
The second step is policy improvement. After obtaining the state value function, we can further calculate its Q function. After obtaining the Q function, we directly maximize the Q function, and further improve the strategy by doing a greedy search on the Q function. These two steps have been carried out iteratively. So as shown in Figure 2.21b, in policy iteration, at the time of initialization, we have an initialized state value functionVVV and strategyπ \piπ , and then iterate between these two steps. The upper line in Figure 2.21c is the value of our current state value function, and the lower line is the value of the policy.
The process of policy iteration is the same as kicking a football. We first give the currently existing policy function and calculate its state value function. After computing the state-value function, we get a Q-function. We adopt a greedy policy on the Q function, which is like kicking the ball and "kicking" back to the policy. Then further improve the strategy. After getting an improved strategy, it is not the best strategy. We will evaluate the strategy again and get a new value function. Based on this new value function, the Q function is maximized, so that the state value function and strategy will converge after gradual iteration.

Figure 2.21 Strategy iteration

Let's take a look at the second step—strategy improvement, and see how we improve the strategy. After obtaining the state value function, we can calculate the Q function through the reward function and the state transition function:
Q π i ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π i ( s ′ ) Q_{\pi_{i}}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s ^{\prime} \mid s, a\right) V_{\pi_{i}}\left(s^{\prime}\right)QPii(s,a)=R(s,a)+csSp(ss,a)VPii(s)

For each state, the strategy improvement will get its new round of strategy. For each state, we take the action that makes it get the maximum value, that is,
π i + 1 ( s ) = arg ⁡ max ⁡ a Q π i ( s , a ) \pi_{i+1}(s)=\underset{a}{\arg \max } ~Q_{\pi_{i}}(s, a)Pii+1(s)=aargmax QPii(s,a)

As shown in Figure 2.22, we can regard the Q function as a Q table (Q-table) : the horizontal axis is all its states, and the vertical axis is its possible actions. If we get the Q-function, the Q-table is also got. For a certain state, we will take the maximum value in each column, and the action corresponding to the maximum value is the action it should take now. So the arg max operation refers to taking an action in each state, and this action is an action that can maximize the value of the Q function of this column.

Figure 2.22 Q form

Bellman Optimal Equation

When we keep taking the arg max operation, we get a monotonically increasing. By taking this greedy operation (arg max operation), we get better or unchanged policies without making the value function worse. So when the improvement stops, we get an optimal policy. When the improvement stops, we take an action to maximize the value of the Q function, and the Q function will directly become a value function, that is,
Q π ( s , π ′ ( s ) ) = max ⁡ a ∈ AQ π ( s , a ) = Q π ( s , π ( s ) ) = V π ( s ) Q_{\pi}\left(s, \pi^{\prime}(s)\right)=\max _{a \in A} Q_{\pi}(s, a)=Q_{\pi}(s, \pi(s))=V_{\pi}(s)Qp(s,Pi(s))=aAmaxQp(s,a)=Qp(s,π ( s ))=Vp(s)

We can also get the Bellman optimality equation (Bellman optimality equation)
V π ( s ) = max ⁡ a ∈ AQ π ( s , a ) V_{\pi}(s)=\max _{a \in A} Q_{\pi}(s, a)Vp(s)=aAmaxQp(s,a )
The Bellman optimal equation shows that: the value of a state under the optimal strategy must be equal to the expectation of the return of taking the best action in this state. When the Markov decision process satisfies the Bellman optimal equation, the entire Markov decision process has reached the best state.

The Bellman optimality equation is satisfied only when the entire state has converged and we have an optimal value function. After satisfying the Bellman optimality equation, we can use the maximization operation, namely

V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) (2.20) V^{*}(s)=\max _{a} Q^{*}(s, a) \tag{2.20} V(s)=amaxQ(s,a)( 2.20 )
When we take the value corresponding to the action that maximizes the value of the Q function, it is the value of the best value function in the current state. Additionally, we give the Bellman equation for the Q function

Q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ∗ ( s ′ ) (2.21) Q^{*}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right) \tag{2.21} Q(s,a)=R(s,a)+csSp(ss,a)V(s)(2.21)

我们把式(2.20)代入式(2.21)可得

Q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ∗ ( s ′ ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) max ⁡ a Q ∗ ( s ′ , a ′ ) \begin{aligned} Q^{*}(s, a)&=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right) \\ &=R(s,a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \max _{a} Q^{*}(s', a') \end{aligned} Q(s,a)=R(s,a)+γsSp(ss,a)V(s)=R(s,a)+γsSp(ss,a)amaxQ(s,a)

We can get the transition between Q functions. Q learning is based on the Bellman optimal equation, when the state with the largest Q function value ( max ⁡ a ′ Q ∗ ( s ′ , a ′ ) \underset{a'}{\max} Q^{*} \left(s^{\prime}, a^{\prime}\right)amaxQ(s,a )) can be obtained

Q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) max ⁡ a ′ Q ∗ ( s ′ , a ′ ) Q^{*}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \max _{a^{\prime}} Q^{*}\left(s^{\prime}, a^{\prime}\right) Q(s,a)=R(s,a)+csSp(ss,a)amaxQ(s,a )
We will introduce the specific content of Q-learning in Chapter 3. We can also substitute formula (2.21) into formula (2.20) to get

V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) = max ⁡ a E [ G t ∣ s t = s , a t = a ] = max ⁡ a E [ r t + 1 + γ G t + 1 ∣ s t = s , a t = a ] = max ⁡ a E [ r t + 1 + γ V ∗ ( s t + 1 ) ∣ s t = s , a t = a ] = max ⁡ a E [ r t + 1 ] + max ⁡ a E [ γ V ∗ ( s t + 1 ) ∣ s t = s , a t = a ] = max ⁡ a R ( s , a ) + max ⁡ a γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ∗ ( s ′ ) = max ⁡ a ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ∗ ( s ′ ) ) \begin{aligned} V^{*}(s)&=\max _{a} Q^{*}(s, a) \\ &=\max_{a} \mathbb{E}[G_t|s_t=s,a_t=a]\\ &=\max_{a}\mathbb{E}[r_{t+1}+\gamma G_{t+1}|s_t=s,a_t=a]\\ &=\max_{a}\mathbb{E}[r_{t+1}+\gamma V^*(s_{t+1})|s_t=s,a_t=a]\\ &=\max_{a}\mathbb{E}[r_{t+1}]+ \max_a \mathbb{E}[\gamma V^*(s_{t+1})|s_t=s,a_t=a]\\ &=\max_{a} R(s,a) + \max_a\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right)\\ &=\max_{a} \left(R(s,a) + \gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right)\right) \end{aligned} V(s)=amaxQ(s,a)=amaxE [ Gtst=s,at=a]=amaxE [ rt+1+c Gt+1st=s,at=a]=amaxE [ rt+1+γ V(st+1)st=s,at=a]=amaxE [ rt+1]+amaxE [ γ V(st+1)st=s,at=a]=amaxR(s,a)+amaxcsSp(ss,a)V(s)=amax(R(s,a)+csSp(ss,a)V(s))

We can get the transition of the state value function.

2.3.12 Value iteration

1. The principle of optimality

We think about the problem from another angle. The dynamic programming method divides the optimization problem into two parts. The first step is to execute the optimal action. Every step of the subsequent state is done according to the optimal strategy, and the final result is optimal.

The principle of optimality theorem :
a strategy π ( a ∣ s ) \pi(a|s)π ( a s ) in statesss reaches the optimal value, that is,V π ( s ) = V ∗ ( s ) V_{\pi}(s) = V^{*}(s)Vp(s)=V (s)holds if and only if foranys arrived ats ' s's , have reached the optimal value. That is, for alls ′ s’sV π ( s ′ ) = V ∗ ( s ′ ) V_{\pi}(s') = V^{*}(s')Vp(s)=V(s )is always established.

2. Confirmatory value iteration

If we know the subproblem V ∗ ( s ′ ) V^{*}(s')V(s ), the optimalV ∗ ( s ) V^{*}(s)V (s)solution. Value iteration is to treat the Bellman optimal equation as an update rule, that is,
V ( s ) ← max ⁡ a ∈ A ( R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V ( s ′ ) ) (2.22) V(s) \leftarrow \max _{a \in A}\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p \left(s^{\prime} \mid s, a\right) V\left(s^{\prime}\right)\right) \tag{2.22}V(s)aAmax(R(s,a)+csSp(ss,a)V(s))(2.22)

Equation (2.22) is satisfied only when the entire Markov decision process has reached the optimal state. But we can convert it into a backup equation. The backup equation is an iterative equation. We iterate the Bellman optimal equation continuously, and the value function can gradually tend to the optimal value function, which is the essence of the value iteration algorithm.

In order to get the best V ∗ V^*V , for each stateVVV , we iterate directly through the Bellman optimal equation, and after many iterations, the value function will converge. This value iteration algorithm is also called deterministic value iteration.

3. Value iteration algorithm

The process of value iteration algorithm is as follows.

(1) Initialization: let k = 1 k=1k=1 , for all statessss V 0 ( s ) = 0 V_0(s)=0 V0(s)=0

(2) For k = 1 : H k=1:Hk=1:H H H H is letV ( s ) V(s)V(s)收敛所需的迭代次数)

    (a)对于所有状态 s s s
Q k + 1 ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V k ( s ′ ) (2.23) Q_{k+1}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{k}\left(s^{\prime}\right) \tag{2.23} Qk+1(s,a)=R(s,a)+γsSp(ss,a)Vk(s)(2.23)

V k + 1 ( s ) = max ⁡ a Q k + 1 ( s , a ) (2.24) V_{k+1}(s)=\max _{a} Q_{k+1}(s, a) \tag{2.24} Vk+1(s)=amaxQk+1(s,a)(2.24)

    (b) k ← k + 1 k \leftarrow k+1 kk+1

(3) Extract the optimal policy after iteration:
π ( s ) = arg ⁡ max ⁡ a [ R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) VH + 1 ( s ′ ) ] \pi(s)=\underset{a}{\arg \max } \left[R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^ {\prime} \mid s, a\right) V_{H+1}\left(s^{\prime}\right)\right]π ( s )=aargmax[R(s,a)+csSp(ss,a)VH+1(s)]

We use the value iteration algorithm to get the best strategy π \piπ . We can use formula (2.22) to iterate, and the value obtained after iterating multiple times and converging is the best value.

At the beginning of the value iteration algorithm, all values ​​are initialized, and then each state is iterated. We can get formula (2.22) by substituting formula (2.23) into formula (2.24). Therefore, after we have formula (2.23) and formula (2.24), we iterate non-stop. After many iterations, the value function will converge, and after convergence, we will get V ∗ V^*V . We haveV ∗ V^*V , a problem is how to further deduce its optimal strategy. We can directly use the arg max operation to extract the best policy. We first reconstruct the Q function. After reconstruction, the action with the largest Q value corresponding to each column is the best strategy. In this way we can extract the best policy from the best value function. We're just solving a planning problem, not a reinforcement learning problem, because we know how the environment changes.

The work done by value iteration is similar to value backpropagation. Each iteration does one step of propagation, so the strategy and value function in the intermediate process are meaningless. The result of each iteration of strategy iteration is meaningful and a complete strategy. Figure 2.23 shows a visual process of finding the shortest path. In a grid world, we set an end point, which is the point in the upper left corner. No matter where we start, we want to be able to reach the end (in fact, this end is unnecessary in the iteration process, just for better demonstration). The iterative process of value iteration is like a process of backpropagating from a certain state (here is our end point) to other states, because each iteration can only affect the state directly related to it.
Let us recall the optimality principle theorem: If we solve a certain state ss in a certain iterationValue function of s V k + 1 ( s ) V_{k+1}(s)Vk+1( s ) is the optimal solution, and its premise is that all statess ′ s^{\prime}s have already obtained the optimal solution; if not, all it does is a process similar to transferring the value function.

Figure 2.23 Example: Shortest Path

As shown in Figure 2.23, in fact, for each state, we can regard it as an end point. Iterations start at each endpoint, and we recalculate the value each time according to the Bellman optimality equation. If its neighbor's value changes for the better, then its value gets better too, until the neighbor remains unchanged. Therefore, after we iterate to V 7 V_7V7Before, that is, before the optimal value of each terminal is passed to all other states, the intermediate values ​​are just a kind of temporary incomplete data, which cannot represent the value of each state, so generate A strategy is a strategy that doesn't make sense. Value iteration is an iterative process, and Figure 2.23 visualizes V_1 from V 1V1to V 7 V_7V7Changes in the value of each state. And because the agent will get a negative value every time it takes a step, it needs to reach the end point as soon as possible, and the farther the state can be found, the smaller the value. V 7 V_7V7After convergence, the value in the lower right corner is $-$6, which means it takes 6 steps to reach the end. The closer the agent is to the end, the greater the value. When we get the optimal value, we can get the optimal strategy through strategy extraction.

2.3.13 The difference between strategy iteration and value iteration

Let's look at a dynamic demonstration of Markov decision process control. Figure 2.24 shows the initialization interface of the grid world.

Figure 2.24 Grid World: Initialization Interface

First, let's take a look at strategy iteration. The previous example adopts a fixed random strategy in each state, and each state goes up, down, left, and right with a probability of 0.25, and there is no change in strategy. But we now want to iterate the strategy, and the strategy of each state is changed. As shown in Figure 2.25a, ​​we first perform a policy evaluation to obtain value functions, one for each state. As shown in Figure 2.25b, we continue to improve the policy and click "policy update". At this time, the policies in some grids have changed. For example, for the state of $-$1 in the middle, its best strategy is to go down. After we reach the $-$1 state, we should go down so that we get the best value. The strategy of the grid on the right of the green has also changed. The best strategy it chooses now is to go left, that is, in this state, the best strategy should be to go left.

Figure 2.25 Markov Decision Process Control: Policy Iteration Example

As shown in Figure 2.27a, we execute the next round of policy evaluation, and the value in the grid is changed again. After many times, the values ​​in the grid will converge. As shown in Figure 2.27b, we perform policy update again, and the values ​​in each state basically change. They no longer randomly change up, down, left, and right, but will choose the best strategy to change.

Figure 2.26 Markov Decision Process Control: Policy Iteration Example

As shown in Figure 2.27a, we perform strategy evaluation again, and the value of the grid keeps changing again, and then converges after the change. As shown in Figure 2.27b, we perform one more policy update. Now the value of the grid will change again, and the optimal strategy of the grid in each state will also have some changes. As shown in Figure 2.28a, we execute the policy update again, and the value of the grid does not change, which shows that the entire Markov decision process has converged. So now the value of each state is the value of the current best value function, and the strategy corresponding to the current state is the best strategy.

Figure 2.27 Markov Decision Process Control: Policy Iteration Example

Through the above example, we know that policy iteration can "solve" the grid world. "Solved" means that no matter which state we are in, we can use the best strategy corresponding to the state to reach the state that can get the most rewards.

As shown in Figure 2.28b, we use value iteration to solve the Markov decision process, click "Switch to value iteration". When the value of the grid is determined, its best state will be generated, and the strategy for extracting the best state is consistent with the best strategy obtained by strategy iteration. In each state, we use the best strategy to reach the state that gets the most rewards.

Figure 2.28 Markov Decision Process Control: Policy Iteration Example

Let's compare policy iteration and value iteration, both algorithms can solve the control problem of Markov decision process. Policy iteration is divided into two steps. Firstly, strategy evaluation is performed, that is, to evaluate the currently searched strategy function. After obtaining the valuation, we carry out strategy improvement, that is, calculate the Q function for further improvement. These two steps are repeated until the strategy converges. Value iteration directly uses the Bellman optimal equation to iterate to find the best value function. After finding the best value function, we then extract the best policy.

2.3.14 Summary of Prediction and Control in Markov Decision Processes

Summary As shown in Table 2.1, we use the dynamic programming algorithm to solve the prediction and control in the Markov decision process, and adopt different Bellman equations. For the prediction problem, that is, the problem of policy evaluation, we keep implementing the Bellman expectation equation, so that a given policy can be estimated, and then the value function is obtained. For the control problem, if the algorithm we adopt is policy iteration, we use the Bellman expectation equation; if the algorithm we adopt is value iteration, we use the Bellman optimal equation.

Table 2.1 Dynamic programming algorithm

references

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131304485