RL-赵-(二)-基于模型:贝尔曼/Bellman公式【用于计算给定π下的StateValue:①线性方程组法、②迭代法】、Action Value【根据状态值求解得到;进而用来评价action优劣】

在这里插入图片描述
State Valuethe average Return that an agent can obtain if it follows a given policy/π【给定一个policy/π,所有可能的trajectorys得到的所有return的平均值/期望值: v π ( s ) ≐ E [ G t ∣ S t = s ] v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s] vπ(s)E[GtSt=s].
Returnthe discounted sum of all the rewards collected along a trajectory【return 等于沿着一个特定的 trajectory 收集到的所有 rewards 的折现总和: G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , G_t\doteq R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots, GtRt+1+γRt+2+γ2Rt+3+,

v π ( s ) v_π(s) vπ(s):从状态 s s s 出发,遵循策略 π π π 能够获得的期望回报。

  • 从给定状态 s i s_i si 出发基于给定的policy/π所包含的trajectorys之所以有多个,是因为在每个state时agent接下来要采取的action是按该policy下的概率来采样的,如下图:在这里插入图片描述
    如果每一个state处所采取的action是固定的(以箭头表示),则每一个state在该policy中只有一个固定的trajectory;
    在这里插入图片描述

  • v π ( s ) v_{\pi}(s) vπ(s) depends on s s s:This is because its defnition is a conditional expectation with the condition that the agent starts from S t = s . S_t=s. St=s.

  • v π ( s ) v_{\pi}(s) vπ(s) depends on π \pi π:This is because the trajectories are generated by following the policy π \pi π. For a different policy, the state value may be different.【不同的policy/π得到的trajectories集合是不一样的】

  • v π ( s ) v_{\pi}(s) vπ(s) does not depend on t t t:If the agent moves in the state space, t t t represents the current time step. The value of υ π ( s ) \upsilon_{\pi}(s) υπ(s) is determined once the policy is given.

state values和returns之间的关系进一步明确如下:

  • When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. 【当 policy 和 system model 都是确定性的时候,从一个状态开始总是会导致相同的轨迹。在这种情况下,从一个状态开始获得的return等于该状态的state value】
  • By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories. In this case, the returns of different trajectories are different, and the state value is the mean of these returns.【相反,当 policy 或 system model 中的任何一个是随机的时候,从相同的 state 开始可能会产生不同的trajectory。在这种情况下,不同trajectory的return是不同的,而state value是这些return的平均值。】
  • The greater the state value is, the better the corresponding policy is.
  • State values can be used as a metric to evaluate whether a policy is good or not.

While state values are important, how can we analyze them?

  • The answer is the Bellman equation(贝尔曼方程), which is an important tool for analyzing state values.
  • In a nutshell, the Bellman equation describes the relationships between the values of all states.【贝尔曼方程描述了所有状态值之间的关系。】
  • By solving the Bellman equation, we can obtain the state values. This process is called policy evaluation, which is a fundamental concept in reinforcement learning. 【通过解决贝尔曼方程,我们可以得到状态值。这个过程被称为策略评估,是强化学习中的基本概念。】

Finally, this chapter introduces another important concept called the action value.

2.1 Motivating example 1: Why are returns important?

The previous chapter introduced the concept of returns. In fact, returns play a fundamental role in reinforcement learning since they can evaluate whether a policy is good or not. This is demonstrated by the following examples.
在这里插入图片描述
Consider the three policies shown in Figure 2.2. It can be seen that the three policies are different at s1. Which is the best and which is the worst? Intuitively,

  • the leftmost policy is the best because the agent starting from s 1 s_1 s1 can avoid the forbidden area. 【最左边的策略是最好的,因为从 s 1 s_1 s1开始的agent可以避开禁区】
  • The middle policy is intuitively worse because the agent starting from s 1 s_1 s1 moves to the forbidden area. 【中间的策略直观上更差,因为从 s 1 s_1 s1开始的agent会移动到禁区】
  • The rightmost policy is in between the others because it has a probability of 0.5 to go to the forbidden area.【最右边的策略介于其他两者之间,因为它 有0.5的概率进入禁区】

While the above analysis is based on intuition, a question that immediately follows is whether we can use mathematics to describe such intuition. The answer is yes and relies on the return concept. 【尽管上述分析是基于直觉的,但紧随其后的一个问题是我们是否可以使用数学来描述这种直觉。答案是肯定的,这依赖于return的概念】

In particular, suppose that the agent starts from s 1 s_1 s1.

  • Following the first policy, the trajectory is s1 → s3 → s4 → s4 · · · . The corresponding discounted return is
    r e t u r n 1 = 0 + γ 1 + γ 2 1 + … = γ ( 1 + γ + γ 2 + … ) = γ 1 − γ , \begin{aligned} return_1& =0+\gamma1+\gamma^21+\ldots \\ &=\gamma(1+\gamma+\gamma^2+\ldots) \\ &=\frac\gamma{1-\gamma}, \\ \end{aligned} return1=0+γ1+γ21+=γ(1+γ+γ2+)=1γγ,
    where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1) is the discount rate.
  • Following the second policy, the trajectory is s1 → s2 → s4 → s4 · · · . The discounted return is

r e t u r n 2 = − 1 + γ 1 + γ 2 1 + … = − 1 + γ ( 1 + γ + γ 2 + … ) = − 1 + γ 1 − γ . \begin{aligned} return_2& \begin{aligned}=-1+\gamma1+\gamma^21+\ldots\end{aligned} \\ &=-1+\gamma(1+\gamma+\gamma^2+\ldots) \\ &=-1+\frac\gamma{1-\gamma}. \end{aligned} return2=1+γ1+γ21+=1+γ(1+γ+γ2+)=1+1γγ.

  • Following the third policy, two trajectories can possibly be obtained. One is s1 → s3 → s4 → s4 · · · , and the other is s1 → s2 → s4 → s4 · · · . The probability of either of the two trajectories is 0.5. Then, the average return that can be obtained starting from s 1 s_1 s1 is

r e t u r n 3 = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ . \begin{aligned} return_3& \begin{aligned}=0.5\left(-1+\frac{\gamma}{1-\gamma}\right)+0.5\left(\frac{\gamma}{1-\gamma}\right)\end{aligned} \\ &=-0.5+\frac\gamma{1-\gamma}. \end{aligned} return3=0.5(1+1γγ)+0.5(1γγ)=0.5+1γγ.

By comparing the returns of the three policies, we notice that:
r e t u r n 1 > r e t u r n 3 > r e t u r n 2 return_1 > return_3 > return_2 return1>return3>return2

for any value of γ. r e t u r n 1 > r e t u r n 3 > r e t u r n 2 return_1 > return_3 > return_2 return1>return3>return2 suggests that the first policy is the best because its return is the greatest, and the second policy is the worst because its return is the smallest.

This mathematical conclusion is consistent with the aforementioned intuition: the first policy is the best since it can avoid entering the forbidden area, and the second policy is the worst because it leads to the forbidden area.

2.2 Motivating example 2: How to calculate returns?

While we have demonstrated the importance of returns, a question that immediately follows is how to calculate the returns when following a given policy.
在这里插入图片描述

There are two ways to calculate returns. 【有两种计算return的方法】

  • The first is simply by definition: a return equals the discounted sum of all the rewards collected along a trajectory【return 等于沿着一个 trajectory 收集到的所有奖励的折现总和】. Consider the example in Figure 2.3. Let v i v_i vi denote the return obtained by starting from s i s_i si for i = 1, 2, 3, 4. Then, the returns obtained when starting from the four states in Figure 2.3 can be calculated as
    在这里插入图片描述
  • The second way, which is more important, is based on the idea of bootstrapping. By observing the expressions of the returns in (2.2), we can rewrite them as
    在这里插入图片描述
    The above equations indicate an interesting phenomenon that the values of the returns rely on each other. More specifically, v1 relies on v2, v2 relies on v3, v3 relies on v4, and v4 relies on v1. This reflects the idea of bootstrapping, which is to obtain the values of some quantities from themselves.

At first glance, bootstrapping is an endless loop because the calculation of an unknown value relies on another unknown value. In fact, bootstrapping is easier to understand if we view it from a mathematical perspective. In particular, the equations in (2.3) can be reformed into a linear matrix-vector equation:
在这里插入图片描述
which can be written compactly as
v = r + γ P v . v=r+\gamma Pv. v=r+γPv.
Thus, the value of v v v can be calculated easily as v = ( I − γ P ) − 1 r v=(I-\gamma P)^{-1}r v=(IγP)1r, where I I I is the identity matrix with appropriate dimensions. One may ask whether I − γ P I-\gamma P IγP is always invertible. The answer is yes and explained in Section 2.7.1.

In fact, (2.3) is the Bellman equation for this simple example. Although it is simple, (2.3) demonstrates the core idea of the Bellman equation: the return obtained by starting from one state depends on those obtained when starting from other states.

2.3 State values

We mentioned that returns can be used to evaluate policies. However, they are inapplicable to stochastic systems because starting from one state may lead to different returns.【我们提到过returns可以用来评估policies。然而,在随机系统中,returns是不适用的,因为从一个state出发可能会导致不同的returns】

Motivated by this problem, we introduce the concept of state value in this section.

First, we need to introduce some necessary notations.

Consider a sequence of time steps t = 0 , 1 , 2 , … t=0,1,2,\ldots t=0,1,2, At time t t t, the agent is at state S t S_t St, and the action taken following a policy π \pi π is A t . A_t. At. The next state is S t + 1 S_{t+1} St+1, and the immediate reward obtained is R t + 1 . R_{t+1}. Rt+1. This process can be expressed concisely as

S t → A t S t + 1 , R t + 1 S_t\xrightarrow{A_t}S_{t+1},R_{t+1} StAt St+1,Rt+1

Note that S t , S t + 1 , A t , R t + 1 S_t,S_{t+1},A_t,R_{t+1} St,St+1,At,Rt+1 are all random variables. Moreover, S t , S t + 1 ∈ S , A t ∈ A ( S t ) S_t,S_{t+1}\in\mathcal{S},A_t\in\mathcal{A}(S_t) St,St+1S,AtA(St), and R t + 1 ∈ R ( S t , A t ) R_{t+1}\in\mathcal{R}(S_t,A_t) Rt+1R(St,At).

Starting from t t t, we can obtain a state-action-reward trajectory:

S t → A t S t + 1 , R t + 1 → A t + 1 S t + 2 , R t + 2 → A t + 2 S t + 3 , R t + 3 … . \begin{aligned}S_t\xrightarrow{A_t}S_{t+1},R_{t+1}\xrightarrow{A_{t+1}}S_{t+2},R_{t+2}\xrightarrow{A_{t+2}}S_{t+3},R_{t+3}\ldots.\end{aligned} StAt St+1,Rt+1At+1 St+2,Rt+2At+2 St+3,Rt+3.

By definition, the discounted return along the trajectory【注:在给定π条件下,可能有多个trajectory】 is
G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , G_t\doteq R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots, GtRt+1+γRt+2+γ2Rt+3+,
where γ ∈ (0, 1) is the discount rate. Note that Gt is a random variable since R t + 1 , R t + 2 , … \begin{aligned}R_{t+1},R_{t+2},\ldots\end{aligned} Rt+1,Rt+2, are all random variables.

Since G t G_t Gt is a random variable, we can calculate its expected value (also called the expectation or mean):

v π ( s ) ≐ E [ G t ∣ S t = s ] v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s] vπ(s)E[GtSt=s]

Here, v π ( s ) v_π(s) vπ(s) is called the state-value function or simply the state value of s.

Some important remarks are given below.

  • v π ( s ) v_{\pi}(s) vπ(s) depends on s s s:This is because its defnition is a conditional expectation with the condition that the agent starts from S t = s . S_t=s. St=s.
  • v π ( s ) v_{\pi}(s) vπ(s) depends on π \pi π:This is because the trajectories are generated by following the policy π \pi π. For a different policy, the state value may be different.【不同的policy/π得到的trajectories集合是不一样的】
  • v π ( s ) v_{\pi}(s) vπ(s) does not depend on t t t:If the agent moves in the state space, t t t represents the current time step. The value of υ π ( s ) \upsilon_{\pi}(s) υπ(s) is determined once the policy is given.

state values和returns之间的关系进一步明确如下:

  • When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. 【当 policy 和 system model 都是确定性的时候,从一个状态开始总是会导致相同的轨迹。在这种情况下,从一个状态开始获得的return等于该状态的state value】
  • By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories. In this case, the returns of different trajectories are different, and the state value is the mean of these returns.【相反,当 policy 或 system model 中的任何一个是随机的时候,从相同的 state 开始可能会产生不同的trajectory。在这种情况下,不同trajectory的return是不同的,而state value是这些return的平均值。】

2.4 Bellman equation【贝尔曼公式】

Bellman equation, a mathematical tool for analyzing state val- ues. In a nutshell, the Bellman equation is a set of linear equations that describe the relationships between the values of all the states.【贝尔曼方程,这是一种分析状态值的数学工具。简而言之,贝尔曼方程是一组线性方程,描述了所有状态值之间的关系。】

贝尔曼公式推导:

  • First, note that G t G_t Gt can be rewritten as
    G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) = R t + 1 + γ G t + 1 \begin{aligned} G_{t}& \begin{aligned}=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots\end{aligned} \\ &=R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+\ldots) \\ &=R_{t+1}+\gamma G_{t+1} \end{aligned} Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1

where G t + 1 = R t + 2 + γ R t + 3 + … . \begin{aligned}G_{t+1}=R_{t+2}+\gamma R_{t+3}+\ldots.\end{aligned} Gt+1=Rt+2+γRt+3+. This equation establishes the relationship between G t G_t Gt and G t + 1 G_{t+1} Gt+1.

State Value 表示为(2.4) :
υ π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{aligned} \upsilon_{\pi}(s)&=\mathbb{E}[G_t|S_t=s]\\ &=\mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s] \\ &=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \end{aligned} υπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]

  • 第一项 E [ R t + 1 ∣ S t = s ] \mathbb{E}[R_{t+1}|S_t=s] E[Rt+1St=s] 表示 the expectation of immediate rewards,根据the law of total expectation,可被改写为(2.5):
    E [ R t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r \begin{aligned} \mathbb{E}[R_{t+1}|S_t=s]&=\sum_{a\in\mathcal{A}}\pi(a|s)\mathbb{E}[R_{t+1}|S_t=s,A_t=a]\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r \end{aligned} E[Rt+1St=s]=aAπ(as)E[Rt+1St=s,At=a]=aAπ(as)rRp(rs,a)r
    • Here, A A A and R R R are the sets of possible actions and rewards, respectively.
    • It should be noted that A may be different for different states.
    • In this case, A A A should be written as A ( s ) A(s) A(s). Similarly, R R R may also depend on ( s , a ) (s, a) (s,a).
    • We drop the dependence on s s s or ( s , a ) (s, a) (s,a) for the sake of simplicity in this book. 【为了书写简洁】
    • Nevertheless, the conclusions are still valid in the presence of dependence.
  • 第二项 E [ G t + 1 ∣ S t = s ] \mathbb{E}[G_{t+1}|S_t=s] E[Gt+1St=s] ,表示 the expectation of the future rewards,可被改写为:
    E [ G t + 1 ∣ S t = s ] = ∑ s ′ ∈ S E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ ∈ S E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (due to the Markov property) = ∑ s ′ ∈ S v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ ∈ S v π ( s ′ ) ∑ a ∈ A p ( s ′ ∣ s , a ) π ( a ∣ s ) \begin{aligned} \mathbb{E}[G_{t+1}|S_t=s]&=\sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_t=s,S_{t+1}=s']p(s'|s)\\ &=\sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s)\quad\text{(due to the Markov property)}\\ &=\sum_{s^{\prime}\in\mathcal{S}}v_\pi(s^{\prime})p(s^{\prime}|s) \\ &=\sum_{s^{\prime}\in\mathcal{S}}v_\pi(s^{\prime})\sum_{a\in\mathcal{A}}p(s^{\prime}|s,a)\pi(a|s) \end{aligned} E[Gt+1St=s]=sSE[Gt+1St=s,St+1=s]p(ss)=sSE[Gt+1St+1=s]p(ss)(due to the Markov property)=sSvπ(s)p(ss)=sSvπ(s)aAp(ss,a)π(as)
    • The above derivation uses the fact that E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] = E [ G t + 1 ∣ S t + 1 = s ′ ] \mathbb{E}[G_{t+1}|S_t=s,S_{t+1}=s']=\mathbb{E}[G_{t+1}|S_{t+1}=s'] E[Gt+1St=s,St+1=s]=E[Gt+1St+1=s],which is due to the Markov property that the future rewards depend merely on the present state rather than the previous ones.
    • s ′ s' s 表示 s s s 的下一个时刻的状态;

将(2.5)-(2.6)代入(2.4),得公式(2.7):

v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all  s ∈ S \begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s]=mean of immediate rewards aAπ(as)rRp(rs,a)r+mean of future rewards γaAπ(as)sSp(ss,a)vπ(s),=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],for all sS
在这里插入图片描述
This equation is the Bellman equation, which characterizes the relationships of state values. It is a fundamental tool for designing and analyzing reinforcement learning algorithms.【这个方程是贝尔曼方程,它描述了state value之间的关系。它是设计和分析强化学习算法的基本工具。】

  • v π ( s ) v_\pi( s) vπ(s) and v π ( s ′ ) v_\pi( s^\prime) vπ(s) are unknown state values to be calculated.
    • the Bellman equation refers to a set of linear equations for all states rather than a single equation.
    • If we put these equations together, it becomes clear how to calculate all the state values.
  • π ( a ∣ s ) \pi(a|s) π(as) is a given policy.
    • Since state values can be used to evaluate a policy, solving the state values from the Bellman equation is a p o l i c y   e v a l u a t i o n policy\:evaluation policyevaluation process.
  • p ( r ∣ s , a ) p( r|s, a) p(rs,a) and p ( s ′ ∣ s , a ) p( s^\prime|s, a) p(ss,a) represent the system model.
    • We will first show how to calculate the state values w i t h with with this model and then show how to do that w i t h o u t without without the model by using model-free algorithms later in this book.

In addition to the expression in (2.7), readers may also encounter other expressions of the Bellman equation in the literature. We next introduce two equivalent expressions.

  • First, it follows from the law of total probability that
    p ( s ′ ∣ s , a ) = ∑ r ∈ R p ( s ′ , r ∣ s , a ) , p ( r ∣ s , a ) = ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) . \begin{aligned}p(s'|s,a)&=\sum_{r\in\mathcal{R}}p(s',r|s,a),\\p(r|s,a)&=\sum_{s'\in\mathcal{S}}p(s',r|s,a).\end{aligned} p(ss,a)p(rs,a)=rRp(s,rs,a),=sSp(s,rs,a).
    Then, equation (2.7) can be rewritten as
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S ∑ r ∈ R p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] \begin{aligned}v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r|s,a)\left[r+\gamma v_{\pi}(s')\right]\end{aligned} vπ(s)=aAπ(as)sSrRp(s,rs,a)[r+γvπ(s)]
  • Second, the reward r r r may depend solely on the next state s ′ s^{\prime} s in some problems. As a result, we can write the reward as r ( s ′ ) r(s^{\prime}) r(s) and hence p ( r ( s ′ ) ∣ s , a ) = p ( s ′ ∣ s , a ) p(r(s^{\prime})|s,a)=p(s^{\prime}|s,a) p(r(s)s,a)=p(ss,a), substituting which into (2.7) gives
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) [ r ( s ′ ) + γ v π ( s ′ ) ] \begin{aligned}v_\pi(s)&=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)\left[r(s')+\gamma v_\pi(s')\right]\end{aligned} vπ(s)=aAπ(as)sSp(ss,a)[r(s)+γvπ(s)]

2.5 Examples for illustrating the Bellman equation

We next use two examples to demonstrate how to write out the Bellman equation and calculate the state values step by step. Readers are advised to carefully go through the examples to gain a better understanding of the Bellman equation.【我们接下来使用两个例子来演示如何写出贝尔曼方程并逐步计算状态值。建议读者仔细阅读这些例子,以更好地理解贝尔曼方程。】

If we compare the state values of the two policies in the above examples, it can be seen that:【如果我们比较下述示例中两种策略的状态值,可以看到】
v π 1 ( s i ) ≥ v π 2 ( s i ) , i = 1 , 2 , 3 , 4 , v_{\pi_1}(s_i)\geq v_{\pi_2}(s_i),\quad i=1,2,3,4, vπ1(si)vπ2(si),i=1,2,3,4,

  • which indicates that the policy in Figure 2.4 is better because it has greater state values.【这表明图2.4中的策略更好,因为它具有更大的状态值。】
  • This mathematical conclusion is consistent with the intuition that the first policy is better because it can avoid entering the forbidden area when the agent starts from s 1 s_1 s1. 【这个数学结论与直觉一致,即第一个policy更好,因为当agent 从 s 1 s_1 s1开始时,它可以避免进入禁区。】
  • As a result, the above two examples demonstrate that state values can be used to evaluate policies.【这两个例子证明了state value可以 用来评估policy】

2.5.1 example01【policy是确定性的】

在这里插入图片描述
Consider the first example shown in Figure 2.4, where the policy is deterministic【policy是确定性的】.

We next write out the Bellman equation and then solve the state values from it.

First, consider state s 1 . s_1. s1. Under the policy:

  • The probabilities of taking the actions are π ( a = a 3 ∣ s 1 ) = 1 \pi( a= a_3|s_1) = 1 π(a=a3s1)=1 and π ( a ≠ a 3 ∣ s 1 ) = 0 \pi( a≠a_3|s_1) = 0 π(a=a3s1)=0.
  • The state transition probabilities are p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 p( s^\prime= s_3|s_1, a_3) = 1 p(s=s3s1,a3)=1 and p ( s ′ ≠ s 3 ∣ s 1 , a 3 ) = 0. p( s^\prime≠s_3|s_1, a_3) = 0. p(s=s3s1,a3)=0.
  • The reward probabilities are p ( r = 0 ∣ s 1 , a 3 ) = 1 p( r= 0|s_1, a_3) = 1 p(r=0∣s1,a3)=1 and p ( r ≠ 0 ∣ s 1 , a 3 ) = 0. p( r\neq0|s_1, a_3) = 0. p(r=0∣s1,a3)=0.

Substituting these values into
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all  s ∈ S \begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s]=mean of immediate rewards aAπ(as)rRp(rs,a)r+mean of future rewards γaAπ(as)sSp(ss,a)vπ(s),=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],for all sS

gives
v π ( s 1 ) = 0 + γ v π ( s 3 ) v_\pi(s_1)=0+\gamma v_\pi(s_3) vπ(s1)=0+γvπ(s3)

Similarly, it can be obtained that
υ π ( s 2 ) = 1 + γ υ π ( s 4 ) v π ( s 3 ) = 1 + γ υ π ( s 4 ) υ π ( s 4 ) = 1 + γ υ π ( s 4 ) \begin{aligned} &\upsilon_{\pi}(s_2) =1+\gamma\upsilon_{\pi}(s_4) \\ &v_{\pi}(s_{3}) =1+\gamma\upsilon_{\pi}(s_4) \\ &\upsilon_{\pi}(s_{4}) =1+\gamma\upsilon_\pi(s_4) \\ \end{aligned} υπ(s2)=1+γυπ(s4)vπ(s3)=1+γυπ(s4)υπ(s4)=1+γυπ(s4)

We can solve the state values from these equations. Since the equations are simple, we can manually solve them. Here, the state values can be solved as:【我们可以从这些方程中解出状态值。由于这些方程很简单,我们可以手动解决它们。在这里,状态值可以被解决为】
υ π ( s 4 ) = 1 1 − γ , υ π ( s 3 ) = 1 1 − γ , υ π ( s 2 ) = 1 1 − γ , υ π ( s 1 ) = γ 1 − γ . \begin{gathered} \upsilon_\pi(s_4) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{3}) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{2}) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{1}) =\frac\gamma{1-\gamma}. \end{gathered} υπ(s4)=1γ1,υπ(s3)=1γ1,υπ(s2)=1γ1,υπ(s1)=1γγ.

Furthermore, if we set γ = 0.9 γ = 0.9 γ=0.9, then
υ π ( s 4 ) = 1 1 − 0.9 = 10 , υ π ( s 3 ) = 1 1 − 0.9 = 10 , υ π ( s 2 ) = 1 1 − 0.9 = 10 , υ π ( s 1 ) = 0.9 1 − 0.9 = 9. \begin{gathered} \upsilon_\pi(s_4) \begin{aligned}=\frac{1}{1-0.9}=10,\end{aligned} \\ \upsilon_{\pi}(s_{3}) =\frac1{1-0.9}=10, \\ \upsilon_{\pi}(s_{2}) =\frac1{1-0.9}=10, \\ \upsilon_{\pi}(s_{1}) =\frac{0.9}{1-0.9}=9. \end{gathered} υπ(s4)=10.91=10,υπ(s3)=10.91=10,υπ(s2)=10.91=10,υπ(s1)=10.90.9=9.

2.5.2 example02【policy是随机的】

在这里插入图片描述
Consider the second example shown in Figure 2.5, where the policy is stochastic【policy是随机的】.

We next write out the Bellman equation and then solve the state values from it.

First, consider state s 1 . s_1. s1. Under the policy:

  • At state s 1 s_1 s1, the probabilities of going right and down equal 0.5. The probabilities of taking the actions are π ( a = a 2 ∣ s 1 ) = 0.5 \pi( a= a_2|s_1) = 0.5 π(a=a2s1)=0.5 and π ( a = a 3 ∣ s 1 ) = 0.5. \pi( a= a_3|s_1) = 0.5. π(a=a3s1)=0.5.
  • The state transition probability is deterministic since p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 p( s^{\prime}= s_3|s_1, a_3) = 1 p(s=s3s1,a3)=1 and p ( s ′ = s 2 ∣ s 1 , a 2 ) = 1. p( s^\prime= s_2|s_1, a_2) = 1. p(s=s2s1,a2)=1.
  • The reward probability is also deterministic since p ( r = 0 ∣ s 1 , a 3 ) = 1 p( r= 0|s_1, a_3) = 1 p(r=0∣s1,a3)=1 and p ( r = − 1 ∣ s 1 , a 2 ) = 1. p( r= - 1|s_1, a_2) = 1. p(r=1∣s1,a2)=1.

Substituting these values into
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all  s ∈ S \begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s]=mean of immediate rewards aAπ(as)rRp(rs,a)r+mean of future rewards γaAπ(as)sSp(ss,a)vπ(s),=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],for all sS

gives
v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] \begin{aligned}v_\pi(s_1)=0.5[0+\gamma v_\pi(s_3)]+0.5[-1+\gamma v_\pi(s_2)]\end{aligned} vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)]

Similarly, it can be obtained that

υ π ( s 2 ) = 1 + γ υ π ( s 4 ) , υ π ( s 3 ) = 1 + γ υ π ( s 4 ) , υ π ( s 4 ) = 1 + γ υ π ( s 4 ) . \begin{aligned} &\upsilon_{\pi}(s_{2}) =1+\gamma\upsilon_{\pi}(s_4), \\ &\upsilon_{\pi}(s_{3}) =1+\gamma\upsilon_{\pi}(s_4), \\ &\upsilon_{\pi}(s_{4}) =1+\gamma\upsilon_\pi(s_4). \\ \end{aligned} υπ(s2)=1+γυπ(s4),υπ(s3)=1+γυπ(s4),υπ(s4)=1+γυπ(s4).

The state values can be solved from the above equations. Since the equations are simple, we can solve the state values manually and obtain

v π ( s 4 ) = 1 1 − γ , υ π ( s 3 ) = 1 1 − γ , υ π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] = − 0.5 + γ 1 − γ \begin{aligned} &v_\pi(s_4) =\frac1{1-\gamma}, \\ &\upsilon_{\pi}(s_{3}) =\frac1{1-\gamma}, \\ &\upsilon_{\pi}(s_{2}) =\frac1{1-\gamma}, \\ &v_{\pi}(s_{1}) =0.5[0+\gamma v_\pi(s_3)]+0.5[-1+\gamma v_\pi(s_2)] =-0.5+\frac\gamma{1-\gamma} \end{aligned} vπ(s4)=1γ1,υπ(s3)=1γ1,υπ(s2)=1γ1,vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)]=0.5+1γγ

Furthermore, if we set γ = 0.9 γ = 0.9 γ=0.9, then

υ π ( s 4 ) = 10 , υ π ( s 3 ) = 10 , v π ( s 2 ) = 10 , v π ( s 1 ) = − 0.5 + 9 = 8.5. \begin{aligned} &\upsilon_\pi(s_4) =10, \\ &\upsilon_{\pi}(s_{3}) =10, \\ &v_{\pi}(s_{2}) =10, \\ &v_{\pi}(s_{1}) =-0.5+9=8.5. \end{aligned} υπ(s4)=10,υπ(s3)=10,vπ(s2)=10,vπ(s1)=0.5+9=8.5.

2.6 Matrix-vector form of the Bellman equation【贝尔曼方程的矩阵-向量形式】

The Bellman equation in (2.7) is in an element-wise form. Since it is valid for every state, we can combine all these equations and write them concisely in a matrix-vector form, which will be frequently used to analyze the Bellman equation.【贝尔曼方程(2.7)是以元素形式表示的。由于它对于每个状态都有效,我们可以将所有这些方程组合起来,并以矩阵-向量形式简洁地写出来,将经常用于分析贝尔曼方程。】

To derive the matrix-vector form, we first rewrite the Bellman equation【为了得到矩阵-向量形式,我们首先将贝尔曼方程重写】

v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all  s ∈ S \begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s]=mean of immediate rewards aAπ(as)rRp(rs,a)r+mean of future rewards γaAπ(as)sSp(ss,a)vπ(s),=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],for all sS

重写为:

v π ( s ) = r π ( s ) + γ ∑ s ′ ∈ S p π ( s ′ ∣ s ) v π ( s ′ ) ( 2.8 ) \begin{aligned}v_{\pi}(s)=r_{\pi}(s)+\gamma\sum_{s'\in\mathcal{S}}p_{\pi}(s'|s)v_{\pi}(s')\quad\quad\quad(2.8)\end{aligned} vπ(s)=rπ(s)+γsSpπ(ss)vπ(s)(2.8)

其中:
r π ( s ) ≐ ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r p π ( s ′ ∣ s ) = ˙ ∑ a ∈ A π ( a ∣ s ) p ( s ′ ∣ s , a ) \begin{aligned} &r_{\pi}(s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r \\[4ex] &p_{\pi}(s'|s)\dot{=}\sum_{a\in\mathcal{A}}\pi(a|s)p(s'|s,a) \end{aligned} rπ(s)aAπ(as)rRp(rs,a)rpπ(ss)=˙aAπ(as)p(ss,a)

  • r π ( s ) r_{\pi}( s) rπ(s) denotes the mean of the immediate rewards;
  • p π ( s ′ ∣ s ) p_{\pi}(s^{\prime}|s) pπ(ss) is the probability of transitioning from s s s to s ′ s^{\prime} s under policy π . \pi. π.

Suppose that the states are indexed as s i s_i si with i = 1 , … , n , i= 1, \ldots , n, i=1,,n, where n = ∣ S ∣ . n= |S|. n=S∣. For state s i s_i si, ( 2.8) can be written as
v π ( s i ) = r π ( s i ) + γ ∑ s j ∈ S p π ( s j ∣ s i ) v π ( s j ) ( 2.9 ) \begin{aligned}v_{\pi}(s_{i})=r_{\pi}(s_{i})+\gamma\sum_{s_{j}\in\mathcal{S}}p_{\pi}(s_{j}|s_{i})v_{\pi}(s_{j})&\quad\quad\quad(2.9)\end{aligned} vπ(si)=rπ(si)+γsjSpπ(sjsi)vπ(sj)(2.9)

令: v π = [ v π ( s 1 ) , … , v π ( s n ) ] T ∈ R n v_\pi=[v_\pi(s_1),\ldots,v_\pi(s_n)]^T\in\mathbb{R}^n vπ=[vπ(s1),,vπ(sn)]TRn r π = [ r π ( s 1 ) , … , r π ( s n ) ] T ∈ R n r_\pi=[r_\pi(s_1),\ldots,r_\pi(s_n)]^T\in\mathbb{R}^n rπ=[rπ(s1),,rπ(sn)]TRn,and P π ∈ R n × n P_\pi\in\mathbb{R}^{n\times n} PπRn×n with [ P π ] i j = p π ( s j ∣ s i ) [P_\pi]_{ij}=p_\pi(s_j|s_i) [Pπ]ij=pπ(sjsi)。Then, (2.9) can be written in the following matrix-vector form:

v π = r π + γ P π v π , ( 2.10 ) \color{red}{\begin{aligned}v_{\pi}&=r_{\pi}+\gamma P_{\pi}v_{\pi},\end{aligned}\quad\quad(2.10)} vπ=rπ+γPπvπ,(2.10)

where v π v_π vπ is the unknown to be solved, and r π r_π rπ, P π P_π Pπ are known.

The matrix P π P_\mathrm{\pi} Pπ has some interesting properties.

  • First, it is a nonnegative matrix meaning that all its elements are equal to or greater than zero. This property is denoted as P π ≥ 0 P_{\pi}\geq0 Pπ0, where 0 denotes a zero matrix with appropriate dimensions. In this book, ≥ \geq or ≤ \leq represents an elementwise comparison operation.
  • Second, P π P_\pi Pπ is a stochastic matrix meaning that the sum of the values in every row is equal to one. This property is denoted as P π 1 = 1 P_{\pi}\mathbf{1}=\mathbf{1} Pπ1=1, where 1 = [ 1 , … , 1 ] T \mathbf{1}=[1,\ldots,1]^T 1=[1,,1]T has appropriate dimensions.

在这里插入图片描述
Consider the example shown in Figure 2.6. The matrix-vector form of the Bellman equation is

[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] ⏟ r π + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] ⏟ P π [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π \underbrace{\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right]}_{v_\pi}=\underbrace{\left[\begin{array}{c}r_\pi(s_1)\\r_\pi(s_2)\\r_\pi(s_3)\\r_\pi(s_4)\end{array}\right]}_{r_\pi}+\gamma\underbrace{\left[\begin{array}{cccc}p_\pi(s_1|s_1)&p_\pi(s_2|s_1)&p_\pi(s_3|s_1)&p_\pi(s_4|s_1)\\p_\pi(s_1|s_2)&p_\pi(s_2|s_2)&p_\pi(s_3|s_2)&p_\pi(s_4|s_2)\\p_\pi(s_1|s_3)&p_\pi(s_2|s_3)&p_\pi(s_3|s_4)&p_\pi(s_4|s_4)\\p_\pi(s_1|s_4)&p_\pi(s_2|s_4)&p_\pi(s_3|s_4)&p_\pi(s_4|s_4)\end{array}\right]}_{P_\pi}\underbrace{\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right]}_{v_\pi} vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4) =rπ rπ(s1)rπ(s2)rπ(s3)rπ(s4) +γPπ pπ(s1s1)pπ(s1s2)pπ(s1s3)pπ(s1s4)pπ(s2s1)pπ(s2s2)pπ(s2s3)pπ(s2s4)pπ(s3s1)pπ(s3s2)pπ(s3s4)pπ(s3s4)pπ(s4s1)pπ(s4s2)pπ(s4s4)pπ(s4s4) vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4)

Substituting the specific values into the above equation gives

[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 ] + γ [ 0 0.5 0.5 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \left.\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right.\right]=\left[\begin{array}{c}0.5(0)+0.5(-1)\\1\\1\\1\end{array}\right]+\gamma\left[\begin{array}{cccc}0&0.5&0.5&0\\0&0&0&1\\0&0&0&1\\0&0&0&1\end{array}\right]\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right] vπ(s1)vπ(s2)vπ(s3)vπ(s4) = 0.5(0)+0.5(1)111 +γ 00000.50000.50000111 vπ(s1)vπ(s2)vπ(s3)vπ(s4)

It can be seen that P π P_\pi Pπ satisfies P π 1 = 1 P_{\pi}\mathbf{1}=\mathbf{1} Pπ1=1.

2.7 从贝尔曼方程求解 State Values【Solving state values from the Bellman equation】

Calculating the state values of a given policy is a fundamental problem in reinforcement learning. This problem is often referred to as policy evaluation. 【计算给定策略的状态值是强化学习中的一个基本问题。这个问题通常被称为策略评估。】

In this section, we present two methods for calculating state values from the Bellman equation.【介绍两种从贝尔曼方程计算state values的方法】

2.7.1 Closed-form solution

Since ⁡ v π = r π + γ P π v π \operatorname{Since}v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} Sincevπ=rπ+γPπvπ is a simple linear equation, its closed-form solution can be easily obtained as

v π = ( I − γ P π ) − 1 r π . v_\pi=(I-\gamma P_\pi)^{-1}r_\pi. vπ=(IγPπ)1rπ.

Some properties of ( I − γ P π ) − 1 (I-\gamma P_{\pi})^{-1} (IγPπ)1 are given below.

  • I − γ P π I-\gamma P_{\pi} IγPπ is invertible. The proof is as follows. According to the Gershgorin circle theorem [4], every eigenvalue of I − γ P π I-\gamma P_{\pi} IγPπ lies within at least one of the Gershgorin circles. The i i ith Gershgorin circle has a center at [ I − γ P π ] i i = 1 − γ p π ( s i ∣ s i ) [I-\gamma P_{\pi}]_{ii}=1-\gamma p_{\pi}(s_{i}|s_{i}) [IγPπ]ii=1γpπ(sisi) and a radius equal to ∑ j ≠ i [ I − γ P π ] i j = − ∑ j ≠ i γ p π ( s j ∣ s i ) . \sum_{j\neq i}[I-\gamma P_{\pi}]_{ij}=-\sum_{j\neq i}\gamma p_{\pi}(s_{j}|s_{i}). j=i[IγPπ]ij=j=iγpπ(sjsi). Since γ < 1 \gamma<1 γ<1, we know that the radius is less than the magnitude of the center: ∑ j ≠ i γ p π ( s j ∣ s i ) < 1 − γ p π ( s i ∣ s i ) \sum_{j\neq i}\gamma p_\pi(s_j|s_i)<1-\gamma p_\pi(s_i|s_i) j=iγpπ(sjsi)<1γpπ(sisi) Therefore, all Gershgorin circles do not encircle the origin, and hence, no eigenvalue of I − γ P π I-\gamma P_{\pi} IγPπ is zero.

  • ( I − γ P π ) − 1 ≥ I (I-\gamma P_{\pi})^{-1}\geq I (IγPπ)1I, meaning that every element of ( I − γ P π ) − 1 (I-\gamma P_{\pi})^{-1} (IγPπ)1 is nonnegative and, more specifically, no less than that of the identity matrix. This is because P π P_\mathrm{\pi} Pπ has nonnegative entries, and hence, ( I − γ P π ) − 1 = I + γ P π + γ 2 P π 2 + ⋯ ≥ I ≥ 0. (I-\gamma P_{\pi})^{-1}=I+\gamma P_{\pi}+\gamma^{2}P_{\pi}^{2}+\cdots\geq I\geq0. (IγPπ)1=I+γPπ+γ2Pπ2+I0.

  • For any vector r ≥ 0 r\geq 0 r0, it holds that ( I − γ P π ) − 1 r ≥ r ≥ 0 (I- \gamma P_{\pi})^{-1}r\geq r\geq0 (IγPπ)1rr0 This property follows from the second property because [ ( I − γ P π ) − 1 − I ] r ≥ 0. [(I-\gamma P_{\pi})^{-1}-I]r\geq0. [(IγPπ)1I]r0. As a consequence, if r 1 ≥ r 2 r_1\geq r_2 r1r2, we have  ( I − γ P π ) − 1 r 1 ≥ ( I − γ P π ) − 1 r 2 . \begin{aligned}\text{have }(I-\gamma P_{\pi})^{-1}r_{1}\geq(I-\gamma P_{\pi})^{-1}r_{2}.\end{aligned} have (IγPπ)1r1(IγPπ)1r2.

2.7.2 Iterative solution

Although the closed-form solution is useful for theoretical analysis purposes, it is not applicable in practice because it involves a matrix inversion operation, which still needs to be calculated by other numerical algorithms. In fact, we can directly solve the Bellman equation using the following iterative algorithm:【尽管闭式解对于理论分析目的很有用,但在实践中不适用,因为它涉及矩阵求逆运算,仍需要
通过其他数值算法进行计算。实际上,我们可以使用以下迭代算法直接解决贝尔曼方程:】

v k + 1 = r π + γ P π v k , k = 0 , 1 , 2 , … ( 2.11 ) \color{red}{\begin{aligned}v_{k+1}=r_{\pi}+\gamma P_{\pi}v_{k},\quad k=0,1,2,\ldots\end{aligned}\quad\quad\quad\quad\quad(2.11)} vk+1=rπ+γPπvk,k=0,1,2,(2.11)

This algorithm generates a sequence of values { v 0 , v 1 , v 2 , … } \{ v_0, v_1, v_2, \ldots \} { v0,v1,v2,} , where v 0 ∈ R n v_0\in \mathbb{R} ^n v0Rn is an initial guess of v π v_\pi vπ.

It holds that【大家认为】

v k → v π = ( I − γ P π ) − 1 r π ,  as  k → ∞ ( 2.12 ) \begin{aligned}v_k\to v_\pi&=(I-\gamma P_\pi)^{-1}r_\pi,\quad\text{ as }k\to\infty \quad\quad\quad\quad\quad\quad (2.12)\end{aligned} vkvπ=(IγPπ)1rπ, as k(2.12)

2.7.3 Illustrative examples

We next apply the algorithm in (2.11) to solve the state values of some examples.

The examples are shown in Figure 2.7. The orange cells represent forbidden areas. The blue cell represents the target area. The reward settings are r boundary = r forbidden = − 1 r_{\text{boundary}}=r_{\text{forbidden}}=-1 rboundary=rforbidden=1,and r target = 1 r_{\text{target}}= 1 rtarget=1. Here, the discount rate is γ = 0.9 \gamma = 0.9 γ=0.9.
在这里插入图片描述

  • Figure 2.7(a) shows two “good” policies and their corresponding state values obtained by (2.11). The two policies have the same state values but differ at the top two states in the fourth column. Therefore, we know that different policies may have the same state values.
  • Figure 2.7(b) shows two “bad” policies and their corresponding state values. These two policies are bad because the actions of many states are intuitively unreasonable. Such intuition is supported by the obtained state values. As can be seen, the state values of these two policies are negative and much smaller than those of the good policies in Figure 2.7(a).

2.8 From state value to action value

While we have been discussing state values thus far in this chapter, we now turn to the action value, which indicates the “value” of taking an action at a state. While the concept of action value is important, the reason why it is introduced in the last section of this chapter is that it heavily relies on the concept of state values. It is important to understand state values well first before studying action values.

State Value
v π ( s ) ≐ E [ G t ∣ S t = s ] \begin{aligned}v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s]\end{aligned} vπ(s)E[GtSt=s]

Action Value
q π ( s , a ) ≐ E [ G t ∣ S t = s , A t = a ] \begin{aligned}q_\pi(s,a)\doteq\mathbb{E}[G_t|S_t=s,A_t=a]\end{aligned} qπ(s,a)E[GtSt=s,At=a]

As can be seen, the action value is defined as the expected return that can be obtained after taking an action at astate. 【动作值被定义为在某个状态下采取某个动作后可以获得的预期回报】

It must be noted that q π ( s , a ) q_\pi(s,a) qπ(s,a) depends on a state-action pair ( s , a ) (s,a) (s,a) rather than an action alone. It may be more rigorous to call this value a state-action value, but it is conventionally called an action value for simplicity.【必须注意的是, q ( s , a ) q(s, a) q(s,a)取决于state-action对 ( s , a ) (s, a) (s,a),而不仅仅是一个action 。为了简单起见,通常将这个值称为 action value,尽管更严谨的说法应该是state-action value】

What is the relationship between action values and state values? 【行动值和状态值之间的关系是什么?】

  • First, it follows from the properties of conditional expectation that
    E [ G t ∣ S t = s ] ⏟ v π ( s ) = ∑ a ∈ A E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) π ( a ∣ s ) \begin{aligned}\underbrace{\mathbb{E}[G_t|S_t=s]}_{v_\pi(s)}&=\sum_{a\in\mathcal{A}}\underbrace{\mathbb{E}[G_t|S_t=s,A_t=a]}_{q_\pi(s,a)}\pi(a|s)\end{aligned} vπ(s) E[GtSt=s]=aAqπ(s,a) E[GtSt=s,At=a]π(as)
    It then follows that
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) . ( 2.13 ) \color{red}{\begin{aligned}v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a).&&(2.13)\end{aligned}} vπ(s)=aAπ(as)qπ(s,a).(2.13)
    As a result, a state value is the expectation of the action values associated with that state. 【State Value是与该state相关联的Action Value的期望值
  • Second, since the state value is given by
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , \begin{aligned}v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\Big[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s')\Big],\end{aligned} vπ(s)=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],
    comparing it with (2.13) leads to
    q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) . ( 2.14 ) \color{red}{\begin{aligned}q_\pi(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s').\quad\quad\quad\quad(2.14)\end{aligned}} qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)vπ(s).(2.14)
    It can be seen that the action value consists of two terms.
    • The first term is the mean of the immediate rewards;
    • the second term is the mean of the future rewards.

Both (2.13) and (2.14) describe the relationship between state values and action values. They are the two sides of the same coin:

  • (2.13) shows how to obtain state values from action values【action values---->state values】
  • (2.14) shows how to obtain action values from state values【state values---->action values】

2.8.1 Illustrative examples

在这里插入图片描述
We next present an example to illustrate the process of calculating action values and discuss a common mistake that beginners may make.

Consider the stochastic policy shown in Figure 2.8. We next only examine the actions of s 1 s_1 s1. The other states can be examined similarly. The action value of ( s 1 , a 2 ) (s_1, a_2) (s1,a2) is

q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) q_\pi(s_1,a_2)=-1+\gamma v_\pi(s_2) qπ(s1,a2)=1+γvπ(s2)

where s 2 s_2 s2 is the next state. Similarly, it can be obtained that

q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) \begin{aligned}q_\pi(s_1,a_3)=0+\gamma v_\pi(s_3)\end{aligned} qπ(s1,a3)=0+γvπ(s3)

A common mistake that beginners may make is about the values of the actions that the given policy does not select. For example, the policy in Figure 2.8 can only select a 2 a_2 a2 or a 3 a_3 a3 and cannot select a 1 , a 4 , a 5 . a_1,a_4,a_5. a1,a4,a5. One may argue that since the policy does not select a 1 , a 4 , a 5 , a_1, a_4, a_5, a1,a4,a5, we do not need to calculate their action values, or we can simply set q π ( s 1 , a 1 ) = q π ( s 1 , a 4 ) = q π ( s 1 , a 5 ) = 0. q_\pi(s_1,a_1)=q_\pi(s_1,a_4)=q_\pi(s_1,a_5)=0. qπ(s1,a1)=qπ(s1,a4)=qπ(s1,a5)=0. This is wrong.

  • First, even if an action would not be selected by a policy, it still has an action value In this example, although policy π \pi π does not take a 1 a_1 a1at$s_1, $ we can still calculate its action value by observing what we would obtain after taking this action. Specificallyy after taking a 1 a_1 a1, the agent is bounced back to s 1 s_1 s1(hence, the immediate reward is -1) and then continues moving in the state space starting from s 1 s_1 s1 by following π \pi π (hence, the future reward is γ v π ( s 1 ) ) \gamma v_\pi(s_1)) γvπ(s1)). As a result, the action value of ( s 1 , a 1 ) (s_1,a_1) (s1,a1) is
    q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) q_\pi(s_1,a_1)=-1+\gamma v_\pi(s_1) qπ(s1,a1)=1+γvπ(s1)
    Similarly, for a 4 a_4 a4 and a 5 a_5 a5, which cannot be possibly selected by the given policy either, we have

q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) , q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) . \begin{aligned}q_\pi(s_1,a_4)&=-1+\gamma v_\pi(s_1),\\q_\pi(s_1,a_5)&=0+\gamma v_\pi(s_1).\end{aligned} qπ(s1,a4)qπ(s1,a5)=1+γvπ(s1),=0+γvπ(s1).

  • Second, why do we care about the actions that the given policy would not select?
    Although some actions cannot be possibly selected by a given policy, this does not mean that these actions are not good. It is possible that the given policy is not good, so it cannot select the best action. The purpose of reinforcement learning is to find optimal policies. To that end, we must keep exploring all actions to determine better actions for each state.【其次,为什么我们关心给定策略不会选择的行动?尽管某些行动不可能被给定策略选择,但这并不意味着这些行动不好。有可能给定的策略不好,因此无法选择最佳行动。强化学习的目的是找到最优策略。为此,我们必须不断探索所有行动,以确定每个状态的更好行动。】

Finally, after computing the action values, we can also calculate the state value ac- cording to (2.14):
υ π ( s 1 ) = 0.5 q π ( s 1 , a 2 ) + 0.5 q π ( s 1 , a 3 ) , = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] . \begin{aligned} \upsilon_{\pi}(s_1)& \begin{aligned}=0.5q_\pi(s_1,a_2)+0.5q_\pi(s_1,a_3),\end{aligned} \\ &\begin{aligned}=0.5[0+\gamma v_\pi(s_3)]+0.5[-1+\gamma v_\pi(s_2)].\end{aligned} \end{aligned} υπ(s1)=0.5qπ(s1,a2)+0.5qπ(s1,a3),=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)].

2.8.2 The Bellman equation in terms of action values

The Bellman equation that we previously introduced was defined based on state values. In fact, it can also be expressed in terms of action values.

In particular, substituting (2.13) into (2.14) yields

q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ∑ a ′ ∈ A ( s ′ ) π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) , \begin{aligned}q_\pi(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)\sum_{a'\in\mathcal{A}(s')}\pi(a'|s')q_\pi(s',a'),\end{aligned} qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)aA(s)π(as)qπ(s,a),

which is an equation of action values. The above equation is valid for every state-action pair. If we put all these equations together, their matrix-vector form is

q π = r ~ + γ P Π q π , ( 2.15 ) \begin{aligned}q_{\pi}&=\tilde{r}+\gamma P\Pi q_{\pi},\quad\quad\quad\quad\quad\quad(2.15)\end{aligned} qπ=r~+γPΠqπ,(2.15)

where q π q_\mathrm{\pi} qπ is the action value vector indexed by the state-action pairs: its ( s , a ) (s,a) (s,a)th element is [ q π ] ( s , a ) = q π ( s , a ) . r ~ [q_\pi]_{(s,a)}=q_\pi(s,a).\quad\tilde{r} [qπ](s,a)=qπ(s,a).r~ is the immediate reward vector indexed by the state-action pairs: [ r ~ ] ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r . [\tilde{r}]_{(s,a)}=\sum_{r\in\mathcal{R}}p(r|s,a)r. [r~](s,a)=rRp(rs,a)r. The matrix P P P is the probability transition matrix, whose row is indexed by the state-action pairs and whose column is indexed by the states: [ P ] ( s , a ) , s ′ = p ( s ′ ∣ s , a ) . [P]_{(s,a),s'}=p(s'|s,a). [P](s,a),s=p(ss,a). Moreover, Π \Pi Π is a block diagonal matrix in which each block is a 1 × ∣ A ∣ 1\times|\mathcal{A}| 1×A vector: Π s ′ , ( s ′ , a ′ ) = π ( a ′ ∣ s ′ ) \Pi_{s^{\prime},(s^{\prime},a^{\prime})}=\pi(a^{\prime}|s^{\prime}) Πs,(s,a)=π(as) and the other entries of Π \Pi Π are zero.

Compared to the Bellman equation defined in terms of state values, the equation defined in terms of action values has some unique features. For example, r ~ \tilde{r} r~ and P P P are independent of the policy and are merely determined by the system model. The policy is embedded in Π \Pi Π. It can be verified that (2.15) is also a contraction mapping and has a unique solution that can be iteratively solved. More details can be found in [5].

2.9 Summary

The most important concept introduced in this chapter is the state value. Mathematically, a state value is the expected return that the agent can obtain by starting from a state.【本章介绍的最重要的概念是state value。在数学上,状态值是 agent 从一个state开始可以获得的expected return】

The values of different states are related to each other. That is, the value of state s relies on the values of some other states, which may further rely on the value of state s itself. 【不同状态的值彼此相关。也就是说,状态 s s s 的值依赖于其他一些states的值,而这些states
的值可能进一步依赖于状态 s s s 本身的值】

This phenomenon might be the most confusing part of this chapter for beginners.

It is related to an important concept called bootstrapping, which involves calculating something from itself. Although bootstrapping may be intuitively confusing, it is clear if we examine the matrix-vector form of the Bellman equation. 【如果我们考虑贝尔曼方程的矩阵-向量形式,就会变得清晰】

In particular, the Bellman equation is a set of linear equations that describe the relationships between the values of all states.【特别是,贝尔曼方程是一组线性方程,描述了所有状态值之间的关系】

Since state values can be used to evaluate whether a policy is good or not, the process of solving the state values of a policy from the Bellman equation is called policy evaluation. 【由于状态值可以用来评估一个policy的好坏,从贝尔曼方程中解决策略的state values的过程被称为策略评估(policy evaluation)】

As we will see later in this book, policy evaluation is an important step in many reinforcement learning algorithms.

Another important concept, action value, was introduced to describe the value of taking one action at a state. 【另一个重要的概念,action value,被引入来描述在某个state下采取某个 action 的价值】

As we will see later in this book, action values play a more direct role than state values when we attempt to find optimal policies. 【正如我们将在本书后面看到的那样,当我们试图寻找最优policy时,action values比state value起到更直接的作用

Finally, the Bellman equation is not restricted to the reinforcement learning field. Instead, it widely exists in many fields such as control theories and operation research. In different fields, the Bellman equation may have different expressions. In this book, the Bellman equation is studied under discrete Markov decision processes. More information about this topic can be found in [2].【最
后,贝尔曼方程不仅限于强化学习领域,而是广泛存在于许多领域,如控制理论和运筹学。在不同的领域中,贝尔曼方程可能有不同的表达方式。在本书中,贝尔曼方程是在离散马尔可夫决策过程下进行研究的】

2.10 Q&A

Q: What is the relationship between state values and returns?
A: The value of a state is the mean of the returns that can be obtained if the agent starts from that state.【一个状态的价值是指如果代理从该状态开始,可以获得的回报的平均值。】

Q: Why do we care about state values?
A: State values can be used to evaluate policies. In fact, optimal policies are defined based on state values. 【状态值可以用来评估策略。事实上,最优策略是基于状态值来定义的。】This point will become clearer in the next chapter.

Q: Why do we care about the Bellman equation?
A: The Bellman equation describes the relationships among the values of all states.【贝尔曼方程描述了所有状态的值之间的关系。】
It is the tool for analyzing state values.

Q: Why is the process of solving the Bellman equation called policy evaluation?
A: Solving the Bellman equation yields state values. Since state values can be used to evaluate a policy, solving the Bellman equation can be interpreted as evaluating the corresponding policy. 【解决贝尔曼方程可以得到状态值。由于状态值可以用来评估策略,所以解决贝尔曼方程可以被解释为评估相应的策略。】

Q: Why do we need to study the matrix-vector form of the Bellman equation?
A: The Bellman equation refers to a set of linear equations established for all the states. To solve state values, we must put all the linear equations together. The matrix-vector form is a concise expression of these linear equations. 【贝尔曼方程是为所有状态建立的一组线性方程。要解决状态值,我们必须将所有线性方程放在一起。矩阵-向量形式是这些线性方程的简洁表达。】

Q: What is the relationship between state values and action values?
A: On the one hand, a state value is the mean of the action values for that state. On the other hand, an action value relies on the values of the next states that the agent may transition to after taking the action. 【一方面,状态值是该状态的动作值的平均值。另一方面,动作值依赖于agent在采取行动后可能转移到的下一个状态的值】

Q: Why do we care about the values of the actions that a given policy cannot select?
A: Although a given policy cannot select some actions, this does not mean that these actions are not good. On the contrary, it is possible that the given policy is not good and misses the best action. To find better policies, we must keep exploring different actions even though some of them may not be selected by the given policy.【尽管某个policy不能选择某些行动,但这并不意味着这些actions不好。相反,可能是这个policy本身不好,错过了最佳actions。为了找到更好的policy,我们必须继续探索不同的actions,即使其中一些actions可能不被当前的policy所选择】




参考资料:
怎么理解强化学习中状态价值和动作价值?

猜你喜欢

转载自blog.csdn.net/u013250861/article/details/134766614
今日推荐