ADPRL - Approximate Dynamic Programming and Reinforcement Learning - Note 8 - Approximate Policy Iteration

Note 8 Approximate Policy Iteration

Approximate policy iteration

In Note 7, we introduced the concept of parametric function approximation and its application in approximation iterative algorithms. Although the convergence properties of AVI have proven promising, it is not the same as the original VI VIThe inherent limitations of the VI algorithm still exist. In this section, we develop a framework for an approximate policy iteration algorithm.

8.1 A Generic Framework

Similar to the approximate VI algorithm, we can build a system to approximate the policy evaluation and policy improvement steps, as follows

  1. For a given policy π k \pi_{k}Pik, our goal is to find the true total cost J π k J^{\pi_{k}}JPikApproximate value of J k J_{k}Jk,
    i∥ J k − J π k ∥ ∞ ≤ δ (8.1) \left\|J_{k}-J^{\pi_{k}}\right\|_{\infty} \leq \delta \tag{ 8.1}JkJPikd( 8. 1 ) Please note that the
    true total costJ π k J^{\pi_{k}}JPikIn general it cannot be given. The idea of ​​Bellman residual minimization can be used here.

  2. By taking the same equation as ( 7.31 7.31For the same strategy as the approximate greedy step in 7.31), we can also relax it into an approximate strategy improvement . That is to say, given thekthk value function estimatesJ k J_{k}Jk, we find a policy π k + 1 \pi_{k+1}Pik+1,theoretically
    ∥ T π k + 1 J k − T g J k ∥ ∞ ≤ ϵ , (8.2) \left\|\mathrm{T}_{\pi_{k+1}} J_{k}-\ mathrm{T}_{\mathfraction{g}} J_{k}\right\|_{\infty} \leq \epsilon, \tag{8.2}TPik+1JkTgJkϵ ,( 8.2 )
    where ϵ > 0 \epsilon> 0ϵ>0 is the accuracy of inexact policy improvement.

Such a general approximate PI algorithm is given in Algorithm 10.
Insert image description here

In order to determine the error bound of the approximate PI algorithm, we need the following two lemmas (Lemma).

Lemma 8.1 Error bound under monotonicity

Give an infinite range MDP { X , U , p , g , γ } \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , and a fixed policyπ \piπ。让 J ∈ R K J \in \mathbb{R}^{K} JRIn K , the following conditions are met
T π J ≤ J + c 1 (8.3) \mathrm{T}_{\pi} J \leq J+c \mathbf{1} \tag{8.3}TpJJ+c 1(8.3)

And there is c > 0 c>0c>0 , then the strategyπ \piThe total cost function of π has the following constraints
J π ≤ J + c 1 − γ 1 (8.4) J^{\pi} \leq J+\frac{c}{1-\gamma} \mathbf{1} \tag {8.4}JPiJ+1cc1(8.4)


Proof.

Bellman operator T π \mathrm{T}_{\pi}TpThe constant displacement property of means that for all k ∈ N k \in \mathbb{N}kNote π
k J ≤ T π k − 1 J + γ k − 1 c 1 (8.5) \mathrm{T}_{\pi}^{k} J \leq \mathrm{T}_{\pi }^{k-1}J+\gamma^{k-1}c\mathbf{1}\tag{8.5}TPikJTPik1J+ck1c1(8.5)

Then we have any kkk ratio
T π k J − J = T π k J − T π k − 1 J + T π k − 1 J − ... + T π J − J = ∑ t = 1 k ( T π k J − T π k − 1 J ) ≤ ∑ t = 1 k γ t − 1 c 1 (8.6) \begin{align} \mathrm{T}_{\pi}^{k} JJ &=\mathrm{T}_{\pi }^{k}J-\mathrm{T}_{\pi}^{k-1}J+\mathrm{T}_{\pi}^{k-1}J-\ldots+\mathrm{T}_ {\pi} JJ \\ &=\sum_{t=1}^{k}\left(\mathrm{~T}_{\pi}^{k}J-\mathrm{T}_{\pi} ^{k-1} J\right) \\ & \leq \sum_{t=1}^{k} \gamma^{t-1} c \mathbf{1} \end{aligned} \tag{8.6}TPikJJ=TPikJTPik1J+TPik1J+TpJJ=t=1k( TPikJTPik1J)t=1kct1c1(8.6)

The result is through t → ∞ t\rightarrow\inftyt .


Lemma 8.2 Error bound of single approximate PI sweep

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , a fixed policyπ \piπ , one inRK \mathbb{R}^{K}REstimateJJ in KJ , and two fixed policiesπ \piππ ′ \pi^{\prime}Pi , if the following two conditions are at someδ ≥ 0 \delta\geq 0d0ϵ ≥ 0 \epsilon\geqϵEstablished at 0 o'clock

∥ J − J π ∥ ∞ ≤ δ ,  and  ∥ T π ′ J − T g J ∥ ∞ ≤ ϵ (8.7) \left\|J-J^{\pi}\right\|_{\infty} \leq \delta, \quad \text { and } \quad\left\|\mathrm{T}_{\pi^{\prime}} J-\mathrm{T}_{\mathfrak{g}} J\right\|_{\infty} \leq \epsilon \tag{8.7} JJπd , and TPiJTgJϵ(8.7)

Then we have

∥ J π ′ − J ∗ ∥ ∞ ≤ γ ∥ J π − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.8) \left\|J^{\pi^{\prime}}-J^{ *}\right\|_{\infty} \leq \gamma\left\|J^{\pi}-J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma}\tag{8.8}JPiJcJPiJ+1cϵ+2 c d(8.8)


Proof.

根据 T g \mathrm{T}_{\mathfrak{g}} Tg T π ′ \mathrm{T}_{\pi^{\prime}} Tπ的收缩特性,公式(8.7)中的第一个不等式意味着
∥ T π ′ J − T π ′ J π ∥ ∞ ≤ γ δ ,  and  ∥ T g J − T g J π ∥ ∞ ≤ γ δ (8.9) \left\|\mathrm{T}_{\pi^{\prime}} J-\mathrm{T}_{\pi^{\prime}} J^{\pi}\right\|_{\infty} \leq \gamma \delta, \quad \text { and } \quad\left\|\mathrm{T}_{\mathfrak{g}} J-\mathrm{T}_{\mathfrak{g}} J^{\pi}\right\|_{\infty} \leq \gamma \delta \tag{8.9} TπJTπJπγδ, and TgJTgJπγδ(8.9)

Let
T π ′ J π ≤ T π ′ J + γ δ 1 , and T g J − T g J π ≤ γ δ 1 (8.10) \mathrm{T}_{\pi^{\prime}} J^{ \pi} \leq \mathrm{T}_{\pi^{\prime}} J+\gamma \delta \mathbf{1}, \quad \text { and } \quad \mathrm{T}_{\mathfraction{ g}} J-\mathrm{T}_{\mathfraction{g}} J^{\pi} \leq \gamma \delta \mathbf{1} \tag{8.10}TPiJPiTPiJ+c d 1 , and TgJTgJPic d 1(8.10)

Similarly, the second inequality in equation (8.7) yields
T π ′ J ≤ T g J + ϵ 1 (8.11) \mathrm{T}_{\pi^{\prime}} J \leq \mathrm {T}_{\mathfrak{g}} J+\epsilon \mathbf{1} \tag{8.11}TPiJTgJ+ϵ 1(8.11)

then we get

T π ′ J π ≤ T π ′ J + γ δ 1 ≤ T g J + ( ϵ + γ δ ) 1 ≤ T g J π + ( ϵ + 2 γ δ ) 1 ≤ J π + ( ϵ + 2 γ δ ) 1 (8.12) \begin{align} \mathrm{T}_{\pi^{\prime}} J^{\pi} & \leq \mathrm{T}_{\pi^{\prime}} J+ \gamma \delta \mathbf{1} \\ & \leq \mathrm{T}_{\mathfraction{g}} J+(\epsilon+\gamma\delta) \mathbf{1} \\ & \leq \mathrm{T }_{\math fraction{g}} J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1} \\ & \leq J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1} \end{aligned} \tag{8.12}TPiJPiTPiJ+c d 1TgJ+( ϵ+c d ) 1TgJPi+( ϵ+2 c d ) 1JPi+( ϵ+2 c d ) 1(8.12)

Among them, the second inequality is due to formula (8.11), the third inequality is derived from the second inequality in formula (8.10), and the last inequality is due to T g \mathrm{T}_{\mathfrak{g} }TgThe strategy improvement properties of T g J π ≤ T π J π = J π \mathrm{T}_{\mathfrak{g}} J^{\pi} \leq \mathrm{T}_{\pi} J ^{\pi}=J^{\pi}TgJPiTpJPi=JPi

Property Lemma 8.1,IfJ
π ′ ≤ J π + ϵ + 2 γ δ 1 − γ 1 (8.13) J^{\pi^{\prime}} \leq J^{\pi}+\frac{\epsilon +2 \gamma \delta}{1-\gamma} 1 \tag{8.13}JPiJPi+1cϵ+2 c d1(8.13)

And further convert the Bellman operator T π ′ T_{\pi^{\prime}}TPiApply to both sides of the inequality to get

T π ′ J π ′ = J π ′ ≤ T π ′ J π + ϵ + 2 γ δ 1 − γ γ 1. (8.14) \mathrm{T}_{\pi^{\prime}} J^{\; pi^{\prime}}=J^{\pi^{\prime}} \leq \mathrm{T}_{\pi^{\prime}} J^{\pi}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} . \tag{8.14}TPiJPi=JPiTPiJPi+1cϵ+2 c dc 1 .(8.14)

Subtract J ∗ J^{*} from both sides of the inequalityJ,我们得到
J π ′ − J ∗ ≤ T π ′ J π − J ∗ + ϵ + 2 γ δ 1 − γ γ 1 ≤ T g J π + ( ϵ + 2 γ δ ) 1 − J ∗ + ϵ + 2 γ δ 1 − γ γ 1 = T g J π − T g J ∗ + ϵ + 2 γ δ 1 − γ 1 (8.15) \begin{aligned} J^{\pi^{\prime}}-J^{*} & \leq \mathrm{T}_{\pi^{\prime}} J^{\pi}-J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} \\ & \leq \mathrm{T}_{\mathfrak{g}} J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1}-J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} \\ &=\mathrm{T}_{\mathfrak{g}} J^{\pi}-\mathrm{T}_{\mathfrak{g}} J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \mathbf{1} \end{aligned} \tag{8.15} JπJTπJπJ+1γϵ+2γδγ1TgJπ+(ϵ+2 c d ) 1J+1cϵ+2 c dc 1=TgJPiTgJ+1cϵ+2 c d1(8.15)

Among them, the second inequality follows from the third inequality in formula (8.12), and the equality is due to the optimal Bellman operator T g \mathrm{T}_{\mathfrak{g}}Tgthe only fixed point. Finally, we apply the infinite norm to equation (8.15)

∥ J π ′ − J ∗ ∥ ∞ ≤ ∥ T g J π − T g J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ ≤ γ ∥ J π − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.16) \begin{aligned} \left\|J^{\pi^{\prime}}-J^{*}\right\|_{\infty} & \leq\left\|\mathrm{T} _{\mathfraction{g}} J^{\pi}-\mathrm{T}_{\mathfraction{g}} J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \\ & \leq \gamma\left\|J^{\pi}-J^{*}\right\|_{\infty}+\frac{\epsilon+ 2 \gamma \delta}{1-\gamma} \end{aligned} \tag{8.16}JPiJTgJPiTgJ+1cϵ+2 c dcJPiJ+1cϵ+2 c d(8.16)

This completes the proof.


Finally, we summarize the error bounds of the approximate PI algorithm as follows.

Proposition 8.1 Error bound of the approximate PI algorithm (Error bound of the approximate PI algorithm)

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , π k \pi_{k}produced by the approximate PI methodPikInfinitesimal
Lim ⁡ k → ∞ ∥ J π k − J ∗ ∥ ∞ ≤ ϵ + 2 γ δ ( 1 − γ ) 2 . (8.17) \lim _{k \rightarrow \infty}\left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} \leq \frac{\epsilon+2 \gamma \delta}{(1-\gamma)^{2}} \tag{8.17}klimJPikJ(1c )2ϵ+2 c d.(8.17)


Proof.

Given an arbitrary π 0 \pi_{0}Pi0,Lemma 8.2 8.2 8.2意味着
∥ J π 1 − J ∗ ∥ ∞ ≤ γ ∥ J π 0 − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.18) \left\|J^{\pi_{1}}-J^{*}\right\|_{\infty} \leq \gamma\left\|J^{\pi_{0}}-J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \tag{8.18} Jπ1JγJπ0J+1γϵ+2γδ(8.18)

通过直接的归纳论证,对于任意的 k k k ,if there exists
∥ J π k − J ∗ ∥ ∞ ≤ γ k ∥ J π 0 − J ∗ ∥ ∞ + ( ∑ i = 0 k − 1 γ i ) ϵ + 2 γ δ 1 − γ (8.19) \ left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} \leq \gamma^{k}\left\|J^{\pi_{0}}-J ^{*}\right\|_{\infty}+\left(\sum_{i=0}^{k-1} \gamma^{i}\right) \frac{\epsilon+2 \gamma\delta }{1-\gamma} \tag{8.19}JPikJckJPi0J+(i=0k1ci)1cϵ+2 c d(8.19)

The result is by letting k → ∞ k\rightarrow\inftyk derived.


It should be noted that the error bounds of the policies produced by the approximate PI algorithm are not guaranteed to converge within the policy space. That is, the approximate PI algorithm can swing among a set of strategies, see Figure 14.
Insert image description here

Figure 14: Illustration of potential convergence modes of the approximate PI algorithm. When the error constraints are relaxed, the policy produced by the approximate PI algorithm may swing among several candidates, such as { π 1 , π 2 , π 3 , π 4 } . \left\{\pi_{1}, \pi_{ 2}, \pi_{3}, \pi_{4}\right\}.{ p1,Pi2,Pi3,Pi4}.When the error constraint is strict enough, the resulting strategy may converge to a fixed value, such asπ 1 \pi_{1}Pi1

However, in some cases the algorithm can converge to a single policy. In the remainder of the Note, we determine the error bounds of the approximate PI algorithm when the policy converges.

Proposition 8.2 Error bounds of approximate PI under convergence in policy space (Error bounds of approximate PI under convergence in policy space)

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,c } , letπ′ \pi^{\prime}Pi as a strategy to approximate the convergence of the PI algorithm. Then we have

∥ J π ′ − J ∗ ∥ ∞ ≤ ϵ + 2 γ δ 1 − γ (8.20) \left\|J^{\pi^{\prime}}-J^{*}\right\|_{\infty } \leq \frac{\epsilon+2\gamma\delta}{1-\gamma}\tag{8.20}JPiJ1cϵ+2 c d(8.20)


Proof.

Let J ′ ∈ RKJ^{\prime} \in \mathbb{R}^{K}JRK is given byπ ′ \pi^{\prime}PiThe policy produced by the approximate policy evaluation of is J ′ J^{\prime}J andπ ′ \pi^{\prime}Pi' Satisfies the conditions of the approximate PI algorithm

∥ J ′ − J π ′ ∥ ∞ ≤ δ ,  and  ∥ T π ′ J ′ − T g J ′ ∥ ∞ ≤ ϵ .  (8.21) \left\|J^{\prime}-J^{\pi^{\prime}}\right\|_{\infty} \leq \delta, \quad \text { and }\left\|\mathrm{T}_{\pi^{\prime}} J^{\prime}-\mathrm{T}_{\mathfrak{g}} J^{\prime}\right\|_{\infty} \leq \epsilon \text {. } \tag{8.21} JJPid , and TPiJTgJϵ (8.21)

Well, we have

∥ T g J π ′ − J π ′ ∥ ∞ ≤ ∥ T g J π ′ − T g J ′ ∥ ∞ + ∥ T g J ′ − T π ′ J ′ ∥ ∞ + + ∥ T π ′ J ′ − J π ′ ∥ ∞ ≤ γ ∥ J π ′ − J ′ ∥ ∞ + ∥ T g J ′ − T π ′ J ′ ∥ ∞ + + γ ∥ J ′ − J π ′ ∥ ∞ ≤ ϵ + 2 γ δ (8.22) \begin{aligned} \left\|\mathrm{T}_{\mathfraction{g}} J^{\pi^{\prime}}-J^{\pi^{\prime}}\right\ |_{\infty} \leq &\left\|\mathrm{T}_{\mathfraction{g}} J^{\pi^{\prime}}-\mathrm{T}_{\mathfraction{g} } J^{\prime}\right\|_{\infty}+\left\|\mathrm{T}_{\mathfraction{g}} J^{\prime}-\mathrm{T}_{\pi ^{\prime}} J^{\prime}\right\|_{\infty}+\\ &+\left\|\mathrm{T}_{\pi^{\prime}} J^{\prime }-J^{\pi^{\prime}}\right\|_{\infty}\\\leq & \gamma\left\|J^{\pi^{\prime}}-J^{\prime }\right\|_{\infty}+\left\|\mathrm{T}_{\mathfraction{g}} J^{\prime}-\mathrm{T}_{\pi^{\prime}} J^{\prime}\right\|_{\infty}+\\ &+\gamma\left\|J^{\prime}-J^{\pi^{\prime}}\right\|_{ \infty} \\ \leq & \epsilon+2 \gamma \delta \end{aligned} \tag{8.22}TgJPiJPiTgJPiTgJ+TgJTPiJ++TPiJJPicJPiJ+TgJTPiJ++cJJPiϵ+2 c d(8.22)

Among them, the first inequality comes from the triangular property of infinite norm, and the second inequality is due to T g \mathrm{T}_{\mathfrak{g}}TgT π ′ \mathrm{T}_{\pi^{\prime}}TPicontraction properties of , and the last inequality merely recalls the result in equation (8.21). Then, the inequality in equation (8.20) is a direct application of Lemma 3.4.


Obviously, the error bound of the approximate PI algorithm under stable convergence is much stricter than the case of divergence, especially when the discount coefficient γ \gammaWhen γ is close to 1.

8.2 Approximate Policy Evaluation

The analysis of the convergence properties of the general API shows the importance of the performance of approximate strategy evaluation. Strategies similar to those used to develop AVI for minimizing Bellman residuals can also be applied to strategy evaluation.

Definition 8.1 Approximate total cost function

Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , a fixed policyπ \piπ and a total cost function spaceJ \mathcal{J}J , total cost functionJ ∈ JJ \in \mathcal{J}JApproximate total cost functionJ π J^{\pi} of JJπ is given by minimizing the Bellman residual, i.e.

J B π ∈ argmin ⁡ J ∈ J ∥ T π J − J ∥ ∞ . (8.23) J_{B}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\infty} . \tag{8.23} JBpJJargminTpJJ.(8.23)

Estimate JB π J_{B}^{\pi} by minimizing the Bellman residual errorJBpThe error bounds are as follows.

Lemma 8.3 Approximate cost function bounds

Given an infinite range MDP {X, U, p, q, γ} MDP\{\mathcal{X}, \mathcal{U}, p, q, \gamma\}MDP{ X,U,p,q,γ },让J π J^{\pi}Jπ is a fixed strategyπ \piThe total cost function of π . Then, for any total cost functionJ ∈ RKJ\in \mathbb{R}^{K}JRIn K , the following inequality holds

∥ J − J π ∥ ∞ ≤ 1 1 − γ ∥ J − T π J ∥ ∞ . (8.24) \left\|JJ^{\pi}\right\|_{\infty} \leq \frac{1}{1-\gamma}\left\|J-\mathrm{T}_{\pi } J\right\|_{\infty} . \tag{8.24}JJπ1γ1JTπJ.(8.24)


Proof.
直接的有
∥ J − J π ∥ ∞ = ∥ J − T π J + T π J − J π ∥ ∞ = ∥ J − T π J ∥ ∞ + ∥ T π J − J π ∥ ∞ ≤ ∥ J − T π J ∥ ∞ + γ ∥ J − J π ∥ ∞ (8.25) \begin{aligned} \left\|J-J^{\pi}\right\|_{\infty} &=\left\|J-\mathrm{T}_{\pi} J+\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\infty} \\ &=\left\|J-\mathrm{T}_{\pi} J\right\|_{\infty}+\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\infty} \\ & \leq\left\|J-\mathrm{T}_{\pi} J\right\|_{\infty}+\gamma\left\|J-J^{\pi}\right\|_{\infty} \end{aligned} \tag{8.25} JJπ=JTπJ+TπJJπ=JTπJ+TπJJπJTπJ+γJJπ(8.25)


Proposition 8.3 Constraints between estimates and true total cost functions

Given an infinite range MDP { X , U , p , g , γ } MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , a fixed policyπ \piπ and a total cost function spaceJ \mathcal{J}J。 letJBπ ∈ J J_{B}^{\pi} \in \mathcal{J}JBpJ is the global minimum of the MSBE problem. Then the estimated value and the true total cost functionJ π J^{\pi}JThe error between π has the following constraints

∥ JB π − J π ∥ ∞ ≤ 1 + γ 1 − γ min ⁡ J ∈ J ∥ J − J π ∥ ∞ . (8.26) \left\|J_{B}^{\pi}-J^{\pi}\right\|_{\infty} \leq \frac{1+\gamma}{1-\gamma} \min _{J \in \mathcal{J}}\left\|JJ^{\pi}\right\|_{\infty} . \tag{8.26}JBpJπ1c1+cJJminJJπ.(8.26)


Proof.

By applying the triangle inequality with infinite norm, we get
∥ T π J − J ∥ ∞ ≤ ∥ T π J − J π ∥ ∞ + ∥ J π − J ∥ ∞ ≤ ( 1 + γ ) ∥ J − J π ∥ ∞ . (8.27) \begin{aligned} \left\|\mathrm{T}_{\pi} JJ\right\|_{\infty} & \leq\left\|\mathrm{T}_{\pi } JJ^{\pi}\right\|_{\infty}+\left\|J^{\pi}-J\right\|_{\infty} \\ & \leq(1+\gamma)\ left\|JJ^{\pi}\right\|_{\infty} . \end{aligned} \tag{8.27}TpJJTpJJπ+JPiJ(1+c )JJπ.(8.27)

直截了当地有
∥ T π J B π − J B π ∥ ∞ = min ⁡ J ∈ J ∥ T π J − J ∥ ∞ ≤ ( 1 + γ ) min ⁡ J ∈ J ∥ J − J π ∥ ∞ . (8.28) \begin{aligned} \left\|\mathrm{T}_{\pi} J_{B}^{\pi}-J_{B}^{\pi}\right\|_{\infty} &=\min _{J \in \mathcal{J}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\infty} \\ & \leq(1+\gamma) \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\infty} . \end{aligned} \tag{8.28} TπJBπJBπ=JJminTπJJ(1+γ)JJminJJπ.(8.28)

结合不等式和Lemma 8.3 8.3 8.3中的结果,证明了这一点。


Obviously, the MSBE cost given in equation (8.23) is still numerically difficult to optimize. Therefore, similar to AVI, we can define the following mean squared Bellman Error (MSBE) minimization problem
J 2 π ∈ argmin ⁡ J ∈ J ∥ T π J − J ∥ 2 . (8.29) J_{2 }^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|\mathrm{T}_{\pi} JJ\right\|_{2} .\tag{8.29}J2pJJargminTpJJ2.(8.29)

If we adopt the matrix form of the Bellman operator and choose the function approximation space to be linear, that is, T π J = G π + γ P π Φ ⊤ h \mathrm{T}_{\pi} J=G_{\pi }+\gamma P_{\pi} \Phi^{\top} hTpJ=Gp+γPpPhi h, then there is an approximate form expression of the above problem

J 2 π = ( W π ⊤ W π ) − 1 W π ⊤ G π (8.30) J_{2}^{\pi}=\left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi}^{\top} G_{\pi} \tag{8.30} J2p=(WPiWp)1WPiGp(8.30)

Then W π = ( IK − γ P π ) Φ ⊤ W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}Wp=(IKγPp)Phi . While this solution is simple and guaranteed, unfortunately there are no meaningful error bounds that can be used to describe the quality of this approximation.

8.3 Approximate Policy Evaluation with Ergodicity

Although the MSBE minimization problem is well defined and has simple numerical solutions, it inherits the properties of DP, that is, the requirement for model information. In various practical applications of SDM, there is a great need for efficient solutions to problems for which there is no explicit model. Specifically, we study a special class of MDPs, which enables the development of model-free DP algorithms.

8.3.1 Ergodic MDP

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } and a fixed policyπ \piπ , it is well known that system transitions can be modeled as Markov chains. In order to retrieve complete model information by sampling, it must be assumed that each state is reachable from any other state and therefore has a unique stationary distribution over the states. Therefore, we use the underlying MDP model and policyπ \piThe Markov chain of state transitions specified by π imposes the following assumptions

Assumption 8.1 Transition matrix P π P_{\pi}Ppergodicity of

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } and a fixed policyπ \piπ , by the transition matrixP π P_{\pi}PpThe defined Markov chain is ergodic.

Let us use ξ i \xi_{i}XiRepresents the iithThe probability of i corresponding states. The ergodic hypothesis implies that alli = 1 , … , K i=1, \ldots, Ki=1,,K ξ i \xi_{i} ξi都是正定的,也就是说,马尔科夫链有一个唯一的稳定状态分布。让我们定义 ξ : = [ ξ 1 , … , ξ K ] ⊤ ∈ R K \xi:=\left[\xi_{1}, \ldots, \xi_{K}\right]^{\top} \in \mathbb{R}^{K} ξ:=[ξ1,,ξK]RK, 与 x ∈ R K x \in \mathbb{R}^{K} xRK ξ \xi ξ与过渡矩阵 P π P_{\pi} Pπ之间的关系的特点是

P π ⊤ ξ = ξ (8.31) P_{\pi}^{\top} \xi=\xi \tag{8.31} Pπξ=ξ(8.31)

显然,向量 ξ \xi ξ P π ⊤ P_{\pi}^{\top} Pπ的右特征向量,与特征值为1有关。此外,由于 ξ \xi ξ的所有条目都是正的,我们可以将 ξ \xi The weighted norm of ξ is defined as

∥ x ∥ ξ = ∑ k = 1 K ξ i x i 2 (8.32) \|x\|_{\xi}=\sqrt{\sum_{k=1}^{K} \xi_{i} x_{i}^{2}} \tag{8.32} xx=k=1KXixi2 (8.32)

Lemma 8.4 ξ \xiξ weighted norm

Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ , for anyK × KK \times KK×K transition probability matrixP π P_{\pi}Pp, has an invariant distribution ξ = ( ξ 1 , … , ξ n ) \xi=\left(\xi_{1}, \ldots, \xi_{n}\right)X=( x1,,Xn) , has a positive component, we have

∥ P π J ∥ ξ ≤ ∥ J ∥ ξ (8.33) \left\|P_{\pi} J\right\|_{\xi} \leq\|J\|_{\xi} \tag{8.33} PpJxJx(8.33)


Proof
P π = { p i j } P_{\pi}=\left\{p_{i j}\right\} Pp={ pij} , the infinitesimal
∥ P π J ∥ ξ 2 = ∑ i = 1 n ξ i ( ∑ j = 1 npij J j ) 2 (definition) ≤ ∑ i = 1 n ξ i ∑ j = 1 npij J j 2 ( convexity) = ∑ j = 1 n ∑ i = 1 n ξ ipij J j 2 = ∑ j = 1 n ξ j J j 2 ≤ ∥ J ∥ ξ 2 (definition) (8.34) \begin{array}{rlr}\; left\|P_{\pi} J\right\|_{\xi}^{2} & =\sum_{i=1}^{n} \xi_{i}\left(\sum_{j=1} ^{n} p_{ij} J_{j}\right)^{2} & \text { (definition) } \\ & \leq \sum_{i=1}^{n} \xi_{i} \sum_ {j=1}^{n} p_{ij} J_{j}^{2} & \text { (convexity) } \\ & =\sum_{j=1}^{n} \sum_{i=1 }^{n}\xi_{i}p_{ij}J_{j}^{2} & \\&=\sum_{j=1}^{n}\xi_{j}J_{j}^{2 } & \\ \leq & \|J\|_{\xi}^{2} & \text { (definition) } \end{array} \tag{8.34}PpJX2=i=1nXi(j=1npijJj)2i=1nXij=1npijJj2=j=1ni=1nXipijJj2=j=1nXjJj2JX2 (definition)  (convexity)  (definition) (8.34)


Proposition 8.4 ξ \xi Contractibility of Bellman operator under ξ weighted norm

Given an infinite range of MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ,那么贝尔曼算子 T π \mathrm{T}_{\pi} Tπ是模数 γ \gamma γ相对于 ξ \xi ξ加权范数的收缩,即:
∥ T π J − T π J ′ ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ . (8.35) \left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\|_{\xi} \leq \gamma\left\|J-J^{\prime}\right\|_{\xi} . \tag{8.35} TπJTπJξγJJξ.(8.35)


Proof.
为了简单起见,我们使用贝尔曼算子 T π J : = G π + γ P π J \mathrm{T}_{\pi} J:=G_{\pi}+\gamma P_{\pi} J TπJ:=Gπ+γPπJ的紧凑表示,然后,我们得到
∥ T π J − T π J ′ ∥ ξ = ∥ γ P π ( J − J ′ ) ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ (8.36) \begin{aligned} \left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\|_{\xi} &=\left\|\gamma P_{\pi}\left(J-J^{\prime}\right)\right\|_{\xi} \\ & \leq \gamma\left\|J-J^{\prime}\right\|_{\xi} \end{aligned} \tag{8.36} TπJTπJξ=γPπ(JJ)ξγJJξ(8.36)
这直接来自于Lemma 8.4。


通过采用这一特性,我们可以在 ξ \xi The mean square Bellman error (MSBE) is defined below in the ξ weighted norm.
J β π ∈ argmin ⁡ J ∈ J ∥ T π J − J ∥ ξ (8.37) J_{\beta}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin} }\left\|\mathrm{T}_{\pi} JJ\right\|_{\xi} \tag{8.37}JbpJJargminTpJJx(8.37)

Similar to the analysis in Section 8.2, we can derive that the MSBE is minimized in ξ \xiThe error bounds under the ξ- weighted specification are as follows.

Lemma 8.5 ξ \xiBoundary under ξ weighted norm

Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ },让J π J^{\pi}Jπ is a fixed policyπ \piThe total cost function of π . Then, for any total cost functionJ ∈ RKJ \in \mathbb{R}^{K}JRLet K , one of the singular inequalities∥
J − J π ∥ ξ ≤ 1 1 − γ ∥ J − T π J ∥ ξ (8.38) \left\|JJ^{\pi}\right\|_{\xi}; \leq \frac{1}{1-\gamma}\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi} \tag{8.38}JJπx1c1JTpJx(8.38)


Proof.
Directly

∥ J − J π ∥ ξ = ∥ J − T π J + T π J − J π ∥ ξ = ∥ J − T π J ∥ ξ + ∥ T π J − J π ∥ ξ ≤ ∥ J − T π J ∥ ξ + γ ∥ J − J π ∥ ξ (8.39) \begin{aligned} \left\|J-J^{\pi}\right\|_{\xi} &=\left\|J-\mathrm{T}_{\pi} J+\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi} \\ &=\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi}+\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi} \\ & \leq\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi}+\gamma\left\|J-J^{\pi}\right\|_{\xi} \end{aligned} \tag{8.39} JJπx=JTpJ+TpJJπx=JTpJx+TpJJπxJTpJx+cJJπx(8.39)


Proposition 8.5 ξ \xi Constraints between estimates under the ξ -weighted norm and the true total cost function

给出一个无限范围的 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{ X,U,p,g,γ},一个固定的策略 π \pi π和一个总成本函数空间 J \mathcal{J} J。让 J B π ∈ J J_{B}^{\pi} \in \mathcal{J} JBπJ为MSBE问题的全局最小值。那么估计值与真实总成本函数 J π J^{\pi} Jπ之间的误差有如下约束

∥ J β π − J π ∥ ξ ≤ 1 + γ 1 − γ min ⁡ J ∈ J ∥ J − J π ∥ ξ . (8.40) \left\|J_{\beta}^{\pi}-J^{\pi}\right\|_{\xi} \leq \frac{1+\gamma}{1-\gamma} \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\xi} . \tag{8.40} JβπJπξ1γ1+cJJminJJπx.(8.40)


Proof.
By applying the triangle inequality with infinite norm, we can get

∥ T π J − J ∥ ξ ≤ ∥ T π J − J π ∥ ξ + ∥ J π − J ∥ ξ ≤ ( 1 + γ ) ∥ J − J π ∥ ξ . (8.41) \begin{aligned} \left\|\mathrm{T}_{\pi} J-J\right\|_{\xi} & \leq\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi}+\left\|J^{\pi}-J\right\|_{\xi} \\ & \leq(1+\gamma)\left\|J-J^{\pi}\right\|_{\xi} . \end{aligned} \tag{8.41} TpJJxTpJJπx+JPiJx(1+c )JJπx.(8.41)

Simply put, we have

∥ T π J β π − J β π ∥ ξ = min ⁡ J ∈ J ∥ T π J − J ∥ ξ ≤ ( 1 + γ ) min ⁡ J ∈ J ∥ J − J π ∥ ξ (8.42) \begin{aligned} \left\|\mathrm{T}_{\pi} J_{\beta}^{\pi}-J_{\beta}^{\pi}\right\|_{\xi} &=\min _{J \in \mathcal{J}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\xi} \\ & \leq(1+\gamma) \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\xi} \end{aligned} \tag{8.42} TpJbpJbpx=JJminTpJJx(1+c )JJminJJπx(8.42)

Compare this inequality to 8.5 8.5The results of 8.5 are combined to complete the proof.

8.3.2 Mean Squared Projected Bellman Error

Finally, if we limit ourselves to a linear function approximation scheme, we need an orthogonal projection to J l \mathcal{J}_{l}Jl, relative to ξ \xiWeighted specification of ξ . Specifically, we need to solve the following minimization problem
Π Φ ( J ) : = Φ ⊤ argmin ⁡ h ∈ R m ∥ J − Φ ⊤ h ∥ ξ 2 (8.43) \Pi_{\Phi}(J):=\ Phi^{\top} \underset{h \in \mathbb{R}^{m}}{\operatorname{argmin}}\left\|J-\Phi^{\top} h\right\|_{\ xi}^{2} \tag{8.43}PiF(J):=PhihRmargminJPhihX2(8.43)

Since the least squares function is convex, the solution is characterized by solving the following equation hhh to realize
Φ Ξ Φ ⊤ h = Φ Ξ J (8.44) \Phi \Xi \Phi^{\top} h=\Phi \Xi J \tag{8.44}F X Fh=F X J(8.44)

Since rk ⁡ ( Φ ) = m \operatorname{rk}(\Phi)=mr k ( Φ )=m , the orthogonal projection is clearly defined as
Π Φ ( J ) : = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ J (8.45) \Pi_{\Phi}(J):=\Phi^{\top} \left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi J \tag{8.45}PiF(J):=Phi( F X F)1F X J(8.45)

Lemma 8.6 Non-expansive projection operator Π Φ \Pi_{\Phi}PiF

Given an infinite range MDP {MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ . Then, the projectionΠ Φ \Pi_{\Phi}PiFAt ξ − \xi-It is a non-expansive operator under the ξ − norm, that is.
∥ Π Φ J − Π Φ J ′ ∥ ξ ≤ ∥ J − J ′ ∥ ξ . (8.46) \left\|\Pi_{\Phi} J-\Pi_{\Phi} J^{\prime}\right\ |_{\xi} \leq\left\|JJ^{\prime}\right\|_{\xi} . \tag{8.46}ΠFJPiFJxJJx.(8.46)


Proof.Easy
to find

∥ Π Φ J − Π Φ J ′ ∥ ξ 2 = ∥ Π Φ ( J − J ′ ) ∥ ξ 2 ≤ ∥ Π Φ ( J − J ′ ) ∥ ξ 2 + ∥ ( I − Π Φ ) ( J − J ′ ) ∥ ξ 2 = ∥ J − J ′ ∥ ξ 2 (8.47) \begin{aligned} \left\|\Pi_{\Phi} J-\Pi_{\Phi} J^{\prime}\right\|_{\xi}^{2} &=\left\|\Pi_{\Phi}\left(J-J^{\prime}\right)\right\|_{\xi}^{2} \\ & \leq\left\|\Pi_{\Phi}\left(J-J^{\prime}\right)\right\|_{\xi}^{2}+\left\|\left(I-\Pi_{\Phi}\right)\left(J-J^{\prime}\right)\right\|_{\xi}^{2} \\ &=\left\|J-J^{\prime}\right\|_{\xi}^{2} \end{aligned} \tag{8.47} ΠFJPiFJX2=ΠF(JJ)X2ΠF(JJ)X2+(IPiF)(JJ)X2=JJX2(8.47)

The last equation is derived from the Pythagorean theorem. End of proof.


Proposition 8.6. Projection operator Π Φ \Pi_{\Phi}PiFshrinkage

Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ , then the projected Bellman operatorΠ Φ T π \Pi_{\Phi} \mathrm{T}_{\pi}PiFTpis relative to ∥ ⋅ ∥ ξ \|\cdot\|_{\xi}xThe modulus of is γ \gammaShrinkage of γ .


Proof.

Directly from Lemma 8.6 8.68 . 6Otherwise , there is a constraint∥
Π Φ T π J − Π Φ T π J ′ ∥ ξ ≤ ∥ T π J − T π J ′ ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ . (8.48) \begin{aligned} \left\|\Pi_{\Phi} \mathrm{T}_{\pi} J-\Pi_{\Phi} \mathrm{T}_{\pi} J^{\ prime}\right\|_{\xi} & \leq\left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\| _{\xi} \\ & \leq \gamma\left\|JJ^{\prime}\right\|_{\xi} \end{aligned} \tag{8.48}ΠFTpJPiFTpJxTpJTpJxcJJx.(8.48)


This proposition shows that in J \mathcal{J}There is a unique fixed pointJ ~ π \widetilde{J}_{\pi} in JJ p,从而
J ~ π = Π Φ T π J ~ π . \widetilde{J}_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi} \tilde{J}_{\pi} . J π=ΠΦTπJ~π.

由于 h ↦ Φ h h \mapsto \Phi h hΦh是单射的,因此存在一个唯一的 h π ∈ R m h_{\pi} \in \mathbb{R}^{m} hπRm,这样 Φ h π = Π Φ T π ( Φ h π ) \Phi h_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi}\left(\Phi h_{\pi}\right) Φhπ=ΠΦTπ(Φhπ)。这简单地导致了另一个流行的目标函数,即均方投影贝尔曼误差(Mean Squared Projected Bellman Error, MSPBE)

min ⁡ h ∈ R m ∥ Φ h − Π Φ T π ( Φ h ) ∥ ξ (8.49) \min _{h \in \mathbb{R}^{m}}\left\|\Phi h-\Pi_{\Phi} \mathrm{T}_{\pi}(\Phi h)\right\|_{\xi} \tag{8.49} hRmminΦhPiFTp(Φh)x(8.49)

In the following, we describe the error bounds for minimizing the MSPBE fuction.

Proposition 8.7.

Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ,讹h π h_{\pi}hpDefinition h π = Π Φ T π ( Φ h π ) \Phi h_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi}\left(\Phi h_{\pi}\right )Φhp=PiFTp(ΦhpJ π − Φ ⊤ h π ∥ ξ ≤ 1 1 − γ 2 ∥ J π − Π Φ J π ∥ ξ (8.50) \left\|J^{\pi}-\Phi^{ \
top} h_{\pi}\right\|_{\xi} \leq \frac{1}{\sqrt{1-\gamma^{2}}}\left\|J^{\pi}-\Pi_ {\Phi}J^{\pi}\right\|_{\xi}\tag{8JPiPhihpx1c2 1JPiPiFJπx(8.50)


Proof.
Simply put, we have

∥ J π − Φ ⊤ h π ∥ ξ 2 = ∥ J π − Π Φ J π ∥ ξ 2 + ∥ Π Φ J π − Φ ⊤ h π ∥ ξ 2 = ∥ J π − Π Φ J π ∥ ξ 2 + ∥ Π Φ T π J π − Π Φ T π ( Φ ⊤ h π ) ∥ ξ 2 ≤ ∥ J π − Π Φ J π ∥ ξ 2 + γ 2 ∥ J π − Φ ⊤ h π ∥ ξ 2 (8.51 ) \begin{aligned} \left\|J^{\pi}-\Phi^{\top} h_{\pi}\right\|_{\xi}^{2} &=\left\|J^{ \pi}-\Pi_{\Phi} J^{\pi}\right\|_{\xi}^{2}+\left\|\Pi_{\Phi} J^{\pi}-\Phi^ {\top} h_{\pi}\right\|_{\xi}^{2} \\ &=\left\|J^{\pi}-\Pi_{\Phi} J^{\pi}\ right\|_{\xi}^{2}+\left\|\Pi_{\Phi} \mathrm{T}_{\pi} J^{\pi}-\Pi_{\Phi} \mathrm{T }_{\pi}\left(\Phi^{\top} h_{\pi}\right)\right\|_{\xi}^{2} \\ & \leq\left\|J^{\ pi}-\Pi_{\Phi} J^{\pi}\right\|_{\xi}^{2}+\gamma^{2}\left\|J^{\pi}-\Phi^{ \top} h_{\pi}\right\|_{\xi}^{2} \end{aligned} \tag{8.51}JPiPhihpX2=JPiPiFJπX2+PiFJPiPhihpX2=JPiPiFJπX2+PiFTpJPiPiFTp( Fhp)X2JPiPiFJπX2+c2JPiPhihpX2(8.51)

The first equation is generated by the Pythagorean theorem, the second equation is generated by the construction, and this inequality is due to Π Φ T π \Pi_{\Phi} \mathrm{T}_{\pi}PiFTpcaused by shrinkage characteristics.


When the true total cost function J π J^{\pi}JWhen π is not in the linear function approximation space, that is,∥ J π − Π Φ J π ∥ ξ ≠ 0 \left\|J^{\pi}-\Pi_{\Phi} J^{\pi}\right\| _{\xi} \neq 0JPiPiFJπx=0 ,thenJ π − Φ ⊤ h π ∥ ξ \left\|J^{\pi}-\Phi^{\top} h_{\pi}\right\|_{\xi}JPiPhihpxThe error will be severely constrained, if γ \gammaγ is close to 1. Therefore, ensure that the total cost function lies in the linear total cost function approximation spaceJ \mathcal{J}J is crucial, that is,J π ∈ J l J^{\pi}\in \mathcal{J}_{l}JPiJl

Since both the MSBE function and the MSPBE function are convex, both functions can guarantee a global minimum. Therefore, it is valuable to study the performance of solutions to these two problems. To do this, we define the difference in error bounds as

l ( γ ) : = 1 + γ 1 − γ − 1 1 − γ 2 (8.52) l(\gamma ):=\frac{1+\gamma}{1-\gamma}-\frac{1}{\ sqrt{1-\gamma^{2}}} \tag{8.52}l ( c ):=1c1+c1c2 1(8.52)

Obviously, l ( 0 ) = 0 l(0)=0l(0)=0 . Now we takellThe derivative of l is

l ′ ( γ ) = 2 ( 1 − γ ) 2 + γ ( 1 − γ 2 ) 3 (8.53) l^{\prime}(\gamma)=\frac{2}{(1-\gamma)^{ 2}}+\frac{\gamma}{\left(\sqrt{1-\gamma^{2}}\right)^{3}} \tag{8.53}l (c)=(1c )22+(1c2 )3c(8.53)

Its value is for γ ∈ [ 0 , 1 ] \gamma\in[0,1]c[0,1 ] is always positive. This fact means that the difference functionllThe function value of l increases monotonically from 0 to 1. The evaluation in Figure 15 clearly describes whenγ \gammaWhen γ is close to 1, the difference between the error bounds of MSBE minimization and MSPBE minimization becomes infinite. In other words, minimizing the MSPBE function is more advantageous than the MSBE function.

Insert image description here

Figure 15: Error bound quotient for MSBE minimization and MSPBE minimization.

8.4 API supplement

8.4.1 Approximate PI (API)

Insert image description here

  • We will show three different APE methods: ell 2 ell_{2}ell2MSBE, MSBE with ergodic properties, MSPBE with ergodic properties.
  • In the case of E-Bus, there is no approximation method in the policy improvement step.
  • Policy networks in deep reinforcement learning: Approximate policy improvements.

8.4.2 APE via Bellman Residual Minimisation

Insert image description here

  • In Policy Iteration, Policy Evaluation (PE) via T π T_{\pi} Tπ leads to a fixed point J π J^{\pi} Jπ . (Quiz 2)
  • In Approximate PE, there is a Bellman error since we restrict J J J in a subspace ( Φ ⊤ h ) \left(\Phi^{\top} h\right) (Φh) if we apply Linear Function Approximation (LFA).

8.4.3 ℓ 2 \ell_{2} 2 Based Bellman Residual Minimisation

Insert image description here

  • What is the difference between ∥ ⋅ ∥ 2 2 \|\cdot\|_{2}^{2} 22 and ∥ ⋅ ∥ 2 \|\cdot\|_{2} 2 ?
  • ∥ x ∥ 2 2 = x ⊤ x , ∥ x ∥ 2 = x ⊤ x ( x ∈ R n ) \|x\|_{2}^{2}=x^{\top} x,\|x\|_{2}=\sqrt{x^{\top} x}\left(x \in \mathbb{R}^{n}\right) x22=xx,x2=xx (xRn) .
  • Both forms are strict convex, they have the same global minima. We did not make a strict distinction between these two terms since we only focus on the analytical solution.
  • Quite different in numerical calculations, e.g., gradient.
  • In this exercise, we keep using ∥ ⋅ ∥ 2 2 \|\cdot\|_{2}^{2} 22 , which is also more consistent with the name ‘Squared’ BE.

8.4.4 Recap: Closed form policy evaluation

Insert image description here

bold style

Preliminaries: matrix derivation

  • Matrix calculus
  • Layout conventions: given y ∈ R m , x ∈ R n y \in \mathbb{R}^{m}, x \in \mathbb{R}^{n} yRm,xRn .
    Numerator-layout:

 Numerator-layout:  ∂ y ∂ x : = [ ∂ y 1 ∂ x 1 … ∂ y 1 ∂ x n ⋱ ∂ y m ∂ x 1 … ∂ y m ∂ x n ] ∈ R m × n , ∂ y ∂ x : = [ ∂ y 1 ∂ x 1 … ∂ y m ∂ x 1 ⋱ ∂ y 1 ∂ x n … ∂ y m ∂ x n ] ∈ R n × m , \begin{array}{l} \text { Numerator-layout: } \\ \frac{\partial y}{\partial x}:=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \ldots & \frac{\partial y_{1}}{\partial x_{n}} \\ & \ddots & \\ \frac{\partial y_{m}}{\partial x_{1}} & \ldots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \in \mathbb{R}^{m \times n}, \quad \frac{\partial y}{\partial x}:=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \ldots & \frac{\partial y_{m}}{\partial x_{1}} \\ & \ddots & \\ \frac{\partial y_{1}}{\partial x_{n}} & \ldots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \in \mathbb{R}^{n \times m}, \end{array}  Numerator-layout: xy:=x1y1x1ymxny1xnymRm×n,xy:=x1y1xny1x1ymxnymRn×m,

  • This exercise follows denominator layout convention.
  • This exercise has two kinds of matrix derivation:
  • The derivative of a scalar y by a vector x : gradient (vector)
  • The derivative of a vector y by a vector x : Jaccobian (matrix)

8.4.5 ℓ 2 \ell_{2} 2 Based Bellman Residual Minimisation

  • ℓ 2 \ell_{2} 2 least square function:

J 2 π ∈ argmin ⁡ J ∈ J ∥ T π J − J ∥ 2 2 ,  where  J = Φ ⊤ h . ∥ T π J − J ∥ 2 2 = ∥ J − T π J ∥ 2 2 = ∥ J − G π − γ P π J ∥ 2 2 = ∥ ( I K − γ P π ) Φ ⊤ h − G π ∥ 2 2 \begin{aligned} J_{2}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}} &\left\|\mathrm{T}_{\pi} J-J\right\|_{2}^{2}, \quad \text { where } J=\Phi^{\top} h . \\ \left\|\mathrm{T}_{\pi} J-J\right\|_{2}^{2} &=\left\|J-\mathrm{T}_{\pi} J\right\|_{2}^{2}=\left\|J-G_{\pi}-\gamma P_{\pi} J\right\|_{2}^{2} \\ &=\left\|\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} h-G_{\pi}\right\|_{2}^{2} \end{aligned} J2pJJargminTpJJ22TpJJ22, where J=Phih.=JTpJ22=JGpγPpJ22=(IKγPp)PhihGp22

  • Let W π = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp=(IKγPp)PhiRK×m , we have

∥ T π J − J ∥ 2 2 = ∥ W π h − G π ∥ 2 2 = ( W π h − G π ) T ( W π h − G π ) \begin{aligned} \left\|\mathrm{ T}_{\pi} JJ\right\|_{2}^{2} &=\left\|W_{\pi} h-G_{\pi}\right\|_{2}^{2} \\ &=\left(W_{\pi}h-G_{\pi}\right)^{\mathrm{T}}\left(W_{\pi}h-G_{\pi}\right) \end {aligned}TpJJ22=WphGp22=(WphGp)T(WphGp)

  • Since the least square function is convex, we can get the minima when the derivation equals to zero.

  • Let u = W π h − G π ∈ R K × 1 \mathbf{u}=W_{\pi} h-G_{\pi} \in \mathbb{R}^{K \times 1} u=WπhGπRK×1 , we can get
    ∂ u ⊤ u ∂ u = 2 u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ h u , w h e r e ∂ u ∂ h = W π ⊤ (denominator layout) \frac{\partial \mathbf{u}^{\top} \mathbf{u}}{\partial \mathbf{u}}=2 \mathbf{u}, \quad \frac{\partial \mathbf{u}^{\top} \mathbf{u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u}, \quad where \frac{\partial \mathbf{u}}{\partial h}=W_{\pi}^{\top} \quad \text{(denominator layout)} uuu=2u,huu=2huu,wherehu=WPi(denominator layout)

∂ ( W π h − G π ) ⊤ ( W π h − G π ) ∂ h = 2 W π ⊤ ( W π h − G π ) = 0 ∈ R m × 1 ⇒ W π ⊤ W π h − W π ⊤ G π = 0 W π ⊤ W π h = W π ⊤ G π \begin{aligned} \frac{\partial\left(W_{\pi} h-G_{\pi}\right)^{\top}\left(W_{\pi} h-G_{\pi}\right)}{\partial h} &=2 W_{\pi}^{\top}\left(W_{\pi} h-G_{\pi}\right)=0 \in \mathbb{R}^{m \times 1} \\ \Rightarrow \quad W_{\pi}^{\top} W_{\pi} h-W_{\pi}^{\top} G_{\pi} &=0 \\ W_{\pi}^{\top} W_{\pi} h &=W_{\pi}^{\top} G_{\pi} \end{aligned} h(WphGp)(WphGp)WPiWphWPiGpWPiWph=2 WPi(WphGp)=0Rm×1=0=WPiGp

  • W π ⊤ W_{\pi}^{\top}WPi is not a square matrix (non-invertable), so we move ( W π ⊤ W π ) ∈ R m × m \left(W_{\pi}^{\top} W_{\pi}\right) \in \mathbb{R}^{m \times m} (WPiWp)Rm×m to the RHS:

h = ( W π ⊤ W π ) − 1 W π ⊤ G π J 2 π = Φ ⊤ h = Φ ⊤ ( W π ⊤ W π ) − 1 W π ⊤ G π \begin{array}{c} h= \left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi}^{\top} G_{\pi} \\ J_{2}^{\ pi}=\Phi^{\top} h=\Phi^{\top}\left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi} ^{\top} G_{\pi} \end{array}h=(WPiWp)1WPiGpJ2p=Phih=Phi(WPiWp)1WPiGp

8.4.6 Approximate PI (API) with LFA + MSBE

Insert image description here
Insert image description here

  • What is ξ ? → \xi ? \rightarrow x ? Ergodic MDP.
  • Ξ ∈ R K × K \Xi \in \mathbb{R}^{K \times K} XRK×K : a diagonal matrix with diagonal element ξ i \xi_{i} Xi . (The 14th Greek letter Ξ , ξ \Xi, \xi X ,x )
    Insert image description here
    Insert image description here
  • Similar as before, let W π : = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}:=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp:=(IKγPp)PhiRK×m :

∥ T π ( Φ ⊤ h ) − Φ ⊤ h ∥ ξ 2 = ∥ Φ ⊤ h − G π − γ P π Φ ⊤ h ∥ ξ 2 = ∥ W π h − G π ∥ ξ 2 \left\|\mathrm {T}_{\pi}\left(\Phi^{\top}h\right)-\Phi^{\top}h\right\|_{\xi}^{2}=\left\|\ Phi^{\top} h-G_{\pi}-\gamma P_{\pi} \Phi^{\top} h\right\|_{\xi}^{2}=\left\|W_{\ pi}h-G_{\pi}\right\|_{\xi}^{2}Tp( Fh)PhihX2=PhihGpγPpPhihX2=WphGpX2

  • ξ \xi ξ -norm is defined as:

∥ W π h − G π ∥ ξ 2 = ( W π h − G π ) ⊤ Ξ ( W π h − G π ) \left\|W_{\pi} h-G_{\pi}\right\|_{\xi}^{2}=\left(W_{\pi} h-G_{\pi}\right)^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right) WphGpX2=(WphGp)X(WphGp)

  • \Xi \in \mathbb{R}^{K \times K} : a diagonal matrix with diagonal element ξ i \xi_{i} Xi .

  • Again, the least square function is convex, the derivation should equal zero. Let u = W π h − G π ∈ RK × 1 \mathbf{u}=W_{\pi}h-G_{\pi} \in \mathbb{R}^{K \times 1}u=WphGpRK×1 , we can get

∂ u ⊤ Ξ u ∂ u = 2 Ξ u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ hu , where ∂ u ∂ h = W π ⊤ ∂ ( W π h − G π ) ⊤ Ξ ( W π h − G π ) ∂ h = 2 W π ⊤ Ξ ( W π h − G π ) = 0 ∈ R m × 1 W π ⊤ Ξ W π h = W π ⊤ Ξ G π h = ( W π ⊤ Ξ W π ) − 1 W π ⊤ Ξ G π ⇒ J ξ π = Φ ⊤ ( W π ⊤ Ξ W π ) − 1 W π ⊤ Ξ G π \begin{array}{c} \frac{\partial \mathbf{u}^ {\top} \Xi \mathbf{u}}{\partial \mathbf{u}}=2 \Xi \mathbf{u}, \quad \frac{\partial \mathbf{u}^{\top} \mathbf {u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u},\quad \text { where } \frac{\partial \mathbf{u}}{\partial h}=W_{\pi}^{\top} \\ \frac{\partial\left(W_{\pi} h-G_{\pi}\right)^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right)}{\partial h}=2 W_{\pi}^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right)=\mathbf{0} \in \mathbb{R}^{m \times 1} \\ W_{\pi}^{\top} \Xi W_{\pi} h=W_{\pi}^{\top} \Xi G_{\pi} \\ h=\left(W_{\pi}^{\top} \Xi W_{\pi}\right)^{-1} W_{\pi}^{\top} \Xi G_{\pi} \\ \Rightarrow J_{\xi}^{\pi}=\Phi^{\top}\left(W_{\pi}^{\top} \Xi W_{\pi}\right)^{-1} W_{\pi}^{\top} \Xi G_{\pi} \end{array}uuΞu=2Ξu,huu=2huu, where hu=WPih(WphGp)Ξ(WphGp)=2 WPiX(WphGp)=0Rm×1WPiΞWph=WPiΞGph=(WPiΞWp)1WPiΞGpJXp=Phi(WPiΞWp)1WPiΞGp

  • When Ξ \Xi Ξ is an identity matrix, we get the same result as ℓ 2 \ell_{2} 2MSBE.

8.4.7  Approximate PI (API) with LFA  + ξ -weighted MSBE  \text { Approximate PI (API) with LFA }+\xi \text {-weighted MSBE }  Approximate PI (API) with LFA +ξ-weighted MSBE 

Insert image description here

8.4.8 Mean Squared Projected Bellman Error (MSPBE)

Insert image description here

  • Since Π Φ J = J \Pi_{\Phi} J=JPiFJ=J ,

∥ Π Φ T π J − J ∥ ξ 2 = ∥ J − Π Φ ( G π + γ P π J ) ∥ ξ 2 = ∥ Π Φ J − γ Π Φ P π J − Π Φ G π ) ∥ ξ , = ∥ Π Φ ( ( IK − γ P π ) Φ ⊤ h − G π ) ) ∥ ξ 2 \begin{aligned} \left\|\Pi_{\Phi} \mathrm{T}_{\pi} JJ \right\|_{\xi}^{2} &\left.=\left\|J-\Pi_{\Phi}\left(G_{\pi}+\gamma P_{\pi} J\right) \right\|_{\xi}^{2}=\| \Pi_{\Phi} J-\gamma \Pi_{\Phi} P_{\pi} J-\Pi_{\Phi} G_{\pi}\right) \|_{\xi}^{2}, \ \ &\left.=\| \Pi_{\Phi}\left(\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}h-G_{\pi}\right)\right) \|_ {\xi}^{2}\end{aligned}ΠFTpJJX2=JPiF(Gp+γPpJ)X2=ΠFJc PFPpJPiFGp)X2,=ΠF((IKγPp)PhihGp))X2

  • Let W π = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp=(IKγPp)PhiRK×m , we have ∥ Π Φ ( W π h − G π ) ) ∥ ξ 2 \left.\| \Pi_{\Phi}\left(W_{\pi} h-G_{\pi}\right)\right) \|_{\xi}^{2} ΠF(WphGp))X2 .

  • The orthogonal projector Π Φ : = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ∈ RK × K \Pi_{\Phi}:=\Phi^{\top}\left(\Phi \Xi \Phi^{\ top}\right)^{-1}\Phi\Xi\in\mathbb{R}^{K\times K}PiF:=Phi( F X F)1F XRK×K .

  • Similarly as before, let u = Π Φ W π h − Π Φ G π ∈ RK × 1 \mathbf{u}=\Pi_{\Phi} W_{\pi}h-\Pi_{\Phi} G_{\pi }\in\mathbb{R}^{K\times1}u=PiFWphPiFGpRK×1 , we can get

∂ u ⊤ Ξ u ∂ u = 2 Ξ u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ hu , where ∂ u ∂ h = ( Π Φ W π ) ⊤ ∂ ( Π Φ W π h − Π Φ G π ) ⊤ Ξ ( Π Φ W π h − Π Φ G π ) ∂ h = 2 W π ⊤ Π Φ ⊤ Ξ Π Φ ( W π h − G π ) = 0 ∈ R m × 1 , \begin{array}{ c} \frac{\partial \mathbf{u}^{\top} \Xi \mathbf{u}}{\partial \mathbf{u}}=2 \Xi \mathbf{u}, \quad \frac{\ partial \mathbf{u}^{\top} \mathbf{u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u}, \quad \text { where } \frac{\partial \mathbf{u}}{\partial h}=\left(\Pi_{\Phi} W_{\pi}\right)^{\top} \\ \frac{\partial\ left(\Pi_{\Phi} W_{\pi} h-\Pi_{\Phi} G_{\pi}\right)^{\top} \Xi\left(\Pi_{\Phi} W_{\pi} h-\Pi_{\Phi} G_{\pi}\right)}{\partial h}=2 W_{\pi}^{\top} \Pi_{\Phi}^{\top} \Xi \Pi_{ \Phi}\left(W_{\pi} h-G_{\pi}\right)=\mathbf{0} \in \mathbb{R}^{m \times 1}, \end{array}uuΞu=2Ξu,huu=2huu, where hu=( PFWp)h( PFWph PFGp) Ξ(PFWph PFGp)=2 WPiPiPhiX PF(WphGp)=0Rm×1,

  • ( Φ Ξ Φ ⊤ ) − 1 \left(\Phi \Xi \Phi^{\top}\right)^{-1}( F X F)1 is diagonal, thenΠ Φ ⊤ = Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ \Pi_{\Phi}^{\top}=\Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \PhiPiPhi=X F( F X F)1Φ , hence we have:

W π ⊤ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ ⏞ Π Φ ⊤ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ⏞ Π Φ ( W π h − G π ) = 0 , W π Ξ Φ ⊤ ⏟ full rank, invertible ( Φ Ξ ⊤ ) − 1 ⇒ Ξ ( W π h − G π ) = 0 , Φ Ξ W π h = Φ Ξ G π , ⇒ h = ( Φ Ξ W π ) − 1 Φ Ξ G π , ⇒ J ξ π = Φ ⊤ ( Φ Ξ W π ) − 1 Φ Ξ G π \begin{aligned}W_{\pi}^{\top}\overbrace{\Xi\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\ Phi}^{\Pi_{\Phi}^{\top}} & \overbrace{\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\ Phi \Xi}^{\Pi_{\Phi}}\left(W_{\pi}h-G_{\pi}\right)=\mathbf{0}, \\\underbrace{W_{\pi}\Xi \Phi^{\top}}_{\text {full rank, invertible}} &\left(\Phi\Xi\Phi^{\top}\right)^{-1}\\&\Rightarrow\Xi\ left ( W_{\pi} h - G_{\pi}\right )=\mathbf{0} , \\ \Phi \Xi W_{\pi} h=\Phi \Xi G_{\pi} , & \Rightarrow h=\left(\Phi\XiW_{\pi}\right)^{-1}\Phi\XiG_{\pi},WPiX F( F X F)1Phi PiPhifull rank, invertable  WpX FF X Wph=F X Gp,Phi( F X F)1F X PiF(WphGp)=0,( F X F)1X(WphGp)=0,h=( F X Wp)1F X Gp,JXp=Phi( F X Wp)1F X Gp.

  • We have proved that Π Φ T π \Pi_{\Phi} \mathrm{T}_{\pi} PiFTp is a contraction mapping which leads to a fixed point, then the MSPBE should equal to zero:

Π Φ T π J − J = 0 ∈ RK × 1 ⇒ Π Φ ( W π h − G π ) ) = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ( W π h − G π ) ) = 0 \ left.\left.\Pi_{\Phi}\mathrm{T}_{\pi} JJ=0 \in \mathbb{R}^{K\times 1} \Rightarrow \Pi_{\Phi}\left(W_ {\pi}h-G_{\pi}\right)\right)=\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\Phi\ Xi\left(W_{\pi}h-G_{\pi}\right)\right)=0PiFTpJJ=0RK×1PiF(WphGp))=Phi( F X F)1F X(WphGp))=0

  • Left multiply with Φ Ξ ∈ R m × K \Phi \Xi \in \mathbb{R}^{m \times K} F XRm×K at both sides:

Φ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ( W π h − G π ) ) = Φ Ξ , Φ Ξ ( W π h − G π ) ) = 0 , Φ Ξ W π h = Φ Ξ G π , ⇒ h = ( Φ Ξ W π ) − 1 Φ Ξ G π , ⇒ J ξ π = Φ ⊤ ( Φ Ξ W π ) − 1 Φ Ξ G π \begin{aligned} \left.\Phi \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi\left(W_{ \pi}h-G_{\pi}\right)\right) &=\Phi\Xi 0,\\\left.\Phi\Xi\left(W_{\pi}h-G_{\pi}\right )\right) &=0, \\ \Phi \Xi W_{\pi} h &=\Phi \Xi G_{\pi}, \\ \Rightarrow \quad h=\left(\Phi \Xi W_{\ pi}\right)^{-1} \Phi \Xi G_{\pi}, & \\ \Rightarrow \quad J_{\xi}^{\pi}=\Phi^{\top}\left(\Phi \Xi W_{\pi}\right)^{-1} \Phi \Xi G_{\pi} &\end{aligned}F X F( F X F)1F X(WphGp))F X(WphGp))F X Wphh=( F X Wp)1F X Gp,JXp=Phi( F X Wp)1F X Gp.=Φ X 0 ,=0,=F X Gp,

8.4.9 Approximate PI (API) with LFA + ξ \xi ξ-weighted MSPBE

Insert image description here

8.4.10 Approximate PI Summary

Insert image description here

  • Three different APE methods in close-form: ℓ 2 \ell_{2} 2 MSBE, MSBE with ergodicity, MSPBE with ergodicity;
  • The estimation error bound \delta for the above three different APE methods are discussed in the lecture.

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122315269