Note 8 Approximate Policy Iteration
Approximate policy iteration
- Note 8 Approximate Policy Iteration
-
- 8.1 A Generic Framework
-
-
- Lemma 8.1 Error bound under monotonicity
- Lemma 8.2 Error bound of single approximate PI sweep
- Proposition 8.1 Error bound of the approximate PI algorithm (Error bound of the approximate PI algorithm)
- Proposition 8.2 Error bounds of approximate PI under convergence in policy space (Error bounds of approximate PI under convergence in policy space)
-
- 8.2 Approximate Policy Evaluation
- 8.3 Approximate Policy Evaluation with Ergodicity
-
- 8.3.1 Ergodic MDP
-
- Assumption 8.1 Transition matrix P π P_{\pi}Ppergodicity of
- Lemma 8.4 ξ \xiξ weighted norm
- Proposition 8.4 ξ \xi Contractibility of Bellman operator under ξ weighted norm
- Lemma 8.5 ξ \xiBoundary under ξ weighted norm
- Proposition 8.5 ξ \xi Constraints between estimates under the ξ -weighted norm and the true total cost function
- 8.3.2 Mean Squared Projected Bellman Error
- 8.4 API supplement
-
- 8.4.1 Approximate PI (API)
- 8.4.2 APE via Bellman Residual Minimisation
- 8.4.3 ℓ 2 \ell_{2}ℓ2 Based Bellman Residual Minimisation
- 8.4.4 Recap: Closed form policy evaluation
- 8.4.5 ℓ 2 \ell_{2}ℓ2Based Bellman Residual Minimisation
- 8.4.6 Approximate PI (API) with LFA + MSBE
- 8.4.7 Approximate PI (API) with LFA + ξ -weighted MSBE \text { Approximate PI (API) with LFA }+\xi \text {-weighted MSBE } Approximate PI (API) with LFA +ξ-weighted MSBE
- 8.4.8 Mean Squared Projected Bellman Error (MSPBE)
- 8.4.9 Approximate PI (API) with LFA + ξ \xi ξ-weighted MSPBE
- 8.4.10 Approximate PI Summary
In Note 7, we introduced the concept of parametric function approximation and its application in approximation iterative algorithms. Although the convergence properties of AVI have proven promising, it is not the same as the original VI VIThe inherent limitations of the VI algorithm still exist. In this section, we develop a framework for an approximate policy iteration algorithm.
8.1 A Generic Framework
Similar to the approximate VI algorithm, we can build a system to approximate the policy evaluation and policy improvement steps, as follows
-
For a given policy π k \pi_{k}Pik, our goal is to find the true total cost J π k J^{\pi_{k}}JPikApproximate value of J k J_{k}Jk,
i∥ J k − J π k ∥ ∞ ≤ δ (8.1) \left\|J_{k}-J^{\pi_{k}}\right\|_{\infty} \leq \delta \tag{ 8.1}∥Jk−JPik∥∞≤d( 8. 1 ) Please note that the
true total costJ π k J^{\pi_{k}}JPikIn general it cannot be given. The idea of Bellman residual minimization can be used here. -
By taking the same equation as ( 7.31 7.31For the same strategy as the approximate greedy step in 7.31), we can also relax it into an approximate strategy improvement . That is to say, given thekthk value function estimatesJ k J_{k}Jk, we find a policy π k + 1 \pi_{k+1}Pik+1,theoretically
∥ T π k + 1 J k − T g J k ∥ ∞ ≤ ϵ , (8.2) \left\|\mathrm{T}_{\pi_{k+1}} J_{k}-\ mathrm{T}_{\mathfraction{g}} J_{k}\right\|_{\infty} \leq \epsilon, \tag{8.2}∥TPik+1Jk−TgJk∥∞≤ϵ ,( 8.2 )
where ϵ > 0 \epsilon> 0ϵ>0 is the accuracy of inexact policy improvement.
Such a general approximate PI algorithm is given in Algorithm 10.
In order to determine the error bound of the approximate PI algorithm, we need the following two lemmas (Lemma).
Lemma 8.1 Error bound under monotonicity
Give an infinite range MDP { X , U , p , g , γ } \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{
X,U,p,g,γ } , and a fixed policyπ \piπ。让 J ∈ R K J \in \mathbb{R}^{K} J∈RIn K , the following conditions are met
T π J ≤ J + c 1 (8.3) \mathrm{T}_{\pi} J \leq J+c \mathbf{1} \tag{8.3}TpJ≤J+c 1(8.3)
And there is c > 0 c>0c>0 , then the strategyπ \piThe total cost function of π has the following constraints
J π ≤ J + c 1 − γ 1 (8.4) J^{\pi} \leq J+\frac{c}{1-\gamma} \mathbf{1} \tag {8.4}JPi≤J+1−cc1(8.4)
Proof.
Bellman operator T π \mathrm{T}_{\pi}TpThe constant displacement property of means that for all k ∈ N k \in \mathbb{N}k∈Note π
k J ≤ T π k − 1 J + γ k − 1 c 1 (8.5) \mathrm{T}_{\pi}^{k} J \leq \mathrm{T}_{\pi }^{k-1}J+\gamma^{k-1}c\mathbf{1}\tag{8.5}TPikJ≤TPik−1J+ck−1c1(8.5)
Then we have any kkk ratio
T π k J − J = T π k J − T π k − 1 J + T π k − 1 J − ... + T π J − J = ∑ t = 1 k ( T π k J − T π k − 1 J ) ≤ ∑ t = 1 k γ t − 1 c 1 (8.6) \begin{align} \mathrm{T}_{\pi}^{k} JJ &=\mathrm{T}_{\pi }^{k}J-\mathrm{T}_{\pi}^{k-1}J+\mathrm{T}_{\pi}^{k-1}J-\ldots+\mathrm{T}_ {\pi} JJ \\ &=\sum_{t=1}^{k}\left(\mathrm{~T}_{\pi}^{k}J-\mathrm{T}_{\pi} ^{k-1} J\right) \\ & \leq \sum_{t=1}^{k} \gamma^{t-1} c \mathbf{1} \end{aligned} \tag{8.6}TPikJ−J=TPikJ−TPik−1J+TPik−1J−…+TpJ−J=t=1∑k( TPikJ−TPik−1J)≤t=1∑kct−1c1(8.6)
The result is through t → ∞ t\rightarrow\inftyt→∞ .
Lemma 8.2 Error bound of single approximate PI sweep
Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , a fixed policyπ \piπ , one inRK \mathbb{R}^{K}REstimateJJ in KJ , and two fixed policiesπ \piπ和π ′ \pi^{\prime}Pi′ , if the following two conditions are at someδ ≥ 0 \delta\geq 0d≥0和ϵ ≥ 0 \epsilon\geqϵ≥Established at 0 o'clock
∥ J − J π ∥ ∞ ≤ δ , and ∥ T π ′ J − T g J ∥ ∞ ≤ ϵ (8.7) \left\|J-J^{\pi}\right\|_{\infty} \leq \delta, \quad \text { and } \quad\left\|\mathrm{T}_{\pi^{\prime}} J-\mathrm{T}_{\mathfrak{g}} J\right\|_{\infty} \leq \epsilon \tag{8.7} ∥J−Jπ∥∞≤d , and ∥TPi′J−TgJ∥∞≤ϵ(8.7)
Then we have
∥ J π ′ − J ∗ ∥ ∞ ≤ γ ∥ J π − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.8) \left\|J^{\pi^{\prime}}-J^{ *}\right\|_{\infty} \leq \gamma\left\|J^{\pi}-J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma}\tag{8.8}∥∥∥JPi′−J∗∥∥∥∞≤c∥JPi−J∗∥∞+1−cϵ+2 c d(8.8)
Proof.
根据 T g \mathrm{T}_{\mathfrak{g}} Tg和 T π ′ \mathrm{T}_{\pi^{\prime}} Tπ′的收缩特性,公式(8.7)中的第一个不等式意味着
∥ T π ′ J − T π ′ J π ∥ ∞ ≤ γ δ , and ∥ T g J − T g J π ∥ ∞ ≤ γ δ (8.9) \left\|\mathrm{T}_{\pi^{\prime}} J-\mathrm{T}_{\pi^{\prime}} J^{\pi}\right\|_{\infty} \leq \gamma \delta, \quad \text { and } \quad\left\|\mathrm{T}_{\mathfrak{g}} J-\mathrm{T}_{\mathfrak{g}} J^{\pi}\right\|_{\infty} \leq \gamma \delta \tag{8.9} ∥Tπ′J−Tπ′Jπ∥∞≤γδ, and ∥TgJ−TgJπ∥∞≤γδ(8.9)
Let
T π ′ J π ≤ T π ′ J + γ δ 1 , and T g J − T g J π ≤ γ δ 1 (8.10) \mathrm{T}_{\pi^{\prime}} J^{ \pi} \leq \mathrm{T}_{\pi^{\prime}} J+\gamma \delta \mathbf{1}, \quad \text { and } \quad \mathrm{T}_{\mathfraction{ g}} J-\mathrm{T}_{\mathfraction{g}} J^{\pi} \leq \gamma \delta \mathbf{1} \tag{8.10}TPi′JPi≤TPi′J+c d 1 , and TgJ−TgJPi≤c d 1(8.10)
Similarly, the second inequality in equation (8.7) yields
T π ′ J ≤ T g J + ϵ 1 (8.11) \mathrm{T}_{\pi^{\prime}} J \leq \mathrm {T}_{\mathfrak{g}} J+\epsilon \mathbf{1} \tag{8.11}TPi′J≤TgJ+ϵ 1(8.11)
then we get
T π ′ J π ≤ T π ′ J + γ δ 1 ≤ T g J + ( ϵ + γ δ ) 1 ≤ T g J π + ( ϵ + 2 γ δ ) 1 ≤ J π + ( ϵ + 2 γ δ ) 1 (8.12) \begin{align} \mathrm{T}_{\pi^{\prime}} J^{\pi} & \leq \mathrm{T}_{\pi^{\prime}} J+ \gamma \delta \mathbf{1} \\ & \leq \mathrm{T}_{\mathfraction{g}} J+(\epsilon+\gamma\delta) \mathbf{1} \\ & \leq \mathrm{T }_{\math fraction{g}} J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1} \\ & \leq J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1} \end{aligned} \tag{8.12}TPi′JPi≤TPi′J+c d 1≤TgJ+( ϵ+c d ) 1≤TgJPi+( ϵ+2 c d ) 1≤JPi+( ϵ+2 c d ) 1(8.12)
Among them, the second inequality is due to formula (8.11), the third inequality is derived from the second inequality in formula (8.10), and the last inequality is due to T g \mathrm{T}_{\mathfrak{g} }TgThe strategy improvement properties of T g J π ≤ T π J π = J π \mathrm{T}_{\mathfrak{g}} J^{\pi} \leq \mathrm{T}_{\pi} J ^{\pi}=J^{\pi}TgJPi≤TpJPi=JPi
Property Lemma 8.1,IfJ
π ′ ≤ J π + ϵ + 2 γ δ 1 − γ 1 (8.13) J^{\pi^{\prime}} \leq J^{\pi}+\frac{\epsilon +2 \gamma \delta}{1-\gamma} 1 \tag{8.13}JPi′≤JPi+1−cϵ+2 c d1(8.13)
And further convert the Bellman operator T π ′ T_{\pi^{\prime}}TPi′Apply to both sides of the inequality to get
T π ′ J π ′ = J π ′ ≤ T π ′ J π + ϵ + 2 γ δ 1 − γ γ 1. (8.14) \mathrm{T}_{\pi^{\prime}} J^{\; pi^{\prime}}=J^{\pi^{\prime}} \leq \mathrm{T}_{\pi^{\prime}} J^{\pi}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} . \tag{8.14}TPi′JPi′=JPi′≤TPi′JPi+1−cϵ+2 c dc 1 .(8.14)
Subtract J ∗ J^{*} from both sides of the inequalityJ∗,我们得到
J π ′ − J ∗ ≤ T π ′ J π − J ∗ + ϵ + 2 γ δ 1 − γ γ 1 ≤ T g J π + ( ϵ + 2 γ δ ) 1 − J ∗ + ϵ + 2 γ δ 1 − γ γ 1 = T g J π − T g J ∗ + ϵ + 2 γ δ 1 − γ 1 (8.15) \begin{aligned} J^{\pi^{\prime}}-J^{*} & \leq \mathrm{T}_{\pi^{\prime}} J^{\pi}-J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} \\ & \leq \mathrm{T}_{\mathfrak{g}} J^{\pi}+(\epsilon+2 \gamma \delta) \mathbf{1}-J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \gamma \mathbf{1} \\ &=\mathrm{T}_{\mathfrak{g}} J^{\pi}-\mathrm{T}_{\mathfrak{g}} J^{*}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \mathbf{1} \end{aligned} \tag{8.15} Jπ′−J∗≤Tπ′Jπ−J∗+1−γϵ+2γδγ1≤TgJπ+(ϵ+2 c d ) 1−J∗+1−cϵ+2 c dc 1=TgJPi−TgJ∗+1−cϵ+2 c d1(8.15)
Among them, the second inequality follows from the third inequality in formula (8.12), and the equality is due to the optimal Bellman operator T g \mathrm{T}_{\mathfrak{g}}Tgthe only fixed point. Finally, we apply the infinite norm to equation (8.15)
∥ J π ′ − J ∗ ∥ ∞ ≤ ∥ T g J π − T g J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ ≤ γ ∥ J π − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.16) \begin{aligned} \left\|J^{\pi^{\prime}}-J^{*}\right\|_{\infty} & \leq\left\|\mathrm{T} _{\mathfraction{g}} J^{\pi}-\mathrm{T}_{\mathfraction{g}} J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \\ & \leq \gamma\left\|J^{\pi}-J^{*}\right\|_{\infty}+\frac{\epsilon+ 2 \gamma \delta}{1-\gamma} \end{aligned} \tag{8.16}∥∥∥JPi′−J∗∥∥∥∞≤∥TgJPi−TgJ∗∥∞+1−cϵ+2 c d≤c∥JPi−J∗∥∞+1−cϵ+2 c d(8.16)
This completes the proof.
Finally, we summarize the error bounds of the approximate PI algorithm as follows.
Proposition 8.1 Error bound of the approximate PI algorithm (Error bound of the approximate PI algorithm)
Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{
X,U,p,g,γ } , π k \pi_{k}produced by the approximate PI methodPikInfinitesimal
Lim k → ∞ ∥ J π k − J ∗ ∥ ∞ ≤ ϵ + 2 γ δ ( 1 − γ ) 2 . (8.17) \lim _{k \rightarrow \infty}\left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} \leq \frac{\epsilon+2 \gamma \delta}{(1-\gamma)^{2}} \tag{8.17}k→∞lim∥JPik−J∗∥∞≤(1−c )2ϵ+2 c d.(8.17)
Proof.
Given an arbitrary π 0 \pi_{0}Pi0,Lemma 8.2 8.2 8.2意味着
∥ J π 1 − J ∗ ∥ ∞ ≤ γ ∥ J π 0 − J ∗ ∥ ∞ + ϵ + 2 γ δ 1 − γ (8.18) \left\|J^{\pi_{1}}-J^{*}\right\|_{\infty} \leq \gamma\left\|J^{\pi_{0}}-J^{*}\right\|_{\infty}+\frac{\epsilon+2 \gamma \delta}{1-\gamma} \tag{8.18} ∥Jπ1−J∗∥∞≤γ∥Jπ0−J∗∥∞+1−γϵ+2γδ(8.18)
通过直接的归纳论证,对于任意的 k k k ,if there exists
∥ J π k − J ∗ ∥ ∞ ≤ γ k ∥ J π 0 − J ∗ ∥ ∞ + ( ∑ i = 0 k − 1 γ i ) ϵ + 2 γ δ 1 − γ (8.19) \ left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} \leq \gamma^{k}\left\|J^{\pi_{0}}-J ^{*}\right\|_{\infty}+\left(\sum_{i=0}^{k-1} \gamma^{i}\right) \frac{\epsilon+2 \gamma\delta }{1-\gamma} \tag{8.19}∥JPik−J∗∥∞≤ck∥JPi0−J∗∥∞+(i=0∑k−1ci)1−cϵ+2 c d(8.19)
The result is by letting k → ∞ k\rightarrow\inftyk→∞ derived.
It should be noted that the error bounds of the policies produced by the approximate PI algorithm are not guaranteed to converge within the policy space. That is, the approximate PI algorithm can swing among a set of strategies, see Figure 14.
Figure 14: Illustration of potential convergence modes of the approximate PI algorithm. When the error constraints are relaxed, the policy produced by the approximate PI algorithm may swing among several candidates, such as { π 1 , π 2 , π 3 , π 4 } . \left\{\pi_{1}, \pi_{ 2}, \pi_{3}, \pi_{4}\right\}.{ p1,Pi2,Pi3,Pi4}.When the error constraint is strict enough, the resulting strategy may converge to a fixed value, such asπ 1 \pi_{1}Pi1。
However, in some cases the algorithm can converge to a single policy. In the remainder of the Note, we determine the error bounds of the approximate PI algorithm when the policy converges.
Proposition 8.2 Error bounds of approximate PI under convergence in policy space (Error bounds of approximate PI under convergence in policy space)
Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,c } , letπ′ \pi^{\prime}Pi′ as a strategy to approximate the convergence of the PI algorithm. Then we have
∥ J π ′ − J ∗ ∥ ∞ ≤ ϵ + 2 γ δ 1 − γ (8.20) \left\|J^{\pi^{\prime}}-J^{*}\right\|_{\infty } \leq \frac{\epsilon+2\gamma\delta}{1-\gamma}\tag{8.20}∥∥∥JPi′−J∗∥∥∥∞≤1−cϵ+2 c d(8.20)
Proof.
Let J ′ ∈ RKJ^{\prime} \in \mathbb{R}^{K}J′∈RK is given byπ ′ \pi^{\prime}PiThe policy produced by the approximate policy evaluation of ′ is J ′ J^{\prime}J′ andπ ′ \pi^{\prime}Pi' Satisfies the conditions of the approximate PI algorithm
∥ J ′ − J π ′ ∥ ∞ ≤ δ , and ∥ T π ′ J ′ − T g J ′ ∥ ∞ ≤ ϵ . (8.21) \left\|J^{\prime}-J^{\pi^{\prime}}\right\|_{\infty} \leq \delta, \quad \text { and }\left\|\mathrm{T}_{\pi^{\prime}} J^{\prime}-\mathrm{T}_{\mathfrak{g}} J^{\prime}\right\|_{\infty} \leq \epsilon \text {. } \tag{8.21} ∥∥∥J′−JPi′∥∥∥∞≤d , and ∥TPi′J′−TgJ′∥∞≤ϵ . (8.21)
Well, we have
∥ T g J π ′ − J π ′ ∥ ∞ ≤ ∥ T g J π ′ − T g J ′ ∥ ∞ + ∥ T g J ′ − T π ′ J ′ ∥ ∞ + + ∥ T π ′ J ′ − J π ′ ∥ ∞ ≤ γ ∥ J π ′ − J ′ ∥ ∞ + ∥ T g J ′ − T π ′ J ′ ∥ ∞ + + γ ∥ J ′ − J π ′ ∥ ∞ ≤ ϵ + 2 γ δ (8.22) \begin{aligned} \left\|\mathrm{T}_{\mathfraction{g}} J^{\pi^{\prime}}-J^{\pi^{\prime}}\right\ |_{\infty} \leq &\left\|\mathrm{T}_{\mathfraction{g}} J^{\pi^{\prime}}-\mathrm{T}_{\mathfraction{g} } J^{\prime}\right\|_{\infty}+\left\|\mathrm{T}_{\mathfraction{g}} J^{\prime}-\mathrm{T}_{\pi ^{\prime}} J^{\prime}\right\|_{\infty}+\\ &+\left\|\mathrm{T}_{\pi^{\prime}} J^{\prime }-J^{\pi^{\prime}}\right\|_{\infty}\\\leq & \gamma\left\|J^{\pi^{\prime}}-J^{\prime }\right\|_{\infty}+\left\|\mathrm{T}_{\mathfraction{g}} J^{\prime}-\mathrm{T}_{\pi^{\prime}} J^{\prime}\right\|_{\infty}+\\ &+\gamma\left\|J^{\prime}-J^{\pi^{\prime}}\right\|_{ \infty} \\ \leq & \epsilon+2 \gamma \delta \end{aligned} \tag{8.22}∥∥∥TgJPi′−JPi′∥∥∥∞≤≤≤∥∥∥TgJPi′−TgJ′∥∥∥∞+∥TgJ′−TPi′J′∥∞++∥∥∥TPi′J′−JPi′∥∥∥∞c∥∥∥JPi′−J′∥∥∥∞+∥TgJ′−TPi′J′∥∞++c∥∥∥J′−JPi′∥∥∥∞ϵ+2 c d(8.22)
Among them, the first inequality comes from the triangular property of infinite norm, and the second inequality is due to T g \mathrm{T}_{\mathfrak{g}}Tg和T π ′ \mathrm{T}_{\pi^{\prime}}TPi′contraction properties of , and the last inequality merely recalls the result in equation (8.21). Then, the inequality in equation (8.20) is a direct application of Lemma 3.4.
Obviously, the error bound of the approximate PI algorithm under stable convergence is much stricter than the case of divergence, especially when the discount coefficient γ \gammaWhen γ is close to 1.
8.2 Approximate Policy Evaluation
The analysis of the convergence properties of the general API shows the importance of the performance of approximate strategy evaluation. Strategies similar to those used to develop AVI for minimizing Bellman residuals can also be applied to strategy evaluation.
Definition 8.1 Approximate total cost function
Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , a fixed policyπ \piπ and a total cost function spaceJ \mathcal{J}J , total cost functionJ ∈ JJ \in \mathcal{J}J∈Approximate total cost functionJ π J^{\pi} of JJπ is given by minimizing the Bellman residual, i.e.
J B π ∈ argmin J ∈ J ∥ T π J − J ∥ ∞ . (8.23) J_{B}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\infty} . \tag{8.23} JBp∈J∈Jargmin∥TpJ−J∥∞.(8.23)
Estimate JB π J_{B}^{\pi} by minimizing the Bellman residual errorJBpThe error bounds are as follows.
Lemma 8.3 Approximate cost function bounds
Given an infinite range MDP {X, U, p, q, γ} MDP\{\mathcal{X}, \mathcal{U}, p, q, \gamma\}MDP{ X,U,p,q,γ },让J π J^{\pi}Jπ is a fixed strategyπ \piThe total cost function of π . Then, for any total cost functionJ ∈ RKJ\in \mathbb{R}^{K}J∈RIn K , the following inequality holds
∥ J − J π ∥ ∞ ≤ 1 1 − γ ∥ J − T π J ∥ ∞ . (8.24) \left\|JJ^{\pi}\right\|_{\infty} \leq \frac{1}{1-\gamma}\left\|J-\mathrm{T}_{\pi } J\right\|_{\infty} . \tag{8.24}∥J−Jπ∥∞≤1−γ1∥J−TπJ∥∞.(8.24)
Proof.
直接的有
∥ J − J π ∥ ∞ = ∥ J − T π J + T π J − J π ∥ ∞ = ∥ J − T π J ∥ ∞ + ∥ T π J − J π ∥ ∞ ≤ ∥ J − T π J ∥ ∞ + γ ∥ J − J π ∥ ∞ (8.25) \begin{aligned} \left\|J-J^{\pi}\right\|_{\infty} &=\left\|J-\mathrm{T}_{\pi} J+\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\infty} \\ &=\left\|J-\mathrm{T}_{\pi} J\right\|_{\infty}+\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\infty} \\ & \leq\left\|J-\mathrm{T}_{\pi} J\right\|_{\infty}+\gamma\left\|J-J^{\pi}\right\|_{\infty} \end{aligned} \tag{8.25} ∥J−Jπ∥∞=∥J−TπJ+TπJ−Jπ∥∞=∥J−TπJ∥∞+∥TπJ−Jπ∥∞≤∥J−TπJ∥∞+γ∥J−Jπ∥∞(8.25)
Proposition 8.3 Constraints between estimates and true total cost functions
Given an infinite range MDP { X , U , p , g , γ } MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , a fixed policyπ \piπ and a total cost function spaceJ \mathcal{J}J。 letJBπ ∈ J J_{B}^{\pi} \in \mathcal{J}JBp∈J is the global minimum of the MSBE problem. Then the estimated value and the true total cost functionJ π J^{\pi}JThe error between π has the following constraints
∥ JB π − J π ∥ ∞ ≤ 1 + γ 1 − γ min J ∈ J ∥ J − J π ∥ ∞ . (8.26) \left\|J_{B}^{\pi}-J^{\pi}\right\|_{\infty} \leq \frac{1+\gamma}{1-\gamma} \min _{J \in \mathcal{J}}\left\|JJ^{\pi}\right\|_{\infty} . \tag{8.26}∥JBp−Jπ∥∞≤1−c1+cJ∈Jmin∥J−Jπ∥∞.(8.26)
Proof.
By applying the triangle inequality with infinite norm, we get
∥ T π J − J ∥ ∞ ≤ ∥ T π J − J π ∥ ∞ + ∥ J π − J ∥ ∞ ≤ ( 1 + γ ) ∥ J − J π ∥ ∞ . (8.27) \begin{aligned} \left\|\mathrm{T}_{\pi} JJ\right\|_{\infty} & \leq\left\|\mathrm{T}_{\pi } JJ^{\pi}\right\|_{\infty}+\left\|J^{\pi}-J\right\|_{\infty} \\ & \leq(1+\gamma)\ left\|JJ^{\pi}\right\|_{\infty} . \end{aligned} \tag{8.27}∥TpJ−J∥∞≤∥TpJ−Jπ∥∞+∥JPi−J∥∞≤(1+c )∥J−Jπ∥∞.(8.27)
直截了当地有
∥ T π J B π − J B π ∥ ∞ = min J ∈ J ∥ T π J − J ∥ ∞ ≤ ( 1 + γ ) min J ∈ J ∥ J − J π ∥ ∞ . (8.28) \begin{aligned} \left\|\mathrm{T}_{\pi} J_{B}^{\pi}-J_{B}^{\pi}\right\|_{\infty} &=\min _{J \in \mathcal{J}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\infty} \\ & \leq(1+\gamma) \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\infty} . \end{aligned} \tag{8.28} ∥TπJBπ−JBπ∥∞=J∈Jmin∥TπJ−J∥∞≤(1+γ)J∈Jmin∥J−Jπ∥∞.(8.28)
结合不等式和Lemma 8.3 8.3 8.3中的结果,证明了这一点。
Obviously, the MSBE cost given in equation (8.23) is still numerically difficult to optimize. Therefore, similar to AVI, we can define the following mean squared Bellman Error (MSBE) minimization problem
J 2 π ∈ argmin J ∈ J ∥ T π J − J ∥ 2 . (8.29) J_{2 }^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|\mathrm{T}_{\pi} JJ\right\|_{2} .\tag{8.29}J2p∈J∈Jargmin∥TpJ−J∥2.(8.29)
If we adopt the matrix form of the Bellman operator and choose the function approximation space to be linear, that is, T π J = G π + γ P π Φ ⊤ h \mathrm{T}_{\pi} J=G_{\pi }+\gamma P_{\pi} \Phi^{\top} hTpJ=Gp+γPpPhi⊤ h, then there is an approximate form expression of the above problem
J 2 π = ( W π ⊤ W π ) − 1 W π ⊤ G π (8.30) J_{2}^{\pi}=\left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi}^{\top} G_{\pi} \tag{8.30} J2p=(WPi⊤Wp)−1WPi⊤Gp(8.30)
Then W π = ( IK − γ P π ) Φ ⊤ W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}Wp=(IK−γPp)Phi⊤ . While this solution is simple and guaranteed, unfortunately there are no meaningful error bounds that can be used to describe the quality of this approximation.
8.3 Approximate Policy Evaluation with Ergodicity
Although the MSBE minimization problem is well defined and has simple numerical solutions, it inherits the properties of DP, that is, the requirement for model information. In various practical applications of SDM, there is a great need for efficient solutions to problems for which there is no explicit model. Specifically, we study a special class of MDPs, which enables the development of model-free DP algorithms.
8.3.1 Ergodic MDP
Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } and a fixed policyπ \piπ , it is well known that system transitions can be modeled as Markov chains. In order to retrieve complete model information by sampling, it must be assumed that each state is reachable from any other state and therefore has a unique stationary distribution over the states. Therefore, we use the underlying MDP model and policyπ \piThe Markov chain of state transitions specified by π imposes the following assumptions
Assumption 8.1 Transition matrix P π P_{\pi}Ppergodicity of
Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } and a fixed policyπ \piπ , by the transition matrixP π P_{\pi}PpThe defined Markov chain is ergodic.
Let us use ξ i \xi_{i}XiRepresents the iithThe probability of i corresponding states. The ergodic hypothesis implies that alli = 1 , … , K i=1, \ldots, Ki=1,…,K的 ξ i \xi_{i} ξi都是正定的,也就是说,马尔科夫链有一个唯一的稳定状态分布。让我们定义 ξ : = [ ξ 1 , … , ξ K ] ⊤ ∈ R K \xi:=\left[\xi_{1}, \ldots, \xi_{K}\right]^{\top} \in \mathbb{R}^{K} ξ:=[ξ1,…,ξK]⊤∈RK, 与 x ∈ R K x \in \mathbb{R}^{K} x∈RK。 ξ \xi ξ与过渡矩阵 P π P_{\pi} Pπ之间的关系的特点是
P π ⊤ ξ = ξ (8.31) P_{\pi}^{\top} \xi=\xi \tag{8.31} Pπ⊤ξ=ξ(8.31)
显然,向量 ξ \xi ξ是 P π ⊤ P_{\pi}^{\top} Pπ⊤的右特征向量,与特征值为1有关。此外,由于 ξ \xi ξ的所有条目都是正的,我们可以将 ξ \xi The weighted norm of ξ is defined as
∥ x ∥ ξ = ∑ k = 1 K ξ i x i 2 (8.32) \|x\|_{\xi}=\sqrt{\sum_{k=1}^{K} \xi_{i} x_{i}^{2}} \tag{8.32} ∥x∥x=k=1∑KXixi2(8.32)
Lemma 8.4 ξ \xiξ weighted norm
Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ , for anyK × KK \times KK×K transition probability matrixP π P_{\pi}Pp, has an invariant distribution ξ = ( ξ 1 , … , ξ n ) \xi=\left(\xi_{1}, \ldots, \xi_{n}\right)X=( x1,…,Xn) , has a positive component, we have
∥ P π J ∥ ξ ≤ ∥ J ∥ ξ (8.33) \left\|P_{\pi} J\right\|_{\xi} \leq\|J\|_{\xi} \tag{8.33} ∥PpJ∥x≤∥J∥x(8.33)
Proof
令 P π = { p i j } P_{\pi}=\left\{p_{i j}\right\} Pp={
pij} , the infinitesimal
∥ P π J ∥ ξ 2 = ∑ i = 1 n ξ i ( ∑ j = 1 npij J j ) 2 (definition) ≤ ∑ i = 1 n ξ i ∑ j = 1 npij J j 2 ( convexity) = ∑ j = 1 n ∑ i = 1 n ξ ipij J j 2 = ∑ j = 1 n ξ j J j 2 ≤ ∥ J ∥ ξ 2 (definition) (8.34) \begin{array}{rlr}\; left\|P_{\pi} J\right\|_{\xi}^{2} & =\sum_{i=1}^{n} \xi_{i}\left(\sum_{j=1} ^{n} p_{ij} J_{j}\right)^{2} & \text { (definition) } \\ & \leq \sum_{i=1}^{n} \xi_{i} \sum_ {j=1}^{n} p_{ij} J_{j}^{2} & \text { (convexity) } \\ & =\sum_{j=1}^{n} \sum_{i=1 }^{n}\xi_{i}p_{ij}J_{j}^{2} & \\&=\sum_{j=1}^{n}\xi_{j}J_{j}^{2 } & \\ \leq & \|J\|_{\xi}^{2} & \text { (definition) } \end{array} \tag{8.34}∥PpJ∥X2≤=∑i=1nXi(∑j=1npijJj)2≤∑i=1nXi∑j=1npijJj2=∑j=1n∑i=1nXipijJj2=∑j=1nXjJj2∥J∥X2 (definition) (convexity) (definition) (8.34)
Proposition 8.4 ξ \xi Contractibility of Bellman operator under ξ weighted norm
Given an infinite range of MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{
X,U,p,g,γ } , and a fixed policyπ \piπ,那么贝尔曼算子 T π \mathrm{T}_{\pi} Tπ是模数 γ \gamma γ相对于 ξ \xi ξ加权范数的收缩,即:
∥ T π J − T π J ′ ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ . (8.35) \left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\|_{\xi} \leq \gamma\left\|J-J^{\prime}\right\|_{\xi} . \tag{8.35} ∥TπJ−TπJ′∥ξ≤γ∥J−J′∥ξ.(8.35)
Proof.
为了简单起见,我们使用贝尔曼算子 T π J : = G π + γ P π J \mathrm{T}_{\pi} J:=G_{\pi}+\gamma P_{\pi} J TπJ:=Gπ+γPπJ的紧凑表示,然后,我们得到
∥ T π J − T π J ′ ∥ ξ = ∥ γ P π ( J − J ′ ) ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ (8.36) \begin{aligned} \left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\|_{\xi} &=\left\|\gamma P_{\pi}\left(J-J^{\prime}\right)\right\|_{\xi} \\ & \leq \gamma\left\|J-J^{\prime}\right\|_{\xi} \end{aligned} \tag{8.36} ∥TπJ−TπJ′∥ξ=∥γPπ(J−J′)∥ξ≤γ∥J−J′∥ξ(8.36)
这直接来自于Lemma 8.4。
通过采用这一特性,我们可以在 ξ \xi The mean square Bellman error (MSBE) is defined below in the ξ weighted norm.
J β π ∈ argmin J ∈ J ∥ T π J − J ∥ ξ (8.37) J_{\beta}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin} }\left\|\mathrm{T}_{\pi} JJ\right\|_{\xi} \tag{8.37}Jbp∈J∈Jargmin∥TpJ−J∥x(8.37)
Similar to the analysis in Section 8.2, we can derive that the MSBE is minimized in ξ \xiThe error bounds under the ξ- weighted specification are as follows.
Lemma 8.5 ξ \xiBoundary under ξ weighted norm
Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{
X,U,p,g,γ },让J π J^{\pi}Jπ is a fixed policyπ \piThe total cost function of π . Then, for any total cost functionJ ∈ RKJ \in \mathbb{R}^{K}J∈RLet K , one of the singular inequalities∥
J − J π ∥ ξ ≤ 1 1 − γ ∥ J − T π J ∥ ξ (8.38) \left\|JJ^{\pi}\right\|_{\xi}; \leq \frac{1}{1-\gamma}\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi} \tag{8.38}∥J−Jπ∥x≤1−c1∥J−TpJ∥x(8.38)
Proof.
Directly
∥ J − J π ∥ ξ = ∥ J − T π J + T π J − J π ∥ ξ = ∥ J − T π J ∥ ξ + ∥ T π J − J π ∥ ξ ≤ ∥ J − T π J ∥ ξ + γ ∥ J − J π ∥ ξ (8.39) \begin{aligned} \left\|J-J^{\pi}\right\|_{\xi} &=\left\|J-\mathrm{T}_{\pi} J+\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi} \\ &=\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi}+\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi} \\ & \leq\left\|J-\mathrm{T}_{\pi} J\right\|_{\xi}+\gamma\left\|J-J^{\pi}\right\|_{\xi} \end{aligned} \tag{8.39} ∥J−Jπ∥x=∥J−TpJ+TpJ−Jπ∥x=∥J−TpJ∥x+∥TpJ−Jπ∥x≤∥J−TpJ∥x+c∥J−Jπ∥x(8.39)
Proposition 8.5 ξ \xi Constraints between estimates under the ξ -weighted norm and the true total cost function
给出一个无限范围的 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{ X,U,p,g,γ},一个固定的策略 π \pi π和一个总成本函数空间 J \mathcal{J} J。让 J B π ∈ J J_{B}^{\pi} \in \mathcal{J} JBπ∈J为MSBE问题的全局最小值。那么估计值与真实总成本函数 J π J^{\pi} Jπ之间的误差有如下约束
∥ J β π − J π ∥ ξ ≤ 1 + γ 1 − γ min J ∈ J ∥ J − J π ∥ ξ . (8.40) \left\|J_{\beta}^{\pi}-J^{\pi}\right\|_{\xi} \leq \frac{1+\gamma}{1-\gamma} \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\xi} . \tag{8.40} ∥∥Jβπ−Jπ∥∥ξ≤1−γ1+cJ∈Jmin∥J−Jπ∥x.(8.40)
Proof.
By applying the triangle inequality with infinite norm, we can get
∥ T π J − J ∥ ξ ≤ ∥ T π J − J π ∥ ξ + ∥ J π − J ∥ ξ ≤ ( 1 + γ ) ∥ J − J π ∥ ξ . (8.41) \begin{aligned} \left\|\mathrm{T}_{\pi} J-J\right\|_{\xi} & \leq\left\|\mathrm{T}_{\pi} J-J^{\pi}\right\|_{\xi}+\left\|J^{\pi}-J\right\|_{\xi} \\ & \leq(1+\gamma)\left\|J-J^{\pi}\right\|_{\xi} . \end{aligned} \tag{8.41} ∥TpJ−J∥x≤∥TpJ−Jπ∥x+∥JPi−J∥x≤(1+c )∥J−Jπ∥x.(8.41)
Simply put, we have
∥ T π J β π − J β π ∥ ξ = min J ∈ J ∥ T π J − J ∥ ξ ≤ ( 1 + γ ) min J ∈ J ∥ J − J π ∥ ξ (8.42) \begin{aligned} \left\|\mathrm{T}_{\pi} J_{\beta}^{\pi}-J_{\beta}^{\pi}\right\|_{\xi} &=\min _{J \in \mathcal{J}}\left\|\mathrm{T}_{\pi} J-J\right\|_{\xi} \\ & \leq(1+\gamma) \min _{J \in \mathcal{J}}\left\|J-J^{\pi}\right\|_{\xi} \end{aligned} \tag{8.42} ∥∥TpJbp−Jbp∥∥x=J∈Jmin∥TpJ−J∥x≤(1+c )J∈Jmin∥J−Jπ∥x(8.42)
Compare this inequality to 8.5 8.5The results of 8.5 are combined to complete the proof.
8.3.2 Mean Squared Projected Bellman Error
Finally, if we limit ourselves to a linear function approximation scheme, we need an orthogonal projection to J l \mathcal{J}_{l}Jl, relative to ξ \xiWeighted specification of ξ . Specifically, we need to solve the following minimization problem
Π Φ ( J ) : = Φ ⊤ argmin h ∈ R m ∥ J − Φ ⊤ h ∥ ξ 2 (8.43) \Pi_{\Phi}(J):=\ Phi^{\top} \underset{h \in \mathbb{R}^{m}}{\operatorname{argmin}}\left\|J-\Phi^{\top} h\right\|_{\ xi}^{2} \tag{8.43}PiF(J):=Phi⊤h∈Rmargmin∥∥J−Phi⊤h∥∥X2(8.43)
Since the least squares function is convex, the solution is characterized by solving the following equation hhh to realize
Φ Ξ Φ ⊤ h = Φ Ξ J (8.44) \Phi \Xi \Phi^{\top} h=\Phi \Xi J \tag{8.44}F X F⊤h=F X J(8.44)
Since rk ( Φ ) = m \operatorname{rk}(\Phi)=mr k ( Φ )=m , the orthogonal projection is clearly defined as
Π Φ ( J ) : = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ J (8.45) \Pi_{\Phi}(J):=\Phi^{\top} \left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi J \tag{8.45}PiF(J):=Phi⊤( F X F⊤)−1F X J(8.45)
Lemma 8.6 Non-expansive projection operator Π Φ \Pi_{\Phi}PiF
Given an infinite range MDP {MDP{
X,U,p,g,γ } , and a fixed policyπ \piπ . Then, the projectionΠ Φ \Pi_{\Phi}PiFAt ξ − \xi-It is a non-expansive operator under the ξ − norm, that is.
∥ Π Φ J − Π Φ J ′ ∥ ξ ≤ ∥ J − J ′ ∥ ξ . (8.46) \left\|\Pi_{\Phi} J-\Pi_{\Phi} J^{\prime}\right\ |_{\xi} \leq\left\|JJ^{\prime}\right\|_{\xi} . \tag{8.46}∥ΠFJ−PiFJ′∥x≤∥J−J′∥x.(8.46)
Proof.Easy
to find
∥ Π Φ J − Π Φ J ′ ∥ ξ 2 = ∥ Π Φ ( J − J ′ ) ∥ ξ 2 ≤ ∥ Π Φ ( J − J ′ ) ∥ ξ 2 + ∥ ( I − Π Φ ) ( J − J ′ ) ∥ ξ 2 = ∥ J − J ′ ∥ ξ 2 (8.47) \begin{aligned} \left\|\Pi_{\Phi} J-\Pi_{\Phi} J^{\prime}\right\|_{\xi}^{2} &=\left\|\Pi_{\Phi}\left(J-J^{\prime}\right)\right\|_{\xi}^{2} \\ & \leq\left\|\Pi_{\Phi}\left(J-J^{\prime}\right)\right\|_{\xi}^{2}+\left\|\left(I-\Pi_{\Phi}\right)\left(J-J^{\prime}\right)\right\|_{\xi}^{2} \\ &=\left\|J-J^{\prime}\right\|_{\xi}^{2} \end{aligned} \tag{8.47} ∥ΠFJ−PiFJ′∥X2=∥ΠF(J−J′)∥X2≤∥ΠF(J−J′)∥X2+∥(I−PiF)(J−J′)∥X2=∥J−J′∥X2(8.47)
The last equation is derived from the Pythagorean theorem. End of proof.
Proposition 8.6. Projection operator Π Φ \Pi_{\Phi}PiFshrinkage
Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ , then the projected Bellman operatorΠ Φ T π \Pi_{\Phi} \mathrm{T}_{\pi}PiFTpis relative to ∥ ⋅ ∥ ξ \|\cdot\|_{\xi}∥⋅∥xThe modulus of is γ \gammaShrinkage of γ .
Proof.
Directly from Lemma 8.6 8.68 . 6Otherwise , there is a constraint∥
Π Φ T π J − Π Φ T π J ′ ∥ ξ ≤ ∥ T π J − T π J ′ ∥ ξ ≤ γ ∥ J − J ′ ∥ ξ . (8.48) \begin{aligned} \left\|\Pi_{\Phi} \mathrm{T}_{\pi} J-\Pi_{\Phi} \mathrm{T}_{\pi} J^{\ prime}\right\|_{\xi} & \leq\left\|\mathrm{T}_{\pi} J-\mathrm{T}_{\pi} J^{\prime}\right\| _{\xi} \\ & \leq \gamma\left\|JJ^{\prime}\right\|_{\xi} \end{aligned} \tag{8.48}∥ΠFTpJ−PiFTpJ′∥x≤∥TpJ−TpJ′∥x≤c∥J−J′∥x.(8.48)
This proposition shows that in J \mathcal{J}There is a unique fixed pointJ ~ π \widetilde{J}_{\pi} in JJ
p,从而
J ~ π = Π Φ T π J ~ π . \widetilde{J}_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi} \tilde{J}_{\pi} . J
π=ΠΦTπJ~π.
由于 h ↦ Φ h h \mapsto \Phi h h↦Φh是单射的,因此存在一个唯一的 h π ∈ R m h_{\pi} \in \mathbb{R}^{m} hπ∈Rm,这样 Φ h π = Π Φ T π ( Φ h π ) \Phi h_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi}\left(\Phi h_{\pi}\right) Φhπ=ΠΦTπ(Φhπ)。这简单地导致了另一个流行的目标函数,即均方投影贝尔曼误差(Mean Squared Projected Bellman Error, MSPBE)。
min h ∈ R m ∥ Φ h − Π Φ T π ( Φ h ) ∥ ξ (8.49) \min _{h \in \mathbb{R}^{m}}\left\|\Phi h-\Pi_{\Phi} \mathrm{T}_{\pi}(\Phi h)\right\|_{\xi} \tag{8.49} h∈Rmmin∥Φh−PiFTp(Φh)∥x(8.49)
In the following, we describe the error bounds for minimizing the MSPBE fuction.
Proposition 8.7.
Given an infinite range MDP {X, U, p, g, γ} MDP\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}MDP{
X,U,p,g,γ } , and a fixed policyπ \piπ,讹h π h_{\pi}hpDefinition h π = Π Φ T π ( Φ h π ) \Phi h_{\pi}=\Pi_{\Phi} \mathrm{T}_{\pi}\left(\Phi h_{\pi}\right )Φhp=PiFTp(Φhp∥J π − Φ ⊤ h π ∥ ξ ≤ 1 1 − γ 2 ∥ J π − Π Φ J π ∥ ξ (8.50) \left\|J^{\pi}-\Phi^{ \
top} h_{\pi}\right\|_{\xi} \leq \frac{1}{\sqrt{1-\gamma^{2}}}\left\|J^{\pi}-\Pi_ {\Phi}J^{\pi}\right\|_{\xi}\tag{8∥∥JPi−Phi⊤hp∥∥x≤1−c21∥JPi−PiFJπ∥x(8.50)
Proof.
Simply put, we have
∥ J π − Φ ⊤ h π ∥ ξ 2 = ∥ J π − Π Φ J π ∥ ξ 2 + ∥ Π Φ J π − Φ ⊤ h π ∥ ξ 2 = ∥ J π − Π Φ J π ∥ ξ 2 + ∥ Π Φ T π J π − Π Φ T π ( Φ ⊤ h π ) ∥ ξ 2 ≤ ∥ J π − Π Φ J π ∥ ξ 2 + γ 2 ∥ J π − Φ ⊤ h π ∥ ξ 2 (8.51 ) \begin{aligned} \left\|J^{\pi}-\Phi^{\top} h_{\pi}\right\|_{\xi}^{2} &=\left\|J^{ \pi}-\Pi_{\Phi} J^{\pi}\right\|_{\xi}^{2}+\left\|\Pi_{\Phi} J^{\pi}-\Phi^ {\top} h_{\pi}\right\|_{\xi}^{2} \\ &=\left\|J^{\pi}-\Pi_{\Phi} J^{\pi}\ right\|_{\xi}^{2}+\left\|\Pi_{\Phi} \mathrm{T}_{\pi} J^{\pi}-\Pi_{\Phi} \mathrm{T }_{\pi}\left(\Phi^{\top} h_{\pi}\right)\right\|_{\xi}^{2} \\ & \leq\left\|J^{\ pi}-\Pi_{\Phi} J^{\pi}\right\|_{\xi}^{2}+\gamma^{2}\left\|J^{\pi}-\Phi^{ \top} h_{\pi}\right\|_{\xi}^{2} \end{aligned} \tag{8.51}∥∥JPi−Phi⊤hp∥∥X2=∥JPi−PiFJπ∥X2+∥∥PiFJPi−Phi⊤hp∥∥X2=∥JPi−PiFJπ∥X2+∥∥PiFTpJPi−PiFTp( F⊤hp)∥∥X2≤∥JPi−PiFJπ∥X2+c2∥∥JPi−Phi⊤hp∥∥X2(8.51)
The first equation is generated by the Pythagorean theorem, the second equation is generated by the construction, and this inequality is due to Π Φ T π \Pi_{\Phi} \mathrm{T}_{\pi}PiFTpcaused by shrinkage characteristics.
When the true total cost function J π J^{\pi}JWhen π is not in the linear function approximation space, that is,∥ J π − Π Φ J π ∥ ξ ≠ 0 \left\|J^{\pi}-\Pi_{\Phi} J^{\pi}\right\| _{\xi} \neq 0∥JPi−PiFJπ∥x=0 ,thenJ π − Φ ⊤ h π ∥ ξ \left\|J^{\pi}-\Phi^{\top} h_{\pi}\right\|_{\xi}∥∥JPi−Phi⊤hp∥∥xThe error will be severely constrained, if γ \gammaγ is close to 1. Therefore, ensure that the total cost function lies in the linear total cost function approximation spaceJ \mathcal{J}J is crucial, that is,J π ∈ J l J^{\pi}\in \mathcal{J}_{l}JPi∈Jl。
Since both the MSBE function and the MSPBE function are convex, both functions can guarantee a global minimum. Therefore, it is valuable to study the performance of solutions to these two problems. To do this, we define the difference in error bounds as
l ( γ ) : = 1 + γ 1 − γ − 1 1 − γ 2 (8.52) l(\gamma ):=\frac{1+\gamma}{1-\gamma}-\frac{1}{\ sqrt{1-\gamma^{2}}} \tag{8.52}l ( c ):=1−c1+c−1−c21(8.52)
Obviously, l ( 0 ) = 0 l(0)=0l(0)=0 . Now we takellThe derivative of l is
l ′ ( γ ) = 2 ( 1 − γ ) 2 + γ ( 1 − γ 2 ) 3 (8.53) l^{\prime}(\gamma)=\frac{2}{(1-\gamma)^{ 2}}+\frac{\gamma}{\left(\sqrt{1-\gamma^{2}}\right)^{3}} \tag{8.53}l′ (c)=(1−c )22+(1−c2)3c(8.53)
Its value is for γ ∈ [ 0 , 1 ] \gamma\in[0,1]c∈[0,1 ] is always positive. This fact means that the difference functionllThe function value of l increases monotonically from 0 to 1. The evaluation in Figure 15 clearly describes whenγ \gammaWhen γ is close to 1, the difference between the error bounds of MSBE minimization and MSPBE minimization becomes infinite. In other words, minimizing the MSPBE function is more advantageous than the MSBE function.
Figure 15: Error bound quotient for MSBE minimization and MSPBE minimization.
8.4 API supplement
8.4.1 Approximate PI (API)
- We will show three different APE methods: ell 2 ell_{2}ell2MSBE, MSBE with ergodic properties, MSPBE with ergodic properties.
- In the case of E-Bus, there is no approximation method in the policy improvement step.
- Policy networks in deep reinforcement learning: Approximate policy improvements.
8.4.2 APE via Bellman Residual Minimisation
- In Policy Iteration, Policy Evaluation (PE) via T π T_{\pi} Tπ leads to a fixed point J π J^{\pi} Jπ . (Quiz 2)
- In Approximate PE, there is a Bellman error since we restrict J J J in a subspace ( Φ ⊤ h ) \left(\Phi^{\top} h\right) (Φ⊤h) if we apply Linear Function Approximation (LFA).
8.4.3 ℓ 2 \ell_{2} ℓ2 Based Bellman Residual Minimisation
- What is the difference between ∥ ⋅ ∥ 2 2 \|\cdot\|_{2}^{2} ∥⋅∥22 and ∥ ⋅ ∥ 2 \|\cdot\|_{2} ∥⋅∥2 ?
- ∥ x ∥ 2 2 = x ⊤ x , ∥ x ∥ 2 = x ⊤ x ( x ∈ R n ) \|x\|_{2}^{2}=x^{\top} x,\|x\|_{2}=\sqrt{x^{\top} x}\left(x \in \mathbb{R}^{n}\right) ∥x∥22=x⊤x,∥x∥2=x⊤x(x∈Rn) .
- Both forms are strict convex, they have the same global minima. We did not make a strict distinction between these two terms since we only focus on the analytical solution.
- Quite different in numerical calculations, e.g., gradient.
- In this exercise, we keep using ∥ ⋅ ∥ 2 2 \|\cdot\|_{2}^{2} ∥⋅∥22 , which is also more consistent with the name ‘Squared’ BE.
8.4.4 Recap: Closed form policy evaluation
Preliminaries: matrix derivation
- Matrix calculus
- Layout conventions: given y ∈ R m , x ∈ R n y \in \mathbb{R}^{m}, x \in \mathbb{R}^{n} y∈Rm,x∈Rn .
Numerator-layout:
Numerator-layout: ∂ y ∂ x : = [ ∂ y 1 ∂ x 1 … ∂ y 1 ∂ x n ⋱ ∂ y m ∂ x 1 … ∂ y m ∂ x n ] ∈ R m × n , ∂ y ∂ x : = [ ∂ y 1 ∂ x 1 … ∂ y m ∂ x 1 ⋱ ∂ y 1 ∂ x n … ∂ y m ∂ x n ] ∈ R n × m , \begin{array}{l} \text { Numerator-layout: } \\ \frac{\partial y}{\partial x}:=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \ldots & \frac{\partial y_{1}}{\partial x_{n}} \\ & \ddots & \\ \frac{\partial y_{m}}{\partial x_{1}} & \ldots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \in \mathbb{R}^{m \times n}, \quad \frac{\partial y}{\partial x}:=\left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \ldots & \frac{\partial y_{m}}{\partial x_{1}} \\ & \ddots & \\ \frac{\partial y_{1}}{\partial x_{n}} & \ldots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] \in \mathbb{R}^{n \times m}, \end{array} Numerator-layout: ∂x∂y:=⎣⎡∂x1∂y1∂x1∂ym…⋱…∂xn∂y1∂xn∂ym⎦⎤∈Rm×n,∂x∂y:=⎣⎡∂x1∂y1∂xn∂y1…⋱…∂x1∂ym∂xn∂ym⎦⎤∈Rn×m,
- This exercise follows denominator layout convention.
- This exercise has two kinds of matrix derivation:
- The derivative of a scalar y by a vector x : gradient (vector)
- The derivative of a vector y by a vector x : Jaccobian (matrix)
8.4.5 ℓ 2 \ell_{2} ℓ2 Based Bellman Residual Minimisation
- ℓ 2 \ell_{2} ℓ2 least square function:
J 2 π ∈ argmin J ∈ J ∥ T π J − J ∥ 2 2 , where J = Φ ⊤ h . ∥ T π J − J ∥ 2 2 = ∥ J − T π J ∥ 2 2 = ∥ J − G π − γ P π J ∥ 2 2 = ∥ ( I K − γ P π ) Φ ⊤ h − G π ∥ 2 2 \begin{aligned} J_{2}^{\pi} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}} &\left\|\mathrm{T}_{\pi} J-J\right\|_{2}^{2}, \quad \text { where } J=\Phi^{\top} h . \\ \left\|\mathrm{T}_{\pi} J-J\right\|_{2}^{2} &=\left\|J-\mathrm{T}_{\pi} J\right\|_{2}^{2}=\left\|J-G_{\pi}-\gamma P_{\pi} J\right\|_{2}^{2} \\ &=\left\|\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} h-G_{\pi}\right\|_{2}^{2} \end{aligned} J2p∈J∈Jargmin∥TpJ−J∥22∥TpJ−J∥22, where J=Phi⊤h.=∥J−TpJ∥22=∥J−Gp−γPpJ∥22=∥∥(IK−γPp)Phi⊤h−Gp∥∥22
- Let W π = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp=(IK−γPp)Phi⊤∈RK×m , we have
∥ T π J − J ∥ 2 2 = ∥ W π h − G π ∥ 2 2 = ( W π h − G π ) T ( W π h − G π ) \begin{aligned} \left\|\mathrm{ T}_{\pi} JJ\right\|_{2}^{2} &=\left\|W_{\pi} h-G_{\pi}\right\|_{2}^{2} \\ &=\left(W_{\pi}h-G_{\pi}\right)^{\mathrm{T}}\left(W_{\pi}h-G_{\pi}\right) \end {aligned}∥TpJ−J∥22=∥Wph−Gp∥22=(Wph−Gp)T(Wph−Gp)
-
Since the least square function is convex, we can get the minima when the derivation equals to zero.
-
Let u = W π h − G π ∈ R K × 1 \mathbf{u}=W_{\pi} h-G_{\pi} \in \mathbb{R}^{K \times 1} u=Wπh−Gπ∈RK×1 , we can get
∂ u ⊤ u ∂ u = 2 u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ h u , w h e r e ∂ u ∂ h = W π ⊤ (denominator layout) \frac{\partial \mathbf{u}^{\top} \mathbf{u}}{\partial \mathbf{u}}=2 \mathbf{u}, \quad \frac{\partial \mathbf{u}^{\top} \mathbf{u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u}, \quad where \frac{\partial \mathbf{u}}{\partial h}=W_{\pi}^{\top} \quad \text{(denominator layout)} ∂u∂u⊤u=2u,∂h∂u⊤u=2∂h∂uu,where∂h∂u=WPi⊤(denominator layout)
∂ ( W π h − G π ) ⊤ ( W π h − G π ) ∂ h = 2 W π ⊤ ( W π h − G π ) = 0 ∈ R m × 1 ⇒ W π ⊤ W π h − W π ⊤ G π = 0 W π ⊤ W π h = W π ⊤ G π \begin{aligned} \frac{\partial\left(W_{\pi} h-G_{\pi}\right)^{\top}\left(W_{\pi} h-G_{\pi}\right)}{\partial h} &=2 W_{\pi}^{\top}\left(W_{\pi} h-G_{\pi}\right)=0 \in \mathbb{R}^{m \times 1} \\ \Rightarrow \quad W_{\pi}^{\top} W_{\pi} h-W_{\pi}^{\top} G_{\pi} &=0 \\ W_{\pi}^{\top} W_{\pi} h &=W_{\pi}^{\top} G_{\pi} \end{aligned} ∂h∂(Wph−Gp)⊤(Wph−Gp)⇒WPi⊤Wph−WPi⊤GpWPi⊤Wph=2 WPi⊤(Wph−Gp)=0∈Rm×1=0=WPi⊤Gp
- W π ⊤ W_{\pi}^{\top}WPi⊤ is not a square matrix (non-invertable), so we move ( W π ⊤ W π ) ∈ R m × m \left(W_{\pi}^{\top} W_{\pi}\right) \in \mathbb{R}^{m \times m} (WPi⊤Wp)∈Rm×m to the RHS:
h = ( W π ⊤ W π ) − 1 W π ⊤ G π J 2 π = Φ ⊤ h = Φ ⊤ ( W π ⊤ W π ) − 1 W π ⊤ G π \begin{array}{c} h= \left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi}^{\top} G_{\pi} \\ J_{2}^{\ pi}=\Phi^{\top} h=\Phi^{\top}\left(W_{\pi}^{\top} W_{\pi}\right)^{-1} W_{\pi} ^{\top} G_{\pi} \end{array}h=(WPi⊤Wp)−1WPi⊤GpJ2p=Phi⊤h=Phi⊤(WPi⊤Wp)−1WPi⊤Gp
8.4.6 Approximate PI (API) with LFA + MSBE
- What is ξ ? → \xi ? \rightarrow x ?→ Ergodic MDP.
- Ξ ∈ R K × K \Xi \in \mathbb{R}^{K \times K} X∈RK×K : a diagonal matrix with diagonal element ξ i \xi_{i} Xi . (The 14th Greek letter Ξ , ξ \Xi, \xi X ,x )
- Similar as before, let W π : = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}:=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp:=(IK−γPp)Phi⊤∈RK×m :
∥ T π ( Φ ⊤ h ) − Φ ⊤ h ∥ ξ 2 = ∥ Φ ⊤ h − G π − γ P π Φ ⊤ h ∥ ξ 2 = ∥ W π h − G π ∥ ξ 2 \left\|\mathrm {T}_{\pi}\left(\Phi^{\top}h\right)-\Phi^{\top}h\right\|_{\xi}^{2}=\left\|\ Phi^{\top} h-G_{\pi}-\gamma P_{\pi} \Phi^{\top} h\right\|_{\xi}^{2}=\left\|W_{\ pi}h-G_{\pi}\right\|_{\xi}^{2}∥∥Tp( F⊤h)−Phi⊤h∥∥X2=∥∥Phi⊤h−Gp−γPpPhi⊤h∥∥X2=∥Wph−Gp∥X2
- ξ \xi ξ -norm is defined as:
∥ W π h − G π ∥ ξ 2 = ( W π h − G π ) ⊤ Ξ ( W π h − G π ) \left\|W_{\pi} h-G_{\pi}\right\|_{\xi}^{2}=\left(W_{\pi} h-G_{\pi}\right)^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right) ∥Wph−Gp∥X2=(Wph−Gp)⊤X(Wph−Gp)
-
\Xi \in \mathbb{R}^{K \times K} : a diagonal matrix with diagonal element ξ i \xi_{i} Xi .
-
Again, the least square function is convex, the derivation should equal zero. Let u = W π h − G π ∈ RK × 1 \mathbf{u}=W_{\pi}h-G_{\pi} \in \mathbb{R}^{K \times 1}u=Wph−Gp∈RK×1 , we can get
∂ u ⊤ Ξ u ∂ u = 2 Ξ u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ hu , where ∂ u ∂ h = W π ⊤ ∂ ( W π h − G π ) ⊤ Ξ ( W π h − G π ) ∂ h = 2 W π ⊤ Ξ ( W π h − G π ) = 0 ∈ R m × 1 W π ⊤ Ξ W π h = W π ⊤ Ξ G π h = ( W π ⊤ Ξ W π ) − 1 W π ⊤ Ξ G π ⇒ J ξ π = Φ ⊤ ( W π ⊤ Ξ W π ) − 1 W π ⊤ Ξ G π \begin{array}{c} \frac{\partial \mathbf{u}^ {\top} \Xi \mathbf{u}}{\partial \mathbf{u}}=2 \Xi \mathbf{u}, \quad \frac{\partial \mathbf{u}^{\top} \mathbf {u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u},\quad \text { where } \frac{\partial \mathbf{u}}{\partial h}=W_{\pi}^{\top} \\ \frac{\partial\left(W_{\pi} h-G_{\pi}\right)^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right)}{\partial h}=2 W_{\pi}^{\top} \Xi\left(W_{\pi} h-G_{\pi}\right)=\mathbf{0} \in \mathbb{R}^{m \times 1} \\ W_{\pi}^{\top} \Xi W_{\pi} h=W_{\pi}^{\top} \Xi G_{\pi} \\ h=\left(W_{\pi}^{\top} \Xi W_{\pi}\right)^{-1} W_{\pi}^{\top} \Xi G_{\pi} \\ \Rightarrow J_{\xi}^{\pi}=\Phi^{\top}\left(W_{\pi}^{\top} \Xi W_{\pi}\right)^{-1} W_{\pi}^{\top} \Xi G_{\pi} \end{array}∂u∂u⊤Ξu=2Ξu,∂h∂u⊤u=2∂h∂uu, where ∂h∂u=WPi⊤∂h∂(Wph−Gp)⊤Ξ(Wph−Gp)=2 WPi⊤X(Wph−Gp)=0∈Rm×1WPi⊤ΞWph=WPi⊤ΞGph=(WPi⊤ΞWp)−1WPi⊤ΞGp⇒JXp=Phi⊤(WPi⊤ΞWp)−1WPi⊤ΞGp
- When Ξ \Xi Ξ is an identity matrix, we get the same result as ℓ 2 \ell_{2} ℓ2MSBE.
8.4.7 Approximate PI (API) with LFA + ξ -weighted MSBE \text { Approximate PI (API) with LFA }+\xi \text {-weighted MSBE } Approximate PI (API) with LFA +ξ-weighted MSBE
8.4.8 Mean Squared Projected Bellman Error (MSPBE)
- Since Π Φ J = J \Pi_{\Phi} J=JPiFJ=J ,
∥ Π Φ T π J − J ∥ ξ 2 = ∥ J − Π Φ ( G π + γ P π J ) ∥ ξ 2 = ∥ Π Φ J − γ Π Φ P π J − Π Φ G π ) ∥ ξ , = ∥ Π Φ ( ( IK − γ P π ) Φ ⊤ h − G π ) ) ∥ ξ 2 \begin{aligned} \left\|\Pi_{\Phi} \mathrm{T}_{\pi} JJ \right\|_{\xi}^{2} &\left.=\left\|J-\Pi_{\Phi}\left(G_{\pi}+\gamma P_{\pi} J\right) \right\|_{\xi}^{2}=\| \Pi_{\Phi} J-\gamma \Pi_{\Phi} P_{\pi} J-\Pi_{\Phi} G_{\pi}\right) \|_{\xi}^{2}, \ \ &\left.=\| \Pi_{\Phi}\left(\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}h-G_{\pi}\right)\right) \|_ {\xi}^{2}\end{aligned}∥ΠFTpJ−J∥X2=∥J−PiF(Gp+γPpJ)∥X2=∥ΠFJ−c PFPpJ−PiFGp)∥X2,=∥ΠF((IK−γPp)Phi⊤h−Gp))∥X2
-
Let W π = ( I K − γ P π ) Φ ⊤ ∈ R K × m W_{\pi}=\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top} \in \mathbb{R}^{K \times m} Wp=(IK−γPp)Phi⊤∈RK×m , we have ∥ Π Φ ( W π h − G π ) ) ∥ ξ 2 \left.\| \Pi_{\Phi}\left(W_{\pi} h-G_{\pi}\right)\right) \|_{\xi}^{2} ∥ΠF(Wph−Gp))∥X2 .
-
The orthogonal projector Π Φ : = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ∈ RK × K \Pi_{\Phi}:=\Phi^{\top}\left(\Phi \Xi \Phi^{\ top}\right)^{-1}\Phi\Xi\in\mathbb{R}^{K\times K}PiF:=Phi⊤( F X F⊤)−1F X∈RK×K .
-
Similarly as before, let u = Π Φ W π h − Π Φ G π ∈ RK × 1 \mathbf{u}=\Pi_{\Phi} W_{\pi}h-\Pi_{\Phi} G_{\pi }\in\mathbb{R}^{K\times1}u=PiFWph−PiFGp∈RK×1 , we can get
∂ u ⊤ Ξ u ∂ u = 2 Ξ u , ∂ u ⊤ u ∂ h = 2 ∂ u ∂ hu , where ∂ u ∂ h = ( Π Φ W π ) ⊤ ∂ ( Π Φ W π h − Π Φ G π ) ⊤ Ξ ( Π Φ W π h − Π Φ G π ) ∂ h = 2 W π ⊤ Π Φ ⊤ Ξ Π Φ ( W π h − G π ) = 0 ∈ R m × 1 , \begin{array}{ c} \frac{\partial \mathbf{u}^{\top} \Xi \mathbf{u}}{\partial \mathbf{u}}=2 \Xi \mathbf{u}, \quad \frac{\ partial \mathbf{u}^{\top} \mathbf{u}}{\partial h}=2 \frac{\partial \mathbf{u}}{\partial h} \mathbf{u}, \quad \text { where } \frac{\partial \mathbf{u}}{\partial h}=\left(\Pi_{\Phi} W_{\pi}\right)^{\top} \\ \frac{\partial\ left(\Pi_{\Phi} W_{\pi} h-\Pi_{\Phi} G_{\pi}\right)^{\top} \Xi\left(\Pi_{\Phi} W_{\pi} h-\Pi_{\Phi} G_{\pi}\right)}{\partial h}=2 W_{\pi}^{\top} \Pi_{\Phi}^{\top} \Xi \Pi_{ \Phi}\left(W_{\pi} h-G_{\pi}\right)=\mathbf{0} \in \mathbb{R}^{m \times 1}, \end{array}∂u∂u⊤Ξu=2Ξu,∂h∂u⊤u=2∂h∂uu, where ∂h∂u=( PFWp)⊤∂h∂ ( PFWph − PFGp)⊤ Ξ(PFWph − PFGp)=2 WPi⊤PiPhi⊤X PF(Wph−Gp)=0∈Rm×1,
- ( Φ Ξ Φ ⊤ ) − 1 \left(\Phi \Xi \Phi^{\top}\right)^{-1}( F X F⊤)− 1 is diagonal, thenΠ Φ ⊤ = Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ \Pi_{\Phi}^{\top}=\Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \PhiPiPhi⊤=X F⊤( F X F⊤)−1Φ , hence we have:
W π ⊤ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ ⏞ Π Φ ⊤ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ⏞ Π Φ ( W π h − G π ) = 0 , W π Ξ Φ ⊤ ⏟ full rank, invertible ( Φ Ξ ⊤ ) − 1 ⇒ Ξ ( W π h − G π ) = 0 , Φ Ξ W π h = Φ Ξ G π , ⇒ h = ( Φ Ξ W π ) − 1 Φ Ξ G π , ⇒ J ξ π = Φ ⊤ ( Φ Ξ W π ) − 1 Φ Ξ G π \begin{aligned}W_{\pi}^{\top}\overbrace{\Xi\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\ Phi}^{\Pi_{\Phi}^{\top}} & \overbrace{\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\ Phi \Xi}^{\Pi_{\Phi}}\left(W_{\pi}h-G_{\pi}\right)=\mathbf{0}, \\\underbrace{W_{\pi}\Xi \Phi^{\top}}_{\text {full rank, invertible}} &\left(\Phi\Xi\Phi^{\top}\right)^{-1}\\&\Rightarrow\Xi\ left ( W_{\pi} h - G_{\pi}\right )=\mathbf{0} , \\ \Phi \Xi W_{\pi} h=\Phi \Xi G_{\pi} , & \Rightarrow h=\left(\Phi\XiW_{\pi}\right)^{-1}\Phi\XiG_{\pi},WPi⊤X F⊤( F X F⊤)−1Phi PiPhi⊤full rank, invertable WpX F⊤F X Wph=F X Gp,Phi⊤( F X F⊤)−1F X PiF(Wph−Gp)=0,( F X F⊤)−1⇒X(Wph−Gp)=0,⇒h=( F X Wp)−1F X Gp,⇒JXp=Phi⊤( F X Wp)−1F X Gp.
- We have proved that Π Φ T π \Pi_{\Phi} \mathrm{T}_{\pi} PiFTp is a contraction mapping which leads to a fixed point, then the MSPBE should equal to zero:
Π Φ T π J − J = 0 ∈ RK × 1 ⇒ Π Φ ( W π h − G π ) ) = Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ( W π h − G π ) ) = 0 \ left.\left.\Pi_{\Phi}\mathrm{T}_{\pi} JJ=0 \in \mathbb{R}^{K\times 1} \Rightarrow \Pi_{\Phi}\left(W_ {\pi}h-G_{\pi}\right)\right)=\Phi^{\top}\left(\Phi\Xi\Phi^{\top}\right)^{-1}\Phi\ Xi\left(W_{\pi}h-G_{\pi}\right)\right)=0PiFTpJ−J=0∈RK×1⇒PiF(Wph−Gp))=Phi⊤( F X F⊤)−1F X(Wph−Gp))=0
- Left multiply with Φ Ξ ∈ R m × K \Phi \Xi \in \mathbb{R}^{m \times K} F X∈Rm×K at both sides:
Φ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ ( W π h − G π ) ) = Φ Ξ , Φ Ξ ( W π h − G π ) ) = 0 , Φ Ξ W π h = Φ Ξ G π , ⇒ h = ( Φ Ξ W π ) − 1 Φ Ξ G π , ⇒ J ξ π = Φ ⊤ ( Φ Ξ W π ) − 1 Φ Ξ G π \begin{aligned} \left.\Phi \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi\left(W_{ \pi}h-G_{\pi}\right)\right) &=\Phi\Xi 0,\\\left.\Phi\Xi\left(W_{\pi}h-G_{\pi}\right )\right) &=0, \\ \Phi \Xi W_{\pi} h &=\Phi \Xi G_{\pi}, \\ \Rightarrow \quad h=\left(\Phi \Xi W_{\ pi}\right)^{-1} \Phi \Xi G_{\pi}, & \\ \Rightarrow \quad J_{\xi}^{\pi}=\Phi^{\top}\left(\Phi \Xi W_{\pi}\right)^{-1} \Phi \Xi G_{\pi} &\end{aligned}F X F⊤( F X F⊤)−1F X(Wph−Gp))F X(Wph−Gp))F X Wph⇒h=( F X Wp)−1F X Gp,⇒JXp=Phi⊤( F X Wp)−1F X Gp.=Φ X 0 ,=0,=F X Gp,
8.4.9 Approximate PI (API) with LFA + ξ \xi ξ-weighted MSPBE
8.4.10 Approximate PI Summary
- Three different APE methods in close-form: ℓ 2 \ell_{2} ℓ2 MSBE, MSBE with ergodicity, MSPBE with ergodicity;
- The estimation error bound \delta for the above three different APE methods are discussed in the lecture.