ADPRL - Approximate Dynamic Programming and Reinforcement Learning - Note 12 - Numerical Temporal Difference Learning (Numerical TD Learning)

Note 12 Numerical TD Learning

As discussed in the previous two chapters, TD learning is a theoretically sound sampling-based algorithm mechanism to solve the model curse problem. In the LFA setting, a common practice of DP is to use the framework of policy iteration to obtain an optimal policy. That is, the TD algorithm of LFA is used to evaluate the total cost of a given strategy, and then policy improvement steps are taken to complete a scan of the sampling-based PI framework.

12.1 Brief description of Off-policy learning

An obvious risk of the sampling-based PI algorithm with LFA is that it may not converge at all, or may not converge quickly to a useful range. Therefore, a practical need for RL is to exploit sampling interactions by following a given policy to evaluate the total cost of a new policy, which may not be evaluable. Such tasks are called off-policy learning. More specifically, given an MDP ⁡ ( X , U , g , p , γ ) \operatorname{MDP}(\mathcal{X}, \mathcal{U}, g, p, \gamma)MDP(X,U,g,p,γ ) and a so-called behavioral strategyπ b \pi_{b}Pib, the task of Off-policy learning is to evaluate another policy π t \pi_{t}PitThe total cost is called the target strategy. As the counterpart of off-policy learning, the so-called on-policy learning refers to the RL algorithm that estimates the total cost function of the policy and generates samples.

[Reinforcement Learning (4) - Monte Carlo Methods and Examples]
…The only general way to ensure that all actions are chosen infinitely frequently is to let the agent continue to choose them. There are two methods to ensure this, resulting in what we call the on-policy (some articles are translated as "same policy") method and off-policy ("different policy") method, and the original method will still be used in the future. The English names are "on-policy" and "off-policy". The on-policy method attempts to evaluate or improve the policy used for decision-making, while the off-policy method evaluates or improves the policy used to generate data... The
target policy and behavioral policy of on-policy are the same policy, and its benefit is simplicity Roughly, the strategy can be optimized by directly using the data, but such processing will cause the strategy to actually learn a local optimum, because the On-policy strategy cannot maintain both exploration and utilization at the same time; while Off-policy will Target strategy π t \pi_{t}Pitand behavioral strategy π b \pi_{b}PibSeparately, you can find the global optimal value while maintaining exploration...

Let us review the definition of the total cost function of the policy in equation (3.4). It is obvious that off-policy learning can be considered as a distribution mismatch problem. In other words, it is necessary to start from the target policy π t \pi_{t}PitSampling relevant interactions from the distribution of and from the behavioral policy π b \pi_{b}PibAvailable trajectories are extracted from the distribution. Importance sampling is a conventional means of dealing with distribution mismatch. Let us consider µ ′ \mu^{\prime} from another distributionmThe samples drawn from are estimated according to μ \muRandom variable xx with μ distributionThe task of x 's expected value. Ifμ ′ ( x ) > 0 \mu^{\prime}(x)>0m(x)>0 for allxxx , it is easy to see

E x ∼ μ [ x ] = ∫ X x μ ( x ) d x = ∫ X x μ ( x ) μ ′ ( x ) μ ′ ( x ) d x = E x ∼ μ ′ [ μ ( x ) μ ′ ( x ) x ] (12.1) \begin{aligned} \underset{x \sim \mu}{\mathbb{E}}[x] &=\int_{\mathcal{X}} x \mu(x) d x \\ &=\int_{\mathcal{X}} x \frac{\mu(x)}{\mu^{\prime}(x)} \mu^{\prime}(x) d x \\ &=\underset{x \sim \mu^{\prime}}{\mathbb{E}}\left[\frac{\mu(x)}{\mu^{\prime}(x)} x\right] \end{aligned} \tag{12.1} xμE[x]=Xxμ(x)dx=Xxm(x)m ( x )m(x)dx=xμE[m(x)m ( x )x](12.1)

Let us express the ratio of the two density functions as

ψ ( x ) = μ ( x ) μ ′ ( x ) (12.2) \psi(x)=\frac{\mu(x)}{\mu^{\prime}(x)} \tag{12.2}ψ ( x )=m(x)m ( x )( 1 2 . 2 )
For random variablexxThe expected value of x can be approximated by the empirical average as
E x ∼ μ [ x ] ≈ 1 N ∑ i = 1 N ψ ( xi ) xi (12.3) \underset{x \sim \mu}{\mathbb{E}}[ x] \approx \frac{1}{N} \sum_{i=1}^{N} \psi\left(x_{i}\right) x_{i} \tag{12.3}xμE[x]N1i=1Np(xi)xi(12.3)

Obviously, for a target policy π t \pi_{t}Pitand behavioral strategy π b \pi_{b}PibA specific MDP using importance sampling requires behavioral strategy π b \pi_{b}PibHas the same action coverage as the target policy. A little abuse of notation, if we think of the policy as a conditional distribution π t ( u ∣ x ) \pi_{t}(u \mid x)Pit(ux) π b ( u ∣ x ) \pi_{b}(u \mid x) Pib(ux ),我们可以
ψ ( x , u ) = π t ( u ∣ x ) π b ( u ∣ x ) (12.4) \psi(x, u)=\frac{\pi_{t}(u \mid x )}{\pi_{b}(u \mid x)} \tag{12.4}ψ ( x ,u)=Pib(ux)Pit(ux)(12.4)

We can then use importance sampling to approximate the total cost function of the target policy as follows

J π t ( x ) = E p π b ( x ′ ∣ x ) [ ψ ( x , u ) ( g ( x , π b ( x ) , x ′ ) + γ J π b ( x ′ ) ) ] ≈ 1 N ∑ i = 1 N ψ ( x , u ) ( g ( x , π b ( x ) , x ′ ) + γ J π b ( x ′ ) ) (12.5) \begin{aligned} J^{\pi_{t}}(x) &=\mathbb{E}_{p_{\pi_{b}}\left(x^{\prime} \mid x\right)}\left[\psi(x, u)\left(g\left(x, \pi_{b}(x), x^{\prime}\right)+\gamma J^{\pi_{b}}\left(x^{\prime}\right)\right)\right] \\ & \approx \frac{1}{N} \sum_{i=1}^{N} \psi(x, u)\left(g\left(x, \pi_{b}(x), x^{\prime}\right)+\gamma J^{\pi_{b}}\left(x^{\prime}\right)\right) \end{aligned} \tag{12.5} JPit(x)=EpPib(xx)[ ψ ( x ,u)(g(x,Pib(x),x)+γJPib(x))]N1i=1Nψ ( x ,u)(g(x,Pib(x),x)+γJPib(x))(12.5)

By following the same derivation method for TD learning, an off-policy TD ( 0 ) (0)( 0 ) The algorithm is given the following update rules

J k + 1 ( x ) = J k ( x ) + α k ψ ( x , u ) ( g ( x , u , x ′ ) + γ J k ( x ′ ) − J k ( x ) ) (12.6) J_{k+1}(x)=J_{k}(x)+\alpha_{k} \psi(x, u)\left(g\left(x, u, x^{\prime}\right)+\gamma J_{k}\left(x^{\prime}\right)-J_{k}(x)\right) \tag{12.6} Jk+1(x)=Jk(x)+akψ ( x ,u)(g(x,u,x)+γJk(x)Jk(x))(12.6)

Although the development of off-policy TD follows exactly the same philosophy as the original TD algorithm, it has been observed that the practical use of off-policy TD is rather limited. In particular, off-policy TD algorithms often fail to converge. This phenomenon is often called the fatal trio, namely function approximation, bootstrapping, and off-policy learning.

A true stochastic gradient descent algorithm for minimizing off-policy MSPBE is the first successful attempt to break the fatal trilogy. For a given target policy π t \pi_{t}Pitand behavioral strategy π b \pi_{b}PibTo traverse the MDP, we use ξ t \xi_{t}Xtand ξ b \xi_{b}XbRespectively represent π t \pi_{t}PitSum π b \pi_{b}Pibsteady state distribution. Then, the off-policy MSPBE function is defined as

f t ( h ) = ∥ Π π t   T π t Φ ⊤ h − Φ ⊤ h ∥ ξ t 2 , (12.7) f_{t}(h)=\left\|\Pi_{\pi_{t}} \mathrm{~T}_{\pi_{t}} \Phi^{\top} h-\Phi^{\top} h\right\|_{\xi_{t}}^{2}, \tag{12.7} ft(h)=PiPit TPitPhihPhihXt2,(12.7)

WhereΞ t = Ξ b Ψ \Xi_{t}=\Xi_{b} \PsiXt=XbΨ Ψ = diag ⁡ ( ψ ( x 1 ) , … , ψ ( x K ) ) \Psi=\operatorname{diag}\left(\psi\left(x_{1}\right), \ldots, \psi\left(x_{K}\right)\right) Ps=diag( p(x1),,p(xK) ) , orthogonal projectorΠ π t \Pi_{\pi_{t}}PiPitIt can be expressed as follows

Π π t = Φ ⊤ ( Φ Ξ t Φ ⊤ ) − 1 Φ Ξ t = Φ ⊤ ( Φ Ξ b Ψ Φ ⊤ ) − 1 Φ Ξ b Ψ (12.8) \begin{aligned} \Pi_{\pi_{t }} &=\Phi^{\top}\left(\Phi \Xi_{t} \Phi^{\top}\right)^{-1} \Phi \Xi_{t} \\ &=\Phi^ {\top}\left(\Phi \Xi_{b} \Psi \Phi^{\top}\right)^{-1} \Phi \Xi_{b} \Psi \end{aligned} \tag{12.8}PiPit=Phi( F XtPhi)1F Xt=Phi( F XbPs F)1F XbPs(12.8)

For simplicity, we derive the GTD algorithm that minimizes the MSPBE function on the original policy.

12.2 Gradient TD Learning

definitionδ ( h ) : = T π Φ ⊤ h − Φ ⊤ h \delta(h):=\mathrm{T}_{\pi} \Phi^{\top} h-\Phi^{\top} hd ( h ):=TpPhihPhi h。 ⊤ h ⊤
⊤ Π π Ξ Π π Π π T π Φ ⊤ h − Φ ⊤ h ∥ ξ = ( δ ( h ) ) ⊤ Π π Ξ Π π δ ( h ) = ( δ ( h ) ) ) ⊤ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) (12.9) \begin{aligned} \left\|\Pi_{\pi} \mathrm{T}_{\pi} \Phi ^{\top} h-\Phi^{\top} h\right\|_{\xi} &=(\delta(h))^{\top} \Pi_{\pi} \Xi \Pi_{\ pi} \delta(h) \\ &=(\delta(h))^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^ {-1} \Phi \Xi \delta(h) \end{aligned}\tag{12.9}PipTpPhihPhihx=( d ( h ) )ΠpX Ppd ( h )=( d ( h ) )ΞΦ( F X F)1Φ X d ( h )(12.9)

mspbe nsp. –
  . ) ⊤ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) = γ Φ P π ⊤ Ξ Φ ⊤ ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) − Φ Ξ δ ( h ) ( 12.10) \begin{aligned} \nabla_{f_{t}}(h) &=\left(\nabla_{\delta}(h)\right)^{\top} \Xi \Phi^{\top}\ left ( \Phi \Xi \Phi^{\top}\right)^{-1}\Phi\Xi\delta(h)\\&=\left(\gamma P_{\pi}\Phi^{\top }-\Phi^{\top}\right)^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi \delta(h) \\ &=\gamma \Phi P_{\pi}^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right )^{-1} \Phi \Xi \delta(h)-\Phi \Xi \delta(h) \end{aligned} \tag{12.10}ft(h)=(d(h))X F( F X F)1Φ X d ( h )=(γPpPhiPhi)X F( F X F)1Φ X d ( h )=γΦPPiX F( F X F)1Φ X d ( h )Φ X d ( h )(12.10)

Since the MSPBE function is strongly convex, there is only one global minimum, which satisfies the critical point condition, that is, ∇ ft ( h ) = 0 \nabla_{f_{t}}(h)=0ft(h)=0 . Equivalently, the global minimum is described by the following equation inhhh is the independent solution
( γ Φ P π ⊤ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) = ( Φ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) (12.11) \left(\gamma \Phi P_{ \pi}^{\top}\Xi\Phi^{\top}\right)^{-1}\Phi\Xi\delta(h)=\left(\Phi\Xi\Phi^{\top}\ right)^{-1}\Phi\Xi\delta(h)\tag{12.11}(γΦPPiX F)1Φ X d ( h )=( F X F)1Φ X d ( h )(12.11)

Obviously, since we need to calculate Φ P π ⊤ Ξ Φ ⊤ \Phi P_{\pi}^{\top} \Xi \Phi^{\top}ΦPPiX FΦ Ξ Φ ⊤ \Phi \Xi \Phi^{\top}F X FThe reciprocal of ⊤ , the classic technique of random approximation fails here. In order to alleviate this difficulty, we introduce an auxiliary variableω \omegaω作为
ω : = ( γ Φ P π ⊤ Ξ Φ ⊤ ) − 1 Φ Ξ δ ( h ) (12.12) \omega:=\left(\gamma \Phi P_{\pi}^{\top} \Xi \; Phi^{\top}\right)^{-1} \Phi \Xi \delta(h) \tag{12.12}oh:=(γΦPPiX F)1Φ X d ( h )(12.12)

Constraint
Φ Ξ δ ( h ) = γ Φ P π ⊤ Ξ Φ ⊤ ω = Φ Ξ Φ ⊤ ω . (12.13) \Phi \Xi \delta(h)=\gamma \Phi P_{\pi}^{\top} \Xi \Phi^{\top} \omega=\Phi \Xi \Phi^{\top} \omega . \tag{12.13}Φ X d ( h )=γΦPPiX Fω=F X Fω.(12.13)

The infinitesimal equivalent of a singular function
{ Φ Ξ δ ( h ) − γ Φ P π ⊤ Ξ Φ ⊤ ω = 0 Φ Ξ δ ( h ) − Φ Ξ Φ ⊤ ω = 0 (12. 14) \left\{\begin {array}{lll}\Phi\Xi\delta(h)-\gamma\Phi P_{\pi}^{\top}\Xi\Phi^{\top}\omega&=&0\\\Phi\ Xi \delta(h)-\Phi \Xi \Phi^{\top} \omega & = & 0 \end{array}\right. \tag{12.14}{ Φ X d ( h )γΦPPiX FωΦ X d ( h )F X Fω==00(12.14)

We define the TD error as
δ h ( xk , uk , xk ′ ) : = g ( xk , uk , xk ′ ) + γ h ⊤ ϕ ( xk ′ ) − h ⊤ ϕ ( xk ) (12.15) \delta_{h }\left(x_{k}, u_{k}, x_{k}^{\prime}\right):=g\left(x_{k}, u_{k}, x_{k}^{\prime }\right)+\gamma h^{\top} \phi\left(x_{k}^{\prime}\right)-h^{\top} \phi\left(x_{k}\right) \ tag{12.15}dh(xk,uk,xk):=g(xk,uk,xk)+c hϕ(xk)hϕ(xk)(12.15)

Similarly, if we have a single SA function, then
{ hk + 1 = hk + α k ( δ h ( xk , uk , xk ′ ) ϕ ( xk ) − ω k ⊤ ϕ ( xk ) ϕ ( xk ′ ) ) ω k + 1 = ω k + α k ( δ h ( xk , uk , xk ′ ) − ω k ⊤ ϕ ( xk ) ) ϕ ( xk ) (12.16) \left\{\begin{array}{l} h_{k +1}=h_{k}+\alpha_{k}\left(\delta_{h}\left(x_{k}, u_{k}, x_{k}^{\prime}\right) \phi\ left(x_{k}\right)-\omega_{k}^{\top}\phi\left(x_{k}\right) \phi\left(x_{k}^{\prime}\right)\ right) \\\omega_{k+1}=\omega_{k}+\alpha_{k}\left(\delta_{h}\left(x_{k}, u_{k}, x_{k}^{ \prime}\right)-\omega_{k}^{\top}\phi\left(x_{k}\right)\right) \phi\left(x_{k}\right)\end{array}\ right. \tag{12.16}{ hk+1=hk+ak( dh(xk,uk,xk)ϕ(xk)ohkϕ(xk)ϕ(xk))ohk+1=ohk+ak( dh(xk,uk,xk)ohkϕ(xk))ϕ(xk)(12.16)

Note that this SA algorithm coincides with the TDC algorithm.

Theorem 12.1 GTD with LFA ⁡ ( 0 ) \operatorname{GTD}(0)Convergence of GTD ( 0 ) \operatorname{GTD}(0 )GTD(0) with LFA).

Concession length α k \alpha_{k}akMeet RobbinsMonro conditions. Then, the vector hk h_{k} produced by the GTD(0) tilt algorithm with LFAhkConvergs to the fixed point of the projected Bellman operator with probability 1.

Remark 12.1 Difficulty in practical convergence

Although the asymptotic convergence theorem dominates the theoretical analysis of RL, there is a big problem with stochastic approximation. That is, the convergence properties depend on the construction of the weight sequence, while asymptotic convergence only reaches infinity in theory. In the next subsection, we discuss an alternative numerical method that can solve the PE problem equally well. Advanced numerical methods such as the stochastic Nesterov accelerated gradient algorithm have been developed.

12.3 Least Squares TD Learning

As discussed in the previous section, one of the most challenging technical issues with TD or GTD learning algorithms is the fragile asymptotic convergence they inherit from SA methods. Let’s take a closer look at LFA’s PE issue, as shown in Figure 18.

Insert image description here

Figure 18: Geometry of policy evaluation with LFA.

We did not use numerical optimization methods to find the projected Bellman operator Π π T π \Pi_{\pi} T_{\pi}PipTpThe fixed point Φ ⊤ h ∗ \Phi^{\top} h^{*}Phih , but directly characterizes the fixed point.

As shown in Figure 18, the residual vector T π Φ ⊤ h ∗ − Φ ⊤ h ∗ \mathrm{T}_{\pi} \Phi^{\top} h^{*}-\Phi^{\top} h^{*}TpPhihPhih Relative to the inner product⟨ ⋅ , ⋅   ξ \langle\cdot, \cdot\rangle_{\xi},x, orthogonal to the approximate space J \mathcal{J}J. _ In other words, we have

Φ Ξ ( T π Φ ⊤ h ∗ − Φ ⊤ h ∗ ) = 0 (12.17) \Phi \Xi\left(\mathrm{T}_{\pi} \Phi^{\top} h^{*}- \Phi^{\top} h^{*}\right)=0 \tag{12.17}F X(TpPhihPhih)=0(12.17)

And apply the Bellman operator T π Φ ⊤ h : = G π + γ P π Φ ⊤ h \mathrm{T}_{\pi} \Phi^{\top} h:=G_{\pi}+\gamma P_{\pi} \Phi^{\top} hTpPhih:=Gp+γPpPhiA compact expression of ⊤ h, we finally get

Φ Ξ ( IK − γ P π ) Φ ⊤ h ∗ = Φ Ξ G π (12.18) \Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}h^{*}=\Phi \Xi G_{\pi} . \tag{12.18}F X(IKγPp)Phih=F X Gp.(12.18)

Simply put, the task now is to solve the above hhSystem of linear equations for h , that is, A h = b A h=bAh=b ,if
{ A = Φ Ξ ( IK − γ P π ) Φ ⊤ b = Φ Ξ G π (12.19) \left\{\begin{aligned} A &=\Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}\\b &=\ Phi \Xi G_{\pi} \end{aligned}\right. \tag{12.19}{ Ab=F X(IKγPp)Phi=F X Gp.(12.19)

According to rk ⁡ ( Φ ) = m \operatorname{rk}(\Phi)=mr k ( Φ )=Assuming m , it is easy to see that the linear system has a unique solution, namely:
h ∗ = ( Φ Ξ ( IK − γ P π ) Φ ⊤ ) − 1 Φ Ξ G π . (12.20) h^{*} =\left(\Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}\right)^{-1} \Phi \Xi G_{\pi} .\tag{12.20}h=( F X(IKγPp)Phi)1F X Gp.(12.20)

In order to achieve model-free online learning, we adopt the expectation form as

{ A = E p π ( x ′ ∣ x ) [ ϕ ( x ) ( ϕ ( x ) − γ ϕ ( x ′ ) ) ⊤ ] b = E p π ( x ′ ∣ x ) [ g ( x , u , x ′ ) ϕ ( x ) ] (12.21) \left\{\begin{aligned} A &=\mathbb{E}_{p_{\pi}\left(x^{\prime} \mid x\right)}\left[\phi(x)\left(\phi(x)-\gamma \phi\left(x^{\prime}\right)\right)^{\top}\right] \\ b &=\mathbb{E}_{p_{\pi}\left(x^{\prime} \mid x\right)}\left[g\left(x, u, x^{\prime}\right) \phi(x)\right] \end{aligned}\right. \tag{12.21} Ab=Epp(xx)[ ϕ ( x )( ϕ ( x )c ϕ(x))]=Epp(xx)[g(x,u,x)ϕ ( x ) ] .(12.21)

By exploring the empirical average of the above two terms, h ∗ h^{*} given in Equation (12.20)h的解决方案的基于抽样的实现可以写为
h k + 1 = ( ∑ i = 1 k ϕ ( x i ) ( ϕ ( x i ) − γ ϕ ( x i ′ ) ) ⊤ ) − 1 ( ∑ i = 1 k g ( x i , u i , x i ′ ) ϕ ( x i ) ) (12.22) h_{k+1}=\left(\sum_{i=1}^{k} \phi\left(x_{i}\right)\left(\phi\left(x_{i}\right)-\gamma \phi\left(x_{i}^{\prime}\right)\right)^{\top}\right)^{-1}\left(\sum_{i=1}^{k} g\left(x_{i}, u_{i}, x_{i}^{\prime}\right) \phi\left(x_{i}\right)\right) \tag{12.22} hk+1=(i=1kϕ(xi)( ϕ(xi)c ϕ(xi)))1(i=1kg(xi,ui,xi)ϕ(xi))(12.22)

This update scheme is called Least Squares Temporal Difference (LSTD) learning. Obviously, the bottleneck of the LSTD algorithm is the continuous calculation of the square matrix. To reduce this computational burden, we can use the Sherman-Morrison formula to calculate the inverse of the matrix and perform a rank-one update .

Proposition 12.1 Sherman-Morrison Official

AAA is an invertible square matrix,u, vu, vu , v are column vectors. Assume1 + v ⊤ A − 1 u ≠ 0 1+v^{\top} A^{-1} u \neq 01+vA1 u=0 . Then rank 1 updateA + uv ⊤ A+uv^{\top}A+uv的逆如下所示
( A + u v ⊤ ) − 1 = A − 1 − A − 1 u v ⊤ A − 1 1 + v ⊤ A − 1 u (12.23) \left(A+u v^{\top}\right)^{-1}=A^{-1}-\frac{A^{-1} u v^{\top} A^{-1}}{1+v^{\top} A^{-1} u} \tag{12.23} (A+uv)1=A11+vA1 uA1 uvA1(12.23)

Obviously, the LSTD algorithm is not a SA algorithm, but a pure Monte Carlo algorithm. Since calculating AAA andbbThe arithmetic mean at b is canceled and the LSTD update in Equation (12.22) does not require any adjustment of hyperparameters or learning rates. Nonetheless, the performance of the LSTD algorithm is more affected by the LFA space properties discussed in Section 8.2.

So with the help of Sherman-Morrison formula, we can derive the LSTD algorithm with LFA
Insert image description here

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122757971