ADPRL - Approximate Dynamic Programming and Reinforcement Learning - Note 11 - Temporal Difference Learning (Theory of TD learning)

Note 11 - Theory of TD learning


In the last Note, we reviewed the basic concepts of RL, namely TD learning and its extensions and qualification traces. Due to the simplicity and outstanding performance of the TD algorithm, the extension of the TD mechanism with the linear function approximation method (LFA) will definitely have great advantages in solving the curse of dimensionality.

Let us review a typical linear equation approximation (LFA): J = Φ ⊤ h ∈ RKJ=\Phi^{\top} h \in \mathbb{R}^{K}J=PhihRK ,withΦ : = [ ϕ 1 , ... , ϕ K ] ∈ R m × K \Phi:=\left[\phi_{1}, \ldots, \phi_{K}\right] \in \mathbb{ R}^{m\timesK}Phi:=[ p1,,ϕK]Rm × K . If the TD errorδ ( x , u , x ′ ) \delta\left(x, u, x^{\prime}\right)d(x,u,x )is regarded as describing the transformation( x , u , x ′ ) \left(x, u, x^{\prime}\right)(x,u,x )information, the classic TD update and LFA can be constructed as

h k + 1 = h k + α k ( g ( x k , u k , x k ′ ) + γ h k ⊤ ϕ ( x k ′ ) − h k ⊤ ϕ ( x k ) ) ϕ ( x k ) (11.1) h_{k+1}=h_{k}+\alpha_{k}\left(g\left(x_{k}, u_{k}, x_{k}^{\prime}\right)+\gamma h_{k}^{\top} \phi\left(x_{k}^{\prime}\right)-h_{k}^{\top} \phi\left(x_{k}\right)\right) \phi\left(x_{k}\right) \tag{11.1} hk+1=hk+ak(g(xk,uk,xk)+c hkϕ(xk)hkϕ(xk))ϕ(xk)(11.1)

This update is performed on each transformation ( x , u , x ′ ) \left(x, u, x^{\prime}\right)(x,u,x )triggered after. Note that classic TD updates revise the approximation of the total cost at each individual state, while TD with LFA updates the weight vector, i.e., implicitly approximating the entire total cost at each time. There is an obvious surprise that the algorithm associated with the update rule (11.1) can be shown to converge asymptotically to a fixed point of the projected Bellman operator, as shown in Proposition 8.6.

The first TD learning theory on LFA was proposed as some variant of the gradient descent algorithm. However, these TD learning algorithms were considered not to be true gradient descent algorithms as a follow-up, and attempted theories on the convergence and robustness of traditional TD learning algorithms were weakened and limited. Although recent attempts have been made to reinterpret the concept of TD using the concept of semi-gradients, a concise explanation of TD learning is still somewhat lacking. In this chapter, we aim to reveal the mathematical basis of the TD learning algorithm, as well as a concise interpretation of the convergence properties, in contrast to the concept of bootstrapping, a hand-waving technique used to approximate the total cost ( hand-waving trick).

11.1 Brief description of stochastic approximation algorithm (Stochastic Approximation Algorithm in Nutshell)

In this subsection, we review some basic results in Stochastic Approximation (SA) theory so that we can better understand the classic TD learning algorithm with LFA. Specifically, the purpose of the SA algorithm is to find some nonlinear self-mapping z : R m → R mz: \mathbb{R}^{m} \rightarrow \mathbb{R}^{m}z:RmRRoots of m , that is, forh ∈ R mh\in \mathbb{R}^{m}hRm solve the following equation

z ( h ) = 0. (11.2) z(h)=0 . \tag{11.2} z(h)=0.(11.2)

Here, self-mapping zzz is usually assumed to be in its argumenthhh is continuous. The SA algorithm deals with a very interesting and challenging situation, the functionzzz is unknown, onlyzzSome "noise measurements" of z can be obtained. In particular, the noise measurement is modeled as

y = z ( h ) + w (11.3) y=z(h)+w \tag{11.3} y=z(h)+w(11.3)

Among them lolw is a zero-average random variable, representing noise, and its probability density function isp (w) p (w)p ( w ) . Obviously, for a fixedhhhyyy is a random variable whose expected value is equal toz ( h ) z(h)z ( h ) , that is:

E p ( y ) [ y ] = E p ( w ) [ z ( h ) + w ] = z ( h ) , (11.4) \mathbb{E}_{p(y)}[y]=\mathbb{E}_{p(w)}[z(h)+w]=z(h), \tag{11.4} Ep ( y )[y]=Ep(w)[z(h)+w]=z(h),(11.4)

其中 p ( y ) p(y) p(y)表示噪声测量的概率密度函数。经典的SA算法迭代了以下更新规则

h k + 1 = h k + α k y k , (11.5) h_{k+1}=h_{k}+\alpha_{k} y_{k}, \tag{11.5} hk+1=hk+αkyk,(11.5)
其中 y k y_{k} yk 是函数 z z z的噪声度量。在一些适当的条件下,SA算法会收敛到 z ( h ) z(h) z(h)的根。

SA的一个特别有趣的变化涉及到固定点算法。让 T : R K → R K \mathbf{T}: \mathbb{R}^{K} \rightarrow \mathbb{R}^{K} T:RKRK R K \mathbb{R}^{K} RK上的一个收缩,有一个唯一的固定点,即 T ( h ) = h \mathrm{T}(h)=h T(h)=h T \mathrm{T} The unique fixed point of T can be described as the root of the following self-mapping

z T ( h ) : = T ( h ) − h . (11.6) z_{\mathrm{T}}(h):=\mathrm{T}(h)-h . \tag{11.6} zT(h):=T(h)h.(11.6)

Similarly, when shrinking T \mathbf{T}There are some "noisy measurements" when T
is assumed to be unavailable y T = z T ( h ) + w , (11.7) y_{\mathrm{T}}=z_{\mathrm{T}}(h)+w , \tag{11.7}yT=zT(h)+w,(11.7)

We end up with the following SA algorithm
hk + 1 = hk + α ky T ( k ) (11.8) h_{k+1}=h_{k}+\alpha_{k} y_{\mathrm{T}}^{(k )} \tag{11.8}hk+1=hk+akyT(k)(11.8)

where y T ( k ) y_{\mathrm{T}}^{(k)}yT(k)is the function z T z_{\mathrm{T}}zTnoise measurement.

11.1.1 Use linear equation approximation to understand TD

Let us recall that for a given policy π \piπ , the fixed point property of the projected Bellman operator is
Φ ⊤ h = Π π T π Φ ⊤ h . (11.9) \Phi^{\top} h=\Pi_{\pi} \mathrm{T}_{\ pi} \Phi^{\top} h . \tag{11.9}Phih=PipTpPhih.(11.9)

Since the characteristic matrix Φ ∈ R m × K \Phi \in \mathbb{R}^{m \times K}PhiRm × K is assumed to have full row rank. It is obvious that multiplying both sides of the above equation from the left byΦ Ξ π \Phi \Xi_{\pi}F Xp这是电影其解,即
Φ Ξ π Φ ⊤ h = Φ Ξ π Π π T π Φ ⊤ h . (11.10) \Phi \Xi_{\pi} \Phi^{\top} h=\Phi \Xi_{\pi} \Pi_{\pi} \mathrm{T}_{\pi} \Phi^{\top } h . \tag{11.10}F XpPhih=F XpPipTpPhih.(11.10)

Let us review the definition of projectionΠ
π ( J ) : = Φ ⊤ ( Φ Ξ π Φ ⊤ ) − 1 Φ Ξ π J , (11.11) \Pi_{\pi}(J):=\Phi^{\top }\left(\Phi \Xi_{\pi} \Phi^{\top}\right)^{-1} \Phi \Xi_{\pi} J, \tag{11.11}Pip(J):=Phi( F XpPhi)1F XpJ,(11.11)

Then we get
Φ Ξ π Φ ⊤ h = Φ Ξ π T π Φ ⊤ h , (11.12) \Phi \Xi_{\pi} \Phi^{\top} h=\Phi \Xi_{\pi} \mathrm{ T}_{\pi} \Phi^{\top} h, \tag{11.12}F XpPhih=F XpTpPhih,(11.12)

Let us divide the boundary layer by
z 0 ( h ) : R m → R mz 0 ( h ) : = Φ Ξ π T π Φ ⊤ h − Φ Ξ π Φ ⊤ h . (11.13) \begin{aligned} z_{0}(h): \mathbb{R}^{m} & \rightarrow\mathbb{R}^{m}\z_{0}(h) &:=\ Phi \Xi_{\pi} \mathrm{T}_{\pi} \Phi^{\top} h-\Phi \Xi_{\pi} \Phi^{\top} h \end{aligned} \tag{11.13}z0(h):Rmz0(h)Rm:=F XpTpPhihF XpPhih.(11.13)

More specifically, the self-mapping z T π z_{\mathrm{T}_{\pi}}zTp的计算方法是
z 0 ( h ) : = E p π ( x ′ ∣ x ) [ ( g ( x , u , x ′ ) + γ h ⊤ ϕ ( x ′ ) − h ⊤ ϕ ( x ) ) ϕ ( x ) ] . (11.14) z_{0}(h):=\mathbb{E}_{p_{\pi}\left(x^{\prime} \mid x\right)}\left[\left(g\left(x, u, x^{\prime}\right)+\gamma h^{\top} \phi\left(x^{\prime}\right)-h^{\top} \phi(x)\right) \phi(x)\right] . \tag{11.14} z0(h):=Epp(xx)[(g(x,u,x)+c hϕ(x)hϕ(x))ϕ ( x ) ].(11.14)

A SA algorithm to solve the root-finding problem z T π ( h ) z_{\mathrm{T}_{\pi}}(h)zTp(h)给出如下
h k + 1 = h k + α k ( g ( x k , u k , x k ′ ) + γ h k ⊤ ϕ ( x k ′ ) − h k ⊤ ϕ ( x k ) ) ϕ ( x k ) , (11.15) h_{k+1}=h_{k}+\alpha_{k}\left(g\left(x_{k}, u_{k}, x_{k}^{\prime}\right)+\gamma h_{k}^{\top} \phi\left(x_{k}^{\prime}\right)-h_{k}^{\top} \phi\left(x_{k}\right)\right) \phi\left(x_{k}\right), \tag{11.15} hk+1=hk+ak(g(xk,uk,xk)+c hkϕ(xk)hkϕ(xk))ϕ(xk),(11.15)

This is the original form of the TD learning algorithm with LFA, see Algorithm 14.

Insert image description here
Note that if Φ = IK \Phi=I_{K}Phi=IK, that is, the characteristic matrix is ​​KKK -dimensional identity matrix, then the update rule in formula (11.1) can be simply reduced to the classic TD(0) (0)( 0 ) algorithm. Finally, TD with LFA( 0 ) \mathrm{TD}(0)The convergence properties of the T D ( 0 ) learning algorithm come directly from the convergence theory of SA. More details and discussion are given in Section 11.2.

Theorem 11.1 TD ⁡ ( 0 ) \operatorname{TD}(0) Convergence of TD ( 0 ) \operatorname{TD}(0)TD(0) with LFA)

Give an infinite range MDP { X , U , p , g , γ } \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , and let the step lengthα k \alpha_{k}akMeet the Robbins-Monro condition. Then, by TD with LFA ( 0 ) (0)( 0 ) The vector hk h_{k}generated by the learning algorithmhkConvergs to the fixed point of the projected Bellman operator with probability 1.

11.1.2 Eligibility traces with linear equation approximations

As an interesting side effect of LFA, TD (0) (0)( 0 ) The learning algorithm has a better chance of avoiding update locality because the weight vectorhhh is universally updated for all states. In this subsection, we aim to extend the concept of eligibility traces to LFA. Let us reviewλ \lambdaλ - the geometric mean Bellman operator, we give the following results without proof.

Proposition 11.1

Given an infinite range MDP {MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ . Projectedλ \lambdaλ is a continuous scalar functionΠ π ∘ T π , λ ∞ \Pi_{\pi} \circ \mathrm{T}_{\pi, \lambda}^{\infty}PipTp , lRatio ( 1 − λ ) 1 − λ γ \frac{\gamma(1-\lambda)}{1-\lambda \gamma}1 λ cc ( 1 λ )Relative to ξ \xiContraction of ξ weighted norm.

We apply the same technique as in Section 11.1 to construct the following root-finding problem
z λ ( h ) : = Φ Ξ T π , λ ∞ Φ ⊤ h − Φ Ξ Φ ⊤ h = 0 (11.16) z_{\lambda}(h ):=\Phi \Xi T_{\pi, \lambda}^{\infty} \Phi^{\top} h-\Phi \Xi \Phi^{\top} h=0 \tag{11.16}zl(h):=F X Tp , lPhihF X Fh=0(11.16)

The calculation of empirical averages results in
Φ Ξ T π , λ ∞ Φ ⊤ h − Φ Ξ Φ ⊤ h ≈ 1 k + 1 ∑ t = 0 k ϕ ( x t ) ∑ m = t k γ m − t λ m − t δ ( x m , u m , x m + 1 ) = 1 k + 1 ∑ t = 0 k ∑ m = t k γ m − t λ m − t ϕ ( x t ) δ ( x m , u m , x m + 1 ) = 1 k + 1 ∑ m = 0 k ∑ t = 0 m γ m − t λ m − t ϕ ( x t ) δ ( x m , u m , x m + 1 ) = 1 k + 1 ∑ m = 0 k δ ( x m , u m , x m + 1 ) ∑ t = 0 m γ m − t λ m − t ϕ ( x t ) . (11.17) \begin{aligned} \Phi \Xi T_{\pi, \lambda}^{\infty} \Phi^{\top} h-\Phi \Xi \Phi^{\top} h & \approx \frac{1}{k+1} \sum_{t=0}^{k} \phi\left(x_{t}\right) \sum_{m=t}^{k} \gamma^{m-t} \lambda^{m-t} \delta\left(x_{m}, u_{m}, x_{m+1}\right) \\ &=\frac{1}{k+1} \sum_{t=0}^{k} \sum_{m=t}^{k} \gamma^{m-t} \lambda^{m-t} \phi\left(x_{t}\right) \delta\left(x_{m}, u_{m}, x_{m+1}\right) \\ &=\frac{1}{k+1} \sum_{m=0}^{k} \sum_{t=0}^{m} \gamma^{m-t} \lambda^{m-t} \phi\left(x_{t}\right) \delta\left(x_{m}, u_{m}, x_{m+1}\right) \\ &=\frac{1}{k+1} \sum_{m=0}^{k} \delta\left(x_{m}, u_{m}, x_{m+1}\right) \sum_{t=0}^{m} \gamma^{m-t} \lambda^{m-t} \phi\left(x_{t}\right) . \end{aligned} \tag{11.17} F X Tp , lPhihF X Fhk+11t=0kϕ(xt)m=tkcmtλmtδ(xm,um,xm+1)=k+11t=0km=tkcmtλm t ϕ(xt)d(xm,um,xm+1)=k+11m=0kt=0mcmtλm t ϕ(xt)d(xm,um,xm+1)=k+11m=0kd(xm,um,xm+1)t=0mcmtλm t ϕ(xt).(11.17)

The trick here is to calculate the sum using 0 ≤ t ≤ m ≤ k . 0\leq t\leq m\leq k.0tmk . Obviously, the third sum is exactly the same quantity as the second sum, but the enumeration of the samples is different. Let the exponentkkk becomes infinity. Then it is obvious that for each sampling segment, the vector
ε = lim ⁡ m → ∞ ∑ t = 0 m γ m − t λ m − t ϕ ( xt ) (11.18) \varepsilon=\lim _{m \rightarrow \infty } \sum_{t=0}^{m} \gamma^{mt} \lambda^{mt} \phi\left(x_{t}\right) \tag{11.18}e=mlimt=0mcmtλm t ϕ(xt)(11.18)

It has nothing to do with the state being visited, but only with the time. It is easy to see that the sum of features can be efficiently calculated using samples, i.e., eligibility trace vectors. Clearly, the eligibility trace is the sum of the discounted eigenvectors as the interaction proceeds. Therefore, we define

ϵ k + 1 : = ϕ ( xk ) + λ γ ϵ k (11.19) \epsilon_{k+1}:=\phi\left(x_{k}\right)+\lambda \gamma \epsilon_{k} . \tag{11.19}ϵk+1:=ϕ(xk)+λ c ϵk.( 1 1 . 1 9 )
Therefore, TD using LFA⁡ ( λ ) \operatorname{TD}(\lambda)The T D ( λ ) learning update is given as

h k + 1 = h k + α k ( g ( x k , u k , x k ′ ) + γ h k ⊤ ϕ ( x k ′ ) − h k ⊤ ϕ ( x k ) ) ϵ k + 1 . (11.20) h_{k+1}=h_{k}+\alpha_{k}\left(g\left(x_{k}, u_{k}, x_{k}^{\prime}\right)+\gamma h_{k}^{\top} \phi\left(x_{k}^{\prime}\right)-h_{k}^{\top} \phi\left(x_{k}\right)\right) \epsilon_{k+1} . \tag{11.20} hk+1=hk+ak(g(xk,uk,xk)+c hkϕ(xk)hkϕ(xk))ϵk+1.(11.20)

More details about the algorithm are given in Algorithm 15.
Insert image description here

If λ = 0 \lambda=0l=0 , using LFA’sTD ⁡ ( λ ) \operatorname{TD}(\lambda)T D ( λ ) learning update simply degenerates into the classicTD (0) \mathrm{TD}(0)T D ( 0 ) is updated, as in formula (11.1). Finally, we present the classical convergence properties in the proposal below.

Theorem 11.2 TD ( λ ) \mathrm{TD}(\lambda )Convergence of TD ( λ ) \mathrm{TD}( \ lambda)TD(λ) with LFA)

Given an infinite range MDP {MDP{ X,U,p,g,γ } , and a fixed policyπ \piπ . Concession lengthα k \alpha_{k}akMeet the Robbins-Monro condition. Then, by TD ( λ ) TD(\lambda)T D ( λ ) The vector hk h_{k}produced by the lean algorithmhkIn LFALFAL F A case converges to the projectionλ \lambdaλ -with the Bellman functional functionΠ π ∘ T π , λ ∞ \Pi_{\pi} \circ \mathrm{T}_{\pi, \lambda}^{\infty}PipTp , lThe probability is 1.

11.2 Convergence of TD Learning

In the previous chapter, we introduced the concept of TD learning and its extension to qualification traces, namely multi-step TD learning. Despite the heuristic behind this development, all of these algorithms were ultimately shown to converge asymptotically to their desired solution. The main developments in analyzing the convergence properties of TD algorithms are based on stochastic approximation theory, which was originally proposed by Robbins and Monro. In this section, we present a general framework that has been developed to study the convergence properties of the family of TD learning algorithms.

We adopt the form of a stochastic process, and its convergence properties. Specifically, we focus on the finite set X \mathcal{X}The random iterativeprocess on
1}(x)=\left(1-\alpha_{k}(x)\right) \Delta_{k}(x)+\alpha_{k}(x) R_{k}(x) \tag{11.21 }Dk+1(x)=(1ak(x))Dk(x)+ak(x)Rk(x)(11.21)

where, x ∈ X x\in \mathcal{X}xX中, Δ k + 1 ( x ) ∈ R m \Delta_{k+1}(x)\in \mathbb{R}^{m} Dk+1(x)Rm . let us define

R k = { x 1 , α 1 ( x 1 ) , R 1 ( x 1 ) , … , x k , α 1 ( x k ) } (11.22) \mathcal{R}_{k}=\left\{x_{1}, \alpha_{1}\left(x_{1}\right), R_{1}\left(x_{1}\right), \ldots, x_{k}, \alpha_{1}\left(x_{k}\right)\right\} \tag{11.22} Rk={ x1,a1(x1),R1(x1),,xk,a1(xk)}(11.22)

As of step kkThe history of the stochastic iterative process of k . The convergence properties of this stochastic iterative process are given in the following theorem.

Theorem 11.3

Under the following assumptions, the stochastic iterative process defined in Equation (11.21) converges to zero with probability 1.
(1) ∑ k = 1 ∞ α k ( x ) = ∞ \sum_{k=1}^{\infty} \alpha_{k}(x)=\inftyk=1ak(x)=, and ∑ k = 1 ∞ ( α k ( x ) ) 2 < ∞ \sum_{k=1}^{\infty}\left(\alpha_{k}(x)\right)^{2}<\infty k=1( ak(x))2<

(2) ∥ E [ R k ( x ) ∣ R k ] ∥ W ≤ γ ∥ Δ k ∥ W \left\|\mathbb{E}\left[R_{k}(x) \mid \mathcal{R}_{k}\right]\right\|_{W} \leq \gamma\left\|\Delta_{k}\right\|_{W} E[Rk(x)Rk]WcΔkW with γ < 1 \gamma<1 c<1

(3) var ⁡ [ R k ( x ) ∣ R k ] ≤ C ( 1 + ∥ Δ k ∥ W 2 ) \operatorname{var}\left[R_{k}(x) \mid \mathcal{R}_{k}\right] \leq C\left(1+\left\|\Delta_{k}\right\|_{W}^{2}\right) w a r[Rk(x)Rk]C(1+ΔkW2) with C > 0 C>0 C>0

Here, ∥ Δ k ∥ W \left\|\Delta_{k}\right\|_{W}ΔkWrepresents some appropriate norm.

Remark 11.1. This result can be applied to most TD learning algorithms with both tabular features or LFA. In the remainder of this section we prove the theorem to show that QQAsymptotic convergence properties of Q- learning algorithm.

In short, QQ shown in Algorithm 12The asymptotic convergence properties of the Q- learning algorithm are given in the following proposition.

Proposition 11.2

Given an infinite range MDP {X, U, p, g, γ} \{\mathcal{X}, \mathcal{U}, p, g, \gamma\}{ X,U,p,g,γ } , as long as the learning rate satisfies the Robbins-Monro condition,QQThe Q learning algorithm will converge to the optimal state - the action value functionQ ∗ Q^{*}Q

Proof.
Let us put QQThe update rule of Q learning algorithm is rewritten as

Q k + 1 ( x k , u k ) = ( 1 − α k ( x k , u k ) ) Q k ( x k , u k ) + α k ( x k , u k ) ( g ( x k , u k , x k ′ ) + γ min ⁡ u k ′ Q k ( x k ′ , u k ′ ) ) (11.23) \begin{aligned} Q_{k+1}\left(x_{k}, u_{k}\right)=(1-&\left.\alpha_{k}\left(x_{k}, u_{k}\right)\right) Q_{k}\left(x_{k}, u_{k}\right)+\\ & \alpha_{k}\left(x_{k}, u_{k}\right)\left(g\left(x_{k}, u_{k}, x_{k}^{\prime}\right)+\gamma \min _{u_{k}^{\prime}} Q_{k}\left(x_{k}^{\prime}, u_{k}^{\prime}\right)\right) \end{aligned} \tag{11.23} Qk+1(xk,uk)=(1ak(xk,uk))Qk(xk,uk)+ak(xk,uk)(g(xk,uk,xk)+cukminQk(xk,uk))(11.23)

We subtract the term Q ∗ ( xt , ut ) Q^{*}\left(x_{t}, u_{t}\right) from both sides of the equationQ(xt,ut),然后定义
Δ k ( x , u ) : = Q k ( x , u ) − Q ∗ ( x , u ) (11.24) \Delta_{k}(x, u):=Q_{k}(x, u)-Q^{*}(x, u) \tag{11.24} Dk(x,u):=Qk(x,u)Q(x,u)(11.24)

and

R k ( x , u ) : = g ( x , u , x ′ ) + γ min ⁡ u ′ Q k ( x ′ , u ′ ) − Q ∗ ( x , u ) (11.25) R_{k}(x, u):=g\left(x, u, x^{\prime}\right)+\gamma \min _{u^{\prime}} Q_{k}\left(x^{\prime}, u^{\prime}\right)-Q^{*}(x, u) \tag{11.25} Rk(x,u):=g(x,u,x)+cuminQk(x,u)Q(x,u)(11.25)

Obviously, we have

Δ k + 1 ( x k , u k ) = ( 1 − α k ( x k , u k ) ) Δ k ( x k , u k ) + α k ( x k , u k ) R k ( x k , u k ) (11.26) \Delta_{k+1}\left(x_{k}, u_{k}\right)=\left(1-\alpha_{k}\left(x_{k}, u_{k}\right)\right) \Delta_{k}\left(x_{k}, u_{k}\right)+\alpha_{k}\left(x_{k}, u_{k}\right) R_{k}\left(x_{k}, u_{k}\right) \tag{11.26} Dk+1(xk,uk)=(1ak(xk,uk))Dk(xk,uk)+ak(xk,uk)Rk(xk,uk)(11.26)

This is the stochastic iterative process we are interested in.

Now, let x ′ ∈ X x^{\prime}\in \mathcal{X}xX is a randomly sampled state obtained from the MDP model. Then, we calculate

E [ R k ( x , u ) ∣ R k ] = E p ( x ′ ) [ g ( x , u , x ′ ) + γ min ⁡ u ′ Q k ( x ′ , u ′ ) − Q ∗ ( x , u ) ] = H g Q k ( x , u ) − Q ∗ ( x , u ) (11.27) \begin{aligned} \mathbb{E}\left[R_{k}(x, u) \mid \mathcal{R}_{k}\right] &=\mathbb{E}_{p\left(x^{\prime}\right)}\left[g\left(x, u, x^{\prime}\right)+\gamma \min _{u^{\prime}} Q_{k}\left(x^{\prime}, u^{\prime}\right)-Q^{*}(x, u)\right] \\ &=\mathrm{H}_{\mathfrak{g}} Q_{k}(x, u)-Q^{*}(x, u) \end{aligned} \tag{11.27} E[Rk(x,u)Rk]=Ep(x)[g(x,u,x)+cuminQk(x,u)Q(x,u)]=HgQk(x,u)Q(x,u)(11.27)

Act according to the fixed point properties of the optimal state Bellman operator, that is, H g Q ∗ = Q ∗ \mathrm{H}_{\mathfrak{g}} Q^{*}=Q^{*}HgQ=QFixed point properties of ∗ , we have

∥ E [ R k ( x , u ) ∣ R k ] ∥ ∞ = ∥ H g Q k ( x , u ) − H g Q ∗ ( x , u ) ∥ ∞ ≤ ∥ H g Q k − H g Q ∗ ∥ ∞ ≤ γ ∥ Q k − Q ∗ ∥ ∞ = γ ∥ Δ k ∥ ∞ (11.28) \begin{aligned} \left\|\mathbb{E}\left[R_{k}(x, u) \mid \mathcal{R}_{k}\right]\right\|_{\infty} &=\left\|\mathrm{H}_{\mathfraction{g}} Q_{k}(x,u)- \mathrm{H}_{\mathfraction{g}} Q^{*}(x,u)\right\|_{\infty} \\ & \leq\left\|\mathrm{H}_{\mathfraction {g}} Q_{k}-\mathrm{H}_{\mathfrak{g}} Q^{*}\right\|_{\infty} \\ & \leq \gamma\left\|Q_{k }-Q^{*}\right\|_{\infty} \\ &=\gamma\left\|\Delta_{k}\right\|_{\infty} \end{aligned} \tag{11.28}E[Rk(x,u)Rk]=HgQk(x,u)HgQ(x,u)HgQkHgQcQkQ=cΔk(11.28)

which satisfies condition (2) in Theorem 11.3.

Finally, we get

var ⁡ [ R k ( x , u ) ∣ R k ] = E p ( x ′ ) [ ( R k ( x , u ) − H g Q k ( x , u ) + Q ∗ ( x , u ) ) 2 ] = E p ( x ′ ) [ ( g ( x , u , x ′ ) + γ min ⁡ u ′ Q k ( x ′ , u ′ ) − H g Q k ( x , u ) ) 2 ] = var ⁡ [ g ( x , u , x ′ ) + γ min ⁡ u ′ Q k ( x ′ , u ′ ) ∣ R k ] (11.29) \begin{aligned} \operatorname{var}\left[R_{k}(x, u) \mid \mathcal{R}_{k}\right] &=\mathbb{E}_{p\left(x^{\prime}\right)}\left[\left(R_{k}(x, u)-\mathrm{H}_{\mathfrak{g}} Q_{k}(x, u)+Q^{*}(x, u)\right)^{2}\right] \\ &=\mathbb{E}_{p\left(x^{\prime}\right)}\left[\left(g\left(x, u, x^{\prime}\right)+\gamma \min _{u^{\prime}} Q_{k}\left(x^{\prime}, u^{\prime}\right)-\mathrm{H}_{\mathfrak{g}} Q_{k}(x, u)\right)^{2}\right] \\ &=\operatorname{var}\left[g\left(x, u, x^{\prime}\right)+\gamma \min _{u^{\prime}} Q_{k}\left(x^{\prime}, u^{\prime}\right) \mid \mathcal{R}_{k}\right] \end{aligned} \tag{11.29} w a r[Rk(x,u)Rk]=Ep(x)[(Rk(x,u)HgQk(x,u)+Q(x,u))2]=Ep(x)[(g(x,u,x)+cuminQk(x,u)HgQk(x,u))2]=w a r[g(x,u,x)+cuminQk(x,u)Rk](11.29)

Due to costgggandQQ __Q functions are all bounded, Theorem11.3 11.3Condition (3) in 1 1. 3 is also satisfied. Therefore, apply theorem11.3 11.31 1 . 3 completes the proof.

11.3 Example: TD ( λ ) with qualification trace \text{TD}(\lambda)TD(λ)

As shown in the figure below, a baseline environment is a grid with 11 states and an obstacle. The agent starts from the "Start" state in the lower left corner and stops in two terminal states.
Insert image description here

There are four available actions: up, down, left, and right. Each action is random, with a probability of 0.8 for one step and 0.2 for two steps, both in the desired direction. The local cost of all transitions is 0.04 and the local cost of the terminal states is ± 1 ±1±1

Task 1: Given a certain policy, apply TD ( λ ) \text{TD}(\lambda) to all statesTD ( λ ) .
Task 2: Given a randomly generated featureΦ \PhiΦ , apply TD ( λ )with linear equation approximation to all statesTD(λ)

11.3.1 TD ( λ ) \text{TD}(\lambda) TD(λ)

from operator import xor
import random
import numpy as np

class GridWorld:
    def __init__(self, width=4, height=3, obstacle=[(1,1)]):
        self.width = width
        self.height = height

        self.obstacle = obstacle
        self.terminal = [(0, width-1), (1, width-1)]  # the terminal states are always at right top

        self.row = height - 1  # the start point is always at left bottom
        self.col = 0

        # define the MDP
        self.actions = self.act_space()
        self.states = set(self.actions.keys()) | set(self.terminal)
        self.J = self.init_J()
        self.local_cost = 0.04


    def act_space(self):
        act_space = {
    
    }

        for row in range(self.height):
            for col in range(self.width):
                possible_acts = []
                if (row, col) not in self.obstacle + self.terminal:
                    if row - 1 >= 0 and (row-1, col) not in self.obstacle:
                        possible_acts.append('U')
                    if row + 1 < self.height and (row+1, col) not in self.obstacle:
                        possible_acts.append('D')
                    if col - 1 >=0 and (row, col-1) not in self.obstacle:
                        possible_acts.append('L')
                    if col + 1 < self.width and (row, col+1) not in self.obstacle:
                        possible_acts.append('R')
                    act_space[(row, col)] = possible_acts
        return act_space

    def init_J(self, init_J_value=0):
        J = {
    
    }
        for row in range(self.height):
            for col in range(self.width):
                if (row, col) not in self.obstacle + self.terminal:
                    J[(row, col)] = init_J_value
        J[self.terminal[0]] = -1  # J(x_N) = g(x_N)
        J[self.terminal[1]] = +1
        return J

    def move(self, action, deterministic=False):
        # check if legal move first
        if action in self.actions[(self.row, self.col)]:
            if action == 'U':
                # probablistic transition
                if deterministic or random.uniform(0,1) > 0.8 or (self.row-2, self.col) not in self.states:
                    self.row -= 1
                else:
                    self.row -= 2
            elif action == 'D':
                if deterministic or random.uniform(0,1) > 0.8 or (self.row+2, self.col) not in self.states:
                    self.row += 1
                else:
                    self.row += 2
            elif action == 'R':
                if deterministic or random.uniform(0,1) > 0.8 or (self.row, self.col+2) not in self.states:
                    self.col += 1
                else:
                    self.col += 2
            elif action == 'L':
                if deterministic or random.uniform(0,1) > 0.8 or (self.row, self.col-2) not in self.states:
                    self.col -= 1
                else:
                    self.col -= 2

        if (self.row, self.col) == self.terminal[0]:
            return -1
        elif (self.row, self.col) == self.terminal[1]:
            return +1
        else:
            return self.local_cost

    def set_state(self, s):
        self.row = s[0]
        self.col = s[1]

    def current_state(self):
        return (self.row, self.col)

    def game_over(self):
        return (self.row, self.col) not in self.actions

    def print_J(self):
        for row in range(self.height):
            print("---------------------------")
            for col in range(self.width):
                J = env.J.get((row, col), 0)
                if J >= 0:
                    print(" %.2f|" % J, end="")
                else:
                    print("%.2f|" % J, end="")
            print("")
        print("---------------------------")


## Task 1: TD(lambda)
# %% TD

def play_game(grid, policy):
    start_states = list(grid.actions.keys())
    start_idx = np.random.choice(len(start_states))
    grid.set_state(start_states[start_idx])

    # generate traj
    x = grid.current_state()
    traj = [(x,0)]
    while not grid.game_over():
        u = policy[x]
        g = grid.move(u)
        x = grid.current_state()
        traj.append((x, g))  # save the trajectory
    return traj

# define the constants
gamma = 0.9
alpha = 0.1
lmbda = 0.5

env = GridWorld()

policy = {
    
    
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'R',
    (2, 1): 'R',
    (2, 2): 'R',
    (2, 3): 'U',
}

for _ in range(200):
    # get the random trajectory from the policy
    traj_all = play_game(env, policy)

    # updates the total cost for each state included by the trajectory
    for idx in range(len(traj_all) - 1):
        # get current state and the successive state as well as the state cost
        x, _ = traj_all[idx]
        x_, g = traj_all[idx+1]

        # Compute the TD error of the current state
        TD_err = g + gamma * env.J[x_] - env.J[x]

        # get the eligibility trace, i.e. trajectory until state x
        e_trace = traj_all[:idx+1]

        # update the total cost for each former state
        for e_idx in range(len(e_trace)):
            x_e_trace, _ = e_trace[e_idx]
            e_x = (gamma * lmbda) ** (idx - e_idx)
            env.J[x_e_trace] = env.J[x_e_trace] + alpha * e_x * TD_err

print("\nTotal cost by TD(lambda)")
env.print_J()

The output is


Total cost by TD(lambda)
---------------------------
-1.63|-1.72|-1.90|-1.00|
---------------------------
-1.29| 0.00| 1.71| 1.00|
---------------------------
-1.13|-0.51|-0.48|-0.63|
---------------------------

11.3.2 TD ( λ ) \text{TD}(\lambda) TD(λ) with LFA


## Task 2: TD(lambda) with LFA
m = 7
K = len(env.states)

Phi = np.random.rand(m, K)  # Feature matrix
h = np.random.rand(m, 1)  # weight vector

# build the state index to retrieve the corresponding feature matrix
state_idx = {
    
    }
i = 0
for key in env.states:
    state_idx[key] = i
    i += 1

# initial the world
env = GridWorld()

# TD-lambda
for _ in range(200):
    traj_all = play_game(env, policy)
    for idx in range(len(traj_all)-1):
        # get current state and the successive state as well as the state cost
        x, _ = traj_all[idx]
        x_, g = traj_all[idx+1]

        # compute TD error delta
        delta = g + gamma * h.T @ Phi[:, state_idx[x_]] - h.T @ Phi[:, state_idx[x]]

        # get the eligibility trace, i.e. trajectory until state x
        e_trace_traj = traj_all[:idx+1]
        e_vector = 0
        for e_idx in range(idx+1):
            x_idx, _ = e_trace_traj[e_idx]

            # update the eligibility vector
            e_vector = Phi[:, state_idx[x]] + lmbda * gamma * e_vector

            # update the weight
            h = h + alpha * delta[0] * e_vector.reshape(m,1)

J_pi = Phi.T @ h

for key in state_idx:
    env.J[key] = J_pi[state_idx[key]]

# reset the terminal state
env.J[(0,3)] = -1
env.J[(1,3)] = 1

print("\nTotal cost by TD lambda with LFA")
env.print_J()

The output is


Total cost by TD lambda with LFA
---------------------------
-1.79|-2.31|-2.19|-1.00|
---------------------------
-0.94| 0.00| 0.70| 1.00|
---------------------------
-0.92| 0.06|-1.36|-1.13|
---------------------------

Process finished with exit code 0

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122660270