ADPRL - Approximate Dynamic Programming and Reinforcement Learning - Note 12 - Numerical Temporal Difference Learning (Numerical TD Learning)

Note 12 Numerical TD Learning

Numerical temporal difference learning

Note 12 Numerical TD Learning

As discussed in the previous two chapters, TD learning is a theoretically sound sampling-based algorithm mechanism to solve the model curse problem. In the LFA setting, a common practice of DP is to use the framework of policy iteration to obtain an optimal policy. That is, the TD algorithm of LFA is used to evaluate the total cost of a given strategy, and then policy improvement steps are taken to complete a scan of the sampling-based PI framework.

12.1 Brief description of Off-policy learning

An obvious risk of the sampling-based PI algorithm with LFA is that it may not converge at all, or may not converge quickly to a useful range. Therefore, a practical need for RL is to exploit sampling interactions by following a given policy to evaluate the total cost of a new policy, which may not be evaluable. Such tasks are called off-policy learning. More specifically, given an $\operatorname{MDP}(\mathcal{X}, \mathcal{U}, g, p, \gamma)$ and a so-called behavioral strategy $\pi_{b}$ , the task of Off-policy learning is to evaluate another policy $\pi_{t}$ The total cost is called the target strategy. As the counterpart of off-policy learning, the so-called on-policy learning refers to the RL algorithm that estimates the total cost function of the policy and generates samples.

[Reinforcement Learning (4) - Monte Carlo Methods and Examples]
…The only general way to ensure that all actions are chosen infinitely frequently is to let the agent continue to choose them. There are two methods to ensure this, resulting in what we call the on-policy (some articles are translated as "same policy") method and off-policy ("different policy") method, and the original method will still be used in the future. The English names are "on-policy" and "off-policy". The on-policy method attempts to evaluate or improve the policy used for decision-making, while the off-policy method evaluates or improves the policy used to generate data... The
target policy and behavioral policy of on-policy are the same policy, and its benefit is simplicity Roughly, the strategy can be optimized by directly using the data, but such processing will cause the strategy to actually learn a local optimum, because the On-policy strategy cannot maintain both exploration and utilization at the same time; while Off-policy will Target strategy $\pi_{t}$ and behavioral strategy $\pi_{b}$ Separately, you can find the global optimal value while maintaining exploration...

Let us review the definition of the total cost function of the policy in equation (3.4). It is obvious that off-policy learning can be considered as a distribution mismatch problem. In other words, it is necessary to start from the target policy $\pi_{t}$ Sampling relevant interactions from the distribution of and from the behavioral policy $\pi_{b}$ Available trajectories are extracted from the distribution. Importance sampling is a conventional means of dealing with distribution mismatch. $\mu^{\prime}$ from another distribution $m$ The samples drawn from $^{'}$ $\mu$ $with μ$ distributionThe task of $x 's expected value.$ If $\mu^{\prime}(x)>0$ for all $x$ , it is easy to see

$\begin{aligned} \underset{x \sim \mu}{\mathbb{E}}[x] &=\int_{\mathcal{X}} x \mu(x) d x \\ &=\int_{\mathcal{X}} x \frac{\mu(x)}{\mu^{\prime}(x)} \mu^{\prime}(x) d x \\ &=\underset{x \sim \mu^{\prime}}{\mathbb{E}}\left[\frac{\mu(x)}{\mu^{\prime}(x)} x\right] \end{aligned} \tag{12.1}$

Let us express the ratio of the two density functions as

$\psi(x)=\frac{\mu(x)}{\mu^{\prime}(x)} \tag{12.2}$
For random variable $The expected value of x$ can be approximated by the empirical average as
$\underset{x \sim \mu}{\mathbb{E}}[ x] \approx \frac{1}{N} \sum_{i=1}^{N} \psi\left(x_{i}\right) x_{i} \tag{12.3}$

Obviously, for a target policy $\pi_{t}$ and behavioral strategy $\pi_{b}$ A specific MDP using importance sampling requires behavioral strategy $\pi_{b}$ Has the same action coverage as the target policy. A little abuse of notation, if we think of the policy as a conditional distribution $\pi_{t}(u \mid x)$ 与 $\pi_{b}(u \mid x)$ ，我们可以
$\psi(x, u)=\frac{\pi_{t}(u \mid x )}{\pi_{b}(u \mid x)} \tag{12.4}$

We can then use importance sampling to approximate the total cost function of the target policy as follows

$\begin{aligned} J^{\pi_{t}}(x) &=\mathbb{E}_{p_{\pi_{b}}\left(x^{\prime} \mid x\right)}\left[\psi(x, u)\left(g\left(x, \pi_{b}(x), x^{\prime}\right)+\gamma J^{\pi_{b}}\left(x^{\prime}\right)\right)\right] \\ & \approx \frac{1}{N} \sum_{i=1}^{N} \psi(x, u)\left(g\left(x, \pi_{b}(x), x^{\prime}\right)+\gamma J^{\pi_{b}}\left(x^{\prime}\right)\right) \end{aligned} \tag{12.5}$

By following the same derivation method for TD learning, an off-policy TD $(0)$ The algorithm is given the following update rules

$J_{k+1}(x)=J_{k}(x)+\alpha_{k} \psi(x, u)\left(g\left(x, u, x^{\prime}\right)+\gamma J_{k}\left(x^{\prime}\right)-J_{k}(x)\right) \tag{12.6}$

Although the development of off-policy TD follows exactly the same philosophy as the original TD algorithm, it has been observed that the practical use of off-policy TD is rather limited. In particular, off-policy TD algorithms often fail to converge. This phenomenon is often called the fatal trio, namely function approximation, bootstrapping, and off-policy learning.

A true stochastic gradient descent algorithm for minimizing off-policy MSPBE is the first successful attempt to break the fatal trilogy. For a given target policy $\pi_{t}$ and behavioral strategy $\pi_{b}$ To traverse the MDP, we use $\xi_{t}$ and $\xi_{b}$ Respectively represent $\pi_{t}$ Sum $\pi_{b}$ steady state distribution. Then, the off-policy MSPBE function is defined as

$f_{t}(h)=\left\|\Pi_{\pi_{t}} \mathrm{~T}_{\pi_{t}} \Phi^{\top} h-\Phi^{\top} h\right\|_{\xi_{t}}^{2}, \tag{12.7}$

WhereΞ $\Xi_{t}=\Xi_{b} \Psi$ ， $\Psi=\operatorname{diag}\left(\psi\left(x_{1}\right), \ldots, \psi\left(x_{K}\right)\right)$ , orthogonal projector $\Pi_{\pi_{t}}$ It can be expressed as follows

$\begin{aligned} \Pi_{\pi_{t }} &=\Phi^{\top}\left(\Phi \Xi_{t} \Phi^{\top}\right)^{-1} \Phi \Xi_{t} \\ &=\Phi^ {\top}\left(\Phi \Xi_{b} \Psi \Phi^{\top}\right)^{-1} \Phi \Xi_{b} \Psi \end{aligned} \tag{12.8}$

For simplicity, we derive the GTD algorithm that minimizes the MSPBE function on the original policy.

12.2 Gradient TD Learning

definitionδ $\delta(h):=\mathrm{T}_{\pi} \Phi^{\top} h-\Phi^{\top} h$ 。 ⊤ h ⊤
$\begin{aligned} \left\|\Pi_{\pi} \mathrm{T}_{\pi} \Phi ^{\top} h-\Phi^{\top} h\right\|_{\xi} &=(\delta(h))^{\top} \Pi_{\pi} \Xi \Pi_{\ pi} \delta(h) \\ &=(\delta(h))^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^ {-1} \Phi \Xi \delta(h) \end{aligned}\tag{12.9}$

mspbe nsp. –
$\begin{aligned} \nabla_{f_{t}}(h) &=\left(\nabla_{\delta}(h)\right)^{\top} \Xi \Phi^{\top}\ left ( \Phi \Xi \Phi^{\top}\right)^{-1}\Phi\Xi\delta(h)\\&=\left(\gamma P_{\pi}\Phi^{\top }-\Phi^{\top}\right)^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right)^{-1} \Phi \Xi \delta(h) \\ &=\gamma \Phi P_{\pi}^{\top} \Xi \Phi^{\top}\left(\Phi \Xi \Phi^{\top}\right )^{-1} \Phi \Xi \delta(h)-\Phi \Xi \delta(h) \end{aligned} \tag{12.10}$

Since the MSPBE function is strongly convex, there is only one global minimum, which satisfies the critical point condition, that is, $\nabla_{f_{t}}(h)=0$ . Equivalently, the global minimum is described by the following equation in $h$ is the independent solution
$\left(\gamma \Phi P_{ \pi}^{\top}\Xi\Phi^{\top}\right)^{-1}\Phi\Xi\delta(h)=\left(\Phi\Xi\Phi^{\top}\ right)^{-1}\Phi\Xi\delta(h)\tag{12.11}$

Obviously, since we need to calculate $\Phi P_{\pi}^{\top} \Xi \Phi^{\top}$ 与 $\Phi \Xi \Phi^{\top}$ , the classic technique of random approximation fails here. In order to alleviate this difficulty, we introduce an auxiliary variable $\omega$ 作为
$\omega:=\left(\gamma \Phi P_{\pi}^{\top} \Xi \; Phi^{\top}\right)^{-1} \Phi \Xi \delta(h) \tag{12.12}$

Constraint
$\Phi \Xi \delta(h)=\gamma \Phi P_{\pi}^{\top} \Xi \Phi^{\top} \omega=\Phi \Xi \Phi^{\top} \omega . \tag{12.13}$

The infinitesimal equivalent of a singular function
$\left\{\begin {array}{lll}\Phi\Xi\delta(h)-\gamma\Phi P_{\pi}^{\top}\Xi\Phi^{\top}\omega&=&0\\\Phi\ Xi \delta(h)-\Phi \Xi \Phi^{\top} \omega & = & 0 \end{array}\right. \tag{12.14}$

We define the TD error as
$\delta_{h }\left(x_{k}, u_{k}, x_{k}^{\prime}\right):=g\left(x_{k}, u_{k}, x_{k}^{\prime }\right)+\gamma h^{\top} \phi\left(x_{k}^{\prime}\right)-h^{\top} \phi\left(x_{k}\right) \ tag{12.15}$

Similarly, if we have a single SA function, then
$\left\{\begin{array}{l} h_{k +1}=h_{k}+\alpha_{k}\left(\delta_{h}\left(x_{k}, u_{k}, x_{k}^{\prime}\right) \phi\ left(x_{k}\right)-\omega_{k}^{\top}\phi\left(x_{k}\right) \phi\left(x_{k}^{\prime}\right)\ right) \\\omega_{k+1}=\omega_{k}+\alpha_{k}\left(\delta_{h}\left(x_{k}, u_{k}, x_{k}^{ \prime}\right)-\omega_{k}^{\top}\phi\left(x_{k}\right)\right) \phi\left(x_{k}\right)\end{array}\ right. \tag{12.16}$

Note that this SA algorithm coincides with the TDC algorithm.

Theorem 12.1 GTD with LFA $\operatorname{GTD}(0)$ Convergence of $GTD$ $($ $0$ $)$ $\operatorname{GTD}(0$ $)$ $G T D (0)$ with LFA).

Concession length $\alpha_{k}$ Meet RobbinsMonro conditions. $h_{k}$ produced by the GTD(0) tilt algorithm with LFA $h_{k}$ Convergs to the fixed point of the projected Bellman operator with probability 1.

Remark 12.1 Difficulty in practical convergence

Although the asymptotic convergence theorem dominates the theoretical analysis of RL, there is a big problem with stochastic approximation. That is, the convergence properties depend on the construction of the weight sequence, while asymptotic convergence only reaches infinity in theory. In the next subsection, we discuss an alternative numerical method that can solve the PE problem equally well. Advanced numerical methods such as the stochastic Nesterov accelerated gradient algorithm have been developed.

12.3 Least Squares TD Learning

As discussed in the previous section, one of the most challenging technical issues with TD or GTD learning algorithms is the fragile asymptotic convergence they inherit from SA methods. Let’s take a closer look at LFA’s PE issue, as shown in Figure 18.

Insert image description here

Figure 18: Geometry of policy evaluation with LFA.

We did not use numerical optimization methods to find the projected Bellman operator $\Pi_{\pi} T_{\pi}$ The fixed point $\Phi^{\top} h^{*}$ , but directly characterizes the fixed point.

As shown in Figure 18, the residual vector $\mathrm{T}_{\pi} \Phi^{\top} h^{*}-\Phi^{\top} h^{*}$ Relative to the inner product $\langle\cdot, \cdot\rangle_{\xi}$ , orthogonal to the approximate space $\mathcal{J}$ _ In other words, we have

$\Phi \Xi\left(\mathrm{T}_{\pi} \Phi^{\top} h^{*}- \Phi^{\top} h^{*}\right)=0 \tag{12.17}$

And apply the Bellman operator $\mathrm{T}_{\pi} \Phi^{\top} h:=G_{\pi}+\gamma P_{\pi} \Phi^{\top} h$ , we finally get

$\Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}h^{*}=\Phi \Xi G_{\pi} . \tag{12.18}$

Simply put, the task now is to solve the above System of linear equations for $h$ $A h = b$ ,if
$\left\{\begin{aligned} A &=\Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}\\b &=\ Phi \Xi G_{\pi} \end{aligned}\right. \tag{12.19}$

According to $\operatorname{rk}(\Phi)=m$ , it is easy to see that the linear system has a unique solution, namely:
$h^{*} =\left(\Phi \Xi\left(I_{K}-\gamma P_{\pi}\right) \Phi^{\top}\right)^{-1} \Phi \Xi G_{\pi} .\tag{12.20}$

In order to achieve model-free online learning, we adopt the expectation form as

$\left\{\begin{aligned} A &=\mathbb{E}_{p_{\pi}\left(x^{\prime} \mid x\right)}\left[\phi(x)\left(\phi(x)-\gamma \phi\left(x^{\prime}\right)\right)^{\top}\right] \\ b &=\mathbb{E}_{p_{\pi}\left(x^{\prime} \mid x\right)}\left[g\left(x, u, x^{\prime}\right) \phi(x)\right] \end{aligned}\right. \tag{12.21}$

$h^{*}$ given in Equation (12.20) $h^{*}$ 的解决方案的基于抽样的实现可以写为
$h_{k+1}=\left(\sum_{i=1}^{k} \phi\left(x_{i}\right)\left(\phi\left(x_{i}\right)-\gamma \phi\left(x_{i}^{\prime}\right)\right)^{\top}\right)^{-1}\left(\sum_{i=1}^{k} g\left(x_{i}, u_{i}, x_{i}^{\prime}\right) \phi\left(x_{i}\right)\right) \tag{12.22}$

This update scheme is called Least Squares Temporal Difference (LSTD) learning. Obviously, the bottleneck of the LSTD algorithm is the continuous calculation of the square matrix. To reduce this computational burden, we can use the Sherman-Morrison formula to calculate the inverse of the matrix and perform a rank-one update .

Proposition 12.1 Sherman-Morrison Official

设 $A$ is an invertible square matrix, $u, v$ are column vectors. Assume $1+v^{\top} A^{-1} u \neq 0$ . Then rank 1 update $A+uv^{\top}$ 的逆如下所示
$\left(A+u v^{\top}\right)^{-1}=A^{-1}-\frac{A^{-1} u v^{\top} A^{-1}}{1+v^{\top} A^{-1} u} \tag{12.23}$

Obviously, the LSTD algorithm is not a SA algorithm, but a pure Monte Carlo algorithm. Since calculating $A$ and $The arithmetic mean at b$ is canceled and the LSTD update in Equation (12.22) does not require any adjustment of hyperparameters or learning rates. Nonetheless, the performance of the LSTD algorithm is more affected by the LFA space properties discussed in Section 8.2.

So with the help of Sherman-Morrison formula, we can derive the LSTD algorithm with LFA
Insert image description here