DP algorithms

DP algorithms rest on the principle of optimality, which roughly as follows:
Let $\{u_0^*, \dots, u_{N-1}^*\}$ be an optimal control sequence, which together with $x_0$ determines the corresponding state sequences $\{x_1^*, \dots, x_N^*\}$ via the system equation $x_{k+1}=f_k(x_k, u_k)$ . Consider the subproblem whereby starting at $x_k^*$ at time $k$ and wish to minimize the cost-to-go from $k$ to time $N$ :
$\begin{aligned} g_k(x_k^*, u_k)+\sum_{m=k+1}^{N-1}g_m(x_m, u_m)+g_N(x_N) \end{aligned}$
over $\{u_k, \dots, u_{N-1}\}$ with $u_m\in U_m(x_m), m=k, \dots, N-1$ . Then the truncated optimal control seqence $\{u_k^*, \dots, u_{N-1}^*\}$ is optimal for this subproblem.

The DP algorithms is based on this idea: it proceeds sequentially, by solving all the tail subproblems of a given time length, using the solution of the tail subproblems of shorter time length.

Optimal Sequence

The DP algorithm constructs functions sequentially:
$J_N^*(x_N), J_{N-1}^*(x_{N-1}), \dots, J_0^*(x_0)$

DP Algorithm for Deterministic Finite Horizon Problems

Start with
$J_N^*(x_N)=g_N(x_N)$
for all $x_N$ , and for $\dots, N-1$ , let
$J_k^*(x_k)=\min_{u_k\in U_k(x_k)}\bigg[g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k))\bigg]$
The key fact about the DP algorithm is that for every initial state $x_0$ , the number $J_0^*(x_0)$ obtained at the last step, is equal to the optimal cost $J^*(x_0)$ . All states $x_k$ at time $k$ , we have
$J_k^*(x_k)=\min_{u_m\in U_m(x_m)}J(x_k; u_k, \dots, u_{N-1})$
where
$J(x_k; u_k, \dots, u_{N-1})=g_N(x_N)+\sum_{m=k}^{N-1}g_m(x_m, u_m)$
$J_k^*(x_k)$ is the optimal cost for an $(N - k)$ stage tail subproblem that starts at state $x_k$ and time $k$ , and ends at time $N$ .

Proof:
By induction, the assertion holds for $k = N$ in view of the intial condition $J_N^*(x_N)=g_N(x_N)$ . From above, we have
$\begin{aligned} J_k^*(x_k)=&\min_{u_m\in U(x_m), m=k,\dots, N-1}\bigg[g_N(x_N)+\sum_{m=k}^{N-1}g_m(x_m, u_m)\bigg]\\ &=\min_{u_k\in U_k(x_k)}\bigg[g_k(x_k, u_k)+\min_{u_m\in U_m(x_m), m=k+1,\dots, N-1} \bigg[g_N(x_N)+\sum_{m=k+1}^{N-1} g_m(x_m, u_m)\bigg]\bigg]\\ &=\min_{u_k\in U_k(x_k)}\bigg[ g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k))\bigg] \end{aligned}$

Once the functions $J_0^*, \dots, J_N^*$ have been obtained, an optimal control sequence $\{u_0^*, \dots, u_{N-1}^*\}$ can be constructed by the following forward algorithm:
Construction of Optimal Control Sequence:
Set
$u_0^*\in \arg\min_{u_0\in U_0(x_0)}\bigg[g_0(x_0, u_0)+J_1^*(f_0(x_0, u_0))\bigg]$
and
$x_1^*=f_0(x_0, u^*_0)$
Sequentially, for $\dots, N-1$ , set
$u_k^*\in\arg\min_{u_k\in U(x_k^*)}\bigg[g_k(x_k^*, u_k)+J_{k+1}^*(f_k(x_k^*, u_k))\bigg]$
and
$x_{k+1}^*=f_k(x_k^*, u_k^*)$

Approximation in Value Space

The optimal control sequence construction is possible only after we have computed $J_k^*(x_k)$ by DP for all $x_k$ and $k$ , but this is often prohibitively time-consuming. A similar forward algorithmic process can be used if the optimal cost-to-go functions $J_k^*$ are replaced by some approximations $\tilde{J}_k$ .

Q-Factors and Q-Learning

$\tilde{Q}_k(x_k, u_k)=g_k(x_k, u_k)+\tilde{J}_{k+1}(f_k(x_k, u_k))$
The associated computation of the approximately optimal control can be done through the Q-factor minimization:
$\tilde{u}_k\in\arg\min_{u_k\in U_k(\tilde{x}_k)}\tilde{Q}_k(\tilde{x}_k, u_k)$
This suggests the possibility of using Q-factors in place of cost functions in approximation in value space schemes. Methods of this type use as starting point an alternative form of the DP algorithm, which instead of the optimal cost-to-go functions $J_k^*$ , generates the optimal Q-factors, defined for all pairs $x_k, u_k)$ and $k$ by:
$Q^*_k(x_k, u_k)=g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k))$
Thus the optimal Q-factors are simply the expressions that are minimized in the right-hand of DP equation. The optimal cost function $J_k^*$ can be recovered from the optimal Q-factor $Q_k^*$ by means of
$J_k^*(x_k)=\min_{u_k\in U_k(x_k)} Q_k^*(x_k, u_k)$
Moreover, using the above relation, the DP algorithm can be written in an essentially equivalent form that involves Q-factors only
$Q_k(x_k, u_k)=g_k(x_k, u_k)+\min_{u_{k+1}\in U_{k+1}} Q_{k+1}^*(f_k(x_k, u_k), u_{k+1})$

Stochastic dynamic programming

This system includes a random ‘disturbance’ $w_k$ , which is characterized by a probability distribution $P_k(\cdot|x_k, u_k)$ that may depend explicitly on $x_k$ and $u_k$ , but not on values of prior disturbances $w_{k-1}, \dots, w_0$ . The system has the form
$x_{k+1}=f_k(x_k, u_k, w_k), \quad k=0, 1, \dots, N-1$

An important difference is that we optimize not over control sequences $\{u_0, \dots, u_{N-1}\}$ , but rather over policies that consist of a sequence of functions
$\pi=\{\mu_0, \dots, \mu_{N-1}\}$
where $\mu_k$ maps states $x_k$ into controls $u_k=\mu_k(x_k)$ , and satisfies the control constraints. Policies are more general objects than control sequences, and in the presence of stochastic uncertainty, they can result in improved cost, since they allow choices of controls $u_k$ that incorporate knowledge of the state $x_k$ . Without this knowledge, the controller cannot adapt appropriately to unexpected values of the state, and as a result the cost can be adversely affected.

Source From

Reinforcement learning and optimal control

【RL-Notes】The Dynamic Programming Algorithm

Navigator