【RL-Notes】The Dynamic Programming Algorithm

DP algorithms

DP algorithms rest on the principle of optimality, which roughly as follows:
Let { u 0 ∗ , … , u N − 1 ∗ } \{u_0^*, \dots, u_{N-1}^*\} { u0,,uN1} be an optimal control sequence, which together with x 0 x_0 x0 determines the corresponding state sequences { x 1 ∗ , … , x N ∗ } \{x_1^*, \dots, x_N^*\} { x1,,xN} via the system equation x k + 1 = f k ( x k , u k ) x_{k+1}=f_k(x_k, u_k) xk+1=fk(xk,uk) . Consider the subproblem whereby starting at x k ∗ x_k^* xk at time k k k and wish to minimize the cost-to-go from k k k to time N N N:
g k ( x k ∗ , u k ) + ∑ m = k + 1 N − 1 g m ( x m , u m ) + g N ( x N ) \begin{aligned} g_k(x_k^*, u_k)+\sum_{m=k+1}^{N-1}g_m(x_m, u_m)+g_N(x_N) \end{aligned} gk(xk,uk)+m=k+1N1gm(xm,um)+gN(xN)
over { u k , … , u N − 1 } \{u_k, \dots, u_{N-1}\} { uk,,uN1} with u m ∈ U m ( x m ) , m = k , … , N − 1 u_m\in U_m(x_m), m=k, \dots, N-1 umUm(xm),m=k,,N1. Then the truncated optimal control seqence { u k ∗ , … , u N − 1 ∗ } \{u_k^*, \dots, u_{N-1}^*\} { uk,,uN1} is optimal for this subproblem.

The DP algorithms is based on this idea: it proceeds sequentially, by solving all the tail subproblems of a given time length, using the solution of the tail subproblems of shorter time length.

Optimal Sequence

The DP algorithm constructs functions sequentially:
J N ∗ ( x N ) , J N − 1 ∗ ( x N − 1 ) , … , J 0 ∗ ( x 0 ) J_N^*(x_N), J_{N-1}^*(x_{N-1}), \dots, J_0^*(x_0) JN(xN),JN1(xN1),,J0(x0)

DP Algorithm for Deterministic Finite Horizon Problems

Start with
J N ∗ ( x N ) = g N ( x N ) J_N^*(x_N)=g_N(x_N) JN(xN)=gN(xN)
for all x N x_N xN, and for k = 0 , … , N − 1 k=0, \dots, N-1 k=0,,N1, let
J k ∗ ( x k ) = min ⁡ u k ∈ U k ( x k ) [ g k ( x k , u k ) + J k + 1 ∗ ( f k ( x k , u k ) ) ] J_k^*(x_k)=\min_{u_k\in U_k(x_k)}\bigg[g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k))\bigg] Jk(xk)=ukUk(xk)min[gk(xk,uk)+Jk+1(fk(xk,uk))]
The key fact about the DP algorithm is that for every initial state x 0 x_0 x0, the number J 0 ∗ ( x 0 ) J_0^*(x_0) J0(x0) obtained at the last step, is equal to the optimal cost J ∗ ( x 0 ) J^*(x_0) J(x0). All states x k x_k xk at time k k k, we have
J k ∗ ( x k ) = min ⁡ u m ∈ U m ( x m ) J ( x k ; u k , … , u N − 1 ) J_k^*(x_k)=\min_{u_m\in U_m(x_m)}J(x_k; u_k, \dots, u_{N-1}) Jk(xk)=umUm(xm)minJ(xk;uk,,uN1)
where
J ( x k ; u k , … , u N − 1 ) = g N ( x N ) + ∑ m = k N − 1 g m ( x m , u m ) J(x_k; u_k, \dots, u_{N-1})=g_N(x_N)+\sum_{m=k}^{N-1}g_m(x_m, u_m) J(xk;uk,,uN1)=gN(xN)+m=kN1gm(xm,um)
J k ∗ ( x k ) J_k^*(x_k) Jk(xk) is the optimal cost for an ( N − k ) (N-k) (Nk) stage tail subproblem that starts at state x k x_k xk and time k k k, and ends at time N N N.

Proof:
By induction, the assertion holds for k = N k=N k=N in view of the intial condition J N ∗ ( x N ) = g N ( x N ) J_N^*(x_N)=g_N(x_N) JN(xN)=gN(xN). From above, we have
J k ∗ ( x k ) = min ⁡ u m ∈ U ( x m ) , m = k , … , N − 1 [ g N ( x N ) + ∑ m = k N − 1 g m ( x m , u m ) ] = min ⁡ u k ∈ U k ( x k ) [ g k ( x k , u k ) + min ⁡ u m ∈ U m ( x m ) , m = k + 1 , … , N − 1 [ g N ( x N ) + ∑ m = k + 1 N − 1 g m ( x m , u m ) ] ] = min ⁡ u k ∈ U k ( x k ) [ g k ( x k , u k ) + J k + 1 ∗ ( f k ( x k , u k ) ) ] \begin{aligned} J_k^*(x_k)=&\min_{u_m\in U(x_m), m=k,\dots, N-1}\bigg[g_N(x_N)+\sum_{m=k}^{N-1}g_m(x_m, u_m)\bigg]\\ &=\min_{u_k\in U_k(x_k)}\bigg[g_k(x_k, u_k)+\min_{u_m\in U_m(x_m), m=k+1,\dots, N-1} \bigg[g_N(x_N)+\sum_{m=k+1}^{N-1} g_m(x_m, u_m)\bigg]\bigg]\\ &=\min_{u_k\in U_k(x_k)}\bigg[ g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k))\bigg] \end{aligned} Jk(xk)=umU(xm),m=k,,N1min[gN(xN)+m=kN1gm(xm,um)]=ukUk(xk)min[gk(xk,uk)+umUm(xm),m=k+1,,N1min[gN(xN)+m=k+1N1gm(xm,um)]]=ukUk(xk)min[gk(xk,uk)+Jk+1(fk(xk,uk))]

Once the functions J 0 ∗ , … , J N ∗ J_0^*, \dots, J_N^* J0,,JN have been obtained, an optimal control sequence { u 0 ∗ , … , u N − 1 ∗ } \{u_0^*, \dots, u_{N-1}^*\} { u0,,uN1} can be constructed by the following forward algorithm:
Construction of Optimal Control Sequence:
Set
u 0 ∗ ∈ arg ⁡ min ⁡ u 0 ∈ U 0 ( x 0 ) [ g 0 ( x 0 , u 0 ) + J 1 ∗ ( f 0 ( x 0 , u 0 ) ) ] u_0^*\in \arg\min_{u_0\in U_0(x_0)}\bigg[g_0(x_0, u_0)+J_1^*(f_0(x_0, u_0))\bigg] u0argu0U0(x0)min[g0(x0,u0)+J1(f0(x0,u0))]
and
x 1 ∗ = f 0 ( x 0 , u 0 ∗ ) x_1^*=f_0(x_0, u^*_0) x1=f0(x0,u0)
Sequentially, for k = 1 , 2 , … , N − 1 k=1, 2, \dots, N-1 k=1,2,,N1, set
u k ∗ ∈ arg ⁡ min ⁡ u k ∈ U ( x k ∗ ) [ g k ( x k ∗ , u k ) + J k + 1 ∗ ( f k ( x k ∗ , u k ) ) ] u_k^*\in\arg\min_{u_k\in U(x_k^*)}\bigg[g_k(x_k^*, u_k)+J_{k+1}^*(f_k(x_k^*, u_k))\bigg] ukargukU(xk)min[gk(xk,uk)+Jk+1(fk(xk,uk))]
and
x k + 1 ∗ = f k ( x k ∗ , u k ∗ ) x_{k+1}^*=f_k(x_k^*, u_k^*) xk+1=fk(xk,uk)

Approximation in Value Space

The optimal control sequence construction is possible only after we have computed J k ∗ ( x k ) J_k^*(x_k) Jk(xk) by DP for all x k x_k xk and k k k, but this is often prohibitively time-consuming. A similar forward algorithmic process can be used if the optimal cost-to-go functions J k ∗ J_k^* Jk are replaced by some approximations J ~ k \tilde{J}_k J~k.

Q-Factors and Q-Learning

Q ~ k ( x k , u k ) = g k ( x k , u k ) + J ~ k + 1 ( f k ( x k , u k ) ) \tilde{Q}_k(x_k, u_k)=g_k(x_k, u_k)+\tilde{J}_{k+1}(f_k(x_k, u_k)) Q~k(xk,uk)=gk(xk,uk)+J~k+1(fk(xk,uk))
The associated computation of the approximately optimal control can be done through the Q-factor minimization:
u ~ k ∈ arg ⁡ min ⁡ u k ∈ U k ( x ~ k ) Q ~ k ( x ~ k , u k ) \tilde{u}_k\in\arg\min_{u_k\in U_k(\tilde{x}_k)}\tilde{Q}_k(\tilde{x}_k, u_k) u~kargukUk(x~k)minQ~k(x~k,uk)
This suggests the possibility of using Q-factors in place of cost functions in approximation in value space schemes. Methods of this type use as starting point an alternative form of the DP algorithm, which instead of the optimal cost-to-go functions J k ∗ J_k^* Jk, generates the optimal Q-factors, defined for all pairs ( x k , u k ) (x_k, u_k) (xk,uk) and k k k by:
Q k ∗ ( x k , u k ) = g k ( x k , u k ) + J k + 1 ∗ ( f k ( x k , u k ) ) Q^*_k(x_k, u_k)=g_k(x_k, u_k)+J_{k+1}^*(f_k(x_k, u_k)) Qk(xk,uk)=gk(xk,uk)+Jk+1(fk(xk,uk))
Thus the optimal Q-factors are simply the expressions that are minimized in the right-hand of DP equation. The optimal cost function J k ∗ J_k^* Jk can be recovered from the optimal Q-factor Q k ∗ Q_k^* Qk by means of
J k ∗ ( x k ) = min ⁡ u k ∈ U k ( x k ) Q k ∗ ( x k , u k ) J_k^*(x_k)=\min_{u_k\in U_k(x_k)} Q_k^*(x_k, u_k) Jk(xk)=ukUk(xk)minQk(xk,uk)
Moreover, using the above relation, the DP algorithm can be written in an essentially equivalent form that involves Q-factors only
Q k ( x k , u k ) = g k ( x k , u k ) + min ⁡ u k + 1 ∈ U k + 1 Q k + 1 ∗ ( f k ( x k , u k ) , u k + 1 ) Q_k(x_k, u_k)=g_k(x_k, u_k)+\min_{u_{k+1}\in U_{k+1}} Q_{k+1}^*(f_k(x_k, u_k), u_{k+1}) Qk(xk,uk)=gk(xk,uk)+uk+1Uk+1minQk+1(fk(xk,uk),uk+1)

Stochastic dynamic programming

This system includes a random ‘disturbance’ w k w_k wk, which is characterized by a probability distribution P k ( ⋅ ∣ x k , u k ) P_k(\cdot|x_k, u_k) Pk(xk,uk) that may depend explicitly on x k x_k xk and u k u_k uk, but not on values of prior disturbances w k − 1 , … , w 0 w_{k-1}, \dots, w_0 wk1,,w0. The system has the form
x k + 1 = f k ( x k , u k , w k ) , k = 0 , 1 , … , N − 1 x_{k+1}=f_k(x_k, u_k, w_k), \quad k=0, 1, \dots, N-1 xk+1=fk(xk,uk,wk),k=0,1,,N1

An important difference is that we optimize not over control sequences { u 0 , … , u N − 1 } \{u_0, \dots, u_{N-1}\} { u0,,uN1}, but rather over policies that consist of a sequence of functions
π = { μ 0 , … , μ N − 1 } \pi=\{\mu_0, \dots, \mu_{N-1}\} π={ μ0,,μN1}
where μ k \mu_k μk maps states x k x_k xk into controls u k = μ k ( x k ) u_k=\mu_k(x_k) uk=μk(xk), and satisfies the control constraints. Policies are more general objects than control sequences, and in the presence of stochastic uncertainty, they can result in improved cost, since they allow choices of controls u k u_k uk that incorporate knowledge of the state x k x_k xk. Without this knowledge, the controller cannot adapt appropriately to unexpected values of the state, and as a result the cost can be adversely affected.

Source From

Reinforcement learning and optimal control

猜你喜欢

转载自blog.csdn.net/qq_18822147/article/details/121096397