5. Reinforcement learning--approximate representation of value function

The previous content explains some basic theories of reinforcement learning. This knowledge can only solve some small and medium-sized problems. Many value functions need to be stored in a large table. Obtaining the value of a certain state or behavior usually requires a table lookup operation. (Table Lookup), which is almost impossible to solve for problems with large state space or behavior space, and many practical problems are problems with large state and behavior spaces. The content at the beginning of this lecture is mainly focused on how to solve practical problems.
This lecture mainly addresses the approximate representation and learning of various value functions. In practical applications, when the state and behavior spaces are relatively large, it is almost impossible to accurately obtain various v(S) and q(s,a). of. At this time, you need to find an approximate function. Specifically, you can use linear combinations, neural networks, and other methods to approximate the value function:
v ( S ) ≈ v ( S , w ) v ( S ) \approx v ( S , w )v(S)v(S,w)
w w w is an introduced parameter, usually a matrix or at least a vector.
Through function approximation, a small number of parameters w can be used to fit various actual value functions. Various approximation methods are divided into two major categories. One is the "incremental method". For each step, the approximate function obtains some information and immediately optimizes the approximate function. It is mainly used for online learning; the other is the "batch method", which is aimed at a Approximation is performed on batch historical data sets. There is no clear boundary between the two types of methods and they can learn from each other.

Approximate value functionValue Function Approximation

So far, we have used the table lookup method, which means that each state or each state-behavior pair corresponds to a value data. For large-scale problems, this requires too much memory to store, and sometimes learning the value for each state is also a slow process.
For large-scale problems, the solution can be as follows:

  1. Estimate the actual value function through functional approximation:
    V ^ ( s , w ) ≈ v π ( s ) q ^ ( s , a , w ) ≈ q π ( s , a ) \begin{array} { c } \hat { V } ( \boldsymbol { s } , \mathbf { w } ) \approx v _ { \pi } ( s ) \\ \hat { q } ( s , a , \mathbf { w } ) \approx q _ { \pi } ( s , a ) \end{array}V^(s,w)vp(s)q^(s,a,w)qp(s,a)
  2. Generalize functions learned from known states to unencountered states.
  3. Use MC or TD learning to update function parameters.

For reinforcement learning, the approximate function can have the following three architectures depending on the input and output:
Insert image description here

  1. For the state itself, output the approximate value of this state;
  2. For a state-behavior pair, output the approximate value of the state-behavior pair;
  3. For the state itself, a vector is output. Each element in the vector is the value of a possible action in the state.

All algorithms related to machine learning can be applied to reinforcement learning. Linear regression and neural networks are widely used in reinforcement learning. The main reason is that these two methods are approximate functions that are differentiable for states.
The data in reinforcement learning application scenarios is usually non-static, non-independent and uniformly distributed, because one state data may continue to flow in, and the next state is usually highly correlated with the previous state. Therefore, we need a training method suitable for non-static, non-independent uniformly distributed data to obtain approximate functions.

Incremental Methods

gradient descent

Assume that J(w) is a differentiable function with respect to parameter w, and define the gradient of J(w) as follows:
∇ w J ( w ) = ( ∂ J ( w ) ∂ w 1 ⋮ ∂ J ( w ) ∂ wn ) \ nabla _ { \mathbf { w } } J ( \mathbf { w } ) = \left( \begin{array} { c } \frac { \partial J ( \mathbf { w } ) } { \partial w _ { 1 } } \\ \vdots \\ \frac { \partial J ( \mathbf { w } ) } { \partial \mathbf { w } _ { n } } \end{array} \right)wJ(w)=w1J(w)wnJ(w)
Adjust the parameters in the direction of the negative gradient to find the local minimum of J(w):
Δ w = − 1 2 α ∇ w J ( w ) \Delta \mathbf { w } = - \frac { 1 } { 2 } \alpha \nabla _ { \mathbf { w } } J ( \mathbf { w } )Δw=21αwJ ( w )
Goal: Find the parameter vector w and minimize the approximate functionv ^ ( S , w ) \hat{v}(S,w)v^(S,w ) and the actual functionv π ( s ) v_\pi({s})vp( s ) definition:
J ( w ) = E π [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] J ( \mathbf { w } ) = \mathbb { E } _ { \pi } \left[\left(v_{\pi}(S) - \hat{v}(S,\mathbf{w})\right)^{2}\right]J(w)=Ep[(vp(S)v^(S,w))2 ]
Use stochastic gradient descent to update the gradient toapproximate the expectation of the difference:
Δ w = α ( v π ( S ) − v ^ ( S , w ) ) ∇ wv ^ ( S , w ) \Delta \mathbf { w } = \alpha \left( v _ { \pi } ( S ) - \hat { v } ( S , \mathbf { w } ) \right) \nabla _ { \mathbf { w } } \hat { v } ( S , \mathbf { w } )Δw=a(vp(S)v^(S,w))wv^(S,w )
meansv π ( s ) v_\pi(s)vp( s ) to guide the estimated value functionv ^ ( S , w ) \hat{v}(S,w)v^(S,w ) optimization direction.

Prediction – Incremental Algorithm

In fact, none of the formulas listed before can be directly used in reinforcement learning, because there is an actual value function v π ( s ) v_\pi(s) in the formulavp( s ) . Or a specific value, and reinforcement learning does not have supervised data, so the above formula cannot be used directly. In reinforcement learning, there are only immediate rewards and no supervised data. We need to find a replacement forv π ( s ) v_\pi(s)vpThe target value of ( s ) is used to learn the parameters of the approximate function using a supervised learning algorithm.
For the MC algorithm, the target value is the gain:
Δ w = α ( G t − v ^ ( S t , w ) ) ∇ wv ^ ( S t , w ) \Delta \mathbf { w } = \alpha \left( G _ { t } - \hat { v } \left( S _ { t } , \mathbf { w } \right) \right) \nabla _ { \mathbf { w } } \hat { v } \left( S _ { t } , \mathbf { w } \right)Δw=a(Gtv^(St,w))wv^(St,w)
对于TD(0),目标值就是TD目标:
Δ w = α ( R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) \Delta \mathbf { w } = \alpha \left( R _ { t + 1 } + \gamma \hat { v } \left( S _ { t + 1 } , \mathbf { w } \right) - \hat { v } \left( S _ { t } , \mathbf { w } \right) \right) \nabla _ { \mathbf { w } } \hat { v } \left( S _ { t } , \mathbf { w } \right) Δw=a(Rt+1+cv^(St+1,w)v^(St,w))wv^(St,w )
In the case of TD(λ), the functional values ​​are:
Δ w = α ( G t λ − v ^ ( S t , w ) ) ∇ wv ^ ( S t , w ) \Delta \mathbf { w } = \ alpha \left(G_{t}^{\lambda} - \hat{v}\left(S_{t},\mathbf{w}\right)\right)\nabla_{\mathbf{w}} \hat{v}\left(S_{t}, \mathbf{w}\right)Δw=a(Gtlv^(St,w))wv^(St,w)

MC applied to state value function approximation

Harvest G t G_tGtis the real value v π ( s ) v_\pi(s)vpThe noisy, unbiased sampling of ( s )
can be regarded as label data for supervised learning and brought into the machine learning algorithm for learning. In this way, the training data set can be: ⟨ S 1 , G 1   , ⟨ S 2 , G 2 〉 , … , ⟨ ST , GT 〉 \left\langle S _ { 1 } , G _ { 1 } \right\rangle , \left\langle S _ { 2 } , G _ { 2 } \right\rangle , \ldots , \left\langle S _ { T } , G _ { T } \right\rangleS1,G1,S2,G2,,ST,GT If
linear Monte Carlo strategy iteration is used, the correction value of each parameter is:
Δ w = α ( G t − v ^ ( S t , w ) ) ∇ wv ^ ( S t , w ) = α ( G t − v ^ ( S t , w ) ) x ( S t ) \begin{aligned} \Delta \mathbf { w } & = \alpha \left( G _ { t } - \hat { v } \left( S _ { t } , \mathbf { w } \right) \right) \nabla _ { \mathbf { w } } \hat { v } \left( S _ { t } , \mathbf { w } \right) \\ & = \alpha \left( G _ { t } - \hat { v } \left( S _ { t } , \mathbf { w } \right) \right) \mathbf { x } \left( S _ { t } \right) \end{aligned}Δw=a(Gtv^(St,w))wv^(St,w)=a(Gtv^(St,w))x(St)
Conclusion: Monte Carlo strategy iteration also converges to a local optimal solution when using linear function approximation.

TD applied to state value function approximation

The TD target value is a noisy, biased sample of the true value. The training data set at this time is:
⟨ S 1 , R 2 + γ v ^ ( S 2 , w )   , ⟨ S 2 , R 3 + γ v ^ ( S 3 , w )   , … , ⟨ ST − 1 , RT   \left\langle S _ { 1 } , R _ { 2 } + \gamma \hat { v } \left( S _ { 2 } , \mathbf { w } \right) \right\rangle , \left \langle S _ { 2 } , R _ { 3 } + \gamma \hat { v } \left( S _ { 3 } , \mathbf { w } \right) \right\rangle , \dots , \left\langle S _ { T - 1 } , R _ { T } \right\rangleS1,R2+cv^(S2,w),S2,R3+cv^(S3,w),,ST1,RT⟩Also
define the TD(0) function:
Δ w = α ( R + γ v ^ ( S ′ , w ) − v ^ ( S , w ) ) ∇ wv ^ ( S , w ) = α δ x ( S ) \begin{aligned}\Delta\mathbf{w}&=\alpha\left( R + \gamma good {v}\left(S^{\prime}, \mathbf{w}\right) - \ hat { v } ( S , \mathbf { w } ) \right ) \nabla _ { \mathbf { w } } good { v } ( S , \mathbf { w } ) \\ & = \alpha \delta \mathbf { x } ( S ) \end{aligned}Δw=a(R+cv^(S,w)v^(S,w))wv^(S,w)=αδx(S)
Conclusion: The linear TD(0) method will converge to the global optimum.
Note: The gradient here only acts on v ^ ( S , w ) \hat{v}(S,w)v^(S,w ) does not act onv ^ ( S ′ , w ) \hat{v}(S',w)v^(S,w ) According to Silver's explanation, if it acts on the target, it would be a bit like traveling back in time, which is inconsistent with reality.

TD(λ) applied to state value function approximation

The TD(λ) target value is a noisy, biased sample of the true value. The training data set at this time is:
⟨ S 1 , G 1 λ 〉 , ⟨ S 2 , G 2 λ 〉 , … , ⟨ ST − 1 , GT − 1 λ   \left\langle S _ { 1 } , G _ { 1 } ^ { \lambda } \right\rangle , \left\langle S _ { 2 } , G _ { 2 } ^ { \lambda } \right\rangle , \ldots , \left\langle S _ { T - 1 } , G _ { T - 1 } ^ { \lambda } \right\rangleS1,G1l,S2,G2l,,ST1,GT1l⟩Let
us consider the TD(λ) function, which is a simple solution:
Δ w = α ( G t λ − v ^ ( S t , w ) ) ∇ wv ^ ( S t , w ) = α ( G t λ − v ^ ( S t , w ) ) x ( S t ) \begin{aligned}\Delta\mathbf{w}& = \alpha\left(G_{t}^{\lambda} - \that{v} \left(S_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} \hat{v}\left(S_{t}, \mathbf{w}\ right)\\&=\alpha\left(G_{t}^{\lambda} - \hat{v}\left(S_{t},\mathbf{w}\right)\right)\mathbf{ x } \left( S_{t} \right) \end{aligned}Δw=a(Gtlv^(St,w))wv^(St,w)=a(Gtlv^(St,w))x(St)
For example,
δ t = R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) E t = γ λ E t − 1 + x ( S t ) Δ w = α δ t E t \begin{aligned} \delta _ { t } & = R _ { t + 1 } + \good gamma { v } \left( S _ { t + 1 } , \mathbf {w}\right) - \hat{v}\left(S_{t}, \mathbf{w}\right)\\E_{t}& = \gamma\lambda E_{t-1}+ \mathbf{x}\left(S_{t}\right)\\\Delta\mathbf{w}& = \alpha\delta_{t}E_{t}\end{aligned}dtEtΔw=Rt+1+cv^(St+1,w)v^(St,w)=c l Et1+x(St)=a dtEt
For a complete Episode, the forward knowledge and backward knowledge of TD(λ) are equivalent to changes in w.

Control – Incremental Algorithm

Using reinforcement learning for model-free control requires two conditions. How to introduce approximate functions into the control process? We need a value function approximation that can approximate state-behavior pairs rather than just a value function approximation for states.
Insert image description here
Starting from a series of parameters, an approximate state-behavior value function is obtained, a behavior is generated under the Ɛ-greedy execution strategy, and an immediate reward is obtained for executing the behavior. The target value is calculated based on this data, and the approximate function parameters are updated. Then apply this strategy to obtain the subsequent state and corresponding target value. Each time the state is experienced, the sequential parameters are updated. In this way, the strategy is repeatedly optimized and the optimal value function is approximated.

Strategy evaluation: It is an approximate strategy evaluation q ^ ( ⋅ , ⋅ , w ) ≈ q π \hat{q}(·,·,w) \approx q_\piq^(,,w)qp, especially the early error will be large, and this approximation cannot finally converge to the behavioral value function corresponding to the optimal strategy, and can only oscillate around it. The improvement method will be described later.
Strategy improvement: Use Ɛ-greedy execution.

The behavior value function is approximately expressed as:
q ^ ( S , A , w ) ≈ q π ( S , A ) \hat { q } ( S , A , \mathbf { w } ) \approx q _ { \pi } ( S , A )q^(S,A,w)qp(S,A )
Directional equation:
J(w) = E π [(q π(S,A) − q^(S,A,w)) 2] J(\mathbf{w}) = \mathbb{E}; _{\pi}\left[\left(q_{\pi}(S,A) - \hat{q}(S,A,\mathbf{w})\right)^{2}\right]J(w)=Ep[(qp(S,A)q^(S,A,w))2 ]
Specify the quantity of the infinitesimal equation:
− 1 2 ∇ w J ( w ) = ( q π ( S , A ) − q ^ ( S , A , w ) ) ∇ wq ^ ( S , A , . w ) Δ w = α ( q π ( S , A ) − q ^ ( S , A , w ) ) ∇ wq ^ ( S , A , w ) \begin{array} { c } - \frac { 1 } { 2} \nabla_{w}J(\mathbf{w}) = \left(q_{\pi}(S,A) - \hat{q}(S,A,\mathbf{w})\right ) \nabla_{w}\hat{q}(S,A,\mathbf{w})\\\Delta\mathbf{w} = \alpha\left(q_{\pi}(S,A)- \hat {q}(S,A,\mathbf{w})\right)\nabla_{w}\hat{q}(S,A,\mathbf{w})\end{array}21wJ(w)=(qp(S,A)q^(S,A,w))wq^(S,A,w)Δw=a(qp(S,A)q^(S,A,w))wq^(S,A,w)

Convergence analysis

For prediction problems:

MC uses a noisy unbiased estimate of the actual value. Although its performance is poor in many cases, it can always converge to the local or global optimal solution. TD performance is usually better. Does it mean that TD is always converging? the answer is negative. The original lecture notes gave an example of non-convergence of TD learning. I will not go into details here. Here is a summary of whether various algorithms converge when using different approximate functions for predictive learning: As
Insert image description here
can be seen from the table, when there is no function approximation , all algorithms converge; when linear function approximation, current policy learning can converge, but only MC converges when using offline strategies; when nonlinear function approximation, only MC converges whether using current policy or offline strategy. The MC algorithm is rarely used in practice. This brings challenges to the practical application of reinforcement learning. Fortunately, we have some ways to improve the TD algorithm.

Convergence for control problems:
Insert image description here
For control learning algorithms, most can get better strategies, but theoretically as long as there is function approximation, it is not strictly convergent. It is more common to oscillate around the optimal strategy. Gradually approach, then suddenly diverge, then gradually approach, etc. The effect of using nonlinear functions to approximate is much worse than approximating functions, and this is also true in practice.

Batch Methods

The incremental algorithms mentioned above are all based on data flow. After one step and updating the algorithm, we no longer use the data of this step. This algorithm is simple, but sometimes it is not efficient enough. In contrast, the batch method collects data within a period of time and uses learning to make the parameters better fit all the data in this period . The "chunk" of the training data set here is equivalent to an individual's experience.

least squares prediction

Suppose there is an approximation of the value function:
v ^ ( s , w ) ≈ v π ( s ) \hat { v } ( s , \mathbf { w } ) \approx v _ { \pi } ( s )v^(s,w)vp( s )
and a period of experience D including <state, value>:
D = { ⟨ s 1 , v 1 π 〉 , ⟨ s 2 , v 2 π 〉 , … , ⟨ s T , v T π 〉 } \mathcal { D } = \left\{ \left\langle s _ { 1 } , v _ { 1 } ^ { \pi } \right\rangle , \left\langle s _ { 2 } , v _ { 2 } ^ { \pi } \right\rangle , \ldots , \left\langle s _ { T } , v _ { T } ^ { \pi } \right\rangle \right\}D={ s1,v1p,s2,v2p,,sT,vTp  }
The least square difference algorithm requires finding the parameter w so that the value of the following formula is minimized: LS ( w ) = ∑ t =
1 T ( vt π − v ^ ( st , w ) ) 2 = ED [ ( v π − v ^ ( s , w ) ) 2 ] \begin{aligned} LS ( \mathbf { w } ) & = \sum _ { t = 1 } ^ { T } \left( v _ { t } ^ { \pi } - \hat { v } \left( s _ { t } , \mathbf { w } \right) \right) ^ { 2 } \\ & = \mathbb { E } _ { \mathcal { D } } \left[ \left( v ^ { \pi } - \hat { v } ( s , \mathbf { w } ) \right) ^ { 2 } \right] \end{aligned}LS(w)=t=1T(vtpv^(st,w))2=ED[(vPiv^(s,w))2]
This is equivalent to Experience Replay, reliving the experience over a period of time and updating the parameters. This algorithm is not very difficult to implement, just repeat:

  1. Take a <s,v> from experience:
    ⟨ s , v π   ∼ D \left\langle s , v ^ { \pi } \right\rangle \sim \mathcal { D }s,vPiD
  2. Apply stochastic gradient descent to update parameters:
    Δ w = α ( v π − v ^ ( s , w ) ) ∇ wv ^ ( s , w ) \Delta \mathbf { w } = \alpha \left( v ^ { \pi } - \hat { v } ( s , \mathbf { w } ) \right) \nabla _ { w } \hat { v } ( s , \mathbf { w } )Δw=a(vPiv^(s,w))wv^(s,w )
    This will converge to the parameter that minimizes the squared difference for this period:
    w π = argmin ⁡ w LS ( w ) \mathbf { w } ^ { \pi } = \underset { \mathbf { w } } { \operatorname { argmin } } LS ( \mathbf { w } )wPi=wargminLS(w)

Batch method applied to DQN

It was previously said that the TD method combined with nonlinear neural network function approximation may not converge, but DQN uses experience replay and a fixed Q target value to achieve convergence and maintain good robustness. Before explaining its convergence, we must first introduce the key points of the DQN algorithm:

  1. 依据ϵ − greedy \epsilon-greedyϵg r e e d y execution strategy produces behavior at time t;
  2. Convert a large amount of experience data (for example, millions of levels) to (st, at, rt + 1, st + 1) (s_t,a_t,r_{t+1},s_{t+1})(st,at,rt+1,st+1) is stored in memory as a D block;
  3. Randomly extract small blocks (for example, 64 sample data) of data ( s , a , r , s ′ ) (s,a,r,s') from D large blocks.(s,a,r,s)
  4. Maintain two neural networks DQN1 and DQN2. One network fixed parameters are specially used to generate target values, and the target values ​​are equivalent to label data. Another network is dedicated to evaluating the policy and updating parameters.
  5. Optimize with respect to the minimum squared difference between the Q network and the Q target value:
    L i ( wi ) = E s , a , r , s ′ ∼ D i [ ( r + γ max ⁡ a ′ Q ( s ′ , a ′ ; wi − ) − Q ( s , a ; wi ) ) 2 ] \mathcal { L } _ { i } \left( w _ { i } \right) = \mathbb { E } _ { s , a , r , s ^ { \prime } \sim \mathcal { D } _ { i } } \left[ \left( r + \gamma \max _ { a ^ { \prime } } Q \left( s ^ { \prime } , a ^ { \prime } ; w _ { i } ^ { - } \right) - Q \left( s , a ; w _ { i } \right) \right) ^ { 2 } \right]Li(wi)=Es,a,r,sDi[(r+camaxQ(s,a;wi)Q(s,a;wi))2 ]
    where:w − w^-w−It is fixed during the learning process of the batch,wi w_iwiIt is a dynamically updated parameter.
  6. Update parameters using stochastic gradient descent.

First, random sampling breaks the connection between states; second, the neural network temporarily freezes the parameters , and we obtain the target value from the network of frozen parameters rather than from the network that is updating the parameters, which increases the stability of the algorithm. After a batch calculation, replace the network with frozen parameters with updated parameters and freeze again to generate the target value to be used in a new iteration.

minimum square difference control

The principle and outcome of policy iteration using minimum squared difference are shown in the figure:
Insert image description here
policy evaluation uses minimum squared difference Q learning, and policy improvement uses: Greedy search strategy. If we use the least squared difference method for policy control, we must design a linear approximation to the behavioral value function:
q ^ ( s , a , w ) = x ( s , a ) ⊤ w ≈ q π ( s , a ) \hat { q } ( s , a , \mathbf { w } ) = \mathbf { x } ( s , a ) ^ { \top } \mathbf { w } \approx q _ { \pi } ( s , a )q^(s,a,w)=x(s,a)wqp(s,a )
Use the following experience to minimizeq ^ ( s , a , w ) \hat{q}(s,a,w)q^(s,a,w ) sumq π ( s , a ) q_\pi(s,a)qp(s,a)的平方差:
D = { ⟨ ( s 1 , a 1 ) , v 1 π ⟩ , ⟨ ( s 2 , a 2 ) , v 2 π ⟩ , … , ⟨ ( s T , a T ) , v T π ⟩ } \mathcal { D } = \left\{ \left\langle \left( s _ { 1 } , a _ { 1 } \right) , v _ { 1 } ^ { \pi } \right\rangle , \left\langle \left( s _ { 2 } , a _ { 2 } \right) , v _ { 2 } ^ { \pi } \right\rangle , \ldots , \left\langle \left( s _ { T } , a _ { T } \right) , v _ { T } ^ { \pi } \right\rangle \right\} D={ (s1,a1),v1p,(s2,a2),v2p,,(sT,aT),vTp  } For policy evaluation ,
we hope to improve the policy and use methods similar to Q-learning for offline policy learning.

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105669357