[Reinforcement Learning Actual Combat] Function Approximation Method-Convergence of Linear Approximation and Function Approximation

Linear approximation

The most commonly used function approximations are linear approximations and artificial neural networks. This section introduces linear approximation. Linear approximation uses a linear combination of many feature vectors to approximate the value function. The feature vector depends on the input (that is, the state or state-action pair). Taking action value approximation as an example, we can define multiple different features for each state action pair x (s, a) = (xj (s, a): j ∈ J) x(s,a)=(x_j( s,a):j∈\mathcal{J})x(s,a)=(xj(s,a):jJ ) , and then define the approximate function as a linear combination of these features, that
Insert picture description here
is, there are similar approximate methods for the state function:
Insert picture description here

Relationship between exact lookup table and linear approximation

For the action value, it can be considered as ∣ S ∣ × ∣ A ∣ |S|×|A|S×A eigenvectors, each vector is in the form of 1
Insert picture description here
in a certain state action pair, and 0 in the others. In this way, the linear combination of all vectors is the entire action value function, and the value of the linear combination coefficient is the value of the action value function.

Linear least squares strategy evaluation

In the case of using linear approximation, not only can the strategy evaluation method based on stochastic gradient descent be used, but also linear least squares can be used for strategy evaluation. Linear least squares is a batch method that tries to find the best estimate on the entire sample set for multiple empirical samples at a time.

Using linear least squares for round update, you can get linear least square round update (Linear Least Square Monte Carlo, Linear LSMC). Linear least squares round update attempts to minimize.
Insert picture description here
In the case of linear approximation, its gradient is
Insert picture description here
the weight to be calculated w LSMC w_{LSMC}wLSMCSubstituting the above formula and making it equal to zero, there is a
Insert picture description here
solution to the linear equations: in
Insert picture description here
this way, the updated calculation formula of the linear least squares round is obtained. In actual use, the above formula is directly used to update the weights, and the linear least squares round update is realized.
Using linear least squares for timing difference, you can get Linear Least Square Temporal Difference (Linear LSTD). For the case of single-step timing difference, linear least squares timing difference tries to minimize
Insert picture description here
where U t = R t + 1 + γ q (S t + 1, A t + 1; w) U_t=R_{t+1}+ γq(S_{t+1},A_{t+1};w)Ut=Rt+1+γq(St+1,At+1;w ) . In the case of linear approximation, its half gradient is
Insert picture description here
the weight to be calculatedw LSTD w_{LSTD}wLSTDSubstituting the above formula and making it equal to zero, there is a
Insert picture description here
solution to the linear equations: In
Insert picture description here
this way, the calculation formula for the linear least squares time series difference update is obtained. In actual use, the above formula is used to update the weights directly to realize the linear least squares timing difference update.

Linear least squares optimal strategy solution

Least squares can also be used to solve the optimal strategy. This section introduces the least squares optimal strategy solving algorithm based on Q learning.
In Q learning, the return is estimated as
Insert picture description here
This is compared with the time-series difference strategy estimation introduced in the previous section, which is to compare the estimated return value R t + 1 + γ q (S t + 1, A t + 1; w) Replace A t + 1 with q (S t + 1, a; w) R_{t+1}+γq(S_{t+1},A_{t+1};w) in A_{t+1 } Replace q(S_{t+1},a;w)Rt+1+γq(St+1,At+1;W ) of the At+1Change to Q ( St+1,a;w ) . Therefore, the least-squares solution is correspondingly
Insert picture description here
changedfromto
Insert picture description here
solve the above-mentioned least-squares solution, and the optimal value function can be estimated, and then the optimal strategy can be updated. Based on this iterative strategy iteration, the linear least squares Q learning algorithm is obtained (see Algorithm 8).

Algorithm 8 Linear least squares Q learning algorithm to solve the optimal strategy
Input: a lot of experience.
Output: the best action value estimate q (st, at; w), s ∈, a ∈ (s) q(st,at;w),s∈ ,a∈ (s)q(st,at;w),s,a( s ) and the estimation of the deterministic optimal strategy π.
1. (Initialize)w ← any value; use q (st, at; w), s ∈, a ∈ (s) w ← any value; use q(s_t,a_t;w),s∈ ,a∈ (s )wAny intended values ; with Q ( St,at;w),s,a( s ) Determine the greedy strategy π.
2. (Iterative update) Iteratively perform the following operations:
2.1 (Update value), which is
the action in the state %S_{t+1}% determinedby the deterministic strategyπ.
2.2 (Strategy improvement) According toq (s, a; w ′), s ∈, a ∈ (s) q(s,a;w'),s∈ ,a∈ (s)q(s,a;w),s,a( s ) Determine the deterministic strategy π'.
2.3 If the iteration termination condition is reached (such as w and w'close enough, or π and π'close enough), then terminate the iteration
; otherwise updatew ← w ′, π ← π ′ w←w', π←π'wwΠPi'Go to the next iteration.

Convergence of function approximation

The linear approximation has a simple linear superposition structure, which makes the linear approximation obtain additional convergence. Table 1 and Table 2 respectively show the convergence of the strategy evaluation algorithm and the optimal strategy solution algorithm. In these two tables, the lookup table refers to a method that does not use function approximation. Under normal circumstances, they can converge to the true value function or the optimal value function. However, for the function approximation algorithm, the convergence is usually only guaranteed when the gradient descent round update is used, and there is no guarantee when the half-gradient descent timing difference method is used. After the limiting function is approximated by linear approximation, the convergence is improved in some cases. Of course, all convergence is when the learning rate satisfies the RobbinsMonro sequence (that is, it satisfies (1) α t ≥ 0, t = 0, 1,…; (2) ∑ t = 0 + ∞ α t = + ∞ (3) ∑ t = 0 + ∞ α 2 <+ ∞ (1)αt≥0,t=0,1,…; (2)\sum_{t=0}^{+\infty} \alpha_t=+\ infty(3)\sum_{t=0}^{+\infty}{\alpha}^2<+\infty(1)αt0,t=0,1,(2)t=0+at=+(3)t=0+a2<+ . Linear approximation can also be combined with batch linear least squares, which may result in better convergence. In the case of guaranteed convergence, the convergence can generally be proved by verifying the conditions of the random approximation Robbins-Monro algorithm.
In addition, the convergence proof of the optimal strategy solution uses its stochastic optimization version.

Table 1 Convergence of strategy evaluation algorithm
Insert picture description here
Table 2 Convergence of optimal strategy solution algorithm
Insert picture description here
It is worth mentioning that for Q learning of different strategies, even if linear approximation is adopted, convergence is still not guaranteed. The researchers found that as long as the three different strategies, self-interest, and function approximation appear at the same time, convergence cannot be guaranteed. A well-known example is Baird's counterexample, which interested readers can refer to.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109248924