3. Reinforcement learning--model free decision-making

introduce

We understand how to solve a known MDP theoretically: evaluate a given strategy through dynamic programming, and obtain the optimal value function, and determine the optimal strategy based on the optimal value function; it can also be done directly without any The state value of the strategy is iterated to obtain the optimal value function and optimal strategy.
This lecture discusses solving a problem that can be considered an MDP but does not grasp the specific details of the MDP. That is, it describes how to obtain an estimated optimal value function and optimal strategy directly from the interaction between the Agent and the environment. This part is also divided into two parts. The first part focuses on strategy evaluation, that is, prediction. To put it bluntly, it is the estimation of the final reward that the Agent will get when the details of the MDP are not known for a given strategy.

Monte Carlo algorithm

The Monte Carlo algorithm refers to: without knowing the MDP state transition and immediate rewards, learning the state value directly from the complete Episode. Usually the value of a state is equal to all the values ​​calculated based on the state in multiple Episodes. Harvest average.
Note: Harvest is not specific to Episode, it exists within Episode and is specific to a certain state in Episode. The sum of the decaying immediate rewards obtained after going through the Episode starting from this state. From an Episode, we can get the harvest of all states in the Episode. When a state appears multiple times in an Episode, the harvest of that state has different calculation methods.
A complete Episode means that it must start from a certain state, the Agent interacts with the Environment until the termination state, and the environment gives immediate harvest of the termination state. A complete Episode does not require that the starting state must be a specific state, but requires that the individual eventually enters a certain terminal state recognized by the environment.
The Monte Carlo algorithm has the following characteristics: it is not based on the model itself, but learns directly from the experienced Episodes. It must be a complete Episode. The idea used is to replace the value with the average harvest value. Theoretically, the more episodes there are, the more accurate the results will be. It uses the idea of ​​the law of large numbers, which is to replace expectations with average values, which are unbiased estimates of expectations.

Monte Carlo strategy evaluation

Goal: Under a given strategy, learn the state value function under the strategy from a series of complete Episode experiences.
The mathematical description is as follows:
Based on a specific strategy π \piAn Episode information of π can be expressed as the following sequence:
S 1 , A 1 , R 2 , S 2 , A 2 , … , S t , A t , R t + 1 , … , S k ∼ π S_{1 }, A_{1}, R_{2}, S_{2}, A_{2}, \ldots, S_{t}, A_{t}, R_{t+1}, \ldots, S_{k} \ sim\piS1,A1,R2,S2,A2,,St,At,Rt+1,,Skπ
t time stateS t S_tSt 的收获:
G t = R t + 1 + γ R t + 2 + … + γ T − 1 R T G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma^{T-1} R_{T} Gt=Rt+1+γRt+2++cT1RT
T indicates state termination.
The value of a certain state s under this strategy:
v π ( s ) = E π [ G t ∣ S t = s ] v_{\pi}(s)=E_{\pi}\left[G_{t} | S_ {t}=s\right]vp(s)=Ep[GtSt=s ]
Many times, instant rewards only appear at the end of the Episode, but it cannot be denied that there may be instant rewards in the intermediate state. R t R_tin the formulaRtIt refers to the immediate reward obtained in any state, which should be paid special attention to.
During the state transfer process, it may happen that a state returns to that state one or more times after a certain transition. At this time, how to calculate the number of times this state occurs in an Episode and calculate the harvest of the Episode? There are two methods available:

First access to Monte Carlo strategy evaluation

When a given policy is used to evaluate a certain state s using a series of complete Episodes, for each Episode, only the first occurrence of the state is included in the calculation:
Insert image description here

Monte Carlo strategy evaluation per visit

Given a policy and using a series of complete episodes to evaluate a certain state s, for each episode, each time the state s appears in the state transition chain, the specific calculation formula is the same as above, but the specific meaning is different.
Insert image description here

Updated after each episode visit

Mentioned here is a method commonly used in actual operations to update the mean in real time, so that when calculating the average harvest, there is no need to store all previous harvests. Instead, every time a harvest is obtained, the average harvest is calculated.
The formula is as follows:
μ k = 1 k ∑ j = 1 xj = 1 k ( xk + ∑ j = 1 k − 1 xj ) = 1 k ( xk + ( k − 1 ) μ k − 1 ) = μ k − 1 + 1 k ( xk − μ k − 1 ) \begin{aligned} \mu_{k} &=\frac{1}{k} \sum_{j=1} x_{j} \\ &=\frac{1} {k}\left(x_{k}+\sum_{j=1}^{k-1} x_{j}\right) \\ &=\frac{1}{k}\left(x_{k} +(k-1) \mu_{k-1}\right) \\ &=\mu_{k-1}+\frac{1}{k}\left(x_{k}-\mu_{k-1 }\right) \end{aligned}mk=k1j=1xj=k1(xk+j=1k1xj)=k1(xk+(k1 ) mk1)=mk1+k1(xkmk1)
where μ k \mu_kmkis the average value of the kth time, xk x_kxkis the kth element.
This formula is relatively simple. Applying this method to Monte Carlo strategy evaluation results in the following Monte Carlo real-time updates.

Update algorithm after each visit to episode

For each of a series of Episodes: S 1 , A 1 , R 2 , S 2 , A 2 , … , S t , A t , R t + 1 , … , S k S_{1}, A_{1} , R_{2}, S_{2}, A_{2}, \ldots, S_{t}, A_{t}, R_{t+1}, \ldots, S_{k}S1,A1,R2,S2,A2,,St,At,Rt+1,,Sk
For each state S t S_t in EpisodeSt, there is a gain G t G_tGt, every time S t S_t is encounteredSt, use the following formula to calculate the average value of the state V ( S t ) V(S_t)V(St)
N ( S t ) ← N ( S t ) + 1 V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) \begin{array}{l} N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1 \\ V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right) \end{array} N(St)N(St)+1V(St)V(St)+N(St)1(GtV(St))
When dealing with non-static problems, it is very useful to use this method to track a real-time updated average, and to discard the episode information that has been calculated. At this time, the parameter α \alpha can be introducedα来更新状态价值:
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}-V\left(S_{t}\right)\right) V(St)V(St)+a(GtV(St) )
The above is the main idea and description of the Monte Carlo learning method. Since the Monte Carlo learning method has many shortcomings (will be discussed in detail later), there are not many practical applications. Next, we introduce the commonly used TD learning methods.

Temporal difference learning (TD-learning)

introduce

Temporal difference learning is referred to as TD learning. Its characteristics are as follows: Like Monte Carlo learning, it also learns from Episodes and does not need to know the model itself; but it can learn incomplete Episodes and guess the Episodes through self-sampling (bootstrapping). As a result, this speculation will continue to be updated at the same time. That is, the estimated value function is used to update the value function.
We have already learned, in Monte-Carlo learning, to use the actual return G t G_tGt来更新价值(Value):
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}-V\left(S_{t}\right)\right) V(St)V(St)+a(GtV(St) )
In TD learning, when the algorithm estimates the value of a certain state, it uses the immediate rewardR t + 1 R_{t+1}Rt+1And the next state V t + 1 V_{t+1}Vt+1The estimated state value is multiplied by the attenuation coefficient γ \gammaγ组成,这符合Bellman方程的描述:
V ( S t ) ← V ( S t ) + α ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right) V(St)V(St)+a(Rt+1+γ V(St+1)V(St) ) For example: I go from point A to point C through point B and end at point C. I originally couldn't update the value function of point A at point B. I can only update it when I go to point C. Now that I am at point B, it is estimated that For rewards from B to C, the value function of point A can be updated immediately at point B. R t + 1 + γ V ( S t + 1 ) R_{t+1}+\gamma V\left(S_{t+1}\right)Rt+1+γ V(St+1):称作TD估计值, δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_{t}=R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right) dt=Rt+1+γ V(St+1)V(St) is called TD error.
BootStrappingrefers to the TD target valueR t + 1 + γ V ( S t + 1 ) R_{t+1}+\gamma V\left(S_{t+1}\right)Rt+1+γ V(St+1) instead of cumulative rewards.

BootStrapping

Bootstrapping literally translates as the bootstrap method, also known as the bootstrapping method. It is a method of estimating the population by resampling samples.
https://zhuanlan.zhihu.com/p/54201828 See this post on Zhihu for details.

MC and TD comparison example

Imagine you are driving home after get off work and need to estimate how long the entire trip will take. Suppose a person suddenly encounters a dangerous situation while driving home: a car coming from the opposite direction feels like it will collide with you. In serious cases, he may face death threats, but in the end both parties take measures and no collision actually occurs. If Monte Carlo learning is used, the negative rewards that may be triggered by this dangerous situation on the road will not be taken into account and will not affect the total prediction time; but in TD learning, if such a dangerous situation is encountered, the person will immediately Updating the value of this state, and later discovering that this is worse than the previous state, will immediately consider the decision to slow down and gain time, which means you don't have to update the value of the state until after he dies like Monte Carlo learning, in that case There is also no way to update the status value. The TD algorithm is equivalent to constantly updating the time it takes to finally return home based on the time that has been consumed and the time that is expected to be needed throughout the entire process of returning home (an Episode).
Insert image description here
Based on the data shown in the table above, the figure below shows two different learning strategies, Monte Carlo learning and TD learning, to update the value function (the value of each state). What is used here is the estimated time it takes to get home from a certain state to indirectly reflect the value of a certain state: the longer the estimated time it takes to get home from a certain location, the lower the value of the location, and it is necessary to avoid entering this state when optimizing decisions. For the Monte Carlo learning process, when the driver encounters various situations on the road, he will not update his estimated time to return home. When he returns home and obtains the actual time it takes to return home, he will re-estimate the time it takes to return home. On the way home, check the time when each major node status arrives home, and use the new estimated time to help make decisions when you return home next time; for TD learning, when you first leave the office, you may estimate that the total time will take 30 minutes. , but when you pick up the car and find it raining, you will immediately think that the original estimate was too optimistic, because past experience tells you that rain will extend the total time you need to return home. At this time, you will update the current state value estimate. , increased from the original 30 minutes to 40 minutes. Similarly, when you drive off the highway, the estimated remaining time to return home will be based on the current status (location, road conditions, etc.) until you return home to get the actual total time to return home. During this process, you will update the value of the state in real time based on changes in the state.
Insert image description here

Comparison between MC and TD algorithms

Insert image description here
The figure below compares the efficiency of MC and TD algorithms. The abscissa is the number of episodes experienced, and the ordinate is the mean square error of the value of each state under the calculated state function and the actual state function. The black color is the learning curve of the MC algorithm under different step-sizes, and the gray curve uses the TD algorithm. It can be seen that TD is more efficient than MC.

Comparison of convergence between MC and TD

The MC algorithm will minimize the MSE and bring the distance between the value function and the cumulative reward closer.
∑ k = 1 K ∑ t = 1 T k ( G tk − V ( stk ) ) 2 \sum_{k=1}^{K} \sum_{t=1}^{T_{k}}\left(G_ {t}^{k}-V\left(s_{t}^{k}\right)\right)^{2}k=1Kt=1Tk(GtkV(stk))In formula 2
, k represents the Episode serial number, K is the total Episode number, t is the state serial number in an Episode (the 1st, 2nd, 3rd... state, etc.),T k T_kTkIt represents the total number of states of the k-th Episode, G tk G_t^kGtkRepresents the state st s_t at time t in the kth EpisodestThe final gain obtained, V ( stk ) V(s_t^k)V(stk) represents the statest s_tstthe value of.
The TD algorithm converges to the state value of the largest possible Markov model constructed based on existing experience. That is to say, the TD algorithm will first estimate the transition probability between states based on existing experience, and at the same time estimate the immediate reward of a certain state:
P ^ s , s ′ a = 1 N ( s , a ) ∑ k = 1 K ∑ t = 1 rk 1 ( stk , atk , st + 1 k = s , a , s ′ ) R ^ sa = 1 N ( s , a ) ∑ k = 1 K ∑ t = 1 T k 1 ( stk , atk = s , a ) rtk \begin{aligned} \hat{\mathcal{P}}_{s, s^{\prime} }^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{r_{k}} \mathbf{1} \left(s_{t}^{k}, a_{t}^{k}, s_{t+1}^{k}=s, a, s^{\prime}\right) \\ \hat{ \mathcal{R}}_{s}^{a} &=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{T_ {k}} \mathbf{1}\left(s_{t}^{k}, a_{t}^{k}=s, a\right) r_{t}^{k} \end{aligned}P^s,saR^sa=N(s,a)1k=1Kt=1rk1(stk,atk,st+1k=s,a,s)=N(s,a)1k=1Kt=1Tk1(stk,atk=s,a)rtk
It can be seen from the comparison that the TD algorithm uses the Markov properties of the MDP problem and is more effective in the Markov environment; but the MC algorithm does not use the Markov properties and is usually more effective in the non-Markov environment.

TD( λ \lambda l )

The TD algorithms introduced earlier are actually TD(0) algorithms. The number 0 in the parentheses means looking one step ahead in the current state. What will happen if we look two steps ahead to update the state value? This introduces the concept of n-step.
Insert image description here
Define n-step harvest:
G t ( n ) = R t + 1 + γ R t + 2 + … + γ n − 1 R t + n + γ n V ( S t + n ) G_{t}^{( n)}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t +n}\right)Gt(n)=Rt+1+γRt+2++cn1Rt+n+cnV(St+n)
Then, the update formula of the n-step TD learning state value function is:
V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) ) V\left(S_{t}\ right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{(n)}-V\left(S_{t}\right)\right)V(St)V(St)+a(Gt(n)V(St) )
Since there is n-step prediction, when does n have the best effect?
Here we introduce a new parameter: λ. By introducing this new parameter, predictions of all steps can be comprehensively considered without increasing computational complexity. This is lambda prediction and lambda harvest.

  • λ \lambdaλharvestλ \
    lambdaλ harvest weights the cumulative reward obtained at each step, and it imposes a certain weight(1 − λ) λ n (1-\lambda)\lambda^n(1l ) ln . Through such weight design, the following formula is obtained:
    G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G_{t}^{\lambda}=(1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t}^{(n)}Gtl=(1l )n=1ln1Gt(n)

TD( λ \lambda λ)更新公式:
V ( S t ) ← V ( S t ) + α ( G t λ − V ( S t ) ) V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{\lambda}-V\left(S_{t}\right)\right) V(St)V(St)+a(GtlV(St))

  • TD( λ \lambda λ ) For the distribution of weights,
    Insert image description here
    this picture is relatively easy to understand. For example, for the 3-step harvest of n=3, give it inλ \lambdaThe weight in lambda harvesting is such as the area of ​​the shaded part on the left, and for the T-step harvesting in the terminal state, the area of ​​all shaded parts after T. The sum of the areas of all segments is 1. The design of this geometric series also takes into account the computational convenience of algorithm implementation.

TD( λ \lambda Two ways of understanding λ )

forward understanding

After introducing λ, you will find that to update the state value of a state, you must go through the entire Episode to obtain the instant rewards of each state and the instant rewards of the final state. This is the same requirement as the MC algorithm, so the TD(λ) algorithm has the same disadvantages as the MC method. The value range of λ is [0,1]. When λ=1, the corresponding MC algorithm is used. This practical calculation brings inconvenience.
Insert image description here

backward understanding

TD(λ) on the other hand provides a single-step update mechanism, which is illustrated by the following example.
The mouse received an electric shock after receiving three consecutive bells and one light signal. So when analyzing the cause of the electric shock, is the ringing factor more important or the light factor being more important?
Insert image description here
Frequency heuristic : Attribute the cause to the state that occurs most frequently.
Recency heuristic : Attributing the cause to recent states.
Introduce a value to each state: Eligibility Traces (ES), also translated as "Eligibility Traces", which can be combined with the above two heuristics. Definition:
E 0 ( s ) = 0 E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s ) \begin{array}{l} E_{0}(s)=0 \\ E_{t}(s)=\gamma \lambda E_{t-1} (s)+\mathbf{1}\left(S_{t}=s\right) \end{array}E0(s)=0Et(s)=c l Et1(s)+1(St=s)
Here E t ( s ) E_t(s)Et( s ) is for a certain state, appears once +1, and will decay over time. EEhereE is not a matrix.

Insert image description here
The abscissa of the figure is time, the position with a vertical line under the abscissa represents the current state s, and the ordinate is the utility tracking value E. It can be seen that when a certain state appears continuously, the E value will increase by a unit value based on a certain attenuation. At this time, the proportion of the state's contribution to the final harvest will be increased, so the value of the state can be updated more Consider the impact on the final harvest. At the same time, if the state is farther from the final state, its contribution to the final harvest is smaller, and there is no need to consider the final harvest too much when updating the state.
Backward understanding believes that the state value of the latter state is related to all previous states.
Insert image description here
The updated formula based on Eligibility Traces is as follows:
δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin {aligned} \delta_{t} &=R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right) \\ V(s) & \leftarrow V(s)+\alpha \delta_{t} E_{t}(s) \end{aligned}dtV(s)=Rt+1+γ V(St+1)V(St)V(s)+a dtEt(s)
λ = 0 \lambda=0l=When 0 , only the current state is updated, that is:
E t ( s ) = 1 ( S t = s ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin{array}{l } E_{t}(s)=\mathbf{1}\left(S_{t}=s\right) \\ V(s) \leftarrow V(s)+\alpha \delta_{t} E_{t} (s) \end{array}Et(s)=1(St=s)V(s)V(s)+a dtEt(s)
This is equivalent to the update of TD(0):
V ( S t ) ← V ( S t ) + α δ t V\left(S_{t}\right) \leftarrow V\left(S_{t}\right) +\alpha \delta_{t}V(St)V(St)+a dt
λ = 1 \lambda=1l=When 1
, it is equivalent to the update method of MC. The proof is as follows: Assuming that state s appears for the first time at time-step k, the update method of eligibility traces of TD(1) is as follows:
E t ( s ) = γ E t − 1 ( s ) + 1 ( S t = s ) = { 0 if t < k γ t − k if t ≥ k \begin{aligned} E_{t}(s) &=\gamma E_{ t-1}(s)+\mathbf{1}\left(S_{t}=s\right) \\ &=\left\{\begin{array}{ll} 0 & \text { if } t< k \\ \gamma^{tk} & \text { if } t \geq k \end{array}\right. \end{aligned}Et(s)=c Et1(s)+1(St=s)={ 0ctk if t<k if tk

Therefore, if we want to prove that TD(1) is equivalent to MC, we only need to prove that the following formula is true:
∑ t = 1 T − 1 α δ t E t ( s ) = α ∑ t = k T − 1 γ t − k δ t = α ( G k − V ( S k ) ) \sum_{t=1}^{T-1} \alpha \delta_{t} E_{t}(s)=\alpha \sum_{t=k} ^{T-1} \gamma^{tk} \delta_{t}=\alpha\left(G_{k}-V\left(S_{k}\right)\right)t=1T1a dtEt(s)=at=kT1ctkδt=a(GkV(Sk) )
The proof process is as follows:
It is known that the expression of TD-error is as follows:
δ t ≐ R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_{t} \doteq R_{t+ 1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)dtRt+1+γ V(St+1)V(St)
则有:
G t − V ( S t ) = R t + 1 + γ G t + 1 − V ( S t ) + γ V ( S t + 1 ) − γ V ( S t + 1 ) = δ t + γ ( G t + 1 − V ( S t + 1 ) ) = δ t + γ δ t + 1 + γ 2 ( G t + 2 − V ( S t + 2 ) ) = δ t + γ δ t + 1 + γ 2 δ t + 2 + ⋯ + γ T − t − 1 δ T − 1 + γ T − t ( G T − V ( S T ) ) = δ t + γ δ t + 1 + γ 2 δ t + 2 + ⋯ + γ T − t − 1 δ T − 1 + γ T − t ( 0 − 0 ) = ∑ k = t T − 1 γ k − t δ k \begin{aligned} G_{t}-V\left(S_{t}\right) &=R_{t+1}+\gamma G_{t+1}-V\left(S_{t}\right)+\gamma V\left(S_{t+1}\right)-\gamma V\left(S_{t+1}\right) \\ &=\delta_{t}+\gamma\left(G_{t+1}-V\left(S_{t+1}\right)\right) \\ &=\delta_{t}+\gamma \delta_{t+1}+\gamma^{2}\left(G_{t+2}-V\left(S_{t+2}\right)\right) \\ &=\delta_{t}+\gamma \delta_{t+1}+\gamma^{2} \delta_{t+2}+\cdots+\gamma^{T-t-1} \delta_{T-1}+\gamma^{T-t}\left(G_{T}-V\left(S_{T}\right)\right) \\ &=\delta_{t}+\gamma \delta_{t+1}+\gamma^{2} \delta_{t+2}+\cdots+\gamma^{T-t-1} \delta_{T-1}+\gamma^{T-t}(0-0) \\ &=\sum_{k=t}^{T-1} \gamma^{k-t} \delta_{k} \end{aligned} GtV(St)=Rt+1+c Gt+1V(St)+γ V(St+1)γ V(St+1)=dt+c(Gt+1V(St+1))=dt+c dt+1+c2(Gt+2V(St+2))=dt+c dt+1+c2 dt+2++cTt1δT1+cTt(GTV(ST))=dt+c dt+1+c2 dt+2++cTt1δT1+cTt(00)=k=tT1cktδk

At this point we have discussed λ \lambdaIn the two cases where λ is equal to 0 and 1, it is necessary to prove the equivalence, that is, for anyλ \lambdaλ is established, so it is necessary to prove the following theorem:
Insert image description here
The update formula of the eligibility trace at this time is as follows:
E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s ) = { 0 if t < k ( γ λ ) t − k if t ≥ k \begin{aligned} E_{t}(s) &=\gamma \lambda E_{t-1}(s)+\mathbf{1}\left(S_{ t}=s\right) \\ &=\left\{\begin{array}{ll} 0 & \text { if } t<k \\ (\gamma \lambda)^{tk} & \text { if } t \geq k \end{array}\right. \end{aligned}Et(s)=c l Et1(s)+1(St=s)={ 0( c l )tk if t<k if tk
即证:
∑ t = 1 T α δ t E t ( s ) = α ∑ t = k T ( γ λ ) t − k δ t = α ( G k λ − V ( S k ) ) \sum_{t=1}^{T} \alpha \delta_{t} E_{t}(s)=\alpha \sum_{t=k}^{T}(\gamma \lambda)^{t-k} \delta_{t}=\alpha\left(G_{k}^{\lambda}-V\left(S_{k}\right)\right) t=1Ta dtEt(s)=at=kT( c l )tkδt=a(GklV(Sk))
其实本质上和=1的情况基本一样,证明如下:
G t λ − V ( S t ) = − V ( S t ) + ( 1 − λ ) λ 0 ( R t + 1 + γ V ( S t + 1 ) ) + ( 1 − λ ) λ 1 ( R t + 1 + γ R t + 2 + γ 2 V ( S t + 2 ) ) + ( 1 − λ ) λ 2 ( R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 V ( S t + 3 ) ) + ⋯ = − V ( S t ) + ( γ λ ) 0 ( R t + 1 + γ V ( S t + 1 ) − γ λ V ( S t + 1 ) ) + ( γ λ ) 1 ( R t + 2 + γ V ( S t + 2 ) − γ λ V ( S t + 2 ) ) + ( γ λ ) 2 ( R t + 3 + γ V ( S t + 3 ) − γ λ V ( S t + 3 ) ) + ⋯ + ( γ λ ) 0 ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) + ( γ λ ) 2 ( R t + 2 + γ V ( S t + 2 ) − V ( S t + 1 ) ) + ⋯ = δ t + γ λ δ t + 1 + ( γ λ ) 2 δ t + 2 + … + γ V ( S t + 3 ) − V ( S t + 2 ) ) \begin{aligned} G_{t}^{\lambda}-V\left(S_{t}\right)=-V\left(S_{t}\right) &+(1-\lambda) \lambda^{0}\left(R_{t+1}+\gamma V\left(S_{t+1}\right)\right) \\ &+(1-\lambda) \lambda^{1}\left(R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right)\right) \\ &+(1-\lambda) \lambda^{2}\left(R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\gamma^{3} V\left(S_{t+3}\right)\right) \\ &+\cdots \\ =-V\left(S_{t}\right) &+(\gamma \lambda)^{0}\left(R_{t+1}+\gamma V\left(S_{t+1}\right)-\gamma \lambda V\left(S_{t+1}\right)\right) \\ &+(\gamma \lambda)^{1}\left(R_{t+2}+\gamma V\left(S_{t+2}\right)-\gamma \lambda V\left(S_{t+2}\right)\right) \\ &+(\gamma \lambda)^{2}\left(R_{t+3}+\gamma V\left(S_{t+3}\right)-\gamma \lambda V\left(S_{t+3}\right)\right) \\ &+\cdots \\ &+(\gamma \lambda)^{0}\left(R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right) \\ &+(\gamma \lambda)^{2}\left(R_{t+2}+\gamma V\left(S_{t+2}\right)-V\left(S_{t+1}\right)\right) \\ &+\cdots \\ =&\left.\delta_{t}+\gamma \lambda \delta_{t+1}+(\gamma \lambda)^{2} \delta_{t+2}+\ldots+\gamma V\left(S_{t+3}\right)-V\left(S_{t+2}\right)\right) \end{aligned} GtlV(St)=V(St)=V(St)=+(1l ) l0(Rt+1+γ V(St+1))+(1l ) l1(Rt+1+γRt+2+c2V _(St+2))+(1l ) l2(Rt+1+γRt+2+c2 Rt+3+c3V _(St+3))++( c l )0(Rt+1+γ V(St+1)γλV(St+1))+( c l )1(Rt+2+γ V(St+2)γλV(St+2))+( c l )2(Rt+3+γ V(St+3)γλV(St+3))++( c l )0(Rt+1+γ V(St+1)V(St))+( c l )2(Rt+2+γ V(St+2)V(St+1))+dt+c l dt+1+( c l )2 dt+2++γ V(St+3)V(St+2))
The following table shows the relationship between different algorithms in different situations when λ takes various values.
Insert image description here

Still don’t understand the difference between offline updates and online updates

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105490583