4. Reinforcement learning--model free control


All the content of this course will finally focus on the content of this lecture. Through the study of this lecture, we will learn how to train an Agent so that it can complete tasks better in a completely unknown environment and get as many rewards as possible. .
The previous lecture mainly explained how to make predictions when the model is unknown. The so-called prediction is to evaluate a given strategy, that is, to determine the value function of a state (or state-behavior pair) under a given strategy . The content of this lecture is mainly about how to optimize the value function under unknown conditions of the model . This process is also called model-independent control.

Strategy iteration under model

Insert image description hereThe core of general policy iteration is policy optimization between two alternating processes. One process is strategy evaluation and the other is improvement strategy.
Note that using the dynamic programming algorithm to improve the strategy requires knowing all subsequent states of a certain state and the transition probabilities between states :
π ′ ( s ) = argmax ⁡ a ∈ AR sa + P ss ′ a V ( s ′ ) \pi^{ \prime}(s)=\underset{a \in \mathcal{A}}{\operatorname{argmax}} \mathcal{R}_{s}^{a}+\mathcal{P}_{ss^{ \prime}}^{a} V\left(s^{\prime}\right)Pi(s)=aAargmaxRsa+PssaV(s)

  • For model-free policy iteration
    , is this method suitable for model-unknown Monte Carlo learning? The answer is no, and there are at least two problems . One is that when the model is unknown, it is impossible to know all subsequent states of the current state, and thus it is impossible to determine what behavior is more appropriate in the current state. The way to solve this problem is to use the value Q ( s , a ) Q(s,a) under the state-behavior pairQ(s,a ) instead of state value:
    The purpose of this is to improve the strategy without knowing the entire model. You only need to know what kind of behavior has the greatest value in a certain state. Specifically: we start from an initial Q and policyπ \piStarting from π , first update Q (s, a) Q(s,a)of each state-behavior pair according to this strategyQ(s,a ) value, s is then determined based on the updated Q by the improved greedy algorithm.
    Even so, there is at least one problem, that is, when we use the greedy algorithm to improve the strategy every time, it will very likely lead to a suboptimal strategy due to insufficient sampling experience. We need to try from time to time. Some new behaviors, this is exploration.

Ɛ-Greedy exploration (MC-control)

Ɛ-The goal of greedy exploration is to ensure that all possible behaviors in a certain state have a certain non-zero probability of being selected for execution, which ensures continuous exploration, 1 − ϵ 1-\epsilon1Choose the behavior that is currently considered best under the probability of ϵ , and ϵ \epsilonThe probability of choosing ϵ among all possible actions (including the currently best action). The mathematical expression is as follows:
π ( a ∣ s ) = { ϵ / m + 1 − ϵ if a ∗ = argmax ⁡ a ∈ AQ ( s , a ) ϵ / m otherwise \pi(a | s)=\left\{ \begin{array}{ll} \epsilon / m+1-\epsilon & \text { if } a^{*}=\underset{a \in \mathcal{A}}{\operatorname{argmax}} Q( s, a) \\ \epsilon / m & \text { otherwise } \end{array}\right.π ( a s )={ ϵ / m+1ϵϵ / m if a=aAargmaxQ(s,a) otherwise 
m is the total number of actions.

ϵ \epsilonϵGreedy strategy improvement theorem:
For anyϵ − greedy \epsilon-greedyϵg r e e d y strategyπ \piIn terms of π , use the correspondingq π q_\piqpThe obtained ϵ − greedy \epsilon-greedyϵg r e e d y strategyπ ′ \pi'Pi is atπ \piA strategy improvement on π , that is, v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s)\ge v_{\pi}(s)vPi(s)vp(s)

Proof:
Refer to https://zhuanlan.zhihu.com/p/54272316.
Of course, I feel that the above reference link is not complete and has flaws, so I will improve it: introduce ϵ − soft − greedy \epsilon-soft
here. -greedyϵsoftg r e e d y concept, which means that for all states and actions,π ( a ∣ s ) ≥ ϵ m \pi(a | s) \geq \frac{\epsilon}{m}π ( a s )mϵ, where ϵ > 0 \epsilon>0ϵ>0 ϵ −greedy \epsilon-greedyϵg r e e d yhuman − soft − greedy \epsilon-soft-greedyϵsoftg r e e d y ,determine the max weight sum of the weighted average
value of the ppt:
q π ( s , π ′ ( s ) ) = ∑ a ∈ A π ′ ( a ∣ s ) q π ( s , a ) = ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) max ⁡ a ∈ A q π ( s , a ) ≥ ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ q π ( s , a ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) = v π ( s ) \begin{aligned} q_{\pi}\left(s, \pi^{\prime}(s)\right) &=\sum_{a \in \ mathcal{A}} \pi^{\prime}(a | s) q_{\pi}(s, a) \\ &=\epsilon / m \sum_{a \in \mathcal{A}} q_{\ pi}(s, a)+(1-\epsilon) \max _{a \in \mathcal{A}} q_{\pi}(s, a) \\ & \geq \epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(a | s)-\frac {\epsilon}{m}}{1-\epsilon} q_{\pi}(s, a) \\ &=\sum_{a \in \mathcal{A}} \pi(a | s) q_{\ pi}(s, a)=v_{\pi}(s) \end{aligned}qp(s,Pi(s))=aAPi(as)qp(s,a)=ϵ / maAqp(s,a)+(1) _aAmaxqp(s,a)ϵ / maAqp(s,a)+(1) _aA1ϵπ ( a s )mϵqp(s,a)=aAπ ( a s ) qp(s,a)=vp(s)
The most elusive origin of greater than or equal to is proved by the following method:
According to ϵ − soft \epsilon-softϵs o f t function,π ( a ∣ s ) ≥ ϵ m \pi(a | s) \geq \frac{\epsilon}{m}π ( a s )mϵ,indefinitelyπ ( a ∣ s ) = ϵ m + Δ \pi(a | s)=\frac{\epsilon}{m}+\Deltaπ ( a s )=mϵ+Δmaintaining a free function:
π ( a ∣ s ) = { 1 − ϵ − ( m − 1 ) Δ + ϵ m for greedy action ϵ m + Δ for every non-gre \pi ( a | s ) = \ . left\{ \begin{array} { ll } 1 - \epsilon - (m-1)\Delta + \frac { \epsilon } { m } & \text { for greedy action } \\ \frac { \epsilon } { m } + \Delta & \text { for every non - gre } \end { array } \ rightπ ( a s )={ 1ϵ(m1 ) D+mϵmϵ+D for greedy action  for every non-gre 
It is assumed here:

  1. There are m-1 non-greedy actions, there are 1 greedy actions, and π ( a ∣ s ) \pi(a|s)The sum of π ( a s ) = 1, so for greedy actionπ ( a ∣ s ) \pi(a|s)π ( a s ) :
    1 − ϵ + m / ϵ − ( m − 1 ) Δ 1-\epsilon +m/\epsilon -(m-1)\Delta1ϵ+m / ϵ(m1 ) D
  2. In addition, all non-greedy Q(s,a) values ​​are uniformly equal to the largest Q value qs q_s in non-greedyqs;The Q value of greedy action is qm q_mqm
    max ⁡ a ∈ A q π ( s , a ) = q m \max _ { a \in \mathcal { A } } q _ { \pi } ( s , a ) = q _ { m } aAmaxqp(s,a)=qm
    所什有下式:
    ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ q π ( s , a ) < = ( 1 − ϵ − Δ m + Δ + ϵ m ) − ϵ m 1 − ϵ qm + ( ϵ m + Δ ) − ϵ m 1 − ϵ qs ( m − 1 ) = ( 1 − ϵ − Δ m + Δ ) qm + Δ ( m − 1 ) qs 1 − ϵ = ( 1 − ϵ ) qm − Δ ( m − 1 ) qm + Δ ( m − 1 ) qs 1 − ϵ = qm + Δ ( m − 1 ) ( qs − qm ) <= qm = max ⁡ a ∈ A q π ( s , a ) \begin {aligned} \sum _ { a \in \mathcal { A } } \frac { \pi ( a | s ) - \frac { \epsilon } { m } } { 1 - \epsilon } q _ { \pi } ( s , a ) & <= \frac { \left( 1 - \epsilon - \Delta m + \Delta + \frac { \epsilon } { m } \right) - \frac { \epsilon } { m } } { 1 - \epsilon } q _ { m } + \frac { \left( \frac { \epsilon } { m } + \Delta \right) - \frac { \epsilon } { m } } { 1 - \epsilon } { } q _ { s } ( m - 1 ) \\ & = \frac { ( 1 - \epsilon - \Delta m + \Delta ) q _ { m } + \Delta ( m - 1 ) q _ { s } } { 1 - \epsilon } \\ & = \frac { ( 1 - \epsilon ) q _ { m } - \Delta ( m - 1 ) q _ { m } + \Delta ( m - 1 ) q _ { s } } { 1 - \epsilon } \\ & = q _ { m } + \Delta ( m - 1 ) \left( q _ { s } - q _ { m } \right) \\ &<= q_m =\max _ { a \in \mathcal { A } } q _ { \pi } ( s , a ) \end{aligned}aA1ϵπ ( a s )mϵqp(s,a)<=1ϵ(1ϵΔm _+D+mϵ)mϵqm+1ϵ(mϵ+D )mϵqs(m1)=1ϵ(1ϵΔm _+D ) qm+D ( m1)qs=1ϵ(1ϵ ) qmD ( m1)qm+D ( m1)qs=qm+D ( m1)(qsqm)<=qm=aAmaxqp(s,a)
    The above proves that q π ( s , π ′ ( s ) ) q_{\pi}(s,{\pi'}(s))qp(s,Pi (s))ratiovπ ( s ) v^\pi(s)vπ (s)is larger, but it does not prove thatv π ′ ( s ) v^{\pi'}(s)vPi (s)ratiovπ ( s ) v^\pi(s)vπ(s)要大,所以还要再证一发这个:
    Q π ( s , π ′ ( s ) ) = E [ r t + 1 + V π ( s t + 1 ) ∣ s t = s , a t = π ′ ( s t ) ] < = E [ r t + 1 + Q π ( s t + 1 , π ′ ( s t + 1 ) ) ∣ s t = s , a t = π ′ ( s t ) ] = E [ r t + 1 + r t + 2 + V π ( s t + 2 ) ∣ . . . . ] < = E [ r t + 1 + r t + 2 + Q π ( s t + 2 , π ′ ( s t + 2 ) ) ∣ . . . . ] . . . . . . < = V π ′ ( s ) \begin{aligned} Q^\pi(s,\pi'(s)) &= E[r_{t+1}+V^\pi(s_{t+1})|s_t = s,a_t = \pi'(s_t)] \\ &<=E[r_{t+1}+Q^\pi(s_{t+1},\pi'(s_{t+1}))|s_t = s,a_t = \pi'(s_t)]\\ &= E[r_{t+1}+r_{t+2}+V^\pi(s_{t+2})|....]\\ &<= E[r_{t+1}+r_{t+2}+Q^\pi(s_{t+2},\pi'(s_{t+2}))|....]\\ &......\\ &<= V^{\pi'}(s) \end{aligned} Qπ (s,Pi(s))=E [ rt+1+Vπ (st+1)st=s,at=Pi(st)]<=E [ rt+1+Qπ (st+1,Pi(st+1))st=s,at=Pi(st)]=E [ rt+1+rt+2+Vπ (st+2)....]<=E [ rt+1+rt+2+Qπ (st+2,Pi(st+2))....]......<=VPi(s)
    It is fully proved through the above formula: v π ′ ( s ) ≥ v π ( s ) v^{\pi'}(s)\ge v^\pi(s)vPi(s)vπ (s).
    After solving the above two problems, we finally see the whole picture of Monte Carlo control: using the Q function for strategy evaluation and using Ɛ-greedy exploration to improve the strategy. This method can eventually converge to the optimal policy.
    Insert image description here
    Each upward or downward arrow in the figure corresponds to multiple Episodes. That is to say, we generally perform sequential Q function updates or strategy improvements after experiencing multiple Episodes. In fact, we can also update the Q function or improve the strategy after each episode. But no matter which method is used, under Ɛ-greedy exploration calculation we can always only get the approximate Q function based on a certain strategy, and the algorithm does not have a termination condition because it is always exploring. Therefore, we must pay attention to the following two aspects: on the one hand, we do not want to lose any better information and status; on the other hand, as our strategy improves, we ultimately hope to end up with an optimal strategy, because in fact, the optimal strategy should not Includes some random behavior choices. Another theoretical concept is introduced for this purpose:GLIE.

GLIE(Greedy in the Limit with Infinite Exploration)

The idea of ​​this algorithm is to explore infinite possibilities in a limited time. The specific performance is:

  1. All experienced state-action pairs will be explored infinitely;
  2. In addition, with the infinite extension of exploration, the Ɛ value in the greedy algorithm tends to 0. For example if we take ϵ = 1 / k \epsilon = 1/kϵ=1 / k (k is the number of episodes explored), then the Ɛgreedy Monte Carlo control has GLIE characteristics.

The Monte Carlo control process based on GLIE is as follows:
Insert image description here
Theorem about GLIE: GLIE Monte Carlo control can converge to the optimal state behavior value function. That is Q ( s , a ) Q(s,a)Q(s,a ) will converge toq ∗ ( s , a ) q_{*}(s,a)q(s,a)(Find relevant papers to supplement the proof)

TD Control

on-policy TD Control

As mentioned in the previous lecture, TD has many advantages over MC: low variance, online real-time learning, incomplete Episode learning, etc.
So it is natural to think whether we can use TD learning instead of MC learning on control problems? The answer is yes, this is SARS that will be explained below.
Insert image description here

SAUCE

The name of SARSA comes from the sequence description shown in the figure below: for a state S and a specific behavior A, a state-behavior pair (S, A) is generated, which interacts with the environment. After the environment receives the individual's behavior, it will tell The individual's immediate reward [formula] and the subsequent state S' it enters; then the individual follows the existing strategy to generate a behavior A', and obtains the value Q of the next state-behavior pair (S', A') based on the current state behavior value function (S',A'), use this Q(S',A') value to update the value of the previous state behavior to Q(S,A).
The flow chart of the algorithm is shown below:
Insert image description here
A more intuitive explanation is this: an Agent is in a certain state S, in which it can try various behaviors. When following a certain strategy, it will choose based on the current strategy. For a behavior A, the individual actually performs this behavior and actually interacts with the environment. The environment will give an immediate reward R according to its behavior and enter the next state S'. In this subsequent state S', it will follow the current strategy again and generate a behavior. A', at this time, the individual does not perform the behavior , but obtains the value of the (S', A') state-behavior pair through its current state-behavior value function, and uses this value to take behavior A in combination with the individual's state S. The immediate reward obtained updates the (state) behavior value of the individual taking action A in state S.

Pseudocode of SARSA strategy:
Insert image description here
Note:
1. Q(s,a) in the algorithm is stored in a large table, which is not suitable for solving large-scale problems;
2. For each Episode, use Behavior A is based on the current strategy, and this behavior is also the behavior that actually occurs in Episode. In the value cycle of updating the Q(S,A) state behavior pair, the individual does not actually perform the behavior A' under S', but Is to leave behavior A' to the next loop execution. That is, in fact, the last line of pseudo code assigning A to A' is redundant, because the action of the next state is selected in the next loop, and the next loop executes the strategy obtained by the new round of value function . This is the meaning of on-policy. I use the policy obtained by the current value function to select the action of this cycle, the updated value function and the state of the next cycle, and provide a new strategy for the selection of the next action.

Regarding the convergence theorem of SARSA : When the following two conditions are met, the Sarsa algorithm will converge to the optimal behavior value function. Condition 1: Strategy π t ( a ∣ s ) \pi_t(a|s)
at any timePit( a s ) conforms to the GLIE characteristics;
condition 2: step coefficientα t \alpha_tat满足:
∑ t = 1 ∞ α t = ∞ ∑ t = 1 ∞ α t 2 < ∞ \begin{array} { l } \sum _ { t = 1 } ^ { \infty } \alpha _ { t } = \infty \\ \sum _ { t = 1 } ^ { \infty } \alpha _ { t } ^ { 2 } < \infty \end{array} t=1at=t=1at2<
But according to what was taught in class and based on experience, convergence can be achieved even if the above two conditions are not met.

n walking SARSA

In TD( λ \lambdan-steps reward was mentioned in λ
Insert image description here
): A similar idea can be used in SARSA to define the Q return of n-step:
qt ( n ) = R t + 1 + γ R t + 2 + … + γ n − 1 R t + n + γ n Q ( S t + n ) q _ { t } ^ { ( n ) } = R _ { t + 1 } + \gamma R _ { t + 2 } + \ldots + \gamma ^ { n - 1 } R _ { t + n } + \gamma ^ { n } Q \left( S _ { t + n } \right)qt(n)=Rt+1+γRt+2++cn1Rt+n+cnQ(St+n)
hereqt q_tqtCorresponding to a state-behavior pair < ​​st , at > <s_t,a_t><st,at> , represents the value of taking a certain action in a certain state. If n=1, it means the state behavior pair< ​​st , at > <s_t,a_t><st,at> The Q value can be expressed in two parts. One part is the immediate reward R t + 1 R_{t+1}obtained by leaving state st.Rt+1, the immediate reward is only related to the state and has nothing to do with the behavior taken in that state; the other part is the new state behavior pair < ​​st + 1 , at + 1 > <s_{t+1},a_{t+1}><st+1,at+1> Q value: The environment gives the individual a new statest + 1 s_{t+1}st+1, observed at st + 1 s_{t+1}st+1The state is the behavior obtained based on the current strategy at + 1 a_{t+1}at+1Q at ( st + 1 , at + 1 ) Q(s_{t+1},a_{t+1})Q(st+1,at+1) , the subsequent Q value considers the attenuation coefficient. Whenn = 2 n=2n=When 2 , use the immediate reward of 2 steps forward, and then replace it with the Q value of the new state; ifn = ∞ n=\inftyn= , which means that the Q value is calculated with the immediate reward until the end of the Episode, the individual enters the termination state and obtains the immediate reward of the termination state.
In addition, this definition formula does not reflect the concept of state-behavior pairs, and it is easy to beconfused with the previous n-step G harvest. In fact, Q itself includes behavior, that is, the behavior generated based on a certain state under the current strategy. There is a certain relationship between Q harvest and G harvest, which can be understood in conjunction with the Bellman equation.

With the above definition, n-step Sarsa can be expressed as n-step Q harvest, as follows:
Q ( S t , A t ) ← Q ( S t , A t ) + α ( qt ( n ) − Q ( S t , A t ) ) Q \left( S _ { t } , A _ { t } \right) \leftarrow Q \left( S _ { t } , A _ { t } \right) + \alpha \left ( q _ { t } ^ { ( n ) } - Q \left( S _ { t } , A _ { t } \right) \right)Q(St,At)Q(St,At)+a(qt(n)Q(St,At))

SARSA( λ \lambdal )

Define the q return at this time:
qt λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 qt ( n ) q _ { t } ^ { \lambda } = ( 1 - \lambda ) \sum _ { n = 1 } ^ { \infty } \lambda ^ { n - 1 } q _ { t } ^ { ( n ) }qtl=(1l )n=1ln1qt(n)
更新公式为:
Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) ) Q \left( S _ { t } , A _ { t } \right) \leftarrow Q \left( S _ { t } , A _ { t } \right) + \alpha \left( q _ { t } ^ { \lambda } - Q \left( S _ { t } , A _ { t } \right) \right) Q(St,At)Q(St,At)+a(qtlQ(St,At) )
Forward understanding: using it to update the Q value requires traversing the complete Episode. As shown in the figure below:
Insert image description here
The method of reverse understanding: the same as the reverse understanding of TD(λ), the concept of Eligibility Trace is introduced. The difference is that this time the E value is not for a state, but for a state. Behavior pair:
E 0 ( s , a ) = 0 E t ( s , a ) = γ λ E t − 1 ( s , a ) + 1 ( S t = s , A t = a ) \begin{array} { l } E _ { 0 } ( s , a ) = 0 \\ E _ { t } ( s , a ) = \gamma \lambda E _ { t - 1 } ( s , a ) + 1 \left( S _ { t } = s , A _ { t } = a \right) \end{array}E0(s,a)=0Et(s,a)=c l Et1(s,a)+1(St=s,At=a)
It reflects the causal relationship between a result and a certain state-behavior pair, the state-behavior pair closest to the result, and those state-behavior pairs that occurred frequently before that have the greatest impact on the result.
The following formula introduces E t E_tEtConcept of SARS( λ \lambdaλ)之后的Q值更新描述:
δ t = R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) Q ( s , a ) ← Q ( s , a ) + α δ t E t ( s , a ) \begin{array} { l } \delta _ { t } = R _ { t + 1 } + \gamma Q \left( S _ { t + 1 } , A _ { t + 1 } \right) - Q \left( S _ { t } , A _ { t } \right) \\ Q ( s , a ) \leftarrow Q ( s , a ) + \alpha \delta _ { t } E _ { t } ( s , a ) \end{array} dt=Rt+1+γQ(St+1,At+1)Q(St,At)Q(s,a)Q(s,a)+a dtEt(s,a)
Introduce E t E_tEtconcept, while using SARS( λ \lambdaλ ) will enable more effective online learning because it is not necessary to learn the complete Episode and the data can be discarded after it is used up. E t E_tEtIt is usually more used in online learning algorithms.
SARSA( λ \lambdaThe pseudocode of the λ
Insert image description here
) algorithm is as follows: What should be mentioned here is that E(s,a) needs to be reset to 0 after each episode is browsed, which reflects theE t E_tEtOnly works in one Episode. And here, all (s, a) that have been experienced before are updated at once, while SARSA only updates the value function of the current state-behavior pair.

off-policy

The characteristic of On-Policy Learning is that the currently followed strategy is the strategy for individual learning improvement. Off-Policy Learning refers to following a policy μ ( a ∣ s ) \mu(a|s)μ ( a s ) while evaluating another policyπ ( a ∣ s ) \pi(a|s)π ( a s ) , that is, calculate and determine the state value function v π ( s ) v_\pi(s)under this other strategyvp( s ) sumq π ( s , a ) q_\pi(s,a)qp(s,a ) . Why do you do that? Because in this way it is easier to learn from human experience or the experience of other individuals, you can also learn from some old strategies, and you can compare the advantages and disadvantages of the two strategies. Perhaps the main reason is to follow an exploratory strategy and optimize the existing strategy. It can also be divided into Monte Carlo-based and TD-based based on whether it has experienced a complete Episode.

The importance of off-policy is:
1. You can learn by observing other agents;
2. You can use previous experience or old strategies;
3. Learn the optimal strategy through exploratory strategies;
4. When following a strategy You can learn a variety of strategies while playing.

important sampling

Before using off-policy, you need to know this importance sampling:
EX ∼ P [ f ( X ) ] = ∑ P ( X ) f ( X ) = ∑ Q ( X ) P ( X ) Q ( X ) f ( X ) = EX ∼ Q [ P ( X ) Q ( X ) f ( X ) ] \begin{aligned} \mathbb { E } _ { ( X ) \\ & = \sum Q ( X ) \frac { P ( X ) } { Q ( X ) } f ( X ) \\ & = \mathbb { E }_ { X \sim Q } \left[ \frac { P ( X ) } { Q ( X ) } f ( X ) \right] \end{aligned}EXP[f(X)]=P(X)f(X)=Q(X)Q(X)P(X)f(X)=EXQ[Q(X)P(X)f(X)]
That is, using a different distribution to estimate expectations.

For the MC method, we generally do not use the off-policy strategy because the variance is too large and we need to sample until the end of the round to compare the differences between the two strategies. Based on Li Hongyi’s explanation of importance sampling, it can be seen that two distributions with excessive variance are not suitable for importance sampling.
Insert image description here

TD’s off-policy method

The task of offline policy TD learning is to use the TD method to follow a policy μ ( a ∣ s ) \mu(a|s)μ ( a s ) while evaluating another policyπ ( a ∣ s ) \pi(a|s)π(as)。具体数学表示为:
V ( S t ) ← V ( S t ) + α ( π ( A t ∣ S t ) μ ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ) \begin{aligned} V \left( S _ { t } \right) & \leftarrow V \left( S _ { t } \right) + \alpha \left( \frac { \pi \left( A _ { t } | S _ { t } \right) } { \mu \left( A _ { t } | S _ { t } \right) } \left( R _ { t + 1 } + \gamma V \left( S _ { t + 1 } \right) \right) - V \left( S _ { t } \right) \right) \end{aligned} V(St)V(St)+a(m(AtSt)Pi(AtSt)(Rt+1+γ V(St+1))V(St))
Note: The V function here is for the strategy π \piπ -like!
This formula can be explained like this: the individual is in state S t S_tSt, based on strategy μ \muμ produces an actionA t A_tAt, after executing this behavior, enter the new state S t + 1 S_{t+1}St+1, so how to adjust the value of the original state according to the value of the new state under the current strategy? The method of offline strategy is that in state S t S_tStWhen compared respectively according to another strategy π \piπ and the currently followed strategyμ \muμ produces behaviorA t A_tAtThe probability size, if the strategy π \piThe probability value obtained by π is the same as following the current strategy μ \muThe probability value obtained by μ is close, indicating that according to the state S t + 1 S_{t+1}St+1Value to update S t S_tStThe value of is supported by two strategies at the same time, and this update operation is more convincing. It also shows that in the state S t S_tStWhen , the two strategies have close probability selection behavior A t A_tAt. If this probability ratio is small, it means that if A t A_t is chosen according to the evaluated strategyAtThe chance is very small, at this time we are updating S t S_tStWhen considering the value, we should not consider too much the state S t + 1 S_{t+1} obtained based on the current strategy.St+1the value of.Similarly, when the probability ratio is greater than 1, it means that more people still believe in the evaluated strategy. This amounts to drawing on the experience of the strategy being evaluated to update our own strategy.

Q-learning

The key point is that when updating the Q-value of a state-behavior pair, it is not the Q-value of the next state-behavior pair currently following the policy that is used, but the Q-value of the next state-behavior pair generated by the adopted strategy to be evaluated. The formula is as follows:
Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ Q ( S t + 1 , A ′ ) − Q ( S t , A t ) ) Q \left( S _ { t } , A _ { t } \right) \leftarrow Q \left( S _ { t } , A _ { t } \right) + \alpha \left( R _ { t + 1 } + \gamma Q \left( S _ { t + 1 } , A ^ { \prime } \right) - Q \left( S _ { t } , A _ { t } \right) \right)Q(St,At)Q(St,At)+a(Rt+1+γQ(St+1,A)Q(St,At) )
where the actionA t + 1 A_{t+1}At+1By strategy μ \muμ to produce, target actionA ′ A'A By strategyπ \piπ is generated. There is no need to use importance sampling when using Q-learning, because the Q value is used at this time, not the V value.
The main form of Q learning is: the strategy followed by the individual is based on the current state behavior value functionQ (s, a) Q(s,a)Q(s,a ) of aϵ − greedy \epsilon-greedyϵg r e e d y strategy, and the target strategy is based on the current state behavior value functionQ ( s , a ) Q(s,a)Q(s,a ) Does not containϵ \epsilonϵ 's simplegreedy greedyg r e e d y policy (target policyπ \piπ
π ( S t + 1 ) = argmax ⁡ a ′ Q ( S t + 1 , a ′ ) \pi \left( S _ { t + 1 } \right) = \underset { a ^ { \prime } } { \operatorname { argmax } } Q \left( S _ { t + 1 } , a ^ { \prime } \right) Pi(St+1)=aargmaxQ(St+1,a )
In this way, the TD target value of Q learning can be greatly simplified:
R t + 1 + γ Q ( S t + 1 , A ′ ) = R t + 1 + γ Q ( S t + 1 , arg ⁡ max ⁡ a ′ Q ( S t + 1 , a ′ ) ) = R t + 1 + max ⁡ a ′ γ Q ( S t + 1 , a ′ ) \begin{aligned} & R _ { t + 1 } + \gamma Q \ left( S _ { t + 1 } , A ^ { \prime } \right) \\ = & R _ { t + 1 } + \gamma Q \left( S _ { t + 1 } , \underset { a ^ { \prime } } { \arg \max } Q \left( S _ { t + 1 } , a ^ { \prime } \right) \right) \\ = & R _ { t + 1 } + \max _ { a ^ { \prime } } \gamma Q \left( S _ { t + 1 } , a ^ { \prime } \right) \end{aligned}==Rt+1+γQ(St+1,A)Rt+1+γQ(St+1,aargmaxQ(St+1,a))Rt+1+amaxγQ(St+1,a)
The update formula of the Q value can be written in the following form:
In this way, in the state S t S_tSt依据ϵ − greedy \epsilon-greedyϵg r e e d y follows the policy and gets the behaviorA t A_tAtThe value of Q will be towards S t + 1 S_{t+1}St+1The direction of the maximum Q value of the state is updated at a certain proportion. This algorithm can make greedy greedyg r e e d y strategyπ \piπ eventually converges to the optimal strategy. Since the individual actually followsϵ − greedy \epsilon-greedyϵg r e e d y strategy, which ensures that sufficiently rich new states are experienced.

The update formula of the Q function is as follows:
Q ( S , A ) ← Q ( S , A ) + α ( R + γ max ⁡ a ′ Q ( S ′ , a ′ ) − Q ( S , A ) ) Q ( S , A ) \leftarrow Q ( S , A ) + \alpha \left( R + \gamma \max _ { a ^ { \prime } } Q \left( S ^ { \prime } , a ^ { \prime } \right ) - Q ( S , A ) \right)Q(S,A)Q(S,A)+a(R+camaxQ(S,a)Q(S,A ) )
The algorithm pseudocode of Q-learning is as follows:
Insert image description here
Theorem: Q-learning will converge to the optimal state behavior value function:Q ( s , a ) → q ∗ ( s , a ) Q(s,a)\rightarrow q_ {*}(s,a)Q(s,a)q(s,a)

Summarize the relationship between DP and TD

The following two figures summarize various DP algorithms and various TD algorithms, and also reveal the differences and connections between various algorithms. In general, TD is sampling + data bootstrap, and DP is full width + actual data . If we look at it from the perspective of the Bellman expectation equation: iterative policy evaluation (DP) and TD learning focus on the value of the state itself, and Q-policy iteration (DP) and SARSA focus on the value function of state behavior; if we look at the state From the perspective of the Bellman optimization equation of the behavioral value function, it is Q-value iteration (DP) and Q learning.
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105596023