Enhance learning system learning machine learning (five) - Markov decision process TD solving strategies

Transfer: https://www.cnblogs.com/pinard/p/9529828.html

1. Time Difference (temporal difference)

Monte Carlo method, we need all the experience sampling sequence is complete sequence of states. If we do not have the complete sequence of states, then it can not be solved using the Monte Carlo method. In this paper we discuss the state can not use the full sequence of methods for solving reinforcement learning problems: timing difference (Temporal-Difference, TD).

1. Introduction temporal difference TD

    Similar timing difference method and Monte Carlo method, are not based on reinforcement learning method to solve the problem of the model. So the definition of enhanced learning control and predict the problem is not based on a model defined here still apply.

    Prediction problem: that given to strengthen the five elements of learning: the set of states S, a set of actions A, instant reward R, the attenuation factor γ, a given policy π, solving the status of the policy's value function v (π)

    Control problem: that is, to find the optimal value function and policy. Given intensive study of five elements: set of states S, a set of actions A, instant reward R, the attenuation factor γ, exploration rate ε, find the optimal action value function q * and the optimal policy π *

    Recalling Monte Carlo method is calculated in the state harvest:

G_t =R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3}+... \gamma^{T-t-1}R_{T}

    As for the timing difference method, we do not complete sequence of states, only the part of the state sequence, then how can approximate harvest obtained a status of it? Recalling the Bellman equation:

v_{\pi}(s) = \mathbb{E}_{\pi}(R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s)

    We can use this heuristic Rt + 1 + γv (St + 1) instead of harvesting approximated Gt, we generally Rt + 1 + γV (St + 1) is called the target TD. Process Rt + 1 + γV (St + 1) -V (St) Rt + 1 + γV (St + 1) -V (St) called TD error, instead of the approximation with a target value TD harvested G (t), said In order to guide (bootstrapping). So we only need two consecutive states and the corresponding rewards, you can try solving reinforcement learning problems.

    Now we have our own harvest Gt approximate expression, then you can go to solve the problem of prediction and control of the timing difference.

2. temporal difference TD prediction problem solving

    The timing difference prediction problem solving and similar Monte Carlo method, but there are two major differences. First harvest Gt different expressions, the timing differential expression G (t) is as follows:

G(t) = R_{t+1} + \gamma V(S_{t+1})

    The second is an iterative equation coefficients slightly different review Monte Carlo method is iterative formula:

V(S_t) = V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t) )

    Since we do not have the complete difference in the timing sequence, there is no corresponding number N (St), it is generally with a [0,1] instead of the coefficient α. Value function iteration formulas such timing differences are:

V(S_t) = V(S_t) + \alpha(G_t - V(S_t) )

Q(S_t, A_t) = Q(S_t, A_t) +\alpha(G_t - Q(S_t, A_t) )

    Here we use a simple example to look at the different prediction problem solving Monte Carlo method and the timing difference method.

    Suppose we have a problem of reinforcement learning A, B two states, model unknown, do not involve policy and behavior. Only involves the conversion status and instant rewards. A total of eight complete state sequence is as follows:

    ① A,0,B,0 ②B,1 ③B,1 ④ B,1 ⑤ B,1 ⑥B,1 ⑦B,1 ⑧B,0

    Only the first state is a state transition sequence, and the remaining seven only one state. Set the attenuation factor of γ = 1.

    First we by the Monte Carlo method to solve the prediction problem. Since only the first state contained in the sequence A, A, and therefore only the value calculated by the first sequence, it is equivalent to calculating the sequence harvested state A:

V (A) = G (A) = RA + 0 = γRB

    For B, the need to harvest its eight sequences mean values, the result is 6/8.

    Let's look at the process of solving the timing difference method. Harvested state sequence is calculated in the estimated value which is applied when a subsequent state of the state values ​​calculated for the B, it is always terminated state, no subsequent state, so its value directly which 8 sequence harvested average values, the result is 6/8.

    For A, only appears in the first sequence, its value is:

V (A) = RA + γV (B) = 6/8

    From the above example we can see the difference between solving the problem of forecasting the Monte Carlo method and timing difference method.

    First, the timing difference method can learn to know the results before, you can learn in the absence of results, you can also learn in an environment of ongoing, whereas Monte Carlo rule to wait until the final results to learn, and the timing difference method can be faster, more flexible estimated value update status, which has a very important practical significance in some cases.

    Second, the timing difference method used in the update state value is TD target value, which is based on the estimated value of the immediate reward and the next state to replace the current state of the harvest at the end of the sequence might get state is the current state of the value of the biased estimates, and the actual use of Monte Carlo rule to update the status value of the harvest, is an unbiased estimate of the value of non-state under a policy, which dominated the Monte Carlo method.

    Third, although the value obtained by the timing difference method is biased estimate, but it was 比蒙特卡罗 variance variance method to get low, and sensitive to initial values, usually more efficient 比蒙特卡罗 law.

    As can be seen from the above description the advantages of timing difference method is relatively large, so now the mainstream of solving reinforcement learning methods are based on the timing differential. Future articles will discuss mainly to expand based on the timing difference method.

Step 3. n temporal difference

    In the second timing difference method, we used a Rt + 1 + γv (St + 1) instead of harvesting approximated Gt. That is a step forward to approximate our harvest Gt, then it can not be two steps forward? Certainly, when we harvest Gt approximate expression is:

G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2V(S_{t+2})

    From two steps, the three steps, then n steps, we can sum the differential harvest timing step n G (n) t expressed as:

G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^nV(S_{t+n})

    When n increases, tends to infinity, or tend to use the full sequence of states, n-step timing difference is equivalent to the Monte Carlo method.

    For n steps timing difference, the timing difference, and common difference is the difference calculated harvested. Well, since there is this argument n steps, then n is the number of steps in the end good? N how to measure it good or bad? We discuss in the next section.

4. TD (l)

    How many steps the step number n is selected as a timing difference calculation parameter is a need to try superior hyperparameter tuning problems. In order to consider all of the predicted number of steps without increasing the computational complexity, we introduce a new [0,1] parameter [lambda], n is defined λ- harvest all steps from 1 to ∞ harvested by the weight and. Each step is a weight (1-λ) λ ^ (n-1), such λ- harvested formula is expressed as:

G_t^{\lambda} = (1-\lambda)\sum\limits_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}

    And then we can get the iterative formula TD (λ) of the value of the function:

V(S_t) = V(S_t) + \alpha(G_t^{\lambda} - V(S_t) )

Q(S_t, A_t) = Q(S_t, A_t) +\alpha(G_t^{\lambda}- Q(S_t, A_t) )

    Right every step of the harvest weight is defined as (1-λ) λ ^ What (n-1) because what is it? The image which is shown below, can be seen that n is increased, its right step n harvested weight decay exponentially. When the time T reaches the final state, the weight given to all unallocated actual weight gains terminated state. Such state can make a complete sequence of n steps all the weights add up to 1 weight harvested, the farther away from the current state of its smaller weight gains weight.

    From the forward run TD (λ), a state value of V (St) obtained from Gt, Gt and the obtained indirectly from the value of the state of all subsequent calculations, it can be considered a status update value needs to know the value of all the subsequent states. In other words, the state must go through the complete sequence to get the value of each state, including the termination of the state of immediate reward to update the current status. This requires Monte Carlo method and the same, so TD (λ) and Monte Carlo method has the same disadvantage. When λ = 0, the timing difference is an ordinary second mentioned method, when λ = 1, is a Monte Carlo method.

    From the opposite point of view TD (λ), which we can analyze the influence of the state of the subsequent state. For example, in mice after successively received 3 rings and 1 lighting signal was shock, then the analysis of the causes of electric shocks, ringing in the end is more important factor is the lighting factor is more important? If the cause of the mice subjected to electric shocks considered before accepting a large number of rings, called this frequency attributed inspired (frequency heuristic) type; and the shock due to the impact of recent state a few times, called is inspired by the nearest (recency heuristic) type.

    If you introduce a value for each state: utility (eligibility, E) to represent the impact of the subsequent state of the state, it can be used simultaneously to both inspiration. The utility value for all states collectively referred utility trace (eligibility traces, ES). defined as:

E0(s)=0

E_t(s) = \gamma\lambda E_{t-1}(s) +1(S_t=s) = \begin{cases} 0& {t<k}\\ (\gamma\lambda)^{t-k}& {t\geq k} \end{cases}, \;\;s.t.\; \lambda,\gamma \in [0,1], s\; is\; visited \;once\;at\; time\; k

    At this point we TD (λ) function updates the value equation can be expressed as:

\delta_t = R_{t+1} + \gamma v(S_{t+1}) -V(S_t)

V(S_t) = V(S_t) + \alpha\delta_tE_t(s)

     Some people might ask, this equation forward and reverse equation looks different ah, is not different logic of it? In fact, the two are equivalent. Now we look at the past to derive the reverse update formula.

\begin{align} G_t^{\lambda} - V(S_t) &= - V(S_t) + (1-\lambda)\lambda^{0}(R_{t+1} + \gamma V(S_{t+1})) \\ &+ (1-\lambda)\lambda^{1}(R_{t+1} + \gamma R_{t+2} + \gamma^2V(S_{t+2})) \\ &+ (1-\lambda)\lambda^{2}(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3V(S_{t+3})) \\ &+... \\& = - V(S_t) + (\gamma\lambda)^0(R_{t+1} + \gamma V(S_{t+1}) - \gamma\lambda V(S_{t+1}) ) \\ & + (\gamma\lambda)^1(R_{t+2} + \gamma V(S_{t+2}) - \gamma\lambda V(S_{t+2}) ) \\ & + (\gamma\lambda)^2(R_{t+3} + \gamma V(S_{t+3}) - \gamma\lambda V(S_{t+3}) ) \\ &+... \\ & = (\gamma\lambda)^0(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\ & + (\gamma\lambda)^1(R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) \\ & + (\gamma\lambda)^2(R_{t+3} + \gamma V(S_{t+3}) - V(S_{t+2})) \\ & + ... \\ & = \delta_t + \gamma\lambda \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + ... \end{align}

    As can be seen in fact, consistent with the prior error and reverse TD TD error.

5. Timing control problems solving differential

    现在我们回到普通的时序差分,来看看它控制问题的求解方法。回想上一篇蒙特卡罗法在线控制的方法,我们使用的是ϵ−贪婪法来做价值迭代。对于时序差分,我们也可以用ϵ−贪婪法来价值迭代,和蒙特卡罗法在线控制的区别主要只是在于收获的计算方式不同。时序差分的在线控制(on-policy)算法最常见的是SARSA算法。

    而除了在线控制,我们还可以做离线控制(off-policy),离线控制和在线控制的区别主要在于在线控制一般只有一个策略(最常见的是ϵ−贪婪法)。而离线控制一般有两个策略,其中一个策略(最常见的是ϵ−贪婪法)用于选择新的动作,另一个策略(最常见的是贪婪法)用于更新价值函数。时序差分的离线控制算法最常见的是Q-Learning算法,我们在下篇单独讲解。

5.1. SARSA算法的引入

    SARSA算法是一种使用时序差分求解强化学习控制问题的方法,回顾下此时我们的控制问题可以表示为:给定强化学习的5个要素:状态集S, 动作集A, 即时奖励R,衰减因子γ, 探索率ϵ, 求解最优的动作价值函数q∗和最优策略π∗。

    这一类强化学习的问题求解不需要环境的状态转化模型,是不基于模型的强化学习问题求解方法。对于它的控制问题求解,和蒙特卡罗法类似,都是价值迭代,即通过价值函数的更新,来更新当前的策略,再通过新的策略,来产生新的状态和即时奖励,进而更新价值函数。一直进行下去,直到价值函数和策略都收敛。

    我们的SARSA算法,属于在线控制这一类,即一直使用一个策略来更新价值函数和选择新的动作,而这个策略是ϵ−贪婪法,在系统学习机器学习之增强学习(四)--马尔可夫决策过程策略MC求解中,我们对于ϵ−贪婪法有详细讲解,即通过设置一个较小的ϵ值,使用1−ϵ的概率贪婪地选择目前认为是最大行为价值的行为,而用ϵ的概率随机的从所有m个可选行为中选择行为。用公式可以表示为:

\pi(a|s)= \begin{cases} \epsilon/m + 1- \epsilon & {if\; a^{*} = \arg\max_{a \in A}Q(s,a)}\\ \epsilon/m & {else} \end{cases}

2. SARSA算法概述

    作为SARSA算法的名字本身来说,它实际上是由S,A,R,S,A几个字母组成的。而S,A,R分别代表状态(State),动作(Action),奖励(Reward),这也是我们前面一直在使用的符号。这个流程体现在下图:

    在迭代的时候,我们首先基于ϵ−贪婪法在当前状态S选择一个动作A,这样系统会转到一个新的状态S′, 同时给我们一个即时奖励R, 在新的状态S′,我们会基于ϵ−贪婪法在状态S‘′选择一个动作A′,但是注意这时候我们并不执行这个动作A′,只是用来更新的我们的价值函数,价值函数的更新公式是:

Q(S,A) = Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))

    其中,γ是衰减因子,α是迭代步长。这里和蒙特卡罗法求解在线控制问题的迭代公式的区别主要是,收获Gt的表达式不同,对于时序差分,收获Gt的表达式是R+γQ(S′,A′)。

    除了收获Gt的表达式不同,SARSA算法和蒙特卡罗在线控制算法基本类似。

3. SARSA算法流程

    下面我们总结下SARSA算法的流程。

    算法输入:迭代轮数T,状态集S, 动作集A, 步长α,衰减因子γ, 探索率ϵ,

    输出:所有的状态和动作对应的价值Q

    1. 随机初始化所有的状态和动作对应的价值Q. 对于终止状态其Q值初始化为0.

    2. for i from 1 to T,进行迭代。

      a) 初始化S为当前状态序列的第一个状态。设置A为ϵ−贪婪法在当前状态S选择的动作。

      b) 在状态S执行当前动作A,得到新状态S′和奖励R

      c) 用ϵ−贪婪法在状态S′选择新的动作A′

      d) 更新价值函数Q(S,A):

Q(S,A) = Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))

      e) S=S′,A=A′

      f) 如果S′是终止状态,当前轮迭代完毕,否则转到步骤b)

    这里有一个要注意的是,步长α一般需要随着迭代的进行逐渐变小,这样才能保证动作价值函数Q可以收敛。当Q收敛时,我们的策略ϵ−贪婪法也就收敛了。

4. SARSA算法实例:Windy GridWorld

    下面我们用一个著名的实例Windy GridWorld来研究SARSA算法。

    如下图一个10×7的长方形格子世界,标记有一个起始位置 S 和一个终止目标位置 G,格子下方的数字表示对应的列中一定强度的风。当个体进入该列的某个格子时,会按图中箭头所示的方向自动移动数字表示的格数,借此来模拟世界中风的作用。同样格子世界是有边界的,个体任意时刻只能处在世界内部的一个格子中。个体并不清楚这个世界的构造以及有风,也就是说它不知道格子是长方形的,也不知道边界在哪里,也不知道自己在里面移动移步后下一个格子与之前格子的相对位置关系,当然它也不清楚起始位置、终止目标的具体位置。但是个体会记住曾经经过的格子,下次在进入这个格子时,它能准确的辨认出这个格子曾经什么时候来过。格子可以执行的行为是朝上、下、左、右移动一步,每移动一步只要不是进入目标位置都给予一个 -1 的惩罚,直至进入目标位置后获得奖励 0 同时永久停留在该位置。现在要求解的问题是个体应该遵循怎样的策略才能尽快的从起始位置到达目标位置。

    逻辑并不复杂,完整的代码在我的github。这里我主要看一下关键部分的代码。

    算法中第2步步骤a,初始化SS,使用ϵ−ϵ−贪婪法在当前状态SS选择的动作的过程:

复制代码

    # initialize state
    state = START

    # choose an action based on epsilon-greedy algorithm
    if np.random.binomial(1, EPSILON) == 1:
        action = np.random.choice(ACTIONS)
    else:
        values_ = q_value[state[0], state[1], :]
        action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])

复制代码

    算法中第2步步骤b,在状态SS执行当前动作AA,得到新状态S′S′的过程,由于奖励不是终止就是-1,不需要单独计算:

复制代码

def step(state, action):
    i, j = state
    if action == ACTION_UP:
        return [max(i - 1 - WIND[j], 0), j]
    elif action == ACTION_DOWN:
        return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j]
    elif action == ACTION_LEFT:
        return [max(i - WIND[j], 0), max(j - 1, 0)]
    elif action == ACTION_RIGHT:
        return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)]
    else:
        assert False

复制代码

    算法中第2步步骤c,用ϵ−ϵ−贪婪法在状态S‘S‘选择新的动作A′A′的过程:

复制代码

        next_state = step(state, action)
        if np.random.binomial(1, EPSILON) == 1:
            next_action = np.random.choice(ACTIONS)
        else:
            values_ = q_value[next_state[0], next_state[1], :]
            next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])

复制代码

    算法中第2步步骤d,e, 更新价值函数Q(S,A)Q(S,A)以及更新当前状态动作的过程:

复制代码

        # Sarsa update
        q_value[state[0], state[1], action] += \
            ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] -
                     q_value[state[0], state[1], action])
        state = next_state
        action = next_action

复制代码

    代码很简单,相信大家对照算法,跑跑代码,可以很容易得到这个问题的最优解,进而搞清楚SARSA算法的整个流程。

5. SARSA(λ)

    在系统学习机器学习之增强学习(五)--马尔可夫决策过程策略TD求解中我们讲到了多步时序差分TD(λ)的价值函数迭代方法,那么同样的,对应的多步时序差分在线控制算法,就是我们的SARSA(λ)。

    TD(λ)有前向和后向两种价值函数迭代方式,当然它们是等价的。在控制问题的求解时,基于反向认识的 SARSA(λ)算法将可以有效地在线学习,数据学习完即可丢弃。因此 SARSA(λ)算法默认都是基于反向来进行价值函数迭代。

    在上一篇我们讲到了TD(λ)状态价值函数的反向迭代,即:

\delta_t = R_{t+1} + \gamma V(S_{t+1}) -V(S_t)

V(S_t) = V(S_t) + \alpha\delta_tE_t(S)

    对应的动作价值函数的迭代公式可以找样写出,即:

\delta_t = R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) -Q(S_t, A_t)

Q(S_t, A_t) = Q(S_t, A_t) + \alpha\delta_tE_t(S,A)

    除了状态价值函数Q(S,A)的更新方式,多步参数λ以及反向认识引入的效用迹E(S,A),其余算法思想和SARSA类似。这里我们总结下SARSA(λ)的算法流程。   

    算法输入:迭代轮数T,状态集S, 动作集A, 步长α,衰减因子γ, 探索率ϵ, 多步参数λ

    输出:所有的状态和动作对应的价值Q

    1. 随机初始化所有的状态和动作对应的价值Q. 对于终止状态其Q值初始化为0.

    2. for i from 1 to T,进行迭代。

      a) 初始化所有状态动作的效用迹E为0,初始化S为当前状态序列的第一个状态。设置A为ϵ−贪婪法在当前状态S选择的动作。

      b) 在状态S执行当前动作A,得到新状态S′和奖励R

      c) 用ϵ−贪婪法在状态S′选择新的动作A′

      d) 更新效用迹函数E(S,A)和TD误差δ:

E(S,A) = E(S,A)+1

\delta= R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) -Q(S_t, A_t)

      e) 对当前序列所有出现的状态s和对应动作a, 更新价值函数Q(s,a)和效用迹函数E(s,a):

Q(s,a) = Q(s,a) + \alpha\delta E(s,a)

E(s,a) = \gamma\lambda E(s,a)

      f) S=S′,A=A′

      g) 如果S′是终止状态,当前轮迭代完毕,否则转到步骤b)

      对于步长α,和SARSA一样,一般也需要随着迭代的进行逐渐变小才能保证动作价值函数Q收敛。

6. SARSA小结

    SARSA dynamic programming algorithm and compared, the state transition model does not require the environment, and compared with Monte Carlo method does not require the complete sequence of states, and therefore more flexible. In the traditional method of reinforcement learning more widely used.

    But SARSA algorithm also has a problem common to conventional reinforcement learning method is too complex problem can not be solved. In SARSA algorithm, the value of Q (S, A) is used to store a large table, if our states and actions have reached millions and ten million, need to be saved in memory this large table will be super, even overflow, and therefore not very suitable for solving large-scale problems. Of course, not particularly complex problem, use SARSA still very good reinforcement learning problem-solving method.

    Next we discuss SARSA sister algorithm, the timing differential offline control algorithm Q-Learning.

7. The timing difference Summary

    The timing difference and Monte Carlo method is more flexible than learning stronger, and therefore the current mainstream method of solving reinforcement learning problems, and even now most of solving reinforcement learning reinforcement learning are based on the depth of thinking is based on the timing difference of. Therefore, we will focus on the back.

8. Conclusion

    Markov decision process is non-MDP First we discuss here to determine, that is the reward function and operation of the transfer function is probability. In the state s, taken after the operation proceeds to a next state s' is also a probability. Again, there is an important concept in the Q learning reinforcement learning, the essence is the state s related to V (s) and converted to a related Q. Highly recommended Tom Mitchell's "machine learning" the final chapter, which describes the Q-learning and more content. Finally, which referred to the Bellman equation, there Bellman-Ford dynamic programming algorithm in the "Introduction to Algorithms" can be used to solve the shortest path weight of negatively weighted graph, which is the most worth exploring is the proof of convergence, very value. Some scholars have carefully analyzed the relationship between reinforcement learning and dynamic programming.

Guess you like

Origin blog.csdn.net/App_12062011/article/details/92082148