强化学习(四):蒙特卡洛方法

Monte Carlo Methods

MC方法不需要对环境完全了解,只需要可以从环境中采样即可。MC方法基于平均样品收益(Averaging Sample Returns)。通常,MC方法应用于片段式任务(Episodic Tasks)。

Monte Carlo Prediction

First-visit MC 与 Every-visit MC。

# First-visit MC prediction, for estimating V = v_pi

Input: a policy pi to be evaluated
Initialize:
    V(s), arbitrarily, for all s in S
    Returns(s) = list(), for all s in S
    
While True:
    Generate an episode following pi: S0,A0,R1,S1,A1,R2,...,ST-1,AT-1,RT
    G = 0
    for t in range(T-1,0):
        G = gamma G +R_{t+1}
        if St not in S0,S1,...,St-1:
            Returns(St).append(G)
            V(St) = mean(Returns(St))       

Monte Carlo Estimation of Action Values

Exploring Starts:每一个s,a对的概率都大于0

# Monte Carlo ES(Exploring Starts), for estimating pi = pi*

# Initialize:
pi(s) for all s in S
Q(s,a) for all s in S,a in A(s)
Returns(s,a) = list() for all s in S, a in A(s)
While True:
    Choose S0 in S and A0 in A(S0) s.t. for all p(s,a)>0 # ES
    Generate an episode from S0, A0, following pi: S0,A0,R1,...ST-1,AT-1,RT
    G = 0
    for t in range(T-1,0):
        G = gamma G + R_{t+1}
        if St,At not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}:
            Returns(St,At).append(G)
            Q(St,At) = mean(Returns(St,At))
            pi(St) = argmax_a Q(St,a)

Monte Carlo Control

\[ \pi_0\xrightarrow {\quad E\quad}q_{\pi_0}\xrightarrow {\quad I\quad}\pi_1\xrightarrow {\quad E\quad}...\xrightarrow {\quad I\quad} \pi_*\xrightarrow {\quad E\quad}q_* \]

Monte Carlo Control without Exploring Starts

ES的假设在现实中经常不满足,也即普适性不强。如果要去掉这个不太合理的前提,那么就只能假定所有的动作可以被无数多次抽取。要保证此项有两种方法: on-policy 与 off-policy两种方法。on-policy方法评估与提升的策略与做出决策的是同一个策略,而off-policy方法评估与提升的策略与用于产生数据的策略不是同一个。on-policy方法一般比较简单,通常首先考虑,而off-policy方法因为另一个不同策略的存在,会引入额外的工作,而且相较于on-policy方法会较大的方差且收敛较慢。

on-policy方法通常是soft的,即\(\pi(a|s)>0,\quad \forall s \in S,and\ a\in A(s)\),但会逐渐地逼近确定性策略(deterministic optimal policy)。

# on-policy first-visit MC control (for epsilon-soft policies), estimating pi = pi*

# Initialize:
pi = an arbitrary epsilon-soft policy
Q(s,a) arbitrarily for all s in S, a in A(s)
Returns(s,a) = list() for all s in S, a in A(s)

while True:
    Generate an episode following pi: S0,A0,R1,...,ST-1,AT-1,RT
    G = 0
    for t in range(T-1,0):
        G = gamma G + R_{t+1}
        if pair(St,At) not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}
        Returns(St,At).append(G)
        Q(St,At) = mean(Returns(St,At))
        A* = argmax_a Q(St,a)
        for a in A(St):
            if a == A*:
                pi(a|St) = 1- epsilon + epsilon/|A(St)|
            else:
                pi(a|St) = epsilon/|A(St)|

Off-policy Prediction via Importance Sampling

所有的学习控制方法都会面临exploratory-exploitation dilemma,一方面为了习得每个动作的价值,随后的每个行为都应该是最优的,另一方面为了寻找最优,就要尝试各种动作,这与选择最优又是矛盾的。off-policy方法则同时使用两种策略,一种用于学习最优策略,称为目标策略,另一种用于产生数据(行为),称为行为策略

off-policy有一个前提假设: 所有的动作在目标策略下可以发生,那么在行为策略下也一定会发生。也就是假定:\(if\ \pi(a|s)>0,then\ b(a|s)>0\). 这被称为收敛假设

Importance sampling

IS是一种通用的技术,主要应用于已知样品从某一分布中抽取,估计这些样品在另一分布下的期望价值。IS 应用于off-policy学习,通过加权收益的方式,权重由样品在目标策略与行为数据的概率的相对比率(IS ratio)得到。

在目标策略下,
\[ P\{ A_t,S_{t+1},A_{t+1},...,S_T|S_t,A_{t:T-1} \sim \pi\} \\ = \pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\pi(A_{t+1}|S_{t+1})...p(S_T|S_{T-1},A_{T-1})\\ = \Pi_{k =t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k) \qquad\qquad\qquad\qquad\qquad\qquad \]
其中\(p\)是状态转移概率。

那么IS ratio:
\[ \rho_{t:T-1} \dot = \frac{\Pi_{k =t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\Pi_{k =t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)} = \Pi_{k =t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} \]
可见,IS ratio最终只依赖于两种策略及样品序列,而对是否是所谓的MDP则没有要求(状态转移概率被约去了)。

扫描二维码关注公众号,回复: 2727565 查看本文章

我们想要估计的是在目标策略下的期望收益,但我们能得到的收益都是在行为策略下得到的,显然这样的期望收益是有问题的:\(E[G_t|S_t = s] = v_b(s)\),这时IS ratio就派上用场了:\(E[\rho_{t:T-1}G_t|S_t = s] = v_{\pi}(s)\)

这样要估计价值函数:
\[ V(s) \dot = \frac{\sum_{t\in J(s)}\rho_{t:T(t)-1}G_T}{|J(s)|} \]
其中\(J(s)\)表示在所有time step中,状态s被访问的集合。上式被称为ordinary IS, 另一种称为weighted IS:
\[ V(s) \dot = \frac{\sum_{t\in J(s)}\rho_{t:T(t)-1}G_T}{\sum_{t\in J(s)}\rho_{t:T(t)-1}} \]

Incremental Implementation

\[ V_{n+1} = \frac{\sum_{k=1}^{n}W_kG_k}{\sum_{k=1}^n W_k} = V_n + \frac{W_n}{C_n}[G_n - V_n], \quad where, C_{n+1} = C_n + W_{n+1},C_0 = 0, n>=1 \]

# off-policy MC prediction (policy evaluation) for estimating Q = q_pi

#Initialize, for all s in S,a in A(s)
Q(s,a)
C(s,a) = 0
while True:
    b = any policy with coverage of pi
    generate an episode following b: S0,A0,R1,...,ST-1,AT-1,RT
    G = 0
    W = 1
    for t in range(T-1,0):
        G = gamma G + R_{t+1}
        C(St,At) = C(St,At) + W
        Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
        W = W * pi(At|St)/b(At|St)
        if W = 0:
            break
# off-policy MC control for estimating pi=pi*

#Initialize, for all s in S, a in A(s):
Q(s,a)
C(s,a) = 0
pi(s) = argmax_a Q(s,a) (with ties broken consistently)
while True:
    b = any soft policy
    Generate an episode using b: S0,A0,R1,...,ST-1,AT-1,RT
    G = 0
    W = 1
    for t in range(T-1,0):
        G = gamma G + R_{t+1}
        C(St,At) = C(St,At) + W
        Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
        pi(St) = argmax_a Q(St,a) (with ties broken consistently)
        If At != pi(St):
            break
        else:
            W = W * 1/b(At|St)

猜你喜欢

转载自www.cnblogs.com/vpegasus/p/mc.html