Chapter 5 Monte Carlo Methods

不像以前章节,这里不假设有complete knowledge of the environment。

   不需要完美的model,只要有experiences就行,用episodes表示,一个episode就是一个完整的从开始到结束的state、action、reward序列。蒙特卡洛方法的特点就是要使用整个序列,举例来说就是必须在一个episode结束后得到了整个序列才能使用蒙特卡洛方法。
   蒙特卡洛方法因此可以episode-by-episode的增加,但不是step-by-step的(在线)的增加。
   蒙特卡洛方法在这里用于基于averaging complete returns。而且要处理的问题也是nonstationarity

5.1 Monte Carlo Prediction

首先考虑蒙特卡洛方法用于在给定policy下学习state-value function。跟Policy Evaluation(Prediction)类似的情况。原理是大数定理,这也是所有蒙特卡洛方法的基础

  • First-Visit Monte-Carlo Policy Evaluation:estimate v π ( s ) as the average of the returns following first visits to s.
    • To evaluate state s
    • The fi rst time-step t that state s is visited in an episode,
    • Increment counter N ( s ) N ( s ) + 1
    • Increment total return S ( s ) S ( s ) + G t
    • Value is estimated by mean return V ( s ) = S ( s ) / N ( s )
    • By law of large numbers, V ( s ) v π ( s )   as   N ( s )
  • Every-Visit Monte-Carlo Policy Evaluation:estimate v π ( s ) as the average of the returns following every visits to s.
    • To evaluate state s
    • Every time-step t that state s is visited in an episode,
    • Increment counter N ( s ) N ( s ) + 1
    • Increment total return S ( s ) S ( s ) + G t
    • Value is estimated by mean return V ( s ) = S ( s ) / N ( s )
    • Again, V ( s ) v π ( s )   as   N ( s )

这里的说的 对svisit 是指在一个episode中 state s 出现一次

First-visit MC prediction
first-visit MC和every-visit MC都收敛到 v π ( s ) ,当visit的数量增加到无限的时候

5.3 Monte Carlo Control

π ( s ) a r g max a q ( s , a ) .

policy improvement theorem应用到 π k π k + 1
q π k ( s , π k + 1 ( s ) ) = q π k ( s , a r g max a q π k ( s , a ) ) = max a q π k ( s , a ) q π k ( s , π k ( s ) ) v π k ( s ) .

Monte Carlo ES
exploring starts就是开始的时候手动给一个好的值

5.4 Monte Carlo Control without Exploring Starts

有两种方法可以避开exploring starts的需求

  • On-policy learning
    • ”Learn on the job”
    • Learn about policy π from experience sampled from π
    • On-policy更新的policy与产生样本的policy是一样的
  • Off-policy learning
    • ”Look over someone’s shoulder”
    • Learning about policy π from experience sampled from μ
    • Off-policy更新的policy与产生样本的policy不一样

关于On-policy和Off-policy的定义和关系是后面近似方法的核心

On-policy rst-visit MC control

5.5 Off-policy Prediction via Importance Sampling

off-policy的方差更大,收敛的更慢

   on-policy approach实际上是一种妥协,是探索近似最优policy
   off-policy approach使用一种更直观的方式是使用两个policy,一个用来学习并成为the optimal policy,另一个更exploratory,用来generate behavior。

   用来学习的policy称为target policy,这里是 π ;用来生成行为的policy称为behavior policy,这里是 b
   In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

因为behavior policy更stochastic and more exploratory,所以可以是 ε -greedy 方法

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

给定开始状态 S t ,后续state-action trajectory在任意policy π 下发生的概率

P r { A t , S t + 1 , A T = 1 , , S T | S t , A t : T 1 π } = π ( A t | S t ) p ( S t + 1 | S t , A t ) π ( A t + 1 | S t + 1 ) p ( S T | S T 1 , A T 1 ) = k = t T 1 π ( A k | S k ) p ( S k + 1 | S k , A k ) ,

注意trajectory,后面蒙特卡洛搜索树会用到这个概念

那么importance sampling ratio

ρ t : T 1 k = t T 1 π ( A k | S k ) p ( S k + 1 | S k , A k ) k = t T 1 b ( A k | S k ) p ( S k + 1 | S k , A k ) = k = t T 1 π ( A k | S k ) b ( A k | S k )

应用importance ratio。在只有有behavior policy得到的returns G t 的情况下,想得到在target policy下的expected returns(values)。

E [ ρ t : T 1 G t | S t = s ] = v π ( s )

In particular, we can define the set of all time steps in which state s is visited, denoted J ( s ) . This is for an every-visit method; for a fi rst-visit method, J ( s ) would only include time steps that were fi rst-visits to s within their episodes.

Ordinary importance sampling:

V ( s ) t J ( s ) ρ t : T 1 G t | J ( s ) |

Weighted importance sampling:

V ( s ) t J ( s ) ρ t : T 1 G t t J ( s ) ρ t : T 1

5.6 Incremental Implementation

W t = ρ t : T 1
则有
V ( s ) t = 1 n 1 W k G t t = 1 n 1 W k , n 2

把上面的权重更新写成递增实现
V n + 1 V n + W n C n [ C n V n ] , n 1

G n + 1 C n + W n + 1

Off-policy MC prediction
这里其实就是上面增量的实现weighted importance-sampling。这里只表现了增量实现与importance-sampling的关系

5.7 Off-policy Monte Carlo Control

Off-policy MC control

5.8 *Discounting-aware Importance Sampling

把returns的内部结构添加到discounted rewards的总和的考虑中。这可以减小方差

The essence of the idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.

G ¯ t : h R t + 1 + R t + 2 + + R h , 0 t < h T ,

The conventional full return G t can be viewed as a sum of at partial returns

G t R t + 1 + γ R t + 2 + γ 2 R t + 3 + + γ T t 1 R T = ( 1 γ ) R t + 1 + ( 1 γ ) γ ( R t + 1 + R t + 2 ) + ( 1 γ ) γ 2 ( R t + 1 + R t + 2 + R t + 3 ) + ( 1 γ ) γ T t 2 ( R t + 1 + R t + 2 + + R T 1 ) + γ T t 1 ( R t + 1 + R t + 2 + + R T 1 ) = ( 1 γ ) h = t + 1 T 1 γ h t 1 G ¯ t : h + γ T t 1 G ¯ t : T

则有
ordinary importance-sampling estimator
V ( s ) t J ( s ) ( ( 1 γ ) h = t + 1 T ( t ) 1 γ h t 1 ρ t : h 1 G ¯ t : h + γ T ( t ) t 1 ρ t : T ( t ) 1 G ¯ t : T ( t ) ) | J ( s ) |

weighted importance-sampling estimator

V ( s ) t J ( s ) ( ( 1 γ ) h = t + 1 T ( t ) 1 γ h t 1 ρ t : h 1 G ¯ t : h + γ T ( t ) t 1 ρ t : T ( t ) 1 G ¯ t : T ( t ) ) t J ( s ) ( ( 1 γ ) h = t + 1 T ( t ) 1 γ h t 1 ρ t : h 1 + γ T ( t ) t 1 ρ t : T ( t ) 1 )

5.9 *Per-decision Importance Sampling

另外一种把structure of the return作为rewards的总和,可以被考虑在off-policy importance sampling中。也可以减小方差

ρ t : T 1 G t = ρ t : T 1 ( R t + 1 + γ R t + 2 + + γ T t 1 R T ) = ρ t : T 1 R t + 1 + γ ρ t : T 1 R t + 2 + + γ T t 1 ρ t : T 1 R T )

上式的第一个子项可以写为
ρ t : T 1 R t + 1 = π ( A t | S t ) b ( A t | S t ) π ( A t + 1 | S t + 1 ) b ( A t + 1 | S t + 1 ) π ( A t + 2 | S t + 2 ) b ( A t + 2 | S t + 2 ) π ( A T 1 | S T 1 ) b ( A T 1 | S T 1 ) R t + 1

注意到上式中的各项,只有第一项和最后一项(the reward)是相关的;其他各项都是独立随机变量,它们的期望值为1

E [ π ( A k | S k ) b ( A k | S k ) ] a b ( a | S k ) π ( a | S k ) b ( a | S k ) = a π ( a | S k ) = 1

所有的比率值中,只有第一项留下来了,则有

E [ ρ t : T 1 R t + 1 ] = [ ρ t : t R t + 1 ]

重复上述分析过程则可以得到

E [ ρ t : T 1 G t ] = E [ G ¯ t ]

其中

G ¯ t = ρ t : t R t + 1 + γ ρ t : t + 1 R t + 2 + γ 2 ρ t : t + 2 R t + 3 + + γ T t 1 ρ t : T 1 R T

我们称这个想法为 per-decision importance sampling

使用 G ¯ t 的ordinary-importance-sampling estimator

V ( s ) t J ( s ) G ¯ t | J ( s ) |

猜你喜欢

转载自blog.csdn.net/dengyibing/article/details/80464699