[Reinforcement Learning] 02——Exploration and Utilization

1. Explore and exploit

Exploration and exploitation is an important issue in sequential decision-making tasks, mainly the trade-off between choosing the known optimal decision and trying other decisions.

  • Use Exploitation to select the known optimal decision
  • Exploration: Try other decisions, which may be the optimal decision in the future. Use exploration to make the current strategy continue to move toward the optimal strategy π → π ∗ \pi \rightarrow \pi^*PiPi。用公式说明一下: E t = { π t i ∣ i = 1 , … , n } → 探索 E t + 1 = { π t i ∣ i = 1 , … , n } ∪ { π e j ∣ j = 1 , … , m } \mathcal{E}_{t}=\left\{\pi_{t}^{i}\mid i=1,\ldots,n\right\}\xrightarrow{\text{探索}} \mathcal{E}_{t+1}=\left\{\pi_{t}^{i}\mid i=1,\ldots,n\right\}\cup\left\{\pi_{e}^{j}\mid j=1,\ldots,m\right\} Et={ ptii=1,,n}explore Et+1={ ptii=1,,n}{ pejj=1,,m } E t \mathcal{E}_{t}Etis the current strategy pool, E t + 1 \mathcal{E}_{t+1}Et+1is the strategy pool after exploration. Through exploration, new strategies π ej \pi_{e}^{j} can be obtainedPiej, then in the new strategy pool, it is necessary to find a strategy with greater value than the original strategy without exploration: ∃ V ⋆ ( ⋅ ∣ π ti ∼ E t ) ≤ V ⋆ ( ⋅ ∣ π t + 1 i ∼ E t + 1 ) π t + 1 i ∼ { π ei ∣ i = 1 , … , m } \exists V^{\star}\big(\cdot|\pi_{t}^{i}\sim{\cal E}_ {t}\big)\leq V^{\star}\big(\cdot|\pi_{t+1}^{i}\sim{\cal E}_{t+1}\big)\quad\ pi_{t+1}^{i}\sim\big\{\pi_{e}^{i}\mid i=1,\ldots,m\big\}V(πtiEt)V(πt+1iEt+1)Pit+1i{ peii=1,,m}

2. Explore strategies

  • Naive method (Naive Exploration): Add policy noise ϵ − greedy \epsilon-greedyϵgreedy
  • Optimistic Initialization: Give a higher initialization value to facilitate exploration;
  • Uncertainty Measurement: Explore strategies with higher uncertainty;
  • Probability Matching: sampling strategy, selecting the best strategy;
  • State Searching: Exploring unexplored strategies (the environment is known).

3. Multi-armed bandit machine

In the multi-armed bandit (MAB) problem, there is a player with KKA slot machine with K levers. Pulling each lever corresponds to a probability distributionRRR. _ Every time we pull one of the levers, we can get a reward rrfrom the reward probability distribution corresponding to the lever.r . The problem is solved within a limited time (OperationTTWithin T times), obtain the greatest rewards through continuous attempts and exploration. Sincethe probability distribution of rewards is unknown, we need to make a trade-off between "exploring the winning probability of the lever" and "selecting the lever with the most winnings based on experience". "What kind of operation strategy can be used to obtain the highest cumulative reward" is the multi-armed bandit problem.
Insert image description here

3.1. Formal description

The multi-armed bandit problem can be expressed as a tuple (A, R) (A,R)(A,R ) , where:

  • A A A action set,ai ∈ A , i = 1 , 2 , . . . , K a_i \in A,i=1,2,...,KaiA,i=1,2,...,K
  • R R R is the reward probability distribution,R ( r ∣ ai ) = P ( r ∣ ai ) R(r|a_i)=\mathbb P(r|a_i)R(rai)=P(rai)
  • Assuming that only one lever can be pulled at each time step, the goal of the multi-armed bandit is to maximize TT for a period of timeAccumulated reward within T steps: max ⁡ ∑ t = 1 T rt , rt ∼ R ( ⋅ ∣ at ) \max\sum_{t=1}^Tr_t,r_t\sim\mathcal{R}\left(\cdot| a_t\right)maxt=1Trt,rtR(at)

We know earlier that the probability distribution of the reward is unknown, so RRR is actually an estimate of the reward probability distributionR ^ ( r ∣ ai ) \hat R(r|a_i)R^(rai).

3.2. Estimating expected rewards

There is the following relationship between the expected reward and the number of samples:
Q n ( ai ) = r 1 + r 2 + ⋯ + rn − 1 n − 1 Q_n(a_i)=\frac {r_1+r_2+ \dots + r_{n-1 }}{n-1}Qn(ai)=n1r1+r2++rn1
But the space complexity of this method is O ( n ) O(n)O ( n ) , the incremental approach can reduce the complexity toO ( 1 ) O(1)O(1):
Q k = 1 k ∑ i = 1 k r i = 1 k ( r k + ∑ i = 1 k − 1 r i ) = 1 k ( r k + ( k − 1 ) Q k − 1 ) = 1 k ( r k + k Q k − 1 − Q k − 1 ) = Q k − 1 + 1 k [ r k − Q k − 1 ] \begin{aligned} Q_{k}& =\frac1k\sum_{i=1}^kr_i \\ &=\frac1k\left(r_k+\sum_{i=1}^{k-1}r_i\right) \\ &=\frac1k(r_k+(k-1)Q_{k-1}) \\ &=\frac1k(r_k+kQ_{k-1}-Q_{k-1}) \\ &=Q_{k-1}+\frac1k[r_k-Q_{k-1}] \end{aligned} Qk=k1i=1kri=k1(rk+i=1k1ri)=k1(rk+(k1)Qk1)=k1(rk+kQk1Qk1)=Qk1+k1[rkQk1]

The algorithm flow is as follows.

  • 对于∀ a ∈ A \forall a \in AaA , initialize counterN (a) = 0 N(a)=0N(a)=0 and expected reward estimateQ ^ ( a ) = 0 \hat Q(a)=0Q^(a)=0
  • for t = i → T t = i \rightarrow T t=iT do
  • Based on strategy π \piπ , perform an actionat a_tat
  • Get the return rt = B andit ( at ) r_t=Bandit(a_t)rt=Bandit(at)
  • Update counter N ( at ) = N ( at ) + 1 N(a_t)=N(a_t)+1N(at)=N(at)+1
  • Update the expected reward estimate: Q ^ ( at ) = Q ^ ( at ) + 1 N ( at ) [ rt − Q ^ ( at ) ] \hat{Q}(a_t)=\hat{Q}(a_t)+ \frac1{N(a_t)}\Big[r_t-\hat{Q}(a_t)\Big]Q^(at)=Q^(at)+N(at)1[rtQ^(at)]
  • end for

3.3. Regret function

For each action, define its expected return Q ( ai ) = E r ∼ R ( ⋅ ∣ ai ) [ r ∣ ai ] Q(a_i)=\mathbb{E}_{r\sim\mathcal{R}(\ cdot|a_i)}\left[r|a_i\right]Q(ai)=ErR(ai)[rai]

Therefore, there is at least one pull rod whose expected reward is not less than pulling any other pull rod. We express the optimal expected reward as Q ∗ = max ⁡ ai ∈ AQ ( ai ) Q^*=\max_{a_i\ in\mathcal{A}}Q(a_i)Q=maxaiAQ(ai)

In order to more intuitively and conveniently observe the gap between the expected reward of pulling a lever and the expected reward of the optimal lever, we introduce the concept of regret . Regret is defined as the action of pulling the current lever aaThe expected reward difference between a and the optimal tie rod, that is , R ( ai ) = Q ∗ − Q ( ai ) R(a_i)=Q^*-Q(a_i)R(ai)=QQ(ai)

Cumulative regret is operation TTThe total amount of regret accumulated after pulling the lever T times, σ R = ∑ t = 1 TR ( at ) \sigma_R=\sum_{t=1}^TR(a_t)pR=t=1TR(at)

The goal of the MAB problem is to maximize the cumulative reward, which is equivalent to minimizing the cumulative regret. min ⁡ σ R = max ⁡ E a ∼ π [ ∑ t = 1 TQ ( ati ) ] \min\sigma_R=\max\mathbb{E}_{a\sim\pi}[\sum_{t=1}^ TQ(a_t^i)]minpR=maxEaπ[t=1TQ(ati)]

  • If you keep exploring new strategies: σ R ∝ T ⋅ R \sigma_R\propto T\cdot RpRTR , the accumulated regret will increase linearly and cannot converge.
  • If new strategies are not explored: σ R ∝ T ⋅ R \sigma_R\propto T\cdot RpRTR , the accumulated regret will increase linearly

Therefore, it is necessary to consider whether there is a sublinear way to ensure convergence. Here is a method proposed by Lai&Robbinsti:
using R ( a ) = Q ∗ − Q ( a ) R(a)=Q^*-Q(a)R(a)=QQ ( a ) and feedback function distribution similarity:DKL ( R ( r ∣ a ) ∥ R ⋆ ( r ∣ a ) ) D_{KL}\bigl({\cal R}(r\mid a)\parallel{\ cal R}^{\star}(r\mid a)\bigr)DKL(R(ra)R(ra ) ) to describe:

lim ⁡ T → ∞ σ R ≥ log ⁡ T ∑ a ∣ R ( a ) > 0 R ( a ) D K L ( R ( r ∣ a ) ∥ R ⋆ ( r ∣ a ) ) \lim_{T\to\infty}\sigma_{R}\geq\log T\sum_{a|R(a)>0}\frac{R(a)}{D_{KL}\big(\mathcal{R}(r\mid a)\parallel\mathcal{R}^{\star}(r\mid a)\big)} TlimpRlogTaR(a)>0DKL(R(ra)R(ra))R(a)
The theoretical asymptotic optimal convergence is O ( log ⁡ T ) O(\log T)O(logT)

Feedback function distribution similarity
assumes that there are two feedback functions f ( x ) f(x)f ( x ) andg ( x ) g(x)g ( x ) , they are in the interval[ a , b ] [a, b][a,The density functions on b ] are respectively pf (x) p_f(x)pf( x ) andpg ( x ) p_g(x)pg( x ) . Then Kullback-Leibler divergence can be used to describe the distribution similarity between them, the formula is as follows:

DKL ( pf ∣ ∣ pg ) = ∫ abpf ( x ) log ⁡ pf ( x ) pg ( x ) dx D_{KL}( p_f || p_g) = \int_a^b p_f(x) \log \frac{p_f(x)}{p_g(x)} dxDKL(pf∣∣pg)=abpf(x)logpg(x)pf(x)d x

where, DKL ( pf ∣ ∣ pg ) D_{KL}(p_f || p_g)DKL(pf∣∣pg) meansf ( x ) f(x)f ( x ) andg ( x ) g(x)The distribution difference of g ( x ) , the smaller the value, the more similar the two distributions are.

Now let’s consider what strategy to adopt π \piπ , to maximize returns.

4. Greedy strategy and ϵ − greedy \epsilon-greedyϵg ree d y strategy

For the greedy strategy, the optimal decision will be chosen every time. Obviously this is a process of exploiting . As can be seen from the previous content, cumulative regret increases linearly. And ϵ − greedy \epsilon-greedyϵg ree d y strategy introduces noiseϵ \epsilonϵ , with sampling probability1 − ϵ 1-\epsilon1ϵ Exploitation(select the lever with the largest expected reward value in past experience), with sampling probability ϵ \epsilonϵ Exploration(randomly select a lever) . The formula is as follows:
at = { arg ⁡ max ⁡ a ∈ AQ ^ ( a ) , sampling probability: 1- ϵ randomly selected from A, sampling probability: ϵ a_t=\begin{cases}\arg\max_{a\in\ mathcal{A}}\hat{Q}(a),&\text{Sampling probability: 1-}\epsilon\\\text{Randomly selected from}\mathcal{A}\text{},&\text{ Sampling probability: }\epsilon&\end{cases}at={ argmaxaAQ^(a), Choose randomly from A  ,Sampling probability : 1- ϵSampling probability ϵ
Cumulative regret still increases linearly, but the growth rate is smaller.

Decaying Greedy Strategy
As the number of explorations continues to increase, we estimate the rewards of each action more and more accurately. At this time, we do not need to continue to expend great efforts in exploration. So in ϵ \epsilonIn the specific implementation of ϵ -greedy algorithm, we canlet ϵ \epsilonϵ decays with time, that is, the probability of exploration will continue to decrease. But note thatϵ \epsilonϵ will not decay to 0 within a limited number of steps, because the completely greedy algorithm based on limited step observations is still a greedy algorithm with local information, and there is always a fixed gap from the optimal solution.

A possible attenuation strategy: (It is generally difficult to find a suitable attenuation strategy)
c ≥ 0 , d = min ⁡ a ∣ Δ a > 0 Δ a , ϵ t = min ⁡ { 1 , c ∣ A ∣ d 2 t } c \geq0,\quad d=\min_{a|\Delta_a>0}\Delta_a,\quad\epsilon_t=\min\left\{1,\frac{c|\mathcal{A}|}{d^2t} \right\}c0,d=aΔa>0minDa,ϵt=min{ 1,d2 tcA}

Insert image description here

Different ϵ \epsilonThe impact of ϵ strategy on average return and optimal action selection

As can be seen from the above figure, when ϵ = 0 \epsilon=0ϵ=At 0 , there is no exploration part, only the exploitation part, and the average income basically remains unchanged with the time step; whenϵ = 0.1 \epsilon=0.1ϵ=When 0.1 , the exploration part is increased, the income increases greatly at the beginning, and then remains at a higher level, and more optimal strategies can be selected; whenϵ = 0.01 \epsilon=0.01ϵ=When 0.01 , it is between the two.

5. Aggressive initialization

Q ( a i ) Q(a_i) Q(ai) a higher initial value, also using a growth update method.

Insert image description here

ϵ − greedy \epsilon-greedyϵg ree d y and the positive initialization strategy have an impact on the optimal action. It can be seen that the positive initialization method is better thanϵ − greedy \epsilon-greedyϵg ree d y can obtain a larger proportion of optimal strategies

  • It is a biased estimate, and the influence of bias will decrease as the number of samples increases.
  • May get stuck in local minima. (Adjust ϵ \epsilon) _

6. Explicitly consider the value distribution of actions

Insert image description here
Based on the distribution of the above three actions, how to choose?

  • encourage uncertainty
  • The display location is selected based on distribution sampling

7. Confidence bound algorithm on UCB

The greater the uncertainty Q ( ai ) Q(a_i)Q(ai) , the more valuable it is to explore. . We introduce the uncertainty measure U ( a ) {U}(a)hereU ( a ) , which decreases as the number of times an action is tried increases. We can use an uncertainty-based strategy to comprehensively consider existing expected reward valuations and uncertainties, with the core question being how to estimate uncertainty.

The upper confidence bound (UCB) algorithm is a classic uncertainty-based strategy algorithm. Its idea uses a very famous mathematical principle: Hoeffding 's inequality.

Hoeffding's inequality is an important inequality in probability theory, which describes the convergence speed of the law of large numbers. Roughly speaking, Hoeffding's inequality is used to estimate the probability distribution of the sum of independent and identically distributed random variables whose bounds are known. It is expressed as follows:

Let X 1 , X 2 , . . . , X n X_1, X_2,...,X_nX1,X2,...,XnYesnn _n independent and identically distributed random variables,0 ≤ X i ≤ 1 0 \leq X_i \leq 10Xi1 , its empirical expectation isx ˉ n = 1 n ∑ j = 1 n X j \begin{aligned}\bar{x}_n=\frac{1}{n}\sum_{j=1}^nX_j\end {aligned}xˉn=n1j=1nXj,可得 P { E [ X ] ≥ x ˉ n + u } ≤ e − 2 n u 2 \mathbb{P}\left\{\mathbb{E}\left[X\right]\geq\bar{x}_n+u\right\}\leq e^{-2nu^2} P{ E[X]xˉn+u}e2 n u2

Now we apply Hoeffding's inequality to the multi-armed bandit problem. Q ^ t ( a ) \hat Q_t(a)Q^t( a ) Substitutex ˉ t \bar{x}_txˉt, the parameter u in the inequality = U ^ t ( a ) u=\hat U_t(a)u=U^t( a ) represents the uncertainty measure. Given a probabilityp = e − 2 N t ( a ) U t ( a ) 2 p=e^{-2N_t(a)U_t(a)^2}p=e2 Nt(a)Ut(a)2. According to the above inequality,Q t ( a ) < Q ^ t ( a ) + U ^ t ( a ) Q_t(a)<\hat Q_t(a)+\hat U_t(a)Qt(a)<Q^t(a)+U^t( a ) At least1 − p 1-p1The probability of p exists, ifpp很小,则 Q t ( a ) < Q ^ t ( a ) + U ^ t ( a ) Q_t(a)<\hat Q_t(a)+\hat U_t(a) Qt(a)<Q^t(a)+U^tThe probability of ( a ) existence will be very high,Q ^ t ( a ) + U ^ t ( a ) \hat Q_t(a)+\hat U_t(a)Q^t(a)+U^t( a ) is the upper bound of the expected reward.

At this time, the upper bound algorithm selects the action with the largest expected reward upper bound, that is, a = arg ⁡ max ⁡ a ∈ AQ ^ ( a ) + U ^ ( a ) a=\arg\max_{a\in\mathcal{ A}}\widehat{Q}(a)+\widehat{U}(a)a=argmaxaAQ (a)+U (a)。其中 U ^ t ( a ) = − log ⁡ p 2 N t ( a ) \hat U_t(a)=\sqrt{\frac{-\log p}{2N_t(a)}} U^t(a)=2 Nt(a)logp .

Therefore, set a probability ppAfter p , the corresponding uncertainty measure can be calculated. More intuitively, the UCB algorithm first estimates the upper bound of the expected reward of each pull rodU ^ t ( a ) \hat U_t(a)U^t( a ) , so that the expected reward for pulling each lever is only a small probabilityppp exceeds this upper bound, and then selects the lever with the largest upper bound of the expected reward, thereby selecting the lever that is most likely to obtain the maximum expected reward.

Or use this representation: A t ≐ arg ⁡ max ⁡ a [ Q t ( a ) + c ln ⁡ t N t ( a ) ] A_t\doteq\arg\max_a\left[Q_t(a)+c\sqrt {\frac{\ln t}{N_t(a)}}\right]Atargamax[Qt(a)+cNt(a)lnt ]

Insert image description here

ϵ − greedy \epsilon-greedyϵgreedy U C B UCB Comparison of U CB average returns.
It can be seen that, except for the previous steps,UCB UCBU average return ratio after CB ϵ − greedy \epsilon-greedyϵg ree d y high

8. Thompson sampling algorithm

  • Choose actions based on the probability that each action will be optimal

数学表达: p ( a ) = ∫ I [ E p ( Q ( a ) ) [ Q ( a ; θ ) ] = max ⁡ a ′ ∈ A E p ( Q ( a ′ ) ) ( Q ( a ′ ; θ ) ) ] d θ p(a)=\int\mathbb{I}\left[\mathbb{E}_{p(Q(a))}\left[Q(a;\theta)\right]=\max_{a'\in\mathcal{A}}\mathbb{E}_{p(Q(a'))}(Q(a';\theta))\right]d\theta p(a)=I[Ep(Q(a))[Q(a;i ) ]=aAmaxEp(Q(a))(Q(a;i )) ]d θ
Insert image description here
Thompson sampling uses sampling, that is, based on the current reward probability distribution of each actionp ( Q ( ai ) ) p(Q(a_i))p(Q(ai)) perform a round of sampling to obtain a set of reward samples Q ( ai ) Q(a_i)for each tie rodQ(ai) , and then select the action aawith the largest reward in the samplea . It can be seen that Thompson sampling is a Monte Carlo sampling method that calculates the highest reward probability for all levers.
Insert image description here

After understanding the basic idea of ​​​​the Thompson sampling algorithm, we need to solve another problem: how to get each current action aaThe reward probability distribution of a and updated in the process? In practical situations, we usually useBeta distributionto model the current reward probability distribution of each action. Specifically, if a tie rod is selectedkkk times, wherem 1 m_1m1The reward is 1, m 2 m_2m2The first reward is 0, then the reward compliance parameter of the lever is (m 1 + 1, m 2 + 1) (m_1+1,m_2+1)(m1+1,m2+1 ) Beta distribution.
Insert image description here

Beta distribution is a probability distribution with values ​​ranging from 0 to 1. It can be used to describe the probability of random events and is widely used in statistics, machine learning, Bayesian inference and other fields.

The probability density function of the Beta distribution is as follows:

f ( x ; α , β ) = 1 B ( α , β ) x α − 1 ( 1 − x ) β − 1 f(x;\alpha,\beta) = \frac{ 1}{B(\alpha, \beta)} x^{\alpha-1} (1-x)^{\beta-1}f(x;a ,b )=B ( a , b )1xα 1 (1x)β 1

where,xxx represents the value of the random variable,α \alphaαβ \betaβ is the parameter of the distribution,B ( α , β ) B(\alpha, \beta)B ( a ,β ) is the Beta ratio, the same:

B ( α , β ) = Γ ( α ) Γ ( β ) Γ ( α + β ) B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma (\beta)}{\Gamma(\alpha+\beta)}B ( a ,b )=C ( a + b )C ( a ) C ( b )

Among them, Γ \GammaΓ is the gamma function.

The shape of the Beta distribution is determined by the parameterα \alphaαβ \betaβ determines. Whenα = β = 1 \alpha=\beta=1a=b=When 1 , the Beta distribution degenerates into a uniform distribution; whenα > 1 \alpha>1a>1 andβ > 1 \beta>1b>When 1 , the Beta distribution has a unimodal, bell-shaped distribution shape; whenα < 1 \alpha<1a<1 β < 1 \beta<1 b<When 1 , the Beta distribution has the shape of a skewed distribution. The expectation of the Beta distribution isα α + β \frac{\alpha}{\alpha+\beta}a + ba,方差的α β ( α + β ) 2 ( α + β + 1 ) \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}( a + b )2 (a+b+1)a b

Summarize

  • Exploration and utilization of Exploitation are an integral part of trial-and-error learning in reinforcement learning ;
  • A major difference between the multi-armed bandit problem and reinforcement learning is that its interaction with the environment does not change the environment. That is, the result of each interaction of the multi-armed bandit machine has nothing to do with previous actions, so it can be regarded as stateless reinforcement learning . reinforcement learning)
  • The multi-armed bandit is the best environment to study exploration and utilization theory (the asymptotic optimal convergence of the theory is O ( log ⁡ T ) O(\log T)O(logT));
  • Various exploration and exploitation methods are commonly used in RL, especially in multi-armed bandits.

Insert image description here

Image source: https://staticcdn.boyuai.com/comment/upload/PzjhxfGWOkCb4KdXTZDik/502/2020/07/24/4yjXIv48Dtqdn84LEySmD.jpg

code

import numpy as np
import matplotlib.pyplot as plt

class BernoulliBandit:
    """伯努利多臂老虎机,输入K表示拉杆个数"""
    def __init__(self, K):
        # 随机生成K个0~1的数,作为拉动每根拉杆的获奖概率
        self.probs = np.random.uniform(size=K)
        # 获奖概率最大的拉杆
        self.best_idx = np.argmax(self.probs);
        # 最大的获奖概率
        self.best_prob = self.probs[self.best_idx]
        self.K = K

    def step(self, Kth):
        # 当玩家选择了k号拉杆后,根据拉动该老虎机的k号拉杆获得奖励的概率返回1(获奖)或0(未获奖)
        if np.random.rand() < self.probs[Kth]:
            return 1
        else:
            return 0

class ProblemSolver:
    """多臂老虎机算法基本框架 """
    def __init__(self, bandit):
        self.bandit = bandit
        # 每根拉杆的尝试次数
        self.counts = np.zeros(self.bandit.K)
        # 当前步的累积懊悔
        self.regret = 0.0
        # 维护一个列表,记录每一步的动作
        self.actions = []
        # 维护一个列表,记录每一步的累积懊悔
        self.regrets = []

    def UpdateRegret(self, Kth):
        # 计算累积懊悔并保存, Kth为本次动作选择的拉杆的编号
        self.regret += self.bandit.best_prob - self.bandit.probs[Kth]
        self.regrets.append(self.regret)

    def RunOnce(self):
        # 返回当前动作选择哪一根拉杆, 由每个具体的策略实现,需要继承后重写
        raise NotImplementedError

    def RunLoop(self, NumofSteps):
        # 运行一定次数, num_steps为总运行次数
        for _ in range(NumofSteps):
            Kth = self.RunOnce()
            self.counts[Kth] += 1
            self.UpdateRegret(Kth)
            self.actions.append(Kth)

class EpsilonGreedy(ProblemSolver):
    """ epsilon贪婪算法,继承ProblemSolver类"""
    def __init__(self, bandit, epsilon=0.01, init_prob=1.0):
        super(EpsilonGreedy, self).__init__(bandit)
        self.epsilon = epsilon
        # 初始化拉动所有拉杆的期望奖励估值
        self.EstimateReward = np.array([init_prob] * self.bandit.K)

    def RunOnce(self):
        if np.random.rand() < self.epsilon:
            # 随机选择一根拉杆
            Kth = np.random.randint(0, self.bandit.K)
        else:
            # 选择期望奖励估值最大的拉杆
            Kth = np.argmax(self.EstimateReward)
            # 得到本次动作的奖励
            Reward = self.bandit.step(Kth)
            # 更新期望奖励估值
            self.EstimateReward[Kth] += 1.0 / (self.counts[Kth] + 1) * (Reward - self.EstimateReward[Kth])
        return Kth

class DecayingEpsilonGreedy(ProblemSolver):
    """ epsilon值随时间衰减的epsilon-贪婪算法,继承Solver类 """
    def __init__(self, bandit, init_prob=1.0):
        super(DecayingEpsilonGreedy, self).__init__(bandit)
        # 初始化拉动所有拉杆的期望奖励估值
        self.EstimateReward = np.array([init_prob] * self.bandit.K)
        self.TimeCount = 0

    def RunOnce(self):
        self.TimeCount += 1
        if np.random.rand() < 1.0 / self.TimeCount:
            # 随机选择一根拉杆
            Kth = np.random.randint(0, self.bandit.K)
        else:
            # 选择期望奖励估值最大的拉杆
            Kth = np.argmax(self.EstimateReward)
            # 得到本次动作的奖励
            Reward = self.bandit.step(Kth)
            # 更新期望奖励估值
            self.EstimateReward[Kth] += 1.0 / (self.counts[Kth] + 1) * (Reward - self.EstimateReward[Kth])
        return Kth

class DecayingEpsilonGreedy2(ProblemSolver):
    """ epsilon值随时间衰减的epsilon-贪婪算法(另一种衰减策略),继承Solver类 """
    def __init__(self, bandit, init_prob=1.0, coef=1.0):
        super(DecayingEpsilonGreedy2, self).__init__(bandit)
        # 初始化拉动所有拉杆的期望奖励估值
        self.EstimateReward = np.array([init_prob] * self.bandit.K)
        self.TimeCount = 0
        self.coef = coef

    def RunOnce(self):
        self.TimeCount += 1
        if self.regret > 0:
            d = self.regret
        else:
            d = 1
        coef_epsilon = min(1, (self.coef * self.bandit.K) / (d * d * self.TimeCount))
        if np.random.rand() < coef_epsilon:
            # 随机选择一根拉杆
            Kth = np.random.randint(0, self.bandit.K)
        else:
            # 选择期望奖励估值最大的拉杆
            Kth = np.argmax(self.EstimateReward)
            # 得到本次动作的奖励
            Reward = self.bandit.step(Kth)
            # 更新期望奖励估值
            self.EstimateReward[Kth] += 1.0 / (self.counts[Kth] + 1) * (Reward - self.EstimateReward[Kth])
        return Kth

class UCB(ProblemSolver):
    """ UCB算法,继承Solver类 """
    def __init__(self, bandit, init_prob=1.0, coef=1.0):
        super(UCB, self).__init__(bandit)
        # 初始化拉动所有拉杆的期望奖励估值
        self.EstimateReward = np.array([init_prob] * self.bandit.K)
        self.TimeCount = 0
        self.coef = coef

    def RunOnce(self):
        self.TimeCount += 1
        # 计算上置信界
        ucb = self.EstimateReward + self.coef * np.sqrt(
            np.log(self.TimeCount) / 2 / (self.bandit.K + 1)
        )
        # 选出上置信界最大的拉杆
        Kth = np.argmax(ucb)
        # 得到本次动作的奖励
        Reward = self.bandit.step(Kth)
        # 更新期望奖励估值
        self.EstimateReward[Kth] += 1.0 / (self.counts[Kth] + 1) * (Reward - self.EstimateReward[Kth])
        return Kth

class ThompsonSampling(ProblemSolver):
    """ 汤普森采样算法,继承Solver类 """
    def __init__(self, bandit):
        super(ThompsonSampling, self).__init__(bandit)
        # 列表,表示每根拉杆奖励为1的次数
        self.SuccessCounter = np.zeros(self.bandit.K)
        # 列表,表示每根拉杆奖励为0的次数
        self.FailureCounter = np.zeros(self.bandit.K)

    def RunOnce(self):
        # 按照Beta分布采样一组奖励样本
        Samples = np.random.beta(self.SuccessCounter + 1, self.FailureCounter + 1)
        # 选出采样奖励最大的拉杆
        Kth = np.argmax(Samples)
        # 得到本次动作的奖励
        Reward = self.bandit.step(Kth)
        if Reward == 1:
            self.SuccessCounter[Kth] += 1
        else:
            self.FailureCounter[Kth] += 1
        return Kth

def PlotResults(solvers, solver_names):
    """生成累积懊悔随时间变化的图像。输入solvers是一个列表,列表中的每个元素是一种特定的策略。
    而solver_names也是一个列表,存储每个策略的名称"""
    plt.style.use('seaborn-v0_8-paper')
    for idx, solver in enumerate(solvers):
        time_list = range(len(solver.regrets))
        plt.plot(time_list, solver.regrets, label=solver_names[idx])
    plt.xlabel('Time steps')
    plt.ylabel('Cumulative regrets')
    plt.title('%d-armed bandit' % solvers[0].bandit.K)
    plt.legend()
    plt.show()

# test01
def test01():
    # 设定随机种子,使实验具有可重复性
    np.random.seed(1)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    print("随机生成了一个%d臂伯努利老虎机" % K)
    print("获奖概率最大的拉杆为%d号,其获奖概率为%.4f" %
          (bandit_10_arm.best_idx, bandit_10_arm.best_prob))

# test02-epsilon
def test02():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    EpsilonGreedySolver = EpsilonGreedy(bandit_10_arm, epsilon=0.01)
    EpsilonGreedySolver.RunLoop(5000)
    print('epsilon-贪婪算法的累积懊悔为:', EpsilonGreedySolver.regret)
    PlotResults([EpsilonGreedySolver], ["EpsilonGreedy"])

# test03-multi-epsilon
def test03():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    epsilon_lists = [1e-4, 0.01, 0.1, 0.25, 0.5]
    EpsilonGreedySolvers = [EpsilonGreedy(bandit_10_arm, e) for e in epsilon_lists]
    EpsilonGreedySolversNames = ["epsilon={}".format(e) for e in epsilon_lists]
    for Solver in  EpsilonGreedySolvers:
        Solver.RunLoop(5000)
        print('epsilon-贪婪算法的累积懊悔为:', Solver.regret)
    PlotResults(EpsilonGreedySolvers, EpsilonGreedySolversNames)

# test04-decay-epsilon
def test04():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    DecayingEpsilonGreedySolver = DecayingEpsilonGreedy(bandit_10_arm)
    DecayingEpsilonGreedySolver.RunLoop(5000)
    print('epsilon值衰减的贪婪算法的累积懊悔为:', DecayingEpsilonGreedySolver.regret)
    # print('epsilon值衰减的贪婪算法的每步懊悔为:', DecayingEpsilonGreedySolver.regrets)
    PlotResults([DecayingEpsilonGreedySolver], ["EpsilonGreedy"])

# test05-decay2-epsilon
def test05():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    DecayingEpsilonGreedySolver2 = DecayingEpsilonGreedy2(bandit_10_arm, coef=1.0)
    DecayingEpsilonGreedySolver2.RunLoop(5000)
    print('epsilon值衰减的贪婪算法的累积懊悔为:', DecayingEpsilonGreedySolver2.regret)
    PlotResults([DecayingEpsilonGreedySolver2], ["EpsilonGreedy"])

# test06-UCB
def test06():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    UCBSolver = UCB(bandit_10_arm, coef=2.0)
    UCBSolver.RunLoop(5000)
    print('epsilon值衰减的贪婪算法的累积懊悔为:', UCBSolver.regret)
    PlotResults([UCBSolver], ["UCB"])

# test07-ThompsonSampling
def test07():
    np.random.seed(0)
    K = 10
    bandit_10_arm = BernoulliBandit(K)
    ThompsonSamplingSolver = ThompsonSampling(bandit_10_arm)
    ThompsonSamplingSolver.RunLoop(5000)
    print('epsilon值衰减的贪婪算法的累积懊悔为:', ThompsonSamplingSolver.regret)
    PlotResults([ThompsonSamplingSolver], ["ThompsonSampling"])

if __name__ == '__main__':
    test07()

Some results :

Insert image description here

reference

[1] Boyu AI
[2] https://www.deepmind.com/learning-resources/introduction-to-reinforcement-learning-with-david-silver
[3] Hands-on learning reinforcement learning
[4] Reinforcement Learning
[5 ] A Tutorial on Thompson Sampling https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
[6] An Empirical Evaluation of Thompson Sampling

Guess you like

Origin blog.csdn.net/sinat_52032317/article/details/133136145