[Learn] to strengthen multi-armed bandit problem (MAB) of UCB algorithm introduced

UCB algorithm

UCB is doing EE (Exploit-Explore) when performed well, but do not care about the organization of a context-free (context free) bandit algorithm, just work hard at it, do not look at the face is what kind of arm.

UCB algorithm to solve the problem is:

Face fixed the K item (ad or recommendation items), we do not have any prior knowledge of each item return without full knowledge, every test To choose one, how to maximize our returns in this selection process ?

UCB solve this problem Multi-armed bandit idea is: confidence interval. Confidence interval can be simply understood as the degree of uncertainty, the wider the range, the more uncertain, and vice versa vice versa.

Return each item has a confidence interval of the mean, with the increase in the number of tests, the confidence interval will be narrower (gradually identified in the end rewarding or Poor).

Before each selection, it is re-estimated for each item based on the confidence interval of the mean and the results have been tested.

Select the maximum upper limit of the confidence interval of the item.

"Choose the upper bound of the confidence interval of the maximum item" This sentence reflects several meanings:

  1. If the item is very wide confidence interval (the number of times the selected few, not sure), then it will tend to be multiple choice, this is part of the algorithm to take risks;
  2. If the item is very narrow confidence intervals (often a lot of options, comparison determines that good or bad), then the mean tends to be a large multiple selections, this algorithm is a conservative conservative part;
  3. UCB is an optimistic algorithm, select the confidence interval upper bound of the sort, if pessimistic when a conservative approach is to select the lower bound of the confidence interval sorting.

UCB1 algorithm

Here we introduce one of the most common strategies --UCB1 bandit algorithm, the spirit of the algorithm is considered to be optimistic face of uncertainty : We first guess might give reward each arm, and then select the highest arm, if the actual award less, we will soon be reduced to guess the arm, on the contrary, we try to choose the multi-arm there is speculation, in fact, to reward each arm has established an index by dynamically adjusting the index, we will eventually determine the expected highest award arm.

UCB1 algorithm: K front wheel, each arm of each selected once, in \ (t = K, K + 1 ... \) wheel:

  1. Selectivity index (I_i \) \ biggest arm, wherein \ (I_i = \ bar {X} _i + \ sqrt {2 \ FRAC {\ log T} {n_i}} \) , where \ (\ bar {x} _i \ ) is the mean, \ (n_i \) is an arm \ (I \) the current number of accumulated selected
  2. Recording rewards and update \ (\ bar {x} _i \) and \ (n_i \)

When UCB1 algorithm is executed, we can determine the following theorem, where \ (\ delta_i = \ ^ MU *} {- \ mu_i \) :

Theorem: UCB1 cumulative regret not exceed expectations

\[8\sum_{i:\mu_i\lt \mu^{*}}\frac{\log T}{\Delta_i}+(1+\pi^2/3)(\sum_{j=1}^K \Delta_j)\]

Proof of the theorem I will not list here, specific reference Finite-time Analysis of the Multiarmed Bandit Problem.

We found that the cumulative regret expect UCB1 algorithm is \ (O (\ log T) \) , which is not sufficient? Of course not, if the worst-case cumulative regret too high, in fact, the algorithm does not make sense .

UCB1 worst case

Theorem: UCB1 Unfortunately, the worst case does not exceed the cumulative \ (O (\ sqrt {KT \ log T}) \)

We regret the desired function analysis by accumulating its simple proof: First of all, we regret the partial differential accumulation expected to give

\[\frac{\partial R}{\partial \Delta_i}=\frac{8\log T}{\Delta_i^2}+1+\frac{\pi^2}{3}\]

It is equal to 0, then there is \ (\ delta_i = \ sqrt {\ FRAC. 8 {\ log T} {1+ \ PI ^ 2 /. 3 =}} O (\ sqrt {\ log T}) \) , but this when a \ (R & lt \) is a minimum point \ (R & lt = O (K \ sqrt {\ log T}) \) . Meanwhile, if we let \ (\ delta_i \) as small as possible, then, \ (R & lt \) becomes arbitrarily large, but then all rewards are similar, so some time is minimum. If we let \ (\ delta_i \) is equal to 1, then, we get \ (R = O (K \ sqrt {\ log T}) \)

In fact, if we let \ (\ delta_i = \ Delta \) , then the cumulative regret expectations will be \ (\ Delta T \) , into the formula may have accumulated regret worst case \ (O (\ sqrt {KT \ log T}) \ )

论文:Finite-time Analysis of the Multiarmed Bandit Problem

Comparing the UCB algorithm


Let [official]be independent, identically distributed random variables, such that [official] and [official]. If [official], then:

[official]

With prior coefficient distribution and the number of samples to preclude the posterior distribution coefficient set, from this point of view, another method does not matter why the total number of samples.

Because we have \ (P (\ Beta) \) , there \ (P (x_i, i \ in [1, n] | \ beta), P (\ beta | x) \) do not know yet;

And decided to \ (P (\ beta | x ) \) of the interval is x (Bian kind) rather than the length of the total sample length

Guess you like

Origin www.cnblogs.com/Ryan0v0/p/11366578.html