Upper-Confidence-Bound(UCB) Action Selection

Background

In ε-greedy method, we randomly choose non-greedy actions as exploration, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain.

Upper-Confidence-Bound

In order to take into account both how close their estimates are to being maximal and the uncertainties in those estimates, one effective way is to select actions according to: A t ≐ a r g m a x a [ Q t ( a ) + c ln ⁡ t N t ( a ) ] A_t\doteq \underset{a}{argmax}[Q_t(a)+c\sqrt{\frac{\ln{t}}{N_t(a)}}] Ataargmax[Qt(a)+cNt(a)lnt ]

  • N t ( a ) N_t(a) Nt(a) denotes the number of times that action a a a has been selected prior to time t t t. If N t ( a ) = 0 N_t(a)=0 Nt(a)=0, then a a a is considered to be a maximizing action.
  • c > 0 c>0 c>0 controls the degree of exploration and determines the confidence level.
  • The use of natural logarithm ln ⁡ t \ln{t} lnt means that the increases get smaller over time, but are unbounded - all actions will be selected eventually. But actions with lower value estimates or that have already been selected frequently, will be selected with decreasing frequency over time.

The idea of UCB action selection is that the square-root term c ln ⁡ t N t ( a ) c\sqrt{\frac{\ln{t}}{N_t(a)}} cNt(a)lnt is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is a sort of upper bound on the possible true value of action a a a. Each time the action a a a is selected, the uncertainty is reduced. On the other hand, as the time step t t t goes larger, if the action other than a a a is selected, the uncertainty is increased.

Pros & Cons

  1. UCB is more difficult than ε-greedy method to extend beyond bandit problems.
  2. UCB has difficulties in dealing with large state spaces and nonstationary problems…

猜你喜欢

转载自blog.csdn.net/lun55423/article/details/111997844