Contextual Bandits are between RL and MAB.
- RL: Action changes state, reward is determined by state, action
- CB: Action does not change state, reward is determined by state, action
- MAB: Action does not change the state, the reward is only determined by the action
linUCB is a method of Contextual Bandits. The basic idea is to use a function to approximate the expected return. For each action, learn such an estimation function. When faced with a new state s, first estimate the expected return of each action, and then select an action to do according to the UCB algorithm (combined). Consider exploration and greed).