A few words to sum up the Q-Learning algorithm and Sarsa

  • Policy Gradients with the difference that the two states s algorithm evaluates a desired perform some action a reward, i.e., Q (s, a)

  • Q (s, a) there are two methods is calculated, the first look-up table or a direct model estimate, Q (s, a) = checkTable (s, a), at the beginning of the training is very inaccurate; second step method is obtained by the Monte Carlo method, after performing a hypothesis state s', and s' perform the operation of a ', Q' (s, a) = attenuation coefficient of the current state of reward + * Q (s', a '), similar to a dynamic programming problem, when the end of the game, only the current state of the reward. But the dynamic programming is different is that this recursive relationship will not wait until after the end of the game was updated, but go one step further updated.

  • Q (s, a) shows the model predictions based on historical data of the reward, and Q '(s, a) represents a prediction for the current reward action. A good model, Q (s, a) and Q '(s, a) should be as close as possible, but also to the stability of the iteration, the new Q (s, a) to update the old Q (s, a), and Q '(s, a) a weighted average (learning rate control).

  • Because this method is estimated reward value instead of a probability distribution, so the general use of incentives biggest move, the training poses a problem, because in some states may only ever select an action to solve this method, you need the introduction of epsilon-greedy, that is a high probability selects the maximum bonus action to ensure that the focus of exploration, while a small probability that a randomly selected action to ensure the completeness of the exploration of space.

  • Since the introduction of epsilon-greedy, in Q '(s, a) in iterative formula, s' the choice which action a' calculated two options appear, select the maximum reward is action, and the currently selected action or policy of keeping s consistent with a small probability of a randomly selected an action?

  • If the 'select the maximum bonus action in s greedy strategy is Q learning, this is called different strategy (off-policy), if it is selected and action strategies currently s consistent, that is, Sarsa, which is also called the same strategy (on -policy)

Guess you like

Origin www.cnblogs.com/daniel-D/p/11002870.html