Reinforcement learning based on temporal difference method: Sarsa and Q-learning

Temporal Difference (TD) is a class of algorithms widely used in reinforcement learning to learn value functions or policies. Both Sarsa and Q-learning are important algorithms based on the temporal difference method, which are used to solve the reinforcement learning problem in the Markov Decision Process (MDP).

Here is the simplest TD method update:

It just uses the reward value after the current action and the value of the next state as the goal. Sarsa (State-Action-Reward-State-Action) and Q-learning are both reinforcement learning methods based on the temporal difference method.

The difference between Sarsa and Q-learning

Sarsa stands for State-Action-Reward-State-Action. is a policy-based approach that uses the policy being learned to generate training data. Q-learning is an off-policy approach that uses different strategies to generate training data for the policy of the value function being learned.

The update rules for Sarsa are as follows:

in:

  • Q(s, a) is the value function estimate for taking action a in state s.
  • α is the learning rate, which controls the step size for each update.
  • r is the immediate reward obtained after taking action a in state s.
  • γ is the discount factor, which represents the discount rate of future rewards.
  • s' is the new state obtained after performing action a.
  • a' is the next action chosen in the new state s'.

Q-learning is another reinforcement learning algorithm based on the temporal difference method, which is used to learn a value function, which represents the expected cumulative reward obtained by taking the optimal action in the state s. The update rules of Q-learning are as follows:

Among them: max(Q(s', a')) represents the maximum value function estimation when selecting the next action a' in the new state s'.

From the update above, it can be seen that the two methods are very similar, the main difference is their update strategy. In Sarsa, the update policy considers the next action taken in the new state, while in Q-learning, the update policy always chooses the action in the new state that maximizes the value function. Therefore, Sarsa is more inclined to follow the current strategy to learn, while Q-learning is more inclined to learn the optimal strategy.

Performance in cliff walking environment

This is a simple environment described in the RL book, as shown in the screenshot below.

  • The event starts in state S, and our agent starts in this state
  • An event ends in state G, which is the final state.
  • The state in the bottom row between S and G is the cliff state
  • The reward for transitioning from any state other than the cliff state is -1, and the agent moves to the adjacent state
  • The cliff-like reward is -100, and the agent moves to the start state S, which is to say the end
  • When the agent reaches the terminal state G, walks 100 steps, or ends up in a cliff state, it means the end
  • The blue path in the figure is safe, but not optimal, because it takes many steps to reach the goal state
  • The red path is optimal, but it is very dangerous as the agent may find itself on the edge of a cliff

From the description of the environment, the agent's goal is to maximize the cumulative reward, i.e. take as few steps as possible, since each step has a value of -1. The optimal path is the one above the cliff because it only takes 13 steps and has a value of -13. I use the 2td(0) method above to determine if they are above for the optimal path.

The experimental environment is as follows:

The following hyperparameters are used in training:

  • episodes:2000
  • discounting factor:1
  • Alpha: 0.1, this is the learning rate
  • Epsilon: 0.1, the probability of selecting all actions with the same probability, for the ε-greedy algorithm

result:

Sarsa and Q-learning have roughly the same time to converge, but Q-learning is able to learn the optimal path in 13 steps. Sarsa cannot learn the optimal path, it will choose to avoid the cliff. This is because its update function uses a greedy way to get the next state-action value, so the state value above the cliff is lower.

Q-learning uses the maximum value of the next state’s action value in the update, so it can move carefully along the edge to the goal state G. The figure below shows the number of learning steps for each training test. In order to make the chart smoother, the number of steps is averaged in groups of 20. We can clearly see that Q-learning is able to find the optimal path.

The graph below shows the online performance of the 2 algorithms *These values ​​are again averaged in groups of 20). We can see that Sarsa performs better than Q-learning. This is because as Q-learning learns to obtain optimal paths, it occasionally finds itself in a corner because the generation of state-action pairs to update follows a greedy algorithm. Sarsa, on the other hand, learned to avoid being close to the cliff, thereby reducing the chance of getting close to the cliff.

Summarize

This simple example illustrates the comparison between Sarsa and Q-learning, where we summarize the differences between the two algorithms:

Both Sarsa and Q-learning are reinforcement learning algorithms based on temporal difference methods, and they have some important differences when solving reinforcement learning problems in Markov decision processes (MDP).

  1. Update strategy :- Sarsa: In Sarsa, the update strategy is "state-action-reward-next state-next action", that is, the updated action is related to the next state. This means that Sarsa will consider the next action taken in the new state when updating the value function, so its learning process is more stable and various features of the policy can be learned. - Q-learning: The update strategy of Q-learning is "state-action-reward-maximum-action", that is, the updated action is the action with the maximum function estimate in the new state. This makes Q-learning more inclined to learn the optimal strategy, but it may also cause its learning process to be unstable and susceptible to noise interference.
  2. Learning methods :- Sarsa: Since Sarsa considers the next action performed in the new state when updating, it is more suitable for online learning and applications when interacting with the environment. Sarsa may be more stable in practical applications, but it may also converge slower. - Q-learning: Q-learning is more inclined to learn the optimal strategy, but because its update strategy does not consider the actual execution of the next action, it may converge faster in some cases, but it is also more susceptible to noise.
  3. Exploration Strategies :- Sarsa: Since Sarsa considers the next action performed in a new state, it tends to follow the current policy for exploration during the learning process, which may be more suitable for tasks that require more exploration. - Q-learning: Q-learning is not affected by the current policy when updating, and it is easier to explore during the learning process. However, this irrelevant exploration strategy may cause Q-learning to over-explore in some cases and fall into a state of non-convergence.
  4. Application Scenarios :- Sarsa: Suitable for tasks that require a stable learning process, emphasis on exploration, or when online learning occurs while interacting with the environment. - Q-learning: Suitable for tasks that tend to learn optimal policies, or when fast convergence is required.

These two algorithms are just two of the many algorithms in the field of reinforcement learning. There are other more advanced algorithms such as Deep Q Network (DQN), Actor-Critic, etc., and the appropriate algorithm can be selected according to the complexity and requirements of the problem.

Finally, if you want to experiment on your own, here is the source code for the two experiments in this article

https://avoid.overfit.cn/post/b7ecfa32ef354a4e9e0c9e2e5da7376d

By Kim Rodgers

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132244489