[Reinforcement Learning] One of the commonly used algorithms "SARSA"

 

Author's homepage: boy who loves to laugh. Blog_CSDN blog - deep learning, activities, python field blogger loves to laugh boy. A boy who is good at deep learning, activities, python, etc., and loves to laugh. Focus on algorithms, python, computer vision, image processing, deep learning, pytorch, neural network, opencv fields. https://blog.csdn.net/Code_and516?type=blog Personal profile: Dagong.

Continue to share: machine learning, deep learning, python-related content, daily BUG solutions, and Windows&Linux practical tips.

If you find an error in the article, please point it out, and I will correct it in time. If you have other needs, you can private message me or send me an email: [email protected] 

        Reinforcement learning is a machine learning method that learns optimal behavioral strategies through continuous interaction with the environment. The SARSA (State-action-reward-state-action) algorithm is one of the classic algorithms in reinforcement learning, which is used to solve the optimal strategy in the Markov Decision Process (MDP). This article will introduce the development process, algorithm principle, function and usage of the SARSA algorithm in detail, with sample code and running results.

This article will explain in detail one of the commonly used algorithms for reinforcement learning, "SARSA".


 

Table of contents

1. Introduction

2. History

3. Algorithm formula

1. SARSA algorithm formula

2. The principle of SARSA algorithm

4. Algorithm function

5. Example code

6. Summary


1. Introduction

        Reinforcement learning is a method of maximizing cumulative reward by learning to interact with the environment. In reinforcement learning, an agent in a particular environment chooses an action based on the current state, after executing that action, the environment will shift to a new state and the agent will be rewarded. The goal of reinforcement learning is to enable the agent to choose a series of action sequences that can obtain the maximum cumulative reward through learning, that is, to find the optimal strategy. The SARSA algorithm is a reinforcement learning algorithm based on the state-action value, which is used to learn the optimal strategy.

2. History

        The SARSA algorithm was first proposed by Richard Sutton and Andrew Barto in their book "Introduction to Reinforcement Learning". SARSA is a special case of Q-learning algorithm, and it is also an algorithm based on value function.

        The Q-learning algorithm is a reinforcement learning algorithm based on the state-action value, which learns the optimal strategy by maintaining a Q-value table (storing the state-action value of each state-action pair). However, the Q-learning algorithm must discretize the Q value table, so it is only suitable for problems with small and discrete state space and action space. In order to solve this problem, Richard Sutton et al. proposed the SARSA algorithm.

        The SARSA algorithm is an algorithm based on value functions and strategies. It does not need to discretize the state space and action space, and is suitable for continuous state and action problems. The algorithm uses a Q-value function and a policy function to approximate the optimal policy.

3. Algorithm formula

1. SARSA algorithm formula

        The update formula of the SARSA algorithm is as follows:

        Among them, Q(s, a) represents the state-action value of selecting action a in state s, r represents the immediate reward obtained after executing action a, α represents the learning rate, γ represents the discount factor, and s' represents the new state, a' represents the action chosen in the new state s'. 

2. The principle of SARSA algorithm

        The core idea of ​​the SARSA algorithm is to learn the optimal policy by continuously updating the state-action value function Q(s, a). The algorithm proceeds in the following steps:

  • Initialize the value of the state-action value function Q(s, a) and the policy function π(a|s).
  • In each time step t, an action a is selected according to the current state s and the policy function π.
  • Perform action a, observe the obtained immediate reward r and new state s'.
  • Choose a new action a' according to the new state s' and the policy function π.
  • Update the value of the state-action value function Q(s, a), using the SARSA update formula.
  • Take the new state s' and new action a' as the next state s and action a.
  • Repeat the above steps until the termination condition is met.

        By continuously iteratively updating the state-action value function Q(s, a), the SARSA algorithm can gradually approach the optimal state-action value function, thereby obtaining the optimal policy.

4. Algorithm function

        The SARSA algorithm has the following functions:

  1. Model independence: The SARSA algorithm does not need to make assumptions about the environment model, and only learns the optimal strategy by interacting with the environment.
  2. Convergence: Under certain conditions, the SARSA algorithm is guaranteed to converge to the optimal strategy.
  3. Applicability: The SARSA algorithm is suitable for problems with large and continuous state space and action space, without discretization of state space and action space.

5. Example code

import numpy as np

# 定义迷宫环境
maze = np.array([
    [0, 0, 0, 0],
    [0, -1, 0, -1],
    [0, 0, 0, -1],
    [-1, 0, 0, 1]
])

# 定义起始状态和终止状态
start_state = (3, 0)
goal_state = (3, 3)

# 定义动作空间
actions = [(0, 1), (0, -1), (-1, 0), (1, 0)]

# 初始化状态-动作值函数
Q = np.zeros((4, 4, 4))

# 定义参数
alpha = 0.1
gamma = 0.9
epsilon = 0.1
max_episodes = 100

# SARSA算法 
for episode in range(max_episodes):
    state = start_state
    action = np.random.choice(range(4)) if np.random.rand() < epsilon else np.argmax(Q[state])

    while state != goal_state:
        # next_state = (state[0] + actions[action][0], state[1] + actions[action][1])
        a = state[0] + actions[action][0]
        b = state[1] + actions[action][1]
        if a > 3:
            a-=1
        elif b > 3:
            b-=1
        elif a < -4:
            a+= 1
        elif b < -4:
            b+= 1
        next_state = (a,b)
        reward = maze[next_state]
        next_action = np.random.choice(range(4)) if np.random.rand() < epsilon else np.argmax(Q[next_state])
        Q[state][action] += alpha * (reward + gamma * Q[next_state][next_action] - Q[state][action])

        state = next_state
        action = next_action

# 输出结果
for i in range(4):
    for j in range(4):
        print("State:", (i, j))
        print("Up:", Q[i][j][0])
        print("Down:", Q[i][j][1])
        print("Left:", Q[i][j][2])
        print("Right:", Q[i][j][3])
        print()

        The result of the operation is as follows: 

State: (0, 0)
Up: -0.008042294056935573
Down: -0.007868742418369764
Left: -0.016173595452674966
Right: 0.006662566560762523

State: (0, 1)
Up: 0.048576025675988774
Down: -0.0035842473161881465
Left: 0.024420015715567546
Right: -0.46168987981312615

State: (0, 2)
Up: 0.04523751845081987
Down: 0.04266319340558091
Left: 0.044949583791193154
Right: 0.026234839551098416

State: (0, 3)
Up: 0.01629652821649763
Down: 0.050272192325180515
Left: -0.009916869922464355
Right: -0.4681667868865369

State: (1, 0)
Up: -0.09991342319696966
Down: 0.0
Left: 0.0
Right: 0.036699099068340166

State: (1, 1)
Up: 0.008563965102313987
Down: 0.0
Left: 0.0
Right: 0.3883250678150607

State: (1, 2)
Up: -0.3435187267522706
Down: -0.2554776873673874
Left: 0.05651543121932354
Right: 0.004593450910446022

State: (1, 3)
Up: -0.1
Down: -0.013616634831997914
Left: 0.01298827764814053
Right: 0.0

State: (2, 0)
Up: 0.28092113053540924
Down: 0.0
Left: 0.0024286388798406364
Right: 0.06302299434701504

State: (2, 1)
Up: 0.0
Down: 0.0
Left: -0.16509175606504775
Right: 1.9146361697676122

State: (2, 2)
Up: -0.1
Down: 0.0
Left: 0.03399106390140035
Right: 0.0

State: (2, 3)
Up: -0.3438668479533914
Down: 0.004696957810272524
Left: -0.19
Right: 0.0

State: (3, 0)
Up: 3.3060693607932445
Down: 0.8893977121867367
Left: 0.0
Right: 0.13715553550041798

State: (3, 1)
Up: 4.825854511712306
Down: -0.03438123168566812
Left: 0.10867882029322147
Right: 1.0015572397722665

State: (3, 2)
Up: 5.875704328143301
Down: 0.9315770230698863
Left: 0.0006851481810742227
Right: 0.47794799892127526

State: (3, 3)
Up: 5.4028951599661275
Down: 2.6989177956329757
Left: -0.6454474033238188
Right: 0.018474082554518417

        By running the sample code, we can get the optimal action in each state and the corresponding state-action value.

6. Summary

        This article introduces the SARSA algorithm in reinforcement learning in detail, including its development history, algorithm principle, function and usage, and gives a sample code for solving the maze problem. The SARSA algorithm can achieve model independence and convergence, and is suitable for problems with large and continuous state space and action space. Through the iterative update of the state-action value function, the SARSA algorithm can gradually approach the optimal policy, and learn the optimal behavioral policy by interacting with the environment.

 

 

Guess you like

Origin blog.csdn.net/Code_and516/article/details/131445162