(Reinforcement Learning) Q-Learning code practice

Table of contents

basic knowledge

MDP (Markov Decision Process) Markov Decision Process

Elements and architecture of reinforcement learning

Algorithmic thinking

 code


basic knowledge

MDP (Markov Decision Process)Markov Decision Process

MDP (Markov Decision Process) is a mathematical framework used to describe problems with randomness and sequential decision-making. In MDP, the agent continuously takes actions and observes feedback from the environment through interaction with the environment, thereby learning strategies for taking optimal actions in different states. MDP is commonly used in the field of reinforcement learning.

MDP contains the following elements:

State: The state of the agent at a certain moment.

Action: The action taken by the agent in a certain state.

Reward: The reward or punishment that an agent receives after taking an action in a certain state.

Transition probability: The probability of an agent transitioning from one state to another.

Discount factor: used to measure the value of future rewards, usually between 0 and 1.

Policy: The rules for an agent to take actions in each state.

In MDP, the agent's goal is to find an optimal strategy that maximizes its accumulated rewards in the long term. In order to achieve this goal, the agent needs to learn the value of taking different actions in different states through trial and error, and choose the best action based on these values. Value function and Q function are functions commonly used in MDP to represent value.

Elements and architecture of reinforcement learning

Reinforcement learning systems generally include four elements: policy, reward, value, and environment or model.

Policy: Policy defines the behavior of an agent for a given state.

Reward: The reward signal defines the goal of the reinforcement learning problem

Value: Or value function. Unlike the immediacy of rewards, the value function is a measure of long-term benefits.

Environment (model): The external environment, also known as the model (Model), is a simulation of the environment

Reinforcement learning architecture

Algorithmic thinking

The Q-learning algorithm is a reinforcement learning algorithm based on value iteration, used to learn the optimal strategy for an agent to interact with the environment. The basic idea of ​​this algorithm is to guide action selection by learning a Q-value function. The Q-value function represents the expected return from taking an action in a certain state.

In the Q-learning algorithm, the agent continuously updates the Q-value function by interacting with the environment. Specifically, the agent observes the current state s_t at each time step t, selects an action a_t based on the current state and the Q-value function, and observes the next state s_{t+1} and the corresponding reward r_{ after executing the action. t+1}, and update the Q-value function according to the Q-learning update rule. The update rules are as follows:

Q(s,a)←Q(s,a)+α[r+γmaxa′​Q(s′,a′)−Q(s,a)]

Among them, Q(s,a) represents the Q-value of taking action a in state s, indicating the value of the current action, α is the learning rate, γ is the discount factor, maxa′​Q(s′,a′) represents the next State s′ is the maximum Q-value obtained by taking the optimal action.

The core idea of ​​the Q-learning algorithm is to guide action selection by continuously updating the Q-value function, and ultimately learn an optimal strategy. In practical application, the algorithm requires discretization of the state space and action space in order to represent them as a Q-value table. At the same time, in order to increase the stability and convergence speed of the algorithm, techniques such as experience playback and exploration strategies can be used.

 code

Update rules

def get_update(row, col, action, reward, next_row, next_col):
    #target为下一个格子的最高分数,这里的计算和下一步的动作无关
    target = 0.9 * Q[next_row, next_col].max()
    #加上本步的分数
    target += reward

    #value为当前state和action的分数
    value = Q[row, col, action]

    #根据时序差分算法,当前state,action的分数 = 下一个state,action的分数*gamma + reward
    #此处是求两者的差,越接近0越好
    update = target - value

    #这个0.1相当于lr
    update *= 0.1

    return update


get_update(0, 0, 3, -1, 0, 1)

train

def train():
    for epoch in range(1500):
        #初始化当前位置
        row = random.choice(range(4))
        col = 0

        #初始化第一个动作
        action = get_action(row, col)

        #计算反馈的和,这个数字应该越来越小
        reward_sum = 0

        #循环直到到达终点或者掉进陷阱
        while get_state(row, col) not in ['terminal', 'trap']:

            #执行动作
            next_row, next_col, reward = move(row, col, action)
            reward_sum += reward

            #求新位置的动作
            next_action = get_action(next_row, next_col)

            #计算分数
            update = get_update(row, col, action, reward, next_row, next_col)

            #更新分数
            Q[row, col, action] += update

            #更新当前位置
            row = next_row
            col = next_col
            action = next_action

        if epoch % 100 == 0:
            print(epoch, reward_sum)


train()

Complete code: in 4-time sequence scoring algorithmhttps://download.csdn.net/download/qq_46684028/88076627

Related code:. Use Q learning algorithm to learn to play the maze game: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/ contents/2_Q_Learning_maze

Guess you like

Origin blog.csdn.net/qq_46684028/article/details/131871777