Principle of DNQ Algorithm (Deep Q Network)

1. Reinforcement Learning Concepts

The learning system is not told what to do like many other forms of machine learning

Which behaviors lead to the greatest reward must be discovered after trying

The current behavior may not only affect the immediate reward, but also the next step and all subsequent rewards

uTools_1689855542629

Every action (action) can affect the agent's future state (state)

Measure success through a scalar reward signal

Goal: choose a sequence of actions to maximize future reward

The specific process is to observe first, then act, then observe...

uTools_1689855720456

state

Experience is a sequence of observations, actions, rewards.

The state is a summary of experience.

uTools_1689855933110

2. Markov Decision Making

A Markov decision requires:

  1. able to detect the ideal state

  2. can try multiple times

  3. The next state of the system is only related to the current state information, and has nothing to do with the earlier state, and it is also related to the current action in the decision-making process

A Markov decision process consists of 5 elements:

S: Indicates the state set (states)

A: Indicates a set of actions (actions)

P: represents the state transition probability P. Indicates the probability distribution of other states that will be transferred to other states after the action of a∈A in the current state s∈S. In the state s, the probability of transferring to s' after executing action a can be expressed as p(s|s, a)

R: The reward function (reward function) represents the immediate reward after the agent takes an action

y: The discount factor means that the current reward is more important than the future feedback reward

State value function: v(s)=E[Ut|St=s]

The expectation of future rewards that state s can obtain at time t

The value function is used to measure the pros and cons of a state or state-action pair, and the expectation of cumulative rewards

Optimal value function: optimal cumulative reward expectation v (s)=max v.(s) under all strategies

Strategy: the probability distribution of possible actions in a known state

3. Bellman equation

Bellman equation: the value of the current state is related to the value of the next step and the current reward (Reward)

The value function is decomposed into two parts: the current reward and the value of the next step

This process is usually implemented by an iterative method, and each iteration will update the value function of the state until convergence

Value iterative solution :

Value iteration is a method to solve the Bellman equation. Its basic idea is to update the value function of the state through iteration until it converges to the optimal solution. Specific steps are as follows:

  1. Initialize the value function V(s) to 0, or any non-negative value.

  2. For each state s, the value function is updated according to the following formula:

V(s) = max{R(s, a) + γ * V(next_state)}, where a is an action in state s, next_state is the next state corresponding to action a, and γ is the discount factor. 3. Repeat step 2 until the value function converges to the optimal solution.

The time complexity of value iteration is O(NT^2), where N is the number of states and T is the number of iterations. The advantage of value iteration is that the amount of calculation is small, and the disadvantage is that only a local optimal solution can be found, but the global optimal solution cannot be guaranteed.

The premise is to install gym

Just pip install

import numpy as np
import sys
from gym.envs.toy_test import discrete
​
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
​
class CridworldEnv(discrete.DiscreteEnv):
    metadata = {'render.modss':['humin','ansi']}
    
    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len[shape] == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')
​
        self.shape = shape
​
# 定义状态空间、动作空间、转移概率和即时奖励
state_space = [0, 1, 2, 3, 4]
action_space = [0, 1, 2, 3]
transition_probabilities = {
    (0, 0): [0.5, 0.5],
    (0, 1): [0.5, 0.5],
    (1, 0): [0.1, 0.8, 0.1],
    (1, 1): [0.8, 0.1, 0.1],
    (2, 0): [0.5, 0.5],
    (2, 1): [0.5, 0.5],
    (3, 0): [0.8, 0.1, 0.1],
    (3, 1): [0.1, 0.8, 0.1],
    (4, 0): [0.5, 0.5],
    (4, 1): [0.5, 0.5]
}
reward_matrix = {
    (0, 0): [-1, -1],
    (0, 1): [10, -1],
    (0, 2): [-1, 10],
    (1, 0): [-1, -1],
    (1, 1): [-1, -1],
    (1, 2): [-1, -1],
    (2, 0): [-1, -1],
    (2, 1): [10, -1],
    (2, 2): [-1, 10],
    (3, 0): [-1, -1],
    (3, 1): [-1, -1],
    (3, 2): [-1, -1],
    (4, 0): [-1, -1],
    (4, 1): [-1, -1]
}
​
# 定义值函数初始值和折扣因子
V = {s: 0 for s in state_space}
gamma = 0.9
​
# 值迭代求解
T = 1000  # 迭代次数
for t in range(T):
    for s in state_space:
        Q = {a: 0 for a in action_space}
        for a in action_space:
            for next_s in state_space:
                Q[a] += transition_probabilities[(s, a)][next_s] * (reward_matrix[(s, a)][next_s] + gamma * V[next_s])
        V[s] = max(Q.values())
​
# 输出最优值函数和最优策略
print("Optimal value function:")
for s in state_space:
    print("V(%d) = %f" % (s, V[s]))
​
print("Optimal policy:")
for s in state_space:
    max_action = argmax(Q.items(), key=lambda x: x[1])[0]
    print("Policy for state %d: take action %d" % (s, max_action))

Handwritten case:

import numpy
from gridworld import GridworldEnv
​
env = GridworldEnv()
​
def value_iteration(env, theta=0.0001,discount_factor = 1.0):
    def one_setp_lookahead(state, v):
        A = np.roros(env.nA)
        #更新值
        for a in range(env.nA):
            for prob,next_state,reward,done in env.P[state][a]:
                A[a] += ropb*(reward + discount_factor*v[next_state])
        return A
    w = np.reros(env.nS)
    
    #进行一个迭代更新
    while True:
        delta = 0
        
        for s in range(env.nS):
            # Do a one step lookahead to find the best action
            A = one_step_lookahead(s,v)
            # Calculate delta across all states seen so far
            best_action_value = np.max(A)
            # Update the value function
            delta = max(delta,np.abs(best_action_value-v[s]))
            v[s] = best_action_value
        # Check if we can stop
        if delta < theta:
            break
    policy = np.zeros((env.nS,env.nA))
    for s in range(env.nS):
        A = one_step_lookahead(s,v)
        best_action_value = np.max(A)
        policy[s,best_action_value] = 1.0
    return policy,v
​
policy, v = value_iteration(env)
​
print("Policy Probability Distribution")
print(policy)
print("")
​
print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print("")

4.Q-learning

uTools_1689918204365

According to the form of the legend, we want to go to Goal State No. 5. We need to add some score rewards to the paths close to 5, so as to attract the agent to approach and achieve the final goal.

Q-learning is one of the main algorithms of reinforcement learning and is a model-free learning method. It is based on a key assumption that the interaction between the agent and the environment can be regarded as a Markov decision process (MDP), which determines a fixed state transition probability distribution and the next state according to the current state of the agent and the selected action. , and get an instant reward. The goal of Q-learning is to find a strategy that maximizes future rewards.

The inner idea of ​​Q-learning is to select the most valuable action through a value table or value function. Q(s,a) represents the expected value of future income in the case of a specific initial state s and action a. The Q-Learning algorithm maintains a Q-table, which records the Q values ​​obtained by taking different actions a (a ∈ A) in different states s (s ∈ S). Before exploring the environment, the Q-table is initialized. When the agent interacts with the environment, the algorithm uses the Bellman equation to iteratively update Q(s, a), and a new Q-table is generated after each round. The agent constantly interacts with the environment, updating this table so that it can eventually converge. In the end, the agent can use the table to judge what action to take in a certain state in order to obtain the maximum Q value.

Q-learning iterative calculation :

Step1 Given learning parameters γ and reward matrix R

Step2 Let Q=0

Step3 For each episode

Step 3 can also be subdivided: first, an initial state s can be chosen randomly. Then when the target state is not reached, perform a few steps, select a behavior a from all possible behaviors of the current state s, and then use the selected behavior a to obtain the next state s1, and calculate Q according to the calculation method specified above (s, a), and then assign s1 to our s for the next iterative calculation.

This may take thousands of times to converge to a state.

5.Deep Q Network

uTools_1689920362769

Q-table is a key concept in the Q-learning algorithm. It is a table that records the maximum Q value corresponding to each state and action.

Each row in the Q-table represents a state, each column represents an action, and each element Q(s,a) in the table represents the expected value of the maximum benefit that can be obtained by taking action a in state s. In the Q-learning algorithm, the agent continuously explores the environment, interacts with the environment, and updates the Q-table, so as to gradually learn which actions to take in a specific state to obtain the greatest benefit.

  1. Convert image to grayscale

  2. Resize image to 80 * 80

  3. Stack last 4 frames to produce an 80 * 80 * 4 input array for network

Exploration VS Exploitation : we both need.

δ - greedy exploration : have chances to explore.

6. DQN environment construction

We mainly operate with the bird as an example.

uTools_1689930427128

import tensorflow as tf
import cv2
import sys
sys.path('game')
import random
import numpy as np
from collections import deque
​
GAME = 'bird'
# 或上或下
ACTIONs = 2
GAMMA = 0.99
OBSERVE = 1000
ECPLORE = 3000000
FINAL_EPSILOW = 0.0001
INITIAL = 0.1
REPLAY_MOMORY = 50000
RATCH = 32
FRAME_PER_ACTION = 1
​
def createNetwork():
    # 三层卷积的形式
    # 注意,池化层是没有参数的
    W_conv1 = weights_variable([8, 8, 4, 32])
    b_conv1 = bias_variable([32])
    
    W_conv2 = weights_variable([4, 4, 32, 64])
    b_conv2 = bias_variable([64])
    
    W_conv3 = weights_variable([3, 3, 64, 64])
    b_conv3 = bias_variable([32])
    
    W_fc1 = weights_variable([1600,512])
    b_fc1 = weights_variable([512])
    
    W_fc1 = weights_variable([512,ACTIONS])
    b_fc1 = weights_variable([ACTIONS])
    
    s = tf.placeholder('float', [None,80,80,4])
    
    h_conv1 = tf.nn.relu(conv2d(s,W_conv1,4)+b_conv1)
    h_pool1 = max_pool_2x2(h_conv1)
    
    h_conv2 = tf.nn.relu(conv2d(h-pool1,W_conv2,2)+b_conv2)
    # h_pool2 = max_pool_2x2(h_conv2)
    h_conv3 = tf.nn.relu(conv2d(h-pool1,W_conv3,1)+b_conv3)
    
    # reshape是将连接操作,将立体图转化为向量化数据
    h_conv3_flat = tf.reshape(h_conv3, [-1,1600])
    
    h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat,W_fc1)+b_fc1)
    
    readout = tf.matmul(h_fc1,W_fc2) + b_fc2
    return s,readout,h_fc1
​
def weights_variable(shape):
    initial = tf.truncated_normal(shape,stddev=0.01)
    return tf.Variable(initial)
def bias_variable(shape):
    initial = tf.constant(0.01,shape = shape)
    return tf.Variable(initial)
def conv2d(x,W,stride):
    return tf.nn.conv2d(x,W,strides=[1,stride,stride,1],padding='SAME')
def max_pool_2x2(x):
    return nn.max_pool(x,ksize = [1,2,2,1],strides=[1,stride,stride,1],padding='SAME')
​
def trainNetwork(s,readout,,h_fc1,sess):
    
    a = tf.placeholder('float', [None,ACTIONS])
    y = tf.placeholder('float', [None])
    
    readout_action = tf.reduce_mean(tf.multiply(readout,a),reduce_indices = 1)
    cost = tf.reduce_mean(tf.square(y = readout_action))
    train_step = tf.train.AdamOptimizer(1e-6).minimaize(cost)
    
    game_state = game.GameState()
    
    D = deque()
    do_nothing = np.zeros(ACTIONS)
    do_nothing[0] = 1
    
    x_t,r_0,terminal = game_state.frame_step(do_nothing)
    # 将图变为80*80的二维图像,在转化为1,255的
    x_t = cv2.cvtColor(cv2.resize(x_t,(80,80),cv2.COLOR_BGR2CRAY))
    ret,x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
    
    s_t = np.stack((x_t,x_t,x_t,x_t),axis = 2)
    
    saver = tf.train.Saver()
    see.run(tf.initialize_all_variables())
    checkpoint = tf.train.get_checkpoint_state('saved network')
    
    if checkpoint and checkpoint.model_checkpoint_path:
        saver.restore(sess, checkpoint.model_checkpoint_path)
        print('Successfully loaded')
    else:
        print('load failed')
        
    epsilon = INITIAL_EPSILOW
    t = 0
    while 'flappy bird' != 'angry bird':
        readout_t = readout.eval(feed_dict = {s:[s_t]})[0]
        a_t = np.zeros([ACTIONS])
        action_index = 0
        
        if t % 1 == 0:
            if random.random() <= epsilon:
                print('Rondom Action')
                action_index = random.randint(ACTIONS)
                a_t[action_index] = 1
            else:
                # 决定小鸟向上飞还是向下
                action_index = np.argmax(readout_t)
                a_t[action_index] = 1
        x_t1_colored,r_t,r_t,terminal = game_state.frame_step(a_t)
        x_t = cv2.cvtColor(cv2.resize(x_t1,colored,(80,80),cv2.COLOR_BGR2CRAY))
        ret,x_t = cv2.threshold(x_t1,1,255,cv2.THRESH_BINARY)
        x_t1 = np.reshape(x_t1, (80,80,1))
        s_t1 = np.append(x_t1, s_t[:,:,3],axis = 2)
        
        # 强化学习
        D.append(s_t,a_t,r_t,s_t1,terminal)
        # s_t当前状态
        # a_t当前动作
        # r_t奖励和回馈
        # s_t1新的状态
        # terminal判断是否结束
        if len(D) > REPLAY_MOMORY:
            D.popleft()
        
        if t > OBSERVE:
            minibatch = random.sample(D,BATCH)
            
            s_j_batch = [d[0] for d in minibatch]
            
            a_batch = [d[1] for d in minibatch]
            
            r_batch = [d[2] for d in minibatch]
            
            s_j1_batch = [d[3] for d in minibatch]
            
            y_batch = []
            
            # 神经网络的输出值
            readout_j1_batch = readout.eval(feed_dict = [s:s_j1_batch])
            for i in range(0, len(minibatch)):
                terminal = minibatch[i][4]
                
                if terminal:
                    y_batch.append(r_batch[i])
                else:
                    y_batch.append(r_batch[i] + GAMMA*np.max(readout_j1_batch[i]))
                    
            train_step.run(feed_dict = {
                y:y_batch,
                a:a_batch,
                s:s_j_batch,
            })
            
            # update information
            s_t = s_t1
            t += 1
            if t % 10000 == 0:
                saver.save(sess, './',global_step = t)
                
            state = ''
            if t <= OBSERVE:
                state = 'OBSERVE'
            else:
                state = 'train'
                
            print 
    
def playGame():
    sess = tf.InterativeSession()
    s,readout,h_fel = createNetwork()
    # 训练
    trainNetwork()
    
​
def main():
    playGame()
    
if __name__ == '__main__':
    main()

 

Guess you like

Origin blog.csdn.net/Williamtym/article/details/132454886