Reinforcement learning Q-learning analysis and presentation (entry)

Some instructions, see

https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/1_command_line_reinforcement_learning/treasure_on_right.py

https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake.ipynb

https://www.cnblogs.com/hhh5460/p/10134018.html

http://baijiahao.baidu.com/s?id=1597978859962737001&wfr=spider&for=pc

https://www.jianshu.com/p/29db50000e3f

Questions raised

In order to achieve the path of self-propelled, and try to avoid obstacles, to design a path.

As illustrated, when the robot in an arbitrary grid in the drawing, to make it clear how the surrounding environment, the final target position.

 

 

Here is a run results:

First of them numbered as follows: as identification position.

 

Then using Q-Learning reward mechanisms for data update forms, eventually updated as follows:

 

 

 When selecting the actual robot path according to the selected maximum value table and eventually come to the position 24, and avoid a red square.

4 as the initial position when the maximum value to the left to select the first 3, then select the maximum downward at 3, 8 and then to the selected downward, and the like, the final completion of path selection. And this choice is the use of Q-Learning implementation.

Q-learning ideas

Reward mechanism

In a strange environment, first in the direction of the robot is randomly selected, when it starts from the starting point, select a variety of methods to complete the path.

But when the robot hit the red square, punish, then after several robot will avoid punishment position.

When the robot hit the blue box, or reward, after repeated, the robot tend to run to the location of the blue box.

Specific official

Complete the process of reward and punishment expression is represented by the value of it.

Firstly, the table is empty table, that is, following this table is empty, all values ​​are 0:

 

 

 

 

 After each operation, according to rewards and punishments, to update the table, complete the learning process. In the implementation process, rewards and punishments will be compiled into a table. Tabular similar as FIG.

The rewards and punishments update formula is:

Bellman equation:

Which represents the current Q table, FIG. 25 is a row of four forms. Represents the learning rate, he expressed rewards and punishments will be the next behavior, represents a greed factor, here's the formula, that is, if its value is relatively large, are more inclined to reward the distant future.

(The formula in many pages of text and no fixed format, such as greed factor, in some cases with increasing and decreasing the number of steps (possible).

 

 Recommended reading:

https://www.jianshu.com/p/29db50000e3f

 

 And the like, which comprises a process of updating a number Q table.

Code implementation - preparation

 

 

I have to say is that the code had before: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/1_command_line_reinforcement_learning/treasure_on_right.py

His code to explain: https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-1-general-rl/

 

He designed a program to achieve the robot how to move in one-dimensional space, but does not involve obstacles and use a higher programming ability, show the path process.

This article focuses on how the path is shown, exemplary complete idea.

Import corresponding library functions, and the establishment of model:

import numpy as np
import pandas as pd
import time

 

N_STATES = 25   # the length of the 2 dimensional world
ACTIONS = ['left', 'right','up','down']     # available actions
EPSILON = 0.3   # greedy police
ALPHA = 0.8     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 100   # maximum episodes
FRESH_TIME = 0.00001    # fresh time for one move

 

Q Create a function table:

def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    return table

Select the behavior of the function:

The behavior of the selection process, the use of such a long representation is to express: at the border when some robots can not choose the path of, or to exceed the index of the table. .

When greedy coefficient smaller, and more inclined to use the random scheme, or when all the data table is initially 0, the behavior of the random selection scheme.

When np.random.uniform () <time = EPSILON, use the optimal solution has been selected to complete Qlearning behavior choice, that is, the robot will not have far unknown target represents greed. (Where expression of the role is greedy and magnitude of the coefficients of the equation are over opposite)

def choose_action(state, q_table):
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        if state==0:
            action_name=np.random.choice(['right','down'])
        elif state>0 and state<4:
            action_name=np.random.choice(['right','down','left'])
        elif state==4:
            action_name=np.random.choice(['left','down'])
        elif state==5 or state==15 or state==10 :
            action_name=np.random.choice(['right','up','down'])
        elif state==9 or state==14 or state==19 :
            action_name=np.random.choice(['left','up','down'])
        Elif state == 20:
            action_name=np.random.choice(['right','up'])
        elif state>20 and state<24:    
            action_name=np.random.choice(['right','up','left'])
        elif state==24:
            action_name=np.random.choice(['left','up'])
        else:
            action_name=np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name

Reward expression:

Function parameter S, expressed State (status), A represents the action (behavior), the behavior of 0-3 up and down respectively. The table gives the reward and punishment in the current state, leads to the next direction.

def get_init_feedback_table(S,a):
    tab=np.ones((25,4))
    tab[8][1]=-10;tab[4][3]=-10;tab[14][2]=-10
    tab[11][1]=-10;tab[13][0]=-10;tab[7][3]=-10;tab[17][2]=-10
    tab[16][0]=-10;tab[20][2]=-10;tab[10][3]=-10;
    tab[18][0]=-10;tab[16][1]=-10;tab[22][2]=-1;tab[12][3]=-10
    tab[23][1]=50;tab[19][3]=50
    return tab[S,a]

Get rewards and punishments:

This function is called on a function represented by the incentive to obtain incentive information, wherein the parameters S, A, supra.

When the state S, A next step in line with the final result, the end (termination), indicating completion of objectives and tasks. Otherwise, update the position S

def get_env_feedback(S, A):
    action={'left':0,'right':1,'up':2,'down':3};
    R=get_init_feedback_table(S,action[A])
    if (S==19 and action[A]==3) or (S==23 and action[A]==1):
        S = 'terminal'
        return S,R
    if action[A]==0:
        S-=1
    elif action[A]==1:
        S+=1
    elif action[A]==2:
        S-=5
    else:
        S+=5  
    return S, R

Code implementation - Start training

Q first initialize the table, then set the initial path is in the 0 position (that is to say each robot, start from position 0)

The number of iterations before training MAX_EPISODES already set.

In the training process in each generation, the behavior of selected (or random table using the original Q), and according to the behavior and the selected current location, to get reward and punishment: S_, R

When the behavior is not about to happen will not reach the final destination of the time, use:

q_target = R + GAMMA * q_table.iloc[S_, :].max()
q_table.loc[S, A] += ALPHA * (q_target - q_table.loc[S, A])

 These two lines to complete the update q table. (Control Bellman equation)

When completed, when terminated, start training the next generation.

 

def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        S = 0
        is_terminated = False

        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                print(1)
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_table.loc[S, A])  # update
            S = S_  # move to next state
    return q_table

if __name__ == "__main__":
    q_table = rl()
    print('\r\nQ-table:\n')
    print(q_table)

Effects - Summary

In fact, the beginning and the same effect, to adjust the appropriate parameters, q is the final output table corresponding naturally influence.

Obviously can get it is greed factor will affect the training time.

All the code is above. It can be used to run an eclipse of pydev, debugging. And the effect did not step on the table

 

Guess you like

Origin www.cnblogs.com/bai2018/p/11517584.html