Some instructions, see
https://www.cnblogs.com/hhh5460/p/10134018.html
http://baijiahao.baidu.com/s?id=1597978859962737001&wfr=spider&for=pc
https://www.jianshu.com/p/29db50000e3f
Questions raised
In order to achieve the path of self-propelled, and try to avoid obstacles, to design a path.
As illustrated, when the robot in an arbitrary grid in the drawing, to make it clear how the surrounding environment, the final target position.
Here is a run results:
First of them numbered as follows: as identification position.
Then using Q-Learning reward mechanisms for data update forms, eventually updated as follows:
When selecting the actual robot path according to the selected maximum value table and eventually come to the position 24, and avoid a red square.
4 as the initial position when the maximum value to the left to select the first 3, then select the maximum downward at 3, 8 and then to the selected downward, and the like, the final completion of path selection. And this choice is the use of Q-Learning implementation.
Q-learning ideas
Reward mechanism
In a strange environment, first in the direction of the robot is randomly selected, when it starts from the starting point, select a variety of methods to complete the path.
But when the robot hit the red square, punish, then after several robot will avoid punishment position.
When the robot hit the blue box, or reward, after repeated, the robot tend to run to the location of the blue box.
Specific official
Complete the process of reward and punishment expression is represented by the value of it.
Firstly, the table is empty table, that is, following this table is empty, all values are 0:
After each operation, according to rewards and punishments, to update the table, complete the learning process. In the implementation process, rewards and punishments will be compiled into a table. Tabular similar as FIG.
The rewards and punishments update formula is:
Bellman equation:
Which represents the current Q table, FIG. 25 is a row of four forms. Represents the learning rate, he expressed rewards and punishments will be the next behavior, represents a greed factor, here's the formula, that is, if its value is relatively large, are more inclined to reward the distant future.
(The formula in many pages of text and no fixed format, such as greed factor, in some cases with increasing and decreasing the number of steps (possible).
Recommended reading:
https://www.jianshu.com/p/29db50000e3f
And the like, which comprises a process of updating a number Q table.
Code implementation - preparation
I have to say is that the code had before: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/1_command_line_reinforcement_learning/treasure_on_right.py
His code to explain: https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-1-general-rl/
He designed a program to achieve the robot how to move in one-dimensional space, but does not involve obstacles and use a higher programming ability, show the path process.
This article focuses on how the path is shown, exemplary complete idea.
Import corresponding library functions, and the establishment of model:
import numpy as np import pandas as pd import time
N_STATES = 25 # the length of the 2 dimensional world ACTIONS = ['left', 'right','up','down'] # available actions EPSILON = 0.3 # greedy police ALPHA = 0.8 # learning rate GAMMA = 0.9 # discount factor MAX_EPISODES = 100 # maximum episodes FRESH_TIME = 0.00001 # fresh time for one move
Q Create a function table:
def build_q_table(n_states, actions): table = pd.DataFrame( np.zeros((n_states, len(actions))), # q_table initial values columns=actions, # actions's name ) return table
Select the behavior of the function:
The behavior of the selection process, the use of such a long representation is to express: at the border when some robots can not choose the path of, or to exceed the index of the table. .
When greedy coefficient smaller, and more inclined to use the random scheme, or when all the data table is initially 0, the behavior of the random selection scheme.
When np.random.uniform () <time = EPSILON, use the optimal solution has been selected to complete Qlearning behavior choice, that is, the robot will not have far unknown target represents greed. (Where expression of the role is greedy and magnitude of the coefficients of the equation are over opposite)
def choose_action(state, q_table): state_actions = q_table.iloc[state, :] if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): # act non-greedy or state-action have no value if state==0: action_name=np.random.choice(['right','down']) elif state>0 and state<4: action_name=np.random.choice(['right','down','left']) elif state==4: action_name=np.random.choice(['left','down']) elif state==5 or state==15 or state==10 : action_name=np.random.choice(['right','up','down']) elif state==9 or state==14 or state==19 : action_name=np.random.choice(['left','up','down']) Elif state == 20: action_name=np.random.choice(['right','up']) elif state>20 and state<24: action_name=np.random.choice(['right','up','left']) elif state==24: action_name=np.random.choice(['left','up']) else: action_name=np.random.choice(ACTIONS) else: # act greedy action_name = state_actions.idxmax() # replace argmax to idxmax as argmax means a different function in newer version of pandas return action_name
Reward expression:
Function parameter S, expressed State (status), A represents the action (behavior), the behavior of 0-3 up and down respectively. The table gives the reward and punishment in the current state, leads to the next direction.
def get_init_feedback_table(S,a): tab=np.ones((25,4)) tab[8][1]=-10;tab[4][3]=-10;tab[14][2]=-10 tab[11][1]=-10;tab[13][0]=-10;tab[7][3]=-10;tab[17][2]=-10 tab[16][0]=-10;tab[20][2]=-10;tab[10][3]=-10; tab[18][0]=-10;tab[16][1]=-10;tab[22][2]=-1;tab[12][3]=-10 tab[23][1]=50;tab[19][3]=50 return tab[S,a]
Get rewards and punishments:
This function is called on a function represented by the incentive to obtain incentive information, wherein the parameters S, A, supra.
When the state S, A next step in line with the final result, the end (termination), indicating completion of objectives and tasks. Otherwise, update the position S
def get_env_feedback(S, A): action={'left':0,'right':1,'up':2,'down':3}; R=get_init_feedback_table(S,action[A]) if (S==19 and action[A]==3) or (S==23 and action[A]==1): S = 'terminal' return S,R if action[A]==0: S-=1 elif action[A]==1: S+=1 elif action[A]==2: S-=5 else: S+=5 return S, R
Code implementation - Start training
Q first initialize the table, then set the initial path is in the 0 position (that is to say each robot, start from position 0)
The number of iterations before training MAX_EPISODES already set.
In the training process in each generation, the behavior of selected (or random table using the original Q), and according to the behavior and the selected current location, to get reward and punishment: S_, R
When the behavior is not about to happen will not reach the final destination of the time, use:
q_target = R + GAMMA * q_table.iloc[S_, :].max() q_table.loc[S, A] += ALPHA * (q_target - q_table.loc[S, A])
These two lines to complete the update q table. (Control Bellman equation)
When completed, when terminated, start training the next generation.
def rl(): # main part of RL loop q_table = build_q_table(N_STATES, ACTIONS) for episode in range(MAX_EPISODES): S = 0 is_terminated = False while not is_terminated: A = choose_action(S, q_table) S_, R = get_env_feedback(S, A) # take action & get next state and reward if S_ != 'terminal': q_target = R + GAMMA * q_table.iloc[S_, :].max() # next state is not terminal else: print(1) q_target = R # next state is terminal is_terminated = True # terminate this episode q_table.loc[S, A] += ALPHA * (q_target - q_table.loc[S, A]) # update S = S_ # move to next state return q_table if __name__ == "__main__": q_table = rl() print('\r\nQ-table:\n') print(q_table)
Effects - Summary
In fact, the beginning and the same effect, to adjust the appropriate parameters, q is the final output table corresponding naturally influence.
Obviously can get it is greed factor will affect the training time.
All the code is above. It can be used to run an eclipse of pydev, debugging. And the effect did not step on the table