文章目录
Unit 2: Understanding the ROS + OpenAI structure
SUMMARY |
---|
Estimated time to completion: 4 hours
In this unit, you are going to follow, step by step, the full workflow of a CartPole simulated environment, including all the environments and scripts involved in its training.
END OF SUMMARY |
---|
The openai_ros package provides a ROS based training structure in order to organize the training of your robots in simulations.
In general, this structure can be divided into two big parts:
- Training Script: The training script defines and sets up the Reinforcement Learning (RL) algorithm that you are going to use to train your agent.
- Training Environment: The training environment is the ones in charge of providing all the needed data to your learning algorithm from the simulation and the RL Agent environment, in order to make the agent learn. There are different types of Training Environments:
- Gazebo Environment
- Robot Environment
- Task Environment
For the whole explanation, we will be using the CartPole problem. The CartPole problem consists of a cart that moves along a horizontal axis and a pole that is connected with one joint on top of the moving cart. So, in any state, the cart only has two possible actions: move to the left or move to the right.
Keep in mind that although we will be using the CartPole simulation for this explanation, the structure that you are going to learn is applicable to any other ROS robot. You will just need to adapt some parts of the code, depending on the problem you want to solve. So, with the proper introductions made, let’s start!
Clarification
In order to not confuse you along the course, we need to clarify the following:
- OpenAI Gym, is the training environment created by OpenAI to train systems with reinforcement learning. It is not prepared to work with ROS nor Gazebo.
- The openai_ros package, is a ROS package created by The Construct, which provides the integration of OpenAI Gym with ROS and Gazebo in a simple way.
- The OpenAI Baselines are a set of already made RL algorithms, created by OpenAI, which are ready to be used with OpenAI Gym
In this course, we are going to see how to use the openai_ros package so you can train ROS based robots with the OpenAI Baselines.
Training Script
Demo 2.1 |
---|
a) Launch the training script and see how the CartPole starts training.
Execute in WebShell #1
roslaunch cartpole_v0_training start_training.launch
You should see how the CartPole starts moving and learning, as in the video below.
Demo 2.1 |
---|
This is the code you launched:
<launch>
<rosparam command="load" file="$(find cartpole_v0_training)/config/cartpole_qlearn_params.yaml" />
<!-- Launch the training system -->
<node pkg="cartpole_v0_training" name="cartpole_gym" type="start_training.py" output="screen"/>
</launch>
So basically, in the previous demo, you:
- Loaded the training parameters from the cartpole_qlearn_params.yaml file.
- Launched a Python script called start_training.py, which activated the whole learning process.
But… what’s the whole structure that holds this training? Let’s have a look.
start_training.py
First of all, let’s talk about this start_training.py script. This script is the one that sets up the algorithm that you are going to use in order to make your robot learn. In this case, we are going to use the Q-learning algorithm.
Q-learning is a reinforcement learning technique used in machine learning. The goal of Q-Learning is to learn a policy, which tells an agent (in our case, a robot) which action to take under which circumstances (check the link above for more details).
Next, you can see an example of a training script that uses Qlearn. Try to read the code and understand what it does, by reading the included comments. Then, we’ll have a detailed look at the most important parts of it.
start_training.py |
---|
#!/usr/bin/env python
import gym
import time
import numpy
import random
import time
import qlearn
from gym import wrappers
# ROS packages required
import rospy
import rospkg
# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up
if __name__ == '__main__':
rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)
# Create the Gym environment
env = gym.make('CartPoleStayUp-v0')
rospy.loginfo ( "Gym environment done")
# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('cartpole_v0_training')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)
rospy.loginfo ( "Monitor Wrapper started")
last_time_steps = numpy.ndarray(0)
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/cartpole_v0/alpha")
Epsilon = rospy.get_param("/cartpole_v0/epsilon")
Gamma = rospy.get_param("/cartpole_v0/gamma")
epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
nsteps = rospy.get_param("/cartpole_v0/nsteps")
running_step = rospy.get_param("/cartpole_v0/running_step")
# Initialises the algorithm that we are going to use for learning
qlearn = qlearn.QLearn(actions=range(env.action_space.n),
alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
initial_epsilon = qlearn.epsilon
start_time = time.time()
highest_reward = 0
# Starts the main training loop: the one about the episodes to do
for x in range(nepisodes):
rospy.logdebug("############### START EPISODE => " + str(x))
cumulated_reward = 0
done = False
if qlearn.epsilon > 0.05:
qlearn.epsilon *= epsilon_discount
# Initialize the environment and get first state of the robot
observation = env.reset()
state = ''.join(map(str, observation))
# Show on screen the actual situation of the robot
# for each episode, we test the robot for nsteps
for i in range(nsteps):
rospy.loginfo("############### Start Step => "+str(i))
# Pick an action based on the current state
action = qlearn.chooseAction(state)
rospy.loginfo ("Next action is: %d", action)
# Execute the action in the environment and get feedback
observation, reward, done, info = env.step(action)
rospy.loginfo(str(observation) + " " + str(reward))
cumulated_reward += reward
if highest_reward < cumulated_reward:
highest_reward = cumulated_reward
nextState = ''.join(map(str, observation))
# Make the algorithm learn based on the results
#rospy.logwarn("############### State we were => " + str(state))
#rospy.logwarn("############### Action that we took => " + str(action))
#rospy.logwarn("############### Reward that action gave => " + str(reward))
#rospy.logwarn("############### State in which we will start next step => " + str(nextState))
qlearn.learn(state, action, reward, nextState)
if not(done):
state = nextState
else:
rospy.loginfo ("DONE")
last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
break
rospy.loginfo("############### END Step =>" + str(i))
#raw_input("Next Step...PRESS KEY")
#rospy.sleep(2.0)
m, s = divmod(int(time.time() - start_time), 60)
h, m = divmod(m, 60)
rospy.logwarn ( ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+" Time: %d:%02d:%02d" % (h, m, s)))
rospy.loginfo ( ("\n|"+str(nepisodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |"))
l = last_time_steps.tolist()
l.sort()
#print("Parameters: a="+str)
rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))
env.close()
start_training.py |
---|
Code Analysis
Let’s analyze the most important parts of this script.
from openai_ros.task_envs.cartpole_stay_up import stay_up
In the line above, we are importing from the package openai_ros the Python file that contains the Task Environment that we want to use to learn. The Task Environment is the class that defines the task that the robot has to solve (in this case, the task is to maintain the pole upright). We’ll have a closer look later at this Task Environment that we are importing here.
rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)
In the line above, we are initializing our ROS node. Since we’ll be working with Gazebo + ROS, we need to initialize a ROS node.
env = gym.make('CartPoleStayUp-v0')
In the above line, we are creating an instance of the Task Environment that we imported before. We use the method defined by OpenAI Gym.
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('cartpole_v0_training')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)
Here, we are basically indicating where to store the training results we’ll get when the training finishes. In this case, it will be in the training_results folder, inside the cartpole_v0_training ROS package.
qlearn = qlearn.QLearn(actions=range(env.action_space.n),
alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
initial_epsilon = qlearn.epsilon
Here, we are initializing the Qlearn algorithm.
for x in range(nepisodes):
cumulated_reward = 0
done = False
if qlearn.epsilon > 0.05:
qlearn.epsilon *= epsilon_discount
observation = env.reset()
state = ''.join(map(str, observation))
Here we are starting the main training loop (or episodes loop). As you probably know from Reinforcement Learning, every RL exercise is just a loop within a loop. First loop, the main training loop, is the one in charge to initialise the environment and robot for every episode.
At the beginning of the loop, for every episode, we reset the environment and get the initial state of the robot.
for i in range(nsteps):
Next loop is the steps loop. For each episode, the robot is allowed to try nsteps actions and see the results of those actions. There are several interesting things here.
action = qlearn.chooseAction(state)
In the above line, the Qlearn algorithm chooses the next action to take (based on its current knowledge), given the current state of the robot.
observation, reward, done, info = env.step(action)
Here, we execute the step. Executing a step means:
- We are applying the action decided above to the robot. That is, the robot executes the action in the simulated environment (for example, move the cart to the right at 0.25m/s).
- Based on the result of the action, we get the new state of the robot and environment (for example, the new location of the cart and the new angle of the pole).
- We check if the episode is finished, either because the goal has been achieved or because it has failed so much that the robot will not be able to recover from this point.
- Based on all that, we compute a score of the action taken, also called the reward.
Hence, after the execution of the step, the environment will return four variables: observation, reward, done, and info.
- observation: It gives information about the status of the robot, using its sensors. What it will exactly return will be defined in the Task Environment.
- reward: It gives a reward for the action taken. How this reward will be calculated will also be defined in the Task Environment.
- done: This variable tells if the step has finished or not. You can consider that a step has finished, for instance, when a certain condition is met. You will also define this variable in the Task Environment.
- info: It provides some extra information regarding the step.
qlearn.learn(state, action, reward, nextState)
In the above line, we are actually making the Qlearn algorithm learn. We provide the initial state, the action taken in the step and its reward, and the state of the robot after the step to Qlearn.
The MOST IMPORTANT THING you need to understand from the training script is that it is totally independent from the environments. The purpose of this training script is to set up the learning algorithm that you want to use in order to make your agent learn, regardless of what is being done in the environments. This means that you can change the algorithm you use to learn in the training script without having to worry about modifying your environment’s struture. And this is very powerful! Let’s test it with the following exercises.
Exercise 2.1 |
---|
a) In your workspace, create a new package called my_cartpole_training.
Execute in WebShell #1
catkin_create_pkg my_cartpole_training rospy openai_ros
b) Inside the package, create two new folders called launch and config.
c) Inside the launch folder, create a new file named start_training.launch. You can copy the following contents into it:
start_training.launch |
---|
<launch>
<rosparam command="load" file="$(find my_cartpole_training)/config/cartpole_qlearn_params.yaml" />
<!-- Launch the training system -->
<node pkg="my_cartpole_training" name="cartpole_gym" type="start_training.py" output="screen"/> -->
</launch>
start_training.launch |
---|
As you can see, the launch file is quite self-explanatory. What you are doing is, first, loading some parameters that Qlearn needs in order to train, and second, launching the training script that you’ve reviewed in the above section.
d) The next step will be to create this parameters file for Qlearn. Inside the config folder, create a new file named cartpole_qlearn_params.yaml. You can copy the following contents into it:
cartpole_qlearn_params.yaml |
---|
cartpole_v0: #namespace
n_actions: 4
control_type: "velocity"
#environment variables
min_pole_angle: -0.7 #-23°
max_pole_angle: 0.7 #23°
max_base_velocity: 50
max_base_pose_x: 2.5
min_base_pose_x: -2.5
# those parameters are very important. They are affecting the learning experience
# They indicate how fast the control can be
# If the running step is too large, then there will be a long time between 2 ctrl commans
# If the pos_step is too large, then the changes in position will be very abrupt
running_step: 0.04 # amount of time the control will be executed
pos_step: 0.016 # increment in position for each command
#qlearn parameters
alpha: 0.5
gamma: 0.9
epsilon: 0.1
epsilon_discount: 0.999
nepisodes: 1000
nsteps: 1000
number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits
init_pos: 0.0 # Position in which the base will start
wait_time: 0.1 # Time to wait in the reset phases
cartpole_qlearn_params.yaml |
---|
In this file, we are just setting some parameters that we’ll need for our training. These parameters will first be loaded into the ROS parameter server, and later retrieved by our Training Environments.
e) Now, you will need to place a file that defines the Qlearn algorithm inside the src folder of your package. Create a file named qlearn.py here, and place the following contents into it:
qlearn.py |
---|
'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
https://github.com/vmayoral/basic_reinforcement_learning
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
@author: Victor Mayoral Vilches <[email protected]>
'''
import random
class QLearn:
def __init__(self, actions, epsilon, alpha, gamma):
self.q = {}
self.epsilon = epsilon # exploration constant
self.alpha = alpha # discount constant
self.gamma = gamma # discount factor
self.actions = actions
def getQ(self, state, action):
return self.q.get((state, action), 0.0)
def learnQ(self, state, action, reward, value):
'''
Q-learning:
Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))
'''
oldv = self.q.get((state, action), None)
if oldv is None:
self.q[(state, action)] = reward
else:
self.q[(state, action)] = oldv + self.alpha * (value - oldv)
def chooseAction(self, state, return_q=False):
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)
if random.random() < self.epsilon:
minQ = min(q); mag = max(abs(minQ), abs(maxQ))
# add random values to all the actions, recalculate maxQ
q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))]
maxQ = max(q)
count = q.count(maxQ)
# In case there're several state-action max values
# we select a random one among them
if count > 1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)
action = self.actions[i]
if return_q: # if they want it, give it!
return action, q
return action
def learn(self, state1, action1, reward, state2):
maxqnew = max([self.getQ(state2, a) for a in self.actions])
self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)
qlearn.py |
---|
This file basically defines the Qlearn algorithm. It is imported into our training script in the following line:
import qlearn
Q-Learn is a very basic RL algorithm. If you want to learn more about how Q-Learn works, check this link.
f) Finally, also inside the src folder of your package, you’ll have to place the start_training.py script that we’ve been reviewing. You will need to modify the name of the package where to store the training results:
pkg_path = rospack.get_path('my_cartpole_training')
g) Also create the training_results folder inside your package.
Execute in WebShell #1
roscd my_cartpole_training
mkdir training_results
f) And that’s it! Now you have created a package in order to reproduce the demo you executed in the first step. Run now your launch file and see how the CartPole starts to learn.
End of Exercise 2.1 |
---|
Exercise 2.2 |
---|
a) Now, let’s change the learning algorithm that we use for learning. In this case, instead of the Qlearn algorithm, we’ll use the sarsa algorithm. Below, you have the file that defines the sarsa learning algorithm. Create a new file named sarsa.py and include its content.
b)Make the necessary changes to your package in order to use the sarsa algorithm for training, instead of the Q-learn one.
sarsa.py |
---|
import random
class Sarsa:
def __init__(self, actions, epsilon=0.1, alpha=0.2, gamma=0.9):
self.q = {}
self.epsilon = epsilon
self.alpha = alpha
self.gamma = gamma
self.actions = actions
def getQ(self, state, action):
return self.q.get((state, action), 0.0)
def learnQ(self, state, action, reward, value):
oldv = self.q.get((state, action), None)
if oldv is None:
self.q[(state, action)] = reward
else:
self.q[(state, action)] = oldv + self.alpha * (value - oldv)
def chooseAction(self, state):
if random.random() < self.epsilon:
action = random.choice(self.actions)
else:
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)
count = q.count(maxQ)
if count > 1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)
action = self.actions[i]
return action
def learn(self, state1, action1, reward, state2, action2):
qnext = self.getQ(state2, action2)
self.learnQ(state1, action1, reward, reward + self.gamma * qnext)
sarsa.py |
---|
c) Launch your Training script again and check if you see any differences between the performances of Qlearn and sarsa.
End of Exercise 2.2 |
---|
Training Environments
In the previous code, you saw that we were starting an environment called CartPoleStayUp-v0. But what is this environment for? What does it do?
An environment is, basically, a problem (like keeping the CartPole stand up) with a minimal interface that an agent (a robot) can interact with. The environments in the OpenAI Gym are designed to allow objective testing and bench-marking of an agent’s abilities.
In OpenAI Gym, you can usually define everything with just one environment. But ours, is a more complex situation, since we want to integrate OpenAI with ROS and Gazebo.
That’s why we have created a structure to organize and split the OpenAI Gym environment into three different types:
-
Task Environment: This environment is dedicated to implementing all the functions that a specific training session will use. For instance, using the same Robot Environment, you could create different Task Environments in order to train your robot in different tasks, or using different methods (actions).
• You need to have different Task Environments for different tasks to be learnt -
Robot Environment: This environment is dedicated to implementing all the functions that a specific robot will use during its training. It will not include, though, the specific tasks that depend on the training that you are going to implement.
• You need to have different Robot Environments for different Robots to learn -
Gazebo Environment: This environment is common for any training or robot that you want to implement. This environment will generate all the connections between your robot and Gazebo, so that you don’t have to worry about it.
It’s very important that you understand that this environment structure will always be the same, regardless of:
- the problem you want to solve
- the robot you want to train
- the learning algorithm that you want to use
Also, these three environments will be connected to each other in the sense that each environment will inherit from the previous one. It goes like this:
- Task Environment inherits from the Robot Environment.
- Robot Environment inherits from the Gazebo Environment.
- Gazebo Environment inherits from the Gym Environment. The Gym Environment (gym.Env) is the most basic Environment structure provided by OpenAI Gym framework.
Tons of information, right? But don’t worry too much, since we’ll have an extensive look at all of these environments. For now, let’s focus on the Task Environment.
First environment class: Task Environment
The Task Environment is the highest one in the inheritance structure, which also means that it is the most specific one. This environent fully depends on the task we want the robot learn.
Next, you can see an example of a Training Environment that trains the Pole to stay up on the Cart.
stay_up.py |
---|
from gym import utils
from openai_ros.robot_envs import cartpole_env
from gym.envs.registration import register
from gym import error, spaces
import rospy
import math
import numpy as np
# The path is __init__.py of openai_ros, where we import the CartPoleStayUpEnv directly
register(
id='CartPoleStayUp-v0',
entry_point='openai_ros:CartPoleStayUpEnv',
timestep_limit=1000,
)
class CartPoleStayUpEnv(cartpole_env.CartPoleEnv):
def __init__(self):
self.get_params()
self.action_space = spaces.Discrete(self.n_actions)
cartpole_env.CartPoleEnv.__init__(
self, control_type=self.control_type
)
def get_params(self):
#get configuration parameters
self.n_actions = rospy.get_param('/cartpole_v0/n_actions')
self.min_pole_angle = rospy.get_param('/cartpole_v0/min_pole_angle')
self.max_pole_angle = rospy.get_param('/cartpole_v0/max_pole_angle')
self.max_base_velocity = rospy.get_param('/cartpole_v0/max_base_velocity')
self.min_base_pose_x = rospy.get_param('/cartpole_v0/min_base_pose_x')
self.max_base_pose_x = rospy.get_param('/cartpole_v0/max_base_pose_x')
self.pos_step = rospy.get_param('/cartpole_v0/pos_step')
self.running_step = rospy.get_param('/cartpole_v0/running_step')
self.init_pos = rospy.get_param('/cartpole_v0/init_pos')
self.wait_time = rospy.get_param('/cartpole_v0/wait_time')
self.control_type = rospy.get_param('/cartpole_v0/control_type')
def _set_action(self, action):
# Take action
if action == 0: #LEFT
rospy.loginfo("GO LEFT...")
self.pos[0] -= self.pos_step
elif action == 1: #RIGHT
rospy.loginfo("GO RIGHT...")
self.pos[0] += self.pos_step
elif action == 2: #LEFT BIG
rospy.loginfo("GO LEFT BIG...")
self.pos[0] -= self.pos_step * 10
elif action == 3: #RIGHT BIG
rospy.loginfo("GO RIGHT BIG...")
self.pos[0] += self.pos_step * 10
# Apply action to simulation.
rospy.loginfo("MOVING TO POS=="+str(self.pos))
# 1st: unpause simulation
rospy.logdebug("Unpause SIM...")
self.gazebo.unpauseSim()
self.move_joints(self.pos)
rospy.logdebug("Wait for some time to execute movement, time="+str(self.running_step))
rospy.sleep(self.running_step) #wait for some time
rospy.logdebug("DONE Wait for some time to execute movement, time=" + str(self.running_step))
# 3rd: pause simulation
rospy.logdebug("Pause SIM...")
self.gazebo.pauseSim()
def _get_obs(self):
data = self.joints
# base_postion base_velocity pole angle pole velocity
obs = [round(data.position[1],1), round(data.velocity[1],1), round(data.position[0],1), round(data.velocity[0],1)]
return obs
def _is_done(self, observations):
done = False
data = self.joints
rospy.loginfo("BASEPOSITION=="+str(observations[0]))
rospy.loginfo("POLE ANGLE==" + str(observations[2]))
if (self.min_base_pose_x >= observations[0] or observations[0] >= self.max_base_pose_x): #check if the base is still within the ranges of (-2, 2)
rospy.logerr("Base Outside Limits==>min="+str(self.min_base_pose_x)+",pos="+str(observations[0])+",max="+str(self.max_base_pose_x))
done = True
if (self.min_pole_angle >= observations[2] or observations[2] >= self.max_pole_angle): #check if pole has toppled over
rospy.logerr(
"Pole Angle Outside Limits==>min=" + str(self.min_pole_angle) + ",pos=" + str(observations[2]) + ",max=" + str(
self.max_pole_angle))
done = True
return done
def _compute_reward(self, observations, done):
"""
Gives more points for staying upright, gets data from given observations to avoid
having different data than other previous functions
:return:reward
"""
if not done:
reward = 1.0
elif self.steps_beyond_done is None:
# Pole just fell!
self.steps_beyond_done = 0
reward = 1.0
else:
if self.steps_beyond_done == 0:
logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
self.steps_beyond_done += 1
reward = 0.0
return reward
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
self.steps_beyond_done = None
def _set_init_pose(self):
"""
Sets joints to initial position [0,0,0]
:return:
"""
self.check_publishers_connection()
# Reset Internal pos variable
self.init_internal_vars(self.init_pos)
self.move_joints(self.pos)
stay_up.py |
---|
Code Analysis
First of all, as you may have already guessed, this environment will inherit from the previous one, the Robot Environment (in this case, it is called CartPoleEnv).
class CartPoleStayUpEnv(cartpole_env.CartPoleEnv)
In the initialization of this Robot Environment, we will also need to pass its required variables to the Robot Environment: in this case, it’s the control_type.
cartpole_env.CartPoleEnv.__init__(self, control_type=self.control_type)
Also, in the initialization of the Task Environment class, we are defining the action space of our environment. This is, basically, the number of actions (self.n_actions) that the robot will be able to take. This number will be taken from the ROS Parameter Server, and it is loaded from a YAML file.
self.action_space = spaces.Discrete(self.n_actions)
Let’s now have a look at the different functions that we have in this Task Environment.
def get_params(self):
self.n_actions = rospy.get_param('/cartpole_v0/n_actions')
self.min_pole_angle = rospy.get_param('/cartpole_v0/min_pole_angle')
self.max_pole_angle = rospy.get_param('/cartpole_v0/max_pole_angle')
self.max_base_velocity = rospy.get_param('/cartpole_v0/max_base_velocity')
self.min_base_pose_x = rospy.get_param('/cartpole_v0/min_base_pose_x')
self.max_base_pose_x = rospy.get_param('/cartpole_v0/max_base_pose_x')
self.pos_step = rospy.get_param('/cartpole_v0/pos_step')
self.running_step = rospy.get_param('/cartpole_v0/running_step')
self.init_pos = rospy.get_param('/cartpole_v0/init_pos')
self.wait_time = rospy.get_param('/cartpole_v0/wait_time')
self.control_type = rospy.get_param('/cartpole_v0/control_type')
The get_params(self) function, gets all the parameters needed by Qlearn, which are loaded from a YAML file, and stores them into variables of the environment class.
def _set_action(self, action):
# Take action
if action == 0: #LEFT
rospy.loginfo("GO LEFT...")
self.pos[0] -= self.pos_step
elif action == 1: #RIGHT
rospy.loginfo("GO RIGHT...")
self.pos[0] += self.pos_step
elif action == 2: #LEFT BIG
rospy.loginfo("GO LEFT BIG...")
self.pos[0] -= self.pos_step * 10
elif action == 3: #RIGHT BIG
rospy.loginfo("GO RIGHT BIG...")
self.pos[0] += self.pos_step * 10
# Apply action to simulation.
rospy.loginfo("MOVING TO POS=="+str(self.pos))
# 1st: unpause simulation
rospy.logdebug("Unpause SIM...")
self.gazebo.unpauseSim()
self.move_joints(self.pos)
rospy.logdebug("Wait for some time to execute movement, time="+str(self.running_step))
rospy.sleep(self.running_step) #wait for some time
rospy.logdebug("DONE Wait for some time to execute movement, time=" + str(self.running_step))
# 3rd: pause simulation
rospy.logdebug("Pause SIM...")
self.gazebo.pauseSim()
The _set_action(self, action) function defines what each action will do. Each action will be represented by an integer between 0 and 3. And depending on the action number that this function receives, the CartPole will do one thing or another.
def _get_obs(self):
data = self.joints
# base_postion base_velocity pole angle pole velocity
obs = [round(data.position[1],1), round(data.velocity[1],1), round(data.position[0],1), round(data.velocity[0],1)]
return obs
The _get_obs(self) function returns an array with all the data we need from the robot sensors in order to learn. In this case, we are getting the position and velocity of the Cart, and the angle and velocity of the Pole, which is the most relevant data to have on whether the agent is doing well or not.
def _is_done(self, observations):
done = False
data = self.joints
rospy.loginfo("BASEPOSITION=="+str(observations[0]))
rospy.loginfo("POLE ANGLE==" + str(observations[2]))
if (self.min_base_pose_x >= observations[0] or observations[0] >= self.max_base_pose_x): #check if the base is still within the ranges of (-2, 2)
rospy.logerr("Base Outside Limits==>min="+str(self.min_base_pose_x)+",pos="+str(observations[0])+",max="+str(self.max_base_pose_x))
done = True
if (self.min_pole_angle >= observations[2] or observations[2] >= self.max_pole_angle): #check if pole has toppled over
rospy.logerr(
"Pole Angle Outside Limits==>min=" + str(self.min_pole_angle) + ",pos=" + str(observations[2]) + ",max=" + str(
self.max_pole_angle))
done = True
return done
The _is_done(self, observations) function decides whether the current step has finished or not, and returns a variable (boolean) that says so. For the CartPole case, we say the step has finished either when the Cart is close to the bar limits or when the angle of the Pole is too low.
def _compute_reward(self, observations, done):
"""
Gives more points for staying upright, gets data from given observations to avoid
having different data than other previous functions
:return:reward
"""
if not done:
reward = 1.0
elif self.steps_beyond_done is None:
# Pole just fell!
self.steps_beyond_done = 0
reward = 1.0
else:
if self.steps_beyond_done == 0:
logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
self.steps_beyond_done += 1
reward = 0.0
return reward
The _compute_reward(self, observations, done) function decides the reward to give for each step. In this specific case of the CartPole, it will give one reward point for each step (steps_beyond_done) that the Pole is upright.
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
self.steps_beyond_done = None
The _init_env_variables(self) function initializes all the variables that your particular system is using after each step. In the CartPole case, as you can see, we are initializing the steps_beyond_done variable.
def _set_init_pose(self):
"""
Sets joints to initial position [0,0,0]
:return:
"""
self.check_publishers_connection()
# Reset Internal pos variable
self.init_internal_vars(self.init_pos)
self.move_joints(self.pos)
The _set_init_pose(self) function sets the initial position of the joints of your robot. In this case, the initial position will be with the Cart parallel to the rail, and the Pole perpendicular to them at an angle of 90º.
It’s very important that you understand that you can add any function that you want into this Task Environment. However, there are some functions that will ALWAYS need to be defined here because they are required for the rest of the Environments to work properly. These functions are:
- The _set_action() function
- The _get_obs() function
- The _is_done() function
- The _compute_reward() function
- The _set_init_pose() function
- The _init_env_variables() function
If for your specific task, you don’t need to implement some of these functions, you can just make them pass (like we do in this case with the _set_init_pose() and _init_env_variables() functions). But you will need to add them to your code anyway, or you will get a NotImplementedError().
Exercise 2.3 |
---|
a) Inside src folder of the my_cartpole_training package, add a new script called my_cartpole_task_env.py. Into this new file, copy the contents of the stay_up.py script that you have just reviewed.
b) Now, modify the Task Environment code so that it takes only two actions: move right and move left.
NOTE: Also, you will need to modify the registration id of the environment, in order to avoid conflicts. You just need to change the v0 for a v1.
register(
id='CartPoleStayUp-v1',
entry_point='my_cartpole_task_env:CartPoleStayUpEnv',
timestep_limit=1000,
)
c) Next, you’ll need to also modify the file start_training.py, so that it takes your Task Environment file insted of the default one.
import my_cartpole_task_env
Also, remember to modify the name of the Environment, which you have updated in the previous step.
env = gym.make('CartPoleStayUp-v1')
d) Finally, launch the training scrcipt and check out how the modifications affect the learning process.
End of Exercise 2.3 |
---|
Exercise 2.4 |
---|
a) Modify the Task Environment so that it now gives more reward for going to the left than for going to the right.
End of Exercise 2.4 |
---|
Second environment class: Robot Environment
The robot environment will then contain all the functions associated with the specific “robot” that you want to train. This means that it will contain all the functionalities that your robot will need in order to be controlled.
For instance, let’s focus on the CartPole example. Let’s say that in order to be able to control a CartPole environment, we will basically need three things:
- A way to check that all the systems (ROS topics, Publishers, etc.) are ready and working okay
- A way to move the Cart alongside the rails
- A way to get all the data about all the sensors that we want to evaluate (including the position of the Pole)
For this case, then, we would need to define all these functions into the Robot Environment. Let’s see an example of a Robot Environment script that takes into account the three points introduced above. You can check out the code here:
cartpole_env.py |
---|
#!/usr/bin/env python
import gym
import rospy
import roslaunch
import time
import numpy as np
from gym import utils, spaces
from geometry_msgs.msg import Twist
from std_srvs.srv import Empty
from gym.utils import seeding
from gym.envs.registration import register
import copy
import math
import os
from sensor_msgs.msg import JointState
from trajectory_msgs.msg import JointTrajectory, JointTrajectoryPoint
from std_msgs.msg import Float64
from gazebo_msgs.srv import SetLinkState
from gazebo_msgs.msg import LinkState
from rosgraph_msgs.msg import Clock
from openai_ros import robot_gazebo_env
class CartPoleEnv(robot_gazebo_env.RobotGazeboEnv):
def __init__(
self, control_type
):
self.publishers_array = []
self._base_pub = rospy.Publisher('/cartpole_v0/foot_joint_velocity_controller/command', Float64, queue_size=1)
self._pole_pub = rospy.Publisher('/cartpole_v0/pole_joint_velocity_controller/command', Float64, queue_size=1)
self.publishers_array.append(self._base_pub)
self.publishers_array.append(self._pole_pub)
rospy.Subscriber("/cartpole_v0/joint_states", JointState, self.joints_callback)
self.control_type = control_type
if self.control_type == "velocity":
self.controllers_list = ['joint_state_controller',
'pole_joint_velocity_controller',
'foot_joint_velocity_controller',
]
elif self.control_type == "position":
self.controllers_list = ['joint_state_controller',
'pole_joint_position_controller',
'foot_joint_position_controller',
]
elif self.control_type == "effort":
self.controllers_list = ['joint_state_controller',
'pole_joint_effort_controller',
'foot_joint_effort_controller',
]
self.robot_name_space = "cartpole_v0"
self.reset_controls = True
# Seed the environment
self._seed()
self.steps_beyond_done = None
super(CartPoleEnv, self).__init__(
controllers_list=self.controllers_list,
robot_name_space=self.robot_name_space,
reset_controls=self.reset_controls
)
def joints_callback(self, data):
self.joints = data
def _seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
# RobotEnv methods
# ----------------------------
def _env_setup(self, initial_qpos):
self.init_internal_vars(self.init_pos)
self.set_init_pose()
self.check_all_systems_ready()
def init_internal_vars(self, init_pos_value):
self.pos = [init_pos_value]
self.joints = None
def check_publishers_connection(self):
"""
Checks that all the publishers are working
:return:
"""
rate = rospy.Rate(10) # 10hz
while (self._base_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
rospy.logdebug("No susbribers to _base_pub yet so we wait and try again")
try:
rate.sleep()
except rospy.ROSInterruptException:
# This is to avoid error when world is rested, time when backwards.
pass
rospy.logdebug("_base_pub Publisher Connected")
while (self._pole_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
rospy.logdebug("No susbribers to _pole_pub yet so we wait and try again")
try:
rate.sleep()
except rospy.ROSInterruptException:
# This is to avoid error when world is rested, time when backwards.
pass
rospy.logdebug("_pole_pub Publisher Connected")
rospy.logdebug("All Publishers READY")
def _check_all_systems_ready(self, init=True):
self.base_position = None
while self.base_position is None and not rospy.is_shutdown():
try:
self.base_position = rospy.wait_for_message("/cartpole_v0/joint_states", JointState, timeout=1.0)
rospy.logdebug("Current cartpole_v0/joint_states READY=>"+str(self.base_position))
if init:
# We Check all the sensors are in their initial values
positions_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.position)
velocity_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.velocity)
efforts_ok = all(abs(i) <= 1.0e-01 for i in self.base_position.effort)
base_data_ok = positions_ok and velocity_ok and efforts_ok
rospy.logdebug("Checking Init Values Ok=>" + str(base_data_ok))
except:
rospy.logerr("Current cartpole_v0/joint_states not ready yet, retrying for getting joint_states")
rospy.logdebug("ALL SYSTEMS READY")
def move_joints(self, joints_array):
joint_value = Float64()
joint_value.data = joints_array[0]
rospy.logdebug("Single Base JointsPos>>"+str(joint_value))
self._base_pub.publish(joint_value)
def get_clock_time(self):
self.clock_time = None
while self.clock_time is None and not rospy.is_shutdown():
try:
self.clock_time = rospy.wait_for_message("/clock", Clock, timeout=1.0)
rospy.logdebug("Current clock_time READY=>" + str(self.clock_time))
except:
rospy.logdebug("Current clock_time not ready yet, retrying for getting Current clock_time")
return self.clock_time
cartpole_env.py |
---|
Code Analysis
We can differentiate two main parts in the CartPoleEnv class: the initialization of the class and the specific functions defined below.
First of all, you can see that this class inherits from the RobotGazeboEnv class:
class CartPoleEnv(robot_gazebo_env.RobotGazeboEnv):
This means that all the functions that are defined in the Gazebo Environment will also be accessible from this class.
In the init() function of the class, we are doing several things. First, we are creating all the Publishers and Subscribers required for this specific Cartpole environment:
self.publishers_array = []
self._base_pub = rospy.Publisher('/cartpole_v0/foot_joint_velocity_controller/command', Float64, queue_size=1)
self._pole_pub = rospy.Publisher('/cartpole_v0/pole_joint_velocity_controller/command', Float64, queue_size=1)
self.publishers_array.append(self._base_pub)
self.publishers_array.append(self._pole_pub)
rospy.Subscriber("/cartpole_v0/joint_states", JointState, self.joints_callback)
These are the Publishers needed to move the Cart alongside the rails, and the Subscribers needed to know the positions of the joints at any given time.
In the following section:
self.control_type = control_type
if self.control_type == "velocity":
self.controllers_list = ['joint_state_controller',
'pole_joint_velocity_controller',
'foot_joint_velocity_controller',
]
elif self.control_type == "position":
self.controllers_list = ['joint_state_controller',
'pole_joint_position_controller',
'foot_joint_position_controller',
]
elif self.control_type == "effort":
self.controllers_list = ['joint_state_controller',
'pole_joint_effort_controller',
'foot_joint_effort_controller',
]
We are defining which kind of controllers we will use for our joints: velocity, position, or effort controllers. We decide this depending on the contents of the control_type variable, which has been defined in the Task Environment. Depending on this variable, then, we will fill the value of the controllers_list variable, which needs to be passed to the Gazebo Environment.
Alongside the controllers_list variable, we will need to pass two other variables to the Gazebo Environment: robot_name_space and reset_controls. This is done here:
super(CartPoleEnv, self).__init__(
controllers_list=self.controllers_list,
robot_name_space=self.robot_name_space,
reset_controls=self.reset_controls
)
The values of the robot_name_space and reset_controls variables are assigned right above:
self.robot_name_space = "cartpole_v0"
self.reset_controls = True
And this is basically it for the initialization of the environment. Now, let’s have a look at some of the most important functions of the environment.
def joints_callback(self, data):
self.joints = data
The joints_callback(self, data) function gets the data from the Subscriber of the topic /joint_states and stores it in the self.joints class variable. If you remember from the previous section, this self.joints class variable will be used in order to generate the observation (inside the _get_obs() function).
def check_publishers_connection(self):
"""
Checks that all the publishers are working
:return:
"""
rate = rospy.Rate(10) # 10hz
while (self._base_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
rospy.logdebug("No susbribers to _base_pub yet so we wait and try again")
try:
rate.sleep()
except rospy.ROSInterruptException:
# This is to avoid error when world is rested, time when backwards.
pass
rospy.logdebug("_base_pub Publisher Connected")
while (self._pole_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
rospy.logdebug("No susbribers to _pole_pub yet so we wait and try again")
try:
rate.sleep()
except rospy.ROSInterruptException:
# This is to avoid error when world is rested, time when backwards.
pass
rospy.logdebug("_pole_pub Publisher Connected")
rospy.logdebug("All Publishers READY")
The check_publishers_connection(self) basically checks that the Publishers are connected and working.
def _check_all_systems_ready(self, init=True):
self.base_position = None
while self.base_position is None and not rospy.is_shutdown():
try:
self.base_position = rospy.wait_for_message("/cartpole_v0/joint_states", JointState, timeout=1.0)
rospy.logdebug("Current cartpole_v0/joint_states READY=>"+str(self.base_position))
if init:
# We Check all the sensors are in their initial values
positions_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.position)
velocity_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.velocity)
efforts_ok = all(abs(i) <= 1.0e-01 for i in self.base_position.effort)
base_data_ok = positions_ok and velocity_ok and efforts_ok
rospy.logdebug("Checking Init Values Ok=>" + str(base_data_ok))
except:
rospy.logerr("Current cartpole_v0/joint_states not ready yet, retrying for getting joint_states")
rospy.logdebug("ALL SYSTEMS READY")
The _check_all_systems_ready(self, init=True) function checks that all the joints are at their initial values before starting each step.
def move_joints(self, joints_array):
joint_value = Float64()
joint_value.data = joints_array[0]
rospy.logdebug("Single Base JointsPos>>"+str(joint_value))
self._base_pub.publish(joint_value)
Finally, the move_joints(self, joints_array) functions are in charge of moving the joints (the Cart, in this case) of the simulation.
It’s very important that you understand that you can add all the functions that you want to this Robot Environment. However, there are some functions that will ALWAYS need to be defined here because they are required for the rest of the Environments to work properly. These functions include:
- The init function
- The _check_all_systems_ready() function
Third environment class: Gazebo Environment
The Gazebo Environment is used to connect the Robot Environment to the Gazebo simulator. For instance, it takes care of the resets of the simulator after each step, or the resets of the controllers (if needed).
The most important thing you need to know about this environment is that it will be transparent to you. And what does this mean? Well, it basically means that you don’t have to worry about it. This environment will always be the same, regardless of the robot or the kind of task to solve. So, you won’t have to change it or work over it. Good news, right?
Here you can have a look at how this environment looks:
robot_gazebo_env.py |
---|
import rospy
import gym
from gym.utils import seeding
from .gazebo_connection import GazeboConnection
from .controllers_connection import ControllersConnection
#https://bitbucket.org/theconstructcore/theconstruct_msgs/src/master/msg/RLExperimentInfo.msg
from theconstruct_msgs.msg import RLExperimentInfo
# https://github.com/openai/gym/blob/master/gym/core.py
class RobotGazeboEnv(gym.Env):
def __init__(self, robot_name_space, controllers_list, reset_controls):
# To reset Simulations
self.gazebo = GazeboConnection()
self.controllers_object = ControllersConnection(namespace=robot_name_space, controllers_list=controllers_list)
self.reset_controls = reset_controls
self.seed()
# Set up ROS related variables
self.episode_num = 0
self.reward_pub = rospy.Publisher('/openai/reward', RLExperimentInfo, queue_size=1)
# Env methods
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
"""
Function executed each time step.
Here we get the action execute it in a time step and retrieve the
observations generated by that action.
:param action:
:return: obs, reward, done, info
"""
"""
Here we should convert the action num to movement action, execute the action in the
simulation and get the observations result of performing that action.
"""
self.gazebo.unpauseSim()
self._set_action(action)
self.gazebo.pauseSim()
obs = self._get_obs()
done = self._is_done(obs)
info = {}
reward = self._compute_reward(obs, done)
self._publish_reward_topic(reward, self.episode_num)
return obs, reward, done, info
def reset(self):
rospy.logdebug("Reseting RobotGazeboEnvironment")
self._reset_sim()
self._init_env_variables()
self._update_episode()
obs = self._get_obs()
return obs
def close(self):
"""
Function executed when closing the environment.
Use it for closing GUIS and other systems that need closing.
:return:
"""
rospy.logdebug("Closing RobotGazeboEnvironment")
rospy.signal_shutdown("Closing RobotGazeboEnvironment")
def _update_episode(self):
"""
Increases the episode number by one
:return:
"""
self.episode_num += 1
def _publish_reward_topic(self, reward, episode_number=1):
"""
This function publishes the given reward in the reward topic for
easy access from ROS infrastructure.
:param reward:
:param episode_number:
:return:
"""
reward_msg = RLExperimentInfo()
reward_msg.episode_number = episode_number
reward_msg.episode_reward = reward
self.reward_pub.publish(reward_msg)
# Extension methods
# ----------------------------
def _reset_sim(self):
"""Resets a simulation
"""
if self.reset_controls :
self.gazebo.unpauseSim()
self.controllers_object.reset_controllers()
self._check_all_systems_ready()
self._set_init_pose()
self.gazebo.pauseSim()
self.gazebo.resetSim()
self.gazebo.unpauseSim()
self.controllers_object.reset_controllers()
self._check_all_systems_ready()
self.gazebo.pauseSim()
else:
self.gazebo.unpauseSim()
self._check_all_systems_ready()
self._set_init_pose()
self.gazebo.pauseSim()
self.gazebo.resetSim()
self.gazebo.unpauseSim()
self._check_all_systems_ready()
self.gazebo.pauseSim()
return True
def _set_init_pose(self):
"""Sets the Robot in its init pose
"""
raise NotImplementedError()
def _check_all_systems_ready(self):
"""
Checks that all the sensors, publishers and other simulation systems are
operational.
"""
raise NotImplementedError()
def _get_obs(self):
"""Returns the observation.
"""
raise NotImplementedError()
def _init_env_variables(self):
"""Inits variables needed to be initialised each time we reset at the start
of an episode.
"""
raise NotImplementedError()
def _set_action(self, action):
"""Applies the given action to the simulation.
"""
raise NotImplementedError()
def _is_done(self, observations):
"""Indicates whether or not the episode is done ( the robot has fallen for example).
"""
raise NotImplementedError()
def _compute_reward(self, observations, done):
"""Calculates the reward to give based on the observations given.
"""
raise NotImplementedError()
def _env_setup(self, initial_qpos):
"""Initial configuration of the environment. Can be used to configure initial state
and extract information from the simulation.
"""
raise NotImplementedError()
robot_gazebo_env.py |
---|
Code Analysis
Within the RobotGazeboEnv class, the most important functions are:
- step() function: It defines how to handle a simulation step during training. When to unpause the simulation, when to pause it again, etc.
- reset() function: It defines how to reset the simulation after each episode of the training.
- close() function: It closes the whole ROS system.
- publish_reward_topic() function: It publishes the reward for each step in the topic /openai/reward, for visualization purposes.
One important thing you need to know about this environment is that it requires three parameters in its constructor:
def __init__(self, robot_name_space, controllers_list, reset_controls):
- robot_name_space: A string with the namespace of our robot.
- controllers_list: An array containing the names of the controllers of the robot.
- reset_controls: A boolean that tells whether or not to reset the controllers of the robot each time we reset the simulation.
So, when creating the Robot Environment, bear in mind that you’ll need to fill these three variables.
But also, within this environment, we have some functions that are defined, but not implemented (they raise a NotImplementedError()). These functions need to be implemented either in the Robot Environment or in the Task Environment. This is MANDATORY. These functions are:
- _set_init_pose()
- _check_all_systems_ready()
- _get_obs()
- _init_env_variables()
- _set_action()
- _is_done()
- _compute_reward()
- _env_setup()
Most of them have already been explained in the previous Robot and Task Environments. But you can review the function of each one in the robot_gazebo_env.py file.
Besides the functions introduced above, we also have the _reset_sim() function. Although this function is actually implemented, it can also be re-implemented in any of the other Environments. Let’s say that within the Gazebo Environment we have a default implementation, which should work for most of the cases. But if you need to change the behavior of how your simulation resets, you can re-implement this function either in the Robot or the Task Environments, and they will override the one defined in the Gazebo Environment.
Exercise 2.5 |
---|
a) Add a new script to the src folder of your my_cartpole_training package. You can name it, for instance, my_cartpole_robot_env.py. Into this new file, copy the contents of the cartpole_env.py script that you have just reviewed.
b) Next, you will need to make sure that the Task Environment is correctly importing and using your new Robot Environment file, my_cartpole_robot_env.py. Also, you will need to modify some references in the Task Environment to the old Robot Environment file.
c) Add a function in the Robot Environment file that changes the way in which the CartPole simulation is reset. For instance, make it reset to a position where the Pole is at an angle of 30º in respect to the Cart.
End of Exercise 2.5 |
---|
所有的文档及其位置
详见下图左边
Visualizing the reward
As you have seen in the previous section, where we discussed about the Gazebo Environment, this environment publishes the reward into a topic named /openai/reward. This is very important, because it will allow us to visualize our reward using regular ROS tools. For instance, rqt_multiplot.
rqt_multiplot is similar to rqt_plot, since it also provides a GUI for visualizing 2D plots, but it’s a little bit more complex and provides more features. For instance, you can visualize multiple plots at the same time, or you can also customize the axis for your plot.
In the next exercise, we are going to see how to visualize the reward of our training using the rqt_multiplot tool.
Exercise 2.6 |
---|
a) First of all, let’s start the rqt_multiplot tool.
Execute in WebShell #1
rosrun rqt_multiplot rqt_multiplot
b) Hit the icon with a screen in the top-right corner of the IDE window in order to open the Graphic Interface window.
具体位置如下~~
You should now see something like this in this new window.
c) Next, let’s launch our training script, so that we start getting the rewards published into the /openai/reward topic.
Execute in WebShell #1
roslaunch cartpole_v0_training start_training.launch
d) Next, let’s configure our plot. For that, first, you’ll need to click on the Configure plot button, at the top-right side of the window.
e) Give a name to your new plot, and click on the Add curve icon.
f) Next, set the topic to /openai/reward. Automatically, the rest of the field will be autocompleted. For the X-Axis, select the episode_number field. For the Y-Axis, select the episode_reward field. Like in the picture below.
When you are done, just click on the Enter key on your keyboard.
g) Finally, in the plot visualization window, you will need to click on the Run plot button in order to start visualizing your new plot.
If everything went OK, you should visualize something like this:
You can also save, if you want, your plot configuration. This way, you will be able to load it at any time later. Remember to save it into your catkin_ws/srcworkspace, so that it will also be saved when you close the Course.
End of Exercise 2.6 |
---|
Congratulations!! You are now capable of creating your own Environments structure and make it learn using simple learning algorithms, like Qlearn.