文章目录

Clarification
Training Script

start_training.py

Training Environments

First environment class: Task Environment
Second environment class: Robot Environment
Third environment class: Gazebo Environment

所有的文档及其位置
Visualizing the reward

Unit 2: Understanding the ROS + OpenAI structure
在这里插入图片描述

SUMMARY

Estimated time to completion: 4 hours

In this unit, you are going to follow, step by step, the full workflow of a CartPole simulated environment, including all the environments and scripts involved in its training.

END OF SUMMARY

The openai_ros package provides a ROS based training structure in order to organize the training of your robots in simulations.

In general, this structure can be divided into two big parts:

Training Script: The training script defines and sets up the Reinforcement Learning (RL) algorithm that you are going to use to train your agent.
Training Environment: The training environment is the ones in charge of providing all the needed data to your learning algorithm from the simulation and the RL Agent environment, in order to make the agent learn. There are different types of Training Environments:
- Gazebo Environment
- Robot Environment
- Task Environment

For the whole explanation, we will be using the CartPole problem. The CartPole problem consists of a cart that moves along a horizontal axis and a pole that is connected with one joint on top of the moving cart. So, in any state, the cart only has two possible actions: move to the left or move to the right. 在这里插入图片描述
Keep in mind that although we will be using the CartPole simulation for this explanation, the structure that you are going to learn is applicable to any other ROS robot. You will just need to adapt some parts of the code, depending on the problem you want to solve. So, with the proper introductions made, let’s start!

Clarification

In order to not confuse you along the course, we need to clarify the following:

OpenAI Gym, is the training environment created by OpenAI to train systems with reinforcement learning. It is not prepared to work with ROS nor Gazebo.
The openai_ros package, is a ROS package created by The Construct, which provides the integration of OpenAI Gym with ROS and Gazebo in a simple way.
The OpenAI Baselines are a set of already made RL algorithms, created by OpenAI, which are ready to be used with OpenAI Gym

In this course, we are going to see how to use the openai_ros package so you can train ROS based robots with the OpenAI Baselines.

Training Script

Demo 2.1

a) Launch the training script and see how the CartPole starts training.

Execute in WebShell #1

roslaunch cartpole_v0_training start_training.launch

You should see how the CartPole starts moving and learning, as in the video below.
在这里插入图片描述

Demo 2.1

This is the code you launched:

<launch>
    <rosparam command="load" file="$(find cartpole_v0_training)/config/cartpole_qlearn_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="cartpole_v0_training" name="cartpole_gym" type="start_training.py" output="screen"/>
</launch>

So basically, in the previous demo, you:

Loaded the training parameters from the cartpole_qlearn_params.yaml file.
Launched a Python script called start_training.py, which activated the whole learning process.

But… what’s the whole structure that holds this training? Let’s have a look.

start_training.py

First of all, let’s talk about this start_training.py script. This script is the one that sets up the algorithm that you are going to use in order to make your robot learn. In this case, we are going to use the Q-learning algorithm.

Q-learning is a reinforcement learning technique used in machine learning. The goal of Q-Learning is to learn a policy, which tells an agent (in our case, a robot) which action to take under which circumstances (check the link above for more details).

Next, you can see an example of a training script that uses Qlearn. Try to read the code and understand what it does, by reading the included comments. Then, we’ll have a detailed look at the most important parts of it.

start_training.py

#!/usr/bin/env python

import gym
import time
import numpy
import random
import time
import qlearn
from gym import wrappers

# ROS packages required
import rospy
import rospkg

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up


if __name__ == '__main__':
    
    rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)

    # Create the Gym environment
    env = gym.make('CartPoleStayUp-v0')
    rospy.loginfo ( "Gym environment done")
        
    # Set the logging system
    rospack = rospkg.RosPack()
    pkg_path = rospack.get_path('cartpole_v0_training')
    outdir = pkg_path + '/training_results'
    env = wrappers.Monitor(env, outdir, force=True) 
    rospy.loginfo ( "Monitor Wrapper started")

    last_time_steps = numpy.ndarray(0)

    # Loads parameters from the ROS param server
    # Parameters are stored in a yaml file inside the config directory
    # They are loaded at runtime by the launch file
    Alpha = rospy.get_param("/cartpole_v0/alpha")
    Epsilon = rospy.get_param("/cartpole_v0/epsilon")
    Gamma = rospy.get_param("/cartpole_v0/gamma")
    epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
    nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
    nsteps = rospy.get_param("/cartpole_v0/nsteps")
    running_step = rospy.get_param("/cartpole_v0/running_step")

    # Initialises the algorithm that we are going to use for learning
    qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                    alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
    initial_epsilon = qlearn.epsilon

    start_time = time.time()
    highest_reward = 0

    # Starts the main training loop: the one about the episodes to do
    for x in range(nepisodes):
        rospy.logdebug("############### START EPISODE => " + str(x))
        
        cumulated_reward = 0  
        done = False
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount
        
        # Initialize the environment and get first state of the robot
        
        observation = env.reset()
        state = ''.join(map(str, observation))
        
        # Show on screen the actual situation of the robot
        # for each episode, we test the robot for nsteps
        for i in range(nsteps):
            
            rospy.loginfo("############### Start Step => "+str(i))
            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            rospy.loginfo ("Next action is: %d", action)
            # Execute the action in the environment and get feedback
            observation, reward, done, info = env.step(action)
            rospy.loginfo(str(observation) + " " + str(reward))
            cumulated_reward += reward
            if highest_reward < cumulated_reward:
                highest_reward = cumulated_reward

            nextState = ''.join(map(str, observation))

            # Make the algorithm learn based on the results
            #rospy.logwarn("############### State we were => " + str(state))
            #rospy.logwarn("############### Action that we took => " + str(action))
            #rospy.logwarn("############### Reward that action gave => " + str(reward))
            #rospy.logwarn("############### State in which we will start next step => " + str(nextState))
            qlearn.learn(state, action, reward, nextState)

            if not(done):
                state = nextState
            else:
                rospy.loginfo ("DONE")
                last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
                break
            rospy.loginfo("############### END Step =>" + str(i))
            #raw_input("Next Step...PRESS KEY")
            #rospy.sleep(2.0)
        m, s = divmod(int(time.time() - start_time), 60)
        h, m = divmod(m, 60)
        rospy.logwarn ( ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+"     Time: %d:%02d:%02d" % (h, m, s)))

    
    rospy.loginfo ( ("\n|"+str(nepisodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |"))

    l = last_time_steps.tolist()
    l.sort()

    #print("Parameters: a="+str)
    rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
    rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))

    env.close()

start_training.py

Code Analysis

Let’s analyze the most important parts of this script.

from openai_ros.task_envs.cartpole_stay_up import stay_up

In the line above, we are importing from the package openai_ros the Python file that contains the Task Environment that we want to use to learn. The Task Environment is the class that defines the task that the robot has to solve (in this case, the task is to maintain the pole upright). We’ll have a closer look later at this Task Environment that we are importing here.

rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)

In the line above, we are initializing our ROS node. Since we’ll be working with Gazebo + ROS, we need to initialize a ROS node.

env = gym.make('CartPoleStayUp-v0')

In the above line, we are creating an instance of the Task Environment that we imported before. We use the method defined by OpenAI Gym.

rospack = rospkg.RosPack()
pkg_path = rospack.get_path('cartpole_v0_training')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)

Here, we are basically indicating where to store the training results we’ll get when the training finishes. In this case, it will be in the training_results folder, inside the cartpole_v0_training ROS package.

qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                       alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
initial_epsilon = qlearn.epsilon

Here, we are initializing the Qlearn algorithm.

for x in range(nepisodes):
        
        cumulated_reward = 0  
        done = False
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount
        
        observation = env.reset()
        state = ''.join(map(str, observation))

Here we are starting the main training loop (or episodes loop). As you probably know from Reinforcement Learning, every RL exercise is just a loop within a loop. First loop, the main training loop, is the one in charge to initialise the environment and robot for every episode.

At the beginning of the loop, for every episode, we reset the environment and get the initial state of the robot.

for i in range(nsteps):

Next loop is the steps loop. For each episode, the robot is allowed to try nsteps actions and see the results of those actions. There are several interesting things here.

action = qlearn.chooseAction(state)

In the above line, the Qlearn algorithm chooses the next action to take (based on its current knowledge), given the current state of the robot.

observation, reward, done, info = env.step(action)

Here, we execute the step. Executing a step means:

We are applying the action decided above to the robot. That is, the robot executes the action in the simulated environment (for example, move the cart to the right at 0.25m/s).
Based on the result of the action, we get the new state of the robot and environment (for example, the new location of the cart and the new angle of the pole).
We check if the episode is finished, either because the goal has been achieved or because it has failed so much that the robot will not be able to recover from this point.
Based on all that, we compute a score of the action taken, also called the reward.

Hence, after the execution of the step, the environment will return four variables: observation, reward, done, and info.

observation: It gives information about the status of the robot, using its sensors. What it will exactly return will be defined in the Task Environment.
reward: It gives a reward for the action taken. How this reward will be calculated will also be defined in the Task Environment.
done: This variable tells if the step has finished or not. You can consider that a step has finished, for instance, when a certain condition is met. You will also define this variable in the Task Environment.
info: It provides some extra information regarding the step.

qlearn.learn(state, action, reward, nextState)

In the above line, we are actually making the Qlearn algorithm learn. We provide the initial state, the action taken in the step and its reward, and the state of the robot after the step to Qlearn.

The MOST IMPORTANT THING you need to understand from the training script is that it is totally independent from the environments. The purpose of this training script is to set up the learning algorithm that you want to use in order to make your agent learn, regardless of what is being done in the environments. This means that you can change the algorithm you use to learn in the training script without having to worry about modifying your environment’s struture. And this is very powerful! Let’s test it with the following exercises.

Exercise 2.1

a) In your workspace, create a new package called my_cartpole_training.

Execute in WebShell #1

catkin_create_pkg my_cartpole_training rospy openai_ros

b) Inside the package, create two new folders called launch and config.

c) Inside the launch folder, create a new file named start_training.launch. You can copy the following contents into it:

start_training.launch

<launch>
    <rosparam command="load" file="$(find my_cartpole_training)/config/cartpole_qlearn_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="my_cartpole_training" name="cartpole_gym" type="start_training.py" output="screen"/> -->
</launch>

start_training.launch

As you can see, the launch file is quite self-explanatory. What you are doing is, first, loading some parameters that Qlearn needs in order to train, and second, launching the training script that you’ve reviewed in the above section.

d) The next step will be to create this parameters file for Qlearn. Inside the config folder, create a new file named cartpole_qlearn_params.yaml. You can copy the following contents into it:

cartpole_qlearn_params.yaml

cartpole_v0: #namespace

    n_actions: 4
    control_type: "velocity"
    
    #environment variables
    min_pole_angle: -0.7 #-23°
    max_pole_angle: 0.7 #23°
    
    max_base_velocity: 50
    max_base_pose_x: 2.5
    min_base_pose_x: -2.5
    
    # those parameters are very important. They are affecting the learning experience
    # They indicate how fast the control can be
    # If the running step is too large, then there will be a long time between 2 ctrl commans
    # If the pos_step is too large, then the changes in position will be very abrupt
    running_step: 0.04 # amount of time the control will be executed
    pos_step: 0.016     # increment in position for each command
    
    #qlearn parameters
    alpha: 0.5
    gamma: 0.9
    epsilon: 0.1
    epsilon_discount: 0.999
    nepisodes: 1000
    nsteps: 1000
    number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits

    init_pos: 0.0 # Position in which the base will start
    wait_time: 0.1 # Time to wait in the reset phases

cartpole_qlearn_params.yaml

In this file, we are just setting some parameters that we’ll need for our training. These parameters will first be loaded into the ROS parameter server, and later retrieved by our Training Environments.

e) Now, you will need to place a file that defines the Qlearn algorithm inside the src folder of your package. Create a file named qlearn.py here, and place the following contents into it:

qlearn.py

'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
https://github.com/vmayoral/basic_reinforcement_learning
 
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
 
        @author: Victor Mayoral Vilches <[email protected]>
'''
import random

class QLearn:
    def __init__(self, actions, epsilon, alpha, gamma):
        self.q = {}
        self.epsilon = epsilon  # exploration constant
        self.alpha = alpha      # discount constant
        self.gamma = gamma      # discount factor
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)

    def learnQ(self, state, action, reward, value):
        '''
        Q-learning:
            Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))            
        '''
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state, return_q=False):
        q = [self.getQ(state, a) for a in self.actions]
        maxQ = max(q)

        if random.random() < self.epsilon:
            minQ = min(q); mag = max(abs(minQ), abs(maxQ))
            # add random values to all the actions, recalculate maxQ
            q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))] 
            maxQ = max(q)

        count = q.count(maxQ)
        # In case there're several state-action max values 
        # we select a random one among them
        if count > 1:
            best = [i for i in range(len(self.actions)) if q[i] == maxQ]
            i = random.choice(best)
        else:
            i = q.index(maxQ)

        action = self.actions[i]        
        if return_q: # if they want it, give it!
            return action, q
        return action

    def learn(self, state1, action1, reward, state2):
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)

qlearn.py

This file basically defines the Qlearn algorithm. It is imported into our training script in the following line:

import qlearn

Q-Learn is a very basic RL algorithm. If you want to learn more about how Q-Learn works, check this link.

f) Finally, also inside the src folder of your package, you’ll have to place the start_training.py script that we’ve been reviewing. You will need to modify the name of the package where to store the training results:

pkg_path = rospack.get_path('my_cartpole_training')

g) Also create the training_results folder inside your package.

Execute in WebShell #1

roscd my_cartpole_training

mkdir training_results

f) And that’s it! Now you have created a package in order to reproduce the demo you executed in the first step. Run now your launch file and see how the CartPole starts to learn.

End of Exercise 2.1

Exercise 2.2

a) Now, let’s change the learning algorithm that we use for learning. In this case, instead of the Qlearn algorithm, we’ll use the sarsa algorithm. Below, you have the file that defines the sarsa learning algorithm. Create a new file named sarsa.py and include its content.

b)Make the necessary changes to your package in order to use the sarsa algorithm for training, instead of the Q-learn one.

sarsa.py

import random


class Sarsa:
    def __init__(self, actions, epsilon=0.1, alpha=0.2, gamma=0.9):
        self.q = {}

        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)

    def learnQ(self, state, action, reward, value):
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward 
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state):
        if random.random() < self.epsilon:
            action = random.choice(self.actions)
        else:
            q = [self.getQ(state, a) for a in self.actions]
            maxQ = max(q)
            count = q.count(maxQ)
            if count > 1:
                best = [i for i in range(len(self.actions)) if q[i] == maxQ]
                i = random.choice(best)
            else:
                i = q.index(maxQ)

            action = self.actions[i]
        return action

    def learn(self, state1, action1, reward, state2, action2):
        qnext = self.getQ(state2, action2)
        self.learnQ(state1, action1, reward, reward + self.gamma * qnext)

sarsa.py

c) Launch your Training script again and check if you see any differences between the performances of Qlearn and sarsa.

End of Exercise 2.2

Training Environments

In the previous code, you saw that we were starting an environment called CartPoleStayUp-v0. But what is this environment for? What does it do?

An environment is, basically, a problem (like keeping the CartPole stand up) with a minimal interface that an agent (a robot) can interact with. The environments in the OpenAI Gym are designed to allow objective testing and bench-marking of an agent’s abilities.

In OpenAI Gym, you can usually define everything with just one environment. But ours, is a more complex situation, since we want to integrate OpenAI with ROS and Gazebo.

That’s why we have created a structure to organize and split the OpenAI Gym environment into three different types:

Task Environment: This environment is dedicated to implementing all the functions that a specific training session will use. For instance, using the same Robot Environment, you could create different Task Environments in order to train your robot in different tasks, or using different methods (actions).
• You need to have different Task Environments for different tasks to be learnt
Robot Environment: This environment is dedicated to implementing all the functions that a specific robot will use during its training. It will not include, though, the specific tasks that depend on the training that you are going to implement.
• You need to have different Robot Environments for different Robots to learn
Gazebo Environment: This environment is common for any training or robot that you want to implement. This environment will generate all the connections between your robot and Gazebo, so that you don’t have to worry about it.

It’s very important that you understand that this environment structure will always be the same, regardless of:

the problem you want to solve
the robot you want to train
the learning algorithm that you want to use

Also, these three environments will be connected to each other in the sense that each environment will inherit from the previous one. It goes like this:

Task Environment inherits from the Robot Environment.
Robot Environment inherits from the Gazebo Environment.
Gazebo Environment inherits from the Gym Environment. The Gym Environment (gym.Env) is the most basic Environment structure provided by OpenAI Gym framework.

Tons of information, right? But don’t worry too much, since we’ll have an extensive look at all of these environments. For now, let’s focus on the Task Environment.

First environment class: Task Environment

The Task Environment is the highest one in the inheritance structure, which also means that it is the most specific one. This environent fully depends on the task we want the robot learn.

Next, you can see an example of a Training Environment that trains the Pole to stay up on the Cart.

stay_up.py

from gym import utils
from openai_ros.robot_envs import cartpole_env
from gym.envs.registration import register
from gym import error, spaces
import rospy
import math
import numpy as np

# The path is __init__.py of openai_ros, where we import the CartPoleStayUpEnv directly
register(
        id='CartPoleStayUp-v0',
        entry_point='openai_ros:CartPoleStayUpEnv',
        timestep_limit=1000,
    )

class CartPoleStayUpEnv(cartpole_env.CartPoleEnv):
    def __init__(self):
        
        self.get_params()
        
        self.action_space = spaces.Discrete(self.n_actions)
        
        cartpole_env.CartPoleEnv.__init__(
            self, control_type=self.control_type
            )
            
    def get_params(self):
        #get configuration parameters
        self.n_actions = rospy.get_param('/cartpole_v0/n_actions')
        self.min_pole_angle = rospy.get_param('/cartpole_v0/min_pole_angle')
        self.max_pole_angle = rospy.get_param('/cartpole_v0/max_pole_angle')
        self.max_base_velocity = rospy.get_param('/cartpole_v0/max_base_velocity')
        self.min_base_pose_x = rospy.get_param('/cartpole_v0/min_base_pose_x')
        self.max_base_pose_x = rospy.get_param('/cartpole_v0/max_base_pose_x')
        self.pos_step = rospy.get_param('/cartpole_v0/pos_step')
        self.running_step = rospy.get_param('/cartpole_v0/running_step')
        self.init_pos = rospy.get_param('/cartpole_v0/init_pos')
        self.wait_time = rospy.get_param('/cartpole_v0/wait_time')
        self.control_type = rospy.get_param('/cartpole_v0/control_type')
        
    def _set_action(self, action):
        
        # Take action
        if action == 0: #LEFT
            rospy.loginfo("GO LEFT...")
            self.pos[0] -= self.pos_step
        elif action == 1: #RIGHT
            rospy.loginfo("GO RIGHT...")
            self.pos[0] += self.pos_step
        elif action == 2: #LEFT BIG
            rospy.loginfo("GO LEFT BIG...")
            self.pos[0] -= self.pos_step * 10
        elif action == 3: #RIGHT BIG
            rospy.loginfo("GO RIGHT BIG...")
            self.pos[0] += self.pos_step * 10
            
            
        # Apply action to simulation.
        rospy.loginfo("MOVING TO POS=="+str(self.pos))

        # 1st: unpause simulation
        rospy.logdebug("Unpause SIM...")
        self.gazebo.unpauseSim()

        self.move_joints(self.pos)
        rospy.logdebug("Wait for some time to execute movement, time="+str(self.running_step))
        rospy.sleep(self.running_step) #wait for some time
        rospy.logdebug("DONE Wait for some time to execute movement, time=" + str(self.running_step))

        # 3rd: pause simulation
        rospy.logdebug("Pause SIM...")
        self.gazebo.pauseSim()

    def _get_obs(self):
        
        data = self.joints
        #       base_postion                base_velocity              pole angle                 pole velocity
        obs = [round(data.position[1],1), round(data.velocity[1],1), round(data.position[0],1), round(data.velocity[0],1)]

        return obs
        
    def _is_done(self, observations):
        done = False
        data = self.joints
        
        rospy.loginfo("BASEPOSITION=="+str(observations[0]))
        rospy.loginfo("POLE ANGLE==" + str(observations[2]))
        if (self.min_base_pose_x >= observations[0] or observations[0] >= self.max_base_pose_x): #check if the base is still within the ranges of (-2, 2)
            rospy.logerr("Base Outside Limits==>min="+str(self.min_base_pose_x)+",pos="+str(observations[0])+",max="+str(self.max_base_pose_x))
            done = True
        if (self.min_pole_angle >= observations[2] or observations[2] >= self.max_pole_angle): #check if pole has toppled over
            rospy.logerr(
                "Pole Angle Outside Limits==>min=" + str(self.min_pole_angle) + ",pos=" + str(observations[2]) + ",max=" + str(
                    self.max_pole_angle))
            done = True
            
        return done
        
    def _compute_reward(self, observations, done):

        """
        Gives more points for staying upright, gets data from given observations to avoid
        having different data than other previous functions
        :return:reward
        """
        
        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
            self.steps_beyond_done += 1
            reward = 0.0
        
        return reward
        
    def _init_env_variables(self):
        """
        Inits variables needed to be initialised each time we reset at the start
        of an episode.
        :return:
        """
        self.steps_beyond_done = None
        
    def _set_init_pose(self):
        """
        Sets joints to initial position [0,0,0]
        :return:
        """
        
        self.check_publishers_connection()
        
        # Reset Internal pos variable
        self.init_internal_vars(self.init_pos)
        self.move_joints(self.pos)

stay_up.py

Code Analysis
First of all, as you may have already guessed, this environment will inherit from the previous one, the Robot Environment (in this case, it is called CartPoleEnv).

class CartPoleStayUpEnv(cartpole_env.CartPoleEnv)

In the initialization of this Robot Environment, we will also need to pass its required variables to the Robot Environment: in this case, it’s the control_type.

cartpole_env.CartPoleEnv.__init__(self, control_type=self.control_type)

Also, in the initialization of the Task Environment class, we are defining the action space of our environment. This is, basically, the number of actions (self.n_actions) that the robot will be able to take. This number will be taken from the ROS Parameter Server, and it is loaded from a YAML file.

self.action_space = spaces.Discrete(self.n_actions)

Let’s now have a look at the different functions that we have in this Task Environment.

def get_params(self):
        
        self.n_actions = rospy.get_param('/cartpole_v0/n_actions')
        self.min_pole_angle = rospy.get_param('/cartpole_v0/min_pole_angle')
        self.max_pole_angle = rospy.get_param('/cartpole_v0/max_pole_angle')
        self.max_base_velocity = rospy.get_param('/cartpole_v0/max_base_velocity')
        self.min_base_pose_x = rospy.get_param('/cartpole_v0/min_base_pose_x')
        self.max_base_pose_x = rospy.get_param('/cartpole_v0/max_base_pose_x')
        self.pos_step = rospy.get_param('/cartpole_v0/pos_step')
        self.running_step = rospy.get_param('/cartpole_v0/running_step')
        self.init_pos = rospy.get_param('/cartpole_v0/init_pos')
        self.wait_time = rospy.get_param('/cartpole_v0/wait_time')
        self.control_type = rospy.get_param('/cartpole_v0/control_type')

The get_params(self) function, gets all the parameters needed by Qlearn, which are loaded from a YAML file, and stores them into variables of the environment class.

def _set_action(self, action):
        
        # Take action
        if action == 0: #LEFT
            rospy.loginfo("GO LEFT...")
            self.pos[0] -= self.pos_step
        elif action == 1: #RIGHT
            rospy.loginfo("GO RIGHT...")
            self.pos[0] += self.pos_step
        elif action == 2: #LEFT BIG
            rospy.loginfo("GO LEFT BIG...")
            self.pos[0] -= self.pos_step * 10
        elif action == 3: #RIGHT BIG
            rospy.loginfo("GO RIGHT BIG...")
            self.pos[0] += self.pos_step * 10
            
            
        # Apply action to simulation.
        rospy.loginfo("MOVING TO POS=="+str(self.pos))

        # 1st: unpause simulation
        rospy.logdebug("Unpause SIM...")
        self.gazebo.unpauseSim()

        self.move_joints(self.pos)
        rospy.logdebug("Wait for some time to execute movement, time="+str(self.running_step))
        rospy.sleep(self.running_step) #wait for some time
        rospy.logdebug("DONE Wait for some time to execute movement, time=" + str(self.running_step))

        # 3rd: pause simulation
        rospy.logdebug("Pause SIM...")
        self.gazebo.pauseSim()

The _set_action(self, action) function defines what each action will do. Each action will be represented by an integer between 0 and 3. And depending on the action number that this function receives, the CartPole will do one thing or another.

def _get_obs(self):
        
        data = self.joints
        #       base_postion                base_velocity              pole angle                 pole velocity
        obs = [round(data.position[1],1), round(data.velocity[1],1), round(data.position[0],1), round(data.velocity[0],1)]

        return obs

The _get_obs(self) function returns an array with all the data we need from the robot sensors in order to learn. In this case, we are getting the position and velocity of the Cart, and the angle and velocity of the Pole, which is the most relevant data to have on whether the agent is doing well or not.

def _is_done(self, observations):
        done = False
        data = self.joints
        
        rospy.loginfo("BASEPOSITION=="+str(observations[0]))
        rospy.loginfo("POLE ANGLE==" + str(observations[2]))
        if (self.min_base_pose_x >= observations[0] or observations[0] >= self.max_base_pose_x): #check if the base is still within the ranges of (-2, 2)
            rospy.logerr("Base Outside Limits==>min="+str(self.min_base_pose_x)+",pos="+str(observations[0])+",max="+str(self.max_base_pose_x))
            done = True
        if (self.min_pole_angle >= observations[2] or observations[2] >= self.max_pole_angle): #check if pole has toppled over
            rospy.logerr(
                "Pole Angle Outside Limits==>min=" + str(self.min_pole_angle) + ",pos=" + str(observations[2]) + ",max=" + str(
                    self.max_pole_angle))
            done = True
            
        return done

The _is_done(self, observations) function decides whether the current step has finished or not, and returns a variable (boolean) that says so. For the CartPole case, we say the step has finished either when the Cart is close to the bar limits or when the angle of the Pole is too low.

def _compute_reward(self, observations, done):

        """
        Gives more points for staying upright, gets data from given observations to avoid
        having different data than other previous functions
        :return:reward
        """

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
            self.steps_beyond_done += 1
            reward = 0.0
        
        return reward

The _compute_reward(self, observations, done) function decides the reward to give for each step. In this specific case of the CartPole, it will give one reward point for each step (steps_beyond_done) that the Pole is upright.

def _init_env_variables(self):
        """
        Inits variables needed to be initialised each time we reset at the start
        of an episode.
        :return:
        """
        self.steps_beyond_done = None

The _init_env_variables(self) function initializes all the variables that your particular system is using after each step. In the CartPole case, as you can see, we are initializing the steps_beyond_done variable.

def _set_init_pose(self):
        """
        Sets joints to initial position [0,0,0]
        :return:
        """
        
        self.check_publishers_connection()
        
        # Reset Internal pos variable
        self.init_internal_vars(self.init_pos)
        self.move_joints(self.pos)

The _set_init_pose(self) function sets the initial position of the joints of your robot. In this case, the initial position will be with the Cart parallel to the rail, and the Pole perpendicular to them at an angle of 90º.

It’s very important that you understand that you can add any function that you want into this Task Environment. However, there are some functions that will ALWAYS need to be defined here because they are required for the rest of the Environments to work properly. These functions are:

The _set_action() function
The _get_obs() function
The _is_done() function
The _compute_reward() function
The _set_init_pose() function
The _init_env_variables() function

If for your specific task, you don’t need to implement some of these functions, you can just make them pass (like we do in this case with the _set_init_pose() and _init_env_variables() functions). But you will need to add them to your code anyway, or you will get a NotImplementedError().

Exercise 2.3

a) Inside src folder of the my_cartpole_training package, add a new script called my_cartpole_task_env.py. Into this new file, copy the contents of the stay_up.py script that you have just reviewed.

b) Now, modify the Task Environment code so that it takes only two actions: move right and move left.

NOTE: Also, you will need to modify the registration id of the environment, in order to avoid conflicts. You just need to change the v0 for a v1.

register(
        id='CartPoleStayUp-v1',
        entry_point='my_cartpole_task_env:CartPoleStayUpEnv',
        timestep_limit=1000,
    )

c) Next, you’ll need to also modify the file start_training.py, so that it takes your Task Environment file insted of the default one.

import my_cartpole_task_env

Also, remember to modify the name of the Environment, which you have updated in the previous step.

env = gym.make('CartPoleStayUp-v1')

d) Finally, launch the training scrcipt and check out how the modifications affect the learning process.

End of Exercise 2.3

Exercise 2.4

a) Modify the Task Environment so that it now gives more reward for going to the left than for going to the right.

End of Exercise 2.4

Second environment class: Robot Environment

The robot environment will then contain all the functions associated with the specific “robot” that you want to train. This means that it will contain all the functionalities that your robot will need in order to be controlled.

For instance, let’s focus on the CartPole example. Let’s say that in order to be able to control a CartPole environment, we will basically need three things:

A way to check that all the systems (ROS topics, Publishers, etc.) are ready and working okay
A way to move the Cart alongside the rails
A way to get all the data about all the sensors that we want to evaluate (including the position of the Pole)

For this case, then, we would need to define all these functions into the Robot Environment. Let’s see an example of a Robot Environment script that takes into account the three points introduced above. You can check out the code here:

cartpole_env.py

#!/usr/bin/env python

import gym
import rospy
import roslaunch
import time
import numpy as np
from gym import utils, spaces
from geometry_msgs.msg import Twist
from std_srvs.srv import Empty
from gym.utils import seeding
from gym.envs.registration import register
import copy
import math
import os

from sensor_msgs.msg import JointState
from trajectory_msgs.msg import JointTrajectory, JointTrajectoryPoint
from std_msgs.msg import Float64
from gazebo_msgs.srv import SetLinkState
from gazebo_msgs.msg import LinkState
from rosgraph_msgs.msg import Clock
from openai_ros import robot_gazebo_env

class CartPoleEnv(robot_gazebo_env.RobotGazeboEnv):
    def __init__(
        self, control_type
    ):
        
        self.publishers_array = []
        self._base_pub = rospy.Publisher('/cartpole_v0/foot_joint_velocity_controller/command', Float64, queue_size=1)
        self._pole_pub = rospy.Publisher('/cartpole_v0/pole_joint_velocity_controller/command', Float64, queue_size=1)
        self.publishers_array.append(self._base_pub)
        self.publishers_array.append(self._pole_pub)
        
        rospy.Subscriber("/cartpole_v0/joint_states", JointState, self.joints_callback)
        
        self.control_type = control_type
        if self.control_type == "velocity":
            self.controllers_list = ['joint_state_controller',
                                    'pole_joint_velocity_controller',
                                    'foot_joint_velocity_controller',
                                    ]
                                    
        elif self.control_type == "position":
            self.controllers_list = ['joint_state_controller',
                                    'pole_joint_position_controller',
                                    'foot_joint_position_controller',
                                    ]
                                    
        elif self.control_type == "effort":
            self.controllers_list = ['joint_state_controller',
                                    'pole_joint_effort_controller',
                                    'foot_joint_effort_controller',
                                    ]

        self.robot_name_space = "cartpole_v0"
        self.reset_controls = True

        # Seed the environment
        self._seed()
        self.steps_beyond_done = None
        
        super(CartPoleEnv, self).__init__(
            controllers_list=self.controllers_list,
            robot_name_space=self.robot_name_space,
            reset_controls=self.reset_controls
            )

    def joints_callback(self, data):
        self.joints = data

    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
        
    # RobotEnv methods
    # ----------------------------

    def _env_setup(self, initial_qpos):
        self.init_internal_vars(self.init_pos)
        self.set_init_pose()
        self.check_all_systems_ready()
        
    def init_internal_vars(self, init_pos_value):
        self.pos = [init_pos_value]
        self.joints = None
        
    def check_publishers_connection(self):
        """
        Checks that all the publishers are working
        :return:
        """
        rate = rospy.Rate(10)  # 10hz
        while (self._base_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
            rospy.logdebug("No susbribers to _base_pub yet so we wait and try again")
            try:
                rate.sleep()
            except rospy.ROSInterruptException:
                # This is to avoid error when world is rested, time when backwards.
                pass
        rospy.logdebug("_base_pub Publisher Connected")

        while (self._pole_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
            rospy.logdebug("No susbribers to _pole_pub yet so we wait and try again")
            try:
                rate.sleep()
            except rospy.ROSInterruptException:
                # This is to avoid error when world is rested, time when backwards.
                pass
        rospy.logdebug("_pole_pub Publisher Connected")

        rospy.logdebug("All Publishers READY")

    def _check_all_systems_ready(self, init=True):
        self.base_position = None
        while self.base_position is None and not rospy.is_shutdown():
            try:
                self.base_position = rospy.wait_for_message("/cartpole_v0/joint_states", JointState, timeout=1.0)
                rospy.logdebug("Current cartpole_v0/joint_states READY=>"+str(self.base_position))
                if init:
                    # We Check all the sensors are in their initial values
                    positions_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.position)
                    velocity_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.velocity)
                    efforts_ok = all(abs(i) <= 1.0e-01 for i in self.base_position.effort)
                    base_data_ok = positions_ok and velocity_ok and efforts_ok
                    rospy.logdebug("Checking Init Values Ok=>" + str(base_data_ok))
            except:
                rospy.logerr("Current cartpole_v0/joint_states not ready yet, retrying for getting joint_states")
        rospy.logdebug("ALL SYSTEMS READY")
        
            
    def move_joints(self, joints_array):
        joint_value = Float64()
        joint_value.data = joints_array[0]
        rospy.logdebug("Single Base JointsPos>>"+str(joint_value))
        self._base_pub.publish(joint_value)

        
    def get_clock_time(self):
        self.clock_time = None
        while self.clock_time is None and not rospy.is_shutdown():
            try:
                self.clock_time = rospy.wait_for_message("/clock", Clock, timeout=1.0)
                rospy.logdebug("Current clock_time READY=>" + str(self.clock_time))
            except:
                rospy.logdebug("Current clock_time not ready yet, retrying for getting Current clock_time")
        return self.clock_time

cartpole_env.py

Code Analysis

We can differentiate two main parts in the CartPoleEnv class: the initialization of the class and the specific functions defined below.

First of all, you can see that this class inherits from the RobotGazeboEnv class:

class CartPoleEnv(robot_gazebo_env.RobotGazeboEnv):

This means that all the functions that are defined in the Gazebo Environment will also be accessible from this class.

In the init() function of the class, we are doing several things. First, we are creating all the Publishers and Subscribers required for this specific Cartpole environment:

self.publishers_array = []
self._base_pub = rospy.Publisher('/cartpole_v0/foot_joint_velocity_controller/command', Float64, queue_size=1)
self._pole_pub = rospy.Publisher('/cartpole_v0/pole_joint_velocity_controller/command', Float64, queue_size=1)
self.publishers_array.append(self._base_pub)
self.publishers_array.append(self._pole_pub)

rospy.Subscriber("/cartpole_v0/joint_states", JointState, self.joints_callback)

These are the Publishers needed to move the Cart alongside the rails, and the Subscribers needed to know the positions of the joints at any given time.

In the following section:

self.control_type = control_type
if self.control_type == "velocity":
    self.controllers_list = ['joint_state_controller',
                            'pole_joint_velocity_controller',
                            'foot_joint_velocity_controller',
                            ]

elif self.control_type == "position":
    self.controllers_list = ['joint_state_controller',
                            'pole_joint_position_controller',
                            'foot_joint_position_controller',
                            ]

elif self.control_type == "effort":
    self.controllers_list = ['joint_state_controller',
                            'pole_joint_effort_controller',
                            'foot_joint_effort_controller',
                            ]

We are defining which kind of controllers we will use for our joints: velocity, position, or effort controllers. We decide this depending on the contents of the control_type variable, which has been defined in the Task Environment. Depending on this variable, then, we will fill the value of the controllers_list variable, which needs to be passed to the Gazebo Environment.

Alongside the controllers_list variable, we will need to pass two other variables to the Gazebo Environment: robot_name_space and reset_controls. This is done here:

super(CartPoleEnv, self).__init__(
            controllers_list=self.controllers_list,
            robot_name_space=self.robot_name_space,
            reset_controls=self.reset_controls
            )

The values of the robot_name_space and reset_controls variables are assigned right above:

self.robot_name_space = "cartpole_v0"
self.reset_controls = True

And this is basically it for the initialization of the environment. Now, let’s have a look at some of the most important functions of the environment.

def joints_callback(self, data):
        self.joints = data

The joints_callback(self, data) function gets the data from the Subscriber of the topic /joint_states and stores it in the self.joints class variable. If you remember from the previous section, this self.joints class variable will be used in order to generate the observation (inside the _get_obs() function).

def check_publishers_connection(self):
        """
        Checks that all the publishers are working
        :return:
        """
        rate = rospy.Rate(10)  # 10hz
        while (self._base_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
            rospy.logdebug("No susbribers to _base_pub yet so we wait and try again")
            try:
                rate.sleep()
            except rospy.ROSInterruptException:
                # This is to avoid error when world is rested, time when backwards.
                pass
        rospy.logdebug("_base_pub Publisher Connected")

        while (self._pole_pub.get_num_connections() == 0 and not rospy.is_shutdown()):
            rospy.logdebug("No susbribers to _pole_pub yet so we wait and try again")
            try:
                rate.sleep()
            except rospy.ROSInterruptException:
                # This is to avoid error when world is rested, time when backwards.
                pass
        rospy.logdebug("_pole_pub Publisher Connected")

        rospy.logdebug("All Publishers READY")

The check_publishers_connection(self) basically checks that the Publishers are connected and working.

def _check_all_systems_ready(self, init=True):
        self.base_position = None
        while self.base_position is None and not rospy.is_shutdown():
            try:
                self.base_position = rospy.wait_for_message("/cartpole_v0/joint_states", JointState, timeout=1.0)
                rospy.logdebug("Current cartpole_v0/joint_states READY=>"+str(self.base_position))
                if init:
                    # We Check all the sensors are in their initial values
                    positions_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.position)
                    velocity_ok = all(abs(i) <= 1.0e-02 for i in self.base_position.velocity)
                    efforts_ok = all(abs(i) <= 1.0e-01 for i in self.base_position.effort)
                    base_data_ok = positions_ok and velocity_ok and efforts_ok
                    rospy.logdebug("Checking Init Values Ok=>" + str(base_data_ok))
            except:
                rospy.logerr("Current cartpole_v0/joint_states not ready yet, retrying for getting joint_states")
        rospy.logdebug("ALL SYSTEMS READY")

The _check_all_systems_ready(self, init=True) function checks that all the joints are at their initial values before starting each step.

def move_joints(self, joints_array):
        joint_value = Float64()
        joint_value.data = joints_array[0]
        rospy.logdebug("Single Base JointsPos>>"+str(joint_value))
        self._base_pub.publish(joint_value)

Finally, the move_joints(self, joints_array) functions are in charge of moving the joints (the Cart, in this case) of the simulation.

It’s very important that you understand that you can add all the functions that you want to this Robot Environment. However, there are some functions that will ALWAYS need to be defined here because they are required for the rest of the Environments to work properly. These functions include:

The init function
The _check_all_systems_ready() function

Third environment class: Gazebo Environment

The Gazebo Environment is used to connect the Robot Environment to the Gazebo simulator. For instance, it takes care of the resets of the simulator after each step, or the resets of the controllers (if needed).

The most important thing you need to know about this environment is that it will be transparent to you. And what does this mean? Well, it basically means that you don’t have to worry about it. This environment will always be the same, regardless of the robot or the kind of task to solve. So, you won’t have to change it or work over it. Good news, right?

Here you can have a look at how this environment looks:

robot_gazebo_env.py

import rospy
import gym
from gym.utils import seeding
from .gazebo_connection import GazeboConnection
from .controllers_connection import ControllersConnection
#https://bitbucket.org/theconstructcore/theconstruct_msgs/src/master/msg/RLExperimentInfo.msg
from theconstruct_msgs.msg import RLExperimentInfo

# https://github.com/openai/gym/blob/master/gym/core.py
class RobotGazeboEnv(gym.Env):

    def __init__(self, robot_name_space, controllers_list, reset_controls):

        # To reset Simulations
        self.gazebo = GazeboConnection()
        self.controllers_object = ControllersConnection(namespace=robot_name_space, controllers_list=controllers_list)
        self.reset_controls = reset_controls
        self.seed()

        # Set up ROS related variables
        self.episode_num = 0
        self.reward_pub = rospy.Publisher('/openai/reward', RLExperimentInfo, queue_size=1)

    # Env methods
    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        """
        Function executed each time step.
        Here we get the action execute it in a time step and retrieve the
        observations generated by that action.
        :param action:
        :return: obs, reward, done, info
        """

        """
        Here we should convert the action num to movement action, execute the action in the
        simulation and get the observations result of performing that action.
        """
        self.gazebo.unpauseSim()
        self._set_action(action)
        self.gazebo.pauseSim()
        obs = self._get_obs()
        done = self._is_done(obs)
        info = {}
        reward = self._compute_reward(obs, done)
        self._publish_reward_topic(reward, self.episode_num)

        return obs, reward, done, info

    def reset(self):
        rospy.logdebug("Reseting RobotGazeboEnvironment")
        self._reset_sim()
        self._init_env_variables()
        self._update_episode()
        obs = self._get_obs()
        return obs

    def close(self):
        """
        Function executed when closing the environment.
        Use it for closing GUIS and other systems that need closing.
        :return:
        """
        rospy.logdebug("Closing RobotGazeboEnvironment")
        rospy.signal_shutdown("Closing RobotGazeboEnvironment")

    def _update_episode(self):
        """
        Increases the episode number by one
        :return:
        """
        self.episode_num += 1

    def _publish_reward_topic(self, reward, episode_number=1):
        """
        This function publishes the given reward in the reward topic for
        easy access from ROS infrastructure.
        :param reward:
        :param episode_number:
        :return:
        """
        reward_msg = RLExperimentInfo()
        reward_msg.episode_number = episode_number
        reward_msg.episode_reward = reward
        self.reward_pub.publish(reward_msg)

    # Extension methods
    # ----------------------------

    def _reset_sim(self):
        """Resets a simulation
        """
        if self.reset_controls :
            self.gazebo.unpauseSim()
            self.controllers_object.reset_controllers()
            self._check_all_systems_ready()
            self._set_init_pose()
            self.gazebo.pauseSim()
            self.gazebo.resetSim()
            self.gazebo.unpauseSim()
            self.controllers_object.reset_controllers()
            self._check_all_systems_ready()
            self.gazebo.pauseSim()
            
        else:
            self.gazebo.unpauseSim()
            
            self._check_all_systems_ready()
            self._set_init_pose()
            self.gazebo.pauseSim()
            self.gazebo.resetSim()
            self.gazebo.unpauseSim()
            
            self._check_all_systems_ready()
            self.gazebo.pauseSim()
        

        return True

    def _set_init_pose(self):
        """Sets the Robot in its init pose
        """
        raise NotImplementedError()

    def _check_all_systems_ready(self):
        """
        Checks that all the sensors, publishers and other simulation systems are
        operational.
        """
        raise NotImplementedError()

    def _get_obs(self):
        """Returns the observation.
        """
        raise NotImplementedError()

    def _init_env_variables(self):
        """Inits variables needed to be initialised each time we reset at the start
        of an episode.
        """
        raise NotImplementedError()

    def _set_action(self, action):
        """Applies the given action to the simulation.
        """
        raise NotImplementedError()

    def _is_done(self, observations):
        """Indicates whether or not the episode is done ( the robot has fallen for example).
        """
        raise NotImplementedError()

    def _compute_reward(self, observations, done):
        """Calculates the reward to give based on the observations given.
        """
        raise NotImplementedError()

    def _env_setup(self, initial_qpos):
        """Initial configuration of the environment. Can be used to configure initial state
        and extract information from the simulation.
        """
        raise NotImplementedError()

robot_gazebo_env.py

Code Analysis

Within the RobotGazeboEnv class, the most important functions are:

step() function: It defines how to handle a simulation step during training. When to unpause the simulation, when to pause it again, etc.
reset() function: It defines how to reset the simulation after each episode of the training.
close() function: It closes the whole ROS system.
publish_reward_topic() function: It publishes the reward for each step in the topic /openai/reward, for visualization purposes.

One important thing you need to know about this environment is that it requires three parameters in its constructor:

def __init__(self, robot_name_space, controllers_list, reset_controls):

robot_name_space: A string with the namespace of our robot.
controllers_list: An array containing the names of the controllers of the robot.
reset_controls: A boolean that tells whether or not to reset the controllers of the robot each time we reset the simulation.

So, when creating the Robot Environment, bear in mind that you’ll need to fill these three variables.

But also, within this environment, we have some functions that are defined, but not implemented (they raise a NotImplementedError()). These functions need to be implemented either in the Robot Environment or in the Task Environment. This is MANDATORY. These functions are:

_set_init_pose()
_check_all_systems_ready()
_get_obs()
_init_env_variables()
_set_action()
_is_done()
_compute_reward()
_env_setup()

Most of them have already been explained in the previous Robot and Task Environments. But you can review the function of each one in the robot_gazebo_env.py file.

Besides the functions introduced above, we also have the _reset_sim() function. Although this function is actually implemented, it can also be re-implemented in any of the other Environments. Let’s say that within the Gazebo Environment we have a default implementation, which should work for most of the cases. But if you need to change the behavior of how your simulation resets, you can re-implement this function either in the Robot or the Task Environments, and they will override the one defined in the Gazebo Environment.

Exercise 2.5

a) Add a new script to the src folder of your my_cartpole_training package. You can name it, for instance, my_cartpole_robot_env.py. Into this new file, copy the contents of the cartpole_env.py script that you have just reviewed.

b) Next, you will need to make sure that the Task Environment is correctly importing and using your new Robot Environment file, my_cartpole_robot_env.py. Also, you will need to modify some references in the Task Environment to the old Robot Environment file.

c) Add a function in the Robot Environment file that changes the way in which the CartPole simulation is reset. For instance, make it reset to a position where the Pole is at an angle of 30º in respect to the Cart.

End of Exercise 2.5

所有的文档及其位置

详见下图左边
在这里插入图片描述

Visualizing the reward

As you have seen in the previous section, where we discussed about the Gazebo Environment, this environment publishes the reward into a topic named /openai/reward. This is very important, because it will allow us to visualize our reward using regular ROS tools. For instance, rqt_multiplot.

rqt_multiplot is similar to rqt_plot, since it also provides a GUI for visualizing 2D plots, but it’s a little bit more complex and provides more features. For instance, you can visualize multiple plots at the same time, or you can also customize the axis for your plot.

In the next exercise, we are going to see how to visualize the reward of our training using the rqt_multiplot tool.

Exercise 2.6

a) First of all, let’s start the rqt_multiplot tool.

Execute in WebShell #1

rosrun rqt_multiplot rqt_multiplot

b) Hit the icon with a screen in the top-right corner of the IDE window in order to open the Graphic Interface window.
在这里插入图片描述
具体位置如下~~

You should now see something like this in this new window.
c) Next, let’s launch our training script, so that we start getting the rewards published into the /openai/reward topic.

Execute in WebShell #1

roslaunch cartpole_v0_training start_training.launch

d) Next, let’s configure our plot. For that, first, you’ll need to click on the Configure plot button, at the top-right side of the window.
在这里插入图片描述
e) Give a name to your new plot, and click on the Add curve icon.

f) Next, set the topic to /openai/reward. Automatically, the rest of the field will be autocompleted. For the X-Axis, select the episode_number field. For the Y-Axis, select the episode_reward field. Like in the picture below.
在这里插入图片描述
When you are done, just click on the Enter key on your keyboard.

g) Finally, in the plot visualization window, you will need to click on the Run plot button in order to start visualizing your new plot.
在这里插入图片描述
If everything went OK, you should visualize something like this:

You can also save, if you want, your plot configuration. This way, you will be able to load it at any time later. Remember to save it into your catkin_ws/srcworkspace, so that it will also be saved when you close the Course.
在这里插入图片描述