Application of Gym platform in reinforcement learning experiment

Original source: https://zhuanlan.zhihu.com/p/114392519. Know almost

The implementation of reinforcement learning algorithms requires suitable platforms and tools. This case will first introduce the basic usage method of Gym, a commonly used reinforcement learning implementation platform, and then introduce the basic operation method of the experimental tool TensorFlow, which will lay a solid foundation for the construction and evaluation of powerful reinforcement learning algorithms.

table of Contents

1. Introduction to common reinforcement learning experimental platform
2. Experimental platform Gym
2.1 Installation of
Gym 2.2 Built-in environment in
Gym ] 2.3 Basic usage of Gym]
3 3. Experimental tool TensorFlow
3.1 Installation
of TensorFlow ] 3.2 Use TensorFlow to build a fully connected neural network Approximate state value function
4. Summary

1. Introduction to common reinforcement learning experimental platforms

How do we verify the quality of reinforcement learning algorithms? Just like a data set, we need a recognized platform for simulating, building, rendering, and experimenting with reinforcement learning algorithms in the environment. There are many experimental platforms for reinforcement learning:

DeepMind Lab

DeepMind Lab is an excellent research platform based on reinforcement learning, providing a rich simulation environment. At the same time, it is highly customizable and extensible, with rich visual content, sci-fi style and realistic effects.

Project Malmo

Project Malmo is a reinforcement learning experimental platform developed by Microsoft on the basis of Minecraft. It can provide good flexibility for customized environments and is also suitable for complex environments. However, Malmo currently only provides Minecraft game environments.

VizDoom

VizDoom is a reinforcement learning experimental platform based on the Doom game (Doom). It supports multi-agents and testing agents in a competitive environment. However, VizDoom only supports the Doom game environment. In addition, it can provide off-screen rendering and support single/multi-player.

OpenAI Gym

Gym is currently the most widely used reinforcement learning experimental platform. Below we will focus on how to use it.

2. Experimental platform Gym

OpenAI is a non-profit, open source artificial intelligence research company founded by Elon Musk and Sam Altman. Gym is a reinforcement learning experimental environment library launched by OpenAI. It can be used to simulate real environments, build reinforcement learning algorithms, and test agents in these environments. Gym is compatible with algorithms written under TensorFlow, Theano, Keras and other frameworks. In addition to relying on a small number of commercial libraries, the entire project is open source and free.

2.1 Installation of Gym

Gym supports Windows, Linux, and MacOS systems. Install the Gym library in the Anaconda 3 environment (requires Python version 3.5+). The installation command is as follows pip install gym. After the installation is complete, just import it in Python.

import gym

2.2 Built-in environment in Gym

Hundreds of experimental environments for reinforcement learning are built in the Gym library:

  • Classic control environment
  • Simple text environment
  • Algorithm environment
  • Box2D environment
  • Atari gaming environment
  • Mechanical control environment
  • ……

These environments are all encapsulated in the submodule env. Use the following methods to view all the included environments. Due to the excessive number, we only print the first 10:

from gym import envs
env_spaces = envs.registry.all()
env_ids = [env_space.id for env_space in env_spaces]
print(env_ids[:10])

['Copy-v0', 'RepeatCopy-v0', 'ReversedAddition-v0', 'ReversedAddition3-v0', 'DuplicatedInput-v0', 'Reverse-v0', 'CartPole-v0', 'CartPole-v1', 'MountainCar-v0', 'MountainCarContinuous-v0']

Each environment has an ID in the form of "xxxxx-vd", such as "CartPole-v0", and the ID contains the environment name and version number. Below we select an environment as the experimental object to further introduce the basic usage of the Gym library.

2.3 Basic usage of Gym

We choose "CliffWalking-v0" (Chinese name "Cliff Pathfinder") as the experimental object. The problem that this environment needs to solve is in a 4×12 grid. The agent starts at the bottom left of the grid (number It is 36), and I hope to move to the grid (numbered 47) in the lower right corner with the least number of steps, as shown in the figure below, 37~46 represent the cliff:

The agent can take up, down, left and right four actions to move:

  • The reward for reaching grids other than cliffs is -1
  • The reward for reaching the cliff is -100 and returning to the starting point
  • The action of leaving the grid will keep the current state and reward -1

First, use the makefunction to load the "cliff pathfinding" environment. If you need to load other environments, just makereplace the parameters in the function with the IDs of other environments:

env = gym.make('CliffWalking-v0')

Each environment defines its own state space and action space. After loading the environment, use the observation_spaceproperties of the environment to view the state space, and use the action_spaceproperties to view the action space:

print('状态空间:',env.observation_space)
print('动作空间:',env.action_space)

State space: Discrete(48)
Action space: Discrete(4)

In Gym, discrete spaces are generally represented by gym.spaces.Discreteclasses, and continuous spaces are generally represented by gym.spaces.Boxclasses.

In the cliff pathfinding problem, it Discrete(48)means that the state space is discrete, and the value is [official], which Discrete(4)means that the action space is discrete, and the value is [official].

Use Pattributes to view the transition relationship between different actions and states. It returns a nested dictionary object, the key is the state, and the value is still a dictionary object. Take state 30 as an example:

env.P[30]

{0: [(1.0, 18, -1, False)],
1: [(1.0, 31, -1, False)],
2: [(1.0, 36, -100, False)],
3: [( 1.0, 29, -1, False)]}

In the above dictionary object, the key represents different actions, and the value is a list of tuples. The elements in it represent the transition probability under the action corresponding to the key, the state of arrival, the reward of feedback, and the signal of reaching the end.

For example, when in state 30, 0, 1, 2 and 3 mean moving up, right, down and left respectively. Choosing 2 means moving down and entering the cliff with probability 1, and then returning to the initial state 36 and feedback -100 reward, did not reach the end.

Next, we will introduce the core method of using the environment-the stepmethod, which accepts the action of the agent as a parameter and returns the following four values:

  • observation: the state reached after taking the current action
  • reward: the reward obtained by taking the current action
  • done: Boolean variable, indicating whether the end state is reached
  • info: dictionary type value, contains some debugging information, such as the transition probability between states

At each time step, the agent will choose an action, and then return to the next state and reward.

stepBefore using the method, you need to use the resetmethod to initialize the environment. The method will return to the initial state of the agent. Whenever the round ends, you can use the resetmethod to start the next round:

env.reset()

stepThe method requires an action as a parameter. You can use the samplemethod to randomly select an action from the action space:

env.action_space.sample()

After calling the stepmethod, you can use the rendermethod to graphically display the current environment:

env.render()

Each time the stepmethod is called, it will only move forward one step, so the loop structure is often used to loop the stepmethod to complete the whole round.

env.reset() # 初始化环境
for t in range(10):
    # 在动作空间中随机选择一个动作
    action = env.action_space.sample()
    # 采取一个动作
    observation, reward, done, info = env.step(action)
    print("action:{},observation:{},reward:{}, done:{}, info:{}".format(action, observation, reward, done, info))
    # 图形化显示
    env.render()     

3. Experimental tool TensorFlow

TensorFlow is an open source software library that uses data flow graphs for numerical calculations. Its flexible architecture can be used for calculations on multiple platforms. TensorFlow was originally used for research on machine learning and deep neural networks, but the versatility of this system allows it to be widely used in other computing fields.

In reinforcement learning, when the state space is huge and the action space is continuous, the model is used to estimate the value function, such as the DQN algorithm, and the deep neural network is used to estimate the value function. At this time, it is necessary to use TensorFlow to build a deep neural network and combine with Gym Realize the DQN algorithm together.

The data flow graph, also known as the calculation graph, is the basic calculation framework of TensorFlow and is used to define the network structure of deep learning. The basic data flow graph in TensorFlow is a static graph, that is, once it is created, it does not support dynamic modification, the dynamic graph mechanism (Eager) is also introduced in TensorFlow.

The data flow graph contains some operation (Operation) objects, called compute nodes, and tensor (Tensor) objects represent data nodes between different operations. When defining a graph, you can define different name domains and define variables and operations in them to facilitate subsequent search.

3.1 Installation of TensorFlow

TensorFlow has two versions of CPU and GPU. Take Anaconda3 as an example. Install stable CPU version 1.12 under Windows system. First use conda to create a Python 3.6 environment, then enter this environment and use commands pip install tensorflow==1.12to install it; in Windows system To install GPU version 1.12, you need to complete the following steps in sequence:

  1. Install VS2015
  2. Install CUDA and CUDNN (the version must correspond to the computer graphics card version) and add environment variables
  3. Use conda to create a Python 3.6 environment
  4. Use command pip install tensorflow-gpu==1.12to install

3.2 Use TensorFlow to build a fully connected neural network approximate state value function

We will introduce the method of TensorFlow to build a neural network through an example. Since the neural network is called the "universal function approximator", we can use the neural network to approximate the state value function, namely:

[official]

Among them [official]are the parameters of the neural network, the input of the neural network is the state [official], and the output is the state value [official].

Here we take the fully connected neural network as an example. The specific network structure is as follows:

  • Input layer: state [official], dimension is 1×1
  • Hidden layer: 5 neurons
  • Output layer: state value [official], dimension is 1×1

First read the training data:

import numpy as np
import pandas as pd
data = pd.read_csv('./input/value_function.csv')
data.head()

From the above results, we can see that the data contains two columns. The statecolumn represents the status, and the valuecolumn represents the value corresponding to the status. We will save the two columns separately:

# 神经网络的输入数据
x = data['state'].values
# 神经网络的输出数据
y = data['value'].values

Define placeholder

Since the basic data flow graph of TensorFlow is a static graph, it is necessary to define a placeholder to occupy a fixed position when building a deep neural network. The placeholder only defines the type and dimension of the Tensor, and does not assign a value. TensorFlow can use placeholderfunctions to create placeholders. There is a parameter shapeto specify the data dimension. If shapeset to None, you can enter data of any dimension. We first use placeholders to define the input and output of the neural network:

import tensorflow as tf
# 重置计算图
tf.reset_default_graph()
# 定义输入占位符
x_ = tf.placeholder(shape=[None, 1], dtype=tf.float32, name='x_')
# 定义输出占位符
y_ = tf.placeholder(shape=[None, 1], dtype=tf.float32, name='y_')

Define parameters

In TensorFlow, a constant is a Tensor whose value cannot be changed. Once a value is assigned, it cannot be changed. You can use a constantfunction to create a TensorFlow constant.

The variable is a variable Tensor, which is used to calculate the input of other operations in the graph. The parameters of the neural network can be regarded as variables, and Variablefunctions can be used to create TensorFlow variables.

In a complex neural network structure, there will be many variables or operations in the connections between layers and between nodes, which will cause confusion in the variables. You can use variable_scopethe function variable range set by the operation of related variables or layer concentrated in a range of models help better understanding.

Below we define the weights and biases of the hidden layer and output layer of the neural network:

# 定义隐藏层的权重和偏置
with tf.variable_scope('hidden'):
    # 使用截断正态分布初始化权重
    w_hidden = tf.Variable(tf.truncated_normal(shape=[1, 5], dtype=tf.float32), name='w_hidden')
    # 定义偏置
    b_hidden = tf.Variable(tf.truncated_normal(shape=[5], dtype=tf.float32), name='b_hidden')
    # 定义输出层的权重和偏置
with tf.variable_scope('out'):
    # 定义权重
    w_out = tf.Variable(tf.truncated_normal(shape=[5, 1], dtype=tf.float32), name='w_out')
    # 定义偏置
    b_out = tf.Variable(tf.truncated_normal(shape=[1], dtype=tf.float32), name='b_out')

Defining forward propagation

After defining the input, output and parameters of the neural network, we define the forward propagation calculation. TensorFlow contains basic Tensor operation functions, such as using matmulfunctions to calculate the product of addTensor, and using functions to calculate the sum of Tensor.

In the process of forward propagation, the input of the neuron will be non-linearly mapped through the activation function. In the nn module of TensorFlow, some commonly used activation functions are encapsulated. Here we use ReLU as the activation function:

# 定义前向传播
layer_1 = tf.nn.relu(tf.add(tf.matmul(x_, w_hidden), b_hidden))
y_pred = tf.add(tf.matmul(layer_1, w_out), b_out)

The activation function call method commonly used by TensorFlow is as follows:

Activation function call method ReLUtf.nn.reluSigmoidtf.nn.sigmoidtanhtf.nn.tanhSoftmaxtf.nn.softmaxSoftplustf.nn.softplus

Define loss function and optimizer

TensorFlow also encapsulates the loss function that needs to be defined when training a neural network. The mean square error is often used as the loss function in regression problems, and cross entropy is often used as the loss function in classification problems. The approximation function can be regarded as a regression problem, so the mean square error is used as the loss function.

When training a neural network, it is very important to choose a suitable optimization method, which will directly affect the training effect of the neural network. The commonly used algorithms in the gradient descent algorithm family are encapsulated in the train module of TensorFlow. Here we use the Adam method as the optimizer.

# 定义损失函数
loss = tf.losses.mean_squared_error(predictions=y_pred, labels=y_)
# 定义优化器,学习率设为0.01,设定目标为极小化损失函数loss
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)

The commonly used loss function call methods of TensorFlow are as follows:

Loss function call method Mean square error tf.losses.mean_squared_error Two-class cross entropy tf.nn.sigmoid_cross_entropy_with_logits Multi-class cross entropy tf.nn.softmax_cross_entropy_with_logits_v2 Multi-class sparse cross entropy tf.nn.sparse_softmax_cross_entropy_with_logits

TensorFlow commonly used optimizer invocation methods are as follows:

The optimizer calls the method gradient descent tf.train.GradientDescentOptimizer momentum method tf.train.MomentumOptimizerRMSproptf.train.RMSPropOptimizerAdamtf.train.AdamOptimizerAdadeltatf.train.AdadeltaOptimizerAdagradtf.train.AdagradOptimizer

Create a session training network

In order to perform the calculation of the data flow graph, the data flow graph must be started in a session (Session), and the session allocates graph operations to CPU, GPU and other devices for execution.

To start the calculation graph, first use the Sessionclass to create a session object, and then call the runmethod to execute the calculation graph. After the session is used, call the closemethod to close the session to release resources. We can also use the context management protocol in Python to with…asautomatically close the session.

Let's create a session and start training the network:

# 设定迭代轮数
training_epochs = 500
# 设定batch大小
batch_size = 10
# 创建会话
with tf.Session() as sess: 
    # 变量初始化
    sess.run(tf.global_variables_initializer())  
    for epoch in range(training_epochs):
        for i in range(10):         
            # 将所有训练数据分割为batch,batch大小为10,共10个
            # 将batch_x、batch_y转换成与占位符x_、y_相同的维度
            batch_x = x[i*batch_size:(i+1)*batch_size].reshape(-1, 1)
            batch_y = y[i*batch_size:(i+1)*batch_size].reshape(-1, 1)          
            # 使用参数feed_dict传入数据,进行反向转播更新参数
            _, cost = sess.run([train_op, loss], feed_dict={x_:batch_x, y_:batch_y})         
        # 每20轮输出训练集的损失函数值
        if epoch % 20 == 0:
            print("epoch", epoch, "training loss", sess.run(loss, feed_dict={x_:x.reshape(-1, 1), y_:y.reshape(-1, 1)}))         
    print("隐藏层权重:",w_hidden.eval())
    print("隐藏层偏置:",b_hidden.eval())
    print("输出层权重:",w_out.eval())
    print("输出层偏置:",b_out.eval())

It can be seen that as the number of training rounds continues to increase, the training set loss continues to decrease, and finally the parameters of the neural network are output, thus completing the training of a fully connected neural network.

4. Summary

This case first introduces the basic usage method of OpenAI Gym, the most widely used reinforcement learning experimental platform, including the installation of Gym and the use of built-in environment. In the subsequent cases, we will use Gym as the experimental evaluation of the reinforcement learning algorithm for algorithm evaluation. Evaluation and debugging.

Then we introduced the experimental tool TensorFlow and explained the process of building a neural network through an example. In the subsequent practice of reinforcement learning algorithms, we will use TensorFlow to build deep neural networks and combine them with Gym to implement some classic reinforcement learning algorithms.

I hope everyone can have a basic understanding of TensorFlow and Gym through this case, and prepare for the future practice of reinforcement learning algorithms!

"Viewing this case the complete data, code and reports please login data kuke ( cookdata.cn ) Case section.

Guess you like

Origin blog.csdn.net/SL_World/article/details/108753666