1. Installation environment
gym is a toolkit for developing and comparing reinforcement learning algorithms. It is easy to install the gym library and its sub-scenarios in python.
Install gym:
pip install gym
Install the autopilot module, using the package highway-env published on github by Edouard Leurent (link: https://github.com/eleurent/highway-env):
pip install --user git+https://github.com/eleurent/highway-env
It contains 6 scenarios:
-
Highway - "highway-v0"
-
Import - "merge-v0"
-
Roundabout - "roundabout-v0"
-
Parking - "parking-v0"
-
intersection - "intersection-v0"
-
Race Track - "racetrack-v0"
Detailed documentation can be found here: https://highway-env.readthedocs.io/en/latest/
Second, the configuration environment
After installation, you can experiment in the code (take the highway scene as an example):
import gym
import highway_env
%matplotlib inline
env = gym.make('highway-v0')
env.reset()
for _ in range(3):
action = env.action_type.actions_indexes["IDLE"]
obs, reward, done, info = env.step(action)
env.render()
After running, the following scene will be generated in the simulator:
Green is the ego vehicle env class. There are many parameters that can be configured. For details, please refer to the original document.
Third, the training model
1. Data processing
(1)state
There are no sensors defined in the highway-env package, and all vehicle states (observations) are read from the underlying code, saving a lot of up-front work. According to the documentation, state (ovservations) has three output methods: Kinematics, Grayscale Image and Occupancy grid.
Kinematics
Output a matrix of V*F, where V represents the number of vehicles to be observed (including the ego vehicle itself), and F represents the number of features to be counted. example:
When the data is generated, it will be normalized by default. The value range is [100, 100, 20, 20]. You can also set the vehicle attributes other than the ego vehicle to be the absolute coordinates of the map or the relative coordinates to the ego vehicle.
When defining the environment, you need to set the parameters of the feature:
config = \
{
"observation":
{
"type": "Kinematics",
#选取5辆车进行观察(包括ego vehicle)
"vehicles_count": 5,
#共7个特征
"features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
"features_range":
{
"x": [-100, 100],
"y": [-100, 100],
"vx": [-20, 20],
"vy": [-20, 20]
},
"absolute": False,
"order": "sorted"
},
"simulation_frequency": 8, # [Hz]
"policy_frequency": 2, # [Hz]
}
Grayscale Image
Generate a grayscale image of W*H, where W represents the width of the image and H represents the height of the image
Occupancy grid
Generate a three-dimensional matrix of WHF, and use a W*H table to represent the vehicle situation around the ego vehicle, and each grid contains F features.
(2) action
Actions in the highway-env package are divided into two types: continuous and discrete. Continuous actions can directly define the values of throttle and steering angle, and discrete actions include 5 meta actions:
ACTIONS_ALL = {
0: 'LANE_LEFT',
1: 'IDLE',
2: 'LANE_RIGHT',
3: 'FASTER',
4: 'SLOWER'
}
(3) reward
The same reward function is used in the highway-env package except for the parking scene:
This function can only be changed in its source code, and only the weights can be adjusted in the outer layer. (The reward function of the parking scene is in the original document, too lazy to type the formula...)
2. Build the model
The structure and construction process of the DQN network have been discussed in another article of mine, so I will not explain it in detail here. I use the first state representation - Kinematics for demonstration.
Due to the small amount of state data (5 vehicles * 7 features), you can directly convert the size[5,7] of the two-dimensional data to [1,35] without considering the use of CNN, and the input of the model is 35, The output is the number of discrete actions, 5 in total.
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as T
from torch import FloatTensor, LongTensor, ByteTensor
from collections import namedtuple
import random
Tensor = FloatTensor
EPSILON = 0 # epsilon used for epsilon greedy approach
GAMMA = 0.9
TARGET_NETWORK_REPLACE_FREQ = 40 # How frequently target netowrk updates
MEMORY_CAPACITY = 100
BATCH_SIZE = 80
LR = 0.01 # learning rate
class DQNNet(nn.Module):
def __init__(self):
super(DQNNet,self).__init__()
self.linear1 = nn.Linear(35,35)
self.linear2 = nn.Linear(35,5)
def forward(self,s):
s=torch.FloatTensor(s)
s = s.view(s.size(0),1,35)
s = self.linear1(s)
s = self.linear2(s)
return s
class DQN(object):
def __init__(self):
self.net,self.target_net = DQNNet(),DQNNet()
self.learn_step_counter = 0
self.memory = []
self.position = 0
self.capacity = MEMORY_CAPACITY
self.optimizer = torch.optim.Adam(self.net.parameters(), lr=LR)
self.loss_func = nn.MSELoss()
def choose_action(self,s,e):
x=np.expand_dims(s, axis=0)
if np.random.uniform() < 1-e:
actions_value = self.net.forward(x)
action = torch.max(actions_value,-1)[1].data.numpy()
action = action.max()
else:
action = np.random.randint(0, 5)
return action
def push_memory(self, s, a, r, s_):
if len(self.memory) < self.capacity:
self.memory.append(None)
self.memory[self.position] = Transition(torch.unsqueeze(torch.FloatTensor(s), 0),torch.unsqueeze(torch.FloatTensor(s_), 0),\
torch.from_numpy(np.array([a])),torch.from_numpy(np.array([r],dtype='float32')))#
self.position = (self.position + 1) % self.capacity
def get_sample(self,batch_size):
sample = random.sample(self.memory,batch_size)
return sample
def learn(self):
if self.learn_step_counter % TARGET_NETWORK_REPLACE_FREQ == 0:
self.target_net.load_state_dict(self.net.state_dict())
self.learn_step_counter += 1
transitions = self.get_sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))
b_s = Variable(torch.cat(batch.state))
b_s_ = Variable(torch.cat(batch.next_state))
b_a = Variable(torch.cat(batch.action))
b_r = Variable(torch.cat(batch.reward))
q_eval = self.net.forward(b_s).squeeze(1).gather(1,b_a.unsqueeze(1).to(torch.int64))
q_next = self.target_net.forward(b_s_).detach() #
q_target = b_r + GAMMA * q_next.squeeze(1).max(1)[0].view(BATCH_SIZE, 1).t()
loss = self.loss_func(q_eval, q_target.t())
self.optimizer.zero_grad() # reset the gradient to zero
loss.backward()
self.optimizer.step() # execute back propagation for one step
return loss
Transition = namedtuple('Transition',('state', 'next_state','action', 'reward'))
3. Running results
After each part is completed, the model can be combined to train the model. The process is similar to using CARLA, so I won't go into details.
Initialize the environment (just add the DQN class to it):
import gym
import highway_env
from matplotlib import pyplot as plt
import numpy as np
import time
config = \
{
"observation":
{
"type": "Kinematics",
"vehicles_count": 5,
"features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
"features_range":
{
"x": [-100, 100],
"y": [-100, 100],
"vx": [-20, 20],
"vy": [-20, 20]
},
"absolute": False,
"order": "sorted"
},
"simulation_frequency": 8, # [Hz]
"policy_frequency": 2, # [Hz]
}
env = gym.make("highway-v0")
env.configure(config)
Train the model:
dqn=DQN()
count=0
reward=[]
avg_reward=0
all_reward=[]
time_=[]
all_time=[]
collision_his=[]
all_collision=[]
while True:
done = False
start_time=time.time()
s = env.reset()
while not done:
e = np.exp(-count/300) #随机选择action的概率,随着训练次数增多逐渐降低
a = dqn.choose_action(s,e)
s_, r, done, info = env.step(a)
env.render()
dqn.push_memory(s, a, r, s_)
if ((dqn.position !=0)&(dqn.position % 99==0)):
loss_=dqn.learn()
count+=1
print('trained times:',count)
if (count%40==0):
avg_reward=np.mean(reward)
avg_time=np.mean(time_)
collision_rate=np.mean(collision_his)
all_reward.append(avg_reward)
all_time.append(avg_time)
all_collision.append(collision_rate)
plt.plot(all_reward)
plt.show()
plt.plot(all_time)
plt.show()
plt.plot(all_collision)
plt.show()
reward=[]
time_=[]
collision_his=[]
s = s_
reward.append(r)
end_time=time.time()
episode_time=end_time-start_time
time_.append(episode_time)
is_collision=1 if info['crashed']==True else 0
collision_his.append(is_collision)
I added some drawing functions to the code, and I can master some key indicators during the running process, and count the average every 40 times of training.
Average collision rate:
Average epoch duration (s):
Average reward:
It can be seen that the average collision rate will gradually decrease as the number of training increases, and the duration of each epoch will gradually increase (if a collision occurs, the epoch will end immediately)
-END-
Scan the code to add please note: python