If you don't want to drive by yourself, Python can help you with automatic driving

1. Installation environment

gym is a toolkit for developing and comparing reinforcement learning algorithms. It is easy to install the gym library and its sub-scenarios in python.

Install gym:

pip install gym

Install the autopilot module, using the package highway-env published on github by Edouard Leurent (link: https://github.com/eleurent/highway-env):

pip install --user git+https://github.com/eleurent/highway-env

It contains 6 scenarios:

  • Highway - "highway-v0"

  • Import - "merge-v0"

  • Roundabout - "roundabout-v0"

  • Parking - "parking-v0"

  • intersection - "intersection-v0"

  • Race Track - "racetrack-v0"

Detailed documentation can be found here: https://highway-env.readthedocs.io/en/latest/

Second, the configuration environment

After installation, you can experiment in the code (take the highway scene as an example):

import gym
import highway_env
%matplotlib inline

env = gym.make('highway-v0')
env.reset()
for _ in range(3):
    action = env.action_type.actions_indexes["IDLE"]
    obs, reward, done, info = env.step(action)
    env.render()

After running, the following scene will be generated in the simulator:

Green is the ego vehicle env class. There are many parameters that can be configured. For details, please refer to the original document.

Third, the training model

1. Data processing

(1)state

There are no sensors defined in the highway-env package, and all vehicle states (observations) are read from the underlying code, saving a lot of up-front work. According to the documentation, state (ovservations) has three output methods: Kinematics, Grayscale Image and Occupancy grid.

Kinematics

Output a matrix of V*F, where V represents the number of vehicles to be observed (including the ego vehicle itself), and F represents the number of features to be counted. example:

When the data is generated, it will be normalized by default. The value range is [100, 100, 20, 20]. You can also set the vehicle attributes other than the ego vehicle to be the absolute coordinates of the map or the relative coordinates to the ego vehicle.

When defining the environment, you need to set the parameters of the feature:

config = \
    {
    "observation": 
         {
        "type": "Kinematics",
        #选取5辆车进行观察(包括ego vehicle)
        "vehicles_count": 5,  
        #共7个特征
        "features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"], 
        "features_range": 
            {
            "x": [-100, 100],
            "y": [-100, 100],
            "vx": [-20, 20],
            "vy": [-20, 20]
            },
        "absolute": False,
        "order": "sorted"
        },
    "simulation_frequency": 8,  # [Hz]
    "policy_frequency": 2,  # [Hz]
    }

Grayscale Image

Generate a grayscale image of W*H, where W represents the width of the image and H represents the height of the image

Occupancy grid

Generate a three-dimensional matrix of WHF, and use a W*H table to represent the vehicle situation around the ego vehicle, and each grid contains F features.

(2) action

Actions in the highway-env package are divided into two types: continuous and discrete. Continuous actions can directly define the values ​​of throttle and steering angle, and discrete actions include 5 meta actions:

ACTIONS_ALL = {
        0: 'LANE_LEFT',
        1: 'IDLE',
        2: 'LANE_RIGHT',
        3: 'FASTER',
        4: 'SLOWER'
    }

(3) reward

The same reward function is used in the highway-env package except for the parking scene:

This function can only be changed in its source code, and only the weights can be adjusted in the outer layer. (The reward function of the parking scene is in the original document, too lazy to type the formula...)

2. Build the model

The structure and construction process of the DQN network have been discussed in another article of mine, so I will not explain it in detail here. I use the first state representation - Kinematics for demonstration.

Due to the small amount of state data (5 vehicles * 7 features), you can directly convert the size[5,7] of the two-dimensional data to [1,35] without considering the use of CNN, and the input of the model is 35, The output is the number of discrete actions, 5 in total.

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as T
from torch import FloatTensor, LongTensor, ByteTensor
from collections import namedtuple
import random 
Tensor = FloatTensor

EPSILON = 0    # epsilon used for epsilon greedy approach
GAMMA = 0.9
TARGET_NETWORK_REPLACE_FREQ = 40       # How frequently target netowrk updates
MEMORY_CAPACITY = 100
BATCH_SIZE = 80
LR = 0.01         # learning rate

class DQNNet(nn.Module):
    def __init__(self):
        super(DQNNet,self).__init__()                  
        self.linear1 = nn.Linear(35,35)
        self.linear2 = nn.Linear(35,5)               
    def forward(self,s):
        s=torch.FloatTensor(s)        
        s = s.view(s.size(0),1,35)        
        s = self.linear1(s)
        s = self.linear2(s)
        return s           
                         
class DQN(object):
    def __init__(self):
        self.net,self.target_net = DQNNet(),DQNNet()        
        self.learn_step_counter = 0      
        self.memory = []
        self.position = 0 
        self.capacity = MEMORY_CAPACITY       
        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=LR)
        self.loss_func = nn.MSELoss()

    def choose_action(self,s,e):
        x=np.expand_dims(s, axis=0)
        if np.random.uniform() < 1-e:  
            actions_value = self.net.forward(x)            
            action = torch.max(actions_value,-1)[1].data.numpy()
            action = action.max()           
        else: 
            action = np.random.randint(0, 5)
        return action

    def push_memory(self, s, a, r, s_):
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(torch.unsqueeze(torch.FloatTensor(s), 0),torch.unsqueeze(torch.FloatTensor(s_), 0),\
                                                torch.from_numpy(np.array([a])),torch.from_numpy(np.array([r],dtype='float32')))#
        self.position = (self.position + 1) % self.capacity
       
    def get_sample(self,batch_size):
        sample = random.sample(self.memory,batch_size)
        return sample
      
    def learn(self):
        if self.learn_step_counter % TARGET_NETWORK_REPLACE_FREQ == 0:
            self.target_net.load_state_dict(self.net.state_dict())
        self.learn_step_counter += 1
        
        transitions = self.get_sample(BATCH_SIZE)
        batch = Transition(*zip(*transitions))

        b_s = Variable(torch.cat(batch.state))
        b_s_ = Variable(torch.cat(batch.next_state))
        b_a = Variable(torch.cat(batch.action))
        b_r = Variable(torch.cat(batch.reward))    
             
        q_eval = self.net.forward(b_s).squeeze(1).gather(1,b_a.unsqueeze(1).to(torch.int64)) 
        q_next = self.target_net.forward(b_s_).detach() #
        q_target = b_r + GAMMA * q_next.squeeze(1).max(1)[0].view(BATCH_SIZE, 1).t()           
        loss = self.loss_func(q_eval, q_target.t())        
        self.optimizer.zero_grad() # reset the gradient to zero        
        loss.backward()
        self.optimizer.step() # execute back propagation for one step       
        return loss
Transition = namedtuple('Transition',('state', 'next_state','action', 'reward'))

3. Running results

After each part is completed, the model can be combined to train the model. The process is similar to using CARLA, so I won't go into details.

Initialize the environment (just add the DQN class to it):

import gym
import highway_env
from matplotlib import pyplot as plt
import numpy as np
import time
config = \
    {
    "observation": 
         {
        "type": "Kinematics",
        "vehicles_count": 5,
        "features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
        "features_range": 
            {
            "x": [-100, 100],
            "y": [-100, 100],
            "vx": [-20, 20],
            "vy": [-20, 20]
            },
        "absolute": False,
        "order": "sorted"
        },
    "simulation_frequency": 8,  # [Hz]
    "policy_frequency": 2,  # [Hz]
    }
    
env = gym.make("highway-v0")
env.configure(config)

Train the model:

dqn=DQN()
count=0

reward=[]
avg_reward=0
all_reward=[]

time_=[]
all_time=[]

collision_his=[]
all_collision=[]
while True:
    done = False    
    start_time=time.time()
    s = env.reset()
    
    while not done:
        e = np.exp(-count/300)  #随机选择action的概率,随着训练次数增多逐渐降低
        a = dqn.choose_action(s,e)
        s_, r, done, info = env.step(a)
        env.render()
        
        dqn.push_memory(s, a, r, s_)
        
        if ((dqn.position !=0)&(dqn.position % 99==0)):
            loss_=dqn.learn()
            count+=1
            print('trained times:',count)
            if (count%40==0):
                avg_reward=np.mean(reward)
                avg_time=np.mean(time_)
                collision_rate=np.mean(collision_his)
                                
                all_reward.append(avg_reward)
                all_time.append(avg_time)
                all_collision.append(collision_rate)
                                
                plt.plot(all_reward)
                plt.show()
                plt.plot(all_time)
                plt.show()
                plt.plot(all_collision)
                plt.show()
                
                reward=[]
                time_=[]
                collision_his=[]
                
        s = s_
        reward.append(r)      
    
    end_time=time.time()
    episode_time=end_time-start_time
    time_.append(episode_time)
        
    is_collision=1 if info['crashed']==True else 0
    collision_his.append(is_collision)

I added some drawing functions to the code, and I can master some key indicators during the running process, and count the average every 40 times of training.

Average collision rate:

Average epoch duration (s):

Average reward:

It can be seen that the average collision rate will gradually decrease as the number of training increases, and the duration of each epoch will gradually increase (if a collision occurs, the epoch will end immediately)

-END-

Scan the code to add please note: python

 

Guess you like

Origin blog.csdn.net/qiqi1220/article/details/125797443