MATLAB reinforcement learning combat (3) using parallel computing to train DQN agent for lane keeping assistance (LKA)

This example shows how to use parallel training to train a deep Q-learning network (DQN) agent in Simulink® to keep lane assist (LKA). For an example showing how to train an agent without using parallel training, see Train DQN Agent for Lane Keeping Assist .

matlab version 2020b.

Overview of DQN parallel training

In the DQN agent, each worker generates new experience from his copy of the agent and the environment. Every N steps, the staff will send the experience to the host agent. The host agent updates its parameters as follows.

  1. For asynchronous training, the host agent can learn from the received experience without waiting for all the staff to send the experience, and then send the updated parameters back to the staff who provided the experience. The staff then continue to use the updated parameters to generate experience from their environment.

  2. For synchronous training, the host agent waits to receive the experience of all workers and learn from these experiences. Then, the host sends the updated parameters to all working programs at the same time. All staff will then continue to use the updated parameters to generate experience.

Simulink model of Ego Car

The reinforcement learning environment for this example is a simple bicycle model for vehicle dynamics. The goal of training is to keep the bicycle driving along the centerline of the lane by adjusting the front steering angle. This example uses the same vehicle model as Train DQN Agent for Lane Keeping Assist .

m = 1575;   % total vehicle mass (kg)
Iz = 2875;  % yaw moment of inertia (mNs^2)
lf = 1.2;   % longitudinal distance from center of gravity to front tires (m)
lr = 1.6;   % longitudinal distance from center of gravity to rear tires (m)
Cf = 19000; % cornering stiffness of front tires (N/rad)
Cr = 33000; % cornering stiffness of rear tires (N/rad)
Vx = 15;    % longitudinal velocity (m/s)

Define the sampling time Ts and the simulation duration T in seconds.

Ts = 0.1;
T = 15;

The output of the LKA system is the front steering angle of the bicycle. To simulate the physical steering limit of a bicycle, limit the steering angle to within [–0.5,0.5] rad.

u_min = -0.5;
u_max = 0.5;

The curvature of the road is defined as a constant 0.001 (m − 1) 0.001 (m^(-1))0.001(m1 ). The initial value of the lateral deviation is 0.2 m, and the initial value of the relative yaw angle is 0.1 rad.

rho = 0.001;
e1_initial = 0.2;
e2_initial = -0.1;

Open the model

mdl = 'rlLKAMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];

Insert picture description here
click to enter.

Insert picture description here

For this model:

  1. The rudder angle action signal from the agent to the environment is 15 degrees to 15 degrees.
  2. What is observed from the environment is the lateral deviation e 1 e_1e1, The relative yaw angle e 2 e_2e2, Their derivative e 1 ˙ \dot{e_1}e1˙And e 2 ˙ \dot{e_2}e2˙, And their integral ∫ e 1 \int{e_1}e1And ∫ e 2 \int{e_2}e2
  3. When lateral offset ∣ e 1 ∣ |e_1|e1 >1, the simulation is terminated.
  4. The reward provided by each step t rt r_trtFor
    Insert picture description here
    where uuu is from the previous time stept − 1 t-1t1 control input.

Create environment interface

Create a reinforcement learning environment interface for ego vehicle.

Define observation information.

observationInfo = rlNumericSpec([6 1],'LowerLimit',-inf*ones(6,1),'UpperLimit',inf*ones(6,1));
observationInfo.Name = 'observations';
observationInfo.Description = 'information on lateral deviation and relative yaw angle';

Define the action information.

actionInfo = rlFiniteSetSpec((-15:15)*pi/180);
actionInfo.Name = 'steering';

Create an environmental interface.

env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

The interface has a discrete action space in which the agent can apply one of 31 possible steering angles from -15 degrees to 15 degrees. The observed value is a six-dimensional vector containing lateral deviation, relative yaw angle and its derivative and integral with respect to time .

To define the initial conditions for lateral deviation and relative yaw angle, use the anonymous function handle to specify the environment reset function. The localResetFcn defined at the end of this example randomizes the initial lateral skew and relative yaw angle.

env.ResetFcn = @(in)localResetFcn(in);

The reproducibility of the fixed random generator seed.

rng(0)

Create DQN agent

The DQN agent can use the generally more effective multi-output Q commenter approximator. A multi-output approximator takes observations as input and state action values ​​as output . Each output element represents the expected cumulative long-term reward for taking corresponding discrete actions from the state indicated by the observation input.

To create a commenter, first create a deep neural network with one input (a six-dimensional observation state) and an output vector with 31 elements (steering angles evenly spaced from -15 degrees to 15 degrees) . For more information on creating deep neural network value function representations, see Creating Strategy and Value Function Representations .

nI = observationInfo.Dimension(1);  % number of inputs (6)
nL = 120;                           % number of neurons
nO = numel(actionInfo.Elements);    % number of outputs (31)

dnn = [
    featureInputLayer(nI,'Normalization','none','Name','state')
    fullyConnectedLayer(nL,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(nL,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(nO,'Name','fc3')];

Check the network configuration.

figure
plot(layerGraph(dnn))

Insert picture description here
Use rlRepresentationOptions to specify the options represented by the reviewer.

criticOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor',1e-4);

Create a commenter representation using the specified deep neural network and options. You must also specify the reviewer's operation and observation information, which is obtained from the environment interface. For more information, see rlQValueRepresentation .

critic = rlQValueRepresentation(dnn,observationInfo,actionInfo,'Observation',{
    
    'state'},criticOptions);

To create a DQN agent, first use rlDQNAgentOptions to specify DQN agent options.

agentOpts = rlDQNAgentOptions(...
    'SampleTime',Ts,...
    'UseDoubleDQN',true,...
    'TargetSmoothFactor',1e-3,...
    'DiscountFactor',0.99,...
    'ExperienceBufferLength',1e6,...
    'MiniBatchSize',256);

agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-4;

Then use the specified commenter representation and agent options to create a DQN agent. For more information, see rlDQNAgent .

agent = rlDQNAgent(critic,agentOpts);

Training options

To train the agent, first specify the training options. For this example, use the following options.

  1. Each training can be performed at most 10000 episodes, and each episode lasts at most ceil (T / Ts) time steps.

  2. Only show the training progress in the "Plot Manager" dialog (set the "Plots" and "Verbose" options accordingly).

  3. When the plot reward reaches -1, stop training.

  4. Save a copy of the agent for each episode where the cumulative reward is greater than 100.

maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes, ...
    'MaxStepsPerEpisode',maxsteps, ...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeReward',...
    'StopTrainingValue', -1,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',100);

Parallel computing options

To train the agent in parallel, specify the following training options.

  1. Set the UseParallel option to true .

  2. By setting the ParallelizationOptions.Mode option to " async ", the agent is trained asynchronously and in parallel.

  3. Every 30 steps, each staff member will send experience to the host.

  4. The DQN agent requires the staff to send "experience" to the host.

trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
trainOpts.ParallelizationOptions.DataToSendFromWorkers = "experiences";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;

Training agent

Use the training function to train the agent. Training an agent is a computationally intensive process that takes a few minutes to complete. To save time running this example, please load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true . Due to the randomness of parallel training , you can get different training results from the following chart.

doTraining = false;

if doTraining
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load pretrained agent for the example.
    load('SimulinkLKADQNParallel.mat','agent')
end

Insert picture description here

DQN agent simulation

To verify the performance of the trained agent, uncomment the following two lines and simulate the agent in the environment. For more information about agent simulation, see rlSimulationOptions and sim .

% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);

Insert picture description here
If the error shown in the figure above occurs, create the function localResetFcn.m in the current directory and paste the following code

function in = localResetFcn(in)
% reset
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation
in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle
end

Re-execute the simulation command.

To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink.

e1_initial = -0.4;
e2_initial = 0.2;
sim(mdl)

As shown below, both lateral error (middle graph) and relative yaw angle (bottom graph) are driven to zero. The vehicle starts off from the centerline (–0.4 m) and non-zero yaw angle error (0.2 rad). LKA makes the bicycle drive along the centerline after 2.5 seconds. The steering angle (above) shows that the controller reached a stable state after 2 seconds.
Insert picture description here
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109569081