MATLAB reinforcement learning combat (7) training DDPG control inverted pendulum system in Simulink

This example shows how to train a Deep Deterministic Strategy Gradient (DDPG) agent to control an inverted pendulum system modeled with Simscape™ Multibody™.

matlab version 2020b
Insert picture description here

Simscape model of inverted pendulum

The reinforcement learning environment for this example is a rod connected to an unmanipulated joint on a trolley that moves along a frictionless track. The goal of training is to make the pole stand up without falling over with minimal control effort.

Open the model

mdl = 'rlCartPoleSimscapeModel';
open_system(mdl)

Insert picture description here
The inverted pendulum system is modeled using Simscape Multibody.

Insert picture description here
For this model:

  1. The upward balance bar position is 0 radians, and the downward suspension position is pi radians.
  2. The force signal from the agent to the environment is 15 to 15 N,
  3. Observed from the environment are the position and speed of the trolley, as well as the sine, cosine and derivative of the rod angle.
  4. If the car moves more than 3.5 meters from its original position, the episode terminates.
  5. The reward provided at each time step rt r_trtfor
    Insert picture description here

among them:

  1. θ t \theta_t θtIs the displacement angle from the upright position of the rod.
  2. x t x_t xtIt is the position moved from the center of the trolley.
  3. u t − 1 u_{t-1} ut1It is the control work of the previous time step.
  4. B is a sign (1 or 0), indicating whether the car has crossed the boundary.

For more information about this model, see Loading the Pre-defined Simulink Environment .

Create environment interface

Create a predefined environment interface for the pole.

env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')

Insert picture description here
The interface has a continuous action space in which the agent can apply possible torque values ​​from -15 to 15 N to the pole.

Obtain observation and action information from the environment interface.

obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);

Specify the simulation time Tf and the agent sampling time Ts in seconds.

Ts = 0.02;
Tf = 25;

The reproducibility of the fixed random generator seed.

rng(0)

Create DDPG agent

The DDPG agent uses the commenter value function notation to estimate long-term rewards based on given observations and operations. To create a commenter, you first need to create a deep neural network with two inputs (state and action) and one output. The input size of the action path is [1 1 1], because the agent can apply the action to the environment as a force value. For more information on creating deep neural network value function representations, see Creating Strategy and Value Function Representations .

statePath = [
    featureInputLayer(numObservations,'Normalization','none','Name','observation')
    fullyConnectedLayer(128,'Name','CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(200,'Name','CriticStateFC2')];

actionPath = [
    featureInputLayer(1,'Normalization','none','Name','action')
    fullyConnectedLayer(200,'Name','CriticActionFC1','BiasLearnRateFactor',0)];

commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','CriticOutput')];

criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
    
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');

Check the commenter's network configuration.

figure
plot(criticNetwork)

Insert picture description here
Use rlRepresentationOptions to specify the options represented by the reviewer.

criticOptions = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);

Create a commenter representation using the specified deep neural network and options. You must also specify the reviewer's operation and observation information, which has been obtained from the environment interface. For more information, see rlQValueRepresentation .

critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
    'Observation',{
    
    'observation'},'Action',{
    
    'action'},criticOptions);

The DDPG agent uses the writer representation to decide which action to take under a given observation. To create an actor, you must first create a deep neural network with one input (observation) and one output (action).

Construct actors in a manner similar to commenters. For more information, see rlDeterministicActorRepresentation .

actorNetwork = [
    featureInputLayer(numObservations,'Normalization','none','Name','observation')
    fullyConnectedLayer(128,'Name','ActorFC1')
    reluLayer('Name','ActorRelu1')
    fullyConnectedLayer(200,'Name','ActorFC2')
    reluLayer('Name','ActorRelu2')
    fullyConnectedLayer(1,'Name','ActorFC3')
    tanhLayer('Name','ActorTanh1')
    scalingLayer('Name','ActorScaling','Scale',max(actInfo.UpperLimit))];

actorOptions = rlRepresentationOptions('LearnRate',5e-04,'GradientThreshold',1);

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,...
    'Observation',{
    
    'observation'},'Action',{
    
    'ActorScaling'},actorOptions);

To create a DDPG agent, first use rlDDPGAgentOptions to specify the DDPG agent options.

agentOptions = rlDDPGAgentOptions(...
    'SampleTime',Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'MiniBatchSize',128);
agentOptions.NoiseOptions.Variance = 0.4;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

Then, use the specified actor representation, commenter representation and agent options to create an actor. For more information, see rlDDPGAgent .

agent = rlDDPGAgent(actor,critic,agentOptions);

Training agent

To train the agent, first specify the training options. For this example, use the following options.

  1. Each training episode can run at most 2000 episodes, and each episode lasts at most ceil (Tf / Ts) time steps.

  2. Display the training progress in the "Episode Manager" dialog box (set the "Plots" option) and disable the command line display (set the "Verbose" option to false).

  3. When the average cumulative reward obtained by the agent for five consecutive episodes is greater than –400, please stop training. At this point, the agent can quickly balance the pole in an upright position with the least amount of control power.

  4. Save a copy of the agent for each episode with a cumulative reward greater than –400.

For more information, see rlTrainingOptions .

maxepisodes = 2000;
maxsteps = ceil(Tf/Ts);
trainingOptions = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'ScoreAveragingWindowLength',5,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-400,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',-400);

Use the training function to train the agent. The process of training this agent requires a lot of calculations and takes several hours to complete. To save time running this example, please load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true .

doTraining = false;

if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainingOptions);
else
    % Load the pretrained agent for the example.
    load('SimscapeCartPoleDDPG.mat','agent')
end

Insert picture description here

DDPG agent simulation

To verify the performance of the trained agent, simulate it in a rod-shaped environment. For more information about agent simulation, see rlSimulationOptions and sim .

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

Insert picture description here

bdclose(mdl)

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109606175