Train DDPG to control the inverted pendulum system in Simulink
This example shows how to train a Deep Deterministic Strategy Gradient (DDPG) agent to control an inverted pendulum system modeled with Simscape™ Multibody™.
matlab version 2020b
Simscape model of inverted pendulum
The reinforcement learning environment for this example is a rod connected to an unmanipulated joint on a trolley that moves along a frictionless track. The goal of training is to make the pole stand up without falling over with minimal control effort.
Open the model
mdl = 'rlCartPoleSimscapeModel';
open_system(mdl)
The inverted pendulum system is modeled using Simscape Multibody.
For this model:
- The upward balance bar position is 0 radians, and the downward suspension position is pi radians.
- The force signal from the agent to the environment is 15 to 15 N,
- Observed from the environment are the position and speed of the trolley, as well as the sine, cosine and derivative of the rod angle.
- If the car moves more than 3.5 meters from its original position, the episode terminates.
- The reward provided at each time step rt r_trtfor
among them:
- θ t \theta_t θtIs the displacement angle from the upright position of the rod.
- x t x_t xtIt is the position moved from the center of the trolley.
- u t − 1 u_{t-1} ut−1It is the control work of the previous time step.
- B is a sign (1 or 0), indicating whether the car has crossed the boundary.
For more information about this model, see Loading the Pre-defined Simulink Environment .
Create environment interface
Create a predefined environment interface for the pole.
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')
The interface has a continuous action space in which the agent can apply possible torque values from -15 to 15 N to the pole.
Obtain observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
Specify the simulation time Tf and the agent sampling time Ts in seconds.
Ts = 0.02;
Tf = 25;
The reproducibility of the fixed random generator seed.
rng(0)
Create DDPG agent
The DDPG agent uses the commenter value function notation to estimate long-term rewards based on given observations and operations. To create a commenter, you first need to create a deep neural network with two inputs (state and action) and one output. The input size of the action path is [1 1 1], because the agent can apply the action to the environment as a force value. For more information on creating deep neural network value function representations, see Creating Strategy and Value Function Representations .
statePath = [
featureInputLayer(numObservations,'Normalization','none','Name','observation')
fullyConnectedLayer(128,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(200,'Name','CriticStateFC2')];
actionPath = [
featureInputLayer(1,'Normalization','none','Name','action')
fullyConnectedLayer(200,'Name','CriticActionFC1','BiasLearnRateFactor',0)];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
Check the commenter's network configuration.
figure
plot(criticNetwork)
Use rlRepresentationOptions to specify the options represented by the reviewer.
criticOptions = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
Create a commenter representation using the specified deep neural network and options. You must also specify the reviewer's operation and observation information, which has been obtained from the environment interface. For more information, see rlQValueRepresentation .
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
'Observation',{
'observation'},'Action',{
'action'},criticOptions);
The DDPG agent uses the writer representation to decide which action to take under a given observation. To create an actor, you must first create a deep neural network with one input (observation) and one output (action).
Construct actors in a manner similar to commenters. For more information, see rlDeterministicActorRepresentation .
actorNetwork = [
featureInputLayer(numObservations,'Normalization','none','Name','observation')
fullyConnectedLayer(128,'Name','ActorFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(200,'Name','ActorFC2')
reluLayer('Name','ActorRelu2')
fullyConnectedLayer(1,'Name','ActorFC3')
tanhLayer('Name','ActorTanh1')
scalingLayer('Name','ActorScaling','Scale',max(actInfo.UpperLimit))];
actorOptions = rlRepresentationOptions('LearnRate',5e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{
'observation'},'Action',{
'ActorScaling'},actorOptions);
To create a DDPG agent, first use rlDDPGAgentOptions to specify the DDPG agent options.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',128);
agentOptions.NoiseOptions.Variance = 0.4;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Then, use the specified actor representation, commenter representation and agent options to create an actor. For more information, see rlDDPGAgent .
agent = rlDDPGAgent(actor,critic,agentOptions);
Training agent
To train the agent, first specify the training options. For this example, use the following options.
-
Each training episode can run at most 2000 episodes, and each episode lasts at most ceil (Tf / Ts) time steps.
-
Display the training progress in the "Episode Manager" dialog box (set the "Plots" option) and disable the command line display (set the "Verbose" option to false).
-
When the average cumulative reward obtained by the agent for five consecutive episodes is greater than –400, please stop training. At this point, the agent can quickly balance the pole in an upright position with the least amount of control power.
-
Save a copy of the agent for each episode with a cumulative reward greater than –400.
For more information, see rlTrainingOptions .
maxepisodes = 2000;
maxsteps = ceil(Tf/Ts);
trainingOptions = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'ScoreAveragingWindowLength',5,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-400,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',-400);
Use the training function to train the agent. The process of training this agent requires a lot of calculations and takes several hours to complete. To save time running this example, please load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true .
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOptions);
else
% Load the pretrained agent for the example.
load('SimscapeCartPoleDDPG.mat','agent')
end
DDPG agent simulation
To verify the performance of the trained agent, simulate it in a rod-shaped environment. For more information about agent simulation, see rlSimulationOptions and sim .
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
bdclose(mdl)