MATLAB Reinforcement Learning Toolbox (8) Pendulum model modeling and DDPG training

This example shows how to build a pendulum model and train it with DDPG.

For model loading, refer to my previous blog post using DQN .
Insert picture description here

Open the model and create the environment interface

Open the model

mdl = 'rlSimplePendulumModel';
open_system(mdl)

Create a predefined environment interface for the pendulum.

env = rlPredefinedEnv('SimplePendulumModel-Continuous')

The interface has a discrete operating space in which the agent can apply one of three possible torque values ​​to the pendulum: -2, 0, or 2 N·m.
To define the initial condition of the pendulum as hanging downward, use the anonymous function handle to specify the environment reset function. This reset function sets the model workspace variable thetaθ \thetaθ to pi.

numObs = 3;
set_param('rlSimplePendulumModel/create observations','ThetaObservationHandling','sincos');

To define the initial condition of the pendulum as hanging downward, use the anonymous function handle to specify the environment reset function. This reset function sets theta0 of the model workspace variable to pi.

env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);

Specify the simulation time Tf and the proxy sampling time Ts in seconds.

Ts = 0.05;
Tf = 20;

Fixed random generator seed to improve repeatability.

rng(0)

Create DDPG agent

The DDPG agent uses the commenter value function notation to estimate long-term rewards based on given observations and operations. To create a commenter, you first need to create a deep neural network with two inputs (state and action) and one output. For more information on creating deep neural network value function representations, see Creating Strategy and Value Function Representations .

statePath = [
    featureInputLayer(numObs,'Normalization','none','Name','observation')
    fullyConnectedLayer(400,'Name','CriticStateFC1')
    reluLayer('Name', 'CriticRelu1')
    fullyConnectedLayer(300,'Name','CriticStateFC2')];
actionPath = [
    featureInputLayer(1,'Normalization','none','Name','action')
    fullyConnectedLayer(300,'Name','CriticActionFC1','BiasLearnRateFactor',0)];
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','CriticOutput')];

criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
    
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');

Check the commenter's network configuration.

figure
plot(criticNetwork)

Insert picture description here
Use rlRepresentationOptions to specify the options represented by the reviewer.

criticOpts = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);

Create a commenter representation using the specified deep neural network and options. You must also specify the reviewer's operation and observation information, which is obtained from the environment interface. For more information, see rlQValueRepresentation .

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{
    
    'observation'},'Action',{
    
    'action'},criticOpts);

To create a deep neural network with one input (observation) and one output (action).
Construct actors in a manner similar to commenters. For more information, see rlDeterministicActorRepresentation .

actorNetwork = [
    featureInputLayer(numObs,'Normalization','none','Name','observation')
    fullyConnectedLayer(400,'Name','ActorFC1')
    reluLayer('Name','ActorRelu1')
    fullyConnectedLayer(300,'Name','ActorFC2')
    reluLayer('Name','ActorRelu2')
    fullyConnectedLayer(1,'Name','ActorFC3')
    tanhLayer('Name','ActorTanh')
    scalingLayer('Name','ActorScaling','Scale',max(actInfo.UpperLimit))];

actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{
    
    'observation'},'Action',{
    
    'ActorScaling'},actorOpts);

To create a DDPG agent, first use rlDDPGAgentOptions to specify DDPG agent options.

agentOpts = rlDDPGAgentOptions(...
    'SampleTime',Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'DiscountFactor',0.99,...
    'MiniBatchSize',128);
agentOpts.NoiseOptions.Variance = 0.6;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;

Then use the specified actor representation, commenter representation and agent options to create a DDPG agent. For more information, see rlDDPGAgent .

agent = rlDDPGAgent(actor,critic,agentOpts);

Training agent

To train the agent, first specify the training options. For this example, use the following options.

  1. Perform training up to 50,000 times, and each episode lasts the longest time step of ceil (Tf/Ts).

  2. Display the training progress in the "Plot Manager" dialog box (set the Plots option), and disable the command line display (set the Verbose option to false).

  3. When the average cumulative reward obtained by the agent for five consecutive episodes is greater than –740, please stop training. At this point, the agent can quickly put the pendulum in an upright position with minimal control force.

  4. Save a copy of the agent for each episode where the cumulative reward is greater than –740.

For more information, see rlTrainingOptions .

maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'ScoreAveragingWindowLength',5,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-740,...
    'SaveAgentCriteria','EpisodeReward',...
    'SaveAgentValue',-740);

Use the training function to train the agent. Training this agent is a computationally intensive process that takes several minutes to complete. To save time running this example, please load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load the pretrained agent for the example.
    load('SimulinkPendulumDDPG.mat','agent')
end

Insert picture description here

DDPG simulation

To verify the performance of the trained agent, simulate it in a pendulum environment. For more information about agent simulation, see rlSimulationOptions and sim .

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

Insert picture description here

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109499651