MATLAB reinforcement learning combat (2) using parallel computing to train an inverted pendulum system

This example shows how to train an actor-critic (AC) agent by using asynchronous parallel training to balance the car system modeled in MATLAB®. For an example showing how to train an agent without using parallel training, see Training an AC agent to balance an inverted pendulum system .
Insert picture description here

matlab version 2020b.

Parallel training of actors

When you use parallel computing with AC agents, each worker will generate experience from his copy of the agent and the environment. Every N steps, the staff will calculate the gradient based on experience and send the calculated gradient back to the host agent. The host agent updates its parameters as follows.

  1. For asynchronous training, the host agent will apply the received gradient without waiting for all workers to send the gradient , and then send the updated parameters back to the worker who provided the gradient. The staff then continue to use the updated parameters to generate experience from their environment.

  2. For synchronous training, the host agent waits to receive gradients from all workers and uses these gradients to update its parameters. Then, the host sends the updated parameters to all working programs at the same time. All staff will then continue to use the updated parameters to generate experience.

Create Cart-Pole MATLAB environment interface

Create a predefined environment interface for the rod system. For more information about this environment, see Loading a predefined control system environment .

env = rlPredefinedEnv("CartPole-Discrete");
env.PenaltyForFalling = -10;

Obtain observation and action information from the environment interface.

obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);

The reproducibility of the fixed random generator seed.

rng(0)

Create AC agent

The AC agent uses the commenter value function notation to estimate long-term rewards based on observations and operations. To create a commenter, first create a deep neural network with one input (observation) and one output (state value). The input size of the reviewer network is 4 because the environment provides 4 observations. For more information on creating deep neural network value function representations, see Creating Strategy and Value Function Representations .

criticNetwork = [
    featureInputLayer(4,'Normalization','none','Name','state')
    fullyConnectedLayer(32,'Name','CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(1, 'Name', 'CriticFC')];

criticOpts = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1);

critic = rlValueRepresentation(criticNetwork,obsInfo,'Observation',{
    
    'state'},criticOpts);

The AC agent uses the actor representation to decide the action to take (given the observation). To create an actor, create a deep neural network with one input (observation) and one output (action). The output size of the actor network is 2, because the agent can apply 2 force values-10 and 10 to the environment.

actorNetwork = [
    featureInputLayer(4,'Normalization','none','Name','state')
    fullyConnectedLayer(32, 'Name','ActorStateFC1')
    reluLayer('Name','ActorRelu1')
    fullyConnectedLayer(2,'Name','action')];

actorOpts = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1);

actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
    'Observation',{
    
    'state'},actorOpts);

To create an AC agent, first use rlACAgentOptions to specify AC agent options.

agentOpts = rlACAgentOptions(...
    'NumStepsToLookAhead',32,...
    'EntropyLossWeight',0.01,...
    'DiscountFactor',0.99);

Then use the specified actor representation and agent options to create an agent. For more information, see rlACAgent .

agent = rlACAgent(actor,critic,agentOpts);

Parallel training options

To train the agent, first specify the training options. For this example, use the following options.

  1. A maximum of 1000 episodes can be performed for each training, and each episode lasts for a maximum of 500 time steps.

  2. Display the training progress in the "Plot Manager" dialog box (set the "Plots" option), and disable the command line display (set the "Verbose" option).

  3. When the average cumulative reward obtained by the agent in 10 consecutive episodes is greater than 500, please stop training. At this time, the agent can put the pendulum in an upright position.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',1000,...
    'MaxStepsPerEpisode', 500,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',500,...
    'ScoreAveragingWindowLength',10); 

You can use the drawing function to visualize the inverted pendulum system during training or simulation.

plot(env)

Insert picture description here
To train the agent using parallel computing, specify the following training options.

  1. Set the UseParallel option to True.

  2. By setting the ParallelizationOptions.Mode option to " async ", the agent is trained in parallel in an asynchronous manner.

  3. Every 32 steps, each worker will calculate the gradient based on experience and send it to the host.

  4. The AC agent asks the staff to send the "gradient" to the host.

  5. The AC agent requires "StepsUntilDataIsSent" to be equal to agentOptions.NumStepsToLookAhead.

trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
trainOpts.ParallelizationOptions.DataToSendFromWorkers = "gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;

Training agent

Use the training function to train the agent. Training an agent is a computationally intensive process that takes a few minutes to complete. To save time running this example, please load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to the randomness in asynchronous parallel training, you can get different training results from the following training graph. The figure shows the results of training six workers.

doTraining = false;

if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load the pretrained agent for the example.
    load('MATLABCartpoleParAC.mat','agent');
end

Insert picture description here

AC agent simulation

You can use the drawing function to visualize the inverted pendulum system during the simulation.

plot(env)

To verify the performance of the trained agent, please simulate it in an inverted pendulum environment. For more information about agent simulation, see rlSimulationOptions and sim .

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

Insert picture description here

totalReward = sum(experience.Reward)

totalReward = 500

references

[1] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. ‘Asynchronous Methods for Deep Reinforcement Learning’. ArXiv:1602.01783 [Cs], 16 June 2016. https://arxiv.org/abs/1602.01783.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109565150