MATLAB reinforcement learning combat (9) multi-agent regional exploration

This example demonstrates a multi-agent collaborative competition task in which three Proximal Policy Optimization (PPO) agents are trained to explore all regions in the grid world environment.

Insert picture description here
Multi-agent training is only supported in the Simulink® environment. As shown in this example, if you use the MATLAB®System object to define the environment behavior, you can use the MATLAB System (Simulink) block to incorporate it into the Simulink environment.

Create the environment

The environment in this example is a 12x12 grid world containing obstacles , undeveloped cells are marked as white, and obstacles are marked as black. The red, green and blue circles represent the three robots in the environment. Three near-end strategies with discrete action spaces optimize the intelligent body to control the robot. To learn more about the PPO agent, please refer to the near-end strategy optimization agent .

The agent provides one of five possible movement actions (waiting, up, down, left or right) to its respective robot. The robot decides whether the action is legal or illegal. For example, when the robot is located near the left boundary of the environment, the movement to the left is considered illegal. Similarly, collisions with obstacles and other media in the environment are illegal and will be punished . Environmental dynamics are deterministic , which means that the robot performs legal and illegal actions with 100% and 0% probability , respectively . The overall goal is to explore all squares as quickly as possible .

In each time step, the agent observes the state of the environment through a set of four images. These images can identify the grid with obstacles, the current position of the robot being controlled, the positions of other robots, and the period of the episode. Check the grid. Combine these images to create a 4-channel 12x12 image observation set. The figure below shows an example that controls what the green robot's agent observes in a given time step.
Insert picture description here
For the grid world environment:

  1. The search area is a 12x12 grid with obstacles.

  2. The observation result of each agent is a 12x12x4 image.

  3. The discrete action set is a set of five actions (WAIT = 0, UP = 1, DOWN = 2, LEFT = 3, RIGHT = 4).

  4. The simulation will terminate when the grid is fully explored or the maximum number of steps is reached.

At each time step, the agent will receive the following rewards and punishments.

  1. +1 is used to move to a previously unexplored unit (white).

  2. -0.5 means illegal operation (attempt to move out of the border or collide with other robots and obstacles)

  3. -0.05 means the action that caused the movement (movement cost).

  4. -0.1 means the action of not moving (laziness penalty).

  5. If the grid is fully explored, the coverage contribution of the robot during the episode is +200 times (the ratio of the number of units explored to the total number of units)

Use an index matrix to define the location of obstacles within the grid . The first column contains the row index, and the second column contains the column index.

obsMat = [4 3; 5 3; 6 3; 7 3; 8 3; 9 3; 5 11; 6 11; 7 11; 8 11; 5 12; 6 12; 7 12; 8 12];

Initialize the robot position.

sA0 = [2 2];
sB0 = [11 4];
sC0 = [3 12];
s0 = [sA0; sB0; sC0];

Specify the sampling time, simulation time and the maximum number of steps for each episode.

Ts = 0.1;
Tf = 100;
maxsteps = ceil(Tf/Ts);

Open the Simulink model.

mdl = "rlAreaCoverage";
open_system(mdl)

Insert picture description here
Click to enter the
Insert picture description here
GridWorld block is the MATLAB System block representing the training environment. The System object of this environment is defined in GridWorld.m.

In this example, the agent is homogeneous and has the same observation and operation specifications. Create environmental observations and operating specifications. For more information, see rlNumericSpec and rlFiniteSetSpec .

% Define observation specifications.
obsSize = [12 12 4];
oinfo = rlNumericSpec(obsSize);
oinfo.Name = 'observations';

% Define action specifications.
numAct = 5;
actionSpace = {
    
    0,1,2,3,4};
ainfo = rlFiniteSetSpec(actionSpace);
ainfo.Name = 'actions';

Specify the block path of the agent.

blks = mdl + ["/Agent A (Red)","/Agent B (Green)","/Agent C (Blue)"];

Create an environmental interface to specify the same observation and operation specifications for all three agents.

env = rlSimulinkEnv(mdl,blks,{
    
    oinfo,oinfo,oinfo},{
    
    ainfo,ainfo,ainfo});

Specify the reset function of the environment. The reset function resetMap ensures that the robot starts from a random initial position at the beginning of each episode. Random initialization makes the agent robust to different starting positions and improves training convergence.

env.ResetFcn = @(in) resetMap(in, obsMat);

Create an agent

PPO agents rely on representatives of actors and commenters to learn the best strategy. In this example, the agent maintains a deep neural network-based function approximator for actors and commenters. Both actors and commenters have similar network structures, with convolutional and fully connected layers. The commenter output represents the state value V (s) V (s)The scalar value of V ( s ) . The actor outputs the probability of taking each of these five actionsπ (a ∣ s) π (a|s)π ( a s ) . For more information, seerlValueRepresentationandrlStochasticActorRepresentation.
Set a random seed for repeatability.

rng(0)

Use the following steps to create actor and commenter representations.

  1. Create a deep neural network of actors and commenters.

  2. Specify the representative options for actors and commenters. In this example, specify the learning rate and gradient threshold. For more information, see rlRepresentationOptions .

  3. Create actors and commenters to represent objects.

Use the same network structure and presentation options for all three agents.

for idx = 1:3
    % Create actor deep neural network.
    actorNetWork = [
        imageInputLayer(obsSize,'Normalization','none','Name','observations')
        convolution2dLayer(8,16,'Name','conv1','Stride',1,'Padding',1,'WeightsInitializer','he')
        reluLayer('Name','relu1')
        convolution2dLayer(4,8,'Name','conv2','Stride',1,'Padding','same','WeightsInitializer','he')
        reluLayer('Name','relu2')
        fullyConnectedLayer(256,'Name','fc1','WeightsInitializer','he')
        reluLayer('Name','relu3')
        fullyConnectedLayer(128,'Name','fc2','WeightsInitializer','he')
        reluLayer('Name','relu4')
        fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he')
        reluLayer('Name','relu5')
        fullyConnectedLayer(numAct,'Name','output')
        softmaxLayer('Name','action')];
    
    % Create critic deep neural network.
    criticNetwork = [
        imageInputLayer(obsSize,'Normalization','none','Name','observations')
        convolution2dLayer(8,16,'Name','conv1','Stride',1,'Padding',1,'WeightsInitializer','he')
        reluLayer('Name','relu1')
        convolution2dLayer(4,8,'Name','conv2','Stride',1,'Padding','same','WeightsInitializer','he')
        reluLayer('Name','relu2')
        fullyConnectedLayer(256,'Name','fc1','WeightsInitializer','he')
        reluLayer('Name','relu3')
        fullyConnectedLayer(128,'Name','fc2','WeightsInitializer','he')
        reluLayer('Name','relu4')
        fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he')
        reluLayer('Name','relu5')
        fullyConnectedLayer(1,'Name','output')];
    
    % Specify representation options for the actor and critic.
    actorOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
    criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
    
    % create actor and critic
    actor(idx) = rlStochasticActorRepresentation(actorNetWork,oinfo,ainfo,...
        'Observation',{
    
    'observations'},actorOpts);
    critic(idx) = rlValueRepresentation(criticNetwork,oinfo,...
        'Observation',{
    
    'observations'},criticOpts);
end

Use rlPPOAgentOptions to specify agent options. Use the same options for all three agents. During training, the agent collects experience until it reaches an experience range of 128 steps, and then trains from a small batch of 64 experiences. The objective function limiting coefficient of 0.2 can improve the stability of training, and the discount coefficient value of 0.995 can encourage long-term returns.

opt = rlPPOAgentOptions(...
    'ExperienceHorizon',128,...
    'ClipFactor',0.2,...
    'EntropyLossWeight',0.01,...
    'MiniBatchSize',64,...
    'NumEpoch',3,...
    'AdvantageEstimateMethod','gae',...
    'GAEFactor',0.95,...
    'SampleTime',Ts,...
    'DiscountFactor',0.995);

Use the defined actor, commenter option to create an agent.

agentA = rlPPOAgent(actor(1),critic(1),opt);
agentB = rlPPOAgent(actor(2),critic(2),opt);
agentC = rlPPOAgent(actor(3),critic(3),opt);
agents = [agentA,agentB,agentC];

Training agent

Specify the following options to train the agent.

  1. Train up to 1000 episodes, and each episode lasts up to 5000 time steps.

  2. When the average reward of the agent for 100 consecutive times reaches 80 or more, stop training the agent.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',1000,...
    'MaxStepsPerEpisode',maxsteps,...
    'Plots','training-progress',...
    'ScoreAveragingWindowLength',100,...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',80); 

To train multiple agents, specify an array of agents for the training function. The order of the agents in the array must match the order of the agent block paths specified when creating the environment. Doing so ensures that the agent objects are linked to proper operation and observation specifications in the environment.

Training is a computationally intensive process that takes a few minutes to complete. To save the time of running this example, please load the pre-trained agent parameters by setting doTraining to false. To train the agent yourself, set doTraining to true .

doTraining = false;
if doTraining
    stats = train(agents,env,trainOpts);
else
    load('rlAreaCoverageParameters.mat');
    setLearnableParameters(agentA,agentAParams);
    setLearnableParameters(agentB,agentBParams);
    setLearnableParameters(agentC,agentCParams);
end

The figure below shows a screenshot of the training progress. Due to the randomness of the training process, you may get different results.
Insert picture description here

Agent simulation

The trained agent in the simulation environment.

rng(0) % reset the random seed
simOpts = rlSimulationOptions('MaxSteps',maxsteps);
experience = sim(env,agents,simOpts);

Insert picture description here
These agents successfully covered the entire grid world.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109625104