MATLAB reinforcement learning combat (8) training multiple agents to perform collaborative tasks

Train multiple agents to perform collaborative tasks

This example shows how to set up multi-agent training in the Simulink® environment. In this example, you trained two agents to perform the task of moving objects together.

Insert picture description here
In this example, the environment is a two-dimensional frictionless surface containing elements represented by circles . The target object C is represented by a blue circle with a radius of 2 m , and robots A ( red ) and B ( green ) are represented by smaller circles with a radius of 1 m. The robot tries to move the object C out of the circle with a radius of 8 m through the collision force. All elements in the environment have mass and obey Newton's laws of motion. In addition, the contact force between the element and the environmental boundary is modeled as a spring and mass damper system. Elements can move on the surface by applying external forces in the X and Y directions. There is no movement in three dimensions, and the total energy of the system is preserved.

Create the parameter set required for this example.

rlCollaborativeTaskParams

Open the Simulink model.

mdl = "rlCollaborativeTask";
open_system(mdl)

Insert picture description here
For this environment:

  1. The range of the two-dimensional space in the X and Y directions is from –12 m to 12 m.

  2. The stiffness and damping values ​​of the contact spring are 100 N/m and 0.1 N/m/s, respectively.

  3. The agent has the same observation results for the position, speed and action value of the previous time step of A, B, and C.

  4. When the object C moves outside the circle, the simulation ends.

  5. At each time step, the agent will receive the following rewards:

Insert picture description here

Here:

  1. r A r_A rAAnd r B r_BrBThese are the rewards received by agent A and agent B, respectively.
  2. rglobal r_ {global} rglobalIt is a team reward. When the object C is close to the boundary of the ring, both agents will get this reward.
  3. r l o c a l , A r_{local,A} rlocal,A r l o c a l , B r_{local,B} rlocal,BIt is the local penalty received by agent A and agent B. The penalty is based on the distance between agent A and agent B and object C and the size of the action from the previous time step.
  4. d C d_C dCIs the distance from object C to the center of the ring.
  5. d AC d_ {AC} dACAnd d BC d_{BC}dBCThese are the distances between agent A and object C, and between agent B and object C.
  6. u A u_A uASum u B u_BuBIs the action value of agent A and agent B in the previous time step.

This example uses a near-end policy optimization (PPO) agent with a discrete operating space. To learn more about the PPO agent, please refer to the near-end strategy optimization agent . The agent applies an external force to the robot, which causes movement. At each time step, the agent will choose the action you want to perform u A, B = [FX, FY] u_{A,B} =[F_X, F_Y]uA,B=[FX,FAnd] , WhereFX, FY F_X, F_YFX,FAndIt is one of the following two external forces.
Insert picture description here

Create the environment

To create a multi-agent environment, use a string array to specify the block path of the agent. In addition, use cell arrays to specify observation and action-specified objects. The order of the canonical objects in the cell array must match the order specified in the block path array. When an agent is available in the MATLAB workspace when the environment is created, the observation and operation specification arrays are optional. For more information on creating a multi-agent environment, see rlSimulinkEnv .

Create the I/O specification of the environment. In this example, the agents are homogeneous and have the same I/O specifications.

% Number of observations
numObs = 16;

% Number of actions
numAct = 2;

% Maximum value of externally applied force (N)
maxF = 1.0;

% I/O specifications for each agent
oinfo = rlNumericSpec([numObs,1]);
ainfo = rlFiniteSetSpec({
    
    
    [-maxF -maxF]
    [-maxF  0   ]
    [-maxF  maxF]
    [ 0    -maxF]
    [ 0     0   ]
    [ 0     maxF]
    [ maxF -maxF]
    [ maxF  0   ]
    [ maxF  maxF]});
oinfo.Name = 'observations';
ainfo.Name = 'forces';

Create Simulink environment interface

blks = ["rlCollaborativeTask/Agent A", "rlCollaborativeTask/Agent B"];
obsInfos = {
    
    oinfo,oinfo};
actInfos = {
    
    ainfo,ainfo};
env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);

Specify the reset function of the environment. The reset function resetRobots ensures that the robot starts from a random initial position at the beginning of each episode.

env.ResetFcn = @(in) resetRobots(in,RA,RB,RC,boundaryR);

Create an agent

PPO agents rely on the agents of actors and commentators to learn the best strategy. In this example, the agent maintains a neural network-based function approximator for actors and commenters.

Create a neural network and representation of commenters. The output of the reviewer network is the state value function V(s) V(s) of the state sV s

% Reset the random seed to improve reproducibility
rng(0)

% Critic networks
criticNetwork = [...
    featureInputLayer(oinfo.Dimension(1),'Normalization','none','Name','observation')
    fullyConnectedLayer(128,'Name','CriticFC1','WeightsInitializer','he')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(64,'Name','CriticFC2','WeightsInitializer','he')
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(32,'Name','CriticFC3','WeightsInitializer','he')
    reluLayer('Name','CriticRelu3')
    fullyConnectedLayer(1,'Name','CriticOutput')];

% Critic representations
criticOpts = rlRepresentationOptions('LearnRate',1e-4);
criticA = rlValueRepresentation(criticNetwork,oinfo,'Observation',{
    
    'observation'},criticOpts);
criticB = rlValueRepresentation(criticNetwork,oinfo,'Observation',{
    
    'observation'},criticOpts);

The output of the actor network is the probability π (a ∣ s) π (a|s) of taking each possible action pair in a certain state sπ ( a s ) . Create neural networks and representations of actors.

% Actor networks
actorNetwork = [...
    featureInputLayer(oinfo.Dimension(1),'Normalization','none','Name','observation')
    fullyConnectedLayer(128,'Name','ActorFC1','WeightsInitializer','he')
    reluLayer('Name','ActorRelu1')
    fullyConnectedLayer(64,'Name','ActorFC2','WeightsInitializer','he')
    reluLayer('Name','ActorRelu2')
    fullyConnectedLayer(32,'Name','ActorFC3','WeightsInitializer','he')
    reluLayer('Name','ActorRelu3')
    fullyConnectedLayer(numel(ainfo.Elements),'Name','Action')
    softmaxLayer('Name','SM')];

% Actor representations
actorOpts = rlRepresentationOptions('LearnRate',1e-4);
actorA = rlStochasticActorRepresentation(actorNetwork,oinfo,ainfo,...
    'Observation',{
    
    'observation'},actorOpts);
actorB = rlStochasticActorRepresentation(actorNetwork,oinfo,ainfo,...
    'Observation',{
    
    'observation'},actorOpts);

Create an agent. Both agents use the same options.

agentOptions = rlPPOAgentOptions(...
    'ExperienceHorizon',256,...
    'ClipFactor',0.125,...
    'EntropyLossWeight',0.001,...
    'MiniBatchSize',64,...
    'NumEpoch',3,...
    'AdvantageEstimateMethod','gae',...
    'GAEFactor',0.95,...
    'SampleTime',Ts,...
    'DiscountFactor',0.9995);
agentA = rlPPOAgent(actorA,criticA,agentOptions);
agentB = rlPPOAgent(actorB,criticB,agentOptions);

During training, the agent collects experience until it reaches the experience range of 256 steps or the episode terminates, and then trains from a small batch of 64 experiences. This example uses an objective function limiting factor of 0.125 to improve training stability, and a discount factor of 0.9995 to encourage long-term rewards.

Training agent

Specify the following training options to train the agent.

  1. Train up to 1000 episodes, and each episode lasts up to 5000 time steps.

  2. When the average reward of the agent for more than 100 consecutive times is -10 or higher, stop training the agent.

maxEpisodes = 1000;
maxSteps = 5e3;
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxEpisodes,...
    'MaxStepsPerEpisode',maxSteps,...
    'ScoreAveragingWindowLength',100,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-10);

To train multiple agents, specify an agent array for the training function . The order of agents in the array must match the order of agent block paths specified during environment creation. Doing so ensures that the agent object is linked to the corresponding I/O interface in the environment. Training these agents may take several hours to complete, depending on the available computing power.

The MAT file rlCollaborativeTaskAgents contains a set of pre-trained agents. You can load the file and view the agent's performance. To train the agent yourself, set doTraining to true .

doTraining = false;
if doTraining
    stats = train([agentA, agentB],env,trainOpts);
else
    load('rlCollaborativeTaskAgents.mat');
end

The figure below shows a screenshot of the training progress. Due to the randomness of the training process, you may get different results.
Insert picture description here

Agent simulation

A trained agent in a simulated environment.

simOptions = rlSimulationOptions('MaxSteps',maxSteps);
exp = sim(env,[agentA agentB],simOptions);

Insert picture description here
For more information about agent simulation, see rlSimulationOptions and sim .

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109614456