MATLAB Reinforcement Learning Practical Combat (14) Motion Control of Quadruped Robot Based on DDPG Agent


This example shows how to train a quadruped robot to use a Deep Deterministic Strategy Gradient (DDPG) agent to walk. The robot in this example is modeled using SimscapeTM MultibodyTM. For more information about DDPG agent, see the depth of uncertainty double agent tactics gradient delay .

Insert picture description here
Load the necessary parameters into the basic workspace in MATLAB.

initializeRobotParameters

Quadruped robot model

The environment of this example is a quadruped robot, and the goal of training is to make the robot walk in a straight line with the minimum control force.

The robot is modeled using Simscape multi-body and Simscape multi-body contact force libraries . The main structural components are four legs and a torso. The legs are connected to the torso through abduction joints. The action value provided by the RL intelligent body block is scaled and converted into a joint torque value. These joint torque values ​​are used by revolute joints to calculate movement.

Open the model.

mdl = 'rlQuadrupedRobot';
open_system(mdl)

Insert picture description here
Observations

The robot environment provides 44 observations to the agent, and the normalized range of each observation is between -1 and 1. These observations are:

  1. Y (vertical) and y (side) position of the body's center of gravity
  2. Quaternion indicates the direction of the torso
  3. The x (forward), y (vertical) and z (lateral) speeds of the torso at the center of mass
  4. The roll, tilt and yaw speed of the torso
  5. Angular position and velocity of the hip and knee joints of each leg
  6. The normal force and friction force of each leg due to ground contact
  7. The action value (torque of each joint) is from the previous time step

For all four legs, the initial values ​​of the hip and knee joint angles are set to -0.8234 and 1.6468 rad, respectively. The neutral position of the joint is at 0 rad. When the legs are stretched to their maximum and perpendicular to the ground, they are in a neutral position.

Actions

The agent generates 8 actions normalized between -1 and 1. After multiplying by the scale factor, these are equivalent to the joint torque signals of the eight revolute joints. The total torque limit of each joint is +/-10n-m.

Reward

At each time step during training, the agent will receive the following rewards. This reward function encourages the agent to advance by providing a positive reward for the positive advancement speed. It also encourages agents to avoid early termination by providing a constant reward (25T s/tf) at each time step. The remaining items in the reward function are penalties to prevent unnecessary states, such as large deviations from the required height and direction or excessive use of joint torque.
Insert picture description here

among them

  1. V x is the speed of the torso's center of gravity in the x direction.

  2. Ts and tf are the sampling time and final simulation time of the environment, respectively.

  3. Y is the proportional height error between the torso's center of gravity and the ideal height of 0.75 meters.

  4. θ is the tilt angle of the torso.

  5. U t − 1 i U^i_{t-1} Ut1iIs the motion value of joint i in the previous time step.

Episode Termination

During training or simulation, if any of the following situations occurs, the event will be terminated.

  1. The height of the center of the torso from the ground is less than 0.5 meters (drop).

  2. The head or tail of the torso is below the ground.

  3. Any knee joint is underground.

  4. The roll angle, pitch angle or yaw angle are outside (+/-0.1745, +/-0.1745 and +/-0.3491 rad) respectively.

Create environment interface

Specify the parameters of the observation set.

numObs = 44;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';

Specify the parameters of the action set.

numAct = 8;
actInfo = rlNumericSpec([numAct 1],'LowerLimit',-1,'UpperLimit', 1);
actInfo.Name = 'torque';

Use reinforcement learning models to create environments.

blk = [mdl, '/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,actInfo);

During the training process, the reset function introduces random deviations into the initial joint angle and angular velocity.

env.ResetFcn = @quadrupedResetFcn;

Create a DDPG agent

The DDPG agent approximates long-term rewards for observation and action by using a critical value function representation. The agent also uses the agent representation to decide which action to take given the observation. The network of actors and commenters in this example was inspired by [2].

For more information on creating deep neural network value function representations, see Create strategy and value function representations . For an example of creating a neural network for a DDPG agent, see trainddpg agent controlling a dual integrator system .

Use the createNetworks helper function to create networks in the MATLAB workspace.

createNetworks

You can also use the Deep Network Designer application to interactively create your network of actors and commenters.

Check the critic network configuration.

plot(criticNetwork)

Use rddpgagentoptions to specify agent options.

agentOptions = rlDDPGAgentOptions;
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.MiniBatchSize = 250;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.NoiseOptions.MeanAttractionConstant = 0.15;
agentOptions.NoiseOptions.Variance = 0.1;

Create an rlDDPGAgent object for the agent.

agent = rlDDPGAgent(actor,critic,agentOptions);

Specify training options

To train the agent, first specify the following training options:

  1. Each training session has a maximum of 10000 episodes, and each session has a maximum of maxSteps time steps.

  2. Display the training progress in the "Episode Manager" dialog box (set the "Plots" option), and disable the command line display (set the "Verbose" option).

  3. When the average cumulative reward exceeds 190 during the 250 consecutive training sessions, the agent stops training.

  4. Save a copy of the agent for each episode with a cumulative reward greater than 200.

maxEpisodes = 10000;
maxSteps = floor(Tf/Ts);  
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxEpisodes,...
    'MaxStepsPerEpisode',maxSteps,...
    'ScoreAveragingWindowLength',250,...
    'Verbose',true,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',190,...                   
    'SaveAgentCriteria','EpisodeReward',... 
    'SaveAgentValue',200);                 

To train the agent in parallel, please specify the following training options. Parallel training requires parallel computing ToolboxTM software. If the Parallel Computing ToolboxTM software is not installed, please set Parallel to false.

  1. Set the useballow option to true.

  2. Train agents asynchronously and in parallel.

  3. After every 32 steps, each worker sends experience to the host.

  4. The DDPG agent requires employees to send "experience" to the host.

trainOpts.UseParallel = true;                    
trainOpts.ParallelizationOptions.Mode = 'async';
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Experiences';

Training agent

Use the Train function to train the agent. Due to the complexity of the robot model, this process requires a lot of calculations and takes several hours to complete. To save time when running this example, load the pre-trained agent by setting doTraining to false. To train the agent yourself, set doTraining to true . Due to the randomness of parallel training, you can expect different training results from the graph below.

doTraining = false;
if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load a pretrained agent for the example.
    load('rlQuadrupedAgent.mat','agent')
end

Insert picture description here

Agent simulation

Fixed reproducibility of random generator seeds.

rng(0)

In order to verify the performance of the trained agent, a simulation is performed in a robot environment. For more information about agent simulation, see rlSimulationOptions and sim.

simOptions = rlSimulationOptions('MaxSteps',maxSteps);
experience = sim(env,agent,simOptions);

Insert picture description here
Insert picture description here
For examples of how to train a DDPG agent to walk a biped robot and a humanoid walking robot, please refer to "simulated reinforcement learning multi-body training biped robot walking" and "simulated walking robot training" respectively .

reference

[1] Heess, Nicolas, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, et al. ‘Emergence of Locomotion Behaviours in Rich Environments’. ArXiv:1707.02286 [Cs], 10 July 2017. https://arxiv.org/abs/1707.02286.

[2] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. ‘Continuous Control with Deep Reinforcement Learning’. ArXiv:1509.02971 [Cs, Stat], 5 July 2019. https://arxiv.org/abs/1509.02971.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109679138