MATLAB Reinforcement Learning Toolbox (5) Use custom functions to create MATLAB environment

This example shows how to create an environment by providing custom dynamic functions in MATLAB®.

Using the rlFunctionEnv function, you can create the step and reset functions of the MATLAB intensive learning environment from observation specifications, action specifications, and user-defined. Then, you can train reinforcement learning agents in this environment. The necessary step and reset functions have been defined in this example.

Using custom functions to create an environment is very useful for environments with less dynamic dynamics, environments without special visualization requirements, or environments with third-party library interfaces. For more complex environments, you can use template classes to create environment objects.

Cart-Pole MATLAB environment

The rod environment is a rod attached to a driveless joint on the trolley, which moves along a frictionless track. The goal of training is to make the pendulum stand upright without falling down.
Insert picture description here
For this environment:

  1. The upward position of the balance pendulum is θ \thetaθ radians, the downward hanging position isπ \piπ radians.
  2. The pendulum begins to stand upright, and the initial angle is between -0.5 and 0.05.
  3. The agent's force signal to the environment is 10 to 10 N.
  4. Observe the position of the trolley, the speed of the trolley, the swing angle and the swing angle derivative from the environment.
  5. If the rod exceeds 12 degrees from the vertical, or the caddy moves more than 2.4 meters from its original position, it ends.
  6. Each time the pillar stands upright, the reward is +1. When the pendulum falls, 10 points are penalized.

Observation and action norms

Observe the position of the trolley, the speed of the trolley, the swing angle and the swing angle derivative from the environment.

ObservationInfo = rlNumericSpec([4 1]);
ObservationInfo.Name = 'CartPole States';
ObservationInfo.Description = 'x, dx, theta, dtheta';

The environment has a discrete action space in which the agent can apply one of two possible force values ​​to the car: -10 or 10N.

ActionInfo = rlFiniteSetSpec([-10 10]);
ActionInfo.Name = 'CartPole Action';

Create environment with function name

To define a custom environment, first specify a custom step and reset function. These functions must be in the current working folder or in the MATLAB path.

The custom reset function sets the default state of the environment. This function must have the following signature.

 [InitialObservation,LoggedSignals] = myResetFunction()

Appears
Insert picture description here
Click the arrow, into the sample folder, file, or create your own .m
Insert picture description here
To convey information from one step to the next step, for example, environmental conditions, use LoggedSignals. For this example, LoggedSignals contains the state of the car-rod environment: the position and speed of the trolley, the swing angle and the swing angle derivative. The reset function sets the cart angle to a random value every time the environment is reset.

For this example, use the custom reset function defined in myResetFunction.m.

type myResetFunction.m

The following content appears

function [InitialObservation, LoggedSignal] = myResetFunction()
% Reset function to place custom cart-pole environment into a random
% initial state.
% Theta (randomize)
T0 = 2 * 0.05 * rand() - 0.05;
% Thetadot
Td0 = 0;
% X
X0 = 0;
% Xdot
Xd0 = 0;
% Return initial environment state variables as logged signals.
LoggedSignal.State = [X0;Xd0;T0;Td0];
InitialObservation = LoggedSignal.State;
end

The custom step function specifies how the environment advances to the next state according to the given operation. This function must have the following signature.

[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)

In order to obtain the new state, the environment applies the dynamic equation to the current state stored in LoggedSignals, which is similar to giving an initial condition to the differential equation. The new state is stored in LoggedSignals and returned as output.

For this example, use the custom step function defined in myStepFunction.m. In order to simplify the implementation, this function redefines physical constants, such as cart quality, which are executed at every time step.

type myStepFunction.m

The following content appears

function [NextObs,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)
% Custom step function to construct cart-pole environment for the function
% name case.
%
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Define the environment constants.
% Acceleration due to gravity in m/s^2
Gravity = 9.8;
% Mass of the cart
CartMass = 1.0;
% Mass of the pole
PoleMass = 0.1;
% Half the length of the pole
HalfPoleLength = 0.5;
% Max force the input can apply
MaxForce = 10;
% Sample time
Ts = 0.02;
% Pole angle at which to fail the episode
AngleThreshold = 12 * pi/180;
% Cart distance at which to fail the episode
DisplacementThreshold = 2.4;
% Reward each time step the cart-pole is balanced
RewardForNotFalling = 1;
% Penalty when the cart-pole fails to balance
PenaltyForFalling = -10;
% Check if the given action is valid.
if ~ismember(Action,[-MaxForce MaxForce])
error(‘Action must be %g for going left and %g for going right.’,…
-MaxForce,MaxForce);
end
Force = Action;
% Unpack the state vector from the logged signals.
State = LoggedSignals.State;
XDot = State(2);
Theta = State(3);
ThetaDot = State(4);
% Cache to avoid recomputation.
CosTheta = cos(Theta);
SinTheta = sin(Theta);
SystemMass = CartMass + PoleMass;
temp = (Force + PoleMassHalfPoleLengthThetaDotThetaDotSinTheta)/SystemMass;
% Apply motion equations.
ThetaDotDot = (GravitySinTheta - CosThetatemp) / …
(HalfPoleLength*(4.0/3.0 - PoleMassCosThetaCosTheta/SystemMass));
XDotDot = temp - PoleMassHalfPoleLengthThetaDotDotCosTheta/SystemMass;
% Perform Euler integration.
LoggedSignals.State = State + Ts.
[XDot;XDotDot;ThetaDot;ThetaDotDot];
% Transform state to observation.
NextObs = LoggedSignals.State;
% Check terminal condition.
X = NextObs(1);
Theta = NextObs(3);
IsDone = abs(X) > DisplacementThreshold || abs(Theta) > AngleThreshold;
% Get reward.
if ~IsDone
Reward = RewardForNotFalling;
else
Reward = PenaltyForFalling;
end
end

Use the defined observation specifications, operation specifications, and function names to construct a customized environment.

env = rlFunctionEnv(ObservationInfo,ActionInfo,'myStepFunction','myResetFunction');

To verify the operation of the environment, rlFunctionEnv will automatically call validateEnvironment after the environment is created.

Use function handles to create an environment

You can also define custom functions with additional input parameters that exceed the minimum required set. For example, to pass the extra parameters arg1 and arg2 to the step and rest functions, use the following code.

[InitialObservation,LoggedSignals] = myResetFunction(arg1,arg2)
[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals,arg1,arg2)

To use these functions in rlFunctionEnv, anonymous function handles must be used.

ResetHandle = @()myResetFunction(arg1,arg2);
StepHandle = @(Action,LoggedSignals) myStepFunction(Action,LoggedSignals,arg1,arg2);

Using other input parameters can create a more efficient environment implementation. For example, myStepFunction2.m contains a custom step function that takes environment constants as input parameters (envConstants). In this way, this function avoids redefining environmental constants in each step.

type myStepFunction2.m

Display function content

function [NextObs,Reward,IsDone,LoggedSignals] = myStepFunction2(Action,LoggedSignals,EnvConstants)
% Custom step function to construct cart-pole environment for the function
% handle case.
%
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Check if the given action is valid.
if ~ismember(Action,[-EnvConstants.MaxForce EnvConstants.MaxForce])
error(‘Action must be %g for going left and %g for going right.’,…
-EnvConstants.MaxForce,EnvConstants.MaxForce);
end
Force = Action;
% Unpack the state vector from the logged signals.
State = LoggedSignals.State;
XDot = State(2);
Theta = State(3);
ThetaDot = State(4);
% Cache to avoid recomputation.
CosTheta = cos(Theta);
SinTheta = sin(Theta);
SystemMass = EnvConstants.MassCart + EnvConstants.MassPole;
temp = (Force + EnvConstants.MassPoleEnvConstants.LengthThetaDotThetaDotSinTheta)/SystemMass;
% Apply motion equations.
ThetaDotDot = (EnvConstants.GravitySinTheta - CosThetatemp)…
/ (EnvConstants.Length*(4.0/3.0 - EnvConstants.MassPoleCosThetaCosTheta/SystemMass));
XDotDot = temp - EnvConstants.MassPoleEnvConstants.LengthThetaDotDotCosTheta/SystemMass;
% Perform Euler integration.
LoggedSignals.State = State + EnvConstants.Ts.
[XDot;XDotDot;ThetaDot;ThetaDotDot];
% Transform state to observation.
NextObs = LoggedSignals.State;
% Check terminal condition.
X = NextObs(1);
Theta = NextObs(3);
IsDone = abs(X) > EnvConstants.XThreshold || abs(Theta) > EnvConstants.ThetaThresholdRadians;
% Get reward.
if ~IsDone
Reward = EnvConstants.RewardForNotFalling;
else
Reward = EnvConstants.PenaltyForFalling;
end
end

Create a structure that contains environmental constants.

% Acceleration due to gravity in m/s^2
envConstants.Gravity = 9.8;
% Mass of the cart
envConstants.MassCart = 1.0;
% Mass of the pole
envConstants.MassPole = 0.1;
% Half the length of the pole
envConstants.Length = 0.5;
% Max force the input can apply
envConstants.MaxForce = 10;
% Sample time
envConstants.Ts = 0.02;
% Angle at which to fail the episode
envConstants.ThetaThresholdRadians = 12 * pi/180;
% Distance at which to fail the episode
envConstants.XThreshold = 2.4;
% Reward each time step the cart-pole is balanced
envConstants.RewardForNotFalling = 1;
% Penalty when the cart-pole fails to balance
envConstants.PenaltyForFalling = -5;

Create an anonymous function handle for the custom step function and pass envConstants as an additional input parameter. Because envConstants is available when the StepHandle is created, the function handle contains these values. Even if you clear the variables, these values ​​will be stored in the function handle.

StepHandle = @(Action,LoggedSignals) myStepFunction2(Action,LoggedSignals,envConstants);

Use the same reset function, specifying it as a function handle instead of using its name.

ResetHandle = @myResetFunction;

Use a custom function handle to create an environment.

env2 = rlFunctionEnv(ObservationInfo,ActionInfo,StepHandle,ResetHandle);

Verify custom function

Before training the agent in your environment, the best practice is to verify the behavior of the custom function. To do this, you can use the reset function to initialize the environment, and use the step function to run a simulation step. For reproducibility, set the random generator seed before verification.

Verify the environment created with the function name.

rng(0);
InitialObs = reset(env)

Insert picture description here

[NextObs,Reward,IsDone,LoggedSignals] = step(env,10);
NextObs

Insert picture description here
Verify the environment created with the function handle.

rng(0);
InitialObs2 = reset(env2)

Insert picture description here

[NextObs2,Reward2,IsDone2,LoggedSignals2] = step(env2,10);
NextObs

Insert picture description here
Both environments are successfully initialized and simulated, producing the same state value in NextObs.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109472901