MATLAB Reinforcement Learning Toolbox (13) to create strategy and value function representation

Reinforcement learning strategy is a kind of mapping used to select actions taken by the agent based on observations of the environment. During training, the agent adjusts the parameters represented by its strategy to maximize the expected cumulative long-term reward.

Reinforcement learning agents use function approximators called actor and critic representations to estimate strategy and value functions. The actor represents the strategy for selecting the best action based on current observations. The reviewer represents a value function that estimates the expected long-term cumulative reward of the current policy.

Before creating an agent, a deep neural network, linear basis function, or lookup table must be used to create the required representation of participants and reviewers. The type of function approximator you use depends on your application.

For more information about agents, see Reinforcement Learning Agents .

Actors and commenters said

The Reinforcement Learning Toolbox™ software supports the following representations:

  1. V (S ∣ θ V) V (S | θ_V) V ( S θV) -The reviewer who estimates the expected cumulative long-term reward based on the given observation value S. You can userlValueRepresentation tocreate these commenters.
  2. Q ( S , A ∣ θ Q ) Q(S,A |θ_Q) Q(S,AθQ) -A reviewer who evaluates the expected cumulative long-term reward for all possible discrete actions based on a given observation value S. You can userlQValueRepresentation tocreate these commenters.
  3. Q ( S ∣ θ Q ) Q(S |θ_Q) Q(SθQ) -Multi-output reviewers who estimate the expected cumulative long-term reward of all possible discrete actions Ai for a given observation S. You can userlQValueRepresentation tocreate these reviewers.
  4. μ (S ∣ θ μ) μ (S | θ_ \ mu) μ ( S θm) -Select the actor of the action based on the given observation S. You can userlDeterministicActorRepresentationorrlStochasticActorRepresentation tocreate these actors.

Each representation uses a function approximator, which has a set of corresponding parameters (θ V, θ Q, θ μ) (θ_V, θ_Q, θ_μ)( θVΘQΘm) , these parameters are calculated during the learning process.
Insert picture description here
For systems with a limited number of discrete observations and discrete actions, the value function can be stored in a lookup table. For systems with many discrete observations and actions and continuous observation and action spaces, it is impractical to store observations and actions. For such systems, you can use deep neural networks or custom (linear in parameters) basic functions to represent actors and commenters.

The following table summarizes the four ways that you can use the Reinforcement Learning Toolbox software to represent objects, depending on the action and observation space in your environment, as well as the approximator and agent to be used.Insert picture description here

Table approximator

The look-up table-based representation is suitable for discrete observations and environments with a limited number of actions. You can create two types of lookup table representations:

  1. Value table, storing rewards for corresponding observations

  2. Q table, used to store the reward of the corresponding observation action pair

To create a table representation, first use the rlTable function to create a value table or Q table. Then, use the rlValueRepresentation or rlQValueRepresentation object to create a representation for the table. To configure the learning rate and optimization used by the representation, use the rlRepresentationOptions object.

Deep Neural Network Approximator

You can use deep neural networks to create actor and commenter function approximators. To do so, use the Deep Learning Toolbox™ software function.

The input and output dimensions of the network

The size of the network of actors and commenters must match the corresponding actions and observation norms in the training environment objects. To get the action and observation dimensions of the environment env, please use the getActionInfo and getObservationInfo functions respectively. Then access the Dimensions property of the specification object.

actInfo = getActionInfo(env);
actDimensions = actInfo.Dimensions;

obsInfo = getObservationInfo(env);
obsDimensions = obsInfo.Dimensions;

The value function reviewer's network (such as those used in AC, PG, or PPO agents) must only take observations as input, and must have a single scalar output. For these networks, the size of the input layer must match the size of the environmental observation specification. For more information, see rlValueRepresentation .

Single-output Q-value function reviewers' networks (such as those used in Q, DQN, SARSA, DDPG, TD3, and SAC proxies) must have observations and operations as inputs, and must have a single scalar output. For these networks, for observation and manipulation, the size of the input layer must match the size of the environmental specifications. For more information, see rlQValueRepresentation .

Networks used for multi-output Q-value function reviewers (such as those used in Q, DQN, and SARSA agents) only take observations as input, and must have a single output layer, and the output size is equal to the number of discrete operations. For these networks, the size of the input layer must match the size of the environmental observations. specification. For more information, see rlQValueRepresentation .

For the actor network, the size of the input layer must match the size of the environmental observation specification.

  1. The network used by actors with discrete action spaces (such as roles in PG, AC, and PPO agents) must have a single output layer, and the output size is equal to the number of possible discrete actions.

  2. The network used in deterministic actors with continuous action spaces (such as actors in DDPG and TD3 agents) must have a single output layer whose output size must match the size of the action space defined in the environmental action specification.

  3. The network used in random characters with continuous action spaces (such as characters in PG, AC, PPO, and SAC agents) must have a single output layer whose output size is twice the size of the action space defined in the environmental action specification. These networks must have two separate paths, the first produces an average value (which must be scaled to the output range), and the second produces a standard deviation (which must be a non-negative value).

Building a deep neural network

A deep neural network consists of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a complete list of available layers, see List of Deep Learning Layers.
Insert picture description here
Reinforcement learning does not support the bilstmLayer and batchNormalizationLayer layers.

The reinforcement learning toolbox software provides the following layers, which do not contain adjustable parameters (that is, parameters that will change during training).
Insert picture description here

You can also create your own custom layers. For more information, see reading define custom deep learning layer .

For reinforcement learning applications, you can build a deep neural network by connecting a series of layers for each input path (observation or action) and each output path (estimated reward or action). Then, you can connect these paths together using the connectLayers function.

You can also use the Deep Network Designer application to create deep neural networks. For examples, see Create an agent with the deep network designer and train with image observation .

When creating a deep neural network, you must specify a name for the first layer of each input path and the last layer of the output path.

The following code creates and connects the following input and output paths:

  1. Observe the input path observationPath, whose first layer is named "observation".

  2. Action input path actionPath, the first layer is named "action".

  3. The output path commonPath of the estimated value function, which takes the output of observationPath and actionPath as input. The last layer of this path is called "output".

observationPath = [
    imageInputLayer([4 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(24,'Name','CriticObsFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(24,'Name','CriticObsFC2')];
actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','action')
    fullyConnectedLayer(24,'Name','CriticActFC1')];
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);    
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');

For all observation and motion input paths, imageInputLayer must be specified as the first layer in the path.

You can use the graph function to view the structure of the deep neural network.

plot(criticNetwork)

Insert picture description here
For PG and AC agents, the final output layers of the deep neural network role representation are fullyConnectedLayer and softmaxLayer. When specifying a layer for the network, you must specify fullyConnectedLayer, and you can optionally specify softmaxLayer. If you omit softmaxLayer, the software will automatically add one for you.

Determining the number, type, and size of layers used for deep neural network representation can be difficult and depends on the application. However, the most critical factor in determining the characteristics of a function approximator is whether it can approximate the best strategy or discount value function for your application, that is, whether it has the correct learning observations, actions, and reward signals.

When building a network, consider the following tips.

  1. For continuous action space, if necessary, use tanhLayer and ScalingLayer to bind actions first.

  2. Deep dense networks with reluLayer layers can approximate many different functions well. Therefore, they are usually a good first choice.

  3. Start with the smallest network that you think can approximate the best strategy or value function.

  4. When you are approximating systems with strong nonlinearities or algebraic constraints, adding more layers is usually better than increasing the number of outputs per layer. Generally, the ability of the approximator to represent more complex functions only grows polynomially in the size of the layer, but it grows exponentially with the number of layers. In other words, although more data and longer training time are usually required, more layers can approximate a more complex and non-linear synthesis function. A network with fewer layers may need to increase units exponentially to successfully approximate the same functional category, and may not be able to learn and generalize correctly.

  5. For policy-based agents (agents that only learn from the experience collected when following the current policy) (such as AC and PG agents), if your network is large (for example, a two hidden layer with 32 nodes Network), parallel training would be better, each has hundreds of parameters). The policy parallel update assumes that each worker updates a different part of the network, for example when they explore different areas of the observation space. If the network is small, worker updates may correlate with each other and make training unstable.

Create and configure representation

To create a commenter representation for your deep neural network, use rlValueRepresentation or rlQValueRepresentation objects. To create a character representation for your deep neural network, use rlDeterministicActorRepresentation or rlStochasticActorRepresentation objects. To configure the learning rate and optimization used by the representation, use the rlRepresentationOptions object.

For example, create a Q-value representation object for the critic network crimeNetwork, and specify a learning rate of 0.0001. When creating a representation, pass the environmental operation and observation specifications to the rlQValueRepresentation object, and specify the name of the network layer to which the observation and operation are connected (in this case, "observation" and "operation").

opt = rlRepresentationOptions('LearnRate',0.0001);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
        'Observation',{
    
    'observation'},'Action',{
    
    'action'},opt);

When creating a deep neural network and configuring representation objects, consider using the following methods as a starting point.

  1. Start with the smallest network and a higher learning rate (0.01). Train this initial network to see if the agent quickly converges to a bad strategy or acts in a random manner. If any of these problems occur, readjust the network size by adding more layers or more outputs on each layer. Your goal is to find a large enough network structure, the learning speed will not be too fast, and will show signs of learning after the initial training period (the trajectory of the reward graph keeps improving).

  2. Once you choose a good network architecture, a lower initial learning rate allows you to see if the agent is on the right track and helps you check whether the network architecture satisfies the problem. The low learning rate makes it much easier to adjust the parameters, especially for difficult problems.

In addition, when configuring a deep neural network representation, please consider the following tips.

  1. Please be patient with DDPG and DQN agents, because they may not learn anything for a while in the early episodes, and they usually show a decline in cumulative rewards early in the training process. Finally, after the first few thousand episodes, they can show signs of learning.

  2. For DDPG and DQN agents, it is important to promote the exploration of agents.

  3. For an agent that has both an actor network and a commenter network, please set the initial learning rate of the two representations to the same value. For some problems, setting the reviewer's learning rate higher than the actor's learning rate can improve the learning effect.

Recurrent neural network

When creating representations for PPO or DQN agents, recurrent neural networks can be used. These networks are deep neural networks with a sequenceInputLayer input layer and at least one layer with hidden state information, such as lstmLayer . They may be particularly useful when the state of the environment cannot be included in the observation vector. For more information and examples, see rlValueRepresentation , rlQValueRepresentation , rlDeterministicActorRepresentation, and rlStochasticActorRepresentation .

Custom basic function approximator

The form of the custom (linear in parameters) basic function approximator is f = W'B, where W is the weight array, and B is the column vector output of the custom basic function that must be created. The learnable parameter represented by the linear basis function is the element of W.

For value function reviewer representations (for example, those used in AC, PG, or PPO agents), f is a scalar value, so W must be a column vector of the same length as B, and B must be a function of observations . For more information, see rlValueRepresentation .

For single-output Q-value function commenter representations (such as those used in Q, DQN, SARSA, DDPG, TD3, and SAC agents), f is a scalar value, so W must be a column vector B with the same length, and B Must be a function of observation and action. For more information, see rlQValueRepresentation .

For multi-output Q-value function annotator representations with discrete action spaces (such as those used in Q, DQN, and SARSA agents), f is a vector with as many elements as the number of possible actions. Therefore, W must be a matrix with the same number of columns as the number of possible actions and the same number of rows as the length of B. B must only be a function of the observed value. For more information, see rlQValueRepresentation .
For characters with discrete action spaces (such as those in PG, AC, and PPO agents), f must be a column vector whose length is equal to the number of possible discrete actions.

  1. For deterministic actors with continuous action space (such as participating actors in DDPG and TD3 agents), the dimension of f must match the dimension of the agent's action specification, which can be a scalar or a column vector.

  2. A random character with a continuous action space cannot rely on a custom basis function (since positivity can only be enforced on the standard deviation, only a neural network approximator can be used).

  3. For any actor representation, W must have as many columns as there are elements in f, and there must be as many rows as there are elements in B. B must only be a function of observation. For more information, see rlDeterministicActorRepresentation and rlStochasticActorRepresentation .

Create agent or specify agent representation

After creating actor and commenter representations, you can create reinforcement learning agents that use these representations. For example, use a given network of actors and commenters to create a PG agent.

agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);

For more information about the different types of reinforcement learning agents, see Reinforcement Learning Agents .

You can use getActor and getCritic to obtain representations of actors and commenters from existing agents, respectively .

You can also use setActor and setCritic to set participants and commenters of existing agents, respectively. When using these functions to specify a representation for an existing agent, the input and output layers of the specified representation must match the observation and operation specifications of the original agent.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109554380