MATLAB Reinforcement Learning Practical Combat (1) Overview of Reinforcement Learning Agent Training

Introduction

After creating the environment and the reinforcement learning agent, you can use the training function to train the agent in the environment. To configure your training, use the rlTrainingOptions function. For example, create a training option set opt, and train the agent in the environment env.

opt = rlTrainingOptions(...
    'MaxEpisodes',1000,...
    'MaxStepsPerEpisode',1000,...
    'StopTrainingCriteria',"AverageReward",...
    'StopTrainingValue',480);
trainStats = train(agent,env,opt);

For more information on creating agents, see Reinforcement Learning Agents . For more information on creating an environment, see Creating a MATLAB Environment for Reinforcement Learning and Creating a Simulink Environment for Reinforcement Learning .

The training updates the agent as the training progresses. To keep the original agent parameters for later use, please save the agent to a MAT file .

save("initialAgent.mat","agent")

When the conditions specified in the StopTrainingCriteria and StopTrainingValue options of the rlTrainingOptions object are met, the training will automatically terminate. To manually terminate the ongoing training, type Ctrl + C, or click "Stop Training" in the "Reinforcement Learning Plot Manager". Since Train updates the agent on each episode, you can resume training by calling train(agent, env, trainOpts) again without losing the training parameters learned in the first training .

Training algorithm

Generally, training performs the following steps.

  1. Initialize the agent.

  2. For each episode:

    a. Reset the environment.

    b. Obtain the initial observation s 0 s_0 from the environments0

    c. Calculate the initial operation a 0 = μ (s 0) a_0 = μ(s_0)a0=μ ( s0) , whereμ (s) μ(s)μ ( s ) is the current strategy.

    d. Set the current operation as the initial operation (a ← a 0) (a←a_0)(aa0) And set the current observation value to the initial observation value(s ← s 0) (s←s_0)(ss0)

    e. Please perform the following steps when the episode has not ended or terminated.

    1. Apply action a to the environment and get the next observation s'' and reward r.

    2. From the experience set (s, a, r, s') (s, a, r, s')sars )in learning.

    3. Calculate the next action a ′ = μ (s ′) a'=μ(s')a=μ s

    4. Use the next action (a ← a ′) (a←a')aa )Update the current action and use the next observation value(s ← s ′) (s←s')ss )Update the current observation.

    5. If the termination conditions defined in the environment are met, the plot is terminated.

  3. If the training termination conditions are met, please terminate the training. Otherwise, start the next round of training.

The details of how the software executes these steps depend on the configuration of the agent and the environment. For example, if you have configured the environment, resetting the environment at the beginning of each episode can include randomizing the initial state value. For more information about agents and their training algorithms, see Reinforcement Learning Agents .

Plot manager

By default, calling the training function will open the "Reinforcement Learning Plot Manager", through which you can visually see the training progress. The plot manager graph displays the reward (EpisodeReward) and running average reward value (AverageReward) for each plot. In addition, for agents with commenters, the figure shows the commenter's estimate of the long-term reward discount at the beginning of each episode (EpisodeQ0). The plot manager also displays various plots and training statistics. You can also use the training function to return to the plot and training information.

Insert picture description here
For agents with critics, based on preliminary observations of the environment, Q 0 Q0The Q 0 set is an estimate of the long-term discount reward at the beginning of each episode. As the training progresses, if the critic’s design is good. As shown in the figure above, the plotQ 0 Q0Q 0 is close to the real long-term discount reward.

To close the "Reinforcement Learning Plot Manager", set the "Drawing" option of rlTrainingOptions to "none".

Save candidate agents

During training, you can save candidate agents that meet the conditions specified in the SaveAgentCriteria and SaveAgentValue options of the rlTrainingOptions object. For example, even if the overall conditions for terminating training have not been met, you can save any agent whose plot reward exceeds a certain value. For example, when the plot reward is greater than 100, save the agent.

opt = rlTrainingOptions('SaveAgentCriteria',"EpisodeReward",'SaveAgentValue',100');

The training stores the saved agent in the MAT file in the folder specified by the SaveAgentDirectory option of rlTrainingOptions. Saved agents may be useful, for example, to test candidate agents generated during a long-running training process. For detailed information about saving conditions and saving locations, see rlTrainingOptions .

After the training is completed, you can use the save function to save the final trained agent from the MATLAB® workspace . For example, save the agent myAgent to the file finalAgent.mat in the current working directory.

save(opt.SaveAgentDirectory + "/finalAgent.mat",'agent')

By default, when saving DDPG and DQN agents, the experience buffer data is not saved. If you plan to train your saved agent further, you can start training from the previous experience buffer . In this case, please set the SaveExperienceBufferWithAgent option to true. For some agents, such as agents with large experience buffers and image-based observations, the memory required to save the experience buffers is large. In these cases, you must ensure that there is enough memory available for the saved agent.

Parallel Computing

You can speed up the training of agents by running parallel training simulations. If you have Parallel Computing Toolbox™ software, you can run parallel simulations on multi-core computers. If you have MATLAB Parallel Server™ software, you can run parallel simulations on computer clusters or cloud resources.

When using parallel computing to train an agent, the host client will send a copy of the agent and the environment to each parallel worker. Each worker simulates the agent in the environment, and then sends its simulation data back to the host. The host agent learns from the data sent by the staff and sends the updated policy parameters back to the staff.
Insert picture description here
To create a parallel pool composed of N workers, use the following syntax.

pool = parpool(N);

If you did not use parpool (Parallel Computing Toolbox) to create a parallel pool, the training function will automatically create a parallel pool using the default parallel pool preferences. For more information about specifying these preferences, see Specifying Parallel Preferences (Parallel Computing Toolbox).

For off-policy agents (such as DDPG and DQN agents), do not use all cores for parallel training. For example, if your CPU has six cores, use four. Doing so provides more resources for the host client to calculate the gradient based on the experience sent back by the worker. When calculating gradients for workers, for on-policy agents such as PG and AC agents , there is no need to limit the number of workers.

For more information on configuring your training to use parallel computing, see the UseParallel and ParallelizationOptions options in rlTrainingOptions .

In order to benefit from parallel computing, the computational cost for the simulation environment must be relatively high compared to parameter optimization when sending the experience back to the host. If the cost of environmental simulation is not high enough, the staff will be idle while waiting for the host to learn and send back updated parameters.

When sending back experience from the worker, if the ratio (R) of the environmental step complexity to the learning complexity is large, the sample efficiency can be improved. If the environment can be simulated quickly (R is small), you are unlikely to get any benefit from experience-based parallelization. If the simulation cost of the environment is high, but the cost of learning is also high (for example, if the size of a small batch is large), then you are unlikely to increase the sample efficiency. However, in this case, for the off-policy agent, you can reduce the minimum batch size to make R larger, thereby increasing the sampling efficiency.

For an example of using parallel computing to train an agent in MATLAB, see Training an AC agent to use parallel computing to balance a feeder system . For an example of using parallel computing to train an agent in Simulink®, see Using parallel computing to train a DQN agent to keep lanes .

GPU acceleration

When using the deep neural network function approximator for actor or commenter representation, you can speed up the training by performing the representation operation on the GPU instead of the CPU. To do this, set the UseDevice option to "gpu".

opt = rlRepresentationOptions('UseDevice',"gpu");

The degree of performance improvement depends on your specific application and network configuration.

Verify the strategy after training

To verify your trained agent, you can use the sim function to simulate the agent in the training environment. To configure simulation, use rlSimulationOptions .

When verifying the agent, consider checking how the agent handles the following:

  1. Change the initial conditions of the simulation—To change the initial conditions of the model, modify the reset function of the environment. For examples of resetting functions, see Creating a MATLAB environment using custom functions and Creating a custom MATLAB environment from a template and Creating a Simulink environment for reinforcement learning .

  2. Mismatch between training and simulation environment dynamics—To check for such mismatches, create a test environment in the same way as the training environment, and modify the environment behavior.

As with parallel training, if you have Parallel Computing Toolbox software, you can run multiple parallel simulations on a multi-core computer. If you have MATLAB Parallel Server software, you can run multiple parallel simulations on computer clusters or cloud resources. For more information on configuring simulation to use parallel computing, see UseParallel and ParallelizationOptions in rlSimulationOptions .

Environmental verification

If your training environment implements the plot method, you can visualize environment behavior during training and simulation. If you call plot(env) before training or simulation, where env is your environment object, the visualization will be updated during training so that you can visually observe the progress of each plot or simulation.

When using parallel computing to train or simulate agents, environment visualization is not supported.

For custom environments, you must implement your own drawing method. For more information on using the drawing function to create a custom environment, see Create a custom MATLAB environment from a template .

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109563543