Reinforcement learning tutorial resources: https://stable-baselines3.readthedocs.io/en/master/guide/rl.html

SB3 Tutorial: https://github.com/araffin/rl-tutorial-jnrr19/blob/sb3/1_getting_started.ipynb

Compared with supervised learning with a fixed data set, the data used to train the agent in reinforcement learning comes from the interaction between the agent and the environment. Since the results of reinforcement learning are inconsistent every time, we need to obtain a quantitative result.

Good results depend on proper hyperparameters, recent algorithms like PPO, SAC, TD3 require us to tune very few parameters, but don't expect the default parameters to be suitable for any environment.

Therefore, it is highly recommended to look at RL zoo to learn parameter tuning (it is recommended to use the automatic hyperparameter optimization).

When customizing an environment, normalize the input into the agent and be mindful of preprocessing operations that are common in other environments.

Current RL Limitations

The sampling effect of the model-free algorithm is not very good, and it is often necessary to learn millions of interactive samples, so most RL algorithms are currently only implemented in simulations and games.
In order to achieve better performance on ordinary equipment, the number of training steps of the agent should be increased
To achieve the desired performance, expert knowledge is often used to design an appropriate reward function.
(An example of reward shaping: Deep Mimic paper , which combines imitation learning and reinforcement learning to do acrobatic moves.)
Finally, there is the instability of training. You can see that the performance drops sharply during training, especially in DDPG, which is TD3why I want to solve this problem. Other approaches such as TRPOand PPO, which use trust regions , reduce this problem by avoiding overly large updates.

How to evaluate an RL algorithm

When evaluating your agents or comparing experimental results, be aware that changes in environment wrappers, episode rewards or lengths may also affect evaluation results. It is recommended to browse the helper functions in the Evaluation Helper chapter evaluate_policy.

Since most algorithms use exploration noise during training, we need an independent testing environment to evaluate the agent . In order to get a good evaluation result, it is recommended to periodically evaluate the agent nrounds (usually 5<n<20) and calculate the average reward per round.

SB3 provides EvalCallbackan interface to implement such evaluation, see the Callbacks chapter for details.

For some algorithms that use randomness strategy by default (such as A2C, PPO), .predict()you can set it when calling the method deterministic=True, and usually have better performance.

Algorithm selection

Choose according to whether the action space is discrete or continuous

For example, DQN only supports discrete actions, and SAC is limited to continuous actions
- Single-process discrete action Discrete Actions
Use extended DQN algorithms (eg QR-DQN, double DQN, prioritized replay, …). DQN training is slower, but it is the most efficient sampling method (because of its replay buffer).
- Multi-process discrete action MultiDiscrete Actions
Recommend to try PPO or A2C
- Single process continuous action Continuous Actions
Current State Of The Art (SOTA), the current optimal algorithm: SAC, TD3, TQC

In order to get the best results, you can use the hyperparameters in the RL zoo
- Multi-process continuous action MultiContinuous Actions
It is recommended to try PPO, TRPO, A2C, and it is also recommended to use the hyperparameters in RL zoo
Choose according to whether you need parallel training

If the real training time (wall clock training time) is important, A2C and PPO should be used more

Take a look at the Vectorized Environments to learn more about training with multiple workers.
Goal Environment

If your environment follows GoalEnvthe interface (refer to HER), you should choose to use HER + (SAC/TD3/DDPG/DQN/QR-DQN/TQC) according to the action space. Also note batch_sizethat is an important hyperparameter of HER .

Create a custom environment

Want to learn how to create a custom environment, poke -> page

Example of creating a custom gym environment: colab notebook

some advices:

When you know the bounds, normalize your observation space as much as possible
Normalize the action space, and make the space symmetrical when the action is continuous, for example, scaling the action to [-1, 1]
start with shaped reward (i.e. informative reward) and simplified version of your problem
Use random actions to check if your environment is working properly, and follow the gym interface

Pay attention to avoid breaking the Markov assumption when creating a custom environment, and correctly handle Termination (the maximum number of steps in each round) caused by timeout. For example, if there is some delay between action and observation (such as wifi communication), a historical observation should be given as input.

Termination caused by timeout (reaching the maximum number of steps per round) needs to be handled separately, you need to manually add the key to dict: . info["TimeLimit.truncated"] = TrueIf you use gym's TimeLimitwrapper, this operation will be done automatically.

You can read Time Limit in RL or take a look at the RL Tips and Tricks video for more details.

A method for checking that the environment is correct:

from stable_baselines3.common.env_checker import check_env

env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)

Quickly use random agents in the environment to check if your environment is working properly:

env = YourEnv()
obs = env.reset()
n_steps = 10
for _ in range(n_steps):
    # Random action
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()

Why is it necessary to standardize the action space ?

For continuous actions, most reinforcement learning algorithms rely on a Gaussian distribution (centered at 0 with a standard deviation of 1), so if your environment uses an unnormalized action space, it will threaten the learning process and make it difficult to debug .

Another consequence of using Gaussian is that the range of motion is not limited, so clipping operations are usually used to keep it within a reasonable range.

A better solution is to use squashing function (cf SAC) or Beta distribution (cf issue #112 ).

The above is not valid for DDPGand TD3because they do not depend on any probability distribution.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-ljR7KaoS-1670319642461)(assets/image-20221206171102-ji6psq2.png)]

Tips for running an RL algorithm

When trying to reproduce an RL paper by running an algorithm, it is recommended to take a look at the nuts and bolts of RL research by John Schulman ( video ).

We recommend following those steps to have a working RL algorithm:

Read the original paper several times
Read existing implementations (if available)
Try to have some “sign of life” on toy problems
Run in more and more complex environments to further verify the feasibility of the algorithm (the results can be compared with RL zoo). Hyperparameter optimization is usually required at this step

You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. issue #75) and when to stop the gradient propagation（梯度传播）.

Progressively difficult continuous action environment:

Pendulum (easy to solve)
HalfCheetahBullet (medium difficulty with local minima and shaped reward)
BipedalWalkerHardcore (if it works on that one, then you can have a cookie)

Gradually difficult discrete action environments:

CartPole-v1 (easy to be better than random agent, harder to achieve maximal performance)
LunarLander
Pong (one of the easiest Atari game)
other Atari games (e.g. Breakout)

[RL] Some suggestions for using reinforcement learning

Current RL Limitations

How to evaluate an RL algorithm

Algorithm selection

Create a custom environment

Tips for running an RL algorithm

Guess you like