Reinforcement learning tutorial resources: https://stable-baselines3.readthedocs.io/en/master/guide/rl.html
SB3 Tutorial: https://github.com/araffin/rl-tutorial-jnrr19/blob/sb3/1_getting_started.ipynb
Compared with supervised learning with a fixed data set, the data used to train the agent in reinforcement learning comes from the interaction between the agent and the environment. Since the results of reinforcement learning are inconsistent every time, we need to obtain a quantitative result.
Good results depend on proper hyperparameters, recent algorithms like PPO, SAC, TD3 require us to tune very few parameters, but don't expect the default parameters to be suitable for any environment.
Therefore, it is highly recommended to look at RL zoo to learn parameter tuning (it is recommended to use the automatic hyperparameter optimization).
When customizing an environment, normalize the input into the agent and be mindful of preprocessing operations that are common in other environments.
Current RL Limitations
-
The sampling effect of the model-free algorithm is not very good, and it is often necessary to learn millions of interactive samples, so most RL algorithms are currently only implemented in simulations and games.
-
In order to achieve better performance on ordinary equipment, the number of training steps of the agent should be increased
-
To achieve the desired performance, expert knowledge is often used to design an appropriate reward function.
(An example of reward shaping: Deep Mimic paper , which combines imitation learning and reinforcement learning to do acrobatic moves.) -
Finally, there is the instability of training. You can see that the performance drops sharply during training, especially in
DDPG
, which isTD3
why I want to solve this problem. Other approaches such asTRPO
andPPO
, which use trust regions , reduce this problem by avoiding overly large updates.
How to evaluate an RL algorithm
When evaluating your agents or comparing experimental results, be aware that changes in environment wrappers, episode rewards or lengths may also affect evaluation results. It is recommended to browse the helper functions in the Evaluation Helper chapter evaluate_policy
.
Since most algorithms use exploration noise during training, we need an independent testing environment to evaluate the agent . In order to get a good evaluation result, it is recommended to periodically evaluate the agent n
rounds (usually 5<n<20) and calculate the average reward per round.
SB3 provides
EvalCallback
an interface to implement such evaluation, see the Callbacks chapter for details.For some algorithms that use randomness strategy by default (such as A2C, PPO),
.predict()
you can set it when calling the methoddeterministic=True
, and usually have better performance.
Algorithm selection
-
Choose according to whether the action space is discrete or continuous
For example, DQN only supports discrete actions, and SAC is limited to continuous actions
- Single-process discrete action Discrete Actions
Use extended DQN algorithms (eg
QR-DQN
, double DQN, prioritized replay, …). DQN training is slower, but it is the most efficient sampling method (because of its replay buffer).- Multi-process discrete action MultiDiscrete Actions
Recommend to try PPO or A2C
- Single process continuous action Continuous Actions
Current State Of The Art (SOTA), the current optimal algorithm: SAC, TD3, TQC
In order to get the best results, you can use the hyperparameters in the RL zoo
- Multi-process continuous action MultiContinuous Actions
It is recommended to try PPO, TRPO, A2C, and it is also recommended to use the hyperparameters in RL zoo
-
Choose according to whether you need parallel training
If the real training time (wall clock training time) is important, A2C and PPO should be used more
Take a look at the Vectorized Environments to learn more about training with multiple workers.
-
Goal Environment
If your environment follows
GoalEnv
the interface (refer to HER), you should choose to use HER + (SAC/TD3/DDPG/DQN/QR-DQN/TQC) according to the action space. Also notebatch_size
that is an important hyperparameter of HER .
Create a custom environment
Want to learn how to create a custom environment, poke -> page
Example of creating a custom gym environment: colab notebook
some advices:
- When you know the bounds, normalize your observation space as much as possible
- Normalize the action space, and make the space symmetrical when the action is continuous, for example, scaling the action to [-1, 1]
- start with shaped reward (i.e. informative reward) and simplified version of your problem
- Use random actions to check if your environment is working properly, and follow the gym interface
Pay attention to avoid breaking the Markov assumption when creating a custom environment, and correctly handle Termination (the maximum number of steps in each round) caused by timeout. For example, if there is some delay between action and observation (such as wifi communication), a historical observation should be given as input.
Termination caused by timeout (reaching the maximum number of steps per round) needs to be handled separately, you need to manually add the key to dict: . info["TimeLimit.truncated"] = True
If you use gym's TimeLimit
wrapper, this operation will be done automatically.
You can read Time Limit in RL or take a look at the RL Tips and Tricks video for more details.
A method for checking that the environment is correct:
from stable_baselines3.common.env_checker import check_env
env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)
Quickly use random agents in the environment to check if your environment is working properly:
env = YourEnv()
obs = env.reset()
n_steps = 10
for _ in range(n_steps):
# Random action
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
if done:
obs = env.reset()
Why is it necessary to standardize the action space ?
For continuous actions, most reinforcement learning algorithms rely on a Gaussian distribution (centered at 0 with a standard deviation of 1), so if your environment uses an unnormalized action space, it will threaten the learning process and make it difficult to debug .
Another consequence of using Gaussian is that the range of motion is not limited, so clipping operations are usually used to keep it within a reasonable range.
A better solution is to use squashing function (cf SAC
) or Beta distribution (cf issue #112 ).
The above is not valid for
DDPG
andTD3
because they do not depend on any probability distribution.
Tips for running an RL algorithm
When trying to reproduce an RL paper by running an algorithm, it is recommended to take a look at the nuts and bolts of RL research by John Schulman ( video ).
We recommend following those steps to have a working RL algorithm:
- Read the original paper several times
- Read existing implementations (if available)
- Try to have some “sign of life” on toy problems
- Run in more and more complex environments to further verify the feasibility of the algorithm (the results can be compared with RL zoo). Hyperparameter optimization is usually required at this step
You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. issue #75) and when to stop the gradient propagation(梯度传播).
Progressively difficult continuous action environment:
- Pendulum (easy to solve)
- HalfCheetahBullet (medium difficulty with local minima and shaped reward)
- BipedalWalkerHardcore (if it works on that one, then you can have a cookie)
Gradually difficult discrete action environments:
- CartPole-v1 (easy to be better than random agent, harder to achieve maximal performance)
- LunarLander
- Pong (one of the easiest Atari game)
- other Atari games (e.g. Breakout)