Large integration of reinforcement learning tuning experience: TD3, PPO+GAE, SAC, discrete action noise exploration, and common hyperparameters of Off-policy and On-policy algorithms

1. General parameter settings for reinforcement learning

(1) Selection of reinforcement learning algorithm

The currently recommended algorithms are mainly:

Proposed algorithm for discrete control problems :

①D3QN——D3 refers to Dueling Double DQN, which mainly integrates the method architecture of Double DQN and Dueling DQN, and can also be used with Noisy DQN to cooperate with the γ-greedy method to improve exploration efficiency.

②SAC-Discrete——The main goal proposed is to solve the decision-making problem in the mixed action space, and the output action vector is regarded as the execution probability of each action, and the specific effect evaluation is high or low.

③H-PPO——H-MPO is a PPO algorithm based on discrete spatial information processing.

Suggested algorithm for continuous control problems :

PPO+GAE——PPO is a simplified version of TRPO, which has the characteristics of simple parameter adjustment and strong robustness. GAE refers to Generalized Advantage Estimation, which will generate an estimated value of the advantage function based on the empirical trajectory, and then let Critic fit the value to achieve the goal of using a small amount of trajectory to describe the current strategy. Experience has found that although GAE can be combined with a variety of RL algorithms, it is not compatible with PPO has the best combination effect, the most stable training, and the easiest parameter adjustment.

SAC (Automating Temperature Parameter α version)——By automatically adjusting the temperature coefficient to maintain the dynamic balance of the policy entropy, but experience points out that it is not suitable for tasks with a large number of boundary actions in the optimal policy, that is, if a large number of actions under the optimal policy reach the boundary value The effect will be worse, for example, when controlling the robot to move at full speed is usually the optimal solution, it is not suitable to use the SAC algorithm, the main reason is that SAC uses the derivative of tanh() when calculating the strategy entropy

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131730358