PPO algorithm with action mask action mask (with code implementation)

The action probability output by the Actor neural network is the probability of all actions. When applying the PPO algorithm to solve practical problems, we often encounter situations where the action is restricted, that is, some actions are reasonable, and the agent action sampling select_action At this time, we can only sample from those reasonable action collections.

There are two common solutions, one is to add punishment rewards to illegal actions, and the other is to mask action masks. The basic idea of ​​action mask is to add a layer of mask to the action probability output by the actor neural network. A mask of 1 for legal actions means that the corresponding action probability is output, and a mask of 0 for illegal actions means no Output the corresponding action probability.

The following will gradually introduce the principle of action mask and its implementation on the PPO algorithm, and introduce it in the form of intercepting key parts of some pertinent reference materials~

Generally speaking, the action mask method is much more effective than the method of adding penalties to illegal actions during training (Reference: DRL Algorithm Implementation Notes - Zhihu https://zhuanlan.zhihu.com/p /412520739)
Insert image description here

Action Mask Action Mask Principle Explanation

How should invalid actions be masked in reinforcement learning? - Zhihu https://zhuanlan.zhihu.com/p/538953546
Insert image description here

Insert image description here
Insert image description here

How the PPO algorithm uses action mask action mask

Taking the PPO algorithm of discrete actions as an example, the output of the Actor network is still the action probability of all actions. There are two places where the action mask needs to be added. One is when selecting action select action sampling, and the other is training the actor network. Often What is easy to forget is that when training the actor network in the second place, you also need to bring the action mask when sampling (Reference: What does the action mask of Tencent Artificial Intelligence Juewu mean? - Zhihu https://www.zhihu.com /question/446176024#)
Insert image description here

Action mask action mask code implementation

Initially, when I implemented the action mask, I operated like this: during sampling, the corresponding position of the illegal action in the logits layer was changed to a negative number with a large absolute value, and after passing the softmax function, the probability of the illegal action became 0 ( Reference: How Softmax performs mask operations - Zhihu https://zhuanlan.zhihu.com/p/543736799). It will encounter an error: the neural network outputs nan. Generally speaking, the neural network outputs nan probably because the divisor encounters 0 and the log function encounters 0 during gradient update.

The solution is not to use the manual softmax function, but to use the built-in function library of torch.distributions.Categorical (Reference: PPO Practice Guide - Zhihu https://zhuanlan.zhihu.com/p/627389144)

References

Special thanks to the above reference bloggers for their sincere sharing! ! !

おすすめ

転載: blog.csdn.net/ningmengzhihe/article/details/131515927