reinforcement learning
1. Concept
What is reinforcement learning?
1. Difficulties
- The appearance of reward is delayed
- The agent's action will affect the result or feedback reward
2. Classification
policy based =》learning a actor;value-based=》learning a critic
2.1 policy based
The policy based process framework is shown in the figure below:
Neural network output actor:
The neural network outputs the probabilities of different actions, and the highest one is the action of this step.
Start with observation S1;
Machine decides to take a1
Machine obtains reward ri
Machine sees observation S2
Machine decides to take a2
Machine obtains reward r2
...
The overall probability calculation formula is as follows: Because behavior and games are random, this R is uncertain of. Then set as the expected value of R to evaluate the quality of the behavior vector.
- goodness of actor
input parameters: the probability that
I choose this action : .
Then it is the expected reward. That is, the reward of each behavior is multiplied by the probability of each behavior, and then added.
So how to choose the best actor? The method used is gradient ascent
-
The value of gradient ascent
when finding the maximum reward : .The process is as follows: initialize first, then repeat the following process:
The mathematical formula is as follows:
again
The idea of the formula is: when the action taken is positive, the probability is increased; otherwise, the probability is reduced.
But this is not perfect. The aircraft may keep firing in place. How to make the increase too low?
add a baseline.
- Add a baseline
Realistic example, after everyone increases, after normalization, even if the probability increases, it is useless, it will be relatively lowest. As shown in the figure, even if a has increased the most, it will still have the smallest probability after normalization.
There are also situations that will make the probability of an action that does not appear smaller. As shown in the figure below, a does not appear, so the probability decreases:
Then we need to design a better b value so that the gradient of probability becomes positive or negative.
The formula becomes the following figure:
2.2 policy gradient
Obtained from neural network