Chapter 1, Reinforcement Learning:

1. Concept

What is reinforcement learning?

1. Difficulties

  1. The appearance of reward is delayed
  2. The agent's action will affect the result or feedback reward

2. Classification

policy based =》learning a actor;value-based=》learning a critic

2.1 policy based

The policy based process framework is shown in the figure below:
Insert image description here

Neural network output actor: Insert image description here
The neural network outputs the probabilities of different actions, and the highest one is the action of this step.
Start with observation S1;
Machine decides to take a1
Machine obtains reward ri
Machine sees observation S2
Machine decides to take a2
Machine obtains reward r2
...
The overall probability calculation formula is as follows: Insert image description hereBecause behavior and games are random, this R is uncertain of. Then set Insert image description hereas the expected value of R to evaluate the quality of the behavior vector.

  • goodness of actor
    input parameters: the probability that Insert image description here
    I choose this action Insert image description here: Insert image description here.

Then Insert image description hereit is the expected reward. That is, the reward of each behavior is multiplied by the probability of each behavior, and then added.

So how to choose the best actor? The method used is gradient ascent

  • The value of gradient ascent
    when finding the maximum reward Insert image description here: Insert image description here.

    The process is as follows: initialize first, then repeat the following process:Insert image description here

The mathematical formula is as follows:Insert image description here

againInsert image description here

The idea of ​​the formula is: when the action taken is positive, the probability is increased; otherwise, the probability is reduced.

But this is not perfect. The aircraft may keep firing in place. How to make the increase too low?
add a baseline.

  • Add a baseline
    Realistic example, after everyone increases, after normalization, even if the probability increases, it is useless, it will be relatively lowest. As shown in the figure, even if a has increased the most, it will still have the smallest probability after normalization. Insert image description here
    There are also situations that will make the probability of an action that does not appear smaller. As shown in the figure below, a does not appear, so the probability decreases: Insert image description here
    Then we need to design a better b value so that the gradient of probability becomes positive or negative.
    The formula becomes the following figure:Insert image description here

2.2 policy gradient

Insert image description hereObtained from neural networkInsert image description here

3. How to interact with the environment

Guess you like

Origin blog.csdn.net/qq_53982314/article/details/131334121