1 Introduction
Thank you Professor Li Hongyi for the explanation!
2 Sampling sample()-strategies for exploring actions
The sample() function corresponds to the role of "sample augmentation" in the training process;
3 Sarsa and Q-Learning-the original reinforcement learning algorithm
3.1 Reinforcement learning based on Q-Learning-using Q tables for action selection
In fact, the idea of Q-Learning is very simple, just like putting an elephant in a refrigerator.
The basic steps are:
- Observe the environment and get the observation;
- Query the Q table according to obs and select the action with the largest Q value;
- Perform the action.
3.2 Expected goals of Sarsa and Q-Learning
In fact, the goals of these two algorithms are different, leading to different results:
Sarsa: Maximize the average level of reward for sample() behavior;
Q-Learning: Maximize the reward of maxQ() behavior;
3 DQN-replace the Q form with a neural network
3.1 Why use a neural network to replace the Q table?
If the space of the action state is continuous, it may not be possible to express this space using the Q table (the possible values of the continuous state are infinite),
So we regard "state-Q value" as a kind of mapping, that is to say: use the idea of function mapping to describe the mapping relationship of "state-Q value" ;
Since it is a function mapping, then our DNN is on the stage~
4 Actor-Critic algorithm
In my opinion, Actor and Critic have such characteristics:
Actor-instinct
Critic-Experienced
The specific form is Q Function;
We use TD to quantify Q (this is also the method taught by Professor Li),
I feel that Critic has the role of directing the reward rule;
Perceptual knowledge: expresses the model's understanding of the rules, (at the same time the reward function is diversified);