The difference between reinforcement learning and supervised learning:
(1) There are no labels in the training data, only the reward function (Reward Function).
(2) The training data is not given ready-made, but obtained from actions.
(3) The current behavior (Action) not only affects the acquisition of subsequent training data, but also affects the value of the reward function (Reward Function).
(4) The purpose of training is to construct a "state->behavior" function, where the state (State) describes the current internal and external environment. In this case, an agent (Agent) is to be operated in a specific In this state, this function determines the behavior that should be taken at this time. It is hoped that after taking these actions, the maximum reward function value will eventually be obtained.
Supervised learning obtains a mapping from data to labels through training.
some definitions
some assumptions
Markov decision Process (MDP)
Objective function to be optimized
The objective function to be optimized in reinforcement learning is the cumulative reward, which is the weighted average of the reward function over a period of time:
here, GAMMA is a decay term.
Q-Learning
The functions that are already known in reinforcement learning are:
the functions that need to be learned are:
According to a decision-making mechanism (Policy), we can obtain a path:
Definition 1: Value Function is a function that measures the cumulative reward that a certain state can eventually obtain:
Definition 2: Q function is a function that measures the cumulative reward taken in a certain state The function of how much cumulative reward can be obtained after a certain behavior:
The relationship between Q and V:
Recursion: There is a probability of generating a according to s, and there is a probability of generating s' according to s and a. The double-level probabilities are summed. Then, the relationship between the valuation function of s and the valuation function of s' is established to find the best strategy
. Iterative algorithm:
Disadvantages of this algorithm:
This approach is not practical when the number of states and behaviors is large.
For example: for an ATARI game, the number of states is the combination of the values of all pixels in adjacent frames. This is an astronomical number!
The number of ACTIONs ranges from 6 to 20
Optimization of Q-learning——Deep Q-Network (DQN)
The definition
is Bellman Equation:
Example
DQN settings for jerking off Atari games
A more difficult DQN settings for Atari games:
DQN algorithm process
Disadvantages of Q-learning:
(1) In some applications, when the number of states or behaviors is large, the Q function will be very complex and difficult to converge. For example, in image applications, the number of states is (number of pixel value ranges)^(number of pixels). Such a method has no understanding of images and tasks, and only relies on big data to achieve convergence.
(2) In many programs, such as chess programs, etc., REWARD is the final result (lost or won), and there is no need to calculate REWARD for every intermediate step.
Policy gradient
Actor-Critic algorithm:
Summarize
(1) The current development status of reinforcement learning: it can reach the human level or surpass humans in some specific tasks, but there is a gap between it and humans in some relatively complex tasks, such as autonomous driving.
(2) The gap with real people may not be entirely attributed to the algorithm. Physical limitations of sensors and machinery are also decisive factors.
(3) Another gap between machines and humans is that humans have some basic concepts. Based on these concepts, humans can learn a lot with very little training, but machines can only learn through large-scale data.
(4) However, machines are fast and never tire. As long as there is a steady stream of data, it can be expected that machines will do better than humans on specific tasks.