【ZJU-Machine Learning】Reinforcement Learning

The difference between reinforcement learning and supervised learning:

(1) There are no labels in the training data, only the reward function (Reward Function).
(2) The training data is not given ready-made, but obtained from actions.
(3) The current behavior (Action) not only affects the acquisition of subsequent training data, but also affects the value of the reward function (Reward Function).
(4) The purpose of training is to construct a "state->behavior" function, where the state (State) describes the current internal and external environment. In this case, an agent (Agent) is to be operated in a specific In this state, this function determines the behavior that should be taken at this time. It is hoped that after taking these actions, the maximum reward function value will eventually be obtained.

Supervised learning obtains a mapping from data to labels through training.

some definitions

Insert image description here

some assumptions

Insert image description here

Markov decision Process (MDP)

Insert image description here

Objective function to be optimized

The objective function to be optimized in reinforcement learning is the cumulative reward, which is the weighted average of the reward function over a period of time:
Insert image description here
here, GAMMA is a decay term.

Q-Learning

The functions that are already known in reinforcement learning are:
Insert image description here
the functions that need to be learned are:
Insert image description here

According to a decision-making mechanism (Policy), we can obtain a path:
Insert image description here
Definition 1: Value Function is a function that measures the cumulative reward that a certain state can eventually obtain:
Insert image description here
Definition 2: Q function is a function that measures the cumulative reward taken in a certain state The function of how much cumulative reward can be obtained after a certain behavior:
Insert image description here
The relationship between Q and V:
Insert image description here

Recursion: There is a probability of generating a according to s, and there is a probability of generating s' according to s and a. The double-level probabilities are summed. Then, the relationship between the valuation function of s and the valuation function of s' is established to find the best strategy
Insert image description here
. Iterative algorithm:
Insert image description here
Disadvantages of this algorithm:

This approach is not practical when the number of states and behaviors is large.

For example: for an ATARI game, the number of states is the combination of the values ​​of all pixels in adjacent frames. This is an astronomical number!
The number of ACTIONs ranges from 6 to 20

Optimization of Q-learning——Deep Q-Network (DQN)

The definition
Insert image description here
is Bellman Equation:
Insert image description here
Insert image description here
Insert image description here

Example

DQN settings for jerking off Atari games
Insert image description here
A more difficult DQN settings for Atari games:

DQN algorithm process

Insert image description here
Disadvantages of Q-learning:

(1) In some applications, when the number of states or behaviors is large, the Q function will be very complex and difficult to converge. For example, in image applications, the number of states is (number of pixel value ranges)^(number of pixels). Such a method has no understanding of images and tasks, and only relies on big data to achieve convergence.

(2) In many programs, such as chess programs, etc., REWARD is the final result (lost or won), and there is no need to calculate REWARD for every intermediate step.

Policy gradient

Insert image description here
Insert image description here
Actor-Critic algorithm:
Insert image description here

Summarize

(1) The current development status of reinforcement learning: it can reach the human level or surpass humans in some specific tasks, but there is a gap between it and humans in some relatively complex tasks, such as autonomous driving.

(2) The gap with real people may not be entirely attributed to the algorithm. Physical limitations of sensors and machinery are also decisive factors.

(3) Another gap between machines and humans is that humans have some basic concepts. Based on these concepts, humans can learn a lot with very little training, but machines can only learn through large-scale data.

(4) However, machines are fast and never tire. As long as there is a steady stream of data, it can be expected that machines will do better than humans on specific tasks.

Guess you like

Origin blog.csdn.net/qq_45654306/article/details/113448807