RL Classification

RL classification algorithm

RL algorithm classification

1. Classification Standard

From "whether environmental modeling 'departure, RL can be divided into Model-Freeand Model-Based.
The difference between the two is the agent can not be environmental modeling, that is, to learn a predictable state transition function and benefits.
If we as environmental modeling, the agent can predict the situation in the various options in advance, and the process from these projections learn more experience, and then applied to the actual behavior. Is the most famous AlphaZeroexample, in sample efficiencya significant advantage. In this case, though we know all the possible opponents of the environment, but opponents do not know what will come true position. Therefore, the opponent's tactics imagine the state of the environment can transition probability.

Sample usage of
every strategy change, if you want to discard the previous sample produced. If agent and environment interact, take a long time, so every time to discard a large number of samples is not worthwhile.

Model-BasedObvious drawback is that modeling is very difficult. Is the relationship between variance and bias, and the cost of learning is also very large.

Model-FreeDo not focus so much easier to implement and more widely applied.

2. Classification Standard Two

From the "learning objectives" set out, RL can be divided into learning policy, learning state value function, a function of the value of learning, learning environment.

2.1 based on model-free classification standard two

Policy Optimization

That RL's goal is to learn an optimal policy, denoted Pi i ( a s ) \pi_{\theta}(a|s) learning, there are two, one is directly determined by the optimal parameter gradients (right. J ( Pi i ) J(\pi_{\theta}) Derivative), and the other is determined J ( Pi i ) J(\pi_{\theta}) Local maximum. ? ?
Both optimization methods have to useOn-Policythe method of conduct, behavior is to use only the latest policy as a sample.

Specific optimization algorithms include A2C / A3C,,PPO

Q-Learning

I.e. RL goal is to learn an optimal value of the function, referred to as Q i ( s , a ) Q_{\theta}(s, a) . I said before, the value function can be expressed by a recursive form. Suppose the function value has stabilized, a status value may be any value obtained by the other states, i.e., can be represented by Bellman equation. Such optimization function is based on the embodimentoff-policy, the sample used for each update that is independent of time constraints. Q-Learning at the next moment the actions choose the maximum value of the algorithm.

a ( s ) = arg max a Q θ ( s , a ) a(s)=\arg \max _{a} Q_{\theta}(s, a)

Specific Q-learning algorithm comprises: DQN, C51

On-policy 和 Off-policy

So-called same strategy (on-policy) and different strategies (off-policy) means:

  • The same strategy: refers to strategies and tactics to generate data to evaluate and improve to the same strategy. For example, the use of sampling greedy strategy,
  • Different strategy: refers to strategies to generate data to evaluate and improve the strategy and not the same strategy.

Note

  1. Policy evaluation data generation means sampling method;
  2. Assessment strategy is how operation of calculating a function value;
  3. Improvement strategy is one that obtains a better strategy π ( a s ) \ Pi (a | s)

Both strategies have their own assessment methods corresponding strategy algorithm. Updates to the former value of the function is entirely based on the interaction of the sequence, we believe that the value of the sampling sequence can be used directly in the calculation of the estimated. The latter is not entirely follow the interaction sequence when updating the value of the function, choosing instead to sub-section from the sequence of other interactive strategies to replace the original interaction sequence. Q-learning is more complex ideas, which combines the best value sub-section, more like a combination of value iteration algorithm update, I hope each time using the optimal results of the previous iteration accumulated updated.

Published 120 original articles · won praise 35 · views 170 000 +

Guess you like

Origin blog.csdn.net/u012328476/article/details/102924723
RL
RL
RL
Recommended