Reinforcement learning Chapter VII

1, policy iteration, the value of the iteration, iteration generalization prerequisite : the agent know the state of the environment of transition probabilities, that is model-based problem

2, Monte Carlo sampling method: randomly sampled estimate the expected value, a desired value by approximation of real sample sequence. Founded reason: law of large numbers.

3, exploration and exploitation, exploration refers to the current informal performance, choice of action different from the current strategy; use is continued use of the current best strategy to get more in return as much as possible.

4, the disadvantage Monte Carlo methods: large variance estimated value. The mean estimate of variance is large convergence will take longer. Large variance reason: Each digital dice are different, the sampling frequency of the problem, there will be the same state several times, there is no distinction in the calculation of the process for the first time to reach this state and the Second Coming, is every-visit manner, can be changed using the first-visit way variance reduction, but did not significantly improve.

Advantages: large enough amount of data, the expected value of the estimate is unbiased.

5, the timing difference method and Sarsa: a combination of dynamic programming and Monte Carlo methods TD method, the use of the idea of ​​optimal substructure.

But he is making in order to reduce the variance of the error becomes large, the Monte Carlo method is to make a very small error variance is large, TD method results are not good MC.

6, Q-learning: it Sarsa only in one place difference, Sarsa follow the true sequence of interactions, estimated value based on real actions, Q-learning at the next moment chosen such that the value of the largest actions did not follow the interaction sequence.

There is "overestimation" of the problem, the time for action instead of using interactive action using the best value. 200 in two steps, with regard to convergence proof temporarily did not understand? ? ? ? ?

7, DQN algorithm two salient points:

(1) replay buffer playback mechanism:

  Q interactive learning methods and modifications based on the current policy, each time the data model using interactive learning, learning samples are dropped. There are two problems: one is correlated interaction resulting sequence. For machine learning models based on maximum likelihood, we assume that the training samples are independent and identically distributed from, the assumption does not hold, the effect is greatly reduced. Another is the inefficient use of interactive data, model training required several rounds of iterations to converge, it is long used to discard useless to spend time.

  Sample Playback of the sample information exchange, save the current state s, action and a long-term cumulative reward v. Larger buffer size settings, reaching one million samples so many new samples of the old sample coverage, even after a random sample from the sample to learn.

(2) target network destination network:

  Introduction and performance as the network model, the target network performance parameters by the network latency update from the target value has been calculated by the target network. It is updated with the performance of network parameters and estimate the value of its network performance comparison.

Guess you like

Origin www.cnblogs.com/lin-kid/p/11520194.html
Recommended