Reinforcement learning DRL--value learning (DQN, SARSA algorithm)

先看题目噢---------------------价值学习(还有一种方法是策略学习)

1. DQN and Q-learning

1.DQN

  • We want to know Q ⋆ Q ⋆Q , because it is like a prophet, can predict the future, at time t, it can predict the expectation of cumulative rewards between t and n times. The most effective way to approximate learning "prophet" Q ⋆ is deep Q network (deep Q network, abbreviated as DQN), denoted asQ ( s , a ; w ) Q(s,a;\bf{w})Q(s,a;w)
  • First, the parameter w is randomly initialized, and then "experience" is used to learn w. The goal of learning is: for all s and a, the prediction Q(s,a;w) of DQN is as close as possible to Q ∗ ( s , a ) Q_*(s,a)Q(s,a)

insert image description here

  • The output of DQN is the Q value of each action on the discrete action space A,

2. Time difference (TD) algorithm

  • The most commonly used algorithm for training DQN is temporal difference (TD for short).
  • TD algorithm is a large class of algorithms, the common ones are Q-learning and SARSA. The purpose of Q learning is to learn 最优动作价值函数 Q ⋆, and the purpose of SARSA is to learn 动作价值函数 Q π Q_\piQp
  • The purpose of the TD algorithm is to make the loss L ( w ) = 1 2 ∗ ( q ^ − y ^ ) 2 L(w) =\frac{1}{2}*(\hat{q}-\hat {y})^2L(w)=21(q^y^)2 decrease.
  • Training process: a lot of formulas

3.Q learning

  • Tabular method, denoted as Q ~ \tilde{Q}Q~
    insert image description here

  • Collect training data e–greedy (also known as behavioral policy )
    insert image description here

  • Experience replay update form Q ~ \tilde{Q}Q~
    insert image description here
    insert image description here

4. On-policy and Off-policy

First introduce the behavior strategy and target strategy :

  • Behavioral strategy: the function is to collect experience (experience), that is, the observed state, action, and reward. The most commonly used behavioral strategy is ϵ-greedy
  • Target policy: The purpose of reinforcement learning is to obtain a policy function and use this policy function to control the agent. This policy function
    is called the target policy.
  • The Q-learning algorithm in this chapter can use any behavior policy to collect ( st , at , rt , st + 1 ) (st ,at ,rt ,s_{t+1 })(st,at,rt,st+1) such quadruples, and then use them to train
    the target strategy, that is, DQN.

same strategy vs different strategy

  • The same strategy refers to using the same behavior strategy and target strategy.
  • The different strategy refers to the use of different behavioral strategies and target strategies, and the DQN in this chapter belongs to the different strategy.
    insert image description here
    insert image description here

2. SARSA Algorithm

The purpose of Q-learning is to learn the optimal action-value function Q ⋆ . The purpose of SARSA is to learn the action-value function Q π ( s , a ) Q _π (s,a)Qp(s,a ) . NowQ π Q _ πQpIt is usually used to evaluate the quality of the policy, not to control the agent.
insert image description here

1. SARSA in table method

The goal of the SARSA algorithm is to learn the table q as an approximation of the action-value function Q π .
insert image description here

2. SARSA in the form of neural networks

With a neural network q ( s , a ; w ) q(s,a;\bf{w})q(s,a;w ) to approximateQ π ( s , a ) Q_π (s,a)Qp(s,a ) , neural networkq ( s , a ; w ) q(s,a;\bf{w})q(s,a;w ) is calledthe value network. Initialize w randomly at first, and then update w with SARSA algorithm.
insert image description here

3. Value Learning Advanced Skills

1. Experience playback

  • One benefit of experience replay is that it breaks sequence dependencies.
  • Another benefit of experience replay is to reuse the collected experience instead of discarding it once, so that the same performance can be achieved with a smaller number of samples.

Notice:

  • The data in the experience replay array is all collected by controlling the agent with a behavior policy. While collecting experience, we are constantly improving the strategy . The policy change leads to the behavior policy used when collecting experience to be an outdated policy, which is different from the current policy we want to update—that is, the target policy. The target strategies we really want to learn are different from outdated behavioral strategies.
  • For example, Q-learning and Deterministic Policy Gradient (DPG) are different strategies . Since they allow behavioral policies to differ from the target policy, experiences gleaned from outdated behavioral policies can be reused. Experience replay is suitable for different strategies.
  • For example, SARSA, REINFORCE, and A2C all belong to the same strategy . They require that the experience must be collected by the current target strategy, rather than using outdated experience. Experience replay does not apply to the same strategy.

Priority experience playback:

  • Priority experience playback gives each quadruple a weight, and then performs non-uniform random sampling according to the weight. If DQN pair ( sj , aj ) (sj , aj )(sj,The value judgment of aj ) is inaccurate, that is,Q ( sj , aj ; w ) Q(sj ,aj ;w)Q ( s j ,aj;w ) fromQ ⋆ ( sj , aj ) Q ⋆ (sj , aj )Q(sj,aj ) is far away, then the quadruple( sj , aj , rj , sj + 1 ) (sj ,aj ,rj ,s j+1 )(sj,aj,rj,sj+1 ) should have a higher weight.

2. Overestimating the problem and its solution

DQNs trained with Q-learning overestimate the true value, and the overestimation is usually non-uniform. Q-learning produces overestimation for two reasons:

  • First, bootstrapping leads to the propagation of bias;
  • Second, maximization causes the TD objective to overestimate the true value.

(1) Target network

To cut off the bootstrapping, use another neural network to compute the TD target instead of DQN to compute the TD target itself. Another neural network is called the target network. Denote the target network as: Q ( s , a ; w − ) Q(s,a;w^-)Q(s,a;w )
Its neural network structure is exactly the same as DQN, but the parameterw − w^-w is different from w.
insert image description here

insert image description here
Use the target network to calculate y ^ \hat{y} insteady^, which avoids updating DQN itself with the estimation of DQN and reduces the harm caused by bootstrapping. However, this approach cannot completely avoid bootstrapping, since the parameters of the target network are still DQN-dependent.

(2) Double Q learning algorithm

Using the target network in the Q-learning algorithm can alleviate the bias caused by bootstrapping, but it will not help alleviate the overestimation caused by maximization. DDQN improves on the basis of the target network to alleviate the overestimation caused by maximization.
Note:

  • Double Q learning (so-called DDQN) is just a TD algorithm, which can train DQN better.
  • Double Q-learning does not use a model different from DQN. There is only one model in this section, which is DQN.
  • We discuss only three TD algorithms for training DQN: original Q-learning, Q-learning with target network, and dual-Q-learning.

Introducing DDQN:
insert image description here
Gradient Update:
insert image description here


Note: If you use the SARSA algorithm (such as in actor-critic ), the problem of bootstrapping still exists, but there is no problem of overestimation caused by maximization. For SARSA, only the bootstrap problem needs to be solved, so the target network should be applied to SARSA.

3. Dueling Network

The duel network, like DQN, is an approximation of the optimal action-value function Q ⋆; the only difference between the two is the neural network structure.

4. Noise network

The noise network is a special neural network structure, and the parameters in the neural network have random noise. Noise network can be used in many deep reinforcement learning models such as DQN. The noise in the noise network can encourage exploration, allowing the agent to try different actions, which is conducive to learning better policies.

Guess you like

Origin blog.csdn.net/qq_45889056/article/details/129654837