Read the RL paper with Dr. Zhang---DQN (ICML version)

Read RL thesis with Dr. Zhang---DQN

Original link: Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” (2015). (ICML version)

Summary

This paper proposes the first use of a reinforcement learning model (a CNN) to learn high-dimensional sensory input to achieve and exceed human levels in games.

introduce

Deep learning has achieved excellent results in vision and speech recognition, but reinforcement learning has not.
Reinforcement learning does not have large labeled training sets. Reinforcement learning algorithms all operate through sparse, noisy, delayed scalar rewards. There may be thousands of steps between the action and the final reward, and supervised learning can be done through a direct connection between the label and the input. Another problem is that in deep learning algorithms, data samples are required to be independent, but the sequences often faced in reinforcement learning are highly correlated. In addition, in RL, data distribution changes as the agent interacts with the environment, which is completely different from DL, where the underlying distribution of data is fixed.
This paper uses an algorithm based on a variant of Q-learning, combined with a CNN architecture to implement sequence control decision-making problems in original video data. In order to prevent correlation between data and non-static distribution, this article uses experience replay to smooth the distribution of the training set.
The goal of this article is to build a neural network architecture that requires no additional features and can be used in multiple games.

background

The internal mechanism of the game simulator cannot be observed. The input is only an image, but the current real state cannot be fully understood with just one image. Therefore, the input of this article is a series of actions and images ( x 1, a 1, x 2, . . . at − 1, xt x_1,a_1,x_2,...a_{t-1},x_{t}x1,a1,x2,...at1,xt), trying to learn strategies from sequences.
The agent's goal is to maximize future rewards by exploiting and interacting with the simulator.
R t = ∑ t ′ = t T γ t ′ − trt ′ R_{t}=\sum_{t^{\prime}=t}^{T} \gamma^{t^{\prime}-t} r_ {t^{\prime}}Rt=t=tTcttrt
T T T is the terminal state.
This article uses a strategy iteration algorithm based on the Bellman equation to update the Q value:
Q i + 1 ( s , a ) = E s ′ ∼ E [ r + γ max ⁡ a ′ Q i ( s ′ , a ′ ) ∣ s , a ] Q_{i+1}(s, a)=\mathbb{E}_{s^{\prime} \sim \mathcal{E}}\left[r+\gamma \max _{a^ {\prime}} Q_{i}\left(s^{\prime}, a^{\prime}\right) \mid s, a\right]Qi+1(s,a)=EsE[r+camaxQi(s,a)s,a ]
For action-value pairs, they are generally discrete, so a function needs to be used to estimate the Q valueQ ( s , a ; θ ) ≈ Q ∗ ( s , a ) Q(s,a ;\theta) \approx Q^{*}(s,a)Q(s,a;i )Q(s,a ) . We generally use neural networks to estimate, so it is called Q-Network. This can be achieved by iteratively minimizing the following loss function:
L i ( θ i ) = E s , a ∼ ρ ( ⋅ ) [ ( yi − Q ( s , a ; θ i ) ) 2 ] L_{i}\left (\theta_{i}\right)=\mathbb{E}_{s, a \sim \rho(\cdot)}\left[\left(y_{i}-Q\left(s, a ; \theta_ {i}\right)\right)^{2}\right]Li( ii)=Es,aρ()[(yiQ(s,a;ii))2]
其中 y i = E s ′ ∼ E [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ i − 1 ) ∣ s , a ] y_{i}=\mathbb{E}_{s^{\prime} \sim \mathcal{E}}\left[r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right) \mid s, a\right] yi=EsE[r+cmaxaQ(s,a;ii1)s,a] y i y_i yiIt is also the target of the i-th iteration, ρ ( s , a ) \rho(s,a)p ( s ,a ) is the distribution of actions and states. When optimizingL i ( θ i ) L_{i}(\theta_{i})Li( ii) , the θ i − 1 \theta_{i-1}of the previous iteration will beii1For example, we have the following equation:
∇ θ i L i ( θ i ) = E s , a ∼ ρ ( ⋅ ) ; s ′ ∼ E [ ( r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ i − 1 ) − Q ( s , a ; θ i ) ) ∇ θ i Q ( s , a ; θ i ) ] . \nabla_{\theta_{i}} L_{i}\left(\theta_{i}\right)=\mathbb{E}_{s, a \sim \rho(\cdot) ; s^{\prime} \sim \mathcal{E}}\left[\left(r+\gamma\max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime } ; \theta_{i-1}\right)-Q\left(s, a ; \theta_{i}\right)\right) \nabla_{\theta_{i}} Q\left(s, a ; \ theta_{i}\right)\right]iiLi( ii)=Es,aρ();sE[(r+camaxQ(s,a;ii1)Q(s,a;ii))iiQ(s,a;ii) ]
Note: This method is model-free and off-policy.

deep reinforcement learning

This article introduces experience replay based on TD-Gammon. A database is used to store experience D = e 1 , . . . , e ND = e_1,...,e_ND=e1,...,eN,其中 e t = ( s t , a t , r t , s t + 1 ) e_t = (s_t,a_t,r_t,s_{t+1}) et=(st,at,rt,st+1) . The flow of the algorithm is as follows:
Insert image description here
In the inner loop of the algorithm, Q-learning and minibatch are used for updating. In order to make the input a fixed length, the function ϕ \phiis usedϕ to handle.
The advantages of this algorithm compared to standard online Q-learning are: 1. Each step of experience can be used in multiple weight updates, improving data utilization; 2. Learning from continuous sampling samples is not sufficient , because there will be a great correlation between the data; 3. In on-policy, the current parameters determine the next data sample for parameter training. This creates a feedback loop that causes the parameters to fall into local optima or fail to converge. If off-policy is used, the parameters currently used to interact with the environment are different from the parameters being updated.

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/107135281
RL