(2) Deep reinforcement learning foundation [value learning]

Value-Based Reinforcement Learning

review

Definition: Discounted return(cumulative discounted future reward)
⋅ \cdot U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+... Ut=Rt+γRt+1+c2 Rt+2+c3 Rt+3+...

⋅ \cdot The return depends on action A t , A t + 1 , A t + 2 , . . . A_{t},A_{t+1},A_{t+2},... At,At+1,At+2,... and states S t , S t + 1 , S t + 2 , . . . S_{t},S_{t+1},S_{t+2},... St,St+1,St+2,...
⋅ \cdot Actions are random: P [ A = a ∣ S = s ] = π ( a ∣ s ) . P[A=a|S=s]=\pi(a|s). P[A=aS=s]=π ( a s ) .        \;\;\;(Policy function)
⋅ \cdot States are random: P [ S ′ = s ′ ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) . P[S^{'}=s^{'}|S=s, A=a]=p(s^{'}|s,a). P[S=sS=s,A=a]=p(ss,a).        \;\;\; (State transition)

Definition: Action-value function for policy π . \pi. π.
⋅ \cdot Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] . Q_{\pi}(s_{t},a_{t}) = E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}]. Qp(st,at)=E [ UtSt=st,At=at].

⋅ \cdot Taken w.r.t actions A t + 1 , A t + 2 , A t + 3 , . . . A_{t+1},A_{t+2},A_{t+3},... At+1,At+2,At+3,... and states S t + 1 , S t + 2 , S t + 3 , . . . S_{t+1},S_{t+2},S_{t+3},... St+1,St+2,St+3,...
⋅ \cdot Integrate out everything except for the observations: A t = a t A_{t}=a_{t} At=at and S t = s t . S_{t}=s_{t}. St=st.

Definition: Optimal action-value function
⋅ \cdot Q ∗ ( s t , a t ) = m a x π Q π ( s t , a t ) . Q^{*}(s_{t},a_{t}) = \underset{\pi}{max}Q_{\pi}(s_{t},a_{t}). Q(st,at)=PimaxQp(st,at).
⋅ \cdot Whatever policy function π \pi π is used, the result of taking a t a_{t} at at state s t s_{t} st cannot be better than Q ∗ ( s t , a t ) . Q^{*}(s_{t},a_{t}). Q(st,at).

1. Deep Q-Network(DQN)

Goal: Win the game( ≈ \approx maximize the total reward.)

Question: If we know Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a), what is the best action?
⋅ \cdot Obviously, the best action is a ∗ = a r g m a x a Q ∗ ( s , a ) . a^{*} = arg\underset{a}{max}Q^{*}(s,a). a=argamaxQ(s,a).
                           ( Q ∗ \;\;\;\;\;\;\;\;\;\;\;\;\;(Q^{*} (Q is an indication for how good it is for an agent to pick action a a a while being in state s s s).
Q ∗ Q^{*} Q is a prophet who can always guide us to make actions. But in fact, it is impossible to approximate an omnipotent prophet.

Challenge: We do not know Q ∗ ( s , a ) . Q^{*}(s,a). Q(s,a).
⋅ \cdot Solution: Deep Q Network(DQN)
⋅ \cdot Use neural network Q ∗ ( s , a , w ) Q^{*}(s,a,w) Q(s,a,w) to approximate Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a).

w w w is the parameter of neural network, s s s is the input, and the output of neural network is many values, which are the possible scores of all actions. We train the network through rewards, and the scoring of this network will gradually improve and become better and better.

Deep Q Network:
⋅ \cdot Input shape: size of the screenshot.
⋅ \cdot Output shape: dimension of action space(scoring of each action).

Question: Based on the predictions, what should be the action?
Answer: If the score of that action is high, which action should be used.

2. Temporal Difference (TD) Learning

The most commonly used method for training DQN is TD algorithm.

Example

⋅ \cdot I want to drive from NYC to Atlanta.
⋅ \cdot Model Q( w w w) estimates the time cost, e.g., 1000 minutes.

Qestion: How do I update the model?

⋅ \cdot Make a prediction: q = Q ( w ) , e . g . , q = 1000. q = Q(w), e.g., q = 1000. q=Q(w),e . g . ,q=1000.

⋅ \cdot Finish the trip and get the target$ y, e.g., y = 860.$

⋅ \cdot Loss: L = 1 2 ( q − y ) 2 . L = \frac{1}{2}(q-y)^{2}. L=21(qy)2.

⋅ \cdot Gradient: ∂ L ∂ w = ∂ q ∂ w ⋅ ∂ L ∂ q = ( q − y ) ⋅ ∂ Q ( w ) ∂ w . \frac{\partial L}{\partial w}=\frac{\partial q}{\partial w} \cdot \frac{\partial L}{\partial q}=(q-y)\cdot\frac{\partial Q(w)}{\partial w}. wL=wqqL=(qy)wQ(w).

⋅ \cdot Gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t . w_{t+1}=w_{t}- \alpha\cdot\frac{\partial L}{\partial w}\mid_{w=w_{t}}. wt+1=wtawLw=wt.

⋅ \cdot Can I update the model before finishing the trip?
⋅ \cdot Can I get a better w w w as soon as I arrived at DC?

Temporal Difference (TD) Learning

⋅ \cdot Model’s estimate:

                       \;\;\;\;\;\;\;\;\;\;\; NYC to Atlanta: 1000 minutes (estimate).

⋅ \cdot I arrived to DC; actual time cost:

                       \;\;\;\;\;\;\;\;\;\;\;NYC to DC: 300 minutes (actual).

⋅ \cdot Model now updates its estimate:
                       \;\;\;\;\;\;\;\;\;\;\; DC to Atlanta: 600 minutes (estimate)

⋅ \cdot Model’s estimate: Q ( w ) = 1000   m i n u t e s Q(w)= 1000 \,minutes Q(w)=1000minutes

⋅ \cdot Updated estimate: 300 + 600 = 900 m i n u t e s ( T D t a r g e t ) . 300 + 600 = 900 minutes (TD target). 300+600=900minutes(TDtarget).

⋅ \cdot TD target y = 900 y = 900 y=900 is a more reliable estimate than 1000 1000 1000.

⋅ \cdot Loss: L = 1 2 L = \frac{1}{2} L=21 ( Q ( w ) − y ) ⏟ TD error \underbrace{(Q(w)-y) }_{\text{TD error}} TD error (Q(w)y) 2 . ^{2}. 2.

⋅ \cdot Gradient: ∂ L ∂ w = ( 1000 − 900 ) ⏟ TD error ⋅ ∂ Q ( w ) ∂ w . \frac{\partial L}{\partial w}=\underbrace{(1000-900) }_{\text{TD error}} \cdot \frac{\partial Q(w)}{\partial w}. wL=TD error (1000900)wQ(w).

⋅ \cdot Gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t . w_{t+1}=w_{t}-\alpha \cdot \frac{\partial L}{\partial w} \mid_{w=w_{t}}. wt+1=wtawLw=wt.

3. Why does TD learning work?

⋅ \cdot Model’s estimates:
           \;\;\;\;\; NYC to Atlanta: 1000 1000 1000 minutes.
           \;\;\;\;\; DC to Atlanta: 600 600 600 minutes.
           \;\;\;\;\; ⇒ \Rightarrow NYC to DC: 400 400 400 minutes.

⋅ \cdot Ground truth:
           \;\;\;\;\; NYC to DC: 300 300 300 minutes.

⋅ \cdot TD error:δ = 400 − 300 = 100 \delta=400-300=100d=400300=100

4. How to apply TD learning to DQN?

⋅ \cdot In the “driving time” example, we have the equation:
                       T N Y C → A T L ⏟ Model’s estimate ≈ T N Y C → D C ⏟ Actual time + T D C → A T L ⏟ Model’s estimate . \;\;\;\;\;\;\;\;\;\;\;\underbrace{T_{NYC\to ATL}}_{\text{Model's estimate}}\approx\underbrace{T_{NYC\to DC}}_{\text{Actual time}}+\underbrace{T_{DC\to ATL}}_{\text{Model's estimate}}. Model’s estimate TNYCATLActual time TNYCDC+Model’s estimate TDCATL.

The above is the form of TD algorithm.

⋅ \cdot In deep reinforcement learning:
                       Q ( s t , a t , w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w ) . \;\;\;\;\;\;\;\;\;\;\;Q(s_{t},a_{t},w)\approx r_{t}+\gamma \cdot Q(s_{t+1},a_{t+1};w). Q(st,at,w)rt+cQ(st+1,at+1;w).

Prove



  \,
  \,

5. Summary

Definition: Optimal action-value function.

⋅ \cdot Q ∗ ( s t , a t ) = m a x π   E [ U t ∣ S t = s t , A t = a t ] . Q^{*}(s_{t},a_{t})=\underset{\pi}{max} \,E[U_{t}\mid S_{t}=s_{t},A_{t}=a_{t}]. Q(st,at)=PimaxE [ UtSt=st,At=at].

The Q ∗ Q^{*} Q function can score all actions based on the current state, and the score can reflect the quality of each state. As long as there is a Q ∗ Q^{*} Q function, it can be used to control the movement of the agent. At each moment, the agent only needs to select the action with the highest score to execute this action. However, we don’t have Q ∗ Q^{*} Q function. The purpose of value learning is to learn a function to approximate Q ∗ Q^{*} Q function, so we have D Q N DQN DQN . _

DQN: Approximately Q ∗ Q^{*}Q (s,a) using a neural network(DQN).

⋅ \cdot Q ∗ ( s , a ; w ) Q^{*}(s,a;w) Q(s,a;w) is a neural network parameterized by w w w.

⋅ \cdot Input: observed state s s s.

⋅ \cdot Output: scores for all the action a ∈ A . a ∈ A. aA.

Guess you like

Origin blog.csdn.net/weixin_49716548/article/details/125964312