Reinforcement Learning [Getting Started]

Reinforcement Learning [Getting Started]

concept

Action : action

State : state

Reward : Reward

π ( a ∣ s ) \pi(a|s)π ( a s ) : Policy function, which is a probability density function, which refers to the probability of each action in the current state.

S'~P(*|a, s): State transition function, which is also a probability density function. According to the current state and current action, it returns the probabilities of the next state (because mobs are random).

观察s1--->根据策略函数获得a1--->生成下一个状态s2--->返回一个奖励r1
再将新的状态s2作为输入,根据策略函数获得a2……
轨迹:
s1,a1,r1,s2,a2,r2,…,sT,aT,rT

insert image description here

Return : return. Future accumulated rewards
U t = R t + R t + 1 + R t + 2 + … U_t=R_t+R_{t+1}+R_{t+2}+…Ut=Rt+Rt+1+Rt+2+
R t R_t Rtand R t + 1 R_{t+1}Rt+1are not equally important, obviously R t R_tRtmore important.

折扣回报:
U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + ⋯ U_{t}=R_{t}+\gamma R_{t+1}+\gamma^{2} R_{t+2}+\gamma^{3} R_{t+3}+\cdots Ut=Rt+γRt+1+c2 Rt+2+c3 Rt+3+ …
The discount rate is a hyperparameter and needs to be tuned yourself.

Action-value function:

With the discount reward, we can know whether the game is about to win or lose, and the bigger the future reward, the better.

But the discount report is a random variable, it depends on at, st, at+1, etc., how to evaluate the discount return? Use expectations! Both future states and actions are eliminated using points. It is also related to the policy function, because there can be different policy functions.
Q π ( st , at ) = E [ U t ∣ S t = st , A t = at ] . Q_{\pi}(s_{t},a_{t})={\mathbb{E}}[U_ {t}|S_{t}=s_{t},A_{t}=a_{t}].Qp(st,at)=E [ UtSt=st,At=at] .
Optimal action-value function:

How to remove the Policy function? Use it to the max!
Q ⋆ ( st , at ) = max ⁡ π Q π ( st , at ) . Q^{\star}(s_{t},a_{t})=\operatorname*{max}_{\pi}Q_{ \pi}(s_{t},a_{t}).Q(st,at)=PimaxQp(st,at) .
This function can evaluate the current action a, and can tell us whether the current action is good or not.

State value function (state-value function):

离散情况:
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_{\pi}(s_{t})=\mathbb{E}_{A}\left[Q_{\pi}(s_{t},A)\right]=\sum_{a}\pi(a|s_{t})\cdot Q_{\pi}(s_{t},a) Vp(st)=EA[Qp(st,A)]=aπ ( a st)Qp(st,a)
连续情况:
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∫ π ( a ∣ s t ) ⋅ Q π ( s t , a )   d a V_{\pi}(s_{t})=\mathbb{E}_{A}\left[Q_{\pi}(s_{t},A)\right]=\int\pi(a|s_{t})\cdot Q_{\pi}(s_{t},a)\:d a Vp(st)=EA[Qp(st,A)]=π ( a st)Qp(st,a)d a
takes the action A as a random variable, and seeks to eliminate A.

The state value function can evaluate the current situation and judge whether we are about to win or lose.

The state value function can also evaluate the policy function π \piπ is good or bad.

How to control Agent?

  1. Based on the Policy function π ( a ∣ s ) \pi(a|s)π ( a s )
    • Observation state st s_{t}st
    • Get the random variable at a_{t}at~ π ( ∗ ∣ st ) \pi(*|s_t)π ( st)
  2. Based on Q ⋆ ( s , a ) Q^{\star}(s,a)Q(s,a ) function
    • Based on the observed state s
    • Input each a to get the a that maximizes Q: at = argmax ⁡ a Q ⋆ ( st , a ) a_{t}=\operatorname{argmax}_{a}Q^{\star}(s_ {t},a)at=argmaxaQ(st,a)

**Reinforcement learning library:**OpenAI Gym

insert image description here

Reinforcement learning is mainly to learn the Policy function π ( a ∣ s ) \pi(a|s)π ( a s ) givesQ ⋆ ( s , a ) Q^{\star}(s,a)Q(s,a ) With one of the two functions, the Agent can be controlled.

value learning

Deep Q Network(DQN)

Use a neural network to approximate Q ⋆ ( s , a ) Q^{\star}(s,a)Q(s,a ) function

Goal: The larger the sum of the rewards obtained at the end of the game, the better.

Q ⋆ ( s , a ) Q^{\star}(s,a) Q(s,a ) Each action can be evaluated, and for the expectation of future value, the action with greater value expectation is better.

Using a neural network Q ( s , a ; w ) Q(s,a;w)Q(s,a;w ) ApproximateQ ⋆ ( s , a ) Q^{\star}(s,a)Q(s,a ) function, the parameter of the neural network is w.

insert image description here

How to train DQN?

Temporal Difference(TD) Learning

If I want the model to predict the time it takes from point A to point B, the model will give a predicted value. If my real value is only from A to C (C is a point between A and B), How should I train? TD algorithm! The model can predict the usage time from C to B, plus the actual usage time from A to C, and use this time (TD target) for gradient descent.

Therefore, the TD algorithm can also be used to update the parameters without finishing the game.

insert image description here

The TD algorithm must have a formula similar to the above in the figure above. The left side is the predicted value of the model, and the right side is the sum of a part of the real value and a part of the predicted value.

In deep reinforcement learning, there happens to be such a formula:
Q ( st , at ; w ) ≈ rt + γ ⋅ Q ( st + 1 , at + 1 ; w ) Q(s_t,a_t;\mathbb{w})\ approx r_t+\gamma\cdot Q(s_{t+1},a_{t+1};\mathbb{w})Q(st,at;w)rt+cQ(st+1,at+1;w)
公式看 U t = R t + r ⋅ U t + 1 U_t = R_t+r\cdot U_{t+1} Ut=Rt+rUt+1

And Q is an estimate of U, so there is the above formula.

  • Prediction: At time t, the model makes a prediction Q ( st , at ; wt ) Q(s_{t},a_{t};\mathbf{w}_{t})Q(st,at;wt).
  • TD target: At time t+1, the real return R t R_t is observedRt, and the state S t + 1 S_{t+1} at time t+1St+1, then you can calculate at + 1 a_{t+1} through DQNat+1, so that the TD target can be calculated.

y t = r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w t ) = r t + γ ⋅ max ⁡ a Q ( s t + 1 , a ; w t ) . \begin{array}{c}{y_{t}=r_{t}+\gamma\cdot Q(s_{t+1},a_{t+1};\mathbf{w}_{t})}\\ {=r_{t}+\gamma\cdot\max_{a}Q(s_{t+1},a;\mathbf{w}_{t}).}\\ \end{array} yt=rt+cQ(st+1,at+1;wt)=rt+cmaxaQ(st+1,a;wt).

  • Loss:
    L t = 1 2 [ Q ( s t , a t ; w ) − y t ] 2 L_t=\frac{1}{2}[Q(s_t,a_t;\mathbf{w})-y_t]^2 Lt=21[Q(st,at;w)yt]2

strategy learning

insert image description here

Approximate the policy function using a neural network.

  • ∑ a ∈ A π ( a ∣ s ; θ ) = 1 \sum_{a\in\mathcal{A}}\pi(a|s;\mathbf{\theta})=1aAπ ( a s ;i )=1

Just replaced the policy function with a neural network, then the state value function
V π ( st ) = EA [ Q π ( st , A ) ] = ∑ a π ( a ∣ st ) ⋅ Q π ( st , a ) V_{\ pi}(s_{t})=\mathbb{E}_{A}\left[Q_{\pi}(s_{t},A)\right]=\sum_{a}\pi(a|s_{ t})\cdot Q_{\pi}(s_{t},a)Vp(st)=EA[Qp(st,A)]=aπ ( a st)Qp(st,a )
Independently
V ( st ; θ ) = ∑ π ( a ∣ st ; θ ) ⋅ Q π ( st , a ) . V(s_t;\theta)=\sum{\pi(a|s_t;\theta)}\cdot Q_\pi(s_t,a).V(st;i )=π ( a st;i )Qp(st,a ) .
V can evaluate the current state and the quality of the strategy function. How to make the strategy function better and better?

Change the neural network parameters θ \thetaθ to make this value bigger and bigger.

Define the objective function as J ( θ ) = ES [ V ( S ; θ ) ] J({\theta})=\mathbb{E}_{S}[V(S;{\theta})]J(θ)=ES[V(S;θ )] , remove the state S as a random variable, so that only the evaluation of the policy network remains.

  • Observation status s
  • Derivation, so that the gradient of the above objective function rises.

This is a stochastic gradient, where the randomness comes from s, which is obtained using random sampling.

We have the following equations:
∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) . {\frac{\partial V(s;\theta)}{\partial\theta}}=\sum_{a}{\frac{\partial\pi(a|s;\theta)}{\partial\theta} }\cdot Q_{\pi}(s,a).θV(s;i ).=aθπ(as;i ).Qp(s,a).

∂ V ( s ; θ ) ∂ θ = EA ∼ π ( ⋅ ∣ s ; θ ) c [ ∂ log ⁡ π ( A ∣ s , θ ) ∂ θ ⋅ Q π ( s , A ) ] {\frac{\partial V(s;\theta)}{\partial\mathbf{\theta}}}=\mathbb{E}_{A\sim\pi(\cdot|s;\mathbf{\theta})_{c}} \big[{\frac{\partial\log\pi(A|s,\mathbf{\theta})}{\partial\mathbf{\theta}}}\cdot Q_{\pi}(s,A)\ big]θV(s;i ).=EA π ( s ; θ )c[θlogπ ( A s ,i ).Qp(s,A)]

If the actions are discrete, then use the above formula to calculate each action separately and add them together.

If the action is continuous, use the following formula, the neural network integral is not easy to calculate. Monte Carlo approximation! Take one or more samples to approximate.

How to calculate Q? Two ways.

  1. You can first use the strategy function to play the game from beginning to end, and record each state, action, and reward. These recorded data are then used for approximation.
  2. Approximated using a neural network.

Actor-Critic method

Actor: Policy network, used to control agent movement (Policy Network)

Critic: Value Network, used to score actions (Value Network)
V π ( s ) = ∑ a π ( a ∣ s ) ⋅ Q π ( s , a ) V_{\pi}(s)=\sum_{a} \pi(a|s)\cdot Q_{\pi}(s,a)Vp(s)=aπ ( a s )Qp(s,a )
Use two neural networks respectively, one approximatingπ \piπ (known as a policy network, also known as an Actor), an approximation toQ π Q_{\pi}Qp(Value Network, value network), the value network is to score actions.

Thus, V π V_\piVpIn general:
V π ( s ) = ∑ a π ( a ∣ s ) ⋅ Q π ( s , a ) ≈ ∑ a π ( a ∣ s ; Θ ) ⋅ q ( s , a ; w ) V_{\pi }(s)=\sum_{a}\pi(a|s)\cdot Q_{\pi}(s,a)~~\approx\sum_{a}\pi(a|s;\theta)\cdot q(s,a;\mathbf{w})Vp(s)=aπ ( a s )Qp(s,a)  aπ ( a s ;i )q(s,a;w )
The product of the policy network and the value network.

Policy network:

insert image description here

Value Network:

insert image description here

network training

V ( s ; θ , w ) = ∑ a π ( a ∣ s ; θ ) ⋅ q ( s , a ; w ) V(s;\mathbf{\theta},\mathbf{w})=\sum_{a }\pi(a|s;\mathbf{\theta})\cdot q(s,a;\mathbf{w})V(s;i ,w)=aπ ( a s ;i )q(s,a;w)

Training: Update parameters θ and w \theta and wθ and w .

  • The purpose of updating the strategy network is to increase the value of V, make the strategy better and better, and make the score of q higher and higher.
  • The purpose of updating the value network q is to make the scoring value more accurate, so as to better estimate the sum of future values.

Five-step update:

  1. Observed state S t S_tSt
  2. Randomly sample actions at a_t according to the policy networkat
  3. Execute action at a_tat, update state S t + 1 S_{t+1}St+1, get the reward rt r_trt
  4. Use the TD algorithm to update the parameter w of the value network.
  5. Use the policy gradient algorithm to update the parameters $\theta $ of the policy network, using the parameters of the value network.

insert image description here

Algorithm steps:

  1. Observed state S t S_tStAnd randomly sample at a_t according to the state and policy networkat
  2. Execute action at a_tat, the environment gives a new state S t + 1 S_{t+1}St+1and reward rt r_trt
  3. According to S t + 1 S_{t+1}St+1And policy network, random sampling to obtain a new action a ~ t + 1 \tilde{a}_{t+1}a~t+1
  4. Evaluate qt = q ( st , at ; wt ) according to the value network q_{t}=q(s_{t},a_{t};\mathbf{w}_{t})qt=q(st,at;wt) q t + 1 = q ( s t + 1 , a ~ t + 1 ; w t ) q_{t+1}=q(s_{t+1},\tilde{a}_{t+1};\mathbf{w}_{t}) qt+1=q(st+1,a~t+1;wt)
  5. Use the TD algorithm to calculate the loss δ t = qt − ( rt + γ ⋅ qt + 1 ) \delta_{t}=q_{t}-(r_{t}+\gamma\cdot q_{t+1})dt=qt(rt+cqt+1)
  6. Deriving the value network to calculate the gradient dw , t = ∂ q ( st , at ; w ) ∂ w ∣ w = wt \mathbf{d}_{\mathbf{w},t}={\frac{\partial q (s_{t},a_{t};\mathbf{w})}{\partial\mathbf{w}}}\mid_{\mathbf{w}=\mathbf{w}_{t}}dw,t=wq(st,at;w)w=wt
  7. 更新网络 w t + 1 = w t − α ⋅ δ t ⋅ d w , t \mathbf{w}_{t+1}=\mathbf{w}_{t}-\alpha\cdot\delta_{t}\cdot\mathbf{d}_{w,t} wt+1=wtadtdw,t
  8. Let the exponential function d θ , t = ∂ log ⁡ π ( at ∣ st , θ ) ∂ θ ∣ θ = θ t \mathbf{d}_{\theta,t}={\frac{\partial\log \pi(a_{t}|s_{t},\mathbf{\theta})}{\partial\mathbf{\theta}}}\mid_{\mathbf{\theta}=\mathbf{\theta}_{ t}}di , t=θlogπ ( atst, i ).θ = θt
  9. The functional derivatives are given by t + 1 = θ t + β ⋅ δ t ⋅ d θ , t \mathbf{\theta}_{t+1}=\mathbf{\theta}_{t}+\beta\cdot \delta_{t}\cdot\mathbf{\mathbf{d}}_{\theta,t}it+1=it+bdtdi , t

reference:

https://www.bilibili.com/video/BV1We4y1w7Us?p=6&spm_id_from=pageDriver&vd_source=77cb10b9cc158f4815e8d992103d448b

Guess you like

Origin blog.csdn.net/no1xiaoqianqian/article/details/129303698
Recommended