Reinforcement study notes: Q-learning

This article is organized in the reinforcement learning tutorial of datawhalechina.github.io and the reinforcement learning tutorial of Mofan python

This article should be short, if it is longer, it will take a few hours at every turn. I really can't stand it.

0x01 Introduction

insert image description here
In reinforcement learning, Q π ( s , a ) Q^\pi(s, a)Qπ (s,a ) ,Q (s, a) Q (s, a)Q(s,a ) , which means that in statessUnder s , the current step adopts the strategyaaa , all subsequent steps are based on the given policyπ \piπ to decide the expected value of the reward to get until the end of the round.

In order to maximize the reward, here the agent's strategy π \piπ can be considered to be in the current statessUnder s , the mindless choice makesQ π ( s , a ) Q^\pi(s, a)Qπ (s,a ) the largestaaa

The above figure is the algorithm flow chart of Q-learning. the policy of a given agent π \piπ , we need to update Q π ( s , a )through such an algorithmQπ (s,a ) ,Q (s, a) Q (s, a)Q(s,a ) . In order to maximize the benefits, the agent is in each statesss will mindlessly chooseQ π ( s , a ) Q^\pi(s, a)Qπ (s,a ) Actionaaa

Suppose the environment state is sss , the approach of Q-learning is:

First, according to the current sss select an actionaaa , this actionaaThe selection strategy of a is ϵ − greedy \epsilon-greedyϵg r e e d y , i.e. a high probability selection such thatQ ( s , a ) Q(s, a)Q(s,a ) value (thenQ ( s , a ) Q(s, a)Q(s,a ) has not been trained yet) reaches the maximuma ∗ a^*a , but there will beϵ \epsilonϵ probability to randomly choose an actionaaa

Next, we choose action aaAfter a , you can takeaaa , then get rewardrrr , and reach the next states ' s's

Next, update Q ( s , a ) Q(s,a)Q(s,a)

Q ( s , a ) = ( 1 − α ) Q ( s , a ) + α ( r + γ max ⁡ a ′ Q ( s ′ , a ′ ) ) Q(s,a) =(1 - \alpha) Q(s,a) + \alpha (r + \gamma \max_{a'} Q(s', a') ) Q(s,a)=(1a ) Q ( s ,a)+α ( r+camaxQ(s,a))

where α \alphaα is the learning rate,γ \gammaγ is the discount factor. Suppose we initialize Q ( s , a ) Q(s, a)completely randomlyQ(s,a ) value, but we gradually introduced the returnrrr . And, because the learning rateα \alphaThe existence of α , the previous initial term will be slowly diluted byrrQ ( s , a ) Q(s, a)composed of rQ(s,a ) More and more values ​​are retained.

Finally, repeat the above action, Q ( s , a ) Q(s, a)Q(s,a ) The value gradually converges.

Guess you like

Origin blog.csdn.net/weixin_43466027/article/details/117081189