Deep Reinforcement Learning - Chapter 6~8 Q-Learning

Reference notes and columns:

  1. Tianjin Baozi stuffing's Zhihu deep reinforcement learning column
  2. Datawhale deep reinforcement learning notes

1. Q-Learning concept

The Q-learning method is a different-strategy time-difference method.
Among them, the different strategy refers to the strategy that the Agent chooses Action, corresponding to ϵ \epsilon in line 5 of the pseudocodeϵ Greedy policy
Note that the policy is not evaluated (line 6).
insert image description here
The time difference methodrefers to using the time difference target to update the current state-action value function Q. The time difference target isrt + γ max Q ( S t + 1 , a ) r_t+\gamma maxQ(S_{t+1}, a)rt+γmaxQ(St+1,a)

What Q-Learning needs to learn is a Policy Evaluation Function (Policy Evaluation Function) that evaluates the Agent's choice of Action strategy.

2. Q-Learning Function

2.1 State Value Function Estimation V π ( s ) V^{\pi}(s) Vπ (s)

insert image description here

Given a certain state, Q-Learing continuously assumes that the next interacting actor is π \piπ , the expected value V π ( S ) V^{\pi}(S)of reward accumulated until the end of the interactionVπ (S).

V π ( S ) V^{\pi}(S)VThe INPUTof π (S)is a state s, andthe OUTPUTis a scalar (that is, it is continuously assumed that the next interacting actor isπ \piπ , the reward accumulated until the end of the interaction)

  1. Approximate with Monte-Carlo(MC) based method

    Assume that the parameters in the neural network are the weights θ \theta of each layer of the networkθ , then the state-action value function isQ ( S t + 1 , a ) Q(S_{t+1}, a)Q(St+1,a ) , the cumulative reward function isGGG(当 input s t a t e s a state s_a statesa, the correct output should be G a G_aGa

  2. Use Temporal-difference (timing difference) method to approximate
    a given state statestate s t s_t st, take action actionaction a t a_t at, get reward rt r_trt, jump to state statestate s t + 1 s_{t+1} st+1, you can apply TD:

    V π ( s t ) V^{\pi}(s_t) Vπ (st) = V π ( s t + 1 ) + r t = V^{\pi}(s_{t+1}) + r_t =Vπ (st+1)+rt

    put st s_tstThrow it into the network and you will get V π ( st ) V^{\pi}(s_t)Vπ (st) , putst + 1 s_{t+1}st+1Throwing it into the network will get V π ( st + 1 ) V^{\pi}(s_{t+1})Vπ (st+1) . Through training, updateV π ( s ) V^{\pi}(s)Vπ (s)parameter, when the result of their two subtraction andrt r_trtThe closer, V π ( s ) V^{\pi}(s)Vπ (s)is learned.

insert image description here
Because the MS method is subject to G a G_aGaThe impact of randomness is greater than that of the TD method, so the TD method is more common.

2.2 State-action Value Function Q π ( s , a ) Q^{\pi}(s, a) Qπ (s,a)

Q π ( s , a ) Q^{\pi}(s, a)Qπ (s,a ) TheINPUTis a state s and action a, andthe OUTPUTis the expected value of accumulated reward.

There are two ways to write Q-function:

  • input is state and action, output is a scalar;
  • The input is a state s, and the output is several values.

for example:
insert image description here

Suppose we have 3 actions, and the 3 actions are stay still, up, and down.

Assuming it is in the first state, no matter which action is taken, the expected reward obtained at the end of the game is actually the same. Because the ball is here, even if you go down, you should actually come for first aid next, so no matter which action you take today, it won't make much difference.

Assume that in the second state, the table tennis ball has bounced very close to the edge. At this time, if you take the upward direction, you can get the positive reward and catch the ball. If you stand still or go down, you will miss the ball next. The reward you get will be negative.

Suppose in the third state, the ball is very close, so it will go up.

Suppose in the fourth state, the ball is bounced back

Guess you like

Origin blog.csdn.net/weixin_45549370/article/details/109479511