Deep Reinforcement Learning - Policy Learning (3)

The focus of this paper: using a neural network to approximate the policy function Π

We use a neural network to approximate the policy function. This neural network is called a policy network. It can be used to control the movement of the agent. To train this neural network, the policy gradient algorithm is used.

The input of the policy function Π is the current state s, and its output is a probability distribution (0, 2, 0.1, 0.7), giving each action a a probability value

In the example of Super Mario, the input of the Π function is the state s, and the output is a three-dimensional vector. Each element is the probability of an action. With these three probability values, the agent will do a random sampling to get the action a. All three actions are likely to be drawn, but the probability of being drawn upward is the highest, which is 0.7

e3abd104b1264c1c82cf00c9efe1257e.png

1. How to get the strategy function?

If a game has only 5 states and 10 actions, it is easy to handle. We draw a table of 5*10, and each cell in the table corresponds to a probability. We can calculate the 50 probability values ​​by playing the game. Yes, but games like Super Mario have countless such states, and this table can’t fit at all, so we can’t directly calculate the policy function Π, so we have to do function approximation, learn a function to approximate the policy function , function approximation There are many ways to use linear functions, curve functions, or neural networks. Here, θ is the parameter of the neural network. At the beginning, θ is randomly initialized. Later, we improve θ through learning.

cc4bf26b666b44d3961908a85ec70876.png

2. If it is the game Super Mario, we can design the strategy network like this: 

The input is the state s, that is, the picture displayed on the current screen, or the picture of the last few frames, and then one or several convolutional layers to turn the picture into a feature vector, and then the fully connected layer maps the feature vector to a three-dimensional Vector, since there are 3 actions, so the dimension here is 3, and then use the softmax activation function, such an output is a probability distribution, the output is a three-dimensional vector, each element corresponds to an action, and the value inside is the action The probability

Softmax can make the output all integers, and the sum = 1

3. Review the state value function:

The action-value function QΠ is the conditional expectation of Ut, and the value of QΠ depends on the policy functions Π, at, st

The state-value function VΠ is the expectation of QΠ, and the action A in QΠ is accumulated. Here, A is regarded as a random variable, and the probability density is Π. After A is eliminated, VΠ is only combined with the policy function Π and state st related

In this way, once you give the strategy function Π, VΠ can evaluate the quality of the current state st, the larger the VΠ, the greater the current chance of winning

Given the state s, VΠ can evaluate the quality of the policy function Π, if Π is good, VΠ will be relatively large, indicating that the chance of winning is high

eedcaa46e9d542efb45bba979a7d068d.png33e01a591dd5474a89d4de620f124111.png

4. The main idea of ​​strategy learning:

We just used the policy network to approximate the policy function. In this way, the value function can be written as V(s; θ), and V can evaluate the quality of the state s and the policy network

5. How to make the policy network better and better?

》Continuously improve and adjust the parameter θ of the strategy network

This expectation is about the state S. Here, the state S is taken as a random variable, and the expectation is removed, so that only θ is left as a variable. The objective function J(θ) is the evaluation of the policy network. The more policy network Well, the larger J(θ) is, so the goal of policy learning is to improve θ so that the larger J(θ) is, the better

6. How to improve θ?

"Use the policy gradient algorithm

We let the agent play the game, and each step will observe a different state s. This s is equivalent to being randomly sampled from the probability distribution of the state. When the state s is observed, V (s, θ) is derived with respect to θ , get a gradient, and then use gradient ascent to update θ, where β is the learning rate, in fact, this is equivalent to stochastic gradient ascent, we are not calculating the gradient, the real gradient is the objective function J(θ), the derivative of θ , here we calculate the derivative θ of v with respect to θ, which is actually a random gradient. The randomness comes from S. Why use gradient ascent? Because we want to make the objective function J(θ) larger and larger, the derivative of v here with respect to θ is called policy gradient (strategy gradient)

72afbafc547b4afb9eb09943d920d525.png

7. Why use gradient ascent to update θ?

"Because walking along the direction of gradient ascent, our objective function J(θ) will be closer to the maximum value!

Deriving a function V yields a local maximum, while deriving its expectation J yields a global maximum

Here's how to approximate the policy gradient:

In order to simplify the derivation, assume that QΠ does not depend on θ (in fact, it may depend), so that QΠ can be regarded as a constant, extracted from the derivation operation, and placed outside the derivative

63e9d6c39b654dc18f7820853bf4fbef.png

d48900702daa4dafa084f04c195ff1a9.png

But in practice, this formula is usually not used to calculate the policy gradient. In practice, the Monte Carlo approximation of this policy gradient is usually used.

ffd89964f5574204804585cb1057cb90.png

a9aa8564aee84682a6354e59ed7bc8a1.png  bde4b7d5f723434cae59a610b9d749ac.png

Discrete action: 

8351d76f850c4873a0dd95c2e839de4c.pngContinuous action:

A is a continuous variable, so if you want to find this expectation directly, you need to do a definite integral, but the integral is impossible, because the Π function is a neural network, which is very complicated and cannot be directly calculated, so you can only do Monte Carlo approximation

The process of Monte Carlo approximation:

1. Randomly sample an action as a hat, and the sampling is based on the probability density function Π

2. Calculate g(a hat, θ), where a hat is a definite action. With a hat, just calculate the value of g(a hat, θ) directly.

Since a hat is randomly sampled according to the probability density function Π, g(a hat,θ) is an unbiased estimate of the policy gradient

 809933d4d98e4e1d92726a88cb145f27.png

Since g() is an unbiased estimate of the strategy gradient, g() can be used to approximate the strategy gradient. This is called Monte Carlo approximation. Monte Carlo is to draw one or several random samples and use random samples to approximate the expectation. , when updating the model parameters θ, use g() as the approximate gradient

4d588b2047cd4d6696c6a212f580215b.pngTo summarize the policy gradient algorithm:

1. At the tth time point, the state st is observed

2. Use Monte Carlo approximation to calculate the policy gradient, use the policy network Π as a probability density function, and use it to randomly sample an action at, for example, at is a leftward action

3. Calculate the value of the value function QΠ, and record the result as qt

4. Calculate the derivative of the policy network Π, calculate the derivative of logΠ with respect to θ, and obtain the result dθ, t is a vector, matrix or tensor, and the size of dθ, t is the same as θ, if θ is a matrix of 100*100 , then dθ, t is also a matrix of 100*100. Both tensorflow and pytorch systems support automatic gradient calculation. You tell the system the current at, st and the current parameter θt, and the system can automatically calculate the gradient dθ, t

5. Calculate the strategy gradient approximately, and use a sample a to calculate the Monte Carlo approximation of the strategy gradient

6. In the last step, with the approximate policy gradient, you can use it to update the parameter θ of the policy network

 42ef4c300abc4c5dbcc4e8aade0db3da.png

At present, there is still an unresolved question: what is QΠ?

We don't know QΠ, so we can't calculate qt, so how to approximate qt ?

Method 1: Use the strategy network Π to control the movement of the agent. From the beginning to the end of the game, record the trajectory of the entire game, s1, a1, r1, s2, a2, r2,...sn, an, rn , observing all rewards r, we can calculate ut

Since the value function QΠ is the expectation of Ut, we can use the observed value of large Ut and small ut to approximate QΠ, so the reinforcement algorithm uses the observed ut to replace the QΠ function. The reinforcement algorithm needs to finish a game and observe all The reward of , and then the policy network can be updated

2a4fb7f8769147c2b568207cd60851a9.png

Method 2: Use a neural network to make an approximation:

Originally, the policy function Π has been approximated with a neural network, and now another neural network is used to approximate the value function QΠ, so that there are two neural networks, one is called Actor, and the other is called Critic, so there is actor- critic method 

Summarize:

It is difficult to find the policy function directly, so we need to use the neural network to approximate the measurement function

θ is the parameter of the neural network. It is randomly initialized at the beginning, and then the parameter θ is learned through the policy gradient algorithm.

The main content of this lesson is to derive the policy gradient and calculate the policy gradient

The policy gradient is the derivative of the value function V with respect to θ. After calculating the policy gradient, we do gradient ascent to update the parameter θ. Why do we need to do gradient ascent? Because we want the value function V to be as large as possible, the objective function is E(V), which can be understood as using the strategy function Π, the average chance of winning the agent is, the better the strategy function, the greater the value of the objective function, and the agent's The greater the chance of evaluation

If you are confused about the content of this lesson, I will give you some pointers in the next Q&A, I hope it can help you!

Guess you like

Origin blog.csdn.net/Tandy12356_/article/details/130176217