6. Reinforcement learning--policy gradient


The previous lecture mainly explained the approximation of the value function, and then formulated strategies based on the value function. In this lecture, the strategy P(a|s) will transform from a probability set into the function itself π(s,a) . By using the guidance of the gradient of the objective function related to the strategy, we can find the extreme value of the objective function, and then obtain the optimal value. Excellent strategy.
The organizational structure of this lecture is as follows: First, it is proposed that the value function cannot solve the problem well in some cases, and at the same time, the analysis directly based on the policy has the advantage that the value function cannot replace it in some situations. Then, it introduces the requirements for direct policy-based learning. The design of the objective function introduces the concept of policy gradient, explains the calculation principle of policy gradient from the finite difference method and theoretical analysis, and introduces the gradient calculation methods of two basic strategies. Based on the above content, the Actor-Critic algorithm that applies policy gradient for reinforcement learning is proposed, and its algorithm flow and some methods for improving the algorithm are given. Similarly, with the development of deep learning libraries, the gradient calculation formulas of some strategies mentioned in this lecture are not widely used in practical applications, but they are still very helpful for theoretical understanding. Like the approximate optimization of the value function, the optimization based on the policy function is also model-free.

Introduction

The main content of the previous lecture is how to approximate the parameterized expression of the value function, including the state value function and the behavior value function:
V θ ( s ) ≈ V π ( s ) Q θ ( s , a ) ≈ Q π ( s , a ) V_\theta(s) \approx V^{\pi}(s)\\ Q_\theta(s,a) \approx Q^{\pi}(s,a)Vi(s)Vπ (s)Qi(s,a)Qπ (s,a )
A policy can thenbe generated directly from the value function, e.g. using Ɛ-greedy exploration methods.
This section will directly parameterize the strategy itself. At the same time, the parameterized strategy will no longer be a probability set but a function, that is, from discrete to continuous:
π θ ( s , a ) = P [ a ∣ s , θ ] \ pi_{\theta}(s,a) = \mathbb{P}[a|s,\theta]Pii(s,a)=P[as,θ ]
The above formula understands the policy function as a parameterized policy functionπ θ \pi_\thetaPii, the policy function determines the probability of taking any possible action under a given state and certain parameter settings, so in fact it is a probability density function. When the strategy is actually applied to generate behavior, behavior sampling is performed according to this probability distribution. The parameters in the policy function determine the shape of the probability distribution.
The purpose of parameterization is to solve large-scale problems. In large-scale problems, it is impossible to strictly isolate each state and indicate that a certain action should be performed in a certain state. Therefore we need parameterization to reasonably approximate the actual function with a small number of parameters.
What we have to do is to use the parameterized policy function and adjust these parameters to get a better strategy . The behavior generated by following this strategy will get more rewards. The specific mechanism is to design an objective function and use the Gradient Ascent algorithm to optimize parameters to maximize rewards.

Value-based and policy-based reinforcement learning

Comparing Value-Based and Policy-Based reinforcement learning, it is pointed out that the former guides policy formulation by learning the value function (such as the Ɛ-greedy execution method); the latter has no value function and directly learns the policy; there is also a method that both learns the value function and The method of learning strategies is called Actor-Critic reinforcement learning.

Advantages and disadvantages of policy based approach

advantage:

  1. Policy-based learning may have better convergence , because although policy-based learning only improves a little bit each time, it always improves in a good direction; however, it is mentioned that some value functions will change in the later stage. There has been a small oscillation around the optimal value function without convergence.

  2. For those who have high-dimensional or continuous state spaces , after using value function-based learning to obtain the value function, when formulating strategies, it is necessary to compare the values ​​corresponding to various behaviors, so that if the behavior space has a high dimension or is continuous , then the process of comparing it to derive a behavior with the maximum value function is more difficult. In this case, it is much more efficient to use policy-based learning.

  3. Some random strategies can be learned , and a good example is given below; but learning based on value functions usually cannot learn random strategies.

  4. Sometimes calculating the value function is very complex. For example, when a ball falls from a certain position in the air and you need to move left and right to catch it, it is difficult to calculate the value of what action the ball takes when it is at a certain position; but based on strategy, it is much simpler. You only need to move towards the ball. Just move the landing direction and modify the strategy.

shortcoming:

  1. Original, unimproved (Naive) policy-based learning is sometimes not efficient enough and sometimes has high variance. Because strategic decisions based on value functions push individuals to choose the behavior of maximum value every time ; but based on strategy, more often than not, the choice of strategy will only move a little bit on the gradient of a certain parameter of the strategy, making the entire The learning is smoother and therefore not efficient enough. Sometimes the calculated increment in the direction of the gradient will have a high variance, which slows down the entire algorithm, but with some modifications, it can be improved.
  2. When solving a specific problem, you need to evaluate the characteristics of the problem to decide whether to mainly use value-based learning or strategy-based learning.

Strategies based on value functions sometimes fail to obtain optimal strategies

When state renaming occurs , random strategies will be better than deterministic strategies. Previous theory tells us that there is always a deterministic optimal strategy for any MDP. But that is only when the state is perfectly observable, or when the features used can perfectly describe the state . When states have duplicate names and cannot be distinguished, or the characteristics of the approximate function used to describe the state limit the perfect description of the state, the state information obtained by the individual is equivalent to the partially observed environmental information, and the problem will not have Markov properties. At this point the optimal strategy will no longer be deterministic. Learning directly based on policy will be able to learn the optimal policy, which is why we perform reinforcement learning directly based on policy.

strategy objective function

Three forms of strategy objective functions

So how does direct policy-based learning optimize the policy? To figure this out, we have to figure out the following question: What is the ultimate goal of our optimization strategy? Get as many rewards as possible . We design an objective function to measure the quality of the strategy. For different problem types, there are three objective functions to choose from:

Goal: Given strategy π θ ( s , a ) \pi_\theta(s,a)Pii(s,a ) , find the best parameterθ \thetai .

  1. Start value: In an environment that can produce a complete Episode, that is, when an individual can reach the terminal state, we can use such a value to measure the quality of the entire strategy: the cumulative reward obtained by the individual from a certain state s1 to the terminal state. . This value is called the start value. The meaning of this value is: if the individual always starts from a certain state s1, or starts from s1 with a certain probability distribution, then what kind of final result will the individual get from the beginning of this state to the end of Episode? award. What the algorithm really cares about at this time is: finding a strategy, when putting the individual in this state s1 and letting it execute the current strategy, it can obtain the start value reward. In this way, our goal becomes to maximize this start value.
    J 1 ( θ ) = V π θ ( s 1 ) = E π θ [ v 1 ] J _ { 1 } ( \theta ) = V ^ { \pi _ { \theta } } \left( s _ { 1 } \right) = \mathbb { E } _ { \pi _ { \theta } } \left[ v _ { 1 } \right]J1( i )=VPii(s1)=EPii[v1]
  2. Average Value: For continuous environmental conditions, there is no starting state. In this case, average value can be used. It means to consider the probability that our individual is in a certain state at a certain moment, that is, the individual's state distribution at that moment , and calculate for each possible state the reward that can be obtained by continuing to interact with the environment from that moment on. According to this Sum of the probability distributions of each state at time:
    J av V ( θ ) = ∑ sd π θ ( s ) V π θ ( s ) J _ { \mathrm { av } V } ( \theta ) = \sum _ { s } d ^ { \pi _ { \theta } } ( s ) V ^ { \pi _ { \theta } } ( s )Jto v V( i )=sdPii(s)VPii( s )
    ratio:d π θ ( s ) d ^ { \pi _ { \theta } } ( s )dPii( s ) is a static distribution of the state of the Markov chain under the current strategy.
  3. Average reward per time-step: Or we can use the average reward that can be obtained at each time step in various situations, that is to say, in a certain time step, check the possibility of an individual being in all states. , then the immediate rewards that can be obtained by taking all actions in each state are obtained by summing all rewards according to probability:
    J av R ( θ ) = ∑ sd π θ ( s ) ∑ a π θ ( s , a ) R sa J _ { av R } ( \theta ) = \sum _ { s } d ^ { \pi _ { \theta } } ( s ) \sum _ { a } \pi _ { \theta } ( s , a ) \mathcal { R } _ { s } ^ { a }JavR( i )=sdPii(s)aPii(s,a)Rsa
    In fact, the goals of these three formulas are the same, and they all try to describe (measure) the value of an individual in a certain state at a certain moment.

Optimize objective function

After finding the objective function, the next step is to optimize the policy parameters and maximize the objective function value. Therefore, it can be said that policy-based reinforcement learning is actually an optimization problem, finding parameters θ to maximize the objective function. Some algorithms use gradients and some don't. If there is an opportunity to obtain gradients, algorithms using gradient ascent are usually superior. Once you understand the use of algorithms that use gradients, it will be easy to apply algorithms that are not gradient-based to strategy optimization.
This lecture will mainly focus on strategy optimization using gradients, while using methods based on sequential structure fragments. How to understand the sequence-based structure? For example, we will not let individuals continue to interact with the environment until their entire life cycle is consumed, and then get a result and optimize the strategy based on this result. This would be meaningless to the individual. We select a sequence structure fragment in the interaction between the individual and the environment, learn through this sequence structure fragment, optimize the strategy, and then know the subsequent interaction between the individual and the environment.
The above is the introduction of this lecture. Next, we will introduce the objective function, gradient ascent, etc. at the end point.

Finite difference policy gradient

policy gradient

Let J(θ) be any type of policy objective function, and the policy gradient algorithm can make J(θ) rise to the local maximum along its gradient. At the same time, determine the parameter θ when the maximum value is obtained:
Δ θ = α ∇ θ J ( θ ) \Delta \theta = \alpha \nabla _ { \theta } J ( \theta )D i=αiDerive J ( θ )
from ▽θ J(θ) functional function:
∇ θ J ( θ ) = ( ∂ J ( θ ) ∂ θ 1 ⋮ ∂ J ( θ ) ∂ θ n ) \nabla _ { \theta } . J ( \theta ) = \left( \begin{array} { c } \frac { \partial \mathcal { J } ( \theta ) } { \partial \theta _ { 1 } } \\ \vdots \\ \frac { \partial J(\theta)} {\partial\theta_{n}}\end{array}\right)iJ(θ)=θ1J ( θ )θnJ ( θ )

Finite difference method to calculate policy gradient

This is a very common numerical calculation method, especially when the gradient function itself is difficult to obtain. The specific method is to roughly calculate the gradient using the following formula for each component θk of the parameter θ:
∂ J ( θ ) ∂ θ k ≈ J ( θ + ϵ uk ) − J ( θ ) ϵ \frac { \partial J ( \theta ) } { \partial \theta _ { k } } \approx \frac { J \left( \theta + \epsilon u _ { k } \right) - J ( \theta ) } { \epsilon }θkJ ( θ )ϵJ( i+ϵuk)J(θ)
u k u_k ukis a unit vector with a value of 1 only in the k-th dimension and 0 in the remaining dimensions. The finite difference method is simple, does not require the policy function to be differentiable, and is suitable for any strategy; but it is noisy and inefficient most of the time.

Monte Carlo Policy Gradient

Now we will analyze the theory and calculate the policy gradient. This requires that the policy is differentiable at the moment of executing the behavior and its gradient can be calculated.
The concept of Likelihood ratios (likelihood ratio, likelihood coefficient) is borrowed here. The gradient of a function at a variable θ is equal to the product of the function value there and the gradient of the logarithmic function of the function there:
∇ θ π θ ( s , a ) = π θ ( s , a ) ∇ θ π θ ( s , a ) π θ ( s , a ) = π θ ( s , a ) ∇ θ log ⁡ π θ ( s , a ) \begin{aligned} \nabla _ { \theta } \pi _ { \theta } ( s , a ) & = \pi _ { \theta } ( s , a ) \frac { \nabla _ { \theta } \pi _ { \theta } ( s , a ) } { \pi _ { \theta } ( s , a ) } \\ & = \pi _ { \theta } ( s , a ) \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) \end{aligned}iPii(s,a)=Pii(s,a)Pii(s,a)iPii(s,a)=Pii(s,a)ilogPii(s,a)
Score functions: ∇ θ log ⁡ π θ ( s , a ) \nabla _ { \theta } \log \pi_{\theta}(s,a)ilogPii(s,a)

softmax strategy

The softmax strategy is a commonly used strategy for some discrete behaviors. We hope to have a smooth parameterized policy to decide: for each discrete action, with what probability it should be executed.
To this end, we regard behavior as the linear algebra sum of multiple features under certain weights:
ϕ ( s , a ) ⊤ θ \phi ( s , a ) ^ { \top } \thetaϕ ( s ,a) θ
and the probability that we take a specific action is proportional to e raised to the power of this value:
π θ ( s , a ) ∝ e ϕ ( s , a ) ⊤ θ \pi _ { \theta } ( s , a ) \propto e ^ { \phi ( s , a ) ^ { \top } \theta }Pii(s,a)eϕ ( s , a )
Define the θscore score function:
θ log ⁡ π θ ( s , a ) = ϕ ( s , a ) − E π θ [ ϕ ( s , ⋅ ) ] \nabla _ { \theta } \ . log \pi_{\theta}(s,a) = \phi(s,a) -\mathbb{E}_{\pi_\theta}[\phi(s,·)]ilogPii(s,a)=ϕ ( s ,a)EPii[ ϕ ( s ,] , ε
;
ϕ − ∑ ϕ π = ϕ − E π \begin{aligned}\frac{d}{d\theta}ln\pi & = \frac{d}{d\theta}[\phi\theta-ln(\sum e^{\phi\theta})] \\ & = \phi-\frac{\sum\phi e^{\phi\theta}}{\sum e^{\phi\theta}} \\ & = \ phi-\sum\phi\frac{e^{\phi\theta}}{\sum e^{\phi\theta}} \\ & = \phi-\sum\phi\pi \\ & = \phi- E\pi \end{aligned}d idlnπ=d id[ ϕ iln(eϕ θ )]=ϕeϕ iϕeϕ i=ϕϕeϕ ieϕ i=ϕϕ p=ϕAnd p
Here ϕ , θ \phi,\thetaϕ ,θ are all vectors.
If the individual chooses to go left in state s and receives a positive immediate reward, the individual will increase the probability of being sampled for this behavior, that is, increase the score of going left in state s. So how to adjust the parameters that determine the score for going left? Adjust accordingly according to the size of the input (that is, the eigenvalue) corresponding to each parameter. If the eigenvalue is positive, the parameter value increases; if the eigenvalue is negative, the parameter value decreases.

Gaussian strategy

Different from the Softmax strategy, the Gaussian strategy is often applied to continuous behavior spaces.
When using the Gaussian strategy, we usually have a parametric representation of the mean, which can also be a linear algebraic sum of some features: μ ( s ) = ϕ ( s ) T θ \mu(s) = \phi(s)^T \thetam ( s )=ϕ ( s )The T θ
variance can be a fixed value or can be expressed parametrically. The behavior corresponds to a specific value, which is randomly sampled from a Gaussian distribution with μ(s) as the mean and σ as the standard deviation:
a ∼ N ( μ ( s ) , σ 2 ) a \sim \mathcal { N } \left( \mu ( s ) , \sigma ^ { 2 } \right)aN( m ( s ) ) ,p2 )
Let:
π θ ( s , a ) ∝ e − ( a − ϕ ( s ) T θ ) 2 σ 2 \pi _ { \theta } ( s , a ) \propto e ^ {-\frac{(a -\phi(s)^T\theta)^2}{\sigma^2}}Pii(s,a)ep2( a ϕ ( s )T i)2
Internal score function:
∇ θ log ⁡ π θ ( s , a ) = ( a − µ ( s ) ) ϕ ( s ) σ 2 \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) = \frac { ( a - \mu ( s ) ) \ phi ( s ) } { \ sigma ^ { 2 } }ilogPii(s,a)=p2(aμ ( s ) ) ϕ ( s )

policy gradient theorem

First consider the following very simple one-step MDP problem: from a distribution d ( s ) d(s)A state s is obtained by sampling in d ( s ) . Starting from s, a behavior a is taken and an immediate rewardr = R s , ar = R_{s,a}r=Rs,aThen terminate. The entire MDP has only one state, behavior, and immediate reward. How to maximize rewards during this MDP process?
Since it is a single-step process, the forms of the three objective functions are the same:
J ( θ ) = E π θ [ r ] = ∑ s ∈ S d ( s ) ∑ a ∈ A π θ ( s , a ) R s , a \begin{aligned} J ( \theta ) & = \mathbb { E } _ { \pi _ { \theta } } [ r ] \\ & = \sum _ { s \in \mathcal { S } } d ( s ) \sum _ { a \in \mathcal { A } } \pi _ { \theta } ( s , a ) \mathcal { R } _ { s , a } \end{aligned}J(θ)=EPii[r]=sSd(s)aAPii(s,a)Rs,a
Note: Here d ( s ) d(s)d ( s ) andR s , a R_{s,a}Rs,aNot θ \thetaLet θ be a function of the equation
:
∇ θ J ( θ ) = ∑ s ∈ S d ( s ) ∑ a ∈ A π θ ( s , a ) ∇ θ log ⁡ π θ ( s , a ) R s , a = E π θ [ ∇ θ log ⁡ π θ ( s , a ) r ] \begin{aligned} \nabla _ { \theta } J ( \theta ) & = \sum _ { s \in \mathcal { S } } d ( s ) \sum _ { a \in \mathcal { A } } \pi _ { \theta } ( s , a ) \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) \mathcal { R } _ { s , a } \\ & = \mathbb { E } _ { \pi _ { \theta } } \left[ \nabla _ { \theta } \log \pi _{\theta}(s,a)r\right]\end{aligned}iJ(θ)=sSd(s)aAPii(s,a)ilogPii(s,a)Rs,a=EPii[ilogPii(s,a)r]

It can be seen that the gradient of the objective function is equal to the expectation of the product of the logarithmic gradient of the policy function and the immediate reward. According to the previous introduction, both parts are relatively easy to determine. So updating parameters becomes easy. One question is whether the situation of single-step MDP applies to multi-step MDP? The answer is yes. The only thing that needs to be changed is to replace the immediate reward value with the Q value of the target , and this is common to the three objective functions. There is the following theorem:
Theorem : For any differentiable strategy π θ ( s , a ) \pi_\theta(s,a)Pii(s,( a ) , let us consider the equivalence of a scalar equation:
∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( s , a ) Q π θ ( s , a ) ] \ . nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { \pi _ { \theta } } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) Q ^ {\pi\theta}(s,a)\right]iJ(θ)=EPii[ilogPii(s,a)Qπθ(s,a ) ]
With the above formula, we can start designing algorithms to solve practical problems. Remember that in reinforcement learning, when talking about learning algorithms, three major categories of algorithms should immediately come to mind: dynamic programming (DP), Monte Carlo (MC) learning, and temporal difference (TD) learning. DP is suitable for small and medium-sized problems and is not the focus of this lecture. Let’s start with MC learning.

Monte Carlo Policy Gradient Theorem

For the situation with a complete Episode, we apply the policy gradient theory and use stochastic gradient ascent to update the parameters. For the expectations in the formula, we replace them in the form of sampling, that is, use the return at time t as the behavior value under the current policy. an unbiased estimate of.
The algorithm description is as follows: We first randomly initialize the parameters θ of the policy function, for an Episode under the current policy:
{ s 1 , a 1 , r 2 , … , s T − 1 , a T − 1 , r T } ∼ π θ \left\{ s _ { 1 } , a _ { 1 } , r _ { 2 } , \ldots , s _ { T - 1 } , a _ { T - 1 } , r _ { T } \right \} \sim \pi _ { \theta }{ s1,a1,r2,,sT1,aT1,rT}Pii
At each moment from t=1 to t=T-1, calculate the individual gain vt v_tvt, and then update the parameter θ. Repeat this for each Episode until the end. The specific algorithm is as follows:
Insert image description here

Note: vt v_t in the above descriptionvtIt is the cumulative return reward defined previously.

Actor-Critic policy gradient theorem

The Monte Carlo policy gradient method uses harvest as an estimate of the state value. Although it is unbiased, the noise is relatively large, that is, the variability (variance) is high. If we can estimate the state value relatively accurately and use it to guide policy updates, will there be better learning effects? This is the main idea of ​​Actor-Critic policy gradient.
Actor-Critic literally means "actor-critic", which is equivalent to the fact that while actors are acting, they have critics' guidance and the actors' acting becomes better and better. That is, using Critic to estimate behavior value:

Q w ( s , a ) ≈ Q π θ ( s , a ) Q _ { w } ( s , a ) \approx Q ^ { \pi _ { \theta } } ( s , a ) Qw(s,a)QPii(s,a )
Actor-Critic policy gradient learning is divided into two parts:

  1. Critic: Parameterized behavioral value function Q w ( s , a ) Q_w(s, a)Qw(s,a)
  2. Actor: Guides the update of strategy function parameters θ according to the value obtained from the Critic part.

Specifically, the Actor-Critic equation is a function of one of the following:
∇ θ J ( θ ) ≈ E π θ [ ∇ θ log ⁡ π θ ( s , a ) Q w ( s , a ) . ] Δ θ = α ∇ θ log ⁡ π θ ( s , a ) Q w ( s , a ) \begin{aligned} \nabla _ { \theta } J ( \theta ) & \approx \mathbb { E } _ { \pi _ { \theta } } \left[\nabla_{\theta}\log\pi_{\theta}(s,a) Q_{w}(s,a)\right]\\\Delta\theta&=\alpha\ nabla_{\theta}\log\pi_{\theta}(s,a) Q_{w}(s,a)\end{aligned}iJ(θ)D iEPii[ilogPii(s,a)Qw(s,a)]=αilogPii(s,a)Qw(s,a)
It can be clearly seen that what the Critic does is actually what we have already seen: strategy evaluation, he wants to tell the individual, based on the parameter θ \thetaθ determined strategyπ θ \pi_\thetaPiiHow did it perform? Regarding strategy evaluation, we have learned how to do it before. You can use Monte Carlo strategy evaluation, TD learning, TD(λ), etc. You can also use the minimum variance method introduced in the previous lecture.
A simple actor-critic algorithm can use an action value-based critic, which uses a linear value function to approximate the state-action value function: Q w ( s , a ) = ϕ ( s , a ) T w Q_w(s,a) = \phi(s,a)^TwQw(s,a)=ϕ ( s ,a)T w
where Critic updates w through linear approximation TD(0), and Actor updates θ through policy gradient. The specific algorithm flow is as follows:
Insert image description here
Note: This algorithm is only an approximate Actor-Critic algorithm based on the linear value function.
This is an online real-time algorithm that is updated for each step without waiting for the end of the Episode. Because this is a linear expression of the value behavior function, the multiplication when updating isϕ ( s , a ) \phi (s, a)ϕ ( s ,a ) , if you have any doubts, please refer to the summary of the previous section.
In the policy-based learning algorithm, the algorithm does not need to use Ɛ-greedy search when selecting the policy. The policy is obtained directly based on the parameter θ. At the same time, there is a learning rate α when updating the policy parameters, which reflects the step size (step size) of updating the parameter θ in the gradient direction. Generally, when updating parameters, we only update a certain amount determined by α in the gradient direction. . For example, the current policy prompts that the gradient direction tends to choose "left" behavior when updating. Then when updating the policy parameters, a certain value can be updated in the left direction. If the value of α increases, then This causes decisions to be tilted towards behaviors that make it easier to choose "left", which is actually equivalent to greedy decision-making behavior without exploration. As long as learning continues, it is possible to try more behaviors due to gradient changes. In this process, the parameter α controls the smoothness of the policy update.

Q: If I use the policy gradient method, is it still guaranteed to find a unique global optimal solution, or will I fall into a local optimal solution?
The answer is: If the strategy is formulated based on the value function, using the table look-up method can ensure that it can converge to the global optimal solution. That is, although a learning method directly based on the strategy is used, when the table look-up method is still used, For example, the global optimal solution can be obtained by using the softmax strategy; but if some generalized approximate function representation methods are used, such as neural networks, etc., whether it is based on the value function or the strategy, it may fall into the local optimal solution. For some approaches in between, there are no complete findings.
Use a linear combination of features to approximate Q w ( s , a ) Q_w(s,a)Qw(s,a ) , and then the method of solving the policy gradient introduces bias. The policy gradient obtained under a biased value may not necessarily find a better solution in the end. For example, when the approximate value functionQ w ( s , a ) Q_w(s, a)Qw(s,a ) , when using features that may cause duplicate state names, can we still solve the grid world problem (referring to the problem of finding money bags in the grid world mentioned above)? The answer is not necessarily. Fortunately, if wecarefully designthe approximateQ w ( s , a ) Q_w(s,a)Qw(s,a ) function can avoid introducing bias, so that we are equivalent to following the accurate policy gradient.

Compatible Function Approximation

So what is a carefully designed Q w ( s , a ) Q_w(s,a)Qw(s,a ) ?
Theorem:
If you want to use a general approximation function strategy to make the policy gradient converge to the optimal solution:
the following two conditions need to be met:

  1. The gradient of the approximate value function is completely equivalent to the gradient of the logarithm of the policy function, that is, there is no duplicate name:
    ∇ w Q w ( s , a ) = ∇ θ log ⁡ π θ ( s , a ) \nabla _ { w } Q _ { w } ( s , a ) = \nabla _ { \theta } \log \pi _ { \theta } ( s , a )wQw(s,a)=ilogPii(s,a)
  2. The value function parameter w minimizes the mean square error:
    ε = E π θ [ ( Q π θ ( s , a ) − Q w ( s , a ) ) 2 ] \varepsilon = \mathbb { E } _ { \pi _ { \ theta } } \left[ \left( Q ^ { \pi _ { \theta } } ( s , a ) - Q _ { w } ( s , a ) \right) ^ { 2 } \right]e=EPii[(QPii(s,a)Qw(s,a))2]

prove:
Insert image description here

Reducing Variance Using Baseline

The basic idea is to extract a benchmark function B(s) from the policy gradient, requiring that this function is only related to the state and has nothing to do with the behavior, so it does not change the gradient itself. The characteristic of B(s) is that it can reduce its Variance without changing the behavioral value expectation. When B(S) has this characteristic, the following derivation holds:
E π θ [ ∇ θ log ⁡ π θ ( s , a ) B ( s ) ] = ∑ s ∈ S d π θ ( s ) ∑ a ∇ θ π θ ( s , a ) B ( s ) = ∑ s ∈ S d π θ B ( s ) ∇ θ ∑ a ∈ A π θ ( s , a ) = 0 \begin{aligned} \mathbb { E } _ { \pi _ { \theta } } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) B ( s ) \right] & = \sum _ { s \in \ mathcal { S } } d ^ { \pi _ { \theta } ( s ) } \sum _ { a } \nabla _ { \theta } \pi _ { \theta } ( s , a ) B ( s ) \\ & = \sum _ { s \in \mathcal { S } } d ^ { \pi _ { \theta } } B ( s ) \nabla _ { \theta } \sum _ { a \in \mathcal { A } } \pi _ { \theta } ( s , a ) \\ & = 0 \end{aligned}EPii[ilogPii(s,a)B(s)]=sSdPii(s)aiPii(s,a)B(s)=sSdPiiB(s)iaAPii(s,a)=0
Explanation of the derivation process: The expectation of the product of the gradient of the logarithm of the policy function and the benchmark function can be expressed as the first row of equations in the form of the product of the gradient of the policy function and B(s) for all states and behavior distributions. This step of derivation is mainly as follows According to the definition of expectation, and B is a function with respect to the state s. Since B(s) has nothing to do with behavior, we can extract it from the summation over behavior a, and we can also extract the gradient from the summation sign (the sum of the gradients is equal to the gradient of the sum), so that the latter term The summation becomes: the summation of the policy function for all behaviors. This summation must be 1 according to the definition of the policy function, and the gradient of the constant is 0. So the total result is equal to 0.

In principle, any function that has nothing to do with behavior can be used as B(s). A good B(s) is the state value function based on the current state:
B ( s ) = V π θ ( s ) B(s) = V^{\pi_\theta}(s)B(s)=VPii( s )
In this way, we use an advantage function (advantage function), defined:
In this way, the gradient of the objective function can be written as:
∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( s , a ) A π θ ( s , a ) ] \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { \pi _ { \theta } } \left[ \nabla _ { \theta } \log \ pi _ { \theta } ( s , a ) A ^ { \pi _ { \theta } } ( s , a ) \right]iJ(θ)=EPii[ilogPii(s,a)APii(s,a)]

The Advantage function can significantly reduce the variability of the state value, so the Critic part of the algorithm can estimate the advantage function instead of just estimating the behavior value function. In this case, we need two approximation functions, that is, two sets of parameters, one to approximate the state value function, and one to approximate the behavior value function in order to calculate the advantage function, and update these two value functions through TD learning . The mathematical expression is as follows:
V v ( s ) ≈ V π θ ( s ) Q w ( s , a ) ≈ Q π θ ( s , a ) A ( s , a ) = Q w ( s , a ) − V v ( s ) \begin{aligned} V _ { v } ( s ) & \approx V ^ { \pi _ { \theta } } ( s ) \\ Q _ { w } ( s , a ) & \approx Q ^ { \pi _ { \theta } } ( s , a ) \\ A ( s , a ) & = Q _ { w } ( s , a ) - V _ { v } ( s ) \end{aligned}Vv(s)Qw(s,a)A(s,a)VPii(s)QPii(s,a)=Qw(s,a)Vv(s)
However, in actual operation, this is not necessary. This is because:
by definition, TD error δ π θ \delta^{\pi_\theta}dPiiAccording to the real state value function V π θ ( s ) V^{\pi_\theta}(s)VPii( s ) Function:
δ π θ = r + γ V π θ ( s ′ ) − V π θ ( s ) \delta ^ { \pi _ { \theta } } = r + \gamma V ^ { \pi _ { \theta}} \left(s^{\prime}\right) - V^{\pi_{\theta}}(s)dPii=r+γ VPii(s)VPii( s )
δ π θ \delta ^ { \pi _ { \theta } }dPiiIt can be seen from the following derivation that A ( s , a ) A ( s , a )A(s,(a ) Definition:
E π θ [ δ π θ ∣ s , a ] = E π θ [ r + γ V π θ ( s ′ ) ∣ s , a ] − V π θ ( s ) = Q π θ ( s , a ) − V π θ ( s ) = A π θ ( s , a ) \begin{aligned} \mathbb { E } _ { \pi _ { \theta } } \left[ \delta ^ { \pi _ { \theta } } | s , a \right ] & = \mathbb { E } _ { \pi _ { \theta } } \left[ r + \gamma V ^ { \pi _ { \theta } } \left( s ^ { \prime } \right) | s , a \right] - V ^ { \pi _ { \theta } } ( s ) \\ & = Q ^ { \pi _ { \theta } } ( s , a ) - V ^ { \pi _ { \ theta } } ( s ) \\ & = A ^ { \pi _ { \theta } } ( s , a ) \end{aligned}EPii[ dPiis,a]=EPii[r+γ VPii(s)s,a]VPii(s)=QPii(s,a)VPii(s)=APii(s,a)
Similarly, we can obtain the following TD functions:
∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( s , a ) δ π θ ] \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { \pi _ { \theta } } \left[ \nabla _ { \theta } \log \pi _ { \theta } ( s , a ) \delta ^ { \pi _ { \ theta } } \right]iJ(θ)=EPii[ilogPii(s,a ) dPii]
In actual application, we use an approximate TD error, that is, replace the actual state function with an approximate function of the state function:
δ v = r + γ V v ( s ′ ) − V v ( s ) \delta _ { v } = r + \gamma V _ { v } \left( s ^ { \prime } \right) - V _ { v } ( s )dv=r+γ Vv(s)Vv( s )
The advantage of doing this is that we only need a set of parameters to describe the state value function, and no longer need to approximate the behavior value function.

Using TD(λ) for the critic process

Insert image description here
Note: MC needs to wait until the end of Episode, vt v_t in the formulavtThat's G t G_tGt; TD(0) only takes one step, the forward perspective of TD(λ) - needs to be until the end of Episode, the backward perspective of TD(λ) - real-time, with frequency memory and near-term memory functions.

Using TD(λ) for Actor procedures

Insert image description here

For Critic and Actor, when applying the backward perspective algorithm of TD(λ) to actual problems, it can be updated online in real time and does not require a complete Episode.

Many forms of policy gradient functions use stochastic gradient ascent algorithms; similarly, the critic part is implemented using policy evaluation. As mentioned in the previous lecture, MC or TD, TD(λ), etc. can be used to estimate State value function V π ( s ) V^{\pi}(s)Vπ (s), behavior value functionQ π ( s , a ) Q^\pi(s,a)Qπ (s,a ) or advantage functionA π ( s , a ) A^\pi(s,a)Aπ (s,a ) etc.

Final summary:
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105725109