Intensive Study Notes-13 Policy Gradient Methods

Reinforcement learning algorithms are mainly about learning optimal decisions. So far, the decision choices we have discussed are all indirectly selected through the value estimation function. This section discusses the use of a parametric decision model \pi(a|s,\theta )to select actions directly based on states s, rather than v(s|w)indirectly based on value estimation functions.

\pi(a|s,\theta )We can define the following Policy Gradient update strategy to solve the parameters of the parameterized decision model , which J(\theta_t)represents the loss function used to measure the pros and cons of the decision model.

\theta_{t+1} = \theta_{t} + \partial_{\theta_{t}} J(\theta_t)

1. Policy Approximation and its Advantages

There are two modeling approaches for parametric decision models \pi(a|s,\theta ): generative or discriminative.

When the action space is discrete and small, a discriminative model can be used to h(s,a,\theta )represent the pros and cons of the state-action s,apair. At this time, \pi(a|s,\theta )it can be expressed as the following formula. This method obtains the probability of selecting an action through softmax, taking into account the ε-greedy action exploration function. On the other hand, random actions are also optimal in many cases, and the softmax method has this characteristic.

\pi(a|s,\theta )=\frac{e^{h(s,a,\theta)}}{\sum_{a'\neq a}e^{h(s,a',\theta)}}

When the action space is continuous, the generative model is a better choice. A simple way is to model the decision-making action distribution \pi(a|s,\theta )as a Gaussian distribution. Due to the existence of action distribution variance, it also takes into account the characteristics of action exploration.

\pi(a|s,\theta)=\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\text{exp}(-\frac{(a-\mu(s,\theta))^2}{\sigma(s,\theta)^2})

2. The Policy Gradient Theorem

The next key is how to find a loss function that measures the pros and cons of the decision-making model J(\theta_t). Intuitively, under the optimal decision-making, the value of each state should also be optimal, so it can be defined as:

J(\theta)=\sum_s \mu(s) \sum_a \pi(a|s,\theta) Q(s,a)

At this time there is

\partial J(\theta)\\=\sum_s \mu(s) \sum_a Q(s,a) \partial_{\theta}\pi(a|s,\theta) \\=\sum_s \mu(s) \sum_a Q(s,a)\pi(a|s,\theta) \frac{\partial_{\theta}\pi(a|s,\theta)}{\pi(a|s,\theta) }\\=E[G(s,a)\frac{\partial_{\theta}\pi(a|s,\theta)}{\pi(a|s,\theta)}]

The following SGD update parameter formula can be defined:

\theta_{t+1}=\theta_{t} + \beta G_t(s,a) \frac{\partial_{\theta}\pi(a|s,\theta_t)}{\pi(a|s,\theta_t)}

According to the above formula, the Policy Gradient Methods combined with MC:

3. Baseline

Because the parameters in the parametric decision-making model will be updated synchronously under \thetadifferent conditions, the cumulative income of some states is much greater than that of other states, while the decision-making model selects actions according to the state, so the cumulative income of different states is different, and deviations will be introduced. This bias affects the impact of the model.s,asGsG

Therefore, the bias caused by the state can be subtracted from the cumulative return G, that is,

\hat{G_t(s,a)}=G_t(s,a)-v_t(s)

At this point the parameter update can be changed to:

\theta_{t+1}=\theta_{t} + \beta (G_t(s,a) - v_t(s))\frac{\partial_{\theta}\pi(a|s,\theta_t)}{\pi(a|s,\theta_t)}

In addition, it can also be seen that although the deviation factor brought by the state is subtracted, it J(\theta_t)is equivalent to the loss function.

\partial J(\theta)\\=\sum_s \mu(s) \sum_a (G(s,a) -v(s))\partial_{\theta}\pi(a|s,\theta)\\=\sum_s \mu(s) (\sum_a G(s,a)\partial_{\theta}\pi(a|s,\theta) - v(s)\sum_a \partial_{\theta}\pi(a|s,\theta))\\ =\sum_s \mu(s) \sum_a G(s,a)\partial_{\theta}\pi(a|s,\theta)

4. Actor–Critic Methods 

\theta_{t+1}=\theta_{t} + \beta (R_{t+1} + \gamma v(s_{t+1}|w_t) - v_t(s|w_t))\frac{\partial_{\theta_t}\pi(a|s,\theta_t)}{\pi(a|s,\theta_t)} \\ w_{t+1}=w_t + \alpha (R_{t+1} + \gamma v(s_{t+1}|w_t) - v_t(s|w_t))\partial_{w_t} v(s|w_t)

 

Guess you like

Origin blog.csdn.net/tostq/article/details/131212697