Gradient reinforcement learning strategies

Strategy gradient

value based method of reinforcement learning the value function were approximated, policy based using similar ideas, strategies \ (\ pi \) can be described as a parameter containing \ (\ theta \) function

\[\pi_{\theta}(s, a)=P(a | s, \theta) \approx \pi(a | s) \]

We can assume that there is a strategy \ (\ pi_ \ Theta (A | S) \) , then we actually have a probability \ (the p-(S '| S, A) \) , represents the state transition probabilities, it is subject to parameters \ ( \ Theta \) effect, the entire path may be \ (\ of tau \) represents,

\[\underbrace{p_{\theta}\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}\right)}_{p_{\theta}(\tau)}=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \]

So our aim is to optimize the \ (\ theta \) so that the maximum expected total return,

\[\theta^* = arg\max_\theta E_{\tau\sim p_\theta(\tau)} \left[\sum_t\gamma(s_t,a_t)\right] \]

Since we have no way to directly calculate the expected total return, only through interaction with the environment, to obtain the desired value by multiple sampling.

From this we can get a similar machine learning in the objective function, and approximated it to use multiple sampling

\[J(\theta)=E_{\tau\sim p_\theta(\tau)} \left[\sum_t\gamma(s_t,a_t)\right]\approx \frac{1}{N}\sum_{i=1}^N \sum_t \gamma(s_{i,t},a_{i,t}) \]

slide somewhat inconsistent, \ (P_ \ Theta (\ of tau) \) and \ (\ pi_ \ theta (\ tau) \) equivalent, are represented a strategy, through a chain of reasoning can be obtained derivative of the objective function

\ [\ Nabla_ \ theta J (\ theta) = E _ {\ year \ sim \ pi _ {\ theta} (\ year)} \ left [\ nabla _ {\ theta} \ log \ pi _ {\ theta} (\ year) r (\ year) \ right] \]

We can deduce and removed through a series of formulas for the derivative item 0 in order to obtain a more precise formula:

1584633176232

\[\nabla_{\theta} J(\theta)=E_{\tau \sim \tau_{*}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right] \]

With the above gradient equation, we can given initial \ (\ theta \) and environment interact to obtain the gradient of the objective function to increase the expected return. That is, updated by sampling \ (\ Theta \) .

Then the policy \ (\ pi \) What is the specific form it?

Softmax function is used in a discrete action space policy, since the policy to the need to meet any state \ (s \ in S \) have \ (\ sum_a \ PI (A | S) =. 1 \) , for this purpose, the operation of introducing preference function by softmax value of the action as a policy preference

\[\pi(a | S ; \theta)=\frac{e^ {h(s, a ; \theta)}}{\sum_{a^{\prime}} e^ {h\left(s, a^{\prime} ; \theta\right)}} \]

The corresponding derivative value can be determined Function

\[\nabla_{\theta} \log \pi_{\theta}(s, a)=\phi(s, a)-\mathbb{E}_{\pi \theta}[\phi(s, a)] \]

Common Gaussian policy continuous behavior space, its behavior from a Gaussian distribution \ (\ mathbb {N} \ left (\ phi (\ mathbf {s}) ^ {\ mathrm {T} _ \ theta}, \ sigma ^ {2 } \ right) \) is generated, which corresponds to the derivative of the function log

\[\nabla_{\theta} \log \pi_{\theta}(s, a)=\frac{\left(a-\phi(s)^{T_\theta} \right) \phi(s)}{\sigma^{2}} \]


Mentioned before

\[J(\theta)=E_{\tau\sim p_\theta(\tau)} \left[\sum_t\gamma(s_t,a_t)\right]\approx \frac{1}{N}\sum_{i=1}^N \sum_t \gamma(s_{i,t},a_{i,t}) \]

Sampled N times can be estimated gradient of the objective function

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N}\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{i, t} | \mathbf{s}_{i, t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{i, t}, \mathbf{a}_{i, t}\right)\right) \]

As of now, we can give our reinforcement learning algorithm:

  1. sample \(\{\tau^i\}\) from \(\pi_\theta(a_t|s_t)\) (run the policy)
  2. Calculating a gradient of the gradient with the above formula
  3. \(\theta \leftarrow \theta+\alpha\nabla_\theta J(\theta)\)

You might have seen have questions - Policy \ (\ pi_ \ theta (a_t | s_t) \) What specifically? We can give an example of autopilot, the state s strategy is the current road conditions, the action is a left turn, right turn, implementation, parameter \ (\ theta \) is the heavy weights of the neural network, bias.

Because of the uncertainty of sampling, there will be a lot of variance in front of the algorithm.

So how to reduce the variance algorithm it? ?

A basic principle is that policy at time t 'can not affect reward at time t when t <t' (before the latter policy does not affect the reward)

1584799716446

\ (\ hat {Q} _ {i, t} \) represents the section \ (I \) sub-sampled from \ (T \) finally obtained and the time to reward

\[\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{i, t} | \mathbf{s}_{i, t}\right) \hat{Q}_{i, t} \]

There is an improvement Baseline

The purpose is to increase the probability of reinforcement learning good choice to reduce the probability of a bad choice, then if the reward is a good action 10001, bad choice of reward is 10000, then the effect will be obvious learning. An obvious improvement is obtained by subtracting the average of the rewards

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau) \left[\gamma(\tau)-b \right] \]

\ [B = \ {1} {N freely} \ sum_ {i = 1} ^ N \ gamma (\ tau) \]

Policy gradient is on-policy

This means that every time you change the policy environment and the need to re-sample interaction to get a new sample.

pytorch implementation code of https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py REINFORCE

Guess you like

Origin www.cnblogs.com/lepeCoder/p/RL_PolicyGradients.html