p4
recapping policy gradients.
the gradient is computed on a sampling estimate of the original objective. The estimate is averaged across n trajectories and each T time steps.
‘reward to go’ is the sum over the return of a trajectory starting from time t, the inner sigma’s variable.
p5
If we interpret as the estimate of expected return of a trajectory starting with , this estimate is obviously not accurate, because this is estimated on a single trajectory.
In this sense, we recover the Q-function given . The above term is an unbiased estimate of Q-function but has high variance because the sampling consists of only one trajectory (i).
p6
recall that a way to reduce variance is to add a baseline, and a typical choice for this baseline is to use the average return under current policy, equivalent to the concept of the value-function.
So with such a value-function baseline, we reduce the original ‘reward to go’ term to the advantage function.
p7
now we know that the final objective that policy gradients use, is still a single-sample high-variance, though, unbiased estimate of the advantage term.
p8
now we think of how to get a good estimate of V or Q or A.
As Q and A can all be represented by V, we first consider fitting the value function.
note that here, to represent Q with V, a single sample approximation is involved, make it unbiased but with high variance.
p9-10
policy evaluation evaluates how good is a policy, but does not change the policy.
Monte Carlo evaluation is the same as what policy gradient does. Although the true form of this is, for any
, a multiple-sample estimate should be applied to estimate
, this is impractical, because this means to sample several trajectories from the prescribed state
and the simulator cannot be controlled to rollback or rollout to this state.
This does not mean there will only be one trajectory in total. There are multiple trajectories to be sampled for the simulator, just that for any state in any trajectory, there is only one sample for the ‘reward to go’ – the simulator does not afford to sample multiple times for every state in one trajectory.
There is variance loss since the ‘reward to go’ still follows a single-sample estimate, but this variance loss is still smaller compared with the case using one trajectory overall. The evidence is that although the same state cannot be sampled more than one trajectory, states across trajectories are similar, which approximates the case where for any state multiple ‘reward to go’ can be estimated.
Although the estimate used for the reward to go is the same with policy gradients, we use this estimate to fit the value function, not to compute gradients for the policy update.
The regression for the value function follows the ordinary supervised style.
To reiterate, the training data composed of multiple trajectories with oneway forward reward records, no rollback is involved.
p11
This slide introduces bootstrapped estimate.
The only difference is that the regression target is not the sum over the real trajectory rewards. It kind of uses a difference equation, only is needed, the subsequent cumulative rewards is given by at the next state.
This form enjoys lower variance but is biased.
I suddenly find that this form is the same as what we approximate for the Q-function based on the value-function. As to the Q-function, is replaced with directly . While for the bootstrapped estimate of the value-function, is replaced with .
p12
examples of TD-Gammon and AlphaGo
p13
batch-mode actor-critic algorithm:
- generate samples (run the policy)
- fit a model (i.e., the value function), via bootstrapped estimate or Monte Carlo Evaluation
- evaluate the advantage function based on the fitted value-function.
- improve the policy via gradient descent on the objective with the estimated (and fixed) advantage.
- back to 1) except convergence.
The regression for the value function uses a supervised MSE loss with gradient descent.
p14
If the episode length has no upper bound, a discount factor on rewards is required otherwise the value can get infinitely large.
For the bootstrapped estimate in a differential form, plug the discount factor before . The intuition for this is that getting rewards sooner is better. A typical value of is 0.99.
This is like adding a dead state from which no real state and reward can be reached, with a probability of .
p15
As mentioned previously, the Q-function is the same as the value-function regression target by approximation in the form, so when evaluating the advantage, there is to plug before the estimated .
Next question is about where to plug the discount factor for the Monte Carlo policy gradients.
We have two options:
- plug it in the ‘reward to go’ term as our intuition suggests.
- derive the primitive formula again, with plugged before each reward.
p16
Although option 2 is the natural and rigid derivation, option 1 is used in practice because there is no such discount for remote future in option 1 as in option 2.
Option 1 is also seen as an approximation to the policy gradients of the expected return without discount. The discount is only added in the ‘reward to go’ term to reduce the variance of the remote future that may be noisy or uncertain.
p17
Batch mode actor-critic with discount is almost the same as before, except that the discount plugged to the advantage estimate and the regression target.
Online actor-critic differ from the previous algorithm as follows:
- the samples generated consist indeed of one transition
- the regression for the value-function should use bootstrapped estimate only because no real data for the future trajectory is known in this case.