Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

版权声明:Copyright reserved to Hazekiah Wang ([email protected]) https://blog.csdn.net/u010909964/article/details/84450913

motivation

Model-based approaches enjoys 1) sample efficiency (meaning they learn quickly), 2) and a reward-independent dynamics model (thinking of model-free approaches require the reward function to update), but meanwhile lagging behind model-free approaches in asymptotic performance (meaning they converge to sub-optimal solutions).

This work based on two observations:

  1. model capacity matters
    GP is efficient but lacks expressiveness, NN leans slowly?
  2. the above issue can be mitigated by incorporating uncertainties.
    (actually i didn’t find any reasoning)

Talking of related works, the paper claims that deterministic NN used in many prior works suffer from overfitting in the ealy stages of learning.

The author mentions a major challenge in model-based RL: model should perform well in both low and high data regimes.

Q2:

What causes this? Is this specific under the setting of model-based RL?

pipeline

probabilistic ensemble dynamics model

dynamics model

  1. probabilistic NN
    a parametrized conditional distribution model f θ ( s t + 1 s t , a t ) f_\theta(s_{t+1}\mid s_t, a_t) , optimized by Maximizing the Likelihood of environment-produced trajectories.

    A typical choice of the distribution is a diagonal multiunivariate Gaussian. This is similar to the model for predicting actions given states in continuous action space. The model would give a state mean vector and a state variance vector, and the next state is produced by sampling from such Gaussian.

  2. deterministic NN
    f ( s t + 1 s t , a t ) = δ ( s t + 1 g θ ( s t , a t ) ) f(s_{t+1}\mid s_t,a_t) = \delta(s_{t+1}-g_\theta(s_t,a_t))

    here what is estimated is g g . There lies a latent assumption that state vectors should gather around some center (which accords with intuition), so the author chooses a preassumed distribution δ ( ) \delta(\cdot) .

    The formula is exactly assuming that s t + 1 s_{t+1} scatters around g θ ( s t , a t ) g_\theta(s_t,a_t) following the delta distribution δ ( ) \delta(\cdot)

  3. ensemble
    I konw of this concept mainly from Random Forest. Via bootstraps of multiple trees, 1) the less correlated the trees are (the variance of each model follows are noisy enough), 2) the more trees there are, the smaller variance the average of these models will get. And bootstrapping means randomly sample from the same dataset a subset with replacement.

  4. put together
    the dataset for bootstrapping should be trajectories from the environment.

    the dynamics model can ensemble both the probabilistic model and the deterministic model

    This step done, we get an approximation of the transition function of the environment. And this should be where model-based lies, though i am not familiar with this concept.

Two uncertainties

  1. aleatoric uncertainty
    noise occured in the interaction

    the probabilistic model allows such uncertainty.

  2. epistemic uncertainty
    noice occured in the approximation of the transition model, for lack of data

    the ensemble mitigates such uncertainty

Planning and control with learned dynamics

This part is a headache for me as I wasn’t ever exposed to such things before. So this part may need a re-check in the future.

As I interpret it, since the goal for RL is find a policy that is optimal in the sense of cumulative rewards, now that we have a transition model for the environment, meaning we can predict the state ourselves, so our agent acts by predicting the future.

To answer the question of how to act, the author uses Model Predictive Control. The approach, at each time step, apply the first action of an optimal action sequence.

The natural question following is where can we get an optimal action sequence?
a r g m a x a t : t + T t E [ r ( s t , a t ) ] argmax_{a_{t:t+T}}\sum_t{E[r(s_t,a_t)]}

At least we know there is somehow a pool of action sequences, which, claimed by the paper, is obtained by CEM instead of Random Sampling Shooting that samples actions from a distribution closer to previous action samples that yielded high reward.

(Note the expecation is across bootstraps)
Then the paper claims

computing the expected trajectory reward using state prediction in closed-form is generally intractable.

Q1:

This appears wierd because, since we have the state transition model, isn’t it a simple feedforward process to apply the action and gets the reward?

Anyway, then the paper discuss about trajectory sampling for state propagation. They use particle-based propagation. The different thing is, as we have different bootstraps of the transition model, we can do the sampling per different bootstrap each time.

There are P particles altogether, beginning from an initial state each. At any time step within the time horizon, any particle can choose one bootstrap transition model to sample the next state. The proposed T S 1 TS1 and T S TS\infty differs in whether the bootstrap model can change with time. T S 1 TS1 can vary along time, while T S 2 TS2 is time-invariant.

Questions

  1. here
  2. here
  3. why is the following

aleatoric state variance is the average variance of particles of same bootstrap, while epistemic state variance is the variance of the average of same bootstraps indexes.

猜你喜欢

转载自blog.csdn.net/u010909964/article/details/84450913