7. Reinforcement learning-model-based reinforcement learning

table of Contents

Course Outline

Summary of model-based reinforcement learning

Model-based value-based RL optimization model-based value-based RL

Model-based policy-based RL

to sum up


Course Outline

model-based RL overview

model-based value optimization

model-based policy optimization

Summary of model-based reinforcement learning

When I learned model-free RL before

(1) Use policy gradient to learn policy directly from experience

(2) Use MC or TD method to learn value function

The MDP we learned in the first lesson involves both algorithms that are model-based.

(1) policy iteration algorithm

(2) Value iteration algorithm

We can't get the realistic model in MDP, so we put forward the model-free method in the previous lessons. This time we have returned to the algorithm based on the environment model, but the difference between the above two MDP algorithms and the content we are going to talk about this time is that our course is to estimate an environment model to optimize the value function estimation And strategy function estimation, so this class is to obtain a model of the environment through interaction with the environment, and then improve the value function and strategy function

This class will talk about model-based RL [It was mentioned when I talked about MDP that when there is a model, you can perform strategy iteration and value iteration]

(1) Learn the model of the environment from experience (this is different from our previous MDP)

(2) Use the learned model to improve value/policy optimization

Here is a picture to understand the meaning of the above:

(1) Figure 1 is model-free, and the agent inside can only interact with the environment

(2) Figure 2 is model-based, the agent inside can interact with the environment, or interact with the model (model is the modeling of the real environment)

Modeling the environment is for planning, that is, to obtain a better policy. The so-called planning is to take the model as input, and then interact with the environment model to generate or improve the calculation process of a strategy

 Obtain a lot of observation experience from the environment. Based on these observation experience, we can learn a transition model (state transition matrix or reward function), and then we can use some planning methods to get better strategies

Reinforcement learning algorithms based on environmental models can be divided into two parts

(1) Value optimization based on environmental model-value function optimization

(2) Policy optimization based on environmental model-policy optimization

Framework structure of model-based reinforcement learning

(1) The relationship structure between learning, planning and acting

(2) How should we use the real experience we get?

          ①It can be used to estimate the value function and strategy function as before

          ②It can be used to estimate the environmental model

What are the advantages and disadvantages of Model-based RL?

(1) The advantage is that the sample efficiency is better . Especially for actual robots, this is very important (evolutionary algorithm has no gradient, so its essence is to make a lot of attempts, so the efficiency is very low; in addition, the efficiency of on-policy is often very low, because it collects data and optimizes it. policy, so it is very slow; in the middle is actor-critic; after that is off-policy and sac, etc. The efficiency is very high, because it retains two policies; the most efficient is model-based)

(2) The disadvantage is - not only to fit the value of the function or policy function, but also fit the model of the environment, so the introduction of two errors ; and also can not guarantee convergence.

So what is an environmental model?

definition:

① M = {P, R} represents the environment model, which includes the state transition matrix (state transition function) and the reward function,

② And we usually assume that P and R are independent of each other

In fact, some models are not so troublesome. For example, some chess models are easy to know, and some dynamic models are also known.

Model-based value-based RL optimization model-based value-based RL

aims

Learn the environment model M from the experience (S1, A1, S2, A2... St), then you can actually learn the result of this problem very simply with a supervised method

Obtaining the reward function can actually be regarded as a regression problem, that is, the problem of knowing S and A and then performing regression to obtain R

The state transition function can also be solved in a similar way. Knowing S and A and then obtaining S'is also similar

The above process can choose MSE, KL divergence, etc. to obtain and optimize this environmental model

The form of the environmental model:

(1) table lookup model

Algorithm framework:

For example:

Step ①: How to get the environmental model?

From 8 trajectories, we know that the state transition from A to B is 100%, in 6 cases, the reward for getting 1 is 75%, and in 2 cases, the reward for getting 0 is 25%, so the reward function is also known, as shown in the figure below That’s how the label comes from

Step ②: How to plan after getting the environment model?

A simple method is to use this estimated model to sample, you can get a lot of trajectories, and then use the previous model-free method

(2) Linear model

(3) Linear Gaussian model

(4) Gaussian distribution model

(5) Neural network model

But there are still some problems with reinforcement learning based on the estimated model

What's the problem? Because the model is not necessarily very accurate, so P and R are inconsistent with the real environment, that is to say:

Therefore, this RL is limited by the accuracy of the model itself. When the model itself is not accurate enough, planning can only produce sub-optimal strategies.

Some possible solutions to this problem are:

We can get two kinds of experience and their comparison

(1) Simulated experience: sampled from the model, it is an approximation to MDP

(2) Real experience: sampling from the real environment is the real MDP

Said so much! Is it possible to combine model-based and model-free methods? Have! Dyna algorithm! [Sutton proposed in 1991, it is not very popular now]

Algorithm structure framework of Dyna algorithm

Model-based policy-based RL

In the previous section, what we did was model-based RL based on the value function, and its process was to sample from the model to the model and then to the value function and then implement the strategy.

So can we directly use the model to optimize the improvement strategy without estimating the value function? [It seems to be back to the previous course hahaha]

How to integrate this model-based thing into the previous policy gradient (the previous policy gradient was model-free and has nothing to do with the environment model)? Can the introduction of environmental models in the model free policy gradient improve the optimization of this strategy?

The optimization of the strategy function based on the environmental model is greatly affected by cybernetics, because the control process is also to optimize a controller with the environmental model, that is, to minimize the cost function (in the case of satisfying the dynamic characteristics). The specific expression is as follows:

The role of optimal control in trajectory optimization:

Algorithm 1:

Combine model learning and trajectory optimization

Algorithm 2:

Improved model one, adding the iterative loop process

Algorithm three:

Improved model two, because we have carried out this plan first, the deviation will be very large at this time, so MPC is introduced

Algorithm Four:

Combine the three

How to parameterize our model is also a bit particular

 The shaded part shows the uncertainty. If there are observations, the uncertainty will be zero, otherwise the uncertainty will be very high if there are few observation points.

to sum up

We collect trajectories not only to learn the strategy function and value function, but also to learn the environmental model (state transition, reward function). With the environmental model, the strategy function and value function can be better optimized. There is a model-based A very big advantage is that the sample efficiency is very good , which is very important in robot applications.

 

Note: All the content in this article is derived from the intensive learning outline course updated by Mr. Zhou Bolei at station B. After listening to it, I have benefited a lot. This article also shares my lecture notes. Teacher Zhou Bilibili video personal homepage: https://space.bilibili.com/511221970?spm_id_from=333.788.b_765f7570696e666f.2

Thanks to Teacher Zhou:)

Guess you like

Origin blog.csdn.net/weixin_43450646/article/details/106797253