"Reinforcement Learning and Optimal Control" Study Notes (3): Overview of Reinforcement Learning Median Space Approximation and Policy Space Approximation

written in front

Link to previous chapter:

"Reinforcement Learning and Optimal Control" Study Notes (2): Comparison of Some Terms between Reinforcement Learning and Optimal Control

This chapter is mainly a brief introduction to the value space approximation and policy space approximation in the previous chapter of the book.

In the first chapter of this book, it was pointed out that it is usually impossible to solve the optimal control problem precisely with DP , because there is a "curse of dimensionality" , that is, as the scale of the problem increases, the required calculation And memory storage will increase rapidly. Furthermore, in many cases the structure of a given problem is known in advance, but some data, such as various system parameters, may not be known until control is ready to start, thus severely limiting the time available for DP calculations. So we usually can't find the optimal result, but we can find a suboptimal solution, that is, to make a reasonable balance between the convenience of implementation and performance.

Approximation Methods in Reinforcement Learning

There are two methods for suboptimal control based on DP, namely Approximation in value space and Approximation in policy space.

Value space approximation means that we approximate the optimal cost function or the cost function of a given strategy, and then obtain a suboptimal strategy.

Strategy space approximation means that we construct a strategy space and optimize it to choose the best strategy.

Value space approximation - one-step lookahead

First of all, let's talk about the value space approximation of one-step lookahead, and use the suboptimal one \tilde{J}_{k}to approximate the optimal cost-to-go functions  J_{k}^{*}. There are several ways to calculate this \tilde{J}_{k}, which will be discussed later. Then we can obtain a set of suboptimal strategies { } through the following formula: \tilde{\mu}_{0},...\tilde{\mu}_{N-1}

We can notice that the expected part on the right is actually mentioned in the first chapter of my study notes, which can be regarded as an approximate Q-factor:

 Then we can directly obtain the strategy through this approximate Q-factor:

\tilde{J}_{k}In other words, through this approximate Q-factor (such as fitting through a neural network), we can directly obtain the strategy without even the intermediate acquisition step.

Value space approximation - Multistep lookahead

Most of the examples we talked about above are one-step lookahead, that is, we only look forward one step. In fact, there is another variant, which is to look forward multiple steps, that is, Multistep lookahead. Here I will talk about the Multistep lookahead first, as shown in the figure below:

This is a more ambitious solution, but it will increase the amount of calculation (because it takes one more step).

Take the two-step lookahead as an example, that is, =2 in the above picture l, then we need to \tilde{J}_{k+1}express it:

Then we substitute this formula into the first formula, and we can get \tilde{\mu}_{k}it. It can be noticed that in fact, we will only \tilde{\mu}_{k}use it to directly affect the system, but \tilde{\mu}_{k+1}only for seeking \tilde{\mu}_{k}, and will not directly affect the system, and we will talk about how to calculate this \tilde{J}_{k+2}even later \tilde{J}_{k+l}

So why use this Multistep lookahead? In fact, intuitively speaking, just like playing chess, one step often needs to consider the next few steps, so as to win, but it is often difficult for us to think about the last dozens of steps, or the last step. So, this method uses a less accurate approximation \tilde{J}_{k+l}to get better performance, so it's ambitious, but lis it really the bigger the better? I think that's up for debate.

policy space approximation

In addition to value space approximation, we can use policy space approximation to obtain suboptimal policies. Strategy space approximation is by constructing a parameterized strategy space, in which there is a class of reasonable (constrained) strategies for us to choose, and its form is as follows:

 \mu_k(x_k,r_k),     k=0,...,N-1

 Among them r_kare the parameters, if it is a neural network, it is the weight inside.

Then why is there a strategy space approximation method, because it can calculate the control sequence more directly than Lookahead Minimization, because we have already built this strategy space, and we only need to input the current state to get the control amount, This is more conducive to online (on-line) learning, which will be discussed later.

Then our goal is of course to use some optimization methods to optimize this strategy space so that it can get better strategies, and this part will be mentioned later.

written on the back

Next chapter link:

"Reinforcement Learning and Optimal Control" Study Notes (4): Overview of Model-Based and Model-Free Implementation and Off-line and On-line Method

Guess you like

Origin blog.csdn.net/qq_42286607/article/details/123464578