"Reinforcement Learning and Optimal Control" Study Notes (1): Deterministic Dynamic Programming and Stochastic Dynamic Programming

written in front

The author of this book is Professor Dimitri Panteli Bertsekas, born in Athens, Greece in 1942, a member of the American Academy of Engineering, and a professor of electrical engineering and computer science at the Massachusetts Institute of Technology. Professor Bertsekas is known worldwide for writing as many as 16 monographs on algorithmic optimization and control and applied probability theory. He is also one of the top 100 most cited computer science authors in the CiteSeer search engine academic database. Professor Bertsekas is also the co-founder of Athena Scientific Publishing House.

We know that dynamic programming and optimal control can solve large multi-stage decision-making problems. This book focuses on how to obtain approximate solutions when computing resources are limited. And the approximate solution found is required to meet certain performance requirements. Such methods are often collectively referred to as reinforcement learning, and are sometimes called approximate dynamic programming or neural dynamic programming. 

The main inspiration for this book comes from the combination of the fields of optimal control and artificial intelligence. One of the main purposes of this article is to explore the border between these two fields and build a bridge for those who work in these two fields.

Related resources for this book: 

官网 REINFORCEMENT LEARNING AND OPTIMAL CONTROL :BOOKS, VIDEOLECTURES, AND COURSE MATERIAL

The official website of this book has a corresponding PDF (just a draft, only the first four chapters, it is recommended to buy the book directly, but it is only available in English at present), as well as related courseware and videos (need to go over the wall).

However, the author also released a class video at Station B:

Demetri_Bosekas——Bilibili

On Zhihu, there are also big guys sharing his study notes:

[Reinforcement Learning and Optimal Control] Notes (1) Dynamic Programming for Deterministic Problems

This boss is from the background of control, and he is familiar with various programming of optimal control and classic mathematical optimization. I recently started to study this book. I have some foundations of reinforcement learning before, and I am doing some applications of robots. This book is very valuable, so record the learning process here, urge yourself to study by yourself, and hope to discuss and study with everyone.

This article corresponds to 1.1. Deterministic Dynamic Programming and 1.2. Stochastic Dynamic Programming in the book.

Deterministic Dynamic Programming

All dynamic programming (hereinafter referred to as DP, Dynamic Programming) problems include a discrete-time dynamic system , which has the following form:

Among them kis the time index, x_kwhich is the state of the system, and u_kis the control or decision variable. It is selected kfrom the set at time U_k(x_k). This is because u_kthere are generally certain constraints. For example, the robot can walk up, down, left, and right on the map, but it cannot The action of choosing upwards is a function f_kabout ( ) describing the dynamic characteristics of the system , which means that the system has a time step. Here we will first discuss the limited case.x_k,u_kNNN

This type of problem also includes a cost function, that is, we need to consider the cost, such as the total path sum required in the pathfinding problem, the fuel used in the fuel problem is less, and so on. Used in this book to represent the cost spent in g_k(x_k,u_k)time . kIt is easy to know that this cost increases with time , so in the initial state , the total cost of x_0the control sequence {​{ } is: u_0,...u_{N-1}

where g_N(x_N)is the cost spent in the terminated state.

In the whole system, our goal is of course to select the minimum total cost in each statex_ku_k , that is, the minimum, and the optimal symbol is generally represented by an asterisk (*):J(x_0)

So the DP problem can be described by the following diagram:

Dynamic programming algorithm

Principle of Optimality

The dynamic programming algorithm actually decomposes the problem to be solved into several sub-problems, and solves the sub-stages in order. The solution of the previous sub-problem provides useful information for the solution of the latter sub-problem . Let's talk about it in detail:

Define { } as the optimal control sequence, so the corresponding state after the initial state is the optimal state, namely { }. We consider an intermediate step, after time : u^{*}_{0},...,u^{*}_{N-1}x_0 x^{*}_{1},...,x^{*}_{N}k

Obviously this is a sub-problem of the original problem, and the optimal solution of this sub-problem is { }. u^{*}_{k},...,u^{*}_{N-1}

The optimal principle means that a subsequence of the optimal solution of the original problem ({ } is taken from { }), is the optimal solution of the sub-problem. Note that the final state of this sub-problem must be the same, that is, the beginning of all sub-problems can be different, but the end must be the same as the original problem . This kind of sub-problem is called Tail subproblem in English , as shown in the following figure: u^{*}_{k},...,u^{*}_{N-1} u^{*}_{0},...,u^{*}_{N-1}

  There is an example in the article that is more straightforward, that is, assuming that the shortest path from Los Angeles to Chicago passes through Boston, then the shortest path from Boston to Chicago also passes through here.

Dynamic programming to find the optimal control sequence

Through the above study, we can know that the optimal solution to solve the original problem can be decomposed into the optimal solution  to solve different Tail subproblems :

 in:

In other words, it is the smallest cost J_{k}^{*}(x_k)from kto . From the formula, we can see that if we want to calculate , we must calculate the total cost from to to every time. For this purpose, an induction method is used :NJ_{k}^{*}(x_k)kN

In this way, as long as we calculate well , we can get it J_{k+1}^{*}by adding it . In fact, this idea is also used in reinforcement learning. It is an immediate cost and represents a future cost, which is called cost-to-go funtion .g_kJ_{k}^{*}(x_k)g_kJ_{k+1}^{*}

After getting it J^{*}_{0},...,J^{*}_{N}, we can use these funtions to solve the optimal control sequence, in fact, to find one {\color{Red}u_k}that makes {\color{Red} J^{*}_{k}}the minimum :

value space approximation

From the above study, we also know that to solve an optimal sequence, we need to calculate J_{k}^{*}(x_k), which needs to consider all x_kthe sums k, which is this part: , which takes a lot of time to calculate. To this end, an approximate method is adopted , which willJ^{*}_k

Instead \tilde{J}_{k}, there are many ways of this approximation. For example, we know that neural networks can fit functions, so this \tilde{J}_{k}can be represented by neural networks, and the results can be directly input x_kand u_koutput without considering all x_kthe sums k.

 This book uses Q-factors to represent the right half of the above formula (1.9):

 Right now:

This approximate Q-factors can also be realized by methods such as neural networks, and then the method of solving Q-factors in the reinforcement learning method is called Q-learning.

Compared with approximate ones \tilde{Q}_{k}, there are naturally optimal ones Q_{k}^{*}. Our goal is to make them as close as possible. We will talk about this later, and it is naturally derived from the above formula:

 It also has an inductive form. The derivation is actually very simple, and it can be deduced privately to enhance understanding:

Stochastic Dynamic Programming

In fact, compared with the deterministic problem, the random problem is just one more interfering random variable w_k, w_kwhich obeys a probability distribution P_k(\cdot |x_k,u_k) . The representation of such a system is as follows:

 Some other definitions are shown in the figure:

 An important difference from the deterministic problem is that what we want to optimize is no longer the control sequence { } , but the policies :u_{0},...,u_{N-1}

 Among them, is the mapping from \mu_{k}state x_kto control , and satisfies the constraints of control. Strategies are more general than control sequences, and in the presence of stochastic uncertainty, they can reduce costs because they are chosen according to (which usually involves some knowledge) . Without this knowledge, the controller cannot adapt to certain contingencies (since there is a random variable ), thus adversely affecting costs. This is a fundamental difference between deterministic and stochastic optimal control problems.u_ku_k =\mu_{k}(x_k)x_ku_kw_k

Another difference is that it is represented by expectation J, which is also because of the existence of random variables w_k, which are generally obtained by Monte Carlo simulation :

The other parts are not very different from the deterministic problem, so I will put a part for comparison:

DP solves randomness problems:

Q-Factors in the randomness problem, it can be seen that there is only one more random variable w_k :

written on the back

Next chapter address:

"Reinforcement Learning and Optimal Control" Study Notes (2): Comparison of Some Terms between Reinforcement Learning and Optimal Control

Guess you like

Origin blog.csdn.net/qq_42286607/article/details/123446666