written in front
The author of this book is Professor Dimitri Panteli Bertsekas, born in Athens, Greece in 1942, a member of the American Academy of Engineering, and a professor of electrical engineering and computer science at the Massachusetts Institute of Technology. Professor Bertsekas is known worldwide for writing as many as 16 monographs on algorithmic optimization and control and applied probability theory. He is also one of the top 100 most cited computer science authors in the CiteSeer search engine academic database. Professor Bertsekas is also the co-founder of Athena Scientific Publishing House.
We know that dynamic programming and optimal control can solve large multi-stage decision-making problems. This book focuses on how to obtain approximate solutions when computing resources are limited. And the approximate solution found is required to meet certain performance requirements. Such methods are often collectively referred to as reinforcement learning, and are sometimes called approximate dynamic programming or neural dynamic programming.
The main inspiration for this book comes from the combination of the fields of optimal control and artificial intelligence. One of the main purposes of this article is to explore the border between these two fields and build a bridge for those who work in these two fields.
Related resources for this book:
官网 REINFORCEMENT LEARNING AND OPTIMAL CONTROL :BOOKS, VIDEOLECTURES, AND COURSE MATERIAL
The official website of this book has a corresponding PDF (just a draft, only the first four chapters, it is recommended to buy the book directly, but it is only available in English at present), as well as related courseware and videos (need to go over the wall).
However, the author also released a class video at Station B:
On Zhihu, there are also big guys sharing his study notes:
This boss is from the background of control, and he is familiar with various programming of optimal control and classic mathematical optimization. I recently started to study this book. I have some foundations of reinforcement learning before, and I am doing some applications of robots. This book is very valuable, so record the learning process here, urge yourself to study by yourself, and hope to discuss and study with everyone.
This article corresponds to 1.1. Deterministic Dynamic Programming and 1.2. Stochastic Dynamic Programming in the book.
Deterministic Dynamic Programming
All dynamic programming (hereinafter referred to as DP, Dynamic Programming) problems include a discrete-time dynamic system , which has the following form:
Among them is the time index, which is the state of the system, and is the control or decision variable. It is selected from the set at time . This is because there are generally certain constraints. For example, the robot can walk up, down, left, and right on the map, but it cannot The action of choosing upwards is a function about ( ) describing the dynamic characteristics of the system , which means that the system has a time step. Here we will first discuss the limited case.
This type of problem also includes a cost function, that is, we need to consider the cost, such as the total path sum required in the pathfinding problem, the fuel used in the fuel problem is less, and so on. Used in this book to represent the cost spent in time . It is easy to know that this cost increases with time , so in the initial state , the total cost of the control sequence { } is:
where is the cost spent in the terminated state.
In the whole system, our goal is of course to select the minimum total cost in each state , that is, the minimum, and the optimal symbol is generally represented by an asterisk (*):
So the DP problem can be described by the following diagram:
Dynamic programming algorithm
Principle of Optimality
The dynamic programming algorithm actually decomposes the problem to be solved into several sub-problems, and solves the sub-stages in order. The solution of the previous sub-problem provides useful information for the solution of the latter sub-problem . Let's talk about it in detail:
Define { } as the optimal control sequence, so the corresponding state after the initial state is the optimal state, namely { }. We consider an intermediate step, after time :
Obviously this is a sub-problem of the original problem, and the optimal solution of this sub-problem is { }.
The optimal principle means that a subsequence of the optimal solution of the original problem ({ } is taken from { }), is the optimal solution of the sub-problem. Note that the final state of this sub-problem must be the same, that is, the beginning of all sub-problems can be different, but the end must be the same as the original problem . This kind of sub-problem is called Tail subproblem in English , as shown in the following figure:
There is an example in the article that is more straightforward, that is, assuming that the shortest path from Los Angeles to Chicago passes through Boston, then the shortest path from Boston to Chicago also passes through here.
Dynamic programming to find the optimal control sequence
Through the above study, we can know that the optimal solution to solve the original problem can be decomposed into the optimal solution to solve different Tail subproblems :
in:
In other words, it is the smallest cost from to . From the formula, we can see that if we want to calculate , we must calculate the total cost from to to every time. For this purpose, an induction method is used :
In this way, as long as we calculate well , we can get it by adding it . In fact, this idea is also used in reinforcement learning. It is an immediate cost and represents a future cost, which is called cost-to-go funtion .
After getting it , we can use these funtions to solve the optimal control sequence, in fact, to find one that makes the minimum :
value space approximation
From the above study, we also know that to solve an optimal sequence, we need to calculate , which needs to consider all the sums , which is this part: , which takes a lot of time to calculate. To this end, an approximate method is adopted , which will
Instead , there are many ways of this approximation. For example, we know that neural networks can fit functions, so this can be represented by neural networks, and the results can be directly input and output without considering all the sums .
This book uses Q-factors to represent the right half of the above formula (1.9):
Right now:
This approximate Q-factors can also be realized by methods such as neural networks, and then the method of solving Q-factors in the reinforcement learning method is called Q-learning.
Compared with approximate ones , there are naturally optimal ones . Our goal is to make them as close as possible. We will talk about this later, and it is naturally derived from the above formula:
It also has an inductive form. The derivation is actually very simple, and it can be deduced privately to enhance understanding:
Stochastic Dynamic Programming
In fact, compared with the deterministic problem, the random problem is just one more interfering random variable , which obeys a probability distribution . The representation of such a system is as follows:
Some other definitions are shown in the figure:
An important difference from the deterministic problem is that what we want to optimize is no longer the control sequence { } , but the policies :
Among them, is the mapping from state to control , and satisfies the constraints of control. Strategies are more general than control sequences, and in the presence of stochastic uncertainty, they can reduce costs because they are chosen according to (which usually involves some knowledge) . Without this knowledge, the controller cannot adapt to certain contingencies (since there is a random variable ), thus adversely affecting costs. This is a fundamental difference between deterministic and stochastic optimal control problems.
Another difference is that it is represented by expectation , which is also because of the existence of random variables , which are generally obtained by Monte Carlo simulation :
The other parts are not very different from the deterministic problem, so I will put a part for comparison:
DP solves randomness problems:
Q-Factors in the randomness problem, it can be seen that there is only one more random variable :
written on the back
Next chapter address: