insert image description here
[Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

insert image description here
Column details : [Reinforcement Learning Principles + Project Column] Must-see series: single-agent, multi-agent algorithm principles + project practice, related skills (parameter adjustment, drawing, etc., interesting project realization, academic application project realization

The plan for deep reinforcement learning is:

Basic single-intelligence algorithm teaching (gym environment-based)
Mainstream multi-intelligence algorithm teaching (gym environment-based)
- Mainstream algorithms: DDPG, DQN, TD3, SAC, PPO, RainbowDQN, QLearning, A2C and other algorithm projects
Some interesting projects (Super Mario, playing backgammon, Fight the Landlord, various game applications)
Actual combat of single-intelligence and multiple-intelligence questions (the paper reproduces partial business such as: UAV optimization scheduling, power resource scheduling and other project applications)

This column is mainly to facilitate entry-level students to quickly grasp reinforcement learning single-agent | multi-agent algorithm principles + project practice. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Column subscription (personalized choice):

Reinforcement learning from basic to advanced - common questions and interviews must know [2]: Markov decision, Bellman equation, dynamic programming, strategy value iteration

1. Core Vocabulary of Markov Decision Making

Markov property (MP) : If the future state of a process has nothing to do with the past state and is only determined by the current state, then it has the Markov property. In other words, a state's next state only depends on its current state, and has nothing to do with the state before its current state.
Markov chain : In probability theory and mathematical statistics, it is a stochastic process (stochastic process) that has Markov properties and exists in a discrete index set (index set) and state space (state space).
State transition matrix (state transition matrix) : The state transition matrix is similar to the conditional probability (conditional probability), which represents the probability of reaching all other states when the agent reaches a certain state. Each row of the matrix describes the probability of reaching all other nodes from a certain node.
Markov reward process (Markov reward process, MRP) : The essence is a Markov chain plus a reward function. In the Markov reward process, the state transition matrix and its state are the same as those of the Markov chain, with only one more reward function. The reward function is an expectation, that is, how much reward can be obtained in a certain state.
Horizon : defines the length of the same episode or a complete trajectory, which is determined by a finite number of steps.
Return (return) : The reward is discounted, and then the corresponding reward is obtained.
Bellman equation : It defines the iterative relationship between the current state and the future state, indicating that the value function of the current state can be calculated by the value function of the next state. The Bellman equation is due to its proposer and dynamic programming founder Richard $\cdot$ Bellman (Richard Bellman), also known as the "dynamic programming equation". The Bellman equation is $\gamma \sum_{s' \in S}P(s'|s)V(s')$ , in particular, its matrix form is $\mathrm{V}=\mathrm{R}+\gamma \mathrm{PV}$ 。
Monte Carlo algorithm (MC algorithm) : It can be used to calculate the value of the value function. Using the example of the boat in this section, after obtaining a Markov reward process, we can start from a certain state, put the boat in the water, and let it flow with the waves, so that a trajectory will be generated, so as to obtain a discounted reward $g$ . When the reward is accumulated to a certain amount, it is directly divided by the number of trajectories, and the value of its value function will be obtained.
Dynamic programming algorithm (dynamic programming, DP) : It can be used to calculate the value of the value function. By iterating the corresponding Bellman equation all the time, it finally converges. When the last updated state is not much different from the previous state, the update of the dynamic programming algorithm can be stopped.
Q-function (Q-function) : It defines the expectation of the possible return corresponding to a certain state and a certain action.
The prediction problem in the Markov decision process : that is, the strategy evaluation problem, given a Markov decision process and a strategy $\pi$ , calculate its policy function, that is, what is the value function value of each state. It can be solved by a dynamic programming algorithm.
The control problem in the Markov decision process : that is, to find an optimal strategy, whose input is the Markov decision process, and the output is the optimal value function (optimal value function) and the optimal policy (optimal policy). It can be solved by a dynamic programming algorithm.
Optimal Value Function : Searching for a Strategy $\pi$ , to maximize the value of each state, $V^*$ is the maximum value reached in each state. In the maximum, the policy we get is the best policy. The optimal strategy makes the value function of each state maximize. So when we say that the environment of a certain Markov decision process is solvable, it means that we can get an optimal value function.

insert image description here

2. Summary of frequently asked questions

2.1 Why is there a need for a discount factor in the Markov reward process?

(1) First of all, some Markov processes are cyclic and have no end, so we want to avoid infinite rewards.

(2) In addition, we want to represent the uncertainty, hoping to get the reward as soon as possible, not at some point in the future.

(3) Next point, if the reward is of real value, we may prefer to get the reward immediately instead of getting the reward later.

(4) Also, in some cases, the discount factor can also be set to 0. When it is set to 0, we only pay attention to its current reward. We can also set it to 1, which means that the rewards obtained in the future are the same as the rewards obtained currently.

Therefore, the discount factor can be adjusted as a hyperparameter of the reinforcement learning agent, and then the agent with different behaviors will be obtained.

2.2 Why is it difficult to obtain the analytical solution of the Bellman equation in matrix form?

Through the process of matrix inversion, we can put $The analytical solution of V$ is obtained. But the complexity of this matrix inversion process is $O(N^3)$ , so when there are a lot of states, such as from 10 states to 1000 states, to 1 million states, then when we have 1 million states, the transition matrix will be a 1 million by 1 million matrix. It is very difficult to invert such a large matrix, so this method of solving by analytical solution can only be applied to a small number of Markov reward processes.

2.3 What are the common methods for computing the Bellman equation, and what are their differences?

(1) Monte Carlo method: It can be used to calculate the value of the value function. Take the boat example in this book as an example. After obtaining a Markov reward process, we can start from a certain state, put the boat into the water, and let it "go with the tide", so that a trajectory will be generated, so as to obtain a Reward after discount $g$ . When the reward has accumulated to a certain amount, divide it directly by the number of trajectories, and you will get the value of its value function.

(2) Dynamic programming method: It can be used to calculate the value of the value function. By iterating the corresponding Bellman equation all the time, it finally converges. When the last updated state is not much different from the previous state, it is usually less than a threshold $\gamma$ , the update can be stopped.

(3) Combination of the above two methods: We can also use the temporal difference learning method, which is a combination of the dynamic programming method and the Monte Carlo method.

2.4 What is the difference between a Markov reward process and a Markov decision process?

Compared with the Markov reward process, the Markov decision process has one more decision-making process, and other definitions are similar to the Markov reward process. Since there is one more decision and one more action, there is also one more condition for state transition, that is, the execution of an action will lead to a change in the future state, which not only depends on the current state, but also depends on the current state. A state change determined by an action. For the value function, it also has one more condition, one more current action, that is, the current state and the action taken will determine the current possible reward.

In addition, there is a conversion relationship between the two. Specifically, given a Markov decision process and a policy $\pi$ , we can convert the Markov decision process into a Markov reward process. In the Markov decision process, the state transition function $P (s^{'} ∣ s, a)$ is based on its current state and current action, because we now know the policy function, that is, in each state, we know the probability of taking each action, so we can directly add this action, that is A transition probability for a Markov reward process can be obtained. Similarly, for the reward, we can remove the action, so that we get a reward similar to the Markov reward process.

2.5 What are the structural or computational differences between the state transition in the Markov decision process and the state transition in the Markov reward process?

For the Markov chain, its transition probability is directly determined, that is, the state value at the next moment is obtained from the state at the current moment through the transition probability. But for the Markov decision process, there is an additional layer of action output in the middle, that is, in the current state, it must first decide to take a certain action, and then change to another state through the state transition function. Therefore, there is an extra layer of decision-making in the transition process between the current state and the future state, which is the difference between the Markov decision-making process and the previous Markov process. In the Markov decision process, the action is determined by the agent, so there is an additional component, the agent will take actions to determine the future state transition.

2.6 How do we find the best strategy, and what are the methods for finding the best strategy?

Essentially, when we get the best value function, we can get the best value by maximizing the Q function. Then, we directly take a value for the Q function that maximizes the action, and we can directly obtain its optimal strategy. The specific method is as follows,

(1) Exhaustive method (generally not used): Suppose we have a finite number of states and a finite number of action possibilities, then we can take $Type A$ action strategy, then the total is $A|^{|S|}$ possible strategies. We can list them exhaustively, then calculate the value function of each strategy, and compare them to get the best strategy. But this method is extremely inefficient.

(2) Strategy iteration: An iterative method, which consists of two parts. The following two steps have been iteratively carried out and finally converged. The process is somewhat similar to the EM algorithm (expectation-maximization algorithm) in machine learning. The first step is policy evaluation, that is, we are currently optimizing this policy $\pi$ , an updated strategy is obtained through evaluation during the optimization process; the second step is strategy promotion, that is, after the value function is obtained, its Q function is further calculated to obtain its maximum value.

(3) Value iteration: We have been iterating the Bellman optimal equation. Through iteration, it can gradually tend to the optimal strategy, which is the core of the value iteration method. In order to get the best $V^*$ , for each state $V^*$ value, directly use the Bellman optimal equation to iterate, and it will converge to the optimal strategy and its corresponding state after multiple iterations. There is no strategy function here.

insert image description here

3. Interview must know and answer

3.1 Friendly interviewer: What is the Markov process? What is a Markov decision process? What is the most important property of Markov?

A Markov process is a binary group $< S, P >$ ， $S$ is the set of states, $P$ is the state transition function;

A Markov decision process is a quintuple $<S,P,A,R,\gamma>$ , where $R$ means from $S$ to $S^{'}$ The reward expectation that can be obtained, $\gamma$ is the discount factor, $A$ is the set of actions;

The most important property of Markov is that the next state is only related to the current state and has nothing to do with the previous state, that is, $p(s_{t+1} | s_t) = p(s_{t+1}|s_1,s_2,...,s_t)$ 。

3.2 Friendly interviewer: How do we usually solve the Markov decision process?

When we solve the Markov decision process, we can directly solve the Bellman equation or the dynamic programming equation:

$\gamma \sum_{s' \in S}p(s'|s)V(s')$

In particular, its matrix form is $\mathrm{V}=\mathrm{R}+\gamma \mathrm{PV}$ . However, the Bellman equation is difficult to solve and the computational complexity is high, so methods such as dynamic programming, Monte Carlo, and time series difference can be used to solve it.

3.3 Friendly interviewer: What if the data flow does not have the Markov property? How should it be handled?

If there is no Markov property, that is, the next state is also related to the previous state, if only the current state is used to solve the decision-making process, it will inevitably lead to poor generalization ability of decision-making. In order to solve this problem, the recurrent neural network can be used to model the historical information to obtain the state representation containing the historical information. The representation process can also use means such as the attention mechanism, and finally solve the Markov decision process problem in the representation state space.

3.4 Friendly interviewer: Please write the Bellman equation based on the state value function and the Bellman equation based on the action value function.

(1) Bellman equation based on state value function: $V_{\pi}(s) = \sum_{a}{\pi(a|s)}\sum_{s',r}{p(s',r|s,a) [r(s,a)+\gamma V_{\pi}(s')]}$ ；

(2) Bellman equation based on action-value function: $Q_{\pi}(s,a)=\sum_{s',r}p(s',r|s,a)[r(s',a)+\gamma V_{\pi}(s ')]$ 。

3.5 Friendly interviewer: ask the best value function $V^$ and optimal strategy $\pi^$ Why is it equivalent?

The optimal value function is defined as $V^* (s)=\max_{\pi} V_{\pi}(s)$ , that is, we search for a strategy $\pi$ to maximize the value of each state. $V^*$ is to reach the maximum value of each state, and the strategy we get can be said to be the best strategy, that is, $\pi^{*}(s) =\underset{\pi}{\arg \max }~ V_{\pi}(s)$ . The optimal strategy makes the value function of each state maximize. So if we can get an optimal value function, we can say that the environment of a certain Markov decision process is solved. In this case, its optimal value function is consistent, that is, the value of the upper limit it reaches is consistent, but here there may be multiple optimal strategies corresponding to the same optimal value.

3.6 Friendly interviewer: Can you write the $What is the n$ -step value function update formula? In addition, when $When n$ gets larger, do the expectation and variance of the value function become larger or smaller?

$The larger n$ , the larger the variance and the smaller the expected deviation. The update formula of the value function is as follows:

$Q\left(S, A\right) \leftarrow Q\left(S, A\right)+\alpha\left[\sum_{i=1}^{n} \gamma^{i-1} r_{t+i}+\gamma^{n} \max _{a} Q\left(S',a\right)-Q\left(S, A\right)\right]$