[Autonomous Driving Decision Planning] Introduction to POMDP

Preface

This article is my notes related to learning POMDP. Due to my limited personal ability, there may be some mistakes. Please criticize and correct me. Beginner, hope to learn with everyone.

Markov Property

Markov Property or Markov assumption:
The probability distribution of future states is only related to the current state and has nothing to do with past states .

P s s ′ \boldsymbol{P}_{ss^{\prime}} PssFor the slave state sss transitions to states ′ s's probability, also known as one-step state transition probability. P \boldsymbol{P}P is a one-step state transition matrix.

P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] P = [ P 11 P 12 … P 1 n P 21 P 22 … P 2 n … … P n 1 P n 2 … P n n ] \begin{gathered} P[S_{t+1}|S_t]=P[S_{t+1}|S_1,\ldots,S_t] \\ \boldsymbol{P}_{ss^{\prime}}=P[S_{t+1}=s^{\prime}|S_{t}=s] \\ \boldsymbol{P}=\begin{bmatrix}P_{11}&P_{12}&\ldots&P_{1n}\\P_{21}&P_{22}&\ldots&P_{2n}\\\ldots\\\ldots\\P_{n1}&P_{n2}&\ldots&P_{nn}\end{bmatrix} \end{gathered} P[St+1St]=P[St+1S1,,St]Pss=P[St+1=sSt=s]P= P11P21Pn 1P12P22Pn 2P1nP2n _Pnn

The matrix has the following properties:

  1. Non-negativity property , P ij ≥ 0 P_{ij}\geq0Pij0
  2. The sum of row elements is 1 , ∑ P ij = 1, i = 1, 2, . . ., n \sum P_{ij}=1,i=1,2,...,nPij=1,i=1,2,...,n

Insert image description here
Taking the above figure as an example, S 1 S_1S1transfer to self and S 2 S2The probabilities of S 2 are 0.1, 0.9 respectively;S 2 S_2S2transfer to self and S 1 S1The probabilities of S 1 are 0.2, 0.8 respectively. The state transition matrix can be expressed as: P = [ 0.1 0.9 0.8 0.2 ] \boldsymbol{P}=\begin{bmatrix}0.1&0.9\\0.8&0.2\end{bmatrix}P=[0.10.80.90.2]

Markov Chain

A Markov process in which both time and state are discrete is called a Markov chain. Markov Chain can be composed of a binary group (S, P) (S, P)(S,P ) description. The state transition probability does not change with time.

Hidden Markov Model

Hidden Markov Model (HMM) is a statistical model that describes a Markov process with hidden unknown parameters. In layman's terms, based on an invisible Markov chain, the system will generate events based on the number of internal state transitions, thereby observing the output of the system. It has been widely used in speech recognition, natural language processing, bioinformatics and other fields.

In HMM, it is assumed that only the observed state is visible, while the internal state of the model is invisible. Each state corresponds to an observation value, and then the parameters of the model are defined through the distribution of the observation values ​​and the probability of state transition. Specifically, HMM consists of the following parts:

  • State sequence: represented by S, where each state s belongs to the state set S = S 1 , S 2 , . . . , SNS={S_1, S_2, ..., S_N}S=S1,S2,...,SN
  • Observation sequence: represented by O, where each observation oo, OMO={O_1, O_2, ..., O_M}O=O1,O2,...,OM
  • State transition probability: use P ij P_{ij}Pijrepresents, represents the state S i S_iSiTransition to state S j S_jSjThe probability of , where P = P ij P={P_{ij}}P=Pijis an N × NN\times NN×matrix of N , and for alliii ∑ j = 1 N P i j = 1 \sum_{j=1}^{N}P_{ij}=1 j=1NPij=1
  • Observation probability: use B i ( o ) B_{i}(o)Bi( o ) means that in stateS i S_iSiObserved when ooo estimate, in whichB = B i ( o ) B={B_i(o)}B=Bi( o ) is anN × MN\times MN×M matrix, and for alliii ∑ o = 1 M B i ( o ) = 1 \sum_{o=1}^{M}B_{i}(o)=1 o=1MBi(o)=1
  • Initial state probability: use π i {\pi_{i}}Piimeans that the initial state is S i S_iSiapproximate probability, in which π = π i \pi={\pi_{i}}Pi=Piiis a length of NNvector of N , and for alliii ∑ i = 1 N π i = 1 \sum_{i=1}^{N}\pi_{i}=1 i=1NPii=1
    Insert image description here

Markov Decision Process

MDP (Markov Decision Process) is a mathematical model used to describe the interaction between an agent and the environment. MDP can be expressed as a five-tuple ( S , A , P , R , γ ) (S, A, P, R, \gamma)(S,A,P,R,c ) :

  • S S S : state set,S = { s 1 , s 2 , . . . , sn } S=\{s_1, s_2, ..., s_n\}S={ s1,s2,...,sn} , including lane, environment, world model and other information.
  • A A A : action set,A = { a 1 , a 2 , . . . , am } A=\{a_1, a_2, ..., a_m\}A={ a1,a2,...,am} , the vehicle’s decision-making space, including lane changing, following, overtaking, etc.
  • P ( s ′ ∣ s , a ) P(s'|s, a) P(ss,a ) : State transition probability, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s'sThe probability of ′ ,s ′ ∈ S s'\in SsS s , s ′ ∈ S s, s'\in S s,sS a ∈ A a\in A aA
  • R ( s , a , s ′ ) R(s, a, s') R(s,a,s ): Immediate reward function, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s's The immediate reward obtained,s , s ′ ∈ S s, s'\in Ss,sS a ∈ A a\in A aA
  • γ ∈ ( 0 , 1 ) γ\in(0,1)c(0,1 ) : Discount factor, indicating the weight ratio of current rewards and future rewards.

MDP converts the agent's actions into state choices, allowing the agent to continuously learn and optimize its decisions based on empirical strategies. Therefore, MDP is widely used in fields such as reinforcement learning.

Insert image description here

The decision-making problem of autonomous vehicles is essentially to find an optimal strategy for the current state SSS , through certain actionsAAA to achieve the target state. Therefore, an objective function can be used to maximize the total reward: ∑ t = 0 ∞ γ t R ( S t + 1 ∣ S t ) \sum_{t=0}^\infty\gamma^tR(S_{t+1 }|S_t)t=0ctR(St+1St) activationaaa can use policyπ \piπ去definition:a = π ( s ) a=\pi(s)a=π ( s ) , policy is the solution of the MDP we are looking for.
The solution to the problem can be solved throughdynamic programming, iteratively solving for the maximum reward, and finally backtracking to obtain the solution. Assume state transition probability matrixPPP and reward functionRRR is known, and the optimal solution can be obtained through continuous iteration of the following process: π ( st ) ← argmax { ∑ st + 1 P ( S t + 1 ∣ S t ) ( R ( S t + 1 ∣ S t ) + γ V ( st + 1 ) ) } V ( st ) ← ∑ st + 1 P π ( st ) ( S t + 1 ∣ S t ) ( R π ( st ) ( S t + 1 ∣ S t ) + γ V ( st + 1 ) ) \begin{aligned}\pi(s_t)&\leftarrow\mathrm{argmax}\left\{\sum_{s_t+1}P(S_{t+1}|S_t)\left(R (S_{t+1}|S_t)+\gamma V(s_{t+1})\right)\right\}\\V(s_t)&\leftarrow\sum_{s_{t+1}}P_{ \pi(s_t)}(S_{t+1}|S_t)(R_{\pi(s_t)}(S_{t+1}|S_t)+\gamma V(s_{t+1}))\end {aligned}π ( st)V(st)argmax{ st+1P(St+1St)(R(St+1St)+γV(st+1))}st+1Pπ ( st)(St+1St)(Rπ ( st)(St+1St)+γV(st+1))

V ( s t ) V(s_t) V(st) is the value function (Value Function), which represents the accumulated discount in the reward. The solution process is inst s_tstand st + 1 s_{t+1}st+1Iterate continuously until convergence.

Regarding the process of dynamic programming iterative solution, this article briefly reveals the relevant principles (Value Iteration Algorithm): Brief Introduction to the Value Iteration Algorithm .

One of the more important points of MDP is how to design the reward function. The following factors usually need to be considered:

  1. Reach the target point without deviating from the path;
  2. safety;
  3. Comfort.

In addition, the design of state space, state transition matrix, etc. is very important.

Partially Observable Markov Decision Process

Partially Observable Markov Decision Process (POMDP) ​​is an extended model based on Markov Decision Process (MDP) whose state is not fully observable. POMDP can be expressed as a seven-tuple (S, A, P, R, Z, O, γ) (S, A, P, R, Z, O, γ)(S,A,P,R,Z,O , γ ) , among which:

  • S S S : state set,S = { s 1 , s 2 , . . . , sn } S=\{s_1, s_2, ..., s_n\}S={ s1,s2,...,sn} , including lane, environment, world model and other information.
  • A A A : action set,A = { a 1 , a 2 , . . . , am } A=\{a_1, a_2, ..., a_m\}A={ a1,a2,...,am} , the vehicle’s decision-making space, including lane changing, following, overtaking, etc.
  • P ( s ′ ∣ s , a ) P(s'|s, a) P(ss,a ) : State transition probability, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s'sThe probability of ′ ,s ′ ∈ S s'\in SsS s , s ′ ∈ S s, s'\in S s,sS a ∈ A a\in A aA
  • R ( s , a , s ′ ) R(s, a, s') R(s,a,s ): Immediate reward function, expressed in the current statessUnder s , take actionaaAfter a , transfer to states ′ s's The immediate reward obtained,s , s ′ ∈ S s, s'\in Ss,sS a ∈ A a\in A aA
  • γ ∈ ( 0 , 1 ) γ\in(0,1)c(0,1 ) : Discount factor, indicating the weight ratio of current rewards and future rewards.
  • Z Z Z is the set,Z = { z 1 , z 2 , . . . , zn } Z=\{z_1, z_2, ..., z_n\}Z={ z1,z2,...,zn}
  • O ( z ∣ s ′ , a ) O(z|s', a) O(zs,a ) is the observation functionO: S × A × Z → [ 0 , 1 ] O: S \times A \times Z \rightarrow [0, 1]O:S×A×Z[0,1 ] , which defines that when performing an actiona ∈ A a \in AaAfter A , in a states ∈ S s \in SsS produces an observationz ∈ Z z \in ZzThe probability of Z.

Insert image description here

In POMDP, a set of observations is added to the model. We cannot directly observe the current state, but estimate the current state through the observation values ​​given by the entire model's universe and state. The observation value is usually a probability, so a corresponding observation model needs to be established.

Background on Solving POMDPs

This part of the content refers to https://cs.brown.edu/research/ai/pomdp/tutorial/pomdp-solving.html .

For the CO-MDP (fully observable MDP) problem, we need to find the mapping relationship between state and action; and for the POMDP problem, we need to find the mapping relationship between state-based probability distribution and action . In the following sections, we use belief state (belief state) to replace state-based probability distribution (probability distribution over states); use belief space (belief space) to replace probability space probability space (the set of all possible probability distributions) .

The next picture will describe the concept of belief space. In order to simplify the problem, we use the 2-state POMDP problem to elaborate. As can be seen from the previous content, the belief state represents a probability distribution, and the sum of all probabilities should be 1. Therefore, in the 2-state POMDP problem, if the probability of a state is ppp , the probability of this other state is1 − p 1-p1p . Therefore, the entire belief space can be described by a line segment. Line segment width is ignored.

Insert image description here

1D belief space for a 2 state POMDP

As shown in the figure, if the belief state is very close to the left, it means that it is in state s 2 s_2s2The probability is 1, otherwise this is in state s 1 s_1s1The probability is 1. It is worth noting that this is also true when extending 2 state to high-dimensional space.

Next, if we assume that we start from a specified belief state bbb starts, then takes actiona 1 a_1a1And receive the observation value z 1 z_1z1, then the next belief state can be completely determined. In fact, if we assume that there are a finite number of actions and a finite number of observations, given a belief state, the next belief state can be determined.

The following figure shows the above process: two states ( s 1 , s 2 s_1,s_2s1,s2), two actions ( a 1 , a 2 a_1,a_2a1,a2) and three observations ( z 1 , z 2 , z 3 z_1,z_2,z_3z1,z2,z3), the black vertical line segment is the starting belief state.
Insert image description here
The diagram above shows all possible belief states. To sum up: for a starting belief state, if an action is generated and the corresponding observation value is obtained, the next belief state can be determined . However, for a POMDP problem, before an action is generated, each final belief state will correspond to a corresponding probability. There are many such belief states (including corresponding observations), but for an action, the next belief state is The sum of the corresponding probabilities should be 1 .

Obviously, the above process is consistent with Markov Property, and the belief state of the next step depends on the belief state of the previous step (as well as actions and observations). In fact, we can transform the discrete POMDP problem into the CO-MDP problem in continuous space, where the continuous space is the belief space. This means that we can use the value iteration (VI) algorithm to solve it. But this algorithm still needs to be further improved.

In some tutorials, it is described as Belief-MDP .

MDP can be solved well using value iteration because its state is discrete and the value function of each state is also discrete. But the POMDP problem is different, because the state (belief state) is continuous, and a reasonable value function is needed to describe it. The figure below is an example of a value function:
Insert image description here
The POMDP formula imposes relevant constraints on the above problem. The core point is: for each horizon length, the value function is piecewise linear and convex (PWLC) . This means that for each value iteration process, we only need to find a limited number of line segments to construct the value function.

The figure below is a corresponding example. The line segments in the overlapping parts take the maximum value.
Insert image description here

These amount to nothing more than lines or, more generally, hyper-planes through belief space. We can simply represent each hyper-plane with a vector of numbers, which are the coefficients of the equation of the hyper-plane. The value at any given belief state is found by plugging in the belief state into the hyper-planes equation. If we represent the hyper-plane as a vector (i.e., the equation coefficients) and each belief state as a vector (the probability at each state) then the value of a belief point is simply the dot product of the two vectors.

If we represent the hyperplane as a vector (that is, the coefficient of the equation, also known as the alpha vector ), and represent each belief state as a vector (the probability of each state), then the valence and value of the belief state are two Dot product of vectors ( α ⋅ b \alpha \cdot bab)。

Here is another way to segment the belief space. The two are actually equivalent:

Instead of linear segments over belief space, another way to view the function is that it partitions belief space into a finite number of segments. We will be using both the value function and this partitioning representation to explain the algorithms. Keep in mind that they are more or less interchangeable.

Insert image description here

The problem now boils down to one stage of value iteration; given a set of vectors representing the value function for horizon ‘h’, we just need to generate the set of vectors for the value function of horizon ‘h+1’

To summarize : given a set of vectors representing the value function of a certain time window (or time step) 'h', we usually need to derive a set of vectors of the value function of the time window 'h+1'.

However, continuous state spaces do pose further problems. At each iteration of values ​​in discrete state space, we can find the new value of a state by looping through all possible next states. However, for continuous-state CO-MDPs, we cannot enumerate all possible states because they are "infinite".

Next, a series of methods will be introduced to solve the above problems.

POMDP Value Iteration Example

Next is an example with a horizon length of 3. There are still two states ( s 1, s 2 s_1,s_2s1,s2), two actions ( a 1 , a 2 a_1,a_2a1,a2) and three observations ( z 1 , z 2 , z 3 z_1,z_2,z_3z1,z2,z3)

Horizon 1 value function

For the case of horizon=1, we only care about the immediate reward, so the value function becomes a function of the immediate reward , without considering the impact of future rewards (that is, we do not need to consider the discount factor γ \gammaγ influence).

For this problem, we have two states and two actions, so we have four combinations of actions and states, each with a different value: r (
s 1 , a 1 ) , r ( s 2 , a 1 ) , r ( s 1 , a 2 ) , r ( s 2 , a 2 ) r(s_1,a_1),r(s_2,a_1),r(s_1,a_2),r(s_2,a_2)r(s1,a1),r(s2,a1),r(s1,a2),r(s2,a2)
These values ​​are discrete (POMDP based on discrete state), but if a specific action is performed in a specific belief state, this value becomes easy to obtain. We simply use the probability distribution of belief states as the weight of each state's value.

Use a simple example to further understand: if action a 1 a_1 is executeda1When , the value is r ( s 1 , a 1 ) = 1 , r ( s 2 , a 1 ) = 0 r(s_1,a_1)=1,r(s_2,a_1)=0r(s1,a1)=1,r(s2,a1)=0 , execute actiona 2 a_2a2When , the value is r ( s 1 , a 2 ) = 0 , r ( s 2 , a 2 ) = 1.5 r(s_1,a_2)=0,r(s_2,a_2)=1.5r(s1,a2)=0,r(s2,a2)=1.5 . If the current belief state is[0.25, 0.75] [0.25,0.75][0.25,0.75 ] (meaning we "believe" there is a 75% chance that the current position iss 2 s_2s2, there is a 25% probability of being in s 1 s_1s1), you can get the execution a 1 a_1a1The value of is: 0.25 × 1 + 0.75 × 0 = 0.25 0.25 \times 1 + 0.75 \times 0 = 0.250.25×1+0.75×0=0.25,执行a 2 a_2a2The value of is: 0.25 × 0 + 0.75 × 1.5 = 1.125 0.25 \times 0 + 0.75 \times 1.5 = 1.1250.25×0+0.75×1.5=1.125 . We can display these values ​​on the belief space using the diagram below.

Insert image description here

Horizon 1 value function

It can be seen that the immediate reward corresponding to each action specifies a linear function in the belief space. Based on a certain belief state, we will choose actions that bring higher benefits. In the above picture, for the belief space of the green part, it is obvious to choose a 2 a_2a2will bring higher returns, and for the belief space in the blue part, choose a 1 a_1a1Will bring higher profits.

Horizon 2 value function
Next, we will introduce the Horizon 2 value function.

Our goal is to find the optimal strategy using only two actions in each belief state . This question is quite complicated. Let’s break it down and answer the following three questions:

  1. How to calculate the value of a single belief state given an action and observation.
  2. How to calculate the value of a belief state given only an action.
  3. How to calculate the value of a belief state.

V(b) given a and z

Let’s start with the first question: given a specific belief state bbb , how to calculate when executing actiona 1 a_1a1And receive the observation value observation z 1 z_1z1value of the case.

In the case of horizon=2, the value of each current belief state is the value of the immediate action plus the value of the next action. Because we have fixed the action of immediate action, the overall process is actually similar to the case of horizon=1.

Let’s look at the example in the figure below:
The left side is the function of immediate reward, and the right side is the value function when Horizon=1. a 2 a_2a2The immediate reward function is a dotted line, we don’t have to consider it (because we have already given action a 1 a_1a1)。

Insert image description here

Value of a fixed action and observation

We define TTT is givena 1 , z 1 a_1,z_1a1,z1Next, from the belief state bbb is transferred tob ′ b'b function. fromb ′ b'b' ' perspective, it is easy to decide which action can bring higher benefits. b ′ b'in the above pictureb falls in the green area, which means that for Horizon=2, if we first choosea 1 a_1a1, and observed z 1 z_1z1, then choose a 2 a_2 nexta2It will bring us greater benefits. Obviously, we know all the variables needed to calculate the value, we know the value of the immediate reward, we know the transfer to b ′ b'bThe best value after ′ can be obtained by adding the two, givena 1 , z 1 a_1,z_1a1,z1When, bbThe value in state b .

We can calculate the value of any belief state (immediate reward + converted value) by repeating the steps in the previous section. Next we want to find the value of all belief states (the value of each point in [0,1] on the horizontal axis).

The value function when Horizon=1 is based on the transferred belief state b ′ b'bfunction of , b ′ b'b is givena 1 , z 1 a_1,z_1a1,z1Based on the initial belief state bbTransfer function of b ( b ′ = τ ( b , a 1 , z 1 ) b'=\tau(b,a_1,z_1)b=t ( b ,a1,z1) ). Therefore, we can construct a function on the entire belief space based on the value function of horizon=1. This function has a built-in transfer functionb ′ = τ ( b , a 1 , z 1 ) b'=\tau(b,a_1, z_1)b=t ( b ,a1,z1)

Insert image description here

Transformed value function

Next, we can use S ( ai , zj ) S(a_i,z_j)S(ai,zj) to represent the above function (transformed value function). Now, if we want to find a given( ai , zj ) (a_i,z_j)(ai,zj) , just changeS ( ai , zj ) S(a_i,z_j)S(ai,zj) sum immediate time encouragementr ( si , ai ) r(s_i,a_i)r(si,ai) can be added.

PS: It is worth noting that this function S ( ai , zj ) S(a_i,z_j)S(ai,zj) is also from PWLC.

V(b) given a
follows the above question, for the given(ai, zj) (a_i,z_j)(ai,zj) , we can know a belief state bbthrough the above methodThe value of b . Typically, however, observationzzz is unknown, we cannot guarantee that we can observe a givenzj z_jzjvalue. Therefore, we give the following example to further explain the problem:

As shown in the figure, bbb at the given actiona 1 a_1a1In the case of , there may be three observation values ​​z 1 , z 2 , z 3 z_1,z_2,z_3z1,z2,z3, and leads to three possible outcomes.

Solving this problem is simple: if we know that for a given observation zj z_jzjThe result of the subsequent belief state, then even if we do not know the observation value, we can calculate the value of the belief state: with each observation result zj z_jzjJust weight the probability of occurrence .
Insert image description here

PS: The S ( a 1 , z 1 ) S(a_1,z_1) we showed beforeS(a1,z1) is actually a factor of observation probability. We claim that it is every belief statebbb pair of fixed actionsai a_iaiAnd the given observation value zj z_jzjThe next belief state b ′ b'b , in fact,S ( ai , zj ) S(a_i,z_j)S(ai,zj) function is not what we claim it to be. S ( ai , zj ) S(a_i,z_j)S(ai,zj) function actually has built-in probability of observation.

In this example (there are three possible observations), each zi z_iziAll have corresponding SSS function. The obvious conclusion to follow is thatthe best next step to take depends not only on the initial belief state, but also on the exact observations we get.

Insert image description here

Transformed value function for all observations

The value at Horizon=2 does not only depend on the given action a 1 a_1a1, also depends on the next action. For a given belief state and observation, we can look at SSS -function partitioning to decide what the best action to do next is.
Insert image description here

As shown in the figure below, if the belief state bb is givenb , actiona 1 a_1a1, we can easily know the next strategy from the picture: if z 1 z_1 is observedz1, then execute a 2 a_2a2;If z 2 , z 3 z_2,z_3 are observedz2,z3, then execute a 1 a_1a1. For each point in the belief space, we can draw such a vertical line to get the strategy in this belief state: If zi z_i is observedzi, then select aj a_jaj. The strategy in the picture below is ( z 1 : a 2 , z 2 : a 1 , z 3 : a 1 ) (z_1:a_2, z_2:a_1, z_3:a_1)(z1:a2,z2:a1,z3:a1)

Insert image description here
V(b)

However, the above process only represents the state of belief bbThe optimal strategy that can be obtained in b does not represent all belief statesbi b_ibithe optimal strategy. Then look at the picture below:

In the figure below, the different colors in the upper rectangle represent different optimal strategies bbb range. We have three observationsz 1 , z 2 , z 3 z_1,z_2,z_3z1,z2,z3, two actions a 1 , a 2 a_1,a_2a1,a2, so we have 8 corresponding strategies, but not every strategy is effective (non-optimal under any circumstances). For example, in the example in the figure, there are only 4 strategies.

Similarly, corresponding line segments can be constructed for four of them, and finally a linear piecewise convex function will be obtained as shown in the figure below.
Insert image description here

Note that each line segment in the above figure represents two actions: action a 1 a_1a1and another action that depends on the observation.

Insert image description here
If there is only one action a 1 a_1 in our modela1, then the above process is over, and we have also obtained the value function of Horizon=2. However, we have other actions. As shown in the figure above, for a 2 a_2a2Repeat the above process to get a 2 a_2a2The value function of and the graph with two partitions.

Next, we will a 1 a_1a1Japanese a 2 a_2a2The two value functions are superimposed to find which action can increase the greater value.
Insert image description here
When constructing PWLC, you can notice that some line segments are invalid for both actions and therefore are pruned. Finally, the value function when Horizon=2 is obtained as shown in the figure below.

Insert image description here

PS: There are several different a 1 a_1 abovea1, only means that the initial steps are the same, which area to choose is equivalent to choosing the corresponding strategy. For example, for the magenta area, our first step is to select a 1 a_1a1, and then select the remaining actions based on the observed values.

Horizon 3 value function
The case of Horizon=3 is similar to the case of Horizon=2. First, for a 1 a_1a1Construct a value function (also using SSThe S function transfers the value function of Horizon=2). As shown below, there are six effective strategies.

Insert image description here
Then construct a 2 a_2a2The value function, as shown in the figure below, has 4 effective strategies.

Insert image description here
By superimposing the two,
Insert image description here
we can combine the two possible actions and perform pruning (removing the strategies that cannot reach the optimal). Then the final picture on the right is the Value Function with horizon=3.

Insert image description here


The current method to solve POMDP is the predictive model method . Introduce confidence and convert the problem into a statistical MDP to solve.

Insert image description here


The picture below nicely shows the relationship between the four Markov models.
Insert image description here

Recommended reading and reference

[1] https://cs.brown.edu/research/ai/pomdp/tutorial/ (explains MDP and POMDP, very easy to understand)
[2] http://cbl.eng.cam.ac.uk/pub /Intranet/MLG/ReadingGroup/pomdp.pdf
[3] https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13692
[4] [Planning Problem] 1 Intuitive understanding of POMDP value iteration
[5] https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/POMDP_lecture.pdf

Guess you like

Origin blog.csdn.net/sinat_52032317/article/details/132790895
Recommended