Mobile robot path planning (7) --- MDP-based path planning MDP-Based Planning

Table of contents

1 What is MDP-Based Planning

2  worst-case analysis for nondeterministic model

3 Expected Cost Planning

4  Real Time Dynamic Programming(RTDP)


1 What is MDP-Based Planning

        We had many executable paths from the starting point to the end point. We can choose the optimal path according to changes in the environment during execution.

        So far, we have assumed that the robot is planning under ideal conditions (the robot's execution is perfect, the robot's estimation is perfect).

        ​​​​Using the two pictures above, we plan a route from one location to another. We assume that if we let the robot walk a grid, it will walk a grid. In the picture on the right, we assume that it accurately reflects the environment. After estimating the pose, we assume that the robot accurately reaches the end position and there are no unexpected situations.

        This is not actually the case:

        In practical applications, execution and status are not perfect.

         ​ ​ ​ • Execution uncertainty: slippage, rough terrain, wind, air resistance, control errors, etc.

         • State estimation uncertainty: sensor noise, calibration error, imperfect estimation, partial observability, etc.

        ​ ​ ​ Uncertainty can be divided into two categories from the robot’s perspective, which indicates how much information the robot can utilize.

        Uncertainty model

        • Non-determinism: The robot does not know what type of uncertainty or disturbance will be added to its behavior (next step). (The offset target point is very far affected by the natural environment)

         • Probabilistic: The robot has a certain estimate of uncertainty through observation and collection of statistical data. (After running a part until you are disturbed)

        ​​​​To formally describe this concept, we first introduce two decision-makers to simulate the generation of uncertainty, and then the type of planning with uncertainty.

        Decision maker (game participant):

         • Robot is the primary decision-maker, planning based on fully known states and perfect execution.

         • Nature adds uncertainty to the plans made by the robot, which is unpredictable for the robot.

        Formalization-7.1: Game with nature (independent game)
• The non-empty set U is called the robot action space. Each u ∈ U is called a robot action.
• The non-empty set Θ is called the nature action space. Every θ ∈ Θ is called a natural action.
• Function L: U × Θ → R ∪ {∞}, called cost function or negative reward function.

        Formalization-7.2: Nature understands robot actions (dependence game)
        • The non-empty set U is called the robot action space. Each u ∈ U is called a robot action.
        • For each u ∈ U, there is a non-empty set Θ(u) called the natural action space.
        • Function L: U × Θ → R ∪ {∞}, called cost function or negative reward function.
        When a robot plays a game with nature, what is the best decision for the robot?

        One-step Worst-Case Analysis
        • Under the nondeterministic model, P(θ) in the independent game and the dependent game P(θ|u_k) in is unknown;
        • The robot cannot predict the behavior of nature and assumes that it maliciously chooses actions that will make the cost as high as possible;
        • Therefore,it is reasonable to make decisions assuming the worst case scenario.

        We exhaustively exhaust all natural action spaces, select the most unfavorable situations, and let the robot perform this action to minimize the most unfavorable situations.

        • Under the probabilistic model, P(θ) in the independent game and P(θ|u_k) in the dependent game are known;
        • Assume that the application has been observed nature's actions, and nature uses a randomization strategy when choosing actions.
        • Therefore, we optimize to obtain the average cost (average cost to be received).

        Every time the robot performs an action, the environment will align and exert various influences. Let's find out the expectations of various influences and choose the action that minimizes the expected value.

        What about multi-step situations?

        Formalization-7.2: Discrete Programming with Nature
1. A non-empty state space X with initial state x_s and target set X_F ⊂ X.
2. For each state x ∈ X, there is a finite and non-empty robot action space U(x). For every x ∈ X and u ∈ U(x), there is a finite and non-empty natural action space Θ(x, u).

 3. State transition function f (x, u, θ) for each x ∈ X, u ∈ U, and θ ∈ Θ(x, u):

4. A set of phases (a robot is not represented by just one phase), each phase denoted by k, starting at k = 1 and continuing indefinitely, or ending at a maximum phase k = K + 1 = F.

5. One stage superimposed cost function L. Let xə F, ũ k, θə K represent the history of states, robot actions, and natural actions up to the Kth stage:

        Markov Decision Process (MDP)
        In the learning field, MDP is a 4-tuple (S, A, P, R), and in the planning field it is (X, U, P, L):
        • S or X is the state space,
        • A or U is the (robot) action space,
        • P(x_k+1 |x_k, u_k) is the state transition function under the probabilistic model, which degenerates into a set X_k+1 (x_k, u_k) under the non-deterministic model,
        • R(x_k, x_k+1) is the immediate reward, or the negative one-step cost −l(x_k, u_k, θ_k) due to u, θ transitioning from x_k to x_k+1.
        The first difficulty in planning in the face of uncertainty is to appropriately formalize our problem using an MDP model.

        When the robot moves from x_I to x_G, the state space is an area full of black spots. There are five types of action spaces (stay in place, up, down, left, and right).

        Nature’s action space:

        1. We assume that the robot performed the action at the position x_k with a random Gaussian error. (continuous)u_k

        2. Discretize the action space of nature. The robot performed the action plus an additional action at the position x_k. (discrete)u_k

        Cost function l: the distance difference between the next state and the current state.

        We hope to find a path (move to the target location with the minimum cost).

        \pi is a mapping from state space to action space , which specifies what action I should perform in what state and is in the form of a discrete set.

        ​​​​Define the variables that measure the quality of the policy:

33

2  worst-case analysis for nondeterministic model

        Let’s take a specific example:

        The optimal cost to go at step K+1 is already known. We are now in the current statex_kThe robot has performed the characteristic actionu_k, The specific that the robot can transfer from this x_k to is determined by . We find one that makes the single-step cost plus K+1 cost to go and the largest one, and then we choose the one with the smallest cost . x_{k+1}\theta _k\theta _k\theta _ku_k

        We assume that the cost to go at the end point is 0, and other places are unknown. For S3, the cost to go is 0+1. For s1, there is only one u, but there are two\theta, one is a single step 2 + end point 0, and the other branch is 2 + the cost to go of s2 (not calculated yet).

        Solve iteratively from the end point to the starting point.

        As an example:

        First we set G(x_F) \leftarrow 0, set the others to infinity, and initialize the openlist to S_g.

        The next expansion status isS_g. The next step is to find its predecessor node.

        对于s_3, calculationG(x_k = s_3) = 1 + G(x_{k+1} = s_g), openlist additions_3.

        对于s_1, G^{*}_{k}(x_k=s_1) = min \{max \{2+0,2+inf \} \} (inf....impossible update, illegal release openlist), G^{*}_{k}(x_k=s_3) = min\{ 1+0 \} .

        For s4, the same.

        For s2, the same is true. However, it has two predecessor nodesS_s,s_1. Process S1 first:

        Last update Ss:

        Advantages Disadvantages:

 

3 Expected Cost Planning

        So here’s the problem?

        Let’s look at the algorithm description:

        Give an example:

        First we initializeG \ value to 0. Choose an iteration orders_1 -> s_2 ->s_3->s_4->s_5.       

        Let’s take a look at the update ofs_1 first:

        Looking for updatess_2:

        Looking for updatess_3:

        Looking for updatess_4:

        Last update tos_s:

        After one round we have G. We then proceed to the second iteration:

        . . . . third iteration

        How to judge convergence? Boundary conditions? ? How to improve? ? How to improve the iteration order? ?

        ​ ​ ​ Advantages &Disadvantages:

        ​ ​ 1.Reflects the average level

        ​ ​ 2. Not necessarily the best

4  Real Time Dynamic Programming(RTDP)

        Look at the actual example:

        is updated based on the number of each node to s_g.

Guess you like

Origin blog.csdn.net/qq_41694024/article/details/134578821