Robot Learning - Path Planning Based on Samples and Probability (2)

10. Markov decision process

Example of recycling robot

As an example, we have a recycling robot. The robot's goal is to navigate its surroundings, picking up as many cans as possible. It has a set of possible states, and a set of possible courses of action. The robot is rewarded for picking up the can, and receives a negative reward (penalty) if it runs out of battery or runs aground.

Robots have a non-deterministic transition model (sometimes called one-step dynamics). This means that an action is not guaranteed to guide the robot from one state to another. Instead, each state has an associated probability.

Suppose at any time step t, the state of the robot's battery is high (S_t = high). In response, the agent decides to search the can (A_t = search = search). In this case, there is a 70% chance that the robot's battery will stay high and a 30% chance that it will drop to low.

Before moving on, let's review the definition of an MDP.

MDP definition

Combined path planning scheme

If we apply A* search to this discrete 4-connected environment, the resulting path will have the robot move 2 spaces to the right, then 2 spaces down, and then to the right again to reach the goal (or RRDRD, which is an equally optimal path). This is indeed the shortest path, however, it takes the robot to a very dangerous area (the pond). There is a good chance that the robot will fall into the pond and fail to complete the task.

If we use MDPs for path planning, we might get better results!

In each state (cell), the robot will get a certain reward R(s)R(s). This kind of reward can be positive or negative, but it cannot be unlimited. Typically the following rewards are provided:
-- small penalties for non-goal states - representing the cost of time passing (a slow-moving robot will suffer a larger penalty than a fast robot),
-- large rewards for goal states, and
-- Big penalties for dangerous states - hopefully will convince bots to avoid them.

Considering the uncertainty of the rover's movement, these rewards will help guide the rover to find an efficient and safe path.

The diagram below shows an environment for assigning appropriate rewards.

As you can see, entering a state that is not the goal state has a reward of -1 if it is flat and -3 if it is mountainous. Dangerous Ponds have a -50 bonus, while Objectives have a 100 bonus.

With the robot's transition model determined and appropriate rewards distributed to all regions of the environment, we can now construct a policy. Read on to see how this is done in probabilistic path planning!

11. Strategy

In reinforcement learning, the solution to a Markov decision process is called a policy and is denoted by the letter π.

definition

A policy is a mapping from states to actions. For each state, there will be a policy telling the robot what action it should take. The optimal policy, denoted π*, informs the robot of the best action to take at any state in order to maximize the overall payoff. We will study the optimal strategy in more detail below.

选读材料:
Wikipedia - Reinforcement Learning(https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Freinforcement-learning-rl-101-with-python-e1aa0d37d43b
Reinforcement Learning 101 - solve the gridworld state-value function (https://en.wikipedia.org/wiki/Reinforcement_learning 

Develop strategies

The diagram below shows the sequence of actions a robot can take in its environment. Note that there are no arrows pointing to the pond, because the robot is considered DOA (dead on arrival) after entering the pond. Likewise, when the robot reaches the goal, there are no arrows to leave the goal because the path planning problem has already been completed—after all, this is an episodic task.

From this set of actions, a policy can be generated by selecting an action for each state. Before we revisit the process of selecting the appropriate action for each strategy, let's see how some of the above values ​​are calculated. After all, -5.9 seems like an odd number!

Calculate expected reward

Recall that the reward for entering an empty cell is -1, the reward for entering a mountainous cell is -3, the reward for entering a pond is -50, and the reward for entering a goal is +100. These are rewards defined by circumstances. However, if our robot wants to move from one cell to another, it cannot guarantee success. Therefore, we have to calculate the expected reward, which takes into account not only the reward set by the environment, but also the transition model of the robot. Let's look at the bottom mountain cell first. From here, it's intuitively clear that moving to the right is the best action, so let's do the math. If the robot's movement is deterministic, then the cost of this movement will be trivial (the reward for moving to an open cell is -1). However, since our action is non-deterministic, we need to evaluate the expected reward of this action. The probability of the robot successfully moving to an open cell is 0.8, the probability of moving to the cell above is 0.1, and the probability of hitting a wall and staying in the current cell is 0.1.

expected reward=0.8∗(−1)+0.1∗(−3)+0.1∗(−3)
expected reward=−1.4

All expected rewards are calculated this way, taking into account the transition model for this particular robot.

You may have noticed that some expected rewards are missing from the above diagram. Can you calculate their values?

selection strategy

Now that we know our expected return, we can choose a strategy and evaluate its efficiency. Again, a policy is just a mapping from states to actions. If we look back at the set of actions described in the diagram above, and choose only one action for each state—that is, there is exactly one arrow leaving each cell (except for hazard and goal states)—then we have our own strategy.

However, we are not looking for any policy, we want to find the optimal policy. For this reason, we need to study the utility of each state and then determine the best action to take from each state. This is the next concept!

12. Utility of state

definition

The utility of a state (also known as the state value) represents the attractiveness of the state relative to the goal. Recall that for each state, the state-value function will yield the expected reward if the agent (robot) starts in that state and then follows the policy at all time steps. In mathematical notation, this can be expressed like this:

As you can see here, computing the utility of a state is an iterative process. It involves all states that the agent will visit between the current state and the goal, as dictated by the policy.

Also, it should be clear that the utility of a state depends on the policy. If you change the strategy, the utility of each state will change because the sequence of states visited before the goal may change.

determine the optimal strategy

Recall that the optimal policy , denoted π∗, tells the robot the best action to take in any state in order to maximize the overall reward. That is to say,

It may not be clear from the start which action is optimal for each state, especially for states that are far from the goal but have many available paths. It is often helpful to start with a goal and work your way back.

If you look at the two cells adjacent to the goal, their best action is trivial - go to the goal! Recall that in RL the goal state has 0 utility. This is because if the agent starts from the goal, the task is completed and no reward is received. Therefore, the expected reward from the target neighbor cell is 79.8. Therefore, the utility of the state, 79.8 + 0 = 79.8 (based on Uπ(s) = R(s) + Uπ(s′)).

If we look at the lower mountain cells, it is also easy to guess which actions should be performed in this state. With an expected reward of -1.2, moving right will be more rewarding than taking any indirect route (up or left). The utility of this state is -1.2 + 79.8 = 78.6.

Now it is your turn!

test

Can you calculate what the utility of the state to the right of the middle hill is if the action with the highest reward is chosen?

Under optimal policy, what is the utility of the states to the right of the center hill?

The process of selecting the most rewarding behavior in each state continues until each state maps to a behavior. These mappings are what make up the strategy.

It is highly recommended that you pause this lesson here and work out an optimal policy yourself using the set of operations seen above. Reading through this example yourself will give you a better understanding of the challenges you face along the way and will help you memorize the material more effectively. When you're done, you can compare the result to the image below.

policy application

Once this process is complete, the agent (robot) will be able to make optimal path planning decisions from each state and successfully navigate from any starting position to the goal. The optimal policy for this environment and this robot is provided below.

The figure below shows a set of operations with only the best operation left. Note that in the upper left cell, the agent can move down or to the right, since both choices have the same reward.

13. Value iteration algorithm

Our process of determining the optimal policy for mountain environments is fairly straightforward, but it does require some intuition to determine which action is optimal for each state. In larger and more complex environments, intuition may not be enough. In such an environment, an algorithm should be applied to handle all computations and find the optimal solution of the MDP. One such algorithm is called the value iteration algorithm. Iteration is a key word here, and you'll see why!

Guess you like

Origin blog.csdn.net/jeffliu123/article/details/130041291