What is Reinforcement Learning Markov Decision Process (MDP)

In the first half of 2016, the "human-machine war" between Li Shishi and AlphaGo set off a wave of artificial intelligence, and also caused a heated discussion on artificial intelligence. This paper mainly studies reinforcement learning in artificial intelligence, which is a computer learning in a "trial and error" way, and the reward-guided behavior obtained by interacting with the environment. The goal is to make the computer obtain the greatest reward.

Taking Go as an example, a reinforcement learning problem usually consists of the following elements:

  • Action Space: A, the collection of all legal actions that can be taken, all legal moves.
  • State Space (State Space): S; the set of all states is called the state space, all the chessboard layout.
  • Reward: R; positive reward for winning, negative reward for losing.
  • State transition probability matrix (Transition): P. Predict the opponent's possible moves, and assign a probability to each situation.

The purpose of reinforcement learning is to let the agent find the best sequence of steps to solve the problem through continuous learning. The "best" measure is the cumulative reward obtained by the agent after performing a series of actions.

Markov Decision Processes (MDP) is a formal description of the environment in reinforcement learning, or a modeling of the environment in which the agent lives. In reinforcement learning, almost all problems can be formally represented as a Markov decision process.

The scene of the "Frozen Lake" game is a frozen lake (that is, a 4×4 square), and it is required to walk from the starting point "Start" to the target point "Goal" without falling into the ice hole.

The game has two modes: "windy" mode and "no wind" mode. The difference between the two modes is that in the "windy" mode, the movement of the agent will be affected by the wind. For example, the agent's current position is S3, and the agent chooses to take a step to the right. The experience reaches the S4 state, and in the "windy" mode, the position of the agent is uncertain, and it may be blown to any state by the wind, such as S7.

In the "Frozen Lake" game, the agent needs to go through a sequence of intermediate states from "Start" to the target point "Goal", and also needs to make a series of actions according to the strategy. Usually, the pros and cons of the strategy are judged according to the cumulative reward obtained by the agent after performing a sequence of actions. The greater the cumulative reward, the better the strategy.

There are two ways to calculate the cumulative reward, one is to calculate the sum of all reward values ​​from the current state to the end state:

Gt=rt+1+rt+2+...+rt+T

The above applies to reinforcement learning in the case of finite time horizon (Finite-horizon), but in some cases of infinite time horizon (Finite-horizon), the agent may perform a task that lasts for a long time, such as autonomous driving, if It is obviously unreasonable to use the above formula to calculate the jackpot value.

A finite value is required, usually by adding a discount factor as follows:

In the above formula, 0≤γ≤1. When the value of γ is equal to 0, the agent only considers the reward of the next step; when the value of γ is closer to 1, the more future rewards will be considered. It should be noted that sometimes we care more about the current reward, and sometimes we care more about the future reward. The way to adjust is to modify the value of γ.

Simplify "Frozen Lake" (no wind mode) first, and ignore the start and end points, as shown in Figure 1.

Simplified "Frozen Lake" game


The state transition diagram on the right shows the probability of transitioning from each state to the next state and the corresponding reward that can be obtained. For example, in state S1, you can transfer to the S2 state, you can also transfer to the S3 state, or do not move, stay in the S1 state, the probability is 0.3, 0.5 and 0.2, and the corresponding rewards are 2, 2 and 1, respectively. For each state, the probabilities of outgoing edges must sum to 1.


1. Markov Process (Markov Process)
In a random process s0, s1, ..., sn, the state si at time ti is known, if the state si+1 at time ti+1 is only related to state si , and has nothing to do with the state before time ti, this process is called a Markov process (class recursion).

For example, in the example in Figure 1, after the agent moves from the S1 state to the S3 state, the next state has nothing to do with S1, and only depends on the current S3 state. This property is known as the Markov property (or "aftereffects") of stochastic processes . Random processes s0, s1, ..., sn with Markov properties are called Markov Chains.

2. In the simplest case of Markov Reward Process
, after each action is performed, the next state reached is determined, so it is only necessary to accumulate the rewards obtained by the agent at each step.

However, many times, the state is uncertain. For example, in the "windy" mode of the "Frozen Lake" game, the agent will transfer to another state with a certain probability after performing an action. Therefore, the reward obtained is also related to This probability is related. Therefore, when calculating the cumulative reward, the expectation of the reward is usually calculated, and the expectation of the reward is represented by V, then the expected reward value of the state s is expressed as: V(s)=E[Gt|St=s]

So Gt=rt+ 1+rt+2+...+rt+T can be expressed as follows:



The second formula for "cumulative rewards" considering the discount factor is expressed as:


3. The Markov Decision Process
only considers the "no wind" mode of the "Frozen Lake" game in the simplified game, because in the "no wind" mode, the agent performs an action and reaches the next level. A state is deterministic, so only state transitions are considered without specific actions. In the "windy" mode, however, the transition probabilities of states vary according to the actions performed.

Still taking the simplified "Frozen Lake" game example, if the current state is S1, in the "windy" mode, according to the different actions performed, the state transition probability is shown in Table 1.


What is a Markov decision process? We define a Markov decision process as a quintuple: M=(S,A,R,P,γ)

  • S: State space, for example in the "Frozen Lake" game, there are 16 states in total (Start, S2, ..., S15, Goal);
  • A: Action space, in the "Frozen Lake" game, there are four actions that can be performed in each state (up, down, left and right);
  • R: Reward function, if an action is performed in a certain state St and transferred to the next state St+1, a corresponding reward rt+1 will be obtained;
  • P: State transition rule, which can be understood as the state transition probability matrix we introduced earlier. When an action is performed in a certain state St, it will transfer to the next state St+1 with a certain probability.


Now to sum up, the problem to be solved by reinforcement learning is: the agent needs to learn a policy π, which defines a mapping relationship π from state to action: S→A, that is, the agent is in any state st. The actions that can be performed are at=π(st), and there are

.

The value Vπ is used to measure the quality of this policy π, and the value Vπ(st) represents the expected value of the cumulative reward obtained by the agent starting from state st and following the premise of policy π to perform a series of actions (in fact, when After the strategy π is determined, the state transition probability in the MDP is also determined. At this time, it can be simply regarded as the Markov return process, and the return can be solved by using the method of solving the Markov return process):


The value here is the value if policy π is followed.

Reference article:

Basic Concepts of Reinforcement Learning (Examples)

Markov Decision Process (MDP)_Python Tutorial Network​www.92python.com/view/410.htmlUploading...Reupload cancel

Reinforcement learning (learning method) ...reupload canceled

Posted on 2021-05-17 23:12

Markov Decision Process (MDP) study notes (1) bzdww

Job No.: Micro Program School

Guess you like

Origin blog.csdn.net/u013288190/article/details/124419669