Reinforcement Learning: Basic Concepts

insert image description here
  Taking grid-world as an example, the basic concepts of reinforcement learning are introduced. As shown in the figure, the robot is in a grid world, and different grids represent different functions. White represents that the robot can enter freely, yellow represents a trap (once the robot enters, it will be forced to return to the starting point), and blue represents the end point. , and now the robot is required to learn by itself and walk the shortest path from the starting point to the ending point.

state state

State: the possible state S i S_i of the system at each stageSi, the state shown in the figure has s 1 s1s1, s 2 s2 s2, ···, s 9 s9 with 9

State collection: the collection of all possible states of the system at each stage SSS= { S i } \left\{\begin{rcases}S_i\end{rcases}\right. { Si}

behavior action

Behavior: possible actions ai a_i in each stateai. In the grid-world, there are 5 behaviors of robots:
  a 1 a1a 1 : move up;
  a 2 a2a 2 : move right;
  a 3 a3a 3 : move down;
  a 4 a4a 4 : move left;
  a 5 a5a 5 : Keep the original position unchanged;

insert image description here

Behavior set: the set of all possible actions A ( S i ) A(S_i)A(Si)= { a i } \left\{\begin{rcases}a_i\end{rcases}\right. { ai}

state transition

State transition: A state will transfer to another state after taking a behavior S 1 — a 2 — > S 2 S_1—a2—>S_2S1a 2—>S2

State Transition Probability: Describe state transitions with probabilities!
       p ( sj ∣ si , an ) p(s_j | s_i,a_n)p(sjsi,an) : means in statesi s_isitake action an a_nanreach state sj s_jsjThe probability of
       such as: p ( s 2 ∣ s 1 , a 2 ) = 1 p(s_2 | s_1,a_2)=1p(s2s1,a2)=1

strategypolicy

Strategy: Based on the ultimate purpose, guide what behavior should be taken in the current state. Each state corresponds to a strategy.
   For example, if a strategy is given now (the green arrow indicates the strategy), as shown in the figure below, different starting points will be obtained based on the given strategy The path to reach the destination.

insert image description here
insert image description here

In specific problems, the conditional probability representation strategy in mathematics is often used, and π π    is habitually used in reinforcement learningπ represents the strategy,π ( an ∣ si ) π(a_n|s_i)π ( ansi) means that in statesi s_isiTake action under the condition an a_nanThe probability.
   in state s 1 s1Taking s 1 as an example, the mathematical expression of the strategy given the arrow above is as follows (deterministic probability):
π ( a 1 ∣ s 1 ) = 0 π(a_1|s_1)=0π ( a1s1)=0 π ( a 2 ∣ s 1 ) = 1 π(a_2|s_1)=1π ( a2s1)=1 π ( a 3 ∣ s 1 ) = 0 π(a_3|s_1)=0π ( a3s1)=0 π ( a 4 ∣ s 1 ) = 0 π(a_4|s_1)=0π ( a4s1)=0 π ( a 5 ∣ s 1 ) = 0 π(a_5|s_1)=0π ( a5s1)=0
   to states 1 s1s 1 as an example, the strategy of uncertainty probability:
insert image description here
π ( a 1 ∣ s 1 ) = 0 π(a_1|s_1)=0π ( a1s1)=0 π ( a 2 ∣ s 1 ) = 0.5 π(a_2|s_1)=0.5π ( a2s1)=0.5 π ( a 3 ∣ s 1 ) = 0.5 π(a_3|s_1)=0.5π ( a3s1)=0.5 π ( a 4 ∣ s 1 ) = 0 π(a_4|s_1)=0π ( a4s1)=0 π ( a 5 ∣ s 1 ) = 0 π(a_5|s_1)=0π ( a5s1)=0

In programming, policies are usually expressed in the form of an array (matrix):
insert image description here

reward reward

   In reinforcement learning, the reward is a real number (scalar) that is obtained after taking an action. There are positive and negative rewards. If the reward is positive, it means that we encourage this kind of behavior; if the reward is negative, it means that we don’t want to take this kind of behavior, which is essentially a punishment for taking this kind of behavior.
   Rewards can be understood as a means of interaction between humans and robots. Through rewards, robots can be guided to act according to our expectations and achieve our goals. Rewards must depend on the current state and the actions taken, and are given based on the actions taken.

   In grid-world, the reward rules are as follows:
insert image description here

	1、如果机械人试图走出边界,则奖励reward=-1		
	2、如果机械人试图进入禁止的单元格,则奖励reward=-1	
	3、如果机械人到达目标单元格,则奖励reward=+1	
	4、其他情况,奖励reward=0	

   In reinforcement learning, the conditional probability is used to express the reward obtained by taking an action.
p ( r ∣ si , an ) : represents the probability of taking action an to obtain a reward under the condition of state si p(r|s_i,a_n): represents the probability of obtaining reward by taking action a_n under the condition of state s_ip(rsi,an) : means in state sitake action a under the conditionnThe probability of getting a reward
   is in the states 1 s_1s1As an example (since we have designed the reward rules, the rewards obtained are deterministic, but in actual situations the rewards obtained are not deterministic): p ( r = − 1 ∣ s 1 , a 1 ) = 1 p (r=-1|s_1,a_1)=1p(r=1∣s1,a1)=1

state-action-return trajectory

   In reinforcement learning, trajectory is a chain that records the behavior taken from the starting point to the end point, the reward obtained, and the state transition, including three changes of state, reward, and behavior. Sampling (return) is a very important probability in the trajectory. It is for a trajectory, and its function is to add up all the rewards obtained along a trajectory.
insert image description here
.
insert image description here

   As shown in the figure above, different strategies result in different trajectories. So how do you evaluate the pros and cons of different strategies? In reinforcement learning, return is usually used to evaluate the pros and cons of a strategy.

Attenuation coefficient γ γThe introduction of γ
insert image description here
   is shown in the figure above. When the robot reaches the end point, the strategy is still in progress, and it will continue to enter the end point repeatedly, which will make the return diverge. To solve this problem, we introduce the discount rateγ ∈ [ 0 , 1 ) γ ∈ [0,1)c[0,1 ) After introducing the attenuation coefficient, you will get the discount return
insert image description here
and you can findγ γThe role of γ :
   1. The guaranteed reward is limited
   2. It can balance the long-term and short-term rewards
    ifγ γIf the γ value is small, it means that we will pay more attention to the recent rewards, that is, the final reward is determined by the initial reward
    ifγ γIf the γ value is too large, it means that we pay more attention to long-term rewards, making the robot far-sighted

episode

   When the robot interacts with the environment according to the strategy, the robot will stop when it reaches the end point, and the resulting trajectory is called an episode.
   In a grid world, should you stop when you reach your goal? In fact, we can do it with a unified math.

Markov decision process MDP (Markov dwcision process)

   A Markov decision process consists of three elements:
1. Set:
State set: S State set: SState set: S Behavior set: A ( s ) Behavior set: A(s)Behavior set: A ( s ) Reward set: R ( s , a ) Reward set: R(s,a)Reward set: R ( s ,a)

2. Probability:
state transition probability: p ( s ′ ∣ s , a ) state transition probability: p(s'|s,a)State transition probability: p ( ss,a ) Reward probability: p ( r ∣ s , a ) Reward probability: p(r|s,a)Reward probability: p ( r s ,a)

3. Strategy:
Strategy: π ( a ∣ s ) Strategy: π(a|s)Strategy: π ( a s )

  Markov property : No aftereffect, which means that the current decision is only related to the current state and goals, and has nothing to do with the past.
insert image description here
The grid-world grid world can be abstracted into a more general model, the Markov process. Circles represent states, links with arrows represent state transitions, and a Markov decision process becomes a Markov chain once a policy is given.

insert image description here

Guess you like

Origin blog.csdn.net/qq_50086023/article/details/130717485