Reinforcement Learning: The Bellman Optimal Formula

Strategy Improvement Case

  The purpose of reinforcement learning is to find the optimal policy. It involves two core concepts, optimal state value and optimal strategy, and a tool: Bellman optimal formula.
  First, we give a familiar example of how the Bellman equation improves a policy.
insert image description here
According to the given strategy, we can easily get the Bellman equation as follows:
v π ( s 1 ) = − 1 + γ v π ( s 2 ) v_π(s_1)=-1+γv_π(s_2)vp(s1)=1+v _p(s2) v π ( s 2 ) = + 1 + γ v π ( s 4 ) v_π(s_2)=+1+γv_π(s_4)vp(s2)=+1+v _p(s4) v π ( s 3 ) = + 1 + γ v π ( s 4 ) v_π(s_3)=+1+γv_π(s_4)vp(s3)=+1+v _p(s4) v π ( s 4 ) = + 1 + γ v π ( s 4 ) v_π(s_4)=+1+γv_π(s_4)vp(s4)=+1+v _p(s4)
γ = 0.9 γ=0.9c=0.9时,求得
v π ( s 4 ) = v π ( s 3 ) = v π ( s 2 ) = 10 v π ( s 1 ) = 8 v_π(s_4)=v_π(s_3)=v_π(s_2)= 10 \quad v_π(s_1)=8vp(s4)=vp(s3)=vp(s2)=10vp(s1)=8
  Once you know the state value, you can get the action value, and use the states 1 s_1s1As an example, the result is as follows:
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) = 6.2 q_π(s_1,a_1)=-1+γv_π(s_1)=6.2qp(s1,a1)=1+v _p(s1)=6.2 q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) = 8 q_π(s_1,a_2)=-1+γv_π(s_2)=8qp(s1,a2)=1+v _p(s2)=8 q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) = 9 q_π(s_1,a_3)=0+γv_π(s_3)=9qp(s1,a3)=0+v _p(s3)=9 q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) = 6.2 q_π(s_1,a_4)=-1+γv_π(s_1)=6.2qp(s1,a4)=1+v _p(s1)=6.2 q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) = 7.2 q_π(s_1,a_5)=0+γv_π(s_1)=7.2qp(s1,a5)=0+v _p(s1)=7.2
  Through the above calculations and intuitive feelings, we know that this strategy is not good, so how should we improve the strategy when the strategy is not good? This depends on the action value. We can express the current strategy in mathematical form as follows:
π ( a 1 ∣ s 1 ) = { 1 a = a 2 0 a ≠ a 2 π(a_1|s_1)=\left\{ \begin{matrix} 1 \quad a=a_2\\ 0 \quad a≠a_2\\ \end{matrix} \right.π ( a1s1)={ 1a=a20a=a2

  By calculating the previous state s 1 s_1s1Taking the action value obtained as an example, we can know that q π ( s 1 , a 3 ) q_π(s_1,a_3)qp(s1,a3) is the largest, so we can consider choosinga 3 a_3a3As a new strategy, the new strategy is expressed as follows:
π new ( a ∣ s 1 ) = { 1 a = a ∗ 0 a ≠ a ∗ π_{new}(a|s_1)=\left\{ \begin{ matrix} 1 \quad a=a^*\\ 0 \quad a≠a^*\\ \end{matrix} \right.Pinew(as1)={ 1a=a0a=a
a = a ∗ a=a^* a=aThe ∗ period probability is 1, indicating that the new strategy will definitely choosea ∗ a^*a , and in this examplea ∗ = maxaq π ( s 1 , a ) = a 3 a^*=max_aq_π(s_1,a)=a_3a=maxaqp(s1,a)=a3

Bellman Optimality Equation: Definition

  Through the calculation of the previous example, we can now give the definition of the optimal strategy, which is defined as follows:
If v π ∗ ( s ) ≥ v π ( s ) foralls ∈ S Then it means that π ∗ is the optimal strategy if \quad v_{ π^*}(s)≥v_{π}(s) \quad for\quad all\quad s∈S\quad means that π^* is the optimal strategyifvPi(s)vp(s)forallsSThen it means π is the optimal strategy

最优策略这一定义引发了许多问题:
	最优策略存在吗?
	最优策略唯一吗?
	最优策略是随机的还是确定性的?
	如何获得最优策略?

To answer these questions, we need to study the Bellman optimality formula.
insert image description here
  The second formula in the above figure is the Bellman optimal formula. Compared with the Bellman formula, we can find that the Bellman optimal formula is the Bellman formula under the optimal strategy condition. The deformation form and vector form of Bellman are given below, as shown in the figure below:
insert image description here
  We can find that there are two unknowns v and π in the optimal formula of Bellman v and πv and π , how to solve an equation with two unknowns? We show in the following formula that it can be solved.
insert image description here
insert image description here
  We know from the above two examples that ifq ( s , a ) q(s,a)q(s,a ) know, then the maximum value is equal tomax ( q ( s , a ) ) max(q(s,a))max(q(s,a)) ,当 a = a ∗ = m a x a q ( s , a ) a=a^*=max_aq(s,a) a=a=maxaq(s,a ) , its mathematical expression and conditions are as follows:insert image description here

Bellman Optimality Equation: Solving

  When solving the Bellman optimality equation, we introduce f ( v ) f(v)f ( v ) , its form is as follows:
insert image description here
  Before solving the above equation, we introduce several concepts:fixed point, contraction map

Known function f ( x ) , if there exists x such that f ( x ) = x then the point ( x , f ( x ) ) is a fixed point of function f ( x ) Known function f(x), if there exists x, such that f(x)=x then the point (x, f(x)) is a fixed point of the function f(x)Known function f ( x ) ,If x exists ,such that f ( x )=x then point ( x ,f ( x )) is a fixed point of function f ( x ) . Suppose ( X , d X ) and ( Y , d Y ) are metric spaces, and f : X → Y is a mapping. If there is a constant k ∈ [ 0 , 1 ) such that ∣ ∣ d Y − d X ∣ ∣ ≤ γ ∣ ∣ X − Y ∣ ∣ then f is called a compression map, and k is called a compression coefficient. Let (X,d_X) and (Y ,d_Y) is a metric space, and f:X→Y is a mapping. If there is a constant k∈[0, 1) such that \quad||d_Y-d_X||≤γ||XY||\quad is called a compression map, k is called a compression coefficientLet ( X ,dX) and ( Y ,dY) is a metric space ,fXY is the mapping. If there is a constant k[ 0 , 1 ) such that∣∣dYdX∣∣γ∣∣XY∣∣Then f is called the compression map, and k is called the compression coefficient
  . With the above two concepts, now we introduce the Banach fixed point theorem (Contraction Mapping Theorem), which is an important theorem in mathematical analysis. It asserts that a compressive map in a complete metric space must have a unique fixed point.

According to the Banach fixed point theorem, the following properties can be obtained:
as long as x = f ( x ) x=f(x)x=An equation of the form f ( x ) iffff is a compression map, then
  1.There must be a fixed point x ∗ , so that f ( x ∗ ) = x ∗ There must be a fixed point x^*, so that f(x^*)=x^*There must be a fixed point x,such that f ( x)=x
   2. The fixed point x ∗ is the only one that exists 2. The fixed point x^* is the only one that exists2. Fixed point x is the only one that exists
  3. The fixed point can be obtained through iteration: xk + 1 = f ( xk ) , xk ≈ x ∗ , when k tends to ∞; 3. The fixed point can be obtained through iteration: x_{ k+1}=f(x_k), x_k≈x^*, when k tends to ∞;3. The fixed point can be obtained by iteration: xk+1=f(xk),xkx , whenktendsto;

  Now, we can use the above Banach fixed point theorem to solve the Bellman optimality equation, before solving, we have to prove that f ( v ) f(v)f ( v ) is a compression map, but here we only give it to the application, friends who are interested in the proof process can go and see for themselves. From the Banach fixed point theorem, we can know that there must be a uniquev ∗ v^*v , and can be obtained by iteration.
insert image description here

Bellman optimal formula solution example

  We use a simple example to deepen our understanding of the process of solving the Bellman equation. As shown in the figure below, the robot has three behaviors: left al a_lal, to the right ar a_rar, in situ a 0 a_0a0, the reward is set to r end point = + 1 r_{end point} = +1rend point=+ 1 ,r bounds = − 1 r_{bounds}=-1rboundary=1
insert image description here
  According to the rules formulated above, we can get the robot'sq ( s , a ) q(s,a)q(s,a )
insert image description here
The question now is how to find the optimal state valuev ∗ ( si ) v^*(s_i)v(si) and optimal policyπ ∗ π^*Pi
Using the method introduced above, takeγ = 0.9 γ=0.9c=0.9 , whenk = 0 k=0k=When 0 , we first give an initial value at random
insert image description here
.
insert image description here
It can be seen that we have found the optimal strategy.

Properties of Optimal Strategies

  What factors do we think about that affect the optimal strategy? According to the following formula, the factors affecting the optimal strategy are r, γ, and the model environment r, γ, and the model environmentr , γ , and the model environment
insert image description here

Guess you like

Origin blog.csdn.net/qq_50086023/article/details/130749955