Strategy Improvement Case
The purpose of reinforcement learning is to find the optimal policy. It involves two core concepts, optimal state value and optimal strategy, and a tool: Bellman optimal formula.
First, we give a familiar example of how the Bellman equation improves a policy.
According to the given strategy, we can easily get the Bellman equation as follows:
v π ( s 1 ) = − 1 + γ v π ( s 2 ) v_π(s_1)=-1+γv_π(s_2)vp(s1)=−1+v _p(s2) v π ( s 2 ) = + 1 + γ v π ( s 4 ) v_π(s_2)=+1+γv_π(s_4)vp(s2)=+1+v _p(s4) v π ( s 3 ) = + 1 + γ v π ( s 4 ) v_π(s_3)=+1+γv_π(s_4)vp(s3)=+1+v _p(s4) v π ( s 4 ) = + 1 + γ v π ( s 4 ) v_π(s_4)=+1+γv_π(s_4)vp(s4)=+1+v _p(s4)
当γ = 0.9 γ=0.9c=0.9时,求得
v π ( s 4 ) = v π ( s 3 ) = v π ( s 2 ) = 10 v π ( s 1 ) = 8 v_π(s_4)=v_π(s_3)=v_π(s_2)= 10 \quad v_π(s_1)=8vp(s4)=vp(s3)=vp(s2)=10vp(s1)=8
Once you know the state value, you can get the action value, and use the states 1 s_1s1As an example, the result is as follows:
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) = 6.2 q_π(s_1,a_1)=-1+γv_π(s_1)=6.2qp(s1,a1)=−1+v _p(s1)=6.2 q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) = 8 q_π(s_1,a_2)=-1+γv_π(s_2)=8qp(s1,a2)=−1+v _p(s2)=8 q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) = 9 q_π(s_1,a_3)=0+γv_π(s_3)=9qp(s1,a3)=0+v _p(s3)=9 q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) = 6.2 q_π(s_1,a_4)=-1+γv_π(s_1)=6.2qp(s1,a4)=−1+v _p(s1)=6.2 q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) = 7.2 q_π(s_1,a_5)=0+γv_π(s_1)=7.2qp(s1,a5)=0+v _p(s1)=7.2
Through the above calculations and intuitive feelings, we know that this strategy is not good, so how should we improve the strategy when the strategy is not good? This depends on the action value. We can express the current strategy in mathematical form as follows:
π ( a 1 ∣ s 1 ) = { 1 a = a 2 0 a ≠ a 2 π(a_1|s_1)=\left\{ \begin{matrix} 1 \quad a=a_2\\ 0 \quad a≠a_2\\ \end{matrix} \right.π ( a1∣s1)={
1a=a20a=a2
By calculating the previous state s 1 s_1s1Taking the action value obtained as an example, we can know that q π ( s 1 , a 3 ) q_π(s_1,a_3)qp(s1,a3) is the largest, so we can consider choosinga 3 a_3a3As a new strategy, the new strategy is expressed as follows:
π new ( a ∣ s 1 ) = { 1 a = a ∗ 0 a ≠ a ∗ π_{new}(a|s_1)=\left\{ \begin{ matrix} 1 \quad a=a^*\\ 0 \quad a≠a^*\\ \end{matrix} \right.Pinew(a∣s1)={
1a=a∗0a=a∗
a = a ∗ a=a^* a=aThe ∗ period probability is 1, indicating that the new strategy will definitely choosea ∗ a^*a∗ , and in this examplea ∗ = maxaq π ( s 1 , a ) = a 3 a^*=max_aq_π(s_1,a)=a_3a∗=maxaqp(s1,a)=a3
Bellman Optimality Equation: Definition
Through the calculation of the previous example, we can now give the definition of the optimal strategy, which is defined as follows:
If v π ∗ ( s ) ≥ v π ( s ) foralls ∈ S Then it means that π ∗ is the optimal strategy if \quad v_{ π^*}(s)≥v_{π}(s) \quad for\quad all\quad s∈S\quad means that π^* is the optimal strategyifvPi∗(s)≥vp(s)foralls∈SThen it means π∗ is the optimal strategy
最优策略这一定义引发了许多问题:
最优策略存在吗?
最优策略唯一吗?
最优策略是随机的还是确定性的?
如何获得最优策略?
To answer these questions, we need to study the Bellman optimality formula.
The second formula in the above figure is the Bellman optimal formula. Compared with the Bellman formula, we can find that the Bellman optimal formula is the Bellman formula under the optimal strategy condition. The deformation form and vector form of Bellman are given below, as shown in the figure below:
We can find that there are two unknowns v and π in the optimal formula of Bellman v and πv and π , how to solve an equation with two unknowns? We show in the following formula that it can be solved.
We know from the above two examples that ifq ( s , a ) q(s,a)q(s,a ) know, then the maximum value is equal tomax ( q ( s , a ) ) max(q(s,a))max(q(s,a)) ,当 a = a ∗ = m a x a q ( s , a ) a=a^*=max_aq(s,a) a=a∗=maxaq(s,a ) , its mathematical expression and conditions are as follows:
Bellman Optimality Equation: Solving
When solving the Bellman optimality equation, we introduce f ( v ) f(v)f ( v ) , its form is as follows:
Before solving the above equation, we introduce several concepts:fixed point, contraction map
Known function f ( x ) , if there exists x such that f ( x ) = x then the point ( x , f ( x ) ) is a fixed point of function f ( x ) Known function f(x), if there exists x, such that f(x)=x then the point (x, f(x)) is a fixed point of the function f(x)Known function f ( x ) ,If x exists ,such that f ( x )=x then point ( x ,f ( x )) is a fixed point of function f ( x ) . Suppose ( X , d X ) and ( Y , d Y ) are metric spaces, and f : X → Y is a mapping. If there is a constant k ∈ [ 0 , 1 ) such that ∣ ∣ d Y − d X ∣ ∣ ≤ γ ∣ ∣ X − Y ∣ ∣ then f is called a compression map, and k is called a compression coefficient. Let (X,d_X) and (Y ,d_Y) is a metric space, and f:X→Y is a mapping. If there is a constant k∈[0, 1) such that \quad||d_Y-d_X||≤γ||XY||\quad is called a compression map, k is called a compression coefficientLet ( X ,dX) and ( Y ,dY) is a metric space ,f:X→Y is the mapping. If there is a constant k∈[ 0 , 1 ) such that∣∣dY−dX∣∣≤γ∣∣X−Y∣∣Then f is called the compression map, and k is called the compression coefficient
. With the above two concepts, now we introduce the Banach fixed point theorem (Contraction Mapping Theorem), which is an important theorem in mathematical analysis. It asserts that a compressive map in a complete metric space must have a unique fixed point.
According to the Banach fixed point theorem, the following properties can be obtained:
as long as x = f ( x ) x=f(x)x=An equation of the form f ( x ) iffff is a compression map, then
1.There must be a fixed point x ∗ , so that f ( x ∗ ) = x ∗ There must be a fixed point x^*, so that f(x^*)=x^*There must be a fixed point x∗,such that f ( x∗)=x∗
2. The fixed point x ∗ is the only one that exists 2. The fixed point x^* is the only one that exists2. Fixed point x∗ is the only one that exists
3. The fixed point can be obtained through iteration: xk + 1 = f ( xk ) , xk ≈ x ∗ , when k tends to ∞; 3. The fixed point can be obtained through iteration: x_{ k+1}=f(x_k), x_k≈x^*, when k tends to ∞;3. The fixed point can be obtained by iteration: xk+1=f(xk),xk≈x∗ , whenktendsto∞;
Now, we can use the above Banach fixed point theorem to solve the Bellman optimality equation, before solving, we have to prove that f ( v ) f(v)f ( v ) is a compression map, but here we only give it to the application, friends who are interested in the proof process can go and see for themselves. From the Banach fixed point theorem, we can know that there must be a uniquev ∗ v^*v∗ , and can be obtained by iteration.
Bellman optimal formula solution example
We use a simple example to deepen our understanding of the process of solving the Bellman equation. As shown in the figure below, the robot has three behaviors: left al a_lal, to the right ar a_rar, in situ a 0 a_0a0, the reward is set to r end point = + 1 r_{end point} = +1rend point=+ 1 ,r bounds = − 1 r_{bounds}=-1rboundary=− 1
According to the rules formulated above, we can get the robot'sq ( s , a ) q(s,a)q(s,a )
The question now is how to find the optimal state valuev ∗ ( si ) v^*(s_i)v∗(si) and optimal policyπ ∗ π^*Pi∗
Using the method introduced above, takeγ = 0.9 γ=0.9c=0.9 , whenk = 0 k=0k=When 0 , we first give an initial value at random
.
It can be seen that we have found the optimal strategy.
Properties of Optimal Strategies
What factors do we think about that affect the optimal strategy? According to the following formula, the factors affecting the optimal strategy are r, γ, and the model environment r, γ, and the model environmentr , γ , and the model environment