RL-Zhao-(9)-Policy-Based02: Selection of objective function/Metrics [①average state value; ②average one-step reward], gradient calculation of objective function

1. Selection of objective functions (Metrics to define optimal policies) [Category 2]

There are two types of objective functions/metrics:

  • The average state value
  • Average one-step reward

1、The average state value

2-dimensional equations:
v ˉ π   = ∑ s ∈ S d ( s ) v π ( s ) ≐ E [ ∑ t = 0 ∞ γ t R t + 1 ] \begin{aligned} \bar{v}_{\pi}\:&=\sum_{s\in\mathcal{S}}d(s)v_{\pi}(s) \\[2ex ] &\doteq\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right] \end{aligned} inˉπ=sSd(s)vπ(s)AND[t=0ctRt+1]

Insert image description here

The first category isthe average state value, or simply average value. This metric is defined as follows:
v ˉ π = ∑ s ∈ S d ( s ) v π ( s ) \bar{v}_{\pi}\:=\sum_{s\in \mathcal{S}}d(s)v_{\pi}(s) inˉπ=sSd(s)vπ(s)

  • Naka v ˉ π \bar{v}_\mathrm{\pi} inˉπis the weighted average of state values;
  • d ( s ) ≥ 0 d(s)≥0 d(s)0This is the situation s s s's weight
  • 因为 ∑ s ∈ S d ( s ) = 1 \sum_{s\in S}d(s)=1 sSd(s)=1, I can do it d ( s ) d(s) d(s) is interpreted as a probability distribution (probability distribution). Then, this melnic can be written as
    v ˉ π = E [ v π ( S ) ] \bar{v}_\pi=\mathbb{E}[v_\pi(S) ] inˉπ=E[vπ(S)]
    inside S 〜 d S \sim d Sd

昛RAN v ˉ π \bar{v}_\mathrm{\pi} inˉπ Correct strategy π \pi π function, dissimilar strategy π \pi π corresponds to different values, so we can optimize and find an optimal strategy π \pi π Let this value reach the maximum. This is actually a very natural way to choose metric.

Vector-product(infinitesimal) function:
v ˉ π = ∑ s ∈ S d ( s ) v π ( s ) = d T v π \bar{ v}_\pi=\sum_{s\in\mathcal{S}}d(s)v_\pi(s)=d^Tv_\pi inˉπ=sSd(s)vπ(s)=dTvπ
in

  • v π = [ . . . , v π ( s ) , . . . ] T ∈ R ∣ S ∣ v_\pi=[...,v_\pi(s),...]^T\in\mathbb{R}^{|\mathcal{S}|} inπ=[...,inπ(s),...]TRS , each element in this area v π ( s ) v_\pi(s) inπ(s) 对应的情况 s s s 的state value;
  • d = [ . . . , d ( s ) , . . . ] T ∈ R ∣ S ∣ d=[...,d(s),...]^T\in\mathbb{R}^{|\mathcal{S}|} d=[...,d(s),...]TRS, each element in this area d ( s ) d(s) d(s) /span> s s The weight or probability of s;

This form is very helpful for analyzing its gradient later.

Insert image description here

How to choose distribution d d d ? There are two situations here:

  • The first case is d d d Sum strategy π \pi π The dead series.
    • This case is relatively simple because the gradient of the metric is easy to calculate. If this d d d sum π π π Naga-ku of the non-existential system v π v_π inπWhen the gradient of , this d d d does not involve any gradient, so I will only find one v π v_π inπgradient of (if d d d sum π π π 有关于那我求 v π v_π inπ When using gradient, of course this is also required d d d 关于这个 π π π gradient, so it is relatively more troublesome).
    • In this case, in order to show that d d d sum π π π 無有关体,General d d dPhotograph d 0 d_0 d0 v ˉ π \bar{v}_\pi inˉπ Photo v ˉ π 0 \bar{v}_\pi^0 inˉPi0
    • How do you choose d 0 ? d_0? d0?
      • A simple way is to treat all states equally, that is, choose d 0 ( s ) = 1 ∣ S ∣ d_0(s)=\cfrac{1}{| S|} d0(s)=S1, so this is actually a uniform distribution.
      • Another important situation is that we only have a specific state s 0 s_{0} s0Interested. For example, episodes always start from the same state s 0 s_{0} in some taskss0 Start [For example, at the beginning of some games, it always starts from that screen, which actually corresponds to a specific state s 0 s_0 s0, then I hope that the reward I can get is as big as possible. In this case, I cannot treat all states equally, so I may have preferences for some states, then In this extreme case, I only care about s 0 s_{0} s0,从 s 0 s_{0} s0Starting from , the bigger the reward I get, the better], then, we only focus on starting from s 0 s_0 s0Long-term return starting with . In this case
        d 0 ( s 0 ) = 1 d 0 ( s ≠ s 0 ) = 0 \begin{aligned} &d_0(s_0)=1\\ &d_0 (s\neq s_0)=0 \end{aligned} d0(s0)=1d0(s=s0)=0
        The current situation is below v ˉ π \bar{v}_\pi inˉπ You have completed the change v ˉ π 0 \bar{v}_\pi^0 inˉPi0, let me maximize this v ˉ π \bar{v}_\pi inˉπ, actually maximizing from s 0 s_{0} s0How much return can I get from setting off?
  • The second case is d d d dependence on strategy π π π
    • This is a common choice, choose d d d One d π ( s ) d_\pi(s) dπ(s), 即 stationany distribution under π \pi π[Simply put, I have a strategy, and then I follow that strategy to constantly interact with the environment. When I execute that strategy many, many times, I can The prediction of being in a certain state is the probability that the agent is in that state, that is, it will gradually reach a stable state at that time, and this probability can actually be directly passed through this d π T P π = d π T d_\pi^TP_\pi=d_\pi^T dPiTPπ=dPiTThe formula is calculated].
      • d p ​​i d_pidpA basic property of i is that it satisfies
        d π T P π = d π T d_\pi^TP_\pi=d_\pi ^T dPiTPπ=dPiT
        Inside P π P_\mathrm{\pi} PπIs the state transition probability matrix (state transition matrix).
    • 选择 d π d_{\pi} dπexplanation of:
      • If a state is frequently visited in a long run, it is more important and should be given more weight;
      • If a state is rarely visited, its weight will naturally be less;

2、average one-step reward

2definitely defined:
r ˉ π ≐ ∑ s ∈ S d π ( s ) r π ( s ) ≐ lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ] \begin{aligned} \bar{r}_{\pi}&\doteq\sum_{s\in\mathcal{S}}d_{\pi}\left(s\right )r_{\pi}\left(s\right) \\[4ex] &\doteq\lim_{n\to\infty}\frac{1}{n}\mathbb{E}\left[\sum_{ k=1}^{n}R_{t+k}\right] \\[4ex] \end{aligned} rˉπsSdπ(s)rπ(s)nlimn1AND[k=1nRt+k]

Insert image description here
The second category is average one-step reward, or average reward for short. Specifically ′ _{\prime} the metrics是
r ˉ π ≐ ∑ s ∈ S d π ( s ) r π ( s ) = E [ r π ( S ) ] \color{red}{ \bar{r}_{\pi}\doteq\sum_{s\in\mathcal{S}}d_{\pi}\left(s\right)r_{\pi}\left(s\right)=\mathbb{E}[r_{\pi}\left(S\right)]} rˉπsSdπ(s)rπ(s)=E[rπ(S)]
Among them:

  • d π ( s ) d_π(s)dπ(s) s s The weight corresponding to s is actually a stationary distribution, which depends on this strategy π π π的;
  • S ∼ d π S\sim d_{\pi}Sdπ
  • r π ≐ ∑ a ∈ A π ( a ∣ s ) r ( s , a ) \begin{aligned}r_\pi\doteq\sum_{a\in\mathcal{A}}\pi(a|s)r(s,a)\end{aligned} rπaAπ(as)r(s,a) is here s s s Starting from an average value of the immediate reward of the single step I got, and
    r ( s , a ) = E [ R ∣ s , a ] = ∑ r r p ( r ∣ s , a ) r(s,a)=\mathbb{E}[R|s,a]=\sum_rrp(r|s,a) r(s,a)=E[Rs,a]=rrp(rs,a)

As its name suggests, r ˉ π \bar{r}_\pi rˉπ is a weighted average of one-step immediate reward ( r ˉ π \bar{r}_\pi rˉπThe horizontal line above also represents an average meaning).

Insert image description here
The second form of average reward above:

  • Suppose an agent generates a trajectory along a given policy, and its rewards are ( R t + 1 , R t + 2 , . . . ) (R_ {t+1},R_{t+2},...) (Rt+1,Rt+2,...)
  • 沿着trajectony的average single-step reward是
    lim ⁡ n → ∞ 1 n E [ R t + 1 + R t + 2 + ⋯ + R t + n ∣ S t = s 0 ] = lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ∣ S t = s 0 ] \begin{aligned} &\begin{aligned}\lim_{n\to\infty}\frac1n\mathbb{E}\Big[R_{t+1}+R_{t+2}+\cdots+R_{t+n}|S_t=s_0\Big]\end{aligned}=\lim_{n\to\infty}\frac1n\mathbb{E}\left[\sum_{k=1}^nR_{t+k}|S_t=s_0\right] \end{aligned} nlimn1E[Rt+1+Rt+2++Rt+nSt=s0]=nlimn1AND[k=1nRt+kSt=s0]
    inside s 0 s_0 s0 是the starting state of the trajectory。

Insert image description here
Let us divide the equation into one:
lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ∣ S t = s 0 ] = lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ] = ∑ s d π ( s ) r π ( s ) = r ˉ π \begin{aligned} \lim_{n\to\infty}\frac {1}{n}\mathbb{E}\left[\sum_{k=1}^nR_{t+k}|S_t=s_0\right]&=\color{red}{\lim_{n\to \infty}\frac{1}{n}\mathbb{E}\left[\sum_{k=1}^{n}R_{t+k}\right]}\\ &=\sum_sd_\pi( s)r_\pi(s) \\ &=\bar{r}_{\pi} \end{aligned} nlimn1AND[k=1nRt+kSt=s0]=nlimn1AND[k=1nRt+k]=sdπ(s)rπ(s)=rˉπ
Landlord s 0 s_0 s0 No more Why? Because s 0 s_0 s0 does not work. It means that after you have run an infinite number of steps, it no longer matters where you started.
Note:

  • When n approaches infinity, starting state s 0 s_{0} s0It doesn't matter anymore.
  • These two are about r ˉ π \bar{r}_\pi rˉπThe equations are equal.

This formula is one that you may often see in papers.

3、Remarks

Insert image description here
Insert image description here
Insert image description here
I would like to emphasize a few points on the above two metrics:

  • Remark1
    • These metrics are strategies π \pi π function;
    • Cause strategy π \pi πKoreyu θ \theta θ is parameterized, so these metrics are θ \theta θ function;
    • In other words, different θ \theta θ can generate different metric values;
    • Therefore, we can search for the optimal θ \theta θ and then maximize these metrics;
  • Remark2
    • These metrics are complex and divided into two cases. The first is the discounted case, where γ ∈ [ 0 , 1 ) \gamma\in[0, 1) c[0,1); This is an undiscounted case in γ = 1 \gamma=1 c=1
    • Here we only consider the discounted case.
  • Remark3
    • Directly, r ˉ π \bar{r}_\pi rˉπ is short-sighted because it rarely considers the immediate rewards, while v ˉ π \bar{v}_\pi inˉπ考虑 the total reward overall steps。【 × \color{red}{×} ×
    • However, in fact these two metrics are equivalent, specifically, in the discounted case, when γ < 1 \gamma<1 c<1,有
      r ˉ π   =   ( 1 − γ ) v ˉ π \bar{r}_\pi\:=\ :(1-\gamma)\bar{v}_\pi rˉπ=(1γ)inˉπ

4. Exercise (another form of objective function)

Insert image description here

Answer: First, analyze and understand such a metric.

  • 它从 S 0 〜 d S_0\sim d S0d Start, end A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , . . . . .A_0,R_1,S_1,A_1,R_2,S_2,.... A0,R1,S1,A1,R2,S2,......
  • A t ∼ π ( S t ) A_t\sim\pi(S_t) Atπ(St),并且 R t + 1 , S t + 1 ∼ p ( R t + 1 ∣ S t , A t ) R_{t+1},S_{t+1}\sim p(R_{t+1}|S_t,A_t) Rt+1,St+1p(Rt+1St,At) p ( S t + 1 ∣ S t , A t ) p(S_{t+1}|S_t,A_t) p(St+1St,At)

Then, we know that this metric is the same as the average value, because
J ( θ ) = E [ ∑ t = 0 ∞ γ t R t + 1 ] = ∑ s ∈ S d ( s ) E [ ∑ t = 0 ∞ γ t R t + 1 ∣ S 0 = s ] = ∑ s ∈ S d ( s ) v π ( s ) = v ˉ π \begin{aligned} J(\theta ) =\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right] &=\sum_{s\in\mathcal{S}}d(s )\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}|S_0=s\right]\\ &=\sum_{s\in\mathcal{S }}d(s)v_{\pi}(s) \\ &=\bar{v}_{\pi} \end{aligned} J(θ)=AND[t=0ctRt+1]=sSd(s)E[t=0ctRt+1S0=s]=sSd(s)vπ(s)=inˉπ

2. Gradient of the objective function (Gradients of the metrics)

Insert image description here
Given a metric, then

  • derive its gradient
  • Then, apply gradient-based methods to optimize this metric.

The calculation of gradient is the most complex calculation part in policy gradient methods!

  • First, we need to distinguish between different objective functions/metrics:
    • v ˉ π   = ∑ s ∈ S d ( s ) v π ( s ) ≐ E [ ∑ t = 0 ∞ γ t R t + 1 ] \begin{aligned}\bar{v}_{\pi}\:&=\sum_{s\in\mathcal{S}}d(s)v_{\pi}(s)\doteq\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right] \end{aligned} inˉπ=sSd(s)vπ(s)AND[t=0ctRt+1]
    • v ˉ π 0 \bar{v}_\pi^0inˉPi0
    • r ˉ π ≐ ∑ s ∈ S d π ( s ) r π ( s ) ≐ lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ] \begin{aligned}\bar{r}_{\pi}\doteq\sum_{s\in\mathcal{S}}d_{\pi}\left(s\right)r_{\pi}\left(s\right) \doteq\lim_{n\to\infty}\frac{1}{n}\mathbb{E}\left[\sum_{k=1}^{n}R_{t+k}\right] \\[4ex]\end{aligned} rˉπsSdπ(s)rπ(s)nlimn1AND[k=1nRt+k]
  • Second, we need to distinguish between the discounted and undiscounted cases

Insert image description here
Gradients/Speed ​​Ratio Equations:
∇ θ J ( θ ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \color{red}{ \nabla_\theta J(\theta)=\sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{ A}}\nabla_\theta\pi(a|s,\theta)q_\pi(s,a)} θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a)
inside

  • Topic function J ( θ ) J(\theta) J(θ)可以是:
    • v ˉ π   = ∑ s ∈ S d ( s ) v π ( s ) ≐ E [ ∑ t = 0 ∞ γ t R t + 1 ] \begin{aligned}\bar{v}_{\pi}\:&=\sum_{s\in\mathcal{S}}d(s)v_{\pi}(s)\doteq\mathbb{E}\left[\sum_{t=0}^\infty\gamma^tR_{t+1}\right] \end{aligned}inˉπ=sSd(s)vπ(s)AND[t=0ctRt+1]
    • v ˉ π 0 \bar{v}_\pi^0inˉPi0
    • r ˉ π ≐ ∑ s ∈ S d π ( s ) r π ( s ) ≐ lim ⁡ n → ∞ 1 n E [ ∑ k = 1 n R t + k ] \begin{aligned}\bar{r}_{\pi}\doteq\sum_{s\in\mathcal{S}}d_{\pi}\left(s\right)r_{\pi}\left(s\right) \doteq\lim_{n\to\infty}\frac{1}{n}\mathbb{E}\left[\sum_{k=1}^{n}R_{t+k}\right] \\[4ex]\end{aligned}rˉπsSdπ(s)rπ(s)nlimn1AND[k=1nRt+k]
  • = = = can mean strict equality (=), approximate (≈) or proportional equality (∝);
  • ∑ s ∈ S \begin{aligned}\sum_{s\in\mathcal{S}}\end{aligned} sS: indicates pair s s s ぐ和;
  • the \etaη is a distribution or weight under states. Each state has a weight η ( s ) \eta(s) < /span>η(s) the \eta η For different objective functions, it will show different distribution under different circumstances;
  • ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \begin{aligned}\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s,\ theta)q_\pi(s,a)\end{aligned}aAθπ(as,θ)qπ(s,a): represents each state s s s Independent action function ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \nabla_\ theta\pi(a|s,\theta)q_\pi(s,a) θπ(as,θ)qπ(s,a)'s addition;
    • ∇ θ π ( a ∣ s , θ ) \nabla_\theta\pi(a|s,\theta)θπ(as,θ): Display strategy π π π level;
    • q π ( s , a ) q_\pi(s,a)qπ(s,a):表示 ( s , a ) (s,a) (s,a) 的action value;

In short, the results obtained in all these casesgradient are very similar, so this formula is used to express them. For most students, this formula is enough. Unless you need to research and innovate new algorithms, then you can read the detailed content in the book.

Separate information r ˉ π \bar{r}_{\pi} rˉπ v ˉ π \bar{v}_{\pi} inˉπ v ˉ π 0 \bar{v}_\pi^0 inˉPi0, if you want to have a slightly larger range(Details are not given here. Interested readers can read my book)::
∇ θ r ˉ π ≃ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) , ∇ θ v ˉ π = 1 1 − γ ∇ θ r ˉ π ∇ θ v ˉ π 0 = ∑ s ∈ S ρ π ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \color{red}{ \begin{aligned} &\nabla_\theta\bar{r}_\pi \word\sum_sd_\pi(s)\sum_a\nabla_\theta\pi(a|s,\theta )q_\pi(s,a), \\[2ex] &\nabla_\theta\bar{v}_\pi=\frac1{1-\gamma}\nabla_\theta\bar{r}_\pi \\[2ex] &\nabla_\theta\bar{v}_\pi^0 =\sum_{s\in\mathcal{S}}\rho_\pi(s)\sum_{a\in\mathcal{ A}}\nabla_\theta\pi(a|s,\theta)q_\pi(s,a) \end{aligned}} θrˉπsdπ(s)aθπ(as,θ)qπ(s,a),θinˉπ=1c1θrˉπθinˉPi0=sSrπ(s)aAθπ(as,θ)qπ(s,a)

1. Analysis of gradient formula

Slope/Gradient
∇ θ J ( θ ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \nabla_\theta J(\theta)=\sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a |s,\theta)q_\pi(s,a) θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a)
can be written in a compact and useful form:

∇ θ J ( θ ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) = E [ ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \begin{aligned} \nabla_\theta J(\theta)&=\sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\ mathcal{A}}\nabla_\theta\pi(a|s,\theta)q_\pi(s,a) \\[4ex] &\color{grid}{=\mathbb{E}[\nabla_\ theta\ln\pi(A|S,\theta)q_\pi(S,A)]} \\ \end{aligned}θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)]

Inside: S 〜 η S\sim\eta Sη A ∼ π ( A ∣ S , θ ) A\thicksim\pi(A|S,\theta) Aπ(AS,θ)

  • possession ∑ ∑ Remove all and write it in the form of an expectation: E [ ] \mathbb{E}[] E[]
  • 这り边的 S S S A A A The city is the amount of change, S 〜 η S\sim\eta Sη A ∼ π ( A ∣ S , θ ) A\thicksim\pi(A|S,\theta) Aπ(AS,θ)
    • S S S is full η η η distribution;
    • A A Ais full π ( A ∣ S , θ ) \pi(A|S,\theta) π(AS,Such a distribution of θ);

Only one ∇ θ J ( θ ) = E [ ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \color{red} {\nabla_\theta J(\theta)=\mathbb{E}[\nabla_\theta\ln\pi(A|S,\theta)q_\pi(S,A)]} θJ(θ)=E[θlnπ(AS,θ)qπ(S,A)] Such an expression?

Because we can use sampling to approximate this gradient!
Insert image description here

三、Gradient-ascent algorithm

Insert image description here

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
1、REINFORCE algorithm

Insert image description here
Insert image description here




Reference materials:
Introduction to Reinforcement Learning (13) - Policy Gradient Method

Guess you like

Origin blog.csdn.net/u013250861/article/details/135045868