In-depth understanding of reinforcement learning - Markov decision process: occupancy measurement - [Basic knowledge]

Category Catalog:General Catalog of "In-depth Understanding of Reinforcement Learning"


mentioned in the article "In-depth understanding of reinforcement learning - Markov decision process: Bellman expectation equation-[Basic knowledge]" , the value functions of different strategies are different. This is because for the same Markov decision process, the probability distributions of states visited by different strategies are different. Imagine that there is now a policy in the Markov decision process in the figure below, and the execution of its actions will make the agent reach the terminal state as soon as possible s 5 s_5 s5, then when the agent is in state s 3 s_3 s3 will not take "Go s 4 s_4 s4" action, and will only take the action "Go s 5 s_5 with probability 1s5" action, so the agent will not get the action in s 4 s_4 s4In the state, take "Go s 5 s_5 s5"The big reward that can be obtained is 10. It is conceivable that according to the Bellman equation, the probability of this strategy in the state will be relatively small. The reason is that it cannot reach the state. Therefore, we need to understand that different strategies will make intelligence Consider the fact that states from different probability distributions are accessed, which affects the value function of the strategy.
A simple example of Markov decision process

First we define the initial state distribution of the Markov decision process as v 0 ( s ) v_0(s) in0(s), in some data, the initial state distribution will be defined in Markov elements of the husband's decision-making process. We use P t π ( s ) P_t^\pi(s) Ptπ(s)Display strategy π \pi πUsing intelligent body time t t t 电影的的 s s s probability, so we have P 0 π ( s ) = v 0 ( s ) P_0^\pi(s)= v_0(s) P0π(s)=in0(s), then you can define a policy’s State Visitation Distribution ):
v π ( s ) = ( 1 − γ ) ∑ t = 1 ∞ γ t P t π ( s ) v^\pi(s)=(1-\gamma)\ sum_{t=1}^\infty\gamma^tP_t^\pi(s) inπ(s)=(1γ)t=1ctPtπ(s)

In that, 1 − γ 1-\gamma 1γ is the normalization factor used to make the probability sum to 1. State access probability represents the distribution of states visited by a policy interacting with a Markov decision process. It should be noted that theoretically it is necessary to interact after infinite steps when calculating this distribution, but in fact the interaction between the agent and the Markov decision process is limited in a sequence. However, we can still use the above formula to express the idea of ​​state access probability. The state access probability has the following properties:
v π ( s ′ ) = ( 1 − γ ) v 0 ( s ′ ) + γ ∫ P ( s ′ ∣ s , a ) π ( a ∣ s ) v π ( s ) d s d a v^\pi(s')=(1-\gamma)v_0(s')+\gamma\ int P(s'|s, a)\pi(a|s)v^\pi(s)\text{d}s\text{d}a inπ(s)=(1γ)v0(s)+cP(ss,a)π(as)vπ(s)dsda

In addition, we can also define the occupancy measure of the strategy (Occupancy Measure):
ρ π ( s , a ) = ( 1 − γ ) ∑ t = 1 ∞ γ t P t π ( s ) π ( a ∣ s ) \rho^\pi(s, a)=(1-\gamma)\sum_{t=1}^\infty\gamma^tP_t^\pi(s)\pi (a|s) rπ(s,a)=(1γ)t=1ctPtπ(s)π(as)

It represents the action state pair ( s , a ) (s, a) (s,a)approximate probability of being visited. The system of existence between two parties:
ρ π ( s , a ) = v π ( s ) π ( a ∣ s ) \rho^\pi(s, a)=v^\ pi(s)\pi(a|s) rπ(s,a)=inπ(s)π(as)

Further we can derive the following two theorems:

  • Theorem 1: Agents use strategies π 1 \pi_1 Pi1sum π 2 \pi_2 Pi2The occupancy measure obtained by interacting with the same Markov decision process satisfies: ρ π 1 = ρ π 2 ⇔ π 1 = π 2 \rho^{\pi_1} =\rho^{\pi_2}\Leftrightarrow\pi_1=\pi_2 rPi1=rPi2Pi1=Pi2
  • Theorem 2: Given a legal occupancy metric ρ \rho ρ, the only strategy that can generate this occupancy metric is: π ρ = ρ ( s , a ) ∑ a ′ ρ ( s , a ′ ) \pi_\rho=\frac{\rho(s, a)}{\sum_{a'}\rho(s, a')} Piρ=aρ(s,a)ρ(s,a)

The "legitimate" occupancy metric mentioned above refers to the probability that a state-action pair generated by the interaction of the agent with the Markov decision process is accessed by a policy.

References:
[1] Zhang Weinan, Shen Jian, Yu Yong. Hands-on reinforcement learning [M]. People's Posts and Telecommunications Press, 2022.
[2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning (2nd Edition) [M]. Electronic Industry Press, 2019
[3] Maxim Lapan. Deep Reinforcement Learning Practice (2nd edition of the original book) [M]. Beijing Huazhang Graphic Information Co., Ltd., 2021
[4] Wang Qi, Yang Yiyuan, Jiang Ji. Easy RL: Reinforcement Learning Tutorial[M] . People's Posts and Telecommunications Press, 2022

Guess you like

Origin blog.csdn.net/hy592070616/article/details/134675972