[Reinforcement Learning Theory] Derivation of State Value Function and Action Value Function Series Formulas

Because the definition difference and formula relationship between the state value function and the action value function are often confused, this time it will be sorted out and recorded.

To understand formula derivation, you need to understand several concepts in the basic definition.

basic definition

reward function

There are two notations for the reward function .

① Denote as r ( s ) r(s)r ( s ) , represents a certain statessThe reward of s refers tothe expectation, namely:
r ( s ) = E [ R t ∣ S t = s ] r(s) = \mathbb {E}[R_t | S_t = s]r(s)=E [ RtSt=s]

Why use R t R_tRtThe expectation instead of directly using R t R_tRtTo represent the reward for this state?

Because for the same state, it takes different actions, the reward R t R_tRtmay be different.

②Described as r ( s , a ) r(s, a)r(s,a ) means a statesstake actionaa under sThe reward of a refers tothe expectationof the reward that can be obtained whentransferring to this state and taking this action, namely: r ( s , a ) = E [ R t ∣ S t = s , A t = a ] r(s, a) = \mathbb{E}[R_t | S_t = s, A_t = a]
r(s,a)=E [ RtSt=s,At=a]

Why use R t R_tRtThe expectation instead of directly using R t R_tRtTo represent the reward for taking this action in this state?

Because for the same state, even if the action taken is the same, the next state s ′ s^{\prime}s may also be different, and the reward obtained afterwardsR t + 1 R_{t+1}Rt+1It may also be different, and the final return G t G_tGtNature is also different.

return

Return (Return) , denoted as G t G_tGt, referring to from ttState st s_tat time tstFrom the beginning to the end state, the sum of the decay of all rewards is:
G t = R t + γ R t + 1 + γ 2 R t + 2 + . _{k = 0}^{\infty} \gamma ^ k R_{t + k} \end{aligned}Gt==Rt+γRt+1+c2 Rt+2+...k=0ckRt+k

value

Value (value) is a state-based concept. The value of a certain state refers to: from a certain state ss at a certain moments , until the terminal state, the cumulative reward (that is, the rewardG t G_tGt) expectations .

1. Why use ** G t G_tGtThe expectation** and cannot directly use G t G_tGt

Because for the same starting state sss , itsG t G_tGtCan be different . In order to objectively evaluate the value of a state, it is necessary to consider as much as possible the different rewards it can bring.

2. Why G t G_tGtCan it be different?

Because in the process of interaction, the initial state st s_tstmay go to a different state s ′ s^{\prime}s , the obtainedR t R_tRtAlso different, the final return G t G_tGtNaturally it is different.

value function

Value function (value function) , denoted as V ( s ) V(s)V ( s ) can be understood as a mapping relationship, and the input is the statesss ​, the output is the value of the state, namely:
V ( s ) = E [ G t ∣ S t = s ] V(s) = \mathbb{E} [G_t | S_t = s]V(s)=E [ GtSt=s]

What is the difference between a reward function and a value function?

According to my own understanding, the reward function only focuses on the benefits that the state can currently obtain, while the value function focuses on all benefits from the current state to the end of the future.

state transition matrix

The state transition matrix can be written as P ( s ′ ∣ s ) P(s^{\prime} | s)P(ss), indicating the statesss transitions to states ′ s^{\prime}s' probability.

In the case of a finite set of states, it can be represented by a matrix; if the set of states is not limited, it is called a state transition function.

Strategy

Policy (policy) , denoted as π \piπ . The strategy can be understood as: in the known input statessIn the case of s , take actionaaa approximation, i.e.:
π ( s , a ) = π ( a ∣ s ) = P ( A t = a ∣ S t = s ) \pi(s, a) =\pi(a | s)= P(A_t = a | S_t = s)π ( s ,a)=π ( a s )=P(At=aSt=s )
strategyπ \piπ (the value) is only related to the current statesss dependent, independent of the state preceding the current state.

for the same state sss , using the strategyπ \piπ is different, the action takenis aaa is different, the value value is also different.

state transition function

The state transition function can be written as P ( s ′ ∣ s , a ) P(s^{\prime} | s, a)P(ss,a ) , means in statesss performs actionaaAfter a reaches states ′ s^{\prime}s' probability.

In contrast to state transition matrices, state transition functions can represent situations where the set of states is not finite.

There are two forms of state transition, one is P ( s ′ ∣ s ) P(s^{\prime} | s)P(ss), the other isP ( s ′ ∣ s , a ) P(s^{\prime} | s, a)P(ss,a ) , the bridge connecting the two is the strategyπ \piπ,即:
P ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) P ( s ′ ∣ s , a ) P(s^{\prime} | s) = \sum_{a \in A} \pi(a | s)P(s^{\prime} | s, a) P(ss)=aAπ ( a s ) P ( ss,a)

state value function

The state-value function can be denoted as V π ( s ) V^{\pi}(s)Vπ (s), means: in the Markov decision process, the agent starts from statesss , follow the strategyπ \piThe return G t obtained by π G_tGtThe expectation of , namely:
V π ( s ) = E π [ G t ∣ S t = t ] V^{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = t]Vπ (s)=Ep[GtSt=t]

It looks similar to the value function , except that the policy is not emphasized in the value function .

The following two questions and their answers are similar to the two questions and answers asked when understanding value .

1. Why is the return G t G_tGtinstead of just using the return G t G_tGtOK?

Because for the same state sss and a given strategyπ \piπ , whoseG t G_tGtmay be different . To objectively evaluate the value of a state under a given policy, it is necessary to consider as much as possible the different rewards it can bring.

2. Why G t G_tGtCan it be different?

Because for the same state sss and a given strategyπ \piπ , the action aataken by the agent in the current statea may be different (especially when the strategy used is a randomness strategy), the resultingR t R_tRtdifferent, the final G t G_tGtIt may also be different.

action value function

The action-value function can be written as Q π ( s , a ) Q^{\pi}(s, a)Qπ (s,a ) , means: in the Markov decision process, the agent starts from statesss , according to the strategyπ \piπ performs actionaaa , the final rewardG t G_tGt期望,即:
Q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] Q^{\pi}(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a] Qπ (s,a)=E [ GtSt=s,At=a]

It looks very similar to the state-value function , except that the action is not emphasized in the state-value function .

The following two questions and their answers are similar to the two questions and answers asked when understanding the state-value function .

1. Why is the return G t G_tGtinstead of just using the return G t G_tGtOK?

Because for the same state sss , a given strategyπ \piπ and a given actionaaa , whoseG t G_tGtmay be different . To objectively evaluate the value of a state under a given policy, it is necessary to consider as much as possible the different rewards it can bring.

2. Why G t G_tGtCan it be different?

Because for the same state sss , a given strategyπ \piπ and a given actionaaa , its next states ′ s^{\prime}s may be different (because the environment may have changed differently), the actionaaa may be different (especially when the strategy used is a randomness strategy), the resultingR t R_tRtdifferent, the final G t G_tGtIt may also be different.

The relationship between the state-value function and the action-value function

relationship 1

V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) V^{\pi}(s) = \sum_{a \in A} \pi(a | s) Q^{\pi}(s, a) Vπ (s)=aAπ ( a s ) Qπ ( s ,a)

The derivation basis of relation 1: ①Definition of state-value function; ②Definition of action-value function. The derivation process is as follows:
V π ( s ) = E π [ G t ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) E π [ G t ∣ S t = s , A t = a ] = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) \begin{aligned} V^{\pi}(s) &= \mathbb { E}_{\pi}[G_t | S_t = s] \\ &= \sum_{a \in A} \pi(a | s) \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ &= \sum_{a \in A} \pi(a | s) Q^{\pi}(s, a) \end{aligned}Vπ (s)=Ep[GtSt=s]=aAπ ( a s ) Ep[GtSt=s,At=a]=aAπ ( a s ) Qπ (s,a)
Line 1 uses the definition of the state value function ;

The understanding that transitions from line 2 to line 3 uses the definition of the action-value function .

Here I explain why E π [ G t ∣ S t = s ] \mathbb{E}_{\pi}[G_{t} | S_t = s] in line 2Ep[GtSt=s]转换成 ∑ a ∈ A π ( a ∣ s ) E π [ G t ∣ S t = s , A t = a ] \sum_{a \in A} \pi(a | s) \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] aAπ ( a s ) Ep[GtSt=s,At=a],而不是 ∑ a ∈ A π ( a ∣ s ) G t \sum_{a \in A} \pi(a | s) G_t aAπ ( a s ) Gt(The difference between these two ideas is only whether there is a pair of G t G_{t}GtSeek expectations) to give their own explanation: although the latter seems to be more in line with the formula form of mathematical expectations: state sss according to the probabilityπ ( a ∣ s ) \pi(a | s)π ( a s ) selects actionaaAfter a , the object multiplied later should be the corresponding statesss takes actionaaThe reward obtained by a G t G_{t}Gt. However, when explaining the definition of the action value function , it was mentioned above : For the same state sss , a given strategyπ \piπ and a given actionaaa , whoseG t G_tGtmay be different . That is, ( s , a ) (s, a)(s,a ) andG t G_{t}GtThere is not a one-to-one correspondence, so they cannot be directly multiplied. At this time, it is necessary to find a method that can be compared with ( s , a ) (s, a)(s,a ) one-to-one correspondence, and generalization( s , a ) (s, a)(s,a ) The concept of earnings. At this time,action valuebecomes a good choice. It is for( s , a ) (s, a)(s,a ) Can correspond to multipleG t G_{t}GtCalculate the mathematical expectation and approximate ( s , a ) (s, a)(s,a ) The overall income (from this point of view, in fact, the second line can also be omitted, and the third line can be directly introduced).

relationship 2

Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) Q^{\pi}(s, a) = r(s, a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) V^{\pi}(s^{\prime}) Qπ ( s ,a)=r(s,a)+csSP(ss,a)Vπ (s)

The derivation basis of relation 2: ①Definition of action-value function; ②Definition of reward; ③Definition of reward function; ④Definition of state-value function. The derivation process is as follows:
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + γ R t + 1 + γ 2 R t + 2 + . . . ∣ S t = s , A t = a ] = E π [ R t + γ ( R t + 1 + γ R t + 2 + . . . ) ∣ S t = s , A t = a ] = E π [ R t ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) E π [ G t + 1 ∣ S t + 1 = s ′ ] = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) \begin{aligned} Q^{\pi}(s, a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ &= \mathbb{E}_{\pi}[R_t + \gamma R_{t+1} + \gamma^2R_{t+2}+ ... | S_t = s, A_t = a] \\ &= \mathbb{E}_{\pi}[R_t + \gamma (R_{t+1} + \gamma R_{t+2} + ...) | S_t = s, A_t = a] \\ &= \mathbb{E}_{\pi}[R_t | S_t = s, A_t = a] + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_t = s, A_t = a] \\ &= r(s, a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s^{\prime}] \\ &= r(s, a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) V^{\pi}(s^{\prime}) \end{aligned} Qπ ( s ,a)=Ep[GtSt=s,At=a]=Ep[Rt+γRt+1+c2 Rt+2+...∣St=s,At=a]=Ep[Rt+c ( Rt+1+γRt+2+...)St=s,At=a]=Ep[RtSt=s,At=a]+c Ep[Gt+1St=s,At=a]=r(s,a)+csSP(ss,a ) Ep[Gt+1St+1=s]=r(s,a)+csSP(ss,a)Vπ (s)
During the derivation, the equation in line 1 uses the definition of the action-value function ;

The conversion from line 1 to line 2 and the conversion from line 3 to line 4 use the definition of Return ;

The transformation from line 4 to line 5 uses the definition of the reward function ;

The transformation from line 5 to line 6 uses the definition of the state-value function .

Here I explain why the second sub-item in line 5 is E π [ G t + 1 ∣ S t = s , A t = a ] \mathbb{E}_{\pi}[G_{t+1} | S_t = s, A_t = a]Ep[Gt+1St=s,At=a]拆解成 ∑ s ′ ∈ S P ( s ′ ∣ s , a ) E π [ G t + 1 ∣ S t + 1 = s ′ ] \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s^{\prime}] sSP(ss,a ) Ep[Gt+1St+1=s ], instead of∑ s ′ ∈ SP ( s ′ ∣ s , a ) G t + 1 \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) G_{t+1}sSP(ss,a)Gt+1(The difference between these two ideas is only whether there is a pair of G t + 1 G_{t+1}Gt+1Seek expectations) to give their own explanation: although the latter seems to be more in line with the formula form of mathematical expectations: state sss , actionaaa According to the probabilityP ( s ′ ∣ s , a ) P(s^{\prime} | s, a)P(ss,a ) Transition to states ′ s^{\prime}s , the object multiplied later should be the corresponding states ′ s^{\prime}s returnG t + 1 G_{t+1}Gt+1. But the above mentioned when explaining the definition of value : for the same initial state sss , itsG t G_tGtCan be different . That is, s ′ s^{\prime}s andG t + 1 G_{t+1}Gt+1There is not a one-to-one correspondence, so they cannot be directly multiplied. At this time, it is necessary to find a method that can be compatible with s ′ s^{\prime}s One-to-one correspondence, and can generalize states ′ s^{\prime}s' The concept of earnings. At this time,valuebecomes a good choice. It is fors ′ s^{\prime}s can correspond to multipleG t + 1 G_{t+1}Gt+1Calculate the mathematical expectation and approximate the state s ′ s^{\prime}s (In this way, line 5 can also be omitted, and line 6 can be derived directly).

另外,特别提醒:
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) E [ G t + 1 ∣ S t + 1 = s ′ ] \begin{aligned} Q^{\pi}(s, a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ &= \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) \mathbb{E}[G_{t+1} | S_{t+1} = s^{\prime}] \end{aligned} Qπ (s,a)=Ep[GtSt=s,At=a]=sSP(ss,a ) E [ Gt+1St+1=s]
Can't turn like this! Because this does not take into account R t R_tRt, from G t G_tGtGoes directly to G t + 1 G_{t+1}Gt+1

Bellman Equation

The following formula isBellman equation.
V ( s ) = r ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) V(s) = r(s) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s) V(s^{\prime}) V(s)=r(s)+csSP(ss)V(s )
Compared with the original formula of the value function, the Bellman equation allows the usercalculatethe analytical solution ofthe value functionwhenthe reward functionandthe state transition matrix.

The derivation basis of the Bellman equation: ① definition of return; ② definition of reward function; ③ definition of value function. The derivation process is as follows:
V ( s ) = E [ G t ∣ S t = s ] = E [ R t + γ R t + 1 + γ 2 R t + 2 + . . . ∣ S t = s ] = E [ R t + γ ( R t + 1 + γ R t + 2 + . . . ) ∣ S t = s ] = E [ R t + γ G t + 1 ∣ S t = s ] = E [ R t ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = r ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) E [ G t + 1 ∣ S t + 1 = s ′ ] = r ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) \begin{aligned} V(s) &= \mathbb{E}[G_t | S_t = s] \\ &= \mathbb{E}[R_t + \gamma R_{t + 1} + \gamma ^ 2 R_{t + 2} + ... | S_t = s] \\ &= \mathbb{E}[R_t + \gamma (R_{t + 1} + \gamma R_{t + 2} + ...) | S_t = s] \\ &= \mathbb{E}[R_t + \gamma G_{t + 1} | S_t = s] \\ &= \mathbb{E}[R_t | S_t = s] + \gamma \mathbb{E}[G_{t + 1} | S_t = s] \\ &= r(s) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s) \mathbb{E}[G_{t + 1} | S_{t + 1} = s^{\prime}] \\ &= r(s) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s) V(s^{\prime}) \end{aligned} V(s)=E [ GtSt=s]=E [ Rt+γRt+1+c2 Rt+2+...∣St=s]=E [ Rt+c ( Rt+1+γRt+2+...)St=s]=E [ Rt+c Gt+1St=s]=E [ RtSt=s]+γE[Gt+1St=s]=r(s)+csSP(ss)E[Gt+1St+1=s]=r(s)+csSP(ss)V(s)
During the derivation process, the conversion from line 1 to line 2 and the conversion from line 3 to line 4 use the definition of Return ;

The transformation from line 5 to line 6 uses the definition of the reward function ;

The transformation from line 6 to line 7 uses the definition of the value function .

Here I explain why the second sub-item in line 6 is E [ G t + 1 ∣ S t = s ] \mathbb{E}[G_{t + 1} | S_t = s]E [ Gt+1St=s ] into∑ s ′ ∈ SP ( s ′ ∣ s ) E [ G t + 1 ∣ S t + 1 = s ′ ] \sum_{s^{\prime} \in S} P(s^{\prime} | s) \mathbb{E}[G_{t + 1} | S_{t + 1} = s^{\prime}]sSP(ss)E[Gt+1St+1=s ], instead of∑ s ′ ∈ SP ( s ′ ∣ s ) G t + 1 \sum_{s^{\prime} \in S} P(s^{\prime} | s) G_{t + 1}sSP(ss)Gt+1(The difference between these two ideas is only whether there is a pair of G t + 1 G_{t+1}Gt+1Seek expectations) to give their own explanation: although the latter seems to be more in line with the formula form of mathematical expectations: state sss according to the probabilityP ( s ′ ∣ s ) P(s^{\prime} | s)P(ss)transition to states ′ s^{\prime}s , the object multiplied later should be the corresponding states ′ s^{\prime}s returnG t + 1 G_{t+1}Gt+1. But the above mentioned when explaining the definition of value : for the same initial state sss , itsG t G_tGtCan be different . That is, s ′ s^{\prime}s andG t + 1 G_{t+1}Gt+1There is not a one-to-one correspondence, so they cannot be directly multiplied. At this time, it is necessary to find a method that can be compatible with s ′ s^{\prime}s One-to-one correspondence, and can generalize states ′ s^{\prime}s' The concept of earnings. At this time,valuebecomes a good choice. It is fors ′ s^{\prime}s can correspond to multipleG t + 1 G_{t+1}Gt+1Calculate the mathematical expectation and approximate the state s ′ s^{\prime}s (In this way, line 6 can also be omitted, and line 7 can be derived directly).

Bellman Expectation Equation

In fact, the Bellman expectation equation is the previous Bellman equation, just introducing the action aaa is more complete after that.

According to the two relations of the state value function and the action value function, the Bellman expectation equation of the two value functions can be deduced.

Equation 1

V π ( s ) = ∑ a ∈ A π ( a ∣ s ) [ r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) ] V^{\pi}(s) = \sum_{a \in A} \pi(a|s) \left[ r(s,a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) V^{\pi}(s^{\prime}) \right] Vπ (s)=aAπ ( a s )[r(s,a)+csSP(ss,a)Vπ (s )]
Subject to the following: Level 2 and Level 1, given:
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) = ∑ a ∈ A π ( a ∣ s ) [ r ( s , a ) + γ ∑ s ′ ∈ SP ( s ′ ∣ s , a ) V π ( s ′ ) ] \begin{aligned} V^{\pi}(s) &= \sum_{a \in A} \pi(a|s) Q^{\pi}(s,a) \\ &= \sum_{a \in A} \pi(a). |s) \left[ r(s,a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s,a) V^{\pi}(s^{\prime}) \right] \end{aligned}Vπ (s)=aAπ ( a s ) Qπ ( s ,a)=aAπ ( a s )[r(s,a)+csSP(ss,a)Vπ (s)]

Equation 2

Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) Q^{\pi}(s,a) = r(s,a) + \gamma \sum_{s^{\prime} \in S}P(s^{\prime} | s, a) \sum_{a^{\prime} \in A} \pi(a^{\prime} | s^{\prime}) Q^{\pi}(s^{\prime}, a^{\prime}) Qπ (s,a)=r(s,a)+csSP(ss,a)aAπ ( as)Qπ (s,a)

推导过程如下:将关系1式代入关系2式,得:
Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) \begin{aligned} Q^{\pi}(s,a) &= r(s,a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) V^{\pi}(s^{\prime}) \\ &=r(s,a) + \gamma \sum_{s^{\prime} \in S} P(s^{\prime} | s, a) \sum_{a^{\prime} \in A} \pi(a^{\prime}| s^{\prime})Q^{\pi}(s^{\prime},a^{\prime}) \end{aligned} Qπ(s,a)=r(s,a)+γsSP(ss,a)Vπ(s)=r(s,a)+csSP(ss,a)aAπ ( as)Qπ (s,a)

Bellman Optimal Equation

To be added……

Guess you like

Origin blog.csdn.net/Mocode/article/details/130383093