Markov decision process in reinforcement learning, review of common formulas

0. Basic knowledge

0.1 Bellman equation:

V(s)=R(s)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right)

       This formula is the core of reinforcement learning.

       Among them, s′ can be regarded as a certain state in the future, and p(s′|s) refers to the probability of transitioning from the current state to the future state. V (s′) represents the value of a certain future state. We start from the current state and have a certain probability to go to all future states, so we have to write p (s′ | s). After we get the future state, we multiply it by γ, so that we can discount future rewards. The part after the plus sign can be regarded as the discounted sum of future reward.

       The Bellman equation defines the relationship between the current state and the future state . The sum of discounts on future rewards plus immediate rewards forms the Bellman equation.

       Here is another Bellman equation of the Q function:

Q_{\pi}(s, a)=R(s,a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)

1. Markov decision process

1.1 Definition of state transition function and reward function

       The known policy function, that is, the probability of actions that may be taken in each state is known, so we can directly sum the actions and remove a, so that we can get the transfer of the Markov reward process, There is no action here.

P_\pi\left(s^{\prime} \mid s\right)=\sum_{a \in A} \pi(a \mid s) p\left(s^{\prime} \mid s, a\right)

r_\pi(s)=\sum_{a \in A} \pi(a \mid s) R(s, a)

        Note that the subscripts in the formula here all have π, indicating the state transition matrix, reward function, value function, and action value function in the Markov decision-making process.

1.2 Definition of value function and action function
 

V_{\pi}(s)=\mathbb{E}_{\pi}\left[G_{t} \mid s_{t}=s\right]

Q_{\pi}(s, a)=\mathbb{E}_{\pi}\left[G_{t} \mid s_{t}=s, a_{t}=a\right]

     

1.3 The relationship between Q and V

       By summing the actions in the Q function, we get the value function.

V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s) Q_{\pi}(s, a)

1.4 Write the value function and action value function in iterative form

      The relationship between the value of the current state and the value of the future state .

V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)\right)

      The relationship between the Q function at the current time and the Q function at the future time.

Q_{\pi}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q_{\pi}\left(s^{\prime}, a^{\prime}\right)

2. Bellman’s optimal equation

2.1 Definition of optimal state value function and optimal state value function

      Optimal state value function.

V^*(s)=\max _\pi V^\pi(s), \quad \forall s \in \mathcal{S}

      Optimal action value function.

Q^*(s, a)=\max _\pi Q^\pi(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}

2.2 The relationship between the two

      When we keep taking the arg max operation, we get a monotonic increase. By taking this greedy operation (arg max operation), we will get a better or unchanged policy without making the value function worse. So when improvement stops, we get an optimal strategy. When the improvement stops and we take the action to maximize the value of the Q function, the Q function will directly become the value function.

Q_{\pi}\left(s, \pi^{*}(s)\right)=\max _{a \in A} Q_{\pi}(s, a)=Q_{\pi}(s, \pi(s))=V_{\pi}(s)

      From this, we can get the relationship between the optimal state value function and the optimal state value function. That is, the value of a state under the optimal strategy must be equal to the expectation of reward from taking the best action in this state.

V^{*}(s)=\max _{a} Q^{*}(s, a)

Bellman's equation       for the Q function .

Q^*(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^*\left(s^{\prime}\right)

  2.3 Bellman’s optimal equation

      The transfer between V functions is the Bellman optimal equation of the V function .

V^{*}(s)=\max_{a} \left(R(s,a) + \gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right)\right)    

      Transfer between Q functions, Q learning, the Bellman optimal equation of the Q function .

Q^{*}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \max _{a^{\prime}} Q^{*}\left(s^{\prime}, a^{\prime}\right)

     When V^{k+1}the sum V^{k}is the same, it is the fixed point of the Bellman optimal equation, which corresponds to the optimal state value function. When the sum is the same, it is the fixed point of the Bellman optimal equation, which corresponds to the optimal state The value function V^{*}extracts the optimal strategy after iteration:,

\pi(s)=\underset{a}{\arg \max } \left[R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{k+1}\left(s^{\prime}\right)\right]

    

     

        .

Guess you like

Origin blog.csdn.net/tortorish/article/details/132677744