In-depth understanding of reinforcement learning - Markov decision process: policy iteration - [Basic knowledge]

Category Catalog:General Catalog of "In-depth Understanding of Reinforcement Learning"


Policy iteration consists of two steps: policy evaluation and policy improvement (Policy Improvement). As shown in Figure (a) below, the first step is strategy evaluation. Currently we are optimizing the strategy π \pi π, a latest strategy is obtained during the optimization process. We first ensure that this policy remains unchanged, and then estimate its value, that is, given the current policy function, we estimate the state value function. The second step is strategy improvement. After obtaining the state value function, we can further calculate its Q function. After obtaining the Q function, we directly maximize the Q function and further improve the strategy by doing a greedy search on the Q function. These two steps have been carried out iteratively. So as shown in Figure (b) below, in policy iteration, during initialization, we have an initialized state value function V V Vsum strategy π \pi π and then iterate between these two steps. The upper line in (b) below is the value of our current state value function, and the lower line is the value of the strategy. The process of strategy iteration is the same as playing football. We first give the currently existing policy function and calculate its state value function. After calculating the state value function, we will get a Q function. We adopt a greedy strategy for the Q function, which is like kicking a ball and "kicking" the strategy back. Then we further improve the strategy. After obtaining an improved strategy, it is not yet the best strategy. We then evaluate the strategy and get a new value function. Based on this new value function, the Q function is maximized. In this way, with gradual iteration, the state value function and strategy will converge.
Strategy iteration
Let’s take a look at the second step - strategy improvement to see how we improve the strategy. After getting the state value function, we can calculate the Q function through the reward function and the state transition function:
Q π i ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π i ( s ′ ) Q_{\pi_i}(s, a)=R(s, a)+\gamma\sum_{s'\in S}p (s'|s, a)V_{\pi_i}(s') QPii(s,a)=R(s,a)+csSp(ss,a)VPii(s)

For each state, the strategy improvement will get its new round of strategy. For each state, we take the action that makes it get the maximum value, that is:
π i + 1 ( s ) = arg ⁡ max ⁡ a Q π i ( s , a ) \pi_{i+1}(s)=\arg\max_aQ_{\pi_i}(s, a) Pii+1(s)=argamaxQPii(s,a)

As shown in the figure below, we can think of the Q function as a Q-table: the horizontal axis is all its states, and the vertical axis is its possible actions. If we get the Q function, we also get the Q table. For a certain state, we will take the maximum value in each column, and the action corresponding to the maximum value is the action it should take now. So arg ⁡ max ⁡ \arg\max argThe max operation refers to taking an action in each state. This action is an action that can maximize the Q function value of this column.
Insert image description here

References:
[1] Zhang Weinan, Shen Jian, Yu Yong. Hands-on reinforcement learning [M]. People's Posts and Telecommunications Press, 2022.
[2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning (2nd Edition) [M]. Electronic Industry Press, 2019
[3] Maxim Lapan. Deep Reinforcement Learning Practice (2nd edition of the original book) [M]. Beijing Huazhang Graphic Information Co., Ltd., 2021
[4] Wang Qi, Yang Yiyuan, Jiang Ji. Easy RL: Reinforcement Learning Tutorial[M] . People's Posts and Telecommunications Press, 2022

Guess you like

Origin blog.csdn.net/hy592070616/article/details/134816136