[Reinforcement Learning Theory] Dynamic Programming Algorithm

The following are some of my own conclusions, but not official conclusions.

Two representatives of dynamic programming algorithm

① Strategy iteration;

②Value iteration.

Application Conditions of Dynamic Programming Algorithm

① Have a complete MDP modeling, know a clear reward function $r$ and the state transition function $p$ . If you don't know, you can only let the agent interact with the environment to get the sequence and simulate it.

②The state space and action space are discrete and limited.

The difference between the two dynamic programming algorithms

① Policy iteration consists of two parts: policy evaluation (multiple times) + policy promotion (once), the two parts are used as a combination, and the cycle is repeated multiple times until the policy π τ − 1 = π τ \pi_{\tau-1} = \pi_{\tau $Pi_{τ -} = Pi$ 。

Strategy evaluation may go through multiple rounds. This process is to calculate the optimal value of each state according to the current strategy, and get $V_{\pi}(s)$ 。

Then there is strategy promotion (one strategy promotion after multiple rounds of strategy evaluation), strategy promotion is by calculating $q (s, a)$ , and specify that $q (s, a)$ The largest action is used as the current $s$ action. This specified process is the process of policy promotion.

Principles used : Bellman expectation equation, relationship between state value function and action value function, strategy promotion theorem

Advantages : the process is easy to understand;

Disadvantages : Because it takes multiple rounds to improve, and then iterates multiple rounds to improve... If the state space and action space are relatively large, it will consume computing resources and time; an initial $\pi$ is required in advance $Pi$

②Value iteration also includes two parts: strategy evaluation (one time) + strategy promotion (one time), the two parts are used as a combination, and the cycle is repeated multiple times until the strategy π τ − 1 = π τ \pi_{\tau-1} = \pi_{\tau $Pi_{τ -} = Pi$ 。

In each calculation of the value of a certain state $When v (s)$ , all $q (s, a)$ , and then directly specify the largest $q (s, a)$ as the updated $v (s)$ 。

The principle used : the Bellman optimality equation

Advantages : The number of iteration rounds is less than that of policy iteration; there is no need to know the initial $\pi in advance$ (because the calculation process does not need to use the probability of the current state transitioning to an action)

Disadvantage : Finally, $\pi$ can be derived according to the value function $p$ .