Reinforcement learning, detailed explanation of policy evaluation in policy iteration algorithm

1. Dynamic programming algorithm

Algorithm background: Policy iteration algorithm is a typical dynamic programming.

Dynamic programming is applied to the planning problem of the Markov decision process rather than the learning problem. We must know the environment completely before we can do dynamic programming, that isYou need to know State transition probability and corresponding reward.
In a white-box environment, there is no need to learn through a large amount of interaction between the agent and the environment, and dynamic programming can be used directly to solve the state value function. However, there are very few white-box environments in reality, which is also the limitation of the dynamic programming algorithm. We cannot apply it to many practical scenarios. In addition, policy iteration and value iteration are usually only applicable to finite Markov decision processes, that is, the state space and action space are discrete and limited.
Using dynamic programming to solve prediction and control problems is a very effective way to solve prediction and control problems in the Markov decision process.

The policy iteration algorithm has two steps: policy evaluation and policy iteration. This article mainly combines code to explain the first step in detail: strategy evaluation.

2 Strategy evaluation formula

Currently we are optimizing the strategy π, and we get the latest strategy during the optimization process. We first ensure that this policy remains unchanged, and then estimate its value, that is, given the current policy function, we estimate the state value function V. Strategy evaluation estimates the value function of each state through continuous iteration of the value function, as shown in the following formula. Convert Bellman expectation backup into iterations of dynamic programming.

$V^{t+1}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{t}\left(s^{\prime}\right)\right)-(0)$

You may be a little confused when you see this formula, but in fact its derivation is also very simple. First, in the Markov decision-making process, the Bellman equation of the Q function is given:

$Q_{\pi}(s, a)=R(s,a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)-(1)$

Then, we know that after summing the policy functions $\pi$ , we can convert the Q function into a V function.

$V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s) Q_{\pi}(s, a)-(2)$

Substituting equation (1) into equation (2) we can get:

$V_{\pi}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V_{\pi}\left(s^{\prime}\right)\right)-(4)$

Write equation (4) into iterative form, which is equation (0)!

3 Strategy evaluation code

3.1. Background introduction

Grid World (Grid World) Rules: Each cell in the grid corresponds to the state in the environment. On a cell, there are 4 possible actions: move north, move south, move east, move west, Each of the actions deterministically causes the agent to move one grid in the corresponding direction on the grid. If the action taken will cause the agent to leave the grid, the result of the action is that the agent's position remains unchanged and causes A reward of −1. Except for the above actions and the actions of moving the agent out of the special states A and B, other actions will only cause a reward of 0. In state A, all 4 actions will cause a reward of +10, and Take the agent to A'. On state B, all 4 actions will generate a +5 reward and take the agent to B'.

3.2 Code

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Table

matplotlib.use('Agg')

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9

# left, up, right, down
ACTIONS = [np.array([0, -1]),
           np.array([-1, 0]),
           np.array([0, 1]),
           np.array([1, 0])]
ACTIONS_FIGS = ['←', '↑', '→', '↓']


ACTION_PROB = 0.25



def step(state, action):
    if state == A_POS:
        return A_PRIME_POS, 10
    if state == B_POS:
        return B_PRIME_POS, 5

    next_state = (np.array(state) + action).tolist()
    x, y = next_state
    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
        reward = -1.0
        next_state = state  # 如果碰壁，则位置不变
    else:
        reward = 0
    return next_state, reward


def draw_image(image):
    fig, ax = plt.subplots()
    ax.set_axis_off()
    tb = Table(ax, bbox=[0, 0, 1, 1])

    nrows, ncols = image.shape
    width, height = 1.0 / ncols, 1.0 / nrows

    # Add cells
    for (i, j), val in np.ndenumerate(image):

        # add state labels
        if [i, j] == A_POS:
            val = str(val) + " (A)"
        if [i, j] == A_PRIME_POS:
            val = str(val) + " (A')"
        if [i, j] == B_POS:
            val = str(val) + " (B)"
        if [i, j] == B_PRIME_POS:
            val = str(val) + " (B')"
        
        tb.add_cell(i, j, width, height, text=val,
                    loc='center', facecolor='white')
        

    # Row and column labels...
    for i in range(len(image)):
        tb.add_cell(i, -1, width, height, text=i+1, loc='right',
                    edgecolor='none', facecolor='none')
        tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',
                    edgecolor='none', facecolor='none')

    ax.add_table(tb)



def draw_policy(optimal_values):
    fig, ax = plt.subplots()
    ax.set_axis_off()
    tb = Table(ax, bbox=[0, 0, 1, 1])

    nrows, ncols = optimal_values.shape
    width, height = 1.0 / ncols, 1.0 / nrows

    # Add cells
    for (i, j), val in np.ndenumerate(optimal_values):
        next_vals=[]
        for action in ACTIONS:
            next_state, _ = step([i, j], action)
            next_vals.append(optimal_values[next_state[0],next_state[1]])

        best_actions=np.where(next_vals == np.max(next_vals))[0]
        val=''
        for ba in best_actions:
            val+=ACTIONS_FIGS[ba]
        
        # add state labels
        if [i, j] == A_POS:
            val = str(val) + " (A)"
        if [i, j] == A_PRIME_POS:
            val = str(val) + " (A')"
        if [i, j] == B_POS:
            val = str(val) + " (B)"
        if [i, j] == B_PRIME_POS:
            val = str(val) + " (B')"
        
        tb.add_cell(i, j, width, height, text=val,
                loc='center', facecolor='white')

    # Row and column labels...
    for i in range(len(optimal_values)):
        tb.add_cell(i, -1, width, height, text=i+1, loc='right',
                    edgecolor='none', facecolor='none')
        tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',
                   edgecolor='none', facecolor='none')

    ax.add_table(tb)


def figure_3_2():
    value = np.zeros((WORLD_SIZE, WORLD_SIZE))
    while True:
        # keep iteration until convergence
        new_value = np.zeros_like(value)
        for i in range(WORLD_SIZE):
            for j in range(WORLD_SIZE):
                for action in ACTIONS:
                    (next_i, next_j), reward = step([i, j], action)
                    # bellman equation
                    new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])
        if np.sum(np.abs(value - new_value)) < 1e-4:
            draw_image(np.round(new_value, decimals=2))
            plt.savefig('../images/figure_3_2.png')
            plt.close()
            break
        value = new_value


if __name__ == '__main__':
    figure_3_2()

3.3 Code explanation

The code is relatively long, but most of it is visual code, and the core part is actually only one line.

new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])

Comparing formula (0) again, it suddenly becomes clear! Note that this is a deterministic environment, so the state probability p is always equal to 1.

$V^{t+1}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{t}\left(s^{\prime}\right)\right)-(0)$

If it’s not clear yet, then read on. In order to see more clearly how the value in each grid is updated, the author combined the formula and took several of the grids for analysis.

3.4 First update

Let's look at the formula and see how the value is updated step by step.

First, after a given action, the transition between states is certain, that is, $p\left(s^{\prime} \mid s, a\right)$ =1. For example, if the agent starts from state No. 1 and walks to the right, it will reach the horizontal grid No. 2. Many times, some environments are probabilistic. For example, when the agent is in state No. 1 and it chooses to go right, the floor may be slippery and it may slide to the vertical grid No. 2.

$V^{t+1}(s)=\sum_{a \in A} \pi(a \mid s)\left(R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{t}\left(s^{\prime}\right)\right)$

For the (1,1) grid.

Left = 0.25 * (-1 + 0.9 * 1 * 0) = -0.25. Going left will hit the wall, so the reward is -1, and the agent continues to stay where it is. The last item $V^{t}\left(s^{\prime}\right)$ is 0
Up = 0.25 * (-1 + 0.9 * 1 * 0) = -0.25. Going up will also hit the wall, so the reward is -1, and the agent continues to stay where it is. The last item $V^{t}\left(s^{\prime}\right)$ is 0
Toward right = 0.25 * (0 + 0.9 * 1 * 0) = 0
Downward = 0.25 * (0 + 0.9 * 1 * 0) = 0
Adding these four terms gives us a value of 0.5 for the (1,1) grid.

Calculate the (2,2) grid. At this time, no matter what action the agent takes, it will transfer to the grid in the fifth row and second column , $R(s, a)$ =1, $V^{t}\left(s^{\prime}\right)$ = V(5,2)=0

toward left = 0.25 * (10 + 0.9 * 1 * 0) = 2.5
Improvement = 0.25 * (10 + 0.9 * 1 * 0) = 2.5
Toward right = 0.25 * (10 + 0.9 * 1 * 0) = 2.5
Downward = 0.25 * (10 + 0.9 * 1 * 0) = 2.5
Adding these four terms gives us a value of 10 for the (2,2) grid.

Readers can try to calculate the value of other grids by themselves, and then verify it with the results on the picture.

3.5 Second update

Next is the second round of updates. In fact, it is easy to make mistakes when you first learn, so you can try it together!

For the (1,1) grid.

Toward left = 0.25 * (-1 + 0.9 * 1 * -0.5) = -0.3625, $V^{t}\left(s^{\prime}\right)=V(1,1)=-0.5$
Improvement = 0.25 * (-1 + 0.9 * 1 * -0.5) = -0.3625, $V^{t}\left(s^{\prime}\right)=V(1,1)=-0.5$
Toward right = 0.25 * (0 + 0.9 * 10) = 2.25, $V^{t}\left(s^{\prime}\right)=V(1,2)=10$
Downward = 0.25 * (0 + 0.9 * 1 * -0.25) = -0.05625, $V^{t}\left(s^{\prime}\right)=V(2,1)=-0.25$
Add up to 1.46875.

Calculate the (2,2) grid.

toward left = 0.25 * (10+ 0.9 * 1 * -0.25), $V^{t}\left(s^{\prime}\right)=V(5,2)=-0.25$
Improvement = 0.25 * (10 + 0.9 * 1 * -0.25) , $V^{t}\left(s^{\prime}\right)=V(5,2)=-0.25$
Toward right = 0.25 * (10 + 0.9 * 1 * -0.25) , $V^{t}\left(s^{\prime}\right)=V(5,2)=-0.25$
Downward = 0.25 * (10 + 0.9 * 1 * -0.25) , $V^{t}\left(s^{\prime}\right)=V(5,2)=-0.25$
Add up to 9.775.