Reinforcement study notes: policy iteration of policy-based learning (python implementation)

Table of contents

 

1 Introduction

2. Algorithm process

3. Code and simulation results

3.1 class PolicyIterationPlanner()

3.2 Test code

3.3 Running results

3.3.1 Value estimation results

3.3.2 The final policy obtained by policy iteration


1 Introduction


        In reinforcement learning, according to whether it depends on the (environmental) model, it can be divided into model-based learning and model-free learning. According to the decision-making benchmark of action, it can be divided into value-based learning (value-based) and policy-based learning (policy-based).

        In value-based learning, action decisions are made based on the value of the state-value function (state-value function) or the value of the action-value function. Always choose the action corresponding to the state with the highest value among the next states that may be reached from the current state.

        The previous two articles respectively introduced the direct solution of the Bellman equation for value calculation and the iterative solution of value approximation (value iteration):

        Intensive study notes: value calculation of value-based learning (python implementation)

        Intensive learning notes: value iteration of value-based learning (python implementation)

        This article further introduces the policy iteration algorithm principle and implementation of policy-based learning.

2. Algorithm process

        Policy iteration includes policy evaluation and policy improvement.

        Policy evaluation refers to the evaluation of the value function based on the current policy, which itself is (can) be solved in an iterative manner. Policy-based valuation is supposed to calculate the expected value of value, which is slightly different from value valuation in value iteration. However, since (as shown below, the strategy obtained by strategy iteration itself is a deterministic strategy, so this expected value also degenerates into the value of the best action under the current strategy). When policy evaluation is performed in each iteration, the initial value of the value function is the value function of the previous policy (policy), which usually improves the convergence speed of policy evaluation, because the value functions of two adjacent policies Little has changed.

        Policy improvement refers to policy updating when the action derived from the policy is inconsistent with the value-based best action.

        The strategy iteration process given by Sutton-book#4.3 is as follows:

        Translated into vernacular as follows:

        Value and strategy initialization: the value is initialized to 0, and the probability of all actions in each state is specified as an equal probability (although it can be initialized to any value, this is the most reasonable choice)

        Iteration until convergence:

  1. State value evaluation (expected value) based on policy: However, in fact, since the policy in policy iteration is a deterministic policy, the expected value here degenerates into the state value of the best action.
  2. The policy is updated based on the state value maximization criterion. for each state
    1. The action obtained based on the current policy (that is, the action with the highest probability) action_by_policy
    2. Evaluate the value Q(s,a) of each action to get the action best_action_by_value with the highest action value
    3. Determine whether the two are consistent (action_by_policy==best_action_by_value?)
      1. If each state is satisfied (action_by_policy==best_action_by_value), stop iterating
      2. Otherwise, update the policy in a greedy (is this inevitable?) fashion. That is, the policy[s][a] corresponding to the best action in each state s is set to 1, and the rest are set to 0

        More simply, it can be represented by the following schematic diagram: 

        The relationship between policy iteration and value iteration [3]:

        Strategy iteration includes value iteration, strategy evaluation is based on value estimation, and value estimation is carried out in the form of value iteration. The value iteration algorithm is a policy iteration algorithm in which the policy evaluation process only performs one iteration.

3. Code and simulation results

3.1 class PolicyIterationPlanner()

        estimate_by_policy() is used to estimate the expected value of the state under the current policy, which is used for the value-based best action selection in step (2). estimate_by_policy() is slightly different from the value estimation in ValueIterationPlanner, here is the estimated expected value (but in fact, because it is a deterministic strategy, it actually degenerates into the value of the best action under the current strategy), and the latter is to directly estimate the most The value of good action.

        plan() implements the process described in the previous chapter. In the iterative process, the value estimation results after each iteration are printed out to facilitate the observation and understanding of the results.

        print_policy() is used to print out the final policy.

class PolicyIterationPlanner(Planner):

    def __init__(self, env):
        super().__init__(env)
        self.policy = {}

    def initialize(self):
        super().initialize()
        self.policy = {}
        actions = self.env.actions
        states = self.env.states
        for s in states:
            self.policy[s] = {}
            for a in actions:
                # Initialize policy.
                # At first, each action is taken uniformly.
                self.policy[s][a] = 1 / len(actions)

    def estimate_by_policy(self, gamma, threshold):
        V = {}
        for s in self.env.states:
            # Initialize each state's expected reward.
            V[s] = 0

        while True:
            delta = 0
            for s in V:
                expected_rewards = []
                for a in self.policy[s]:
                    action_prob = self.policy[s][a]
                    r = 0
                    for prob, next_state, reward in self.transitions_at(s, a):
                        r += action_prob * prob * \
                             (reward + gamma * V[next_state])
                    expected_rewards.append(r)
                value = sum(expected_rewards)
                delta = max(delta, abs(value - V[s]))
                V[s] = value
            if delta < threshold:
                break

        return V

    def plan(self, gamma=0.9, threshold=0.0001):
        self.initialize()
        states = self.env.states
        actions = self.env.actions

        def take_max_action(action_value_dict):
            return max(action_value_dict, key=action_value_dict.get)

        while True:
            update_stable = True
            # Estimate expected rewards under current policy.
            V = self.estimate_by_policy(gamma, threshold)
            self.log.append(self.dict_to_grid(V))

            for s in states:
                # Get an action following to the current policy.
                policy_action = take_max_action(self.policy[s])

                # Compare with other actions.
                action_rewards = {}
                for a in actions:
                    r = 0
                    for prob, next_state, reward in self.transitions_at(s, a):
                        r += prob * (reward + gamma * V[next_state])
                    action_rewards[a] = r
                best_action = take_max_action(action_rewards)
                if policy_action != best_action:
                    update_stable = False

                # Update policy (set best_action prob=1, otherwise=0 (greedy))
                for a in self.policy[s]:
                    prob = 1 if a == best_action else 0
                    self.policy[s][a] = prob

            # Turn dictionary to grid
            self.V_grid = self.dict_to_grid(V)
            self.iters = self.iters + 1
            print('PolicyIteration: iters = {0}'.format(self.iters))
            self.print_value_grid()
            print('******************************')

            if update_stable:
                # If policy isn't updated, stop iteration
                break

    def print_policy(self):
        print('PolicyIteration: policy = ')
        actions = self.env.actions
        states = self.env.states
        for s in states:
            print('\tstate = {}'.format(s))
            for a in actions:
                print('\t\taction = {0}, prob = {1}'.format(a,self.policy[s][a]))

3.2 Test code

        The test code is as follows (for comparison, there are some modifications to valueIterPlanner, and the intermediate results are printed for comparison):

if __name__ == "__main__":

    # Create grid environment
    grid = [
        [0, 0, 0, 1],
        [0, 9, 0, -1],
        [0, 0, 0, 0]
    ]
        
    env1 = Environment(grid)
    valueIterPlanner = ValueIterationPlanner(env1)
    valueIterPlanner.plan(0.9,0.001)
    valueIterPlanner.print_value_grid()

    env2 = Environment(grid)
    policyIterPlanner = PolicyIterationPlanner(env2)
    policyIterPlanner.plan(0.9,0.001)
    policyIterPlanner.print_value_grid()    
    policyIterPlanner.print_policy()    

3.3 Running results

3.3.1 Value estimation results

ValueIteration: iters = 1
-0.040 -0.040 0.792 0.000
-0.040 0.000 0.434 0.000
-0.040 -0.040 0.269 0.058
************************** **
......
ValueIteration: iters = 10
 0.610 0.766 0.928 0.000
 0.487 0.000 0.585 0.000
 0.374 0.327 0.428 0.189
************************** *****

PolicyIteration: iters = 1
-0.270 -0.141 0.102 0.000
-0.345 0.000 -0.487 0.000
-0.399 -0.455 -0.537 -0.728
********************** **********
......
PolicyIteration: iters = 4
 0.610 0.766 0.928 0.000
 0.487 0.000 0.585 0.000
 0.374 0.327 0.428 0.189
******************************

        The final value estimation results obtained by value iteration and strategy iteration are the same. But policy iteration converged in only 4 iterations, while value iteration took 10 iterations.

3.3.2 The final policy obtained by policy iteration

        The final policy result of policy iteration is as follows:

PolicyIteration: policy =
        state = <State: [0, 0]>
                action = Action.UP, prob = 0
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 1
        state = <State: [0, 1]>
                action = Action.UP, prob = 0
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 1
        state = <State: [0, 2]>
                action = Action.UP, prob = 0
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 1
        state = <State: [0, 3]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [1, 0]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [1, 2]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [1, 3]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [2, 0]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [2, 1]>
                action = Action.UP, prob = 0
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 1
        state = <State: [2, 2]>
                action = Action.UP, prob = 1
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 0
                action = Action.RIGHT, prob = 0
        state = <State: [2, 3]>
                action = Action.UP, prob = 0
                action = Action.DOWN, prob = 0
                action = Action.LEFT, prob = 1
                action = Action.RIGHT, prob = 0

        This result can be illustrated as follows ([4], ignore the results in [0,3] and [1,3] in the above results):

【think】

        Can the policy update method in the above policy iteration only be done in a greedy way? The strategy is updated in a greedy way, and the final strategy must be a deterministic strategy, that is, in each state, the probability of only one of the best actions is 1, and the probability of other actions is 0.

Intensive Study Notes: General Catalog of Intensive Study Notes

See the complete code:: reinforcement-learning/value_eval at main chenxy3791/reinforcement-learning (github.com)

references:

【1】Sutton, et, al, Introduction to reinforcement learning (2020)

[2] Takahiro Kubo, Hands-on Learning with Python for Reinforcement Learning

[3] Policy Iteration (Policy Iteration) and Value Iteration (Value Iteration) - Zhihu (zhihu.com)

【4】Policy iteration — Introduction to Reinforcement Learning (gibberblot.github.io)

Guess you like

Origin blog.csdn.net/chenxy_bwave/article/details/128778595