强化学习的最简单实现----矩阵找路问题

邦联学习+强化学习+区块链融合。。。发论文没点想象力还真不行了艹
为了看懂论文简单学习一下强化学习
于是看到了网上的一个教程和代码
代码估计是gym（python的强化学习包）自带的教程代码
但是一来是英文二来批注不够，对我等小白有点难受
于是全改成了中文，加上部分自己理解

在b站找到了一个视频教程，源码不是太齐全（手抄需要），我就把视频用的源码加学习批注放下面了
有不清醒的欢迎联系和指导

代码分两部分，一个gridworld.py，一个ValueIteration2.py
两个文件放在一个文件夹下，python运行后者就行
我是命令行直接 python “路径/ValueIteration2.py”

代码：
gridworld.py

import numpy as np
import sys
from gym.envs.toy_text import discrete
 
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
 
class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
 
    For example, a 4x4 grid looks as follows:
 
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
 
    x is your position and T are the two terminal states.
 
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """
 
    metadata = {'render.modes': ['human', 'ansi']}
 
    def __init__(self, shape=[4,4]):
    	# 限定输入的shape 必定为行数列数的list或tuple，参数只能有两个，否则报错
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')
 
        self.shape = shape
 		# 格子总数nS
        nS = np.prod(shape)
        # 行动选项行数nA
        nA = 4
 
        MAX_Y = shape[0]
        MAX_X = shape[1]
  
        P = {}
        # 每个格子标数字名字
       #  array([[ 0,  1,  2,  3],
       # [ 4,  5,  6,  7],
       # [ 8,  9, 10, 11],
       # [12, 13, 14, 15]])
        grid = np.arange(nS).reshape(shape)
        #迭代读取grid中数字
        # falg：每次迭代可以跟踪一种索引类型
        it = np.nditer(grid, flags = ['multi_index'])
 
        while not it.finished:
        	# s是数字
            s = it.iterindex
            # y为列数，x行数
            y, x = it.multi_index
            # 生成选项，个数为nA，上下左右4个选项P{0：{0:[],1:[],2:[],3:[]},1:...}
            P[s] = {a : [] for a in range(nA)}
            # 是否是出口，起点或终点
            is_done = lambda s: s == 0 or s == (nS - 1)
            # 奖励，终点0，否则-1
            reward = 0.0 if is_done(s) else -1.0
 
            # We're stuck in a terminal state 
            # 是起点终点
            # 注意UP 这些单词作为全局变量之前初始化了
            if is_done(s):
            	# [(prob, next_state, reward, done)]
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                # 计算下一个的号码
                ns_up = s if y == 0 else s - MAX_X
                ns_right = s if x == (MAX_X - 1) else s + 1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X
                ns_left = s if x == 0 else s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

                # 一定加入这一行，循环元素变为下一个数字
            it.iternext()
 		# 完事了P 大概是这样：
 		# 0,15 起点终点就上下左右都是0，不转移了
 		# 其他节点 撞墙就保留位置，也就是转移到自己位置（ =s ）
 		# 最后一位表示是否是终点
 		# 第三位表示奖励
 		# 第二位表示转移位置
 		# 第一位表示选中后转移概率（折扣系数）
 		# {0: {0: [(1.0, 0, 0.0, True)], 1: [(1.0, 0, 0.0, True)], 2: [(1.0, 0, 0.0, True)], 3: [(1.0, 0, 0.0, True)]},
 		#  1: {0: [(1.0, 1, -1.0, False)], 1: [(1.0, 2, -1.0, False)], 2: [(1.0, 5, -1.0, False)], 3: [(1.0, 0, -1.0, True)]},
 		#  2: {0: [(1.0, 2, -1.0, False)], 1: [(1.0, 3, -1.0, False)], 2: [(1.0, 6, -1.0, False)], 3: [(1.0, 1, -1.0, False)]},
 		#  3: {0: [(1.0, 3, -1.0, False)], 1: [(1.0, 3, -1.0, False)], 2: [(1.0, 7, -1.0, False)], 3: [(1.0, 2, -1.0, False)]}, 
 		#  4: {0: [(1.0, 0, -1.0, True)], 1: [(1.0, 5, -1.0, False)], 2: [(1.0, 8, -1.0, False)], 3: [(1.0, 4, -1.0, False)]}, 
 		#  5: {0: [(1.0, 1, -1.0, False)], 1: [(1.0, 6, -1.0, False)], 2: [(1.0, 9, -1.0, False)], 3: [(1.0, 4, -1.0, False)]},
 		#  6: {0: [(1.0, 2, -1.0, False)], 1: [(1.0, 7, -1.0, False)], 2: [(1.0, 10, -1.0, False)], 3: [(1.0, 5, -1.0, False)]},
 		#  7: {0: [(1.0, 3, -1.0, False)], 1: [(1.0, 7, -1.0, False)], 2: [(1.0, 11, -1.0, False)], 3: [(1.0, 6, -1.0, False)]},
 		#  8: {0: [(1.0, 4, -1.0, False)], 1: [(1.0, 9, -1.0, False)], 2: [(1.0, 12, -1.0, False)], 3: [(1.0, 8, -1.0, False)]},
 		#  9: {0: [(1.0, 5, -1.0, False)], 1: [(1.0, 10, -1.0, False)], 2: [(1.0, 13, -1.0, False)], 3: [(1.0, 8, -1.0, False)]},
 		#  10: {0: [(1.0, 6, -1.0, False)], 1: [(1.0, 11, -1.0, False)], 2: [(1.0, 14, -1.0, False)], 3: [(1.0, 9, -1.0, False)]},
 		#  11: {0: [(1.0, 7, -1.0, False)], 1: [(1.0, 11, -1.0, False)], 2: [(1.0, 15, -1.0, True)], 3: [(1.0, 10, -1.0, False)]},
 		#  12: {0: [(1.0, 8, -1.0, False)], 1: [(1.0, 13, -1.0, False)], 2: [(1.0, 12, -1.0, False)], 3: [(1.0, 12, -1.0, False)]},
 		#  13: {0: [(1.0, 9, -1.0, False)], 1: [(1.0, 14, -1.0, False)], 2: [(1.0, 13, -1.0, False)], 3: [(1.0, 12, -1.0, False)]},
 		#  14: {0: [(1.0, 10, -1.0, False)], 1: [(1.0, 15, -1.0, True)], 2: [(1.0, 14, -1.0, False)], 3: [(1.0, 13, -1.0, False)]},
 		#  15: {0: [(1.0, 15, 0.0, True)], 1: [(1.0, 15, 0.0, True)], 2: [(1.0, 15, 0.0, True)], 3: [(1.0, 15, 0.0, True)]}}
        # Initial state distribution is uniform
        # 翻译：初始状态相同意义
        isd = np.ones(nS) / nS
 
        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P
        # 参数保存
        super(GridworldEnv, self).__init__(nS, nA, P, isd)

    # 这个函数也是类内部的哦！
    # 我估摸着这个函数是gym自有的，然后给重写一下
    def _render(self, mode='human', close=False):
    	# 如果你说完毕，那没事了直接退游
        if close:
            return
        # sys.stdout 是python的输出控制口
        # print就是sys.stdout.write()
        # StringIO()就是内存中读写str
        # 可以理解为一个txt或者csv读写
        outfile = StringIO() if mode == 'ansi' else sys.stdout
        #初始化地图
        grid = np.arange(self.nS).reshape(self.shape)
        # 初始化
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
        	# s是数字
            s = it.iterindex
            # y为列数，x行数
            y, x = it.multi_index
 			#状态是设定的起点终结
            if self.s == s:
                output = " x "
                # 状态是终结
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
            	# 状态是未走到的一般点
                output = " o "
 			# 如果是新的一行
            if x == 0:
            	# 去掉output字段左侧多余字符
                output = output.lstrip()
            if x == self.shape[1] - 1:
            	# 去掉output字段右侧多余字符
                output = output.rstrip()
 
            outfile.write(output)
 
            if x == self.shape[1] - 1:
                outfile.write("\n")
 
            it.iternext()

====================== 分割线 =================
ValueIteration2.py

import numpy as np
from gridworld import GridworldEnv


env = GridworldEnv()
print(str(env.P))

def value_iteration(env, theta=0.0001, discount_factor=1.0):
    # env 初始化的表格
    # theta
    # discount_Factor:损失率
    """
    Value Iteration Algorithm.
     
    Args:
        env: OpenAI environment. env.P represents the transition probabilities of the environment.
        theta: Stopping threshold. If the value of all states changes less than theta
            in one iteration we are done.
        discount_factor: lambda time discount factor.
         
    Returns:
        A tuple (policy, V) of the optimal policy and the optimal value function.
    """
     # 走一步 
    def one_step_lookahead(state, V):
        """
        Helper function to calculate the value for all action in a given state.
         
        Args:
            state: The state to consider (int)
            V: The value to use as an estimator, Vector of length env.nS
         
        Returns:
            A vector of length env.nA containing the expected value of each action.
        """
        # 输入 state表示考虑动作
        # V 预估参数，和格子数相同的向量
        # nA格子中一点的行动选项行数 nA
        # A 保存行动选择
        A = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.P[state][a]:
                # 取状态并计算 可能性(折扣系数)*（当前位置奖励R + 损失率 * 预估下一点获利）
                ######## ******* #######
                A[a] += prob * (reward + discount_factor * V[next_state])
                ######## ******* #######
        return A

    # V 预估参数，和格子数相同的向量,记录某个位置现在已知的能获得最大奖励 ，等于max{Q(s_next,a_next )}
    V = np.zeros(env.nS)

    while True:
        # Stopping condition
        # 直到找到出路，一直循环
        delta = 0
        # Update each state...
        #有多少格子循环几次
        for s in range(env.nS):
            # Do a one-step lookahead to find the best action
            A = one_step_lookahead(s, V)
            print("S:"+str(s)+"\t A:"+str(A))
            best_action_value = np.max(A)
            # Calculate delta across all states seen so far
            # derta一直存在，记录耗费的最小值
            # 最后确定结果的时候，
            # best_action 表示的最佳移动策略(A得出的最小值)是不变的
            # 所以后一项一定为0，delta为0<theta，终于结束迭代
            delta = max(delta, np.abs(best_action_value - V[s]))
            # Update the value function
            # s为0时不变，s为1时第一次变为-1，表示最低耗费1，之后就开始连锁了
            V[s] = best_action_value
        # Check if we can stop
        # best_action 表示的最佳移动策略(A得出的最小值)是不变的
        # 所以后一项一定为0，delta为0<theta，终于结束迭代
        if delta < theta:
            break

    # Create a deterministic policy using the optimal value function
    policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        # One step lookahead to find the best action for this state
        A = one_step_lookahead(s, V)
        best_action = np.argmax(A)
        # Always take the best action
        policy[s, best_action] = 1.0
     
    return policy, V
 
policy, v = value_iteration(env)
 
print("Policy Probability Distribution:")
print(policy)
print("")
 
print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print("")

输出应该这样

Policy Probability Distribution:
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]]

Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):
[[0 3 3 2]
[0 0 0 2]
[0 0 1 2]
[0 1 1 0]]

最后的就是指导地图，在这个只有左上和右下是出口的地图，0123分别表示上右下左，跟着数字走，可以最短路径走出地图
本质是
may the flame guide thee

猜你喜欢

目录

热门文章