【人工智能】UC Berkeley 2021春季 CS188 Project 6:Reinforcement Learning强化学习

Project 6:Reinforcement Learning

Code Link
Introduction
Solve Process
Conclusion
Reference

Code Link

Introduction

Article Intro

本篇文章上一次修改日期为2021年10月31日，距离发布时间（2022年4月30日）已有半年，所以下文提到的环境配置在近日或许有版本更新。如有出入，还请谅解。

Project Intro

本题目来源于UC Berkeley 2021春季 CS188 Artificial Intelligence Project 6:Reinforcement Learning上的内容，项目具体介绍链接点击此处：UC Berkeley Spring 2021 Project 6:Reinforcement Learning

Tools Intro

Python3.9+Pycharm

Files Intro

File you’ll edit:	Function
valueIterationAgents.py	用于解决已知 MDP 的值迭代代理
qlearningAgents.py	Gridworld、Crawler 和 Pacman 的 Q-learning 代理
analysis.py	一个文件，用于放置您对项目中给出的问题的答案
model.py	深度 Q 网络，用于帮助 pacman 计算大型 MDP 中的 Q 值

Files you should read but NOT edit:	Function
mdp.py	定义通用 MDP 的方法
learningAgents.py	定义您的代理将扩展的基类ValueEstimationAgent和QLearningAgent
util.py	实用程序，包括util.Counter，这对 Q-learners 特别有用
gridworld.py	Gridworld 实现
featureExtractors.py	用于在（状态，动作）对上提取特征的类。用于近似 Q-learning agent (in qlearningAgents.py)
deepQLearningAgents.py	深度 Q 学习代理的训练循环

其余文件为配置文件，可忽略。

Solve Process

Process Preparation

Environment Installation

最近，python下的pip模块又进行了更新，这使得我们在运行pip install + package安装指定包的过程中遇到困难。所以我们要重新安装pip模块。
在这里插入图片描述
在unistall此模块时，出现了winError拒绝访问的标识，提示需要使用user权限，但是在真正使用user权限进行安装时，命令行报错：“No module named pip”。
其实这个问题我们在之前有遇到过，解决方法是在python3.9安装文件夹的指定目录下，删除已经卸载的旧版本，再重新进行安装。
在这里插入图片描述
根据技术博客上的提示，我们也可以先安装后再进行升级，如果出现升级失败，则通过换一个镜像源的模式，如下所示：

此时安装numpy科学计算库成功，则代表问题解决。

试运行的时候，命令行再次报错：AttributeError: module ‘cgi’ has no attribute ‘escape’，我们通过找到问题出现的所在文件grading.py，import库html，在指定位置进行替换即可。
在这里插入图片描述

MDP

题目所给的下面两段代码分别代表手动控制运行Gridworld和默认Gridworld代理随机移动，Gridworld和我们之前在PROJECT 2中所实现的Pacman吃豆人一样，都是需要我们通过一定算法去实现游戏策略的代理。在本题中，Gridworld有其特定的模拟生活方式。在运行接下来的两个代码中可以看到，窗口上有两个出口布局，蓝点是代理，在按下up键的时候，代理只有80%的时间向北移动。但是手动和随机的不同之处在于，手动是由游戏者自己控制代理的移动，而随机移动的时候，代理在网格周围反弹，直到退出。本题需要入手的，就是通过一定的学习，使Gridworld代理以较好的策略完成移动。

值得注意的是，本题中，我们需首先进入预终止状态（GUI显示的双框，也就是上面所提到的出口布局），然后在episode结束之前执行特殊的“退出”操作（GUI中未显示的TERMINAL_STATE）。在手动运行episode的时候，因为discount rate的缘故，我们的总reward会低于预期。

本题很清楚的提出，是需要实现马尔可夫决策过程（Markov Decision Process, MDP），马尔可夫模型用于在系统状态具有马尔可夫性质的环境中模拟智能体可实现的随机性策略与汇报。马尔可夫决策过程需要基于一组交互对象进行构建，所具有的要素包括状态state、动作action、策略policy和奖励reward。马尔可夫模型被广泛用于强化学习Reinforcement Learning问题的建模，这就是本项目的关键。

python gridworld.py -h

python gridworld.py -g MazeGrid

在这里插入图片描述

Question 1 (5 points): Value Iteration

本问题要求在ValueIterationAgents.py中实现一个值迭代代理。本问题所要实现的值迭代代理是一个离线计划，而不是强化学习，所以相关的训练选项是在它初始计划阶段应该运行的价值迭代次数，ValueIterationAgent在构造函数时采用 MDP 并在构造函数返回之前运行指定迭代次数的值迭代。
在这里插入图片描述
ValueIterationAgent 继承自ValueEstimationAgent，在初始化时采用马尔可夫决策过程（参见 mdp.py），并使用提供的折扣因子对给定的迭代次数运行值迭代。ValueIterationAgent应该在构建时使用 mdp，运行指定数量的迭代，然后根据生成的策略采取行动。
值迭代计算最优值的k阶估计，Vk除了运行值迭代之外，还可以使用Vk为ValueIterationAgent实现以下方法:

computeActionFromValues(State)根据self.values给出的值函数计算最佳操作。

computeQValueFromValues(state，action)返回self.values的value函数给出的（state，action）计算的Q-value值。根据当前存储在self.values中的值，策略是给定状态下的最佳操作。我们可以以任何自己所认为合适的方式打破关系。请注意，如果没有法律行为，在终端状态就是这种情况，应该返回None。

如下图所示，每个MDP状态都给出一个类期望最大的搜索树，其中也展示出了几个值之间的关系。例如，要计算下图最上面的三角形状态顶点S的V星值，需计算从状态s到每一个后续动作a的最大值
，s的下一个动作a到达状态s’的概率是，给予的奖励是，递归计算状态S’的“V*”值。通过计算
在这里插入图片描述
来计算出。valueIterationAgents.py的runValueIteration()函数实现了这个过程。

在本题的最开始，题目给了我们可以参考的函数：

Functions	Use
mdp.getStates()	获取MDP的开始状态
mdp.getPossibleActions(state)	从状态返回可能的actions列表
mdp.getTransitionStatesAndProbs(state, action)	返回 (nextState, prob) 一对列表，表示通过采取“动作”及其转换概率从“状态”可到达的状态。在一般的 Q-Learning 和强化学习中，我们不知道这些概率，也不直接对它们进行建模
mdp.getReward(state, action, nextState)	获得状态、动作、下一个状态转换的奖励。（在强化学习中不可用）
mdp.isTerminal(state)	判断当前状态是否终止，需要注意的是：如果当前状态是终止状态，则返回 true。按照惯例，终端状态的未来奖励为零。有时终端状态可能没有可能的动作。将终端状态视为具有零奖励的自循环动作“通过”也很常见；配方是等价的

根据上述公式及函数提示，我们所实现的代码如下：

import mdp
import util

from learningAgents import ValueEstimationAgent
import collections

class ValueIterationAgent(ValueEstimationAgent):
    """
        * Please read learningAgents.py before reading this.*

        A ValueIterationAgent takes a Markov decision process
        (see mdp.py) on initialization and runs value iteration
        for a given number of iterations using the supplied
        discount factor.
    """
    def __init__(self, mdp, discount = 0.9, iterations = 100):
        """
          Your value iteration agent should take an mdp on
          construction, run the indicated number of iterations
          and then act according to the resulting policy.

          Some useful mdp methods you will use:
              mdp.getStates()
              mdp.getPossibleActions(state)
              mdp.getTransitionStatesAndProbs(state, action)
              mdp.getReward(state, action, nextState)
              mdp.isTerminal(state)
        """
        self.mdp = mdp
        self.discount = discount
        self.iterations = iterations
        # values是给予状态Q-values的计数器
        self.values = util.Counter() # A Counter is a dict with default 0
        self.runValueIteration()

    def runValueIteration(self):
        # Write value iteration code here
        "*** YOUR CODE HERE ***"

        # 循环迭代
        for i in range(self.iterations):
            states = self.mdp.getStates()
            counter = util.Counter()
            # 遍历获取马尔可夫链的所有状态
            for state in states:
                maxVal = float('-inf')
                # 'north','west','south','east'，'exit'
                for action in self.mdp.getPossibleActions(state):
                    Q = self.computeQValueFromValues(state,action)
                    if Q > maxVal:
                        maxVal = Q
                if maxVal != float('-inf'):
                    # 只有maxVal被更新过了，我们才将这个值更新到self.values中
                    counter[state] = maxVal
                else:
                    0
            self.values = counter

    def getValue(self, state):
        """
          Return the value of the state (computed in __init__).
        """
        return self.values[state]

    def computeQValueFromValues(self, state, action):
        """
          Compute the Q-value of action in state from the
          value function stored in self.values.
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        TransitionStatesAndProbs = self.mdp.getTransitionStatesAndProbs(state,action)
        value = 0
        # 遍历当前状态（x，y）的下一个动作的各个方向可能概率
        for nextState,prob in TransitionStatesAndProbs:
            reward = self.mdp.getReward(state,action,nextState)
            value += prob * (reward + self.discount * self.values[nextState])
        return value

    def computeActionFromValues(self, state):
        """
          The policy is the best action in the given state
          according to the values currently stored in self.values.

          You may break ties any way you see fit.  Note that if
          there are no legal actions, which is the case at the
          terminal state, you should return None.
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        # if len(self.mdp.getPossibleActions(state)) == 0的时候返回None
        bestAction = None
        maxVal = float('-inf')
        for action in self.mdp.getPossibleActions(state):
            QValue = self.computeQValueFromValues(state,action)
            if QValue > maxVal:
                maxVal = QValue
                bestAction = action
        # 返回 max（Q*(s,a))
        return bestAction

    def getPolicy(self, state):
        return self.computeActionFromValues(state)

    def getAction(self, state):
        "Returns the policy at the state (no exploration)."
        return self.computeActionFromValues(state)

    def getQValue(self, state, action):
        return self.computeQValueFromValues(state, action)

运行命令及通过截图：

python autograder.py -q q1

在这里插入图片描述

python gridworld.py -a value -i 100 -k 10

在这里插入图片描述

python gridworld.py -a value -i 5

在这里插入图片描述

Question 2 (5 points): Policies

本题需要实现analysis.py中question2a到question2e的每一个函数。通过设置给定的参数，用值迭代获得指定的策略。

本题在题目的一开始就放了一张图：
在这里插入图片描述这个图所显示的是DiscountGrid布局。该网格有两个具有正收益的终端状态（在中间行），一个收益为 +1 的关闭退出和收益为 +10 的远程退出。网格的底行由具有负收益的终端状态组成（以红色显示）；这个“悬崖”区域中的每个州的收益为 -10。起始状态是黄色方块。

我们区分两种类型的路径：

（1）“冒险悬崖”并在网格底行附近行进的路径；这些路径较短，但有可能获得较大的负收益，并由下图中的红色箭头表示。
（2）“避开悬崖”并沿着网格的顶部边缘行进的路径。这些路径更长，但不太可能产生巨大的负收益。这些路径由下图中的绿色箭头表示。

在这个问题中，我们将为 MDP 选择折扣answerDiscount、噪声answerNoise和生活奖励answerLivingReward参数的设置，以生成几种不同类型的最佳策略。每个部分设置的参数值应该具有以下属性：如果代理在 MDP 中遵循其最佳策略，它将表现出给定的行为。如果参数的任何设置都没有实现特定行为，则通过返回 string 断言该策略是不可能的。

对于Question2a-Question2e这五个函数，题目给予了如下策略提示：

·question2a()：更喜欢关闭退出（+1），冒着悬崖的风险（-10）
·question2b()：更喜欢关闭出口（+1），但要避开悬崖（-10）
·question2c()：偏向远方出口（+10），冒险悬崖（-10）
·question2d()：喜欢远处的出口（+10），避开悬崖（-10）
·question2e()：避免出口和悬崖（因此一集永远不应该终止）

Question2a-Question2e这五个函数对应的正确示意图，我们能够在测试用例中进行充分看到。测试用例中有GridWorld specification和Policy specification，GridWorld specification中：

‘-’代表空旷的空间，也就是GridWorld图中的黑格；
数字是具有该值的终端状态；
‘#’是墙壁，也就是图中的灰格；
‘S’是开始状态，即图中的黄格。

而Policy specification 里：

‘-’是没有采取政策的格子；
‘N’,‘E’,‘S’,‘W’ 行动政策分别代表向北、东、南、西。

为了实现折扣answerDiscount、噪声answerNoise和生活奖励answerLivingReward这三个参数，我们需要参考测试用例所给出的正确答案，使用GUI检查我们的策略。
以question2a()的测试用例为例，（3，0）、（3，1）的箭头应指向东（3，2）的箭头应指向北，所以我们可以通过手动调试并计算，获取每个参数的详细信息。

本题所实现的代码如下：

def question2a():
    answerDiscount = 0.5
    answerNoise = 0
    answerLivingReward = -1
    return answerDiscount, answerNoise, answerLivingReward
    # If not possible, return 'NOT POSSIBLE'

def question2b():
    answerDiscount = 0.2
    answerNoise = 0.2
    answerLivingReward = -1
    return answerDiscount, answerNoise, answerLivingReward
    # If not possible, return 'NOT POSSIBLE'

def question2c():
    answerDiscount = 0.95
    answerNoise = 0
    answerLivingReward = -1
    return answerDiscount, answerNoise, answerLivingReward
    # If not possible, return 'NOT POSSIBLE'

def question2d():
    answerDiscount = 0.2
    answerNoise = 0.5
    answerLivingReward = 2
    return answerDiscount, answerNoise, answerLivingReward
    # If not possible, return 'NOT POSSIBLE'

def question2e():
    answerDiscount = 0.95
    answerNoise = 0
    answerLivingReward = 1000
    return answerDiscount, answerNoise, answerLivingReward
    # If not possible, return 'NOT POSSIBLE'

if __name__ == '__main__':
    print('Answers to analysis questions:')
    import analysis
    for q in [q for q in dir(analysis) if q.startswith('question')]:
        response = getattr(analysis, q)()
        print('  Question %s:\t%s' % (q, str(response)))

本题运行命令及通过截图如下：

python autograder.py -q q2

在这里插入图片描述

Question 3 (5 points): Q-Learning

本题需要实现qlearningAgents.py，来实现Q-Learning。
在之前，我们的价值迭代代理valueIterationAgent实际上并未从经验中学习。相反，它会在与真实环境交互之前考虑其 MDP 模型以得出完整的策略。当它确实与环境交互时，它只是遵循预先计算好的策略（例如，它成为一个反射代理）。这种区别在像 Gridword 这样的模拟环境中可能很微妙，但在真实 MDP不可用的现实世界中非常重要。所以，我们引入了Q-Learning代理。
Q-Learning代理主要通过通过它的update(state, action, nextState, reward)方法从与环境的交互中反复试验来学习。
在这里插入图片描述
Q-Learning代理直接通过q函数的贝尔曼方程进行更新，贝尔曼方程在我们学习的时候已经明确知道：

所以针对我们接下来需要实现的update()，computeValueFromQValues()，getQValue()，和computeActionFromQValues()方法，根据贝尔曼方程以及上述给出的Q-Learning代理的伪代码进行实现即可。

getQValue()：返回 Q(state,action)，如果我们从未见过状态或 Q 节点值，则应返回 0.0。
computeValueFromQValues()：返回 max_action Q(state,action)，其中最大值超过合法动作。请注意，如果没有合法操作，即终端状态下的情况，则应返回值 0.0。
computeActionFromQValues()：计算在某种状态下采取的最佳行动。请注意，如果没有法律行为，在终端状态就是这种情况，您应该返回None。
update()：进行Q-Value的更新。
*getAction()：此函数并未要求实现，但是这个函数值得一提。getAction()函数的作用是计算在当前状态下要采取的行动。对于概率self.epsilon，我们应该采取随机行动，否则采取最佳策略行动。

在本节我们需要使用self.getLegalActions(state)函数，去获得当前状态下的合法行动。

本题实现代码如下：

class QLearningAgent(ReinforcementAgent):
    """
      Q-Learning Agent
      Functions you should fill in:
        - computeValueFromQValues
        - computeActionFromQValues
        - getQValue
        - getAction
        - update
      Instance variables you have access to
        - self.epsilon (exploration prob)
        - self.alpha (learning rate)
        - self.discount (discount rate)
      Functions you should use
        - self.getLegalActions(state)
          which returns legal actions for a state
    """
    def __init__(self, **args):
        "You can initialize Q-values here..."
        ReinforcementAgent.__init__(self, **args)
        "*** YOUR CODE HERE ***"
        self.qValues = util.Counter()

    def getQValue(self, state, action):
        """
          Returns Q(state,action)
          Should return 0.0 if we have never seen a state
          or the Q node value otherwise
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        return  self.qValues[(state,action)]

    def computeValueFromQValues(self, state):
        """
          Returns max_action Q(state,action)
          where the max is over legal actions.  Note that if
          there are no legal actions, which is the case at the
          terminal state, you should return a value of 0.0.
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        legalActions = self.getLegalActions(state)
        if len(legalActions) == 0:
            return 0.0
        count = util.Counter()
        for legalAction in legalActions:
            count[legalAction] = self.getQValue(state,legalAction)
        return count[count.argMax()]

    def computeActionFromQValues(self, state):
        """
          Compute the best action to take in a state.  Note that if there
          are no legal actions, which is the case at the terminal state,
          you should return None.
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        actions = self.getLegalActions(state)
        bestAction = None
        maxVal = float('-inf')
        for action in actions:
            qValue = self.qValues[(state,action)]
            if maxVal < qValue:
                maxVal = qValue
                bestAction = action
        return bestAction


    def getAction(self, state):
        """
          Compute the action to take in the current state.  With
          probability self.epsilon, we should take a random action and
          take the best policy action otherwise.  Note that if there are
          no legal actions, which is the case at the terminal state, you
          should choose None as the action.
          HINT: You might want to use util.flipCoin(prob)
          HINT: To pick randomly from a list, use random.choice(list)
        """
        # Pick Action
        legalActions = self.getLegalActions(state)
        action = None
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        legalActions = self.getLegalActions(state)
        flip = util.flipCoin(legalActions)
        if flip:
            return random.choice(legalActions)
        else:
            return self.getPolicy(state)

        return action

    def update(self, state, action, nextState, reward):
        """
          The parent class calls this to observe a
          state = action => nextState and reward transition.
          You should do your Q-Value update here
          NOTE: You should never call this function,
          it will be called on your behalf
        """
        "*** YOUR CODE HERE ***"
        # util.raiseNotDefined()
        oldQValue = self.getQValue(state,action)
        theOld = (1-self.alpha) * oldQValue
        theReward = self.alpha * reward
        if not nextState:
            self.qValues[(state,action)] = theOld+theReward
        else:
            theNextState = self.alpha * self.discount * self.getValue(nextState)
            self.qValues[(state,action)] = theOld + theReward +theNextState

    def getPolicy(self, state):
        return self.computeActionFromQValues(state)

    def getValue(self, state):
        return self.computeValueFromQValues(state)

python gridworld.py -a q -k 5 -m

运行上述命令之后，我们能使用键盘在手动控制下观看Q-learner学习，来观察AI如何了解它所处状态。
在这里插入图片描述

通过键盘的移动，我们可以在控制台观察到代理的当前状态、采取的行动政策、结束状态以及获得的奖励。

本题的运行命令及通过截图如下：

python autograder.py -q q3

在这里插入图片描述

Conclusion

在本次实习中我学习到了马尔可夫决策过程、贝尔曼方程以及强化学习的应用，在GridWorid代理的实际操作中运用了上述概念及其公式。马尔可夫决策过程引入奖励机制，以衡量任意序列的优势，即对序列决策进行评价；贝尔曼方程刻画了价值函数和行动-价值函数自身以及两者相互之间的递推关系；值迭代使用策略优化和策略评估相结合，而Q-Learning基于时序差分。他们有着不同却又相似的实现机理，在人工智能领域占据了自己的一席之地。知识很难，实现起来也相对较为抽象，但是在调试的过程中，随着GridWorld代理的移动，我能深切地体会到知识“动”起来了，这让我对强化学习有了更深刻的理解。

Reference

1、完美解决：No module named pip
2、AttributeError: module ‘cgi’has no attribute ‘escape’
3、CS 188 Project3(RL) Q1: Value Iteration
4、CS 188 Reinforcement Learning