多智能体强化学习

当整个强化学习系统存在多个智能体的时候，我们称作多智能体强化学习。强化学习中智能体与环境交互获得数据以及一些反馈信号，通过不断的交互和反馈信号最终达到控制目标。

那如果是Multi-Agent的话，从一个Agent角度来看，环境当中还会存在其它Agent，环境中的Agent也在执行强化学习算法。这样的环境我们称作Non-stationary environment，也就是说环境中的分布、state、action、和reward 的分布都是一直在变，原因就是环境中还存在其他智能体，也在改变它自己的参数。

我们最熟悉的GameAI是中心化AI，一个智能体调度整场游戏的资源，单很多时候我们也希望去Designing multi-agent communications and co-learning algorithms for elaborate collective game intelligence。因为多个智能体之间的协同其实是非常厉害的群体算法。

参考文献：Peng, Peng, et al. "Multiagent bidirectionally-coordinated nets for learning to play starcraftcombat games."NIPS workshop 2017.

Difficulty in Multi-Agent Learning

与单智能体强化学习相比多智能体强化学习(Multi-Agent Reinforcement Learning)有什么难点呢？

难点就在于环境中的智能体会依据其他智能体所采取的行动而改变自己的策略，因此整个环境会变得混沌，而你希望在这样一种混沌的状态下去求解。

当然你也可以拿强化学习算法强行做，不考虑智能体之间的相互影响，将所有的信息统统看作环境。这种方式无法保证theoretical convergence，也有可能出现越学越差的情况。

所以我们必须去考虑其他智能体会怎么与我进行协作，或者其他智能体会如何与我进行对抗。这里就会引入Game theorem。

Sequential Decision Making

跳出强化学习的MDP设定，在机器博弈里面，类似的问题更广泛的概念是Sequential Decision Making。

Sequential Decision Making里面可以大致分为三类：

Markov decision processes

第一类就是马尔可夫决策过程 (Markov decision processes)，就是我们强化学习里面常见的情况，整个环境里面只有一个决策者 (one decision maker ),但是involve multiple states，在不同的state 的情况下选择的动作一般会不一样，并且还要做planing，期望所得到的return最大。

Repeated games

第二种设定是Repeated games ，既然涉及到game，那一定是多个智能体之间相互进行协作或者是博弈 (multiple decision makers )。但是它只有一个state。也就是说每次做一次决策之后拿到reward，然后游戏从头开始。比如说猜拳游戏。

虽然游戏每次action之后，都会从头开始，但是我们可以通过游戏数据对对手进行建模，学习到对手的策略 (opponent model)。

Stochastic games (Markov games)

最general的是stochastic game，或者叫做Markov game，具备多个智能体 (multiple decision makers) 和多个状态序列 (multiple states )。

我们可以来简单解释一下这个stochastic game，如果我们学习过博弈论的话，我们经常会看到一个matrix game的东西。

在这里插入图片描述

如上图所示，在state 1的matrix上，如果player 1采取action a1，player2采取action a2，于是player 1得到0的reward，player 2得到3的reward。如果在state 1上面一直重复地采取action的话，就称作repeated game；如果采取完了action之后，环境transition到其他新的state matrix上面，我们称作stochastic game。

也就是说agent选择的action不仅要考虑所能获得的reward，还要考虑它迁移到新的state上面，以及新的state对每个agent的影响。这种设置比较复杂，但是比较general。

参考文献：Shapley, Lloyd S. "Stochastic games."Proceedings of the national academy of sciences39.10 (1953): 1095-1100.

Definition of Stochastic Games

stochastic game可以从数学上定义一下stochastic game，可定义为如下元组形式：

$(\mathcal{S},\mathcal{A^{1}},\cdots, \mathcal{A}^{N}, r^{1}, \cdots,r^{N},p,\gamma)$

其中：

state space： $\mathcal{S}$
action space of agent $j$ ： $\mathcal{A}^{j},j \in \{1,\cdots,N\}$
reward function of agent $r^{j}$ ： $\mathcal{S} \times \mathcal{A}^{1} \times \cdots \times \mathcal{A}^{N} \rightarrow \mathbb{R}$
Transition probability $p$ ： $\mathcal{S} \times \mathcal{A}^{1} \times \cdots \times \mathcal{A}^{N} \rightarrow \Omega(\mathcal{S})$ ， $\Omega(\mathcal{S})$ is the collection of probability distributions over $\mathcal{S}$
Discount factor across time $\gamma \in [0,1)$

从上述可以定于看出，它与MDP的区别仅仅在于从一个agent，变成多个agent，每个agent的action space可能会不一样。在当前state，采取joint action，每个agent会得到一个reward，所以会有 $n$ 个action， $n$ 个reward。然后state依据joint action转移到下一个state。可以看出区别就在于多个agent采取多个action，相互之间进行影响。

Policies in Stochastic Games

For agent $j$ , the corresponding policy is:

$\pi^{j}：\mathcal{S} \rightarrow \Omega(\mathcal{A^{j}})$

there $\Omega(\mathcal{A^{j}})$ is the collection of probability distributions over $A^{j}$ 。每个智能体 $j$ ，依据当前的state $\mathcal{S}$ ，去选择action，或者是输出一个action的分布。

The joint policy of all agents is $\boldsymbol{\pi} \triangleq\left[\pi^{1}, \ldots, \pi^{N}\right]$ . 将所有agent的 $\pi$ 总体上用joint policy表示，黑体的 $\boldsymbol{\pi}$ ，简单地来讲这就变成了一个policy，只不过这个policy非常复杂。

State value function of agent $j$ :

$v_{\pi}^{j}(s) = v^{j}(s;\boldsymbol{\pi}) = \sum_{t=0}^{\infty} \gamma^{t} \mathbb{E}_{\boldsymbol{\pi},p}[r_{t}^{j}|s_{0}=s,\boldsymbol{\pi}]$

基于当前的policy，整个环境就可以往前move，对于某一个agent $j$ 来说，它自己的value function如上式所述。

Action value function of agent $j$ : $Q_{\pi}^{j}:\mathcal{S} \times \mathcal{A}^{1} \times \cdots \times \mathcal{A}^{N} \rightarrow \mathbb{R}$

$Q_{\pi}^{j}(s,\boldsymbol{a}) = r^{j}(s,\boldsymbol{a})+\gamma \mathbb{E}_{s^{\prime} \sim p}[v_{\boldsymbol{\pi}}^{j}(s^{\prime})]$

there $\boldsymbol{a} = [a^{1},\cdots a^{N}]$ .

同样的，对于某一个state-action来说，可以表示成上面这个式子，输入是state 和joint action，之后再follow joint policy得到 $\gamma \mathbb{E}_{s^{\prime} \sim p}[v_{\boldsymbol{\pi}}^{j}(s^{\prime})]$ 。之后用TD算法对其学习、更新。

Independent Learning in SG

既然Q function都被定义出来了，我们就可以直接用Q learning来做independent learning in stochastic game，这里是把其他的agent看成环境的一部分。

For each agent $j$ , assume the other agents’ policies are stationary, thus the environment for $j$ is stationary to perform Q-learning ：

$\begin{aligned} Q\left(s, a^{j}, a^{-j}\right) \leftarrow &Q\left(s, a^{j}, a^{-j}\right)+ \\ & \alpha\left(r+\gamma \max _{a^{j^{\prime}}} Q\left(s^{\prime}, a^{j^{\prime}}, a^{-j^{\prime}}\right)-Q\left(s, a^{j}, a^{-j}\right)\right) \end{aligned}$

当前智能体的action 定义为 $a^{j}$ ， $a^{-j}$ 看成其它智能体采取动作的joint。但是这种做法没有theoretical convergence guarantee，也没与考虑其它agent，以及如何与他们进行协作。

如果玩石头剪刀布的游戏，可能就会形成石头剪刀布轮流出的一种情形，而不是每个按 $\frac{1}{3}$ 的概率出。

Nash Equilibrium in SG

那我们怎么在stochastic game上面perform一个真正有效的算法呢？

我们需要引入Nash Equilibrium的这样一个概念先：

$v_{\pi}^{j}(s) = v^{j}(s;\boldsymbol{\pi}) = \sum_{t=0}^{\infty} \gamma^{t} \mathbb{E}_{\boldsymbol{\pi},p}[r_{t}^{j}|s_{0}=s,\boldsymbol{\pi}]$

我们需要 optimal agent $j$ 的 $v_{\boldsymbol{\pi}}^{j}(s)$ 需要depend on整个joint policy $\boldsymbol{\pi}$ .

在SG 中 Nash equilibrium 的joint policy 表示为：

$\boldsymbol{\pi}_{*} \triangleq\left[\pi_{*}^{1}, \ldots, \pi_{*}^{N}\right]$

在Nash equilibrium中任何一个agent都不会单方面去改变自己的policy，用数学公式表达如下所示：

$\begin{aligned} v^{j}\left(s ; \boldsymbol{\pi}_{*}\right)=v^{j}\left(s ; \pi_{*}^{j}, \boldsymbol{\pi}_{*}^{-j}\right) & \geq v^{j}\left(s ; \pi^{j}, \boldsymbol{\pi}_{*}^{-j}\right) \end{aligned}$

理解的话就是，其他agent的policy不变，对于自己的这个agent，当前策略是最优的。并且对于每个agent都是这样的。其中 $\boldsymbol{\pi}_{*}^{-j}$ 表示如下：

$\boldsymbol{\pi}_{*}^{-j} \triangleq\left[\pi_{*}^{1}, \ldots, \pi_{*}^{j-1}, \pi_{*}^{j+1}, \ldots, \pi_{*}^{N}\right]$

当我的对手的policy不变，我也不会改变，因为改变了并不能获得更大的奖励，对于每个智能体都是这样，因此所有agent都不会改变自己的policy，从而达到均衡解。

Nash Q-learning Nash Q-learning

如果我们在每一个状态上面去找一个Nash equilibrium，那么我们就可以非常稳定的perform一个Q-Learning的算法。

那给定Nash policy $\boldsymbol{\pi}_{*}$ ，Nash value function可表示为：

$\boldsymbol{v}^{\text {Nash }}(s) \triangleq\left[v_{\boldsymbol{\pi}_{*}}^{1}(s), \ldots, v_{\boldsymbol{\pi}_{*}}^{N}(s)\right]$

Nash Q Learning做了一件什么事情呢？他在每一个state的时候solve一个当前state的nash equilibrium。比如在每个matrix game上面去算，基于当前state，这个matrix game下的policy是什么？

当policy是stochastic policy 而不是deterministic policy，那我们会按照某一个概率分布去出action，那么至少存在一个nash equilibrium使得游戏会收敛，这是game theorem的一个theorem。

那这个时候我们就可以对于任何一个state 去计算他的Nash equilibrium，得到了这个Nash equilibrium之后我们也就得到了joint policy $\boldsymbol{\pi_{*}}$ ，得打这个policy之后我们做policy iteration，改进value，具体步骤可表示为：

Nash Q-learning defines an iterative procedure

Solving the Nash equilibrium $\boldsymbol{\pi_{*}}$ of the current stage defined by $\{Q_{t}\}$ .
Improving the estimation of the Q-function with the new Nash value $V^{\text{Nash}}$ .

对上述两个过程迭代进行，就可以了。

这个时候，每一个state 都是在follow 一个Nash equilibrium，以至于大家没有任何motivation去修改自己的policy，从而使得policy收敛。

这是一个比较有趣的把博弈论结合进来的多智能体强化学习。但是它有很明显的缺点，需要在每一个state上面去算Nash equilibrium，这是需要一个很大的计算量的，也有很多博弈论的研究者专门在求解nash equilibrium，这里面就会涉及很多算法，本身也是比较困难的。还有我们必须知道其它agent的policy，我们才能计算nash equilibrium。如果其它agent是对手的话，往往很难得到这个对手的policy，也就很难，这种情况一般要做一个opponent modeling。

总结一下主要是以下两点：

Very high computational complexity
May not work when other agents’ policy is unavailable

From Multi-to Many-Agent RL

这个时候我们再回过头来再看看，如果我们的agent number变得很大(通常都很小的number 20个以内这样)，下面这两个component会变得特别复杂：

Reward function of agent $r^{j}$ ： $\mathcal{S} \times \mathcal{A}^{1} \times \cdots \times \mathcal{A}^{N} \rightarrow \mathbb{R}$
Transition probability $p$ ： $\mathcal{S} \times \mathcal{A}^{1} \times \cdots \times \mathcal{A}^{N} \rightarrow \Omega(\mathcal{S})$

可以看到其定义域的维度直接变大。多一个agent就多一个维度，维度升高之后需要更多的数据对其进行学习，而且这里的学习还是高阶的，还要进行交互，会involve 更多的信息，因此需要大量的数据去sample到有效的state-action pair。因为整个强化学习实际上是基于try的一种学习方式，try到了就能学习，而多智能体定义在joint action，因此更难去一起try到有效数据(action之间是相互配合的)。

由于环境的模型变难了，使得环境更加dynamic、更加sensitive、更加容易被overfiting，因此需要超大量的exploration data。因此像DeepMind、OpenAI这种团队做多智能体上来就是一千块GPU。

Mean Field Multi-Agent RL

那我们能不能从模型层面去做一个简化呢？

Taking Other Agents as A Whole

我们很容易想到去从生物上面去寻找灵感，比如鱼群、鸟群等等这些群体生物。他们迁移的时候不会去与每一个个体进行交互。

In some many-body systems, the interaction between an agent and others can be approximated as that between the agent and the “mean agent” of others。

比如说在大合唱的时候，我们只需要知道整体大概唱在什么位置，而不需要去指导每个人唱到了哪里。

Mean Field Multi-Agent RL

在Mean Field Multi-Agent里面，Approximate the joint action value by factorizing the Q-function into pairwise interactions

$Q^{j}(s,a) = \frac{1}{N^{j}} \sum_{k \in \mathcal{N}(j)} Q^{j}(s,a^{j},a^{k})$

其中 $N^{j}$ 表示Neighboring agent set of $j$ ，把agent周围的neighbor做pairwise interactions的 $Q$ average在一起。这样就做了简化。

在这里插入图片描述

在给定这样一种假设的情况下，我们现在将action representation：

考虑discrete action space用one hot形式去具体地表示某一个action，因此action $a^{j}$ of agent $j$ is one-hot encoded as：

$a^{j} = \triangleq [a_{1}^{j},\cdots,a_{D}^{j}]$

因此如果把周围的neighbor所采取的action做个平均的话，The mean action based on the neighborhood of $j$ is：

$\bar{a}^{j}=\frac{1}{N^{j}} \sum_{k} a^{k}$

所表达的意思就是周围的agent采取每一个action的概率是多少，也就得到了neighbor选这些action的distribution。

具体的每个agent $k$ 所采取的action还是one-hot的形式。单我们可以用mean action加上residual把他还原回具体的one-hot形式。Thus the action $a^{k}$ of each neighbor $k$ can be represented as ：

$a^{k} = \bar{a}^{j} + \delta a^{j,k}$

其中 Residual sum is 0。

$\frac{1}{N^{j}} \sum_{k} a^{j,k} = 0$

参考文献：YaodongYang, Weinan Zhang et al. Mean Field Multi-Agent Reinforcement Learning. ICML 2018

Mean Field Q-Learning

A 2-order Taylor expansion on Q-function

有了上述的一些设定之后，我们就可以做二阶的泰勒展开：

$\begin{aligned} Q^{j} & (s,\boldsymbol{a}) = \frac{1}{N^{j}} \sum_{k}Q^{j}(s,a^{j},a^{k}) \\ &=\frac{1}{N^{j}} \sum_{k}\left[Q^{j}\left(s, a^{j}, \bar{a}^{j}\right)+\nabla_{\bar{a}^{j}} Q^{j}\left(s, a^{j}, \bar{a}^{j}\right) \cdot \delta a^{j, k}+\frac{1}{2} \delta a^{j, k} \cdot \nabla_{\tilde{a}^{j, k}}^{2} Q^{j}\left(s, a^{j}, \tilde{a}^{j, k}\right) \cdot \delta a^{j, k}\right]\\ &=Q^{j}\left(s, a^{j}, \bar{a}^{j}\right)+\nabla_{\bar{a}^{j}} Q^{j}\left(s, a^{j}, \bar{a}^{j}\right) \cdot \frac{1}{N^{j}} \sum_{k} \delta a^{j, k}+\frac{1}{2 N^{j}} \sum_{k} \delta a^{j, k} \cdot \nabla_{\tilde{a}^{j, k}}^{2} Q^{j}\left(s, a^{j}, \tilde{a}^{j, k}\right) \cdot \delta a^{j, k}\\ &=Q^{j}\left(s, a^{j}, \bar{a}^{j}\right)+\frac{1}{2 N^{j}} \sum_{k} R_{s, a^{j}}^{j}\left(a^{k}\right)\\ & \approx Q^{j}(s,a^{j},\bar{a}^{j}) \end{aligned}$

其中 $a^{k} = \bar{a}^{j} + \delta a^{j,k}$ ，External random signal for agent $j$ ：

$\begin{aligned} R_{s, a^{j}}^{j}\left(a^{k}\right) \triangleq & \delta a^{j, k} \cdot \nabla_{\tilde{a}^{j, k}}^{2} Q^{j}\left(s, a^{j}, \tilde{a}^{j, k}\right) \cdot \delta a^{j, k} \\ & \tilde{a}^{j, k}=\bar{a}^{j}+\epsilon^{j, k} \delta a^{j, k} \end{aligned}$

A softmax MF-Q policy

基于当前agent 的action $a^{j}$ ，以及周围neighbor action的distribution $\bar{a}^{j}$ ，建立 $Q$ function，我们就可以去perform 一个policy，比如说softmax policy，或者 $\varepsilon$ - greedy policy。然后按照学习算法更新就可以了。

$\pi_{t}^{j}\left(a^{j} | s, \bar{a}^{j}\right)=\frac{\exp \left(\beta Q_{t}^{j}\left(s, a^{j}, \bar{a}^{j}\right)\right)}{\sum_{a^{j^{\prime}} \in \mathcal{A}^{j}} \exp \left(\beta Q_{t}^{j}\left(s, a^{j^{\prime}}, \bar{a}^{j}\right)\right)}$

Given an experience $<s,\boldsymbol{a},\boldsymbol{r},s^{\prime},\bar{\boldsymbol{a}}>$ sampled from replay buffer：

Sample the next action $a_{-}^{j}$ from $Q_{\phi_{-}^{j}}$ 。
Set $y^{j} = r^{j} +\gamma Q_{\phi_{-}^{j}}(s^{\prime},a_{-}^{j},\bar{a}^{j})$
Update Q function with the loss function:

$\mathcal{L}(\phi^{j}) = (y^{j} - Q_{\phi^{j}}(s^{j},a^{j},\bar{a}^{j}))^{2}$

MF-Q Convergence

Theorem: In a finite-state stochastic game, the Q values computed by the update rule of MF-Q converges to the Nash Q-value.

Summary of MARL

在Multi-Agent训练过程中一般是自己的模型与自己的模型做对抗，做self-play，做完之后再与其他模型对抗，看效果，如果直接对抗看效果的话很容易找到对方模型的弱点而过拟合。

在multi-agent里面 agent 不仅需要学习数据，还要学习其它智能体的policy，因此很多时候也需要引用game theorem的结论。机器学习中常做的是model与data之间的objective，在multi-agent里面agent与agent之间的objective也需要去学习。