MMOE 多任务学习模型介绍与源码浅析

前言 (与正文无关, 请忽略~)

后续打算写 DMT, 先介绍一些基础模块.

广而告之

可以在微信中搜索 “珍妮的算法之路” 或者 “world4458” 关注我的微信公众号；另外可以看看知乎专栏 PoorMemory-机器学习, 以后文章也会发在知乎专栏中；

文章信息

论文标题: Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
论文地址: https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture-
代码地址: https://github.com/drawbridge/keras-mmoe
发表时间: KDD, 2018
论文作者: Jiaqi Ma; Zhe Zhao; Xinyang Yi; Jilin Chen; Lichan Hong; Ed Chi
作者单位: Google

核心观点

本文介绍了 MMoE (Multi-gate MoE) 模型, 主要是解决传统的 multi-task 网络 (主要采用 Shared-Bottom Structure) 可能在任务相关性不强的情况下效果不佳的问题, 有研究揭示了 multi-task 模型的效果高度依赖于任务之间的相关性;
MMoE 借鉴 MoE 的思路, 引入多个 Experts (即多个 NN 网络) 网络, 然后再对每个 task 分别引入一个 gating network, gating 网络针对各自的 task 学习 experts 网络的不同组合模式, 即对 experts 网络的输出进行自适应加权. 说实话, 这一点非常像 Attention, Experts 网络学习出 embedding 序列, 而 gating 网络学习自适应的权重并对 Experts 网络的输出进行加权求和, 得到对应的结果之后再分别输入到各个 task 对应的 tower 网络中. 注意 gating 网络的数量和任务的数量是一致的.
文章还有一部分内容是产生可以控制任务相关性的合成数据, 并用这部分数据进行试验说明 MMoE 即使在任务的相关性较低的情况下, 也可以获得较好的效果.

核心观点解读

在推荐场景下, 我们经常要优化多个目标, 比如不仅要推荐用户感兴趣的商品, 还要尽可能促进用户购买, 因此非常有必要构建一个多任务学习模型来同时优化各个目标. 常见的多任务学习模型如下:

其中:

(a) Shared-Bottom Model: 各任务有独立的 tower, 用于获取各任务独有的信息, 而所有任务共享底层的 Shared-Bottom 结构, 以学习任务间的共有信息. 当任务之间相关性程度较高时, 任务之间会相互促进以提升各自的效果; 然而问题是当任务之间差别较大时, 会带来效果的下降.
(b) One-gate MoE Model: Mixture-of-Experts (MoE) 底层由多个 Experts 专家网络构成, Experts 之间相互独立, 通过引入 Gate 网络来学习不同任务下各 Experts 网络对目标的影响程度. 使用公式来精确表示 MoE 如下:

$y=\sum_{i=1}^{n} g(x)_{i} f_{i}(x)$

其中 $\sum_{i=1}^{n} g(x)_{i}=1$ , $g(x)_i$ 表示 Gate 网络 $g (x)$ 输出结果的第 $i$ 个值, $n$ 表示 Experts 网络的个数, 上述结果就是使用 Gate 网络的输出值来对各个 Experts 网络的输出结果进行加权求和.

(c) MMoE Model: 本文模型, 借鉴 MoE 的思路, 引入多个 Experts 网络, 然后再对每个 task 分别引入一个 gating network, gating 网络针对各自的 task 学习 experts 网络的不同组合模式, 即对 experts 网络的输出进行自适应加权. MMoE 网络可以形式化表达为:

$\begin{aligned} y_{k} &=h^{k}\left(f^{k}(x)\right) \\ \text { where } f^{k}(x) &=\sum_{i=1}^{n} g^{k}(x)_{i} f_{i}(x) . \end{aligned}$

其中 $y_k$ 表示第 $k$ 个任务的输出结果, $h^k(x)$ 表示第 $k$ 个任务对应的 tower, $f^k(x)$ 表示 tower 的输入, 它由 $n$ 个 Experts 网络的输出结果进行加权求和获得, 权重系数由第 $k$ 个任务对应的 Gate 网络生成. 注意, MMoE 中, Gate 网络的数量和任务的数量相等.

而 Gate 网络 $g^{k}(x)$ 可以表示为:

$g^{k}(x)=\operatorname{softmax}\left(W_{g k} x\right)$

它其实是对输入 Embedding 线性变化后再经过 Softmax 得到的.

源码分析

代码地址位于: https://github.com/drawbridge/keras-mmoe/blob/master/mmoe.py, 只需要看该文件中关于 MMoE 的实现即可.

初始化创建 Experts 和 Gate 网络 (代码中删除非重点内容)。

class MMoE(Layer):
    """
    Multi-gate Mixture-of-Experts model.
    """

    def __init__(self,
                 units,  ## 隐藏层单元个数
                 num_experts,  ## Experts 的个数
                 num_tasks,  ## 任务个数
                 use_expert_bias=True,
                 use_gate_bias=True,
                 expert_activation='relu',
                 gate_activation='softmax',
				 ## .... 其他参数
                 **kwargs):

    def build(self, input_shape):
        """
        这里我们假设输入 tensor 的 shape 为 [B, I]
        其中 B 为 Batch_size, I 为输入 Embedding 的大小
        隐藏层 units 单元个数使用 E 表示
        Experts 网络的个数设为 N
        Task 任务的个数设置为 K
        """
        assert input_shape is not None and len(input_shape) >= 2

        input_dimension = input_shape[-1]

        """
		初始化 Experts 网络, 其大小为 [I, E, N],
		其中 I 为输入 embedding 的大小, E 为 Experts 网络的输出结果大小,
		N 为 Experts 网络的个数
		"""
        self.expert_kernels = self.add_weight(
            name='expert_kernel',
            shape=(input_dimension, self.units, self.num_experts),
            initializer=self.expert_kernel_initializer,
            regularizer=self.expert_kernel_regularizer,
            constraint=self.expert_kernel_constraint,
        )

        """
		初始化 Experts 网络的 Bias, 大小为 [E, N]
		"""
        if self.use_expert_bias:
            self.expert_bias = self.add_weight(
                name='expert_bias',
                shape=(self.units, self.num_experts),
                initializer=self.expert_bias_initializer,
                regularizer=self.expert_bias_regularizer,
                constraint=self.expert_bias_constraint,
            )

        """
		初始化 Gate 网络, 注意 Gate 网络的个数和 Task 的个数相同, 均为 K,
		因此 self.gate_kernels 列表的大小为 K, 每个 Gate 中 weight 的
		大小均为 [I, N], I 为输入 Embedding 的大小, 而 N 为 Experts 网络的个数
		Gate 网络的输出结果保存着各 Experts 网络的权重系数
		"""
        self.gate_kernels = [self.add_weight(
            name='gate_kernel_task_{}'.format(i),
            shape=(input_dimension, self.num_experts),
            initializer=self.gate_kernel_initializer,
            regularizer=self.gate_kernel_regularizer,
            constraint=self.gate_kernel_constraint
        ) for i in range(self.num_tasks)]

        """
		初始化 Gate 网络的 Bias, self.gate_bias 大小为 K,
		每个 Bias 的大小为 (N,)
		"""
        if self.use_gate_bias:
            self.gate_bias = [self.add_weight(
                name='gate_bias_task_{}'.format(i),
                shape=(self.num_experts,),
                initializer=self.gate_bias_initializer,
                regularizer=self.gate_bias_regularizer,
                constraint=self.gate_bias_constraint
            ) for i in range(self.num_tasks)]

        self.input_spec = InputSpec(min_ndim=2, axes={
    
    -1: input_dimension})

        super(MMoE, self).build(input_shape)

MMoE 的具体实现如下, 实现 MMoE 网络的前向传播:

看代码之前, 先了解下 tf.tensordot, 可以参考: tf.tensordot TensorFlow 官方文档, 下面代码中 tf.tensordot(a, b, axes=1) 相当于 tf.tensordot(a, b, axes=[[1], [0]]).

def call(self, inputs, **kwargs):
        """
        """
        gate_outputs = []
        final_outputs = []

        # f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper
        """
		inputs 输入 Tensor 的大小为 [B, I],
		self.expert_kernels 的大小为 [I, E, N],
		其中 I 为输入 embedding 大小, E 为 Experts 网络的输出大小, N 为 Experts 的个数
		tf.tensordot(a, b, axes=1) 相当于 tf.tensordot(a, b, axes=[[1],[0]]),
		因此 expert_outputs 的大小为 [B, E, N] 
		"""
        expert_outputs = K.tf.tensordot(a=inputs, b=self.expert_kernels, axes=1)
        # Add the bias term to the expert weights if necessary
        if self.use_expert_bias:
            expert_outputs = K.bias_add(x=expert_outputs, bias=self.expert_bias)
        """
        加上 Bias 以及通过激活函数 (relu) 后, expert_outputs 大小仍为 [B, E, N]
		"""
        expert_outputs = self.expert_activation(expert_outputs)

        # g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
        """
		针对 K 个 Task 分别学习各自的 Gate 网络, 这里采用 for 循环实现,
		其中 inputs 的大小为 [B, I],
		gate_kernel 的大小为 [I, N], 其中 I 为输入 embedding 的大小,
		而 N 为 Experts 的个数. 因此 K.dot 对 inputs 和 gate_kernel 进行矩阵乘法,
		得到 gate_output 的大小为 [B, N].
		注意 gate_activation 为 softmax, 因此经过 Bias 以及 gate_activation 后,
		gate_output 的大小为 [B, N], 保存着各 Experts 网络的权重系数
		"""
        for index, gate_kernel in enumerate(self.gate_kernels):
            gate_output = K.dot(x=inputs, y=gate_kernel)
            # Add the bias term to the gate weights if necessary
            if self.use_gate_bias:
                gate_output = K.bias_add(x=gate_output, bias=self.gate_bias[index])
            gate_output = self.gate_activation(gate_output)
            gate_outputs.append(gate_output)

        # f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
        """
		gate_outputs 为大小等于 K (任务个数) 的列表, 其中 gate_output 的大小等于 [B, N],
		而 expert_outputs 的大小为 [B, E, N];
		因此, 首先对 gate_output 使用 expand_dims, 按照 axis=1 进行, 得到
		expanded_gate_output 大小为 [B, 1, N];
		K.repeat_elements 将 expanded_gate_output 扩展为 [B, E, N],
		之后再乘上 expert_outputs, 得到 weighted_expert_output 大小为 [B, E, N];
		此时每个 Experts 网络都乘上了对应的系数, 最后只需要对各个 Experts 网络的输出进行加权
		求和即可, 因此 K.sum(weighted_expert_output, axis=2) 的结果大小为 [B, E];
		"""
        for gate_output in gate_outputs:
            expanded_gate_output = K.expand_dims(gate_output, axis=1) ## [B, 1, N]
            weighted_expert_output = expert_outputs * K.repeat_elements(expanded_gate_output, self.units, axis=1)  ## [B, E, N]
            final_outputs.append(K.sum(weighted_expert_output, axis=2)) ## [B, E]

        return final_outputs

总结

忧桑~ 恍惚中~~