[Paper reading] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts (MMoE model)

[论文阅读] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Abstract

Multi-task learning based on neural networks is widely used in daily life. However, the prediction quality of commonly used multi-task models is sensitive to the relationship between tasks.
The MMoE model proposed in this paper can model the task relationship from the data. By sharing the expert sub-model across all tasks, the expert mixture structure is adapted to multi-task learning, while optimizing the gating network for each task.

Introduction

Background : For the recommendation system, for example: when a movie is recommended to the user, it is hoped that the user can not only buy and watch the movie, but also hope that the user can come back to watch or buy more movies after watching it. That is, to create a model that simultaneously predicts user purchases and ratings.

Now the problem : Many DNN-based multi-task learning models are sensitive to factors such as data distribution differences and relationships between tasks. Inherent conflicts from task variance may actually hurt predictions for at least some tasks, especially when model parameters are widely shared across all tasks.

Previous Work and Question 1 : Previous work assumes a specific data generation process for each task to measure the differences between tasks, and then makes different proposals. However, real tasks have more complex data patterns, making it difficult to measure the difference.

Previous Work and Question 2 : Recent work proposes that there is no need to rely on explicit task difference measures, but these techniques usually involve adding more model parameters, which may damage the model quality and is also expensive in real production.

MMoE : Inspired by Mixture of Experts (MoE), this paper explicitly models task relations and learns task-specific features to exploit shared representations. It allows automatic assignment of parameters to capture shared task information or task-specific information, avoiding the need to add many new parameters per task.
The backbone of MMoE is built on the most commonly used shared bottom multi-task deep neural network structure, and the shared bottom model is shown in the figure below. where several bottom layers after the input layer are shared across all tasks, and then each task has a separate network "tower" on top represented by the bottom.
insert image description here
As shown in the figure below, the model does not have one underlying network shared by all tasks, but instead has a set of underlying networks, each of which is called an expert. Each expert in this paper is a feed-forward network. Then a gating network is introduced, which takes the input features and outputs a softmax gate to assemble the experts with different weights, so that different tasks utilize the experts in different ways. The results of the assembled experts are then passed into task-specific tower networks. In this way, gating networks for different tasks can learn different mixture patterns assembled by experts, thus capturing task relations.
insert image description here
Experimental design : Pearson correlation was used to measure and control for task relatedness. We use two synthetic regression tasks and use the sine function as the data generation mechanism to introduce nonlinearities. Outperforms baseline methods, especially when task dependencies are low. MMoE was also found to be easier to train in experiments, which is related to the recent finding that modulation and gating mechanisms can improve the training ability of training non-convex deep neural networks.
Contribute :

  1. A novel multidisciplinary expert mixture model is proposed that explicitly models task relationships. By modulating and gating the network, our model automatically adjusts the parameterization between modeling shared information and modeling task-specific information.
  2. Control experiments were performed using synthetic data. We report how task dependencies affect training dynamics in multi-task learning, and how MMoE improves model expressiveness and trainability.
  3. Experiments are performed on benchmark data and real-world environments, showing effectiveness.

Related Work

Multi-task Learning in DNNs

Multi-task models can learn the commonalities and differences between different tasks. This improves the efficiency and model quality of each task. Caruana proposes a multi-task model with a shared bottom model structure that is shared across tasks. This structure reduces the risk of overfitting, but may cause optimization conflicts.

How does task dependency affect model quality? Previous work generated and manipulated different types of task dependencies using synthetic data to evaluate the effectiveness of multi-task models.

The current work is mainly to find ways to have more task-specific parameters (compared to sharing the underlying model), resulting in better performance when task differences lead to conflicts in updating parameters. More tasks means more parameters, which is not good, and may be less effective in large-scale models.

Ensemble of Subnets & Mixture of Experts

Apply some recent findings (parameter modulation and ensemble methods) to modeling task relations for multi-task learning.

Eigen and Shazeer et al. converted mixture-of-experts models into basic building blocks (MoE layers) and stacked them in a DNN. The MoE layer selects subnetworks (experts) based on the layer's inputs at training time and serving time. Therefore, the model is not only more powerful in terms of modeling, but also reduces the computational cost by introducing sparsity into the gating network.

Multi-task Learning Applications

On multilingual machine translation tasks, under the condition of sharing model parameters, the translation tasks with limited training data can be improved by joint learning with tasks with large training data.

Multi-task learning helps to provide context-aware recommendations.

PRELIMINARY

Shared-bottom Multi-task Model

As shown in Figure (a), given K tasks, the model consists of a shared bottom network (denoted as fff ) and a network of K towershkh^khk composition. The shared bottom network is connected to the input layer, and the tower network is connected to the shared bottom network. For task K, the model class is expressed as:
insert image description here

Synthetic Data Generation

The performance of multi-task learning models depends heavily on the task dependencies inherent in the data. However, studying data correlations for practical applications is difficult because the correlations cannot be easily changed. Therefore, research was first conducted using synthetic data.
Generate two regression tasks, and use the Person correlation of the two tasks as a quantitative indicator. Since it is a DNN, the regression model is set as a combination of sine functions. Data are as follows:

  1. Given the input feature dimension d, we generate two orthogonal unit vectors u 1 , u 2 ∈ R d u_1,u_2\in R^du1,u2Rd
    insert image description here
  2. Given a proportionality constant c and a correlation score −1 ≤ p ≤ 1, generate two weight vectors w 1 , w 2 w_1, w_2w1w2make
    insert image description here
  3. Randomly sample an input data point x ∈ Rd and each of its elements from N(0,1).
  4. Generate two labels for two regression tasks. y 1 , y 2 y_1, y_2y1y2as follows:
    insert image description here
  5. Repeat (3) and (4) until enough data is generated.

Generating Person correlations with a given label is not easy. Instead, by manipulating the weight vectors w 1 , w 2 w_1, w_2w1w2The cosine similarity of , and then measure the label's Person correlation.
insert image description hereUnder the linear condition, y 1 , y 2 y_1, y_2y1y2The correlation of is exactly p. There is also a positive correlation in the non-linear case.
insert image description here

Impact of Task Relatedness

In order to verify that low task correlation can impair model quality in the baseline multi-task model setting, a control experiment is performed as follows.

  1. Given a list of task-related scores, generate a synthetic dataset for each score;
  2. Train a shared bottom multi-task model on each dataset separately, while controlling all models and training hyperparameters to remain constant;
  3. Repeat steps (1) and (2) hundreds of times with independently generated datasets, but with the same list of control task-related scores and hyperparameters;
  4. Computes the average performance of the model for each task-related score.

Results:
insert image description here
The higher the correlation between tasks, the better the effect; the lower the correlation, the worse the effect.
Think : Why is this happening? Because the multi-task network shares the underlying network, if the tasks are similar, the parameters of the underlying network can better adapt to these tasks; if the tasks are not similar, the parameters of the underlying network may conflict with a greater probability during the optimization process. Resulting in not being able to better adapt to these tasks, leading to poorer results.

MODELING APPROACHES

Mixture-of-Experts

The original MoE model can be expressed as:
insert image description here
where ∑ i = 1 ng ( x ) i = 1 \sum_{i=1}^ng(x)_i=1i=1ng(x)i=1 , the i-th g (x) of the output represents the expertfi f_ifiThe probability

Here, fi f_ifi, i = 1, ..., n is the network of n experts, and g represents a gating network, which aggregates the results of all experts. More specifically, the gating network g produces a distribution over n experts based on the input, and the final output is the weighted sum of the outputs of all experts.
MoELayer : The MoE layer has the same structure as the MoE model, but accepts the output of the previous layer as input and outputs to successive layers. The entire model is then trained in an end-to-end fashion. The goal is to implement conditional computations where only parts of the network are active on a per-example basis. For each input example, the model can only select a subset of experts via a gating network conditioned on the input.

Multi-gate Mixture-of-Experts

The core idea of ​​the model in this paper is to share the bottom network fff is replaced by the MoE layer just mentioned. And add a separate gating networkgkg^kgk . The output of task K is:
insert image description here
Our implementation consists of the same multilayer perceptron with ReLU activation. The gating network is just a linear transformation of the input with a softmax layer:
insert image description here
where,W gk ∈ R n × d W_{gk}\in R^{n\times d}WgkRn × d is a trainable matrix, n is the number of expert systems, and d is the input data dimension.

Each gating network selects the subset of experts it needs. This makes parameter sharing more flexible. MMoE is able to model task relationships in complex ways by determining how separations produced by different gates overlap with each other. Multiple doors are useful. For comparison, a MoE model of each door is created, as shown in the figure below.
insert image description here

MMOE ON SYNTHETIC DATA

Is MMoE indeed better for cases where tasks are less interrelated. Also experiment with synthetic data. And it proves that MMoE is easier to train.

Performance on Data with Different Task Correlations

The previous experiments are repeated using the MMoE model and two baseline models: the shared bottom model and the OMoE model.
result:
insert image description here

  1. For all models, performance on more correlated data is better than on less correlated data.
  2. The gap between the performance of the MMoE model on data with different correlations is much smaller than that of the OMoE model and the shared bottom model.

Trainability

Trainability: The robustness of the model within the range of hyperparameter settings and model initialization.

Methods: Experiments were repeated multiple times under each setting. The model is initialized differently each time data is generated from the same distribution but with a different random seed.
insert image description here
Note that the only difference between MMoE and OMoE is the presence or absence of multiple gate structures. This verifies the usefulness of the multi-gate structure in resolving bad local minima caused by task variance conflicts.

REAL DATA EXPERIMENTS

Experiments are carried out on real datasets to verify the effectiveness of the method.

Baseline Methods

In addition to sharing the bottom model, it is also compared with several state-of-the-art multi-task deep neural network models.

L2-Constrained : This method is designed for cross-lingual problems with two tasks. In this approach, parameters for different tasks are soft-shared by L2 constraints. Given yk y_kykAs the ground truth label for task k, k ∈ 1, 2, the prediction of task k is expressed as the
insert image description hereobjective function:
insert image description here
α is a hyperparameter, the method models the task dependencies with the magnitude of α

Cross-Stitch : This method shares knowledge between two tasks by introducing a "cross-stitch" unit. The cross-stitch unit takes the input of the separated hidden layers x1 and x2 from tasks 1 and 2, and outputs ~xi 1 and ~xi 2 respectively by the following formula: Tensor-Factorization: weights are modeled as tensors, tensor
insert image description here
factorization method For parameter sharing across tasks. Given an input hidden layer size m, an output hidden layer size n and the number of tasks k, the weights W (i.e. m × n × k tensor) are derived by:
insert image description here

Hyper-Parameter Tuning

A hyperparameter tuner is used to search for the best hyperparameters in the dataset.

Census-income Data

Experimental results on public datasets:
insert image description here

Large-scale Content Recommendation

Specifically, given the user's current behavior of consuming items, the goal of this recommender system is to show the user a list of related items to consume next.

CONCLUSION

We propose a novel multi-task learning approach, Multi-Gate MoE (MMoE), which explicitly learns to model task relationships from data. We show through controlled experiments on synthetic data that the proposed method can better handle cases where tasks are less correlated. Compared with baseline methods, MMoE is easier to train. Through experiments on benchmark datasets and real large-scale recommender systems, we demonstrate the success of the proposed method on several state-of-the-art baseline multi-task learning models.

Another major design consideration in practical machine learning production systems is computational efficiency. . However, since the gated network is usually light-weight and the expert network is shared across all tasks, the MMoE model largely preserves the computational advantage. (MMoE's gated network is lightweight)

Guess you like

Origin blog.csdn.net/no1xiaoqianqian/article/details/127716363