Alibaba Cloud open source EasyTransfer: the industry's first deep transfer learning framework for NLP scenarios

Insert picture description here

Alibaba Cloud open source EasyTransfer: The industry's first deep transfer learning framework for NLP scenarios
Original link: https://zhuanlan.zhihu.com/p/267392773

Alibaba Cloud has officially open sourced the deep transfer learning framework EasyTransfer. This article introduces the core functions of the EasyTransfer framework in detail.

The heart of the machine is released, the editorial department of the heart of the machine.

Recently, Alibaba Cloud officially open sourced the deep transfer learning framework EasyTransfer, which is the industry's first deep transfer learning framework for NLP scenarios.

Open source link: github.com/alibaba/Easy

The framework is developed by the Alibaba Cloud Machine Learning PAI team, making the development and deployment of model pre-training and migration learning for natural language processing scenarios easier and more efficient.

Deep transfer learning for natural language processing scenarios has a huge demand in real scenarios. Because a large number of new fields are constantly emerging, traditional machine learning needs to accumulate a large amount of training data for each field, which will consume a lot of manpower and annotation. Material resources. The deep transfer learning technology can transfer the knowledge learned in the source domain to the task of the new domain, thereby greatly reducing the resources of annotation.

Although there are many requirements for deep transfer learning for natural language scenarios, the open source community does not yet have a complete framework, and it is a huge challenge to build a simple, easy-to-use and high-performance framework.

First of all, pre-training model plus knowledge transfer is now the mainstream NLP application mode. Generally, the larger the size of the pre-training model, the more effective the knowledge representation learned. However, the large model poses a huge challenge to the distributed architecture of the framework. How to provide a high-performance distributed architecture to effectively support super-large-scale model training.

Second, the diversity of user application scenarios is high, and a single migration learning algorithm cannot be applied. How to provide a complete migration learning tool to improve the effect of downstream scenarios.

Third, it usually takes a long link from algorithm development to business landing. How to provide a simple and easy-to-use one-stop service from model training to deployment.

Faced with these three challenges, the PAI team launched EasyTransfer, a simple, easy-to-use and high-performance transfer learning framework. The framework supports mainstream migration learning algorithms, supports automatic mixed precision, compilation optimization, and efficient distributed data/model parallel strategies, and is suitable for industrial-level distributed application scenarios.

It is worth mentioning that, with mixed precision, compilation optimization and distributed strategies, the ALBERT model supported by EasyTransfer is more than 4 times faster than the community version of ALBERT in terms of distributed training computing speed.

At the same time, after more than 10 BUs and more than 20 business scenarios within Ali, it provides a variety of conveniences for NLP and migration learning users, including the industry-leading high-performance pre-training tool chain and pre-training ModelZoo, and the rich and easy-to-use AppZoo , Efficient migration learning algorithms, and full compatibility with Alibaba’s PAI ecological products, providing users with a one-stop service from model training to deployment.

Lin Wei, head of the Alibaba Cloud machine learning PAI team, said: This open source EasyTransfer code hopes to empower more users with Ali’s capabilities, lower the threshold of NLP pre-training and knowledge transfer, and also work in-depth with more partners. Create a simple, easy-to-use, high-performance NLP and migration learning tool.

<img src="https://pic2.zhimg.com/v2-bd7934043d3037fe46d6a47ef5698dc1_b.jpg" data-caption="" data-size="normal" data-rawwidth="692" data-rawheight="227" class="origin_image zh-lightbox-thumb" width="692" data-original="https://pic2.zhimg.com/v2-bd7934043d3037fe46d6a47ef5698dc1_r.jpg"/>

Six highlights of the framework

  • Simple and high-performance framework: shielding complex underlying implementations, users only need to pay attention to the logical structure of the model, lowering the entry barrier for NLP and migration learning; at the same time, the framework supports industrial-grade distributed application scenarios and improves the distributed optimizer. With automatic mixed precision, compilation optimization, and efficient distributed data/model parallel strategy, the calculation speed is more than 4 times faster than the community version of multi-machine multi-card distributed training;
  • Language model pre-training tool chain: supports a complete pre-training tool chain, which is convenient for users to pre-train language models such as T5 and BERT. The pre-training model based on the tool chain has achieved good results in the Chinese CLUE list and the English SuperGLUE list Grades
  • Rich and high-quality pre-trained model ModelZoo: supports PAI-ModelZoo, and supports Bert, Albert, Roberta, XLNet, T5 and other mainstream models of Continue Pretrain and Finetune. At the same time, it supports self-developed multi-modal model Fashionbert and others in the clothing industry;
  • Rich and easy-to-use application AppZoo: supports mainstream NLP applications and self-developed model applications, such as single-tower models such as DAM++ and HCNN under text matching, and BERT double-tower + vector recall model; support BERT-HAE under reading comprehension Equivalent model
  • Automatic knowledge distillation tool: supports knowledge distillation, which can be distilled from a large teacher model to a small student model. Integrates task-aware BERT model compression AdaBERT, uses neural network architecture search to search for task-related architecture to compress the original BERT model, which can be compressed up to 1/17 of the original, and the inference is increased by up to 29 times. The effect loss is within 3%;
  • Compatible with PAI ecological products: The framework is developed based on PAI-TF. Users can use PAI's self-developed and efficient distributed training, compilation and optimization features through simple code or configuration file modification; at the same time, the framework is perfectly compatible with PAI ecological products, including PAI Web components (PAI Studio), development platform (PAI DSW), and PAI Serving platform (PAI EAS).

Platform architecture overview

The overall framework of EasyTransfer is shown in the figure below, which simplifies the algorithm development difficulty of deep transfer learning as much as possible in design. The framework abstracts commonly used IO, layers, losses, optimizers, models. Users can develop models based on these interfaces, or they can directly access the pre-training model library ModelZoo for rapid modeling. The framework supports five transfer learning (TL) paradigms, model finetuning, feature-based TL, instance-based TL, model-based TL and meta learning. At the same time, the framework integrates AppZoo, supports mainstream NLP applications, and facilitates users to build commonly used NLP algorithm applications. Finally, the framework is seamlessly compatible with PAI ecological products, bringing users a one-stop experience from training to deployment.

<img src="https://pic1.zhimg.com/v2-d4d680ff72ee42d91638a41d117e2540_b.jpg" data-caption="" data-size="normal" data-rawwidth="692" data-rawheight="382" class="origin_image zh-lightbox-thumb" width="692" data-original="https://pic1.zhimg.com/v2-d4d680ff72ee42d91638a41d117e2540_r.jpg"/>

Detailed platform function

The core functions of the EasyTransfer framework are described in detail below.

Simple and easy to use API interface design

<img src="https://pic1.zhimg.com/v2-77be06b23b004f060e38ad6c61656b44_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="493" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic1.zhimg.com/v2-77be06b23b004f060e38ad6c61656b44_r.jpg"/>

High-performance distributed framework

The EasyTransfer framework supports industrial-grade distributed application scenarios, improves the distributed optimizer, and cooperates with automatic mixing precision, compilation optimization, and efficient distributed data/model parallel strategies to achieve more than the community version of multi-machine multi-card distributed training It is more than 4 times faster in computing speed.

<img src="https://pic3.zhimg.com/v2-29ed4e6e1abe5446efbcae928d817ffe_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="628" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-29ed4e6e1abe5446efbcae928d817ffe_r.jpg"/>

Rich ModelZoo

The framework provides a set of pre-training language model tools for users to customize their own pre-training models, and also provides a pre-training language model library ModelZoo for users to call directly. Currently, 20+ pre-training models are supported. Among them, PAI-ALBERT-zh pre-trained on the PAI platform won the first place in the Chinese CLUE list, and PAI-ALBERT-en-large won the second place in English SuperGLUE. The following is a detailed list of pre-trained models:

<img src="https://pic3.zhimg.com/v2-1eb88bc7572ca8a175d45539fce3b556_b.jpg" data-caption="" data-size="normal" data-rawwidth="841" data-rawheight="223" class="origin_image zh-lightbox-thumb" width="841" data-original="https://pic3.zhimg.com/v2-1eb88bc7572ca8a175d45539fce3b556_r.jpg"/>

The effect of the pre-trained model on the CLUE list:

<img src="https://pic2.zhimg.com/v2-4991f050102fd41334e13ecef03aaa19_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="520" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic2.zhimg.com/v2-4991f050102fd41334e13ecef03aaa19_r.jpg"/>

The effect of SuperGLUE:

<img src="https://pic1.zhimg.com/v2-399f69daedeb4fc61145b5043c121b90_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="478" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic1.zhimg.com/v2-399f69daedeb4fc61145b5043c121b90_r.jpg"/>

Rich AppZoo

EasyTransfer encapsulates AppZoo, which is highly easy-to-use, flexible and low-cost to learn, and supports users to run “leading-edge” open source and self-developed algorithms on a large scale with only a few lines of commands, and quickly access different scenarios and businesses NLP applications under data include text vectorization, matching, classification, reading comprehension, and sequence labeling.

<img src="https://pic3.zhimg.com/v2-e174b2c7e48bfc77988871f77b4ff67a_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="528" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-e174b2c7e48bfc77988871f77b4ff67a_r.jpg"/>

Efficient transfer learning algorithm

The EasyTransfer framework supports all mainstream transfer learning paradigms, including Model Fine-tuning, Feature-based TL, Instance-based TL, Model-based TL and Meta Learning. Based on these migration learning paradigms, more than 10 algorithms have been developed, which have achieved good results in Ali's business practices. All subsequent algorithms will be open sourced to the EasyTransfer code base. In specific applications, users can choose a transfer learning paradigm to test the effect according to the following figure.

<img src="https://pic3.zhimg.com/v2-399a0897b2c389d25fedae5f6215099a_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="325" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-399a0897b2c389d25fedae5f6215099a_r.jpg"/>

Pre-trained language model

One of the hot topics of natural language processing is pre-training language models such as BERT, ALBERT, etc. These models have achieved very good results in various natural language processing scenarios. In order to better support users to use pre-trained language models, we have implanted a set of standard paradigms of pre-trained language models and the pre-trained language model library ModelZoo into the new version of the transfer learning framework EasyTransfer. In order to reduce the total amount of parameters, the traditional Albert canceled the way of stacking bert's encoders, and instead adopted the way of encoder loops, as shown in the following figure. The full loop method does not perform very well on downstream tasks, so we changed the full loop to a full loop on a 2-layer stacked encoder. Then we re-trained the Albert xxlarge based on English C4 data. In the pre-training process, we only use MLM loss, combined with Whole Word Masking, and based on the Train on the fly function of EasyTransfer, we have implemented dynamic online masking, which means that the masking can be dynamically generated each time the original sentence is read. tokens. Our final pre-training model, PAI-ALBERT-en-large, ranked second in the world and first in China on the SuperGLUE list. The model parameters are only 1/10 of the first Google T5, and the effect gap is within 3.5%. In the future, we will continue to optimize the model framework and strive to achieve better results than T5 with 1/5 of the model parameters.

<img src="https://pic1.zhimg.com/v2-eb2e138e7db965fa2b2c7c8f877cacc8_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="579" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic1.zhimg.com/v2-eb2e138e7db965fa2b2c7c8f877cacc8_r.jpg"/>

Multi-modal model FashionBERT

With the development of Web technology, the Internet contains a large amount of multi-modal information, including text, images, voice, video, etc. Searching for important information from massive multi-modal information has always been the focus of academic research. The core of multi-modal matching is Text and Image Matching. This is also a basic research. It has many applications in many fields, such as cross-modality IR and image caption generation. ), image question answering system (Vision Question Answering), image knowledge reasoning (Visual Commonsense Reasoning). However, the current academic research focuses on multi-modal research in general fields, and there are relatively few multi-modal studies in the field of e-commerce. Based on this, we cooperated with the Ali ICBU team to propose the FashionBERT multi-modal pre-training model, which conducts pre-training research on graphic information in the e-commerce field, and has been successful in multiple business scenarios such as cross-modal retrieval and graphic matching. Applications. The model architecture diagram is shown below. This work proposes Adaptive Loss, which is used to balance the three-part loss of graphic matching, pure image, and pure text.

<img src="https://pic3.zhimg.com/v2-f99a2938e10497216033e5131745957e_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="641" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-f99a2938e10497216033e5131745957e_r.jpg"/>

Task-adaptive knowledge distillation

The pre-training model extracts general knowledge from massive unsupervised data, and improves the effect of downstream tasks through the method of knowledge transfer, achieving excellent results in the scene. Generally, the larger the size of the pre-training model, the more effective the learned knowledge representation is for downstream tasks, and the more obvious the improvement of the index. However, large models obviously cannot meet the timeliness requirements of industrial applications, so model compression needs to be considered. We worked with the Alibaba Intelligent Computing team to propose a new compression method, AdaBERT, which uses Differentiable Neural Architecture Search to automatically compress BERT into a task-adaptive small model. In this process, we use BERT as a teacher model to refine its useful knowledge on the target task; under the guidance of this knowledge, we adaptively search for a network structure suitable for the target task and compress it to obtain a small-scale student model. We have conducted experimental evaluations on multiple NLP public tasks. The results show that the small model compressed by AdaBERT can ensure that the inference speed is 12.7 to 29.3 times faster than the original BERT, and the parameter scale is 11.5 to 17.0 times smaller than the original BERT. .

<img src="https://pic3.zhimg.com/v2-ad0456d69f61019db4753a11066b8cfe_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="318" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-ad0456d69f61019db4753a11066b8cfe_r.jpg"/>

QA Scenario Area Relationship Learning

As early as 2017, we tried migration learning in the Alibaba Xiaomi Q&A scene, and we mainly focused on DNN based Supervised TL. There are two main frameworks for this type of algorithm, one is Fully-shared (FS), and the other is Specific-shared (SS). The biggest difference between the two is that the former only considers the shared representation, while the latter considers the specific representation. Generally speaking, the model effect of SS is better than that of FS, because FS can be regarded as a special case of SS. For SS, in the ideal case, the shared part represents the commonality of the two fields, and the specific part represents the characteristics. However, we often find it difficult to achieve this effect, so we consider using an adversarial loss and domain correlation to help the model learn these two features. Based on this, we propose a new algorithm, hCNN-DRSS, whose architecture is as follows:

<img src="https://pic4.zhimg.com/v2-723f12e6227a8bf6683e483d045f7d53_b.jpg" data-caption="" data-size="normal" data-rawwidth="620" data-rawheight="238" class="origin_image zh-lightbox-thumb" width="620" data-original="https://pic4.zhimg.com/v2-723f12e6227a8bf6683e483d045f7d53_r.jpg"/>

We applied this algorithm to Xiaomi's actual business scenarios, and achieved good results in multiple business scenarios (AliExpress, Vientiane, Lazada). At the same time, we also produced an article in WSDM2018: Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce. Jianfei Yu, Minghui Qiu, et al., WSDM 2018.

Reinforced Transfer Learning

The effectiveness of migration learning largely depends on the gap between the source domain and the target domain. If the gap is relatively large, the migration is likely to be invalid. In the Xiaomi QA scenario, if you directly migrate Quora's text matching data, many of them are not suitable. In Xiaomi’s QA scenario, we built a general reinforced transfer learning framework based on the Actor-Critic algorithm, and used RL for sample selection to help the TL model achieve better results. The whole model is divided into three parts, basic QA model, transfer learning model (TL) and reinforcement learning model (RL). Among them, the policy function of RL is responsible for selecting high-quality samples (actions). The TL model trains the QA model on the selected samples and provides feedback to RL. RL updates the actions according to the feedback (reward). The model trained by this framework has achieved a very good improvement in the matching accuracy of both the Russian and Spanish matching models of AliExpress on Double 11 AliExpress. At the same time, we also organized the results into papers and published them in WSDM2019. (Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching. Chen Qu, Feng Ji, Minghui Qiu, et al., WSDM 2019.)

<img src="https://pic3.zhimg.com/v2-dba4c435abbfb4cc85c39c4d153d022e_b.jpg" data-caption="" data-size="normal" data-rawwidth="897" data-rawheight="598" class="origin_image zh-lightbox-thumb" width="897" data-original="https://pic3.zhimg.com/v2-dba4c435abbfb4cc85c39c4d153d022e_r.jpg"/>

Former Compliant Meta Fine-tuning

The wide application of pre-training language models makes the two-stage training model of Pre-training+Fine-tuning mainstream. We noticed that in the fine-tuning stage, model parameters are only fine-tuned on specific fields and specific data sets, and the effect of cross-domain data migration and tuning is not considered. The Meta Fine-tuning algorithm draws on the idea of ​​Meta-learning and aims to learn the cross-domain meta-learner of the pre-trained language model, so that the learned meta-learner can be quickly transferred to tasks in a specific domain. This algorithm learns the cross-domain typicality (ie transferability) of training data samples, and adds domain corruption classifier to the pre-training language model, so that the model learns more domain-invariant representations.

<img src="https://pic3.zhimg.com/v2-7112b6596ceae7be9c20acec04d6bbba_b.jpg" data-caption="" data-size="normal" data-rawwidth="1080" data-rawheight="364" class="origin_image zh-lightbox-thumb" width="1080" data-original="https://pic3.zhimg.com/v2-7112b6596ceae7be9c20acec04d6bbba_r.jpg"/>

We applied the fine-tuning algorithm to BERT and conducted experiments on multiple tasks such as natural language inference and sentiment analysis. Experimental results show that the meta-tuning algorithm is superior to the original fine-tuning algorithm of BERT and the fine-tuning algorithm based on transfer learning in these tasks. We also compiled the results into a paper and published it in EMNLP 2020. (Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining. Chengyu Wang, Minghui Qiu, Jun Huang, et al., EMNLP 2020.)

Meta-Knowledge Distillation

As pre-trained language models such as BERT have achieved SOTA effects on various tasks, models such as BERT have become an important part of the NLP deep migration learning pipeline. But BERT is not flawless. This type of model still has the following two problems: too much model parameters and slow training/inference speed. Therefore, one direction is to distill BERT knowledge into a small model. However, most of the knowledge distillation work focuses on the same field, ignoring the problem of improving distillation tasks across fields. We propose to use Meta Learning to learn cross-domain transferable knowledge, and additionally distill the transferable knowledge in the distillation stage. This approach has significantly improved the effect of the learned Student model in the corresponding field. We have distilled a better student model on multiple cross-domain tasks, which approximates the effect of the teacher model. We will sort out this work in the near future and publish code and articles.

List of innovative articles

The EasyTransfer framework has been implemented in dozens of NLP scenarios in Alibaba Group, including intelligent customer service, search recommendation, security risk control, and large entertainment, which has brought significant business effects. At present, there are hundreds of millions of calls to EasyTransfer's daily services, and the average monthly training call volume exceeds 50,000. The EasyTransfer team has accumulated a lot of innovative algorithm solutions while landing the business, including meta-learning, multi-modal pre-training, enhanced transfer learning, feature transfer learning and other directions. A total of dozens of top conference articles have been published. , Here are some representative works. These algorithms will be open sourced in the EasyTransfer framework for use by users.

  • [EMNLP 2020]. Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining. EMNLP 2020. Full Paper.
  • [SIGIR 2020] FashionBERT: Text and Image Matching for Fashion Domain with Adaptive Loss.
  • [ACM MM 2020] One-shot Learning for Text Field Labeling in Structure Information Extraction. To appear, Full Oral paper.
  • [IJCAI 2020] AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search, IJCAI 2020.
  • [KDD 2019] A Minimax Game for Instance based Selective Transfer Learning. Oral, KDD 2019.
  • [CIKM 2019] Cross-domain Attention Network with Wasserstein Regularizers for E-commerce Search, CIKM 2019.
  • [WWW 2019] Multi-Domain Gated CNN for Review Helpfulness Prediction, WWW.
  • [SIGIR 2019]. BERT with History Modeling for Conversational Question Answering. SIGIR 2019.
  • [WSDM 2019]. Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching. WSDM 2019, Full Paper.
  • [ACL 2018]. Transfer Learning for Context-Aware Question Matching in Information-seeking Conversation Systems in E-commerce. ACL. 2018.
  • [SIGIR 2018]. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. Long Paper.
  • [WSDM 2018]. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce, 2018. Long Paper.
  • [CIKM 2017]. AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience, CIKM 2017, Demo Paper, Best Demo Award.
  • [ICDM 2017]. A Short-Term Rainfall Prediction Model using Multi-Task Convolutional Neural Networks. Long paper, ICDM 2017.
  • [ACL 2017]. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine,ACL 2017.
  • [arXiv]. KEML: A Knowledge-Enriched Meta-Learning Framework for Lexical Relation Classification,arXiv.

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/112666251