Simple to use and high performance! An article to understand the open source transfer learning framework EasyTransfer

Introduction: Recently, Alibaba Cloud has officially open sourced the deep transfer learning framework EasyTransfer, which is the industry's first deep transfer learning framework for NLP scenarios. The framework is developed by the Alibaba Cloud Machine Learning PAI team, making the development and deployment of model pre-training and migration learning for natural language processing scenarios easier and more efficient. This article will give an in-depth interpretation of EasyTransfer. Open source address: https://github.com/alibaba/EasyTransfer

image.png
Deep transfer learning for natural language processing scenarios has a huge demand in real scenarios, because a large number of new fields are constantly emerging, and traditional machine learning needs to accumulate a large amount of training data for each field, which will consume a lot of manpower and labeling. Material resources. The deep transfer learning technology can transfer the knowledge learned in the source domain to the task of the new domain, thereby greatly reducing the resources of annotation.

Although there are many requirements for deep transfer learning for natural language scenarios, the open source community does not yet have a complete framework, and it is a huge challenge to build a simple, easy-to-use and high-performance framework.

  • First of all, pre-training model plus knowledge transfer is now the mainstream NLP application mode. Generally, the larger the pre-training model size, the more effective the learned knowledge representation is. However, the super-large model brings great challenges to the distributed architecture of the framework. How to provide a high-performance distributed architecture to effectively support super-large-scale model training.
  • Second, the diversity of user application scenarios is high, and a single migration learning algorithm cannot be applied. How to provide a complete migration learning tool to improve the effect of downstream scenarios.
  • Third, it usually takes a long link from algorithm development to business landing. How to provide a simple and easy-to-use one-stop service from model training to deployment.

Faced with these three challenges, the PAI team launched EasyTransfer, a simple, easy-to-use and high-performance transfer learning framework. The framework supports mainstream migration learning algorithms, supports automatic mixed precision, compilation optimization, and efficient distributed data/model parallel strategies, and is suitable for industrial-level distributed application scenarios.

It is worth mentioning that, with mixed precision, compilation optimization and distributed strategy, the ALBERT model supported by EasyTransfer is more than 4 times faster than the community version of ALBERT in the calculation speed of distributed training.

At the same time, after more than 10 BUs and more than 20 business scenarios within Ali, it provides NLP and migration learning users with a variety of conveniences, including the industry-leading high-performance pre-training tool chain and pre-training ModelZoo, and the rich and easy-to-use AppZoo , Efficient migration learning algorithms, and full compatibility with Alibaba’s PAI ecological products, providing users with a one-stop service from model training to deployment.

Lin Wei, head of Alibaba Cloud's machine learning PAI team, said: This open source EasyTransfer code hopes to empower more users with Alibaba's capabilities, lower the threshold of NLP pre-training and knowledge transfer, and also cooperate in-depth with more partners. Create a simple, easy-to-use, high-performance NLP and migration learning tool.

image.png

1. Six highlights of EasyTransfer

Simple and high-performance framework

Shielding the complex underlying implementation, users only need to pay attention to the logical structure of the model, which reduces the entry barrier for NLP and migration learning; at the same time, the framework supports industrial-grade distributed application scenarios, improves the distributed optimizer, and cooperates with automatic mixing precision and compilation Optimization, and efficient distributed data/model parallel strategy, achieves more than 4 times faster computing speed than the community version of multi-machine multi-card distributed training.

Language model pre-training tool chain

Supports a complete pre-training tool chain, which is convenient for users to pre-train language models such as T5 and BERT. The pre-trained models based on this tool chain have achieved good results in the Chinese CLUE list and the English SuperGLUE list.

Rich and high-quality pre-trained model ModelZoo

Support PAI-ModelZoo, Continue Pretrain and Finetune of mainstream models such as Bert, Albert, Roberta, XLNet, T5. At the same time, it supports self-developed multi-modal model Fashionbert in the clothing industry.

Rich and easy-to-use applications AppZoo
supports mainstream NLP applications and self-developed model applications, such as single-tower models such as DAM++ and HCNN under text matching, and BERT double-tower + vector recall models; support BERT-HAE under reading comprehension, etc. model.

Automatic knowledge distillation tool

Support knowledge distillation, which can be distilled from a large teacher model to a small student model. Integrates task-aware BERT model compression AdaBERT, uses neural network architecture search to search for task-related architecture to compress the original BERT model, which can be compressed up to 1/17 of the original, and the inference is increased by up to 29 times, and the model The effect loss is within 3%.

Compatible with PAI ecological products

The framework is developed based on PAI-TF. Users can use PAI self-developed and efficient distributed training, compilation optimization and other features through simple code or configuration file modification. At the same time, the framework is perfectly compatible with PAI ecological products, including PAI Web components (PAI Studio ), development platform (PAI DSW), and PAI Serving platform (PAI EAS).

Two platform architecture overview

The overall framework of EasyTransfer is shown in the figure below, which simplifies the algorithm development difficulty of deep transfer learning as much as possible in design. The framework abstracts commonly used IO, layers, losses, optimizers, models. Users can develop models based on these interfaces, or they can directly access the pre-training model library ModelZoo for rapid modeling. The framework supports five transfer learning (TL) paradigms, model finetuning, feature-based TL, instance-based TL, model-based TL and meta learning. At the same time, the framework integrates AppZoo, supports mainstream NLP applications, and facilitates users to build common NLP algorithm applications. Finally, the framework is seamlessly compatible with PAI ecological products, bringing users a one-stop experience from training to deployment.

image.png

Detailed explanation of three platform functions

The core functions of the EasyTransfer framework are described in detail below.

Simple and easy to use API interface design
image.png

High-performance distributed framework

The EasyTransfer framework supports industrial-grade distributed application scenarios and improves the distributed optimizer. With automatic mixing precision, compilation optimization, and efficient distributed data/model parallel strategies, PAI-ALBERT can achieve more than the community version of ALBERT in multiple machines The computing speed of Doka distributed training is more than 4 times faster.

image.png

Rich ModelZoo

The framework provides a set of pre-training language model tools for users to customize their own pre-training models, and also provides a pre-training language model library ModelZoo for users to call directly. Currently, 20+ pre-training models are supported. Among them, PAI-ALBERT-zh pre-trained on the PAI platform won the first place in the Chinese CLUE list, and PAI-ALBERT-en-large won the second place in the English SuperGLUE. The following is a detailed list of pre-trained models:

image.png

The effect of the pre-trained model on the CLUE list:

image.png

The effect of SuperGLUE:

image.png

Rich AppZoo

EasyTransfer encapsulates AppZoo, which is highly easy-to-use, flexible, and low-cost learning. It supports users to run “leading-edge” open source and self-developed algorithms on a “large scale” with only a few lines of commands to quickly access different scenarios and businesses NLP applications under data include text vectorization, matching, classification, reading comprehension, and sequence labeling.

image.png

Efficient transfer learning algorithm

The EasyTransfer framework supports all mainstream transfer learning paradigms, including Model Fine-tuning, Feature-based TL, Instance-based TL, Model-based TL and Meta Learning. Based on these migration learning paradigms, more than 10 algorithms have been developed, and good results have been achieved in Ali's business practices. All subsequent algorithms will be open sourced to the EasyTransfer code base. In specific applications, users can choose a transfer learning paradigm to test the effect according to the following figure.

image.png

Pre-trained language model

One of the hot topics of natural language processing is pre-training language models such as BERT, ALBERT, etc. These models have achieved very good results in various natural language processing scenarios. In order to better support users to use pre-trained language models, we have implanted a set of standard paradigms of pre-trained language models and the pre-trained language model library ModelZoo in the new version of the transfer learning framework EasyTransfer. In order to reduce the total number of parameters, the traditional Albert canceled the way of bert's encoder stacking, and instead used the encoder loop method, as shown in the following figure. The full loop method does not perform very well on downstream tasks, so we changed the full loop to a full loop on a 2-layer stacked encoder. Then we re-trained the Albert xxlarge based on English C4 data. In the pre-training process, we only use MLM loss, combined with Whole Word Masking, and based on the Train on the fly function of EasyTransfer, we have implemented dynamic online masking, which means that the masking can be dynamically generated each time the original sentence is read. tokens. Our final pre-training model, PAI-ALBERT-en-large, ranked second in the world and first in China on the SuperGLUE list. The model parameters are only 1/10 of the first Google T5, and the effect gap is within 3.5%. In the future, we will continue to optimize the model framework and strive to achieve better results than T5 with 1/5 of the model parameters.

image.png

Multi-modal model FashionBERT

With the development of Web technology, the Internet contains a large amount of multi-modal information, including text, images, voice, video, etc. Searching for important information from massive multi-modal information has always been the focus of academic research. The core of multi-modal matching is Text and Image Matching. This is also a basic research. It has many applications in many fields, such as cross-modality IR and image caption generation. ), image question answering system (Vision Question Answering), image knowledge reasoning (Visual Commonsense Reasoning). However, the current academic research focuses on multi-modal research in general fields, and relatively few multi-modal research in the field of e-commerce. Based on this, we cooperated with the Ali ICBU team to propose the FashionBERT multi-modal pre-training model, which conducts pre-training research on graphic information in the e-commerce field, and has been successful in multiple business scenarios such as cross-modal retrieval and graphic matching. Applications. The model architecture diagram is shown below. This work proposes Adaptive Loss, which is used to balance the three-part loss of graphic matching, pure image, and pure text.

image.png

Task-adaptive knowledge distillation

The pre-training model extracts general knowledge from massive unsupervised data, and improves the effect of downstream tasks through the method of knowledge transfer, achieving excellent results in the scene. Generally, the larger the size of the pre-training model, the more effective the learned knowledge representation is for downstream tasks, and the more obvious the improvement of the index. However, large models obviously cannot meet the timeliness requirements of industrial applications, so model compression needs to be considered. We worked with the Alibaba Intelligent Computing team to propose a new compression method, AdaBERT, which uses Differentiable Neural Architecture Search to automatically compress BERT into a task-adaptive small model.

In this process, we use BERT as a teacher model to refine its useful knowledge on the target task; under the guidance of this knowledge, we adaptively search for a network structure suitable for the target task, and compress to obtain a small-scale student model. We have conducted experimental evaluations on multiple NLP public tasks. The results show that the small model compressed by AdaBERT can ensure that the intensive reading is equivalent, while the inference speed is 12.7 to 29.3 times faster than the original BERT, and the parameter scale is 11.5 to 17.0 times smaller than the original BERT. .

image.png

QA scenario field relationship learning

As early as 2017, we tried migration learning in the Alibaba Xiaomi Q&A scene, and we mainly focused on DNN based Supervised TL. There are two main frameworks for this type of algorithm, one is Fully-shared (FS), and the other is Specific-shared (SS). The biggest difference between the two is that the former only considers the shared representation, while the latter considers the specific representation. Generally speaking, the model effect of SS is better than that of FS, because FS can be regarded as a special case of SS. For SS, in the ideal case, the shared part represents the commonality of the two fields, and the specific part represents the characteristics. However, we often find it difficult to achieve such an effect, so we consider using an adversarial loss and domain correlation to help the model learn these two features. Based on this, we propose a new algorithm, hCNN-DRSS, the architecture is as follows:

image.png

We applied this algorithm to Xiaomi's actual business scenarios, and achieved good results in multiple business scenarios (AliExpress, Vientiane, Lazada).

Reinforced Transfer Learning

The effectiveness of migration learning largely depends on the gap between the source domain and the target domain. If the gap is relatively large, the migration is likely to be invalid. In the Xiaomi QA scenario, if you directly migrate Quora's text matching data, many of them are not suitable. In Xiaomi’s QA scenario, based on the Actor-Critic algorithm, we built a general reinforced migration learning framework, and used RL for sample selection to help the TL model achieve better results. The whole model is divided into three parts, basic QA model, transfer learning model (TL) and reinforcement learning model (RL). Among them, the policy function of RL is responsible for selecting high-quality samples (actions), the TL model trains the QA model on the selected samples and provides feedback to RL, and RL updates actions according to the feedback (reward). The model trained by this framework has achieved a very good improvement in the matching accuracy of both the Russian and Spanish matching models of AliExpress on Double 11 AliExpress.

image.png

Former favor Meta Fine-tuning

The wide application of pre-training language models makes the two-stage training model of Pre-training+Fine-tuning mainstream. We noticed that in the fine-tuning stage, model parameters are only fine-tuned on specific fields and specific data sets, and the effect of cross-domain data migration and tuning is not considered. The Meta Fine-tuning algorithm draws on the idea of ​​Meta-learning, and aims to learn the cross-domain meta-learner of pre-trained language models, so that the learned meta-learner can be quickly transferred to tasks in a specific domain. This algorithm learns the cross-domain typicality (ie transferability) of training data samples, and adds domain corruption classifier to the pre-training language model, so that the model learns more domain-invariant representations.

image.png

We applied the fine-tuning algorithm to BERT and conducted experiments on multiple tasks such as natural language inference and sentiment analysis. Experimental results show that the meta-tuning algorithm is superior to the original fine-tuning algorithm of BERT and the fine-tuning algorithm based on transfer learning in these tasks.

Meta-Knowledge Distillation

As pre-trained language models such as BERT have achieved SOTA effects on various tasks, models such as BERT have become an important part of the NLP deep migration learning pipeline. But BERT is not perfect. This type of model still has the following two problems: too much model parameters and slow training/inference speed. Therefore, one direction is to distill BERT knowledge into a small model. However, most of the knowledge distillation work focuses on the same field, ignoring the problem of improving distillation tasks across fields. We propose to use Meta Learning to learn cross-domain transferable knowledge, and additionally distill the transferable knowledge in the distillation stage. This approach has significantly improved the effect of the learned Student model in the corresponding field. We have distilled a better student model on multiple cross-domain tasks, which approximates the effect of the teacher model. We will sort out this work in the near future and publish code and articles.

Four innovative articles

The EasyTransfer framework has been implemented in dozens of NLP scenarios in Alibaba Group, including intelligent customer service, search recommendation, security risk control, and entertainment, which has brought significant business effects. At present, EasyTransfer has hundreds of millions of calls to its daily services, and the average monthly training call volume exceeds 50,000. The EasyTransfer team has accumulated a lot of innovative algorithm solutions while landing the business, including meta-learning, multi-modal pre-training, enhanced transfer learning, feature transfer learning, etc., and has cooperated and published dozens of top conference articles. , Here are some representative works. These algorithms will be open sourced in the EasyTransfer framework for the majority of users to use.

  • [EMNLP 2020]. Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining. EMNLP 2020. Full Paper.
  • [SIGIR 2020] FashionBERT: Text and Image Matching for Fashion Domain with Adaptive Loss.
  • [ACM MM 2020] One-shot Learning for Text Field Labeling in Structure Information Extraction. To appear, Full Oral paper.
  • [IJCAI 2020] AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search, IJCAI 2020.
  • [KDD 2019] A Minimax Game for Instance based Selective Transfer Learning. Oral, KDD 2019.
  • [CIKM 2019] Cross-domain Attention Network with Wasserstein Regularizers for E-commerce Search, CIKM 2019.
  • [WWW 2019] Multi-Domain Gated CNN for Review Helpfulness Prediction, WWW.
  • [SIGIR 2019]. BERT with History Modeling for Conversational Question Answering. SIGIR 2019.
  • [WSDM 2019]. Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching. WSDM 2019, Full Paper.
  • [ACL 2018]. Transfer Learning for Context-Aware Question Matching in Information-seeking Conversation Systems in E-commerce. ACL. 2018.
  • [SIGIR 2018]. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. Long Paper.
  • [WSDM 2018]. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce, 2018. Long Paper.
  • [CIKM 2017]. AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience, CIKM 2017, Demo Paper, Best Demo Award.
  • [ICDM 2017]. A Short-Term Rainfall Prediction Model using Multi-Task Convolutional Neural Networks. Long paper, ICDM 2017.
  • [ACL 2017]. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine,ACL 2017.
  • [arXiv]. KEML: A Knowledge-Enriched Meta-Learning Framework for Lexical Relation Classification,arXiv.

Finally, the EasyTransfer tool is the toolkit officially recommended by the Chinese CLUE community. At the same time, the Aliyun Tianchi platform will work with the CLUE community to create a multi-task semantic understanding contest. EasyTransfer is the default development tool. Users can easily build a multi-task baseline based on EasyTransfer and perform modeling and optimization. Please look forward to it.

Original link: https://developer.aliyun.com/article/776240?

Copyright statement: The content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own its copyright and does not assume corresponding legal responsibilities. Please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines" for specific rules. If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/109241266