NLP training and inference integrated tool (TurboNLPExp)

Author: TurboNLP, Tencent TEG background Engineer

Introduction

When training NLP tasks (sequence labeling, classification, sentence relationship judgment, generative), the machine learning framework Pytorch or Tensorflow is usually used, and the model is defined on it and the data preprocessing of the custom model is difficult to achieve. Precipitation, reuse, and sharing, and for the model launching also face: difficult to launch, high delay, high cost, etc., TEG-AI Platform Department-Search Business Center from the end of 2019, after a lot of research in the early stage, self-developed on the basis of AllenNLP Inference and training integrated tool TurboNLP covers the training framework TurboNLP-exp and the inference framework TuboNLP-inference. The TurboNLP-exp training framework has the characteristics of configurable, simple, multi-framework, multi-tasking, and reusability. Able to perform NLP experiments quickly and efficiently.

The bottom layer of TurboNLP-inference inference framework supports an efficient model inference library BertInference, which integrates commonly used NLP models, is seamlessly compatible with TurboNLP-exp, and has high inference performance (measured on the BERT-base document classification business model, FP6 accuracy is batch_size=64 , Seq_len=64 reached 0.275ms/query, INT8 accuracy reached 0.126ms/query performance under batch_size=64, seq_len=64) and other features, NLP training and inference integrated tools greatly simplify training to The inference process reduces labor costs such as task training and model launch. This article will mainly introduce NLP training and inference integrated tools.

background

The NLP task is usually the algorithm researcher's custom model and data preprocessing in the machine learning framework Pytorchor Tensorflowtraining, and manual deployment to libtorch or tensorflow, this process has the following problems:

The existing model structure and data preprocessing of NLP tasks are redefined, with high repeatability.
Manually modify the model structure and data preprocessing code, constantly adjust the training parameters, and trial and error, resulting in code confusion.
When the model complexity (multi-model and multi-task) is high or the existing model needs to be optimized and improved, if you are not familiar with the model structure, you need to reorganize the model and data preprocessing code defined by Python.
Knowledge accumulation, model reuse and sharing are difficult.
It is difficult to go online, data preprocessing is complicated in C++, and reasoning delay is high.
It is difficult to improve the efficiency of offline training and effect experiments of NLP tasks through process, and the cost of trial and error is high.

In order to solve the above-mentioned pain points, in this context, we have opened up the NLP training end to the inference end, and self-developed the training framework TurboNLP-expand the inference framework TuboNLP-inference. The following is the overall architecture diagram of the framework:

Introduction

Training framework TurboNLP-exp
- TurboNLP-exp has the characteristics of modularity, configurability, multi-platform support, multi-task support, multi-model format export, C++ data preprocessing, etc., which can not only meet the rapid experimentation of researchers, but also deposit models on the framework through configuration. Researchers reuse and share knowledge through configuration.
- TuboNLP-exp has a modular design for model and data preprocessing. For data preprocessing, the data preprocessing for the same type of NLP task (sequence labeling, classification, sentence relationship judgment, generative formula) is basically the same, and it can be reused and configured. Reuse existing data preprocessing; for the model, TurboNLP-exp integrates a wealth of sub-modules: embedder, seq2seq_encoder, seq2vec_encoder, decoder, attention, etc., through configuration of arbitrary construction models, to achieve rapid experimental purposes.
- TuboNLP-exp unified the underlying machine learning platforms (Pytorch and Tensorflow), and being familiar with different machine learning platforms does not affect the reuse, sharing, and knowledge accumulation of models.
- TurboNLP-exp supports C++ and Python data preprocessing. Python data preprocessing has the characteristics of fast experimental debugging and mainly serves the training side. C++ data preprocessing has the characteristics of high performance and mainly serves the inference side. C++ data preprocessing has the same characteristics as Python With API interface, researchers can switch between C++ and Python data preprocessing at will during the training phase to ensure data consistency between the training end and the inference end.
Inference framework TurboNLP-inference
- TurboNLP-inference can directly load the model exported by TuboNLP-exp and instantiate data preprocessing according to the configuration.
- TurboNLP-inference provides a unified API, complete documentation and examples. The model inference code is quickly implemented through examples. The business code calls the inference library through the API interface and the so package.
- TurboNLP-inference reasoning framework integrates commonly used model NLP: lstm, encoder-decoder, crf, esim, BERT, the underlying reasoning supports five libraries: BertInference(BERT reasoning accelerate library), , libtorch, tensorflow(WXGTurboTransformers BERT reasoning accelerate open source library), BertInference-cpu(BERT inference acceleration library on the CPU).

TurboNLP-exp training framework

The TurboNLP-exp training framework is based on AllenNLP research and development. In order to meet the business needs of algorithm researchers and inference, TurboNLP-exp is continuously optimized and has features that the industry framework does not have. The following table is a comparison of TurboNLP-exp with other frameworks in the industry:

frame	Difficulty	Modular	Configurable	Pytorch	Tensorflow	Multitask training	Multi-model format export	Data preprocessing	reasoning
PyText	difficult	T	T	T	F	F	F	Python	Caffe2 execution engine
AllenNLP	simple	T	T	T	F	F	F	Python	Simple Python service
TurboNLP-exp	simple	T	T	T	T	T	T	Python、C++	Efficient TurboNLP-inference

The following will introduce in detail our optimization on TurboNLP-exp.

Modular and configurable

The high degree of configurability of TurboNLP-exp is due to its reasonable module design. Through modular packaging, TurboNLP-exp supports arbitrary combination of models and expansion sub-modules, etc., and provides interface configuration for newcomers to TurboNLP-exp. The data preprocessing and model configuration are generated through the visual interface, which greatly reduces the difficulty of getting started.

Modular and configurable data preprocessing

Data preprocessing can be roughly divided into four modules : dataset_reader , token_indexer , tokenizer , and vocabulary .

dataset_reader: Responsible for reading the training data, using the word segmenter for word segmentation, and indexer for id conversion; integrates multiple data format reading: text classification data format, NER data format, BERT data format, etc., and supports custom extensions.
token_indexer: Responsible for indexing tokens (converting id according to dictionary), integrates a variety of indexers: according to single word index, according to word index, according to word attribute index, etc., support custom expansion.
tokenizer: Responsible for word segmentation of the text. It integrates the word segmenters commonly used in NLP tasks: qqseg, wordpiece, whitespace, character, etc., and supports custom extensions.
vocabulary: Data dictionary, supports automatic generation from training data, and saves it locally after training, or generates it from an existing dictionary file in the local area. Vocabulary will save multiple dictionaries (tokens dictionary, labels) at the same time in the form of namespace Dictionaries etc.).

Modular and configurable model

The modular design of the model can be divided into three major parts: model , trainer , and exporter .

model: This module integrates common models of NLP tasks: encoder, decoder, embedder, etc. Each sub-model is composed of other models. This combined modular design can easily define the model according to the configuration. Similar to NLP tasks, the model structure is basically the same. Researchers can quickly adjust the model structure by modifying the configuration, and customize and expand the sub-models.
trainer: TurboNLP-exp encapsulates the optimizer, learning rate, evaluation index, etc. used in the training process, and modifies the training parameters through configuration to achieve the purpose of rapid experimentation.
Exporter: The module integrates various models export formats: caffe, onnx, ptformat, a format defined by configuration derived.

Multi-platform support

TurboNLP-exp abstracts the underlying machine learning platform and implements a unified framework interface to call the underlying pytorch and tensorflow (as shown in the figure below). The framework chooses pytorch or tensorflow to implement the interface according to the configuration. The pytorch format is currently the standard.

Multitask training

Multi-task learning integrates different types of tasks, such as entity recognition, compactness, etc., into one model by simulating the multi-task characteristics of the human cognitive process. On the common pre-training language model, train their respective tagger layers. During training , Through the mutual supplement of knowledge and goals in each task domain, jointly improve the effect of the task model. When going online, use the same underlying model to save storage and computing resources; currently, the demand for multitasking is increasing, and TurboNLP-exp supports Multi-task, multiple combinations and training scheduling methods (as shown in the figure below)

The multi-task model of TurboNLP-exp has the following characteristics:

Can quickly combine multi-task models through existing single-task models.
Support multiple combination rules, including: sharing, accumulation, shotcut.
- Sharing: Multiple models share the same encoder output.
- Accumulation: The encoder of each task is accumulated and output to the tagger layer of each task.
- shotcut: The encoder output of each task will be used as the encoder input of the next task.
Support multiple training scheduling methods, including: sequential scheduling, random scheduling, and joint scheduling.
- Sequential and random scheduling belongs to alternate training, which can obtain the optimal solution of each task on the basis of multiple tasks, and does not need to construct a unified input, which is simpler.
- Co-scheduling belongs to joint training and uses a unified input. Since loss will accumulate in the end, it is looking for a comprehensive multi-task optimal solution.
The user can freely configure the corresponding combination mode and scheduling mode according to the actual task scene, so that the multi-task can achieve the optimal effect.

Multi-model format export

TurboNLP-exp can export formats: caffe, , onnx, ptsupport direct export TurboNLP-inference reasoning framework supported formats, directly reasoning end load, rather than through a complex model transformation.

Data preprocessing

TurboNLP-exp's data preprocessing can support both Python and C++. Python data preprocessing mainly serves the training side, and C++ data preprocessing mainly serves the inference side and can also serve the training side (as shown in the figure below)

On the training side, when the data preprocessing is still being modified and debugged, using Python data preprocessing can quickly experiment. When the Python data preprocessing is fixed, switch to C++ data preprocessing to verify the data preprocessing results to ensure Data consistency between the training end and the inference end.

On the inference side, using the same configuration as the training side, C++ data preprocessing output will be used as model input, C++ data preprocessing- TurboNLP-datausing multi-threading, preprocessing queue to ensure low latency of data preprocessing, in the BERT-base five categories Measured on the model, the performance of 0.05ms/query is achieved when batch_size=64 and seq_len=64 .

TurboNLP-inference inference framework

TurboNLP-inference reasoning framework seamlessly compatible with TurboNLP-exp, 具备低延迟, 可配置and other characteristics, TurboNLP-inference reasoning underlying supports five libraries: BertInference(BERT reasoning accelerate library), , libtorch, tensorflow(WXGTurboTransformers BERT reasoning accelerate open source library), BertInference-cpu(BERT reasoning accelerated on a CPU Library), among them, BertInferenceis a high-performance BERT inference library developed based on TensorRT , and a library BertInference-cpufor BERT inference acceleration on the CPU developed in cooperation with Intel.

The following is the integrated architecture diagram of the inference framework TurboNLP-inference and the training framework TurboNLP-exp:

TurboNLP-inference has the following features:

NLP integrated task model used: lstm, esim, seq2seq_encoder, attention, transformeretc., according to the configuration model structure and configuration of the model input.
TurboNLP-expExporter that can be loaded directly exports the model.weightsmodel format.
Use C++ data preprocessing- TurboNLP-data, and automatically feed the data preprocessing output into the model input.
The reasoning code will be embedded in the business code in the form of C++ so package and API, intrusion into the business code as little as possible, and the modification is flexible and convenient.

Business Applications

NLP integrated tools (TurboNLP-exp training framework and TurboNLP-inference inference framework) greatly simplify the model from training to online process (as shown in the figure below), according to the actual online process of the business model, manual training and deployment require 14.5 /Person-day, while the use of NLP integrated tools only needs 4/person-day, which saves **72.4%** labor costs.

TurboNLP-inference has successfully supported 5 services of the TEG-AI Platform Department-Search Service Center:

For the document classification BERT model of a business, the accuracy of FP16 reached 0.290ms/query performance under the condition of batch_size=64 and seq_len=64 , the machine resources were saved by 97% , the online cycle was shortened by nearly 50% , and the machine was greatly reduced. And labor costs.
With the BERT model for judging the relationship between text and video of a business, the response delay is reduced to 2/3 of the original, and equipment resources are saved by 92.8% .
The query of a business rewrites the BERT-base model, which greatly reduces the online cycle and labor costs compared to the previous one.
A business multi-task (encoder is BERT, decoder is GRU) model, in the case of FP16 accuracy, reached a performance of 2ms/query .
The BERT-base model is not necessary for the query of a business, and the online period is greatly shortened. In the case of FP16 accuracy, the performance of 1.1ms/query is achieved .

The performance of TurboNLP-inference in business is inseparable from the seamless support of the training framework and the support of the underlying efficient inference library.

latest progress

One of TurboNLP-inference's low-level efficient reasoning libraries, BertInference, currently has support for INT8 reasoning and optimized Attention calculations . We used the BERT-base text classification business model and real online data for performance testing. The effects are as follows:

In the case of batch_size=64, and seq_len=64, the performance has reached 0.126ms/query , and INT8 has increased by about **54.2%** compared to FP16.

TurboNLP-inference supports INT8 calibration and can directly calibrate using existing models. The calibration process is adjusted through configuration. The calibration process is simple. After calibration, INT8 accuracy can be directly used for model inference.

Summary and outlook

NLP integration tools (TurboNLP-exp training framework and TurboNLP-inference inference framework) have now evolved within the TEG AI working group. There are also some cooperative applications in the pre-training model. At the same time, we are also actively working with the AI working group’s computing power and The Tai Chi machine learning platform team actively cooperates to better open the training capabilities on the platform. Next, the training and reasoning framework will also evolve in TencentNLP's unified collaboration oteam, and we look forward to more team cooperation within the company.

TurboNLP-inference's BERT inference acceleration still has room for further improvement in the effect of the INT8 accuracy model. At present, it is focusing on QAT and knowledge distillation. QAT is currently measured on the five-category BERT-base model. Accuracy is only reduced by 0.8% , adding knowledge distillation It is expected that Accuracy will not drop.