Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

Editor's note

The NLP model development platform is based on the rapid creation of intelligent business as the core goal, without the need for machine learning expertise, model creation-data upload-data labeling (smart labeling, data expansion)-model training-model release-model verification full process visualization and convenient operation , A high-precision NLP model can be obtained in a short time, which truly empowers the business.

After the NLP model development platform of Beijing Baifen Information Technology Co., Ltd. was released, more than 200 personalized and customized real-time prediction models were launched in the public opinion analysis business. Relying on a powerful resource scheduling and computing platform, dozens of models are iterated every day Update and optimization, truly realize the closed loop of the whole process data and model. This article mainly introduces the architecture and implementation details of the NLP model development platform, as well as the application in the public opinion business, hoping to provide you with some references.

1. Background introduction

This article focuses on the design and practice of the NLP model development platform in the Percent Public Opinion Insight System (MediaForce). MediaForce is a SaaS product that provides information monitoring and intelligent analysis for government and enterprise customers. Since its development in 2014, the establishment of customer standardization and the accumulation of data assets have laid a solid foundation for the development of automation and intelligence. Internally, it is necessary to improve production and operation efficiency and shorten the feedback time of behavioral results; externally, it is necessary to provide personalized services to increase customer intimacy. Public opinion information is obtained through keyword retrieval to obtain corresponding relevant data. Under traditional information retrieval mechanisms such as BM25 and TF-IDF, only the degree of matching between keywords and documents is considered, and factors such as document subject, query understanding, and search intent are ignored. , Resulting in a large difference between the recall document and the customer's appeal. On the other hand, in customer customization scenarios, it is necessary to manually label customer data, which is an extremely time-consuming and laborious process.

In an NLP model development task, it generally includes the following three major modules:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

In the early days, these three modules were mainly revolved around and repeated to support the business. When the business scale is small, manual methods ensure work flexibility and innovative breakthroughs. However, as the business model matures and grows, the limitations of manual methods are gradually highlighted, which are mainly reflected in the following aspects:

(1) The increase of NLP model development tasks will undoubtedly increase the maintenance work of developers, especially in terms of algorithm iteration update and model version management, which will be catastrophic in nature.

(2) Business personnel are the controllers of core business, but due to the relatively high threshold for model learning, their participation is greatly reduced.

The construction of the NLP model development platform can not only solve the above problems, but also focus on the development of algorithm engineer model and benchmark verification, so that the division of labor is clearer and the whole people participate. Integrating features such as data management, model lifecycle management, and unified management of computing resources and storage resources, we strive to achieve the following goals:

(1) Reusability: general algorithm integration, algorithm management, and avoiding repeated wheel creation. From script development to visual operation, focus on algorithm effect improvement and module reuse.

(2) Ease of use: Even operating (business) personnel can customize private business models to truly enable business. Operations such as data labeling, model training, effect evaluation, and model release can be performed according to their own personalized requirements.

(3) Scalability: computing resources can be expanded, model algorithm framework (TF, Pytorch, H2o) can be expanded, language (Java, Python, R) can be expanded.

2. Review of NLP model development tool stack

In traditional software development, we need to hard-code the behavior of the program. In the development of the NLP machine learning model, we leave a lot of content to the machine to learn the data. The development process is essentially different, as shown in the following figure:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

Many traditional software engineering tools can be used to develop and serve machine learning tasks, but due to the particularity of machine learning, they often need their own tools. For example: Git performs version control by comparing differences line by line, which is suitable for most software development, but is not suitable for version control of data sets or model checkpoints. With the rise of deep learning in 2012, the types and number of machine learning tool stacks exploded, including All-in-one (one-stop machine learning platform): Polyaxon, MLFlow, etc., Modeling&Training (model development, training): PyTorch , Colab, JAX, etc., Serving (release, monitoring, A/B Test): Seldon, Datatron, etc. The following figure shows the number of tools of each type of MLOps:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The refinement of the machine learning tool stack includes: labeling, monitoring, version management, experiment tracking, CI/CD, etc. The details will not be repeated. For details, refer to the following figure:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

It can be seen that the types and numbers of machine learning tool stacks are extremely diverse at present, some are for OSS, and some are for commercial charges. The following figure mainly illustrates products on different types of tool stacks:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

3. Construction of NLP model development platform

  1. Introduction to the basic process of AI training model
    Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

(1) Analyze business requirements: Before officially launching the training model, it is necessary to effectively analyze and disassemble the business requirements and clarify how to choose the model type.

(2) Collect, collect, and preprocess data: collect data consistent with real business scenarios as much as possible, and cover all possible data situations.

(3) Label data: perform data label processing according to the rule definition. If it is some classification label, it can be directly marked offline; if it is some entity labeling and relationship labeling, it needs to correspond to a set of online labeling tools for efficient processing.

(4) Training model: In the training model stage, the labeled data can be based on the determined preliminary model type, and the algorithm can be selected for training.

(5) Effect evaluation: Before the model is formally integrated after training, it is necessary to evaluate whether the model effect is available. A detailed model evaluation report is required, as well as online visual upload data for model effect evaluation, and business verification in a grayscale environment.

(6) Model deployment: After confirming that the model effect is available, the model can be deployed to the production environment. At the same time, it must support functions such as multi-version management and AutoScale.

  1. Overall structure
    Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

(1) Distributed storage includes NFS, HDFS, CEPH. HDFS is to store original data and sample features, NFS is to store model files after training, and CEPH is a file distributed storage system of the K8S cluster.

(2) The underlying computing resources are divided into CPU clusters and GPU clusters. High-performance CPU clusters are mainly used to deploy and train traditional machine learning models, and GPU clusters are used to deploy and train deep (migration) learning models.

(3) Different resources have different calculation types. Machine learning training uses Alink for computing and Yarn to schedule computing resources; deep learning training uses K8S for scheduling, which supports mainstream deep learning frameworks such as Pytorch, Tensorflow, PaddlePaddle, and H2o. Currently, only single-machine training is implemented, and the deployment of models is all It uses K8S for unified release and management.

The module provides external functions such as data annotation, model training, model evaluation, model management, model deployment, and model prediction. At the same time, the platform also abstracts components such as classification, NER, evaluation, and prediction.

  1. Platform construction practice
    Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The upper layer of the platform provides a set of standard visual operation interfaces for business operators to use, and the bottom layer of the platform provides full life cycle model management and supports the expansion of upper layer applications.

The above mainly introduces the basic process and overall architecture of the NLP model development platform construction. This chapter will expand the technology selection and practice.

(1) Selection of container management scheduling platform

There are three mainstream container management scheduling platforms, namely Docker Swarm, Mesos Marathon and Kubernetes. But at the same time, it has many features such as scheduling, affinity/anti-affinity, health check, fault tolerance, scalability, service discovery, rolling upgrade, etc. It is none other than Kubernetes. At the same time, most of the OSS-based machine learning tool stacks are also upper-level development and applications based on Kubernetes, such as the well-known Kubeflow. On the other hand, the field of deep learning usually uses GPU for calculation, and Kubernetes has good support and expansion for the scheduling and resource allocation of GPU cards. For example, there are multiple types of GPU cards in the cluster. You can label GPU nodes and start tasks to configure nodeSelector to achieve precise allocation of card types. In the end, we chose to use K8S as the platform's container management system.

(2) GPU resource scheduling management

At present, the newer version of docker is a runtime that supports NVIDIA GPU, and no longer consider the old version of nvidia-docker or nvidia-docker2. In fact, on the basis of runtime, GPUs can be used directly to perform deep learning tasks, but GPU resources and support for heterogeneous devices cannot be limited. Here are mainly two solutions:

a. Device Plugin
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

In order to be able to manage and schedule GPUs in Kubernetes, Nvidia provides Device Plugin for Nvidia GPUs. The main functions are as follows:

Support the ListAndWatch interface to report the number of GPUs on the node.
Support Allocate interface, support the behavior of allocating GPU.
But this mechanism leads to the exclusive use of GPU cards. Especially in the inference stage, the utilization rate is very low. This is also the main reason why we adopt the second method.

b. GPU Sharing

GPU Device Plugin can achieve better isolation, ensuring that the GPU usage of each application is not affected by other applications. It is very suitable for deep learning model training scenarios, but if the scenario is model development and model inference, it will cause a waste of resources. Therefore, it is necessary to allow users to express requests for shared resources and ensure that GPUs will not be oversubscribed at the plan level. Here we try Aliyun's open source implementation on GPU Sharing, as shown in the following figure:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

In the configuration file, limit the video memory size, where the unit is GiB:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The enabler of public opinion analysis: NLP model development platform design practice

Execute the following commands:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

On the 11GiB graphics card, the GPU allocation status is as follows:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

(3) Gateway selection

When using Kubernetes Service, a problem that must be faced and solved is: How to access the Service created in Kubernetes from the outside (outside the kubernetes cluster)? The most commonly used method here is NodePort. It allows you to use any host IP for access. This method needs to specify the nodePort in advance or randomly generate the nodePort, which is difficult to manage the interface resources. The problem with mainstream gateways such as Kong, Nginx, HAProxy, etc. is that they are not self-service, not native to Kubernetes, designed for Api management, not microservices. Istio is a service mesh of microservices, designed to add the observability, routing and resiliency of the application layer (L7) to the traffic from service to service. The model deployment needs to be deeply integrated with Kubernetes, and there is no need to call between services. Finally, Ambassador is selected as the final gateway selection. As a newer open source microservice gateway product, Ambassador integrates well with kubernetes. The configuration method based on annotation or CRD is integrated with K8S, which truly achieves kubernetes native. An actual example is given below:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The default value of timeout_ms is 3000. When using a CPU as an inference device, a timeout may occur. Here, the value is fine-tuned according to different scenarios to meet the requirements. At the same time, you can view the corresponding URL from the Route Table.

(4) Visualization

The visualization here refers to the need to evaluate and tune model performance during model training. Here is the first to integrate Tensorboard, and then Baidu's VisualDl is also integrated. A separate container is started during the training process, and the interface is exposed for developers to review and analyze.

(5) Model deployment

In the second chapter, the machine learning tool stack with different functions is introduced. In model deployment, we use Seldon Core as a CD tool, and Seldon Core is also deeply integrated by Kubeflow. Seldon is an open source platform that can deploy machine learning models on a large scale on Kubernetes, and convert ML models (Tensorflow, Pytorch, H2o, etc.) or language wrappers (Python, Java, etc.) into microservices such as REST/GRPC.

The following is the construction process of the inference image, where MyModel.py is the prediction file:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

Some of the deployments description files are as follows:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

4. Platform application and effectiveness

The construction of the NLP model development platform greatly reduces the threshold for model learning, enabling business personnel not only to participate in the formulation of rules, but also to participate in multiple stages such as data labeling, service release, and effect evaluation. At the same time, data scientists and machine learning engineers can focus more on the algorithm and performance of the model itself, greatly improving work efficiency and simplifying work processes. The following examples use the platform's effectiveness in data relevance and label processing.

  1. relativity

In the past few decades, various automatic information retrieval systems have been implemented. The effective representation of documents is the core of being able to retrieve documents. Like vector space models and probabilistic models, both rely on characteristic factors such as TF, IDF, and document length. To convert documents from text to numeric or vector-based representation, the sorting function needs to sort the documents according to the relevance of a particular query. Among them, Okapi BM25 is the most famous and widely used sorting algorithm in IR. Traditional information retrieval methods do not consider many factors such as semantic information. And then Bert achieved the best IR-related benchmark tests in GLUE, partly because of its large amount of training data. In addition, based on the Transformer neural network architecture, the deep relationship between each token in the input is promoted, so that the model can better understand the relationship existing in the input. In real applications, you need to consider many techniques such as query intent, query rewriting, and synonym expansion. The following will describe the attempts to improve the relevance of retrieval and the evolution of the program, as well as the effectiveness and application of the NLP model development platform in this regard.

(1) Traditional information retrieval based on query intent

Searches in public opinion are often words or phrases. In the absence of external knowledge, search intentions are often not known. When using Okapi BM25's traditional information retrieval method, only the query keywords are related to the document, but not in line with the search intent. Under the architecture at that time, it was mainly based on Elasticsearch's full-text search, in order to consider whether ES can be used to obtain a more general processing framework. Elasticsearch is based on Luence's architecture. Many elements are in the same line, such as the concepts of documents and fields, relevance models, and various modes of query. The process is shown in the figure below:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The intent expansion library here actually expands the query keywords. For example, the keyword is "true Kungfu". If your search intent refers to the restaurant brand "true Kungfu", you can expand a series of industry-related words: catering, Stores, coupons, etc. Query together with Query. The intention expansion library (expanding related words) here only contributes to the weight score, not as a search filter condition, and can be achieved by using the should statement in ES. This mechanism alleviates the problem of data relevance to a certain extent, especially in the vertical field, the effect is very good. And once it involves cross-domain and broad search intent, it seems powerless.

(2) Application based on Bert classification model

The above implementation mechanisms are all examples of unsupervised sorting algorithms, and such sorting algorithms are not optimal. In an extremely simplified case, if a label is defined as whether a document is relevant for a certain keyword, that is, a binary label, the problem of training a sorting algorithm is transformed into a problem of binary classification. In this way, almost any off-the-shelf binary classifier can be directly used to train the sorting algorithm without modification. For example, the classic "logarithmic probability" classifier or support vector machine is a good choice. This type of algorithm is called "Pointwise Learning to Rank". This mechanism is very consistent with our application scenarios, except that Query is elevated to a topic dimension. Classical text matching algorithms such as DSSM solve the matching degree between the query string and the document. The query string is often a sentence rather than a word. Therefore, we transform the relevance problem into a binary classification problem. The Elastcsearch index library is used for retrieval in the recall phase, and the classifier is used to determine the recalled documents in the sorting phase.
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

Under this mechanism, personalized services are provided for customers. With the help of the NLP model development platform, one-stop processing is carried out, and iterative optimization of the version can be realized.
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The enabler of public opinion analysis: NLP model development platform design practice

  1. Offline label

In some customized scenarios, offline data needs to be labeled. This is a time-consuming and laborious process, and the previous labor cannot empower the subsequent work. We integrate existing data through the annotation module, and annotate the sample data of the new label, so as to quickly empower the business and liberate productivity.
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

The enabler of public opinion analysis: NLP model development platform design practice

And in the case of entity recognition, you can directly label in the labeling module:
Percent Cognitive Intelligence Lab: Design and Practice of NLP Model Development Platform in Public Opinion Analysis (Part 2)

The enabler of public opinion analysis: NLP model development platform design practice

5. Platform Outlook

  1. Perfect marking function

In addition to basic annotation tasks such as text classification and NER, support for mainstream tasks such as relational annotation and seq2seq, as well as features such as task assignment and multi-person collaboration, need to be added.

  1. Rich algorithm module

To meet the basic needs, it is also necessary to add algorithm modules such as text matching to meet a wider range of application scenarios.

  1. Build a Piplines pipeline NLP model development platform

Model training and model evaluation are currently coupled, which is not conducive to the reuse of component modules. Therefore, they must be split independently according to fine-grained modules, and then freely combined according to the Pipline method to achieve maximum utilization.

Reference materials:

[1]https://huyenchip.com/2020/06/22/mlops.html

[2]https://github.com/AliyunContainerService/gpushare-scheduler-extender

[3]https://docs.seldon.io/projects/seldon-core/en/latest/

Guess you like

Origin blog.51cto.com/14669657/2544389