How to deploy machine learning models without experiencing risk?

Author: Zen and the Art of Computer Programming

1 Introduction

The term "no risk" is a mysterious word for anyone who takes machine learning very seriously, but it is irreplaceable in practical applications. For example, the click-through rate prediction model of a large e-commerce website is a very important model no matter how accurate it is, but its prediction results will definitely bring huge economic and social value, so regardless of its accuracy, it must A high degree of reliability is required. So how do you deploy without the model having surprises?

In order to ensure the reliability of the machine learning model, whether it is online service or offline batch processing, the following points need to be done:

  1. Reproducibility of the model training process: During the model training process, all code and configuration information must be recorded through a version control system (such as Git) to ensure the repeatability of the model training;
  2. Division and verification of the test set: The training data not only needs to cover the entire amount of data, but also should be divided into a separate test set to evaluate the model effect;
  3. Streamlined data cleaning and preprocessing: The data processing link contains multiple steps, which need to be automated in an assembly line manner to ensure the consistency and validity of the data;
  4. Optimization and tuning of hyperparameters: Based on the actual situation of the model, it is necessary to optimize the hyperparameters of the model to improve the performance of the model;
  5. Continuously monitor the effect of the model: After the model goes online, the performance of the model needs to be monitored in a timely manner to ensure the stability of the model in the production environment;
  6. Back up model parameters: In a production environment, model updates may involve parameter modifications or new features of the model, so the latest parameters and version numbers of the model need to be backed up regularly;

This article will elaborate on the precautions when deploying machine learning models from the above 6 aspects, hoping to provide you with some help.

2. Explanation of basic concepts and terms

Git

  • Version control system is used to manage historical versions of code, documents, configuration files and other files, and provides various operations, such as checkout, submission, merge, rollback, etc.
  • Open source free software.

Python

  • An interpreted, object-oriented, high-level programming language with dynamic data types, widely used in data science, web crawlers, automated operation and maintenance, cloud computing, artificial intelligence and other fields.
  • Executable code files are usually composed of modules, each module defines an independent function.

Scikit-learn

  • A Python-based machine learning toolkit that provides an easy-to-use and powerful API interface that can implement tasks such as classification, regression, clustering, dimensionality reduction, feature selection, missing value processing, model selection, cross-validation, and grid search. Wait for the task.
  • Open source free software.

TensorFlow

  • The open source machine learning framework launched by Google provides efficient numerical computing capability support and is suitable for large-scale model training and inference.
  • Open source free software.

Jupyter Notebook

  • A Web-based interactive notebook environment that not only has code execution capabilities, but also can carry rich media output such as text, graphics, and media, and can be used as a platform for researchers to conduct data analysis and modeling.
  • Open source free software.

Docker

  • A containerization technology that can run any software on any system without considering underlying dependencies and environment construction issues. It can achieve cross-platform and cross-hardware deployment, and provides a complete set of solutions.
  • Open source free software.

Flask

  • A micro web framework, a lightweight web application development framework written in Python.
  • Open source free software.

DatabaseMySQL

  • A relational database management system that is an open source relational database server.
  • Open source free software.

3. Explanation of core algorithm principles, specific operating steps and mathematical formulas

1. Reproducibility of the model training process

(1) Use git for version control

Generally, the training process of machine learning models includes three stages: data preprocessing, feature engineering, and model training. In order to ensure the reproducibility of model training, a version control system can be used for management. Such as: GitHub, GitLab, Bitbucket, etc.

First, create a remote repository (GitHub, GitLab, etc.) and upload the local project to the remote repository. Then, clone the remote repository locally, create a branch for development, edit the code locally, and submit it to the local branch each time a function is completed.

After the submission is completed, push the local branch code to the remote warehouse. At the same time, each submitted version is recorded through the version management system for easy reference.

(2) Separate data preprocessing, feature engineering, and model training

For different data sets, the time required for model training is different. Therefore, it is recommended to separate data preprocessing, feature engineering, and model training and process them separately. This prevents errors or missing data during preprocessing or feature engineering.

For example, for news text classification scenarios, you can use the toolkit TextBlob for data preprocessing, including word segmentation, stop word removal, stemming, etc. Then use the NLP library SpaCy to perform feature engineering, including part-of-speech tagging, named entity recognition, text similarity calculation, etc. Finally, the machine learning library Scikit-learn is used for model training, including naive Bayes classifier, logistic regression model, decision tree model, etc.

(3) Use Dockerfile to build the image

Although Docker provides convenient virtual environment isolation and resource allocation capabilities, it is still recommended to carry out certain specifications and management to avoid using too many resources and reduce environmental conflicts. Therefore, it is recommended to use Dockerfile to build the image.

A Dockerfile is a file that describes how to create a docker image. Its main contents include basic settings of the image, installation of dependent packages, adding code, startup commands, etc. Through the construction of Dockerfile, the repeatability and portability of the image can be achieved, that is, the same code can be run in any environment.

(4) Configure environment variables

In order to make the code more portable, it is recommended to configure environment variables, including project path, data directory, model directory, log directory, etc. It facilitates team members to collaborate, share code and models, and reduces maintenance difficulty.

For example, you can specify a unified name for the project, put data, models, and logs in the same directory, and add environment variables for easy reference.

2. Division and verification of test set

(1) Training data proportion

Generally, the training data for a machine learning model should exceed the test data. Otherwise, the model's performance will not reflect the real situation. Therefore, training data should account for 80% to 90% of the total data set.

(2) Division of test set

The test set should contain the same data distribution and data quality as the training set. Otherwise, the effectiveness of the model will not be objectively judged for its accuracy.

The test set should be kept as uncorrelated with the training set as possible, that is, avoid using the same data for training and testing. This can be done by splitting the data set.

(3) Division of verification set

Before training the model, the best hyperparameters can be selected through cross validation. This method divides the data set into K subsets, of which K-1 subsets are used to train the model, and the remaining subset is used to verify the performance of the model. The value of K is usually 5 to 10.

(4) Embedded verification

In order to prevent overfitting, Embedded Validation is recommended before training the machine learning model. That is, some unexpected noise (noise, errors, etc.) is added to the training set, causing the model's performance to be poor. If the model is not immune to this noise, it indicates a risk of overfitting.

Commonly used embedded verification methods are:

  • k-fold cross validation
  • Leave One Out (LOO)
  • Bootstrapping
  • Chi-square test

3. Streamlining data cleaning and preprocessing

(1) Use pipeline approach for data cleaning and preprocessing

Data preprocessing refers to cleaning, converting, deleting, resampling and other operations on the original data to finally generate a data set for modeling. Since data processing often involves multiple steps, a pipeline approach is needed for automated processing.

In the traditional way, data cleaning and preprocessing is a manual, complex, and tedious process that requires multiple iterations based on the characteristics of the data to obtain satisfactory results. The pipeline method only requires simple configuration to complete automated processing, which greatly shortens the processing time and improves efficiency.

(2) Use cleaning library for data cleaning

In the process of machine learning, data cleaning is an important task, especially on larger data sets. Therefore, it is recommended to use a cleaning library to clean the data instead of manual cleaning. The introduction of the cleaning library can effectively save time and eliminate data quality problems.

Currently, the more popular cleaning libraries include Pandas, Dask Dataframe, etc.

(3) Use Python for data preprocessing

In addition to cleaning and converting raw data, Python can also be used for data preprocessing. For text data, you can use the NLP library for text analysis and processing; for image, video, audio and other data, you can use the CV library for feature extraction; for structured data, you can use SQL, Pandas and other libraries for data cleaning, conversion and other operations. .

4. Optimization and tuning of hyperparameters

(1) Parameter optimization of random forest

Random Forest is a typical algorithm in ensemble learning. It has fast training speed and strong generalization ability, and is suitable for solving multi-classification problems under big data.

The key to parameter optimization is to determine appropriate hyperparameters. Hyperparameters are parameters set before model training, which determine the model generation process, such as the number of trees, tree depth, tree type, sampling strategy, etc.

Commonly used hyperparameter optimization algorithms include Grid Search, Bayesian Optimization, Genetic Algorithm, Simulated Annealing, etc.

(2) Hyperparameter optimization of XGBoost

XGBoost (Extreme Gradient Boosting) is a tree-based gradient boosting algorithm. Its performance is excellent and it ranks among the best in many competitions.

For parameter optimization of XGBoost, you can refer to the hyperparameter optimization of random forest above. In addition, XGBoost can also use algorithms such as BOGO and TPE for hyperparameter optimization.

5. Continuously monitor the effectiveness of the model

(1) Effect indicators of the model

The performance index of the model is an important criterion for measuring the quality of the model. Different model effect indicators correspond to different optimization goals.

Commonly used model performance indicators include:

  • Accuracy
  • F1 Score
  • ROC Curve
  • AUC value (AUC)

(2) Model monitoring strategy

Since the model's production environment and actual business needs may vary widely, the model's effect monitoring strategy should be adjusted according to the actual situation. Commonly used monitoring strategies include:

  • Fusion of models using ensemble learning methods.
  • Use A/B testing in a production environment to verify model effectiveness.
  • Automatically adjust the model's hyperparameters according to the model's indicator fluctuations.

6. Back up model parameters

(1) Version management

In order to achieve version control of the model, it is recommended to establish a version library on a version control system (such as GitHub) and use git for version control.

Whenever a model training is completed, the latest model configuration and parameters should be submitted in the repository. This makes it easier to track and restore the model.

(2) Regular backup

In order to avoid model loss or damage caused by misoperation, it is recommended to set up a regular backup strategy. Generally speaking, daily, weekly or monthly backups are a better choice.

The backup strategy can be script automation or manually triggered.

7. Future development trends and challenges

There are many other things to pay attention to when deploying machine learning models. For example, how to handle data sets when deploying models, third-party libraries that models depend on, load balancing on the deployment server, etc.

The deployment of machine learning models is a very important part no matter what field it is in. Due to the characteristics of machine learning models, even if the deployment is well designed, its effectiveness may be at risk. When deploying a model, the safest way is to deploy the model in an existing test environment and then test the system.

However, even if the deployed model has no problems in the existing test environment, do not blindly believe in the effect of the model itself. A better approach is to use A/B testing to verify the model and find out how the model performs in actual business.

8. Appendix Frequently Asked Questions and Answers

おすすめ

転載: blog.csdn.net/universsky2015/article/details/133446371