The specific steps of converting a machine learning model into an API in Python

Convert machine learning models into APIs in Python

  1. API introduction
  2. Flask basics
  3. Build a machine learning model
  4. Save machine learning models-serialization and deserialization
  5. Use Flask to create an API for the model
  6. Test the API in Postman

API introduction

Simply put, API is actually an interface between two software. If end-user-oriented software can provide input in a predefined format, another software can expand its functions and provide output results to end-user-oriented software. In essence, Analytics Vidhya API is very similar to a web application, but the former often returns data in a standard data exchange format (such as JSON, XML, etc.). Once developers get the desired output, they can design it according to various needs.

Flask basics

Flask is a tiny Python-developed Web framework, based on the Werkzeug WSGI toolbox and Jinja2 template engine. Flask uses BSD authorization. Flask is also called "microframework" because it uses a simple core and uses extensions to add other functions. Flask does not have a default database or form verification tool. However, Flask retains the flexibility of augmentation. Flask-extension can be used to add these functions: ORM, form verification tool, file upload, various open authentication technologies. Similar competing products include Django, Falcon, Hug, etc.
If you have downloaded the Anaconda version, Flask is included. You can also use pip to download:

pip install flask

You will find that it is very small, which is one reason why it is so popular among Python developers. Another reason is that the Flask framework comes with a built-in lightweight web server, which requires less configuration and can be directly controlled by Python code.

The following code shows the simplicity of Flask very well. It creates a simple Web-API that generates a specific output when a specific URL is received.

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
    return "Welcome to machine learning model APIs!"

if __name__ == '__main__':
    app.run(debug=True)

After running, you can enter this URL in the terminal browser and observe the result.
Insert picture description here

  • Jupyter Notebook is very suitable for dealing with things related to Python, R and markdown. But once it involves building a web server, it will have many strange bugs. Therefore, it is recommended that you should write Flask code in a text editor such as Sublime and run the code from a terminal/command prompt.

  • Never name the file flask.py.

  • By default, the port number to run Flask is 5000. Sometimes the server can start normally on this port, but sometimes, if you start it with a URL in a web browser or any API client (such as Postman), it may report an error, as shown in the figure below:Insert picture description here

  • According to Flask's prompt, the server has been successfully started on port 5000 at this time, but when started with URL in the browser, it did not output any content. Therefore, this may be a port number conflict. In this case, we can change the default port number 5000 to the required port number, just enter
    app.run(debug=True,port=12345)

  • After entering the above code, the Flask server will look like this:
    Insert picture description hereNow let’s take a look at the entered code:

  • After creating a Flask instance, Python will automatically generate a name variable. If this file is run directly with Python as a script, then this variable will be "main"; if it is an imported file, then the value of "name" will be the name of your imported file. For example, if you have test.py and run.py , and import test.py into run.py , the "name" value of test.py will be test (app = Flask(test)) .

  • Regarding the definition of hello() above, you can use @app.route("/") . At the same time, the decorator **route()** can tell Flask what URL can trigger the defined hello().

  • The role of hello() is to generate output when using the API. In this case, going to localhost:5000/ in the web browser will produce the expected output (assuming the default port).

If we want to create APIs for machine learning models, here are some things to keep in mind.

Build a machine learning model

Here, we take the most conventional Scikit-learn model as an example to introduce how to use Flask to learn the Scikit-learn model. First, let's review the common modules of Scikit-learn:

  • Clustering
  • return
  • classification
  • Dimensionality reduction
  • Model selection
  • Pretreatment

For general data, when we send and receive, we will involve the operation of converting objects into a format that is convenient for transmission. They are also called object serialization (serialization) and deserialization (deserialization). The model is very different from the data, but Scikit-learn just supports the serialization and deserialization of the training model, which saves us the time to retrain the model. By using a serialized copy of the model in scikit-learn, we can write the Flask API.

At the same time, a requirement of the Scikit-learn model is that the data must be in a digital format, which is why we need to convert the classification features in the data set into digital features 0 and 1. In fact, in addition to classification, Scikit-learn's sklearn.preprocessing module also provides encoding methods such as LabelEncoder and OneHotEncoder .

In addition, Scikit-learn cannot automatically fill in missing values ​​in the data set, but we need to process them manually before entering the model. Missing values ​​and the feature encoding mentioned above are actually important steps in data preprocessing, and they are very important for building a good-performance machine learning model.

To facilitate the demonstration, here we take the most popular data set on Kaggle- Titanic as an example to explain. This data set is mainly a classification problem. Our task is to predict the survival probability of passengers based on tabular data. For further simplification, we only use four variables: age (age), sex (gender), embarked (port of embarkation: C=Cherbourg, Q=Queenstown, S=Southampton) and survived. Survived is a category label.

# Import dependencies
import pandas as pd
import numpy as np
# Load the dataset in a dataframe object and include only four features as mentioned
url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
df = pd.read_csv(url)
include = ['Age', 'Sex', 'Embarked', 'Survived'] # Only four features
df_ = df[include]


"Sex" and "Embarked" are non-numeric categorical features, we need to encode them; the "age" feature has a lot of missing values, which can be filled in with the median or average after summary statistics; Scikit-learn Can't recognize NaN, so we have to write a helper function for this:


categoricals = []
for col, col_type in df_.dtypes.iteritems():
     if col_type == 'O':
          categoricals.append(col)
     else:
          df_[col].fillna(0, inplace=True)


The above code is to fill in missing values ​​for the data set. One thing to note here is that missing values ​​are actually very important to model performance, especially when there are too many empty values, we must be very careful when filling in a single value, otherwise it is likely to cause a large deviation. In this data set, because the column with missing values ​​is age, we should not fill NaN with 0.

As for converting non-digital features into digital driving, you can use One Hot Encoding or the one provided by Pandas

get_dummies()
df_ohe = pd.get_dummies(df_, columns=categoricals, dummy_na=True)

Now that we have completed the preprocessing, we are ready to train the machine learning model: select the Logistic regression classifier.

from sklearn.linear_model import LogisticRegression
dependent_variable = 'Survived'
x = df_ohe[df_ohe.columns.difference([dependent_variable])]
y = df_ohe[dependent_variable]
lr = LogisticRegression()
lr.fit(x, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Once you have the model, then save the model. Technically speaking, we should serialize the model. In Python, this operation is called Pickling.

Save machine learning models: serialization and deserialization

Call sklearn's joblib :

from sklearn.externals import joblib
joblib.dump(lr, 'model.pkl')
['model.pkl']

The Logistic regression model remains unchanged now. We can load it into memory with one line of code, and the operation of loading the model back to the workspace is deserialization.

lr = joblib.load('model.pkl')

Use Flask to create an API for the model

To use Flask to create a server for the model, we need to do two things:

  • When the APP starts, the existing model is loaded into the memory.
  • Create an API power-off, which accepts input variables, converts them to the appropriate format, and returns predictions.

More specifically, when you enter the following:

[
    {
    
    "Age": 85, "Sex": "male", "Embarked": "S"},
    {
    
    "Age": 24, "Sex": '"female"', "Embarked": "C"},
    {
    
    "Age": 3, "Sex": "male", "Embarked": "C"},
    {
    
    "Age": 21, "Sex": "male", "Embarked": "S"}
]

You want the output of the API to be:

{
    
    "prediction": [0, 1, 1, 0]}

Among them, 0 means dead and 1 means surviving. The input format here is JSON, which is one of the most widely used data exchange formats.

To achieve the above effect, we need to write a function predict() first , its goal is as mentioned earlier:

  • When the APP starts, the existing model is loaded into the memory.
  • Create an API power-off, which accepts input variables, converts them to the appropriate format, and returns predictions.

We have demonstrated how to load an existing model, and then predict the survival status of a person based on the input received:

from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
     json_ = request.json
     query_df = pd.DataFrame(json_)
     query = pd.get_dummies(query_df)
     prediction = lr.predict(query)
     return jsonify({
    
    'prediction': list(prediction)})

Although it seems simple, you may encounter a small problem in this step.

In order for the function you write to run normally, all possible values ​​of these four categorical variables must be included in the incoming request. These values ​​may or may not be real-time. If the necessary value is missing in the incoming request, then the data column generated by predict() defined by the current method will be less than that in the classifier, and the model will report an error.

To solve this problem, we need to keep the columns during model training and serialize any Python objects into .pkl files.

model_columns = list(x.columns)
joblib.dump(model_columns, 'model_columns.pkl')
['model_columns.pkl']

Since the column list has been kept, you can deal with missing values ​​when making predictions (remember to load the model before the app starts):

@app.route('/predict', methods=['POST']) # Your API endpoint URL would consist /predict
def predict():
    if lr:
        try:
            json_ = request.json
            query = pd.get_dummies(pd.DataFrame(json_))
            query = query.reindex(columns=model_columns, fill_value=0)

            prediction = list(lr.predict(query))

            return jsonify({
    
    'prediction': prediction})

        except:

            return jsonify({
    
    'trace': traceback.format_exc()})
    else:
        print ('Train the model first')
        return ('No model here to use')



You have included all the required elements in the "/predict" API, now you only need to write the main class:

if __name__ == '__main__':
    try:
        port = int(sys.argv[1]) # This is for a command-line argument
    except:
        port = 12345 # If you don't provide any port then the port will be set to 12345
    lr = joblib.load(model_file_name) # Load "model.pkl"
    print ('Model loaded')
    model_columns = joblib.load(model_columns_file_name) # Load "model_columns.pkl"
    print ('Model columns loaded')
    app.run(port=port, debug=True)




Now, this API is all completed and ready to be hosted.

Of course, if you want to separate the Logistic regression model code and the Flask API code into separate .py files, this is actually a good programming habit. Then your model.py code should look like this:

# Import dependencies
import pandas as pd
import numpy as np

# Load the dataset in a dataframe object and include only four features as mentioned
url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
df = pd.read_csv(url)
include = ['Age', 'Sex', 'Embarked', 'Survived'] # Only four features
df_ = df[include]

# Data Preprocessing
categoricals = []
for col, col_type in df_.dtypes.iteritems():
     if col_type == 'O':
          categoricals.append(col)
     else:
          df_[col].fillna(0, inplace=True)

df_ohe = pd.get_dummies(df_, columns=categoricals, dummy_na=True)

# Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
dependent_variable = 'Survived'
x = df_ohe[df_ohe.columns.difference([dependent_variable])]
y = df_ohe[dependent_variable]
lr = LogisticRegression()
lr.fit(x, y)

# Save your model
from sklearn.externals import joblib
joblib.dump(lr, 'model.pkl')
print("Model dumped!")

# Load the model that you just saved
lr = joblib.load('model.pkl')

# Saving the data columns from training
model_columns = list(x.columns)
joblib.dump(model_columns, 'model_columns.pkl')
print("Models columns dumped!")


而api.py则是:


# Dependencies
from fla

sk import Flask, request, jsonify
from sklearn.externals import joblib
import traceback
import pandas as pd
import numpy as np

Your API definition

app = Flask(name)

@app.route(’/predict’, methods=[‘POST’])
def predict():
if lr:
try:
json_ = request.json
print(json_)
query = pd.get_dummies(pd.DataFrame(json_))
query = query.reindex(columns=model_columns, fill_value=0)

        prediction = list(lr.predict(query))

        return jsonify({'prediction': str(prediction)})

    except:

        return jsonify({'trace': traceback.format_exc()})
else:
    print ('Train the model first')
    return ('No model here to use')

if name == ‘main’:
try:
port = int(sys.argv[1]) # This is for a command-line input
except:
port = 12345 # If you don’t provide any port the port will be set to 12345

lr = joblib.load("model.pkl") # Load "model.pkl"
print ('Model loaded')
model_columns = joblib.load("model_columns.pkl") # Load "model_columns.pkl"
print ('Model columns loaded')

app.run(port=port, debug=True)现在,你可以在名为Postman的API客户端中测试此API 。只要确保model.py与api.py在同一个目录下,并确保两者都已在测试前编译好了,如下图所示:

Guess you like

Origin blog.csdn.net/weixin_44763047/article/details/106604754