PyCaret-low code ML library usage guide

In this article, I will demonstrate how to use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment.

When we study supervised machine learning problems, if we are satisfied with the results, it is easy to see how the random forest or gradient boosting model performs and stops the experiment. What if you can compare many different models with just one line of code? What if you can reduce every step of the data science process (from functional engineering to model deployment) to just a few lines of code?

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
This is where PyCaret comes into play. PyCaret is a high-level, low-code Python library that makes it easy to compare, train, evaluate, adjust, and deploy machine learning models with just a few lines of code. In essence, PyCaret is basically just a large package of many data science libraries such as Scikit-learn, Yellowbrick, SHAP, Optuna and Spacy. Yes, you can use these libraries for the same task, but if you don't want to write a lot of code, PyCaret can save a lot of time.

In this article, I will demonstrate how to use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment.

Install PyCaret

PyCaret is a large library with many dependencies. I recommend using Conda to create a virtual environment for PyCaret so that the installation will not affect any of your existing libraries. To create and activate a virtual environment in Conda, run the following commands :

conda create --name pycaret_env python=3.6 
conda activate pycaret_env 

To install the default smaller version of PyCaret with only the required dependencies, you can run the following command .

pip install pycaret 

To install the full version of PyCaret, you should run the following command.

pip install pycaret[full] 

Once PyCaret is installed, please deactivate the virtual environment and add it to Jupyter using the following command.

conda deactivate 
python -m ipykernel install --user --name pycaret_env --display-name "pycaret_env" 

Now, after launching Jupyter Notebook in your browser, you should be able to see the option to change the environment to the option you just created.
PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Changing the Conda virtual environment in Jupyter.

Import library

You can find the complete code for this article in this GitHub repository. In the code below, I only imported Numpy and Pandas to process the data for this demo.

import numpy as np 
import pandas as pd 

Read data

For this example, I used the "California Housing Prices" dataset available on Kaggle. In the code below, I read this data set into a data frame and display the first ten rows of the data frame.

housing_data = pd.read_csv('./data/housing.csv')housing_data.head(10) 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> First ten rows of the housing dataset.

The above output gives us an idea of ​​how the data looks. The data mainly contains digital features and a classification feature for the proximity of each house to the ocean. The target column we are trying to predict is the "median_house_value" column. The entire data set contains a total of 20,640 observations.

Initialize the experiment

Now that we have the data, we can initiate a PyCaret experiment that will preprocess the data and enable logging for all models that will be trained on this dataset.

from pycaret.regression import * 
reg_experiment = setup(housing_data,  
                       target = 'median_house_value',  
                       session_id=123,  
                       log_experiment=True,  
                       experiment_name='ca_housing') 

As shown in the GIF below, running the above code will preprocess the data and then generate a data frame with experimental options.
PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Pycaret setup function output.

Benchmark model

We can immediately compare different baseline models to find the model with the best K-fold cross-validation performance using the compare_models function, as shown in the code below. In the example below, I have excluded XGBoost for demonstration purposes.

best_model = compare_models(exclude=['xgboost'], fold=5) 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Results of comparing different models.

This function will generate a data frame containing the performance statistics of each model and highlight the indicators of the best performing model, in this case the CatBoost regressor.

Modeling

We can also use PyCaret to train the model in a single line of code. The create_model function only needs a string corresponding to the type of model you want to train. You can find a complete list of acceptable strings for this feature and the corresponding regression model on the PyCaret documentation page.

catboost = create_model('catboost') 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
The create_model function uses the cross-validation index of the trained CatBoost model to generate the above data frame.

Hyperparameter adjustment

Now that we have a well-trained model, we can further optimize it through hyperparameter adjustments. With just one line of code, we can adjust the hyperparameters of the model as shown below.

tuned_catboost = tune_model(catboost, n_iter=50, optimize = 'MAE') 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Results of hyperparameter tuning with 10-fold cross-validation.

The most important results (in this case the average indicator) are highlighted in yellow.

Visualize the performance of the model

We can use PyCaret to create many charts to visualize the performance of the model. PyCaret uses another high-level library called Yellowbrick to build these visualization files.

Residual map

By default, the plot_model function will generate a residual plot for the regression model, as shown below.

plot_model(tuned_catboost) 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Residual plot for the tuned CatBoost model.

Prediction error

By creating a prediction error map, we can also visualize the predicted value relative to the actual target value.

plot_model(tuned_catboost, plot = 'error') 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Prediction error plot for the tuned CatBoost regressor.

The above figure is particularly useful because it provides us with an intuitive representation of the R² coefficient of the CatBoost model. In an ideal situation (R² = 1), when the predicted value exactly matches the actual target value, the graph will only contain points along the dotted line.

Functional importance

We can also visualize the functional importance of the model as shown below.

plot_model(tuned_catboost, plot = 'feature') 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Feature importance plot for the CatBoost regressor.

As can be seen from the figure above, the median number of digits is the most important feature when predicting house prices. Since this feature corresponds to the median income of the area where the house is built, this assessment is very reasonable. Houses built in high-income areas may be more expensive than houses in low-income areas.

Use all graphs to evaluate the model

We can also create multiple graphs to evaluate the model using the validate_model function.

evaluate_model(tuned_catboost) 

 
> The interface created using the evaluate_model function.

Explanatory model

The interpret_model function is a useful tool for interpreting model predictions. This function uses an interpretable machine learning library called SHAP, which I introduce in the following article.

With just one line of code, we can create a SHAPE colony diagram for the model.

interpret_model(tuned_catboost) 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> SHAP plot produced by calling the interpret_model function.

According to the above figure, we can see that the median income field has the greatest impact on the predicted house value.

Automatic machine learning

PyCaret also has the function of running automatic machine learning (AutoML). We can specify the loss function or metric we want to optimize, and then let the library take over as shown below.

automlautoml_model = automl(optimize = 'MAE') 

In this example, the AutoML model also happens to be a CatBoost regression variable, and we can confirm it by printing out the model.

print(automl_model) 

Running the print statement above will produce the following output:

< catboost.core.CatBoostRegressor at 0x7f9f05f4aad0 >

Generate predictions

The forecast model function allows us to generate forecasts by using data from experiments or new invisible data.

pred_holdouts = predict_model(automl_model) 
pred_holdouts.head() 

The predict_model function above generates predictions for the maintained data set used to verify the model during cross-validation. The code also provides us with a data frame containing performance statistics for the predictions generated by the AutoML model.
PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> Predictions generated by the AutoML model.

In the output above, the "label" column represents the predictions generated by the AutoML model. We can also make predictions on the entire data set, as shown in the code below.

new_data = housing_data.copy() 
new_data.drop(['median_house_value'], axis=1, inplace=True) 
predictions = predict_model(automl_model, data=new_data) 
predictions.head() 

Save the model

PyCaret also allows us to save the trained model using the save_model function. This function saves the conversion pipeline of the model to the pickle file.

save_model(automl_model, model_name='automl-model') 

We can also use the load_model function to load the saved AutoML model.

loaded_model = load_model('automl-model') 
print(loaded_model) 

Printing out the loaded model will produce the following output:

Pipeline(memory=None, 
         steps=[('dtypes', 
                 DataTypes_Auto_infer(categorical_features=[], 
                                      display_types=True, features_todrop=[], 
                                      id_columns=[], ml_usecase='regression', 
                                      numerical_features=[], 
                                      target='median_house_value', 
                                      time_features=[])), 
                ('imputer', 
                 Simple_Imputer(categorical_strategy='not_available', 
                                fill_value_categorical=None, 
                                fill_value_numerical=None, 
                                numer... 
                ('cluster_all', 'passthrough'), 
                ('dummy', Dummify(target='median_house_value')), 
                ('fix_perfect', Remove_100(target='median_house_value')), 
                ('clean_names', Clean_Colum_Names()), 
                ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'), 
                ('dfs', 'passthrough'), ('pca', 'passthrough'), 
                ['trained_model', 
                 ]], 
         verbose=False) 

As you can see from the above output, PyCaret not only saves the trained model at the end of the pipeline, but also saves the feature engineering and data preprocessing steps at the beginning of the pipeline. Now that we have a production-ready machine learning pipeline in one file, we don't have to worry about putting the various parts of the pipeline together.

Model deployment

Now that we have prepared a model pipeline that can be produced, we can also use the deploy_model function to deploy the model to a cloud platform such as AWS. If you plan to deploy the model to an S3 bucket, you must run the following command to configure the AWS command line interface before running this function:

aws configure 

Running the above code will trigger a series of prompts, prompting you to provide information such as AWS Secret Access Key. After completing this process, you can use the deploy_model function to deploy the model.

deploy_model(automl_model, model_name = 'automl-model-aws',  
             platform='aws', 
             authentication = {'bucket' : 'pycaret-ca-housing-model'}) 

In the above code, I deployed the AutoML model to an S3 bucket named pycaret-ca-housing-model in AWS. From here, you can write an AWS Lambda function that pulls the model from S3 and runs it in the cloud. PyCaret also allows you to load models from S3 using the load_model function.

MLflow user interface

Another nice feature of PyCaret is that it can use a machine learning lifecycle tool called MLfLow to record and track your machine learning experiments. Running the following command will launch the MLflow user interface in the browser from the local host.

!mlflow ui 

PyCaret-low-code ML library usage guidePyCaret-low-code ML library usage guide
> MLFlow dashboard.

In the above dashboard, we can see that MLflow can track the operation of different models of your PyCaret experiment. You can view the performance indicators and the running time of each run in the experiment.

Pros and cons of using PyCaret

If you have read this book, you now have a basic understanding of how to use PyCaret. Although PyCaret is a great tool, it has its own advantages and disadvantages, and you should be aware of this if you plan to use it for data science projects.

advantage:

Low code base.
Very suitable for simple standard tasks and general machine learning.
Provide support for regression, classification, natural language processing, clustering, anomaly detection and association rule mining.
Make it easy to create and save the complex conversion pipeline of the model.
Make it easy to visualize model performance.

Disadvantages:

So far, since NLP utilities are limited to topic modeling algorithms, PyCaret is not ideal for text classification.
PyCaret is not ideal for deep learning and does not use Keras or PyTorch models.
You cannot perform more complex machine learning tasks, such as using PyCaret (at least in version 2.2.0) for image classification and text generation.
By using PyCaret, you will sacrifice control of simple and advanced code to some extent.

Generalize

In this article, I demonstrated how to use PyCaret to complete all the steps in a machine learning project, from data preprocessing to model deployment. Although PyCaret is a useful tool, if you plan to use it for a data science project, you should understand its pros and cons. PyCaret is very suitable for general machine learning using tabular data, but starting from version 2.2.0, PyCaret is not suitable for more complex natural language processing, deep learning and computer vision tasks. But this is still a time-saving tool, who knows, maybe developers will add support for more complex tasks in the future?

As mentioned earlier, you can find the complete code for this article on GitHub. https://github.com/AmolMavuduru/PyCaretTutorial

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/115046773