[Machine Learning] The difference between the three serialization methods of the tree model (model storage size, memory used for serialization, serialization speed)

1. Introduction

This article summarizes commonly used tree models: various ways of saving and loading (serialization and deserialization) of models such as rf, xgboost, catboost and lightgbm, and compares the usage and storage size of various methods from the running memory.

Installation Environment:

pip install xgboost
pip install lightgbm
pip install catboost
pip install scikit-learn

The version can be specified or not, and the latest pkg can be obtained by direct downloading.

2. Example of model operation

Multi-classification tasks for the iris dataset:

import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier


iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

# xgb
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, y_test)
xgb_params = {
    
    'objective': 'multi:softmax', 'eval_metric': 'mlogloss', 'num_class': 3, 'verbosity': 0}
xgb_model = xgb.train(xgb_params, xgb_train)
y_pred = xgb_model.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)

# lgb
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
    
    
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_eval], early_stopping_rounds=5)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred = [list(x).index(max(x)) for x in y_pred]
lgb_acc = accuracy_score(y_test, y_pred)

# rf
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred)

# catboost
cat_boost_model = CatBoostClassifier(depth=9, learning_rate=0.01,
                                     loss_function='MultiClass', custom_metric=['AUC'],
                                     eval_metric='MultiClass', random_seed=1996)

cat_boost_model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, early_stopping_rounds=1000)
y_pred = cat_boost_model.predict(X_test)
cat_acc = accuracy_score(y_test, y_pred)

print(xgb_acc, lgb_acc, rf_acc, cat_acc)

3. Running in-memory computing

def cal_current_memory():
    # 获取当前进程内存占用。
    pid = os.getpid()
    p = psutil.Process(pid)
    info = p.memory_full_info()
    memory_used = info.uss / 1024. / 1024. / 1024.
    return {
    
    
        'memoryUsed': memory_used
    }

Obtain the pid of the current process, and use the pid to query the use of memory.

4. Save and load

There are three main methods:

  1. jsonpickle
  2. pickle
  3. model api

4.1 jsonpickle

jsonpickle is a Python serialization and deserialization library that converts Python objects to JSON-formatted strings, or JSON-formatted strings to Python objects.

Call jsonpickle.encode to serialize, decode to deserialize

Take xgb as an example

keep:

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

# xgb
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, y_test)
xgb_params = {
    
    'objective': 'multi:softmax', 'eval_metric': 'mlogloss', 'num_class': 3, 'verbosity': 0}
xgb_model = xgb.train(xgb_params, xgb_train)
y_pred = xgb_model.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)

xgb_str = jsonpickle.encode(xgb_model)
with open(f'{
      
      save_dir}/xgb_model_jsonpickle.json', 'w') as f:
    f.write(xgb_str)

load:

save_dir = './models'

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

xgb_test = xgb.DMatrix(X_test, y_test)

with open(f'{
      
      save_dir}/xgb_model_jsonpickle.json', 'r') as f:
    xgb_model_jsonpickle = f.read()
xgb_model_jsonpickle = jsonpickle.decode(xgb_model_jsonpickle)
y_pred = xgb_model_jsonpickle.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)
print(xgb_acc)

This completes the saving and loading of the model.

Advantage:

  1. The model loading process does not need to be re-instantiated, and the model can be obtained directly by jsonpickle.decode the model file
  2. The obtained model file is in json format, which facilitates data exchange between various programming languages ​​and platforms, and facilitates data transmission and sharing between different systems

Disadvantages:

  1. When dealing with large or complex models, the serialization process may have performance problems (occupies more memory)
  2. Model file storage space is relatively large

4.2 pickle

Pickle is a serialization and deserialization module of Python, which can convert Python objects into byte streams, and convert byte streams into Python objects, so as to realize the persistent storage and recovery of Python objects. (Model is also an object)

Call pickle.dump/dumps to serialize, and pickle.load/loads to deserialize (dump directly saves the serialized file, and dumps returns the serialized bytes file, as does load and loads)

Here you can view the comparison with other python methods: https://docs.python.org/zh-cn/3/library/pickle.html

Take xgb as an example

keep:

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

# xgb
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, y_test)
xgb_params = {
    
    'objective': 'multi:softmax', 'eval_metric': 'mlogloss', 'num_class': 3, 'verbosity': 0}
xgb_model = xgb.train(xgb_params, xgb_train)
y_pred = xgb_model.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)

with open(f'{
      
      save_dir}/xgb_model_pickle.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

load:

save_dir = './models'

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

xgb_test = xgb.DMatrix(X_test, y_test)

with open(f'{
      
      save_dir}/xgb_model_pickle.pkl', 'rb') as f:
    xgb_model_pickle = pickle.load(f)
y_pred = xgb_model_pickle.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)
print(xgb_acc)

The model loading process also does not need to be re-instantiated, which is the same as jsonpickle; the serialized file is much smaller than jsonpickle, and reading and saving will be faster.

When dealing with large or complex objects, there may be performance problems (taking up more memory); it is not in json format, and it is difficult to use across platforms and languages.

4.3 The model comes with

Take xgb as an example

keep:

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

# xgb
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, y_test)
xgb_params = {
    
    'objective': 'multi:softmax', 'eval_metric': 'mlogloss', 'num_class': 3, 'verbosity': 0}
xgb_model = xgb.train(xgb_params, xgb_train)
y_pred = xgb_model.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)

model_path = f'{
      
      save_dir}/xgb_model_self.bin' #也可以是json格式,但最终文件大小有区别 
xgb_model.save_model(model_path)

load:

save_dir = './models'

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

xgb_test = xgb.DMatrix(X_test, y_test)

xgb_model_self = xgb.Booster()
xgb_model_self.load_model(f'{
      
      save_dir}/xgb_model_self.bin')
y_pred = xgb_model_self.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)
print(xgb_acc)

Only the parameter file of the model is saved (including the tree structure and the required model parameters such as the objective function, etc.), the model file is small; the running memory during the serialization process is not much; it can also be saved in the form of json (in XGBoost 1.0. After 0, it is recommended to save in json format).

An instance of the model needs to be created before loading the model.

5. Experiment

The following are mainly experiments for smaller models.

5.1 Model storage size comparison experiment

insert image description here
_jsonpickle is a model file serialized with the jsonpickle method

_pickle is a model file serialized with the pickle method

_self is the model file saved by its own save model method

It can be seen that the relationship is jsonpickle > pickle > self.

5.2 Running memory comparison experiment

By monitoring the memory before and after serialization, such as xgb (only consider serialization, remove the memory required for file writing):

print("before:", cal_current_memory())
model_path = f'{
      
      save_dir}/xgb_model_self.bin'
xgb_model.save_model(model_path)
print("after:", cal_current_memory())

operation result:

before: {
    
    'memoryUsed': 0.1490936279296875}
after: {
    
    'memoryUsed': 0.14911270141601562}
print("before:", cal_current_memory())
pickle.dumps(xgb_model)
print("after:", cal_current_memory())

operation result:

before: {
    
    'memoryUsed': 0.1498260498046875}
after: {
    
    'memoryUsed': 0.14990234375}
print("before:", cal_current_memory())
xgb_str = jsonpickle.encode(xgb_model)
print("after:", cal_current_memory())

operation result:

before: {
    
    'memoryUsed': 0.14917755126953125}
after: {
    
    'memoryUsed': 0.15140914916992188}

It can be seen that for the xgb model, the memory required by picklejson is dozens of times that of the other two methods, and the other two methods are very similar

lgb result:

Corresponding to the above sequence:

self:
before: {
    
    'memoryUsed': 0.14953994750976562}
after {
    
    'memoryUsed': 0.14959716796875}
pickle:
before: {
    
    'memoryUsed': 0.14938735961914062}
after {
    
    'memoryUsed': 0.14946746826171875}
jsonpickle:
before: {
    
    'memoryUsed': 0.14945602416992188}
after {
    
    'memoryUsed': 0.14974594116210938}

Here is still jsonpickle bigger, but the multiple is smaller.

The result of catboost:

self:
before: {
    
    'memoryUsed': 0.24615478515625}
after {
    
    'memoryUsed': 0.25492095947265625}
pickle:
before: {
    
    'memoryUsed': 0.2300567626953125}
after {
    
    'memoryUsed': 0.25820159912109375}
jsonpickle:
before: {
    
    'memoryUsed': 0.2452239990234375}
after {
    
    'memoryUsed': 0.272674560546875}

6. Comparison of serialization time

Because the overall model size of catboost is larger, the speed of serialization can be better reflected through catboost:

self:
0.02413797378540039 s
pickle:
0.04681825637817383 s
jsonpickle:
0.3211638927459717  s

jsonpickle will take more time.

7. Source code

import base64
import json
import os
import pickle
import time
import jsonpickle
import psutil
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier

save_dir = "./models"


def cal_current_memory():
    # 获取当前进程内存占用。
    pid = os.getpid()
    p = psutil.Process(pid)
    info = p.memory_full_info()
    memory_used = info.uss / 1024. / 1024. / 1024.
    return {
    
    
        'memoryUsed': memory_used
    }


iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

# xgb
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, y_test)
xgb_params = {
    
    'objective': 'multi:softmax', 'eval_metric': 'mlogloss', 'num_class': 3, 'verbosity': 0}
xgb_model = xgb.train(xgb_params, xgb_train)
y_pred = xgb_model.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)
#
# print("before:", cal_current_memory())
# model_path = f'{save_dir}/xgb_model_self.bin'
# xgb_model.save_model(model_path)
# print("after", cal_current_memory())
with open(f'{
      
      save_dir}/xgb_model_pickle.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)
print(cal_current_memory())
xgb_str = jsonpickle.encode(xgb_model)
with open(f'{
      
      save_dir}/xgb_model_jsonpickle.json', 'w') as f:
    f.write(xgb_str)
print(cal_current_memory())


# lgb
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
    
    
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_eval], early_stopping_rounds=5)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred = [list(x).index(max(x)) for x in y_pred]
lgb_acc = accuracy_score(y_test, y_pred)
#
# print("before:", cal_current_memory())
# model_path = f'{save_dir}/lgb_model_self.bin'
# gbm.save_model(model_path)
# print("after", cal_current_memory())

with open(f'{
      
      save_dir}/lgb_model_pickle.pkl', 'wb') as f:
    pickle.dump(gbm, f)

lgb_str = jsonpickle.encode(gbm)
with open(f'{
      
      save_dir}/lgb_model_jsonpickle.json', 'w') as f:
    f.write(lgb_str)


# rf
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred)


with open(f'{
      
      save_dir}/rf_model_pickle.pkl', 'wb') as f:
    pickle.dump(rf, f)

rf_str = jsonpickle.encode(rf)
with open(f'{
      
      save_dir}/rf_model_jsonpickle.json', 'w') as f:
    f.write(rf_str)



# catboost
cat_boost_model = CatBoostClassifier(depth=9, learning_rate=0.01,
                                     loss_function='MultiClass', custom_metric=['AUC'],
                                     eval_metric='MultiClass', random_seed=1996)

cat_boost_model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, early_stopping_rounds=1000)
y_pred = cat_boost_model.predict(X_test)
cat_acc = accuracy_score(y_test, y_pred)

# t = time.time()
# model_path = f'{save_dir}/cat_boost_model_self.bin'
# cat_boost_model.save_model(model_path)
# print("after", time.time() - t)

# print("before:", cal_current_memory())
# model_path = f'{save_dir}/cat_boost_model_self.bin'
# cat_boost_model.save_model(model_path)
# print("after", cal_current_memory())
with open(f'{
      
      save_dir}/cat_boost_model_pickle.pkl', 'wb') as f:
    pickle.dump(cat_boost_model, f)

cat_boost_model_str = jsonpickle.encode(cat_boost_model)
with open(f'{
      
      save_dir}/cat_boost_model_jsonpickle.json', 'w') as f:
    f.write(cat_boost_model_str)

print(xgb_acc, lgb_acc, rf_acc, cat_acc)

# 测试

import pickle

import jsonpickle
import psutil
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier

save_dir = './models'

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1996)

xgb_test = xgb.DMatrix(X_test, y_test)

xgb_model_self = xgb.Booster()
xgb_model_self.load_model(f'{
      
      save_dir}/xgb_model_self.bin')
y_pred = xgb_model_self.predict(xgb_test)
xgb_acc = accuracy_score(y_test, y_pred)
print(xgb_acc)

# with open(f'{save_dir}/xgb_model_pickle.pkl', 'rb') as f:
#     xgb_model_pickle = pickle.load(f)
# y_pred = xgb_model_pickle.predict(xgb_test)
# xgb_acc = accuracy_score(y_test, y_pred)
# print(xgb_acc)

#
# with open(f'{save_dir}/xgb_model_jsonpickle.json', 'r') as f:
#     xgb_model_jsonpickle = f.read()
# xgb_model_jsonpickle = jsonpickle.decode(xgb_model_jsonpickle)
# y_pred = xgb_model_jsonpickle.predict(xgb_test)
# xgb_acc = accuracy_score(y_test, y_pred)
# print(xgb_acc)

8. Summary

The above experiments are the average of the results of several experiments. If you want to be more convincing, you can take the average of more experiments for reference. The overall results are basically the same. (You can also start with a larger model to discuss)

  1. If you want to save trouble and want to use a cross-platform language, you can choose picklejson, but you must have a certain memory estimate. If the model is complex and large (maybe a model class contains multiple objects of other models), it will take up a very large memory. , and the model file will be very large, but there is no need to serialize each individual sub-model, just decode it directly.

  2. For saving space and running memory, you can choose the saving method of the model itself (mainly only save the model parameter file), but for this method, it may be necessary to implement serialization and deserialization methods in the total class of the model (sub-model must be implemented, and each calls the savemodel and loadmodel methods of the model)

  3. If you do not consider cross-platform language serialization and deserialization under python, you can directly consider the serialization method of pickle, which is also relatively easy.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131314375
Recommended