Machine Learning: LightGBM-Based Classification Practices

Study time: 2022.04.30~2022.05.1

Machine Learning: LightGBM-Based Classification Practices

I tried the FinTech competition of China Merchants Bank, and said that the effect of using LightGBM (lgb) was very good, so I came to learn it temporarily.

1. Introduction to LightGBM

LightGBM is a lightweight (Light) gradient boosting machine (GBM), which is another evolutionary version of the GBDT model. It continues the ensemble learning method of XGBoost. Compared with xgboost, it has the characteristics of fast training speed and low memory usage.

Well, I admit that I don't know GBM, GBDT, XGBoost in this paragraph, and I don't know what they are, but it does not affect the application of lgb, and I will continue to learn machine learning to make up.

Model accuracy: XGBoost is comparable to LightGBM;

Training speed: LightGBM is much faster than XGBoost;

Memory consumption: LightGBM is much smaller than XGBoost;

Missing value features: Both XGBoost and LightGBM can automatically handle feature missing values;

Categorical features: XGBoost does not support categorical features and requires OneHot encoding preprocessing.

LightGBM directly supports categorical features . Machine learning algorithms for categorical features are generally processed by one-hot, but one-hot is not recommended for decision tree algorithms, because when there are many types of categories, it will cause: ① The problem of unbalanced sample segmentation , the segmentation gain will be very small; ② will affect the learning of the decision tree.

Pros and cons of LightGBM:

  • faster

    • LightGBM adopts the histogram algorithm to convert traversal samples into traversal histograms, which greatly reduces the time complexity;
    • LightGBM uses a unilateral gradient algorithm to filter out samples with small gradients in the training process, reducing a lot of calculations;
    • LightGBM adopts the growth strategy based on Leaf-wise algorithm to build the tree, which reduces a lot of unnecessary computation;
    • LightGBM uses optimized feature parallelism and data parallelism to accelerate computation, and can also adopt voting parallelism strategy when the amount of data is very large;
    • LightGBM also optimizes the cache, increasing the cache hit rate;
  • less memory

    • LightGBM uses a histogram algorithm to convert stored eigenvalues ​​into stored bin values, and does not require the index of eigenvalues ​​to samples, reducing memory consumption;
    • LightGBM uses a mutually exclusive feature bundling algorithm to reduce the number of features and memory consumption during the training process.
  • shortcoming

    • A deeper decision tree may grow, resulting in overfitting. Therefore, LightGBM adds a maximum depth limit on Leaf-wise to prevent overfitting while ensuring high efficiency;
    • The Boosting family is an iterative algorithm. Each iteration adjusts the weight of the sample according to the prediction result of the previous iteration, so as the iteration continues, the error will become smaller and smaller, and the bias of the model will continue to decrease. Noise is more sensitive;
    • When looking for the optimal solution, it is based on the optimal segmentation variables, and does not take into account the idea that the optimal solution is the synthesis of all features.

2. LightGBM practice

pip install lightgbm

LightGBM has two categories of interfaces: LightGBM native interface and scikit-learn interface, and LightGBM can implement both classification and regression tasks.

2.1 Data processing

The LightGBM Python version of the model is able to load data from the following formats:

  • libsvm/tsv/csv/txt format file
  • NumPy 2D array(s), pandas DataFrame, SciPy sparse matrix
  • LightGBM binary file
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

df = pd.read_excel('data/train.xlsx', sheet_name=0)
df_data = df.drop('LABEL', axis=1)
df_label = df['LABEL']

# 应用自定义的数据预处理函数
df_data = fintech_preprocess(df_data)
tr_x, te_x, tr_y, te_y = train_test_split(df_data, df_label, test_size=0.01, random_state=42)

2.2 Parameter setting

Call the sklearn interface provided by lightgbm for training. The meaning of the specific parameters can be found in the official documentation, which should be the most complete and detailed.

gbm = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', metric='auc')
# 网格搜索(已搜索完成)
params = {
    
    ……}
gsearch = GridSearchCV(gbm, param_grid=params, scoring='roc_auc', cv=n)
gsearch.fit(tr_x, tr_y)

Then output the result:

print('参数取值:{0}'.format(gsearch.best_params_))
bst = gsearch.best_estimator_

2.3 Validation set evaluation

y_pred = bst.predict(te_x)
AUC = roc_auc_score(te_y, y_pred)
print("验证集AUC: ", AUC)

The final result (AUC) is only 0.952. It can only be said that it is really difficult for liberal arts students to become a monk halfway through.

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124529188