Popular understanding of LightGBM algorithm

Machine learning is one of the fastest growing fields in the world, and a series of new algorithms are often released. Recently, it was discovered that a boosting framework introduced by Microsoft – LightGBM , has been widely used in kaggle data competitions. It seems that it wants to challenge The status of xgboost in the rivers and lakes, but for data mining (competition) enthusiasts, there is another good tool. This article will introduce in an easy-to-understand way what is the LightGBM algorithm and how to use LightGBM for actual combat?

What is LightGBM?

LightGBM is a gradient boosting framework using decision trees based learning algorithms. It is designed to be distributed and efficient, with the following advantages:

  • Faster Training Efficiency

  • low memory usage

  • better accuracy

  • Supports parallel learning and GPU learning

  • Can handle large-scale data

  • Categorical feature support

With the increasing amount of data, the running speed of traditional data science algorithms can't keep up with the rhythm. However, LightGBM's high speed and support for GPU learning make data scientists widely use this algorithm for the development of data science applications.

How is it different from other tree-based algorithms?

The LightGBM algorithm grows the tree vertically, while other algorithms grow the tree horizontally, which means that the LightGBM algorithm grows in the order of leaves, while other algorithms grow in horizontal order, it will choose the leaf with the largest loss to grow, and when growing on the same leaf, split leaves Algorithm reduces loss more than layered algorithm.

The following figure can intuitively see the main difference between LightGBM and other boosting algorithms:
how LightGBM worksHow other boosting algorithm works

LightGBM is easy to overfit small data, and it is not recommended to use LightGBM on small data sets. It is generally recommended to use it only for data with more than 10000 rows.

The use of LightGBM is actually simple. The only complicated thing is parameter adjustment. LightGBM includes more than 100 parameters, but don't worry, we don't need to learn all the parameters.

Parameter explanation and tuning

Control Parameters control parameters
  • num_leaves : This parameter is used to set the number of leaf nodes formed in the tree. Theoretically, num_leaves = 2^(max_depth). However, this is not a good estimation case, for LightGBM, num_leaves must be less than 2 ^ (max_depth) , otherwise it may lead to overfitting;

max_depth: The maximum depth of the tree. This parameter is used to deal with model overfitting, any time you feel that your model is overfitted, you should first consider reducing the maximum depth.

min_data_in_leaf: This is the minimum number of records a leaf node may have. The default is an optimal value of 20. It is also used to deal with over fitting.

early_stopping_round: This parameter can help you speed up the analysis. If a metric in a validation data does not improve in the last round of stopping, the model will stop training, which can reduce the number of iterations of the model.

lambda: regularization. Typical values ​​range from 0 to 1.

Core Parameters core parameters

application : This is the most important parameter, which specifies the application of the model, including regression and classification problems. Lightgbm treats the model as a regression model by default.

  • regression : for regression regression problem
  • binary : for binary classification
  • multiclass : for multiclass classification problem multi-classification

Boosting: Define the type of algorithm to run, the default is gdbt

  • gbdt: traditional Gradient Boosting Decision Tree traditional gradient boosting decision tree
  • rf: random forest random forest
  • dart: Dropouts meet Multiple Additive Regression Trees Dropouts + Multiple Additive Regression Trees
  • goss: Gradient-based One-Side Sampling Goss: Gradient-based one-side sampling

num_boost_round: number of boost iterations, usually 100+

learning_rate: This determines the impact of each tree on the final result. The way GBMs work is that they start with an initial estimate and then use the output of each tree to update the estimate. The learning parameter is used to control the magnitude of this change in the estimate. Typical values ​​are: 0.1, 0.001, 0.003…

device: the default is to use the cpu, you can also pass in the gpu

Metric parameter Metric parameter

metric: is also an important parameter, because it specifies how the loss of the model is evaluated. Below are a few loss functions for general regression and classification.

  • mae : mean absolute error mean absolute error
  • mse : mean squared error
  • binary_logloss : loss for binary classification
  • multi_logloss : loss for multi classification multi-classification loss
Install LightGBM

Installing the CPU version of LightGBM is simple and can be installed via pip, while the GPU version has more steps and requires the installation of Cuda, Boost, CMake, MS Build or Visual Studio and MinGW.

The installation methods are different for different platforms, please refer to the official document for details:
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html

Here is a simpler way to share with you. If you have installed Anaconda, it will be much more convenient to install LightGBM in Anaconda. You only need one line of command to install:

conda install -c conda-forge lightgbm
insert image description here

Use LightGBM

data set

The data set used in this article contains personal information from different countries. Our goal is to predict whether a person earns <=50k or >50k a year based on other available information. The data set consists of 32561 training data and 14 composed of features. The dataset can be downloaded from here:

http://archive.ics.uci.edu/ml/datasets/Adult.

preprocessing

#importing standard libraries 
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 

#import lightgbm and xgboost 
import lightgbm as lgb 

#loading our training dataset 
data=pd.read_csv("../input/train_data.csv",header=None) 

#Assigning names to the columns 
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income'] 

# Label Encoding our target variable 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le=LabelEncoder() 
le.fit(data.Income) 
#label encoding our target variable
data.Income=Series(le.transform(data.Income))  
#One Hot Encoding of the Categorical features 
one_hot_workclass=pd.get_dummies(data.workclass) 
one_hot_education=pd.get_dummies(data.education) 
one_hot_marital_Status=pd.get_dummies(data.marital_Status) 
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship) 
one_hot_race=pd.get_dummies(data.race) 
one_hot_sex=pd.get_dummies(data.sex) 
one_hot_native_country=pd.get_dummies(data.native_country)

#removing categorical features 
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 
data.head()

#removing dulpicate columns 
 _, i = np.unique(data.columns, return_index=True) 
data=data.iloc[:, i] 

#Here our target variable is 'Income' with values as 1 or 0.  
#Separating our data into features dataset x and our target dataset y 
x=data.drop('Income',axis=1) 
y=data.Income 

#Imputing missing values 用众数(就是频数最高的那个)来填充缺失项 
y.fillna(y.mode()[0],inplace=True) 

#Now splitting our dataset into test and train 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

Model building and training

We need to convert our training data to a dataset format supported by LightGBM, after that we create a python dictionary with parameters and their values, the accuracy of the model depends entirely on the values ​​provided for the parameters.

import lightgbm as lgb
# Light GBM
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {
    
    'num_leaves':150, 
         'objective':'binary',
         'max_depth':7,
         'learning_rate':0.05,
         'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

#training our model using light gbm
num_round=50
from datetime import datetime 
start=datetime.now()
lgbm=lgb.train(param,train_data,num_round)
stop=datetime.now()

#Execution time of the model
execution_time_lgbm = stop-start
execution_time_lgbm
#datetime.timedelta( , , ) representation => (days , seconds , microseconds) 

A brief explanation of the parameters:

  • The value of objective is binary classification problem
  • The metric metric uses binary_logloss binary logarithmic loss
  • type" is gbdt (you can try random forest)

model prediction

The output will be a list of probabilities, and I transpose the probabilities through a threshold of 0.5 for binary classification.

#Prediction
#now predicting our model on test set 
ypred=lgbm.predict(x_test)
#convert into binary values
#converting probabilities into 0 or 1
for i in range(0,9769):
    if ypred[i]>=0.5:       # setting threshold to 0.5
       ypred[i]=1
    else:  
       ypred[i]=0

evaluation model

We can directly calculate the accuracy to check the results, or calculate the value of ROC to evaluate.

#calculating accuracy of our model 
from sklearn.metrics import accuracy_score 
accuracy_lgbm = accuracy_score(ypred,y_test)
accuracy_lgbm # 0.8624219469751254

from sklearn.metrics import roc_auc_score
#calculating roc_auc_score for xgboost
auc_xgb =  roc_auc_score(y_test,ypred)
auc_xgb # 0.7629644010391523

It can be said that this algorithm has shown better results and is better than the existing boosting algorithm. You can use LightGBM to make a detailed comparison with other algorithms (such as xgboost, etc.), and you can see the difference, but LightGBM is different from other algorithms. Like machine learning algorithms, you also need to tune the parameters correctly before training the model!

References

https://www.zhihu.com/question/51644470
https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

Guess you like

Origin blog.csdn.net/lomodays207/article/details/88045852