LightGBM - a detailed introduction to the lifting machine algorithm (with code)

LightGBM - Boost Machine Algorithm

foreword

LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. It can be used for sorting, classification, regression and many other machine learning tasks.

In the competition questions, we know that the XGBoost algorithm is very popular. It is an excellent pulling framework, but in the process of using it, its training takes a long time and consumes a lot of memory. In January 2017, Microsoft open sourced a new boost tool - LightGBM on GitHub. Under the premise of not reducing the accuracy rate, the speed has increased by about 10 times, and the memory usage has decreased by about 3 times. Because it is based on the decision tree algorithm, it uses the optimal leaf-wise strategy to split leaf nodes, while other boosting algorithms generally use depth-wise or level-wise instead of leaf-wise. Therefore, in the LightGBM algorithm, when growing to the same leaf node, the leaf-wise algorithm reduces more losses than the level-wise algorithm. Thus resulting in a higher accuracy than any other existing boosting algorithm can achieve. At the same time, its speed is also shocking, which is     the reason for the algorithm's name lamp .

  • In March 2014, XGBOOST was first proposed as a research project by Chen Tianqi (the part of XGBOOST is in my other blog:

  • In January 2017, Microsoft released the first stable version of LightGBM. In the "Introduction to LightGBM" in the AI ​​headline sharing of Microsoft Asia Research Institute, Wang Taifeng, the researcher in charge of the machine learning group, mentioned that the Microsoft DMTK team has open sourced the performance on github and surpassed other promotions. After using the decision tree tool LightGBM, it has been starred 1000+ times and forked more than 200 times within three days. Nearly a thousand people on Zhihu paid attention to the question "How do you think about Microsoft's open source LightGBM?" and were evaluated as "amazing speed", "very inspiring", "distributed support", "clear and easy to understand code", "small memory footprint" wait. The following are the various advantages of LightGBM officially mentioned by Microsoft, as well as the open source address of the project.

 

一、"What We Do in LightGBM?"

The following table gives a more detailed performance comparison between XGBoost and LightGBM, including the growth method of the tree. LightGBM directly selects the node that obtains the maximum benefit to expand, while XGBoost grows by layer. In this way, LightGBM can build the decision tree we need at a smaller computational cost. Of course, in such an algorithm, we also need to control the depth of the tree and the minimum amount of data for each leaf node, so as to reduce overfitting.

A little translation, please point out if there are any problems

 **XGBoost****LightGBM** Tree Growth Algorithm** Layer-by-layer growth** is conducive to engineering optimization, but it is not efficient for learning models. Directly **select the node with the greatest profit** to expand, in a smaller The calculation cost goes up to select the depth of the decision tree we need to control the tree and the amount of data for each leaf node, which can reduce overfitting

Good for engineering optimization, but not efficient for learning models

Controlling the depth of the tree and the amount of data for each leaf node can reduce overfitting

Partition point search algorithm method for pre-sorting features Histogram algorithm: Divide feature values ​​into many small buckets, and then search for split points on the buckets, which reduces the calculation cost and storage cost, and obtains better performance. In addition, the change of the data structure makes the change in the details theoretically different in efficiency. The memory overhead is 8 bytes and 1 byte is divided. The calculation gain of the data feature container feature cache optimization is not accelerated by 40% on the Higgs dataset. The category feature processing is not available. 8x faster on the Expo dataset

 

2. Comparison on different data sets

Both higgs and expo are categorical data, and yahoo ltr and msltr are sorted data. In these data, LightGBM has better accuracy and stronger memory usage.

Accuracy

memory usage

Comparing the calculation speed, XGBoost usually takes several times longer to complete the same amount of training than LightGBM. On the higgs data set, the gap between them is more than 15 times.

3. The detailed technology of LightGBM

1. Histogram optimization

XGBoost adopts the pre-sorting method. During the calculation process, the value is sorted, and the partition income is calculated one by one data sample. This algorithm can accurately find the optimal partition value, but the cost is relatively high and there is no good generalization.

In LightGBM, the traditional pre-sorting idea is not used, but each of these precise and continuous values ​​is divided into a series of discrete domains, that is, into boxes. Taking floating-point data as an example, the value of an interval will be used as a bucket, and then the histogram will be made with these buckets as the precision unit. In this way, the expression of the data becomes more simplified, the memory usage is reduced, and the histogram brings a certain regularization effect, which can make the model we made avoid overfitting and have better generalization.

Look at the details of histogram optimization

It can be seen that the "histogram" is indexed according to the bin, so there is no need to sort according to each "feature", and there is no need to compare the values ​​of different "features" one by one, which greatly reduces the amount of calculation.

2. Storage memory optimization

The changes brought about when we use the bins of the data to describe the data characteristics: First, there is no need to store the sequence of each sorted data like the pre-sorting algorithm, which is the gray table in the figure below. In LightGBM, this part The calculation cost is 0; second, the general bin will be controlled in a relatively small range, so we can use smaller memory to store

3. Depth-limited node expansion method

LightGBM uses a depth-limited node expansion method (Leaf-wise) to improve model accuracy, which is a more efficient method than Level-wise in XGBoost. It can reduce the training error for better accuracy. But simply using Leaf-wise may grow a deeper tree, which may cause overfitting on a small data set, so add an additional depth limit on Leaf-wise

4. Histogram difference optimization

The histogram difference optimization can achieve twice the speedup. You can observe the histogram on a leaf node, which can be obtained by subtracting the histogram of its sibling node from its parent node histogram. According to this point, we can construct the histogram on the leaf node with a relatively small amount of data, and then use the histogram to make a difference to obtain the histogram on the leaf node with a relatively large amount of data, so as to achieve the effect of acceleration.

5. Sequential access gradient

There are two frequent operations in the pre-sorting algorithm that will cause cache-miss, that is, cache disappearance (which has a great impact on speed, especially when the amount of data is large, sequential access is more than 4 times faster than random access   ).

  • Access to gradients: Gradients need to be used when calculating gains. For different features, the order of accessing gradients is different and random. - Access to index tables: The pre-sorting algorithm uses row numbers and leaf node numbers The index table prevents all features from being split when data is split. Like accessing gradients, all features are indexed by accessing this index table. These two operations are random access, which will bring a very large drop in system performance.

The histogram algorithm used by LightGBM can solve this kind of problem very well. first. For the access to the gradient, because there is no need to sort the features, at the same time, all the features are accessed in the same way, so only the order of the gradient access needs to be reordered, and all the features can access the gradient continuously. And the histogram algorithm does not need to assign the data id to the leaf node number (this index table is not needed, and there is no problem of cache disappearance)

6. Support category features

Traditional machine learning generally cannot support direct input of category features, and needs to be converted into multi-dimensional 0-1 features first, which is not efficient in terms of space or time. By changing the decision rules of the decision tree algorithm, LightGBM directly natively supports category features without conversion, increasing the speed by nearly 8 times .

7. Support parallel learning

LightGBM natively supports parallel learning. Currently, it supports two types of feature parallelization (Featrue Parallelization) and data parallelization (Data Parallelization) . The other is voting-based data parallelization (Voting Parallelization).

  • The main idea of ​​feature parallelism is to find the optimal segmentation points on different machines and different feature sets , and then synchronize the optimal segmentation points between machines. -Data  parallelism is to let different machines construct histograms locally first, then merge them globally, and finally find the optimal split point on the merged histograms . LightGBM is optimized for both parallel methods.

  • In the feature parallel algorithm, the communication of data segmentation results is avoided by saving all data locally. -  In data parallelism , reduce scatter is used to distribute the task of merging histograms to different machines, reducing communication and calculation, and using histograms to make difference, further reducing the communication traffic by half. - **Voting Parallelization** further optimizes the communication cost in data parallelization, so that the communication cost becomes a constant level. When the amount of data is large, using voting parallelism can get a very good acceleration effect. The following figure better illustrates the overall process of the above three parallel learning:

When merging histograms, the communication cost is relatively high, and data parallelism based on voting can solve this problem very well.

4. Install LightGBM on MacOS

#先安装cmake和gcc,安装过的直接跳过前两步
brew install cmake
brew install gcc

git clone --recursive https://github.com/Microsoft/LightGBM 
cd LightGBM

#在cmake之前有一步添加环境变量
export CXX=g++-7 CC=gcc-7
mkdir build ; cd build

cmake ..
make -j4
cd ../python-package
sudo python setup.py install

Let's test it out:

You're done!

It is worth noting that there is no lightgbm in the pip list, and you need to run it in a specific folder to use lightgbm in the future. My address is:

/Users/fengxianhe/LightGBM/python-package

 

Five, implement the LightGBM algorithm with python

In order to demonstrate the usage of LightGBM in python, this code takes the iris data set included in the sklearn package as an example, and uses the lightgbm algorithm to realize the classification task of iris species.

# coding: utf-8
# pylint: disable = invalid-name, C0111

# 函数的更多使用方法参见LightGBM官方文档:http://lightgbm.readthedocs.io/en/latest/Python-Intro.html

import json
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.datasets import  make_classification

iris = load_iris()   # 载入鸢尾花数据集
data=iris.data
target = iris.target
X_train,X_test,y_train,y_test =train_test_split(data,target,test_size=0.2)


# 加载你的数据
# print('Load data...')
# df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
# df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
#
# y_train = df_train[0].values
# y_test = df_test[0].values
# X_train = df_train.drop(0, axis=1).values
# X_test = df_test.drop(0, axis=1).values

# 创建成lgb特征的数据集格式
lgb_train = lgb.Dataset(X_train, y_train) # 将数据保存到LightGBM二进制文件将使加载更快
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)  # 创建验证数据

# 将参数写成字典下形式
params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # 设置提升类型
    'objective': 'regression', # 目标函数
    'metric': {'l2', 'auc'},  # 评估函数
    'num_leaves': 31,   # 叶子节点数
    'learning_rate': 0.05,  # 学习速率
    'feature_fraction': 0.9, # 建树的特征选择比例
    'bagging_fraction': 0.8, # 建树的样本采样比例
    'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
    'verbose': 1 # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
}

print('Start training...')
# 训练 cv and train
gbm = lgb.train(params,lgb_train,num_boost_round=20,valid_sets=lgb_eval,early_stopping_rounds=5) # 训练数据需要参数列表和数据集

print('Save model...') 

gbm.save_model('model.txt')   # 训练后保存模型到文件

print('Start predicting...')
# 预测数据集
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration) #如果在训练期间启用了早期停止,可以通过best_iteration方式从最佳迭代中获得预测
# 评估模型
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) # 计算真实值和预测值之间的均方根误差

 Output result:

It can be seen that the root mean square error between the predicted value and the true value is 0.722972.

References:

【1】LightGBM——Improve Machine Algorithm (Illustration + Theory + Installation Method + Python Code)

Guess you like

Origin blog.csdn.net/weixin_64338372/article/details/130118562