"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.
LightGBM
As far as the performance of the GBDT series of algorithms is concerned, XGBoost is already very efficient, but it is not without flaws. LightGBM is an improved version aimed at the defects of XGBoost, which makes the GBDT algorithm system lighter and more efficient, and can be fast and accurate . 直方图算法
This chapter introduces the basic principles of LightGBM, including , 单边梯度抽样
, 互斥特征捆绑算法
and , for the places that XGBoost can optimize leaf-wise生长策略
.
1 Where XGBoost can be optimized
XGBoost
Use the pre-sorting algorithm to find the optimal splitting point of the feature. Although the pre-sorting algorithm can accurately find the splitting point of the feature, this method takes up too much space. In the case of a large amount of data and features, it will seriously affect algorithm performance. XGBoost
The complexity of the algorithm for finding the optimal split point can be estimated as:
complexity = number of features × number of feature split points × sample size complexity = number of features\times number of feature split points\times sample sizethe complexity=feature number×The number of feature split points×
SinceXGBoost
the complexity of the sample size is determined特征数
,特征分裂点的数量
andthe optimization direction of is also considered from these three directions.样本量
LightGBM
2 Basic Principles of LightGBM
LightGBM
The full name is light gradient boosting machine
(lightweight gradient booster), which is a top Boosting algorithm framework open sourced by Microsoft in 2017. Like XGBoost, LightGBM is also an engineering implementation of the GBDT algorithm framework, but it is faster and more efficient.
2.1 Histogram algorithm
In order 减少特征分裂点数量和更加高效地寻找最优特征分裂点
, lightGBM is different from XGBoost's pre-sorting algorithm, and uses the histogram algorithm to find the optimal feature split point. The main idea is to discretize continuous floating-point eigenvalues into k integers and construct a histogram with a width of k. When traversing each feature data, the discretized value is used for indexing as the cumulative statistics of the histogram. After traversing once, the histogram can accumulate the corresponding statistics, and then find the optimal split point according to the histogram.
The essence of the histogram is a data discretization and binning operation. Although it is not a particularly novel optimization design, it does have fast speed and excellent performance, and the calculation cost and memory usage are greatly reduced.
Another benefit of histograms is differential speedup. The histogram of a leaf node can be obtained by the difference between the histogram of its parent node and the histogram of its sibling nodes, which can also speed up the splitting of feature nodes.
2.2 Unilateral Gradient Sampling
单边梯度抽样
The ( gradient-based one-side sampling, GOSS
) algorithm is an algorithm designed for optimization LightGBM
from the perspective of reducing samplesLightGBM
, and is one of the core principles of .
The main idea of the unilateral gradient sampling algorithm is to eliminate most of the samples with small weights during the training process from the perspective of reducing samples, and only calculate the information gain for the remaining sample data.
In the algorithm chapter10
in AdaBoost
, a key element of the algorithm is the sample weight, and the optimal classification effect is achieved by continuously adjusting the sample classification weight during the training process. However, GBDT
there is no related design of sample weight in the series, and GBDT
the concept of weight is replaced by sample gradient. Generally speaking, samples with small training gradients have small empirical errors, indicating that this part of the data has been well trained. The GBDT
idea is to discard this part of the sample in the next step of residual fitting, but this may change The data distribution of training samples affects the final training accuracy.
LightGBM
It is proposed to use it GOSS采样算法
, and its purpose is to retain as much as possible the samples that are helpful for calculating information gain and improve the speed of model training. GOSS
The basic method is to first sort the features that need to be split in descending order of absolute value, take the previous data with the largest absolute value a%
, assuming that the sample size is , randomly select a data n
from the remaining data, and multiply this data by With a constant : multiply the small gradient sample by a weight coefficient to pull the sample distribution back as much as possible. This approach will allow the algorithm to focus more on samples that are not sufficiently trained, and the original data distribution will not change much. Finally, use the data to calculate the information gain of this feature.(1-a)%
b%
b%
(1-a)/b
a+b
GOSS
GBDT
The algorithm is mainly optimized from the perspective of reducing samples . LightGBM
This is one of the reasons for the faster speed by discarding samples with small gradients and speeding up model training without losing too much accuracy .
2.3 Mutually exclusive feature bundling algorithm
The histogram algorithm corresponds to the optimization of the feature split point, the unilateral gradient sampling corresponds to the optimization of the sample size, and finally the optimization of the number of features remains.
互斥特征捆绑
The ( exclusive feature bunding, EFB
) algorithm speeds up model training by bundling two mutually exclusive features into one feature and reducing the number of features without losing feature information. Most of the time, the two features are not completely mutually exclusive. You can define a conflict ratio to measure the degree of feature non-mutual exclusion. When the conflict ratio is low, you can bundle the two features that are not completely mutually exclusive, which has no effect on the final model accuracy. too much impact.
The so-called 特征互斥
, that is, two features will not be non-zero at the same time , which one-hot
is somewhat similar to the expression of classification features. There are two key issues in the mutually exclusive feature bundling algorithm: one is how to judge which features to bundle, and the other is how to bind the features, that is, how to take the value of the bound features.
For the first problem, the EFB algorithm converts it into 图着色问题
( graph coloring problem
) to solve. The basic idea is to regard all features as vertices in the graph, connect two features that are not independent of each other with an edge, and the weight of the edge represents the conflict ratio of the two connected features. The features that need to be bound together are graphs. Points (features) to be colored in the same color in coloring problems.
The second problem is to determine how the bound features are valued. The key is to be able to separate the original features from the merged features, that is, after binding to a feature, we can still identify it from the bundle
bound out the original features. EFB算法
Try to deal with this problem from the perspective of histograms. The specific method is to divide different eigenvalues into bundle
different histograms in the binding, and add a bias constant to the eigenvalues to deal with it.
To give a simple example, suppose we want to bind two features, feature A and feature B, the value range of feature A is [10,20), and the value range of feature B is [10, 30), we can give Add an offset of 10 to the value range of feature B, then the value range of feature B becomes [20, 40), and the value range of the bound feature becomes [10, 40), so that features A and Feature B can be merged.
2.4 leaf-wise growth strategy
LightGBM
It also proposes XGBoost
a leaf node growth method that is different from layer-by-layer growth, that is, a leaf-wise
decision tree growth method that grows by leaf nodes ( ) with depth restrictions.
XGBoost
The layer-by-layer level-wise
algorithm has the advantage that it can be optimized by multiple threads, and it is also convenient to control the complexity of the model, and it is not easy to overfit. The disadvantage is that all leaf nodes of the same layer are treated indiscriminately, and most node splits and gain calculations are not necessary. , resulting in additional computational overhead.
LightGBM
The algorithm for growing by leaf nodes is proposed leaf-wise
, which is more accurate and efficient, and can save unnecessary computing overhead. At the same time, in order to prevent a certain node from growing too much, a depth limit mechanism can be added to ensure a certain degree of accuracy while ensuring accuracy. to prevent overfitting.
In addition to the above four improved algorithms, LightGBM also has some improvements and optimizations in engineering implementation, such as direct support for category features (no need to perform one-hot processing on category features), efficient parallelism and cache (hit) hit rate optimization, etc. .
3 LightGBM native library example
The Microsoft development team of the open source LightGBM
project provides a native library implementation of the algorithm, and lightgbm库
provides two major interfaces, and the following uses the classification problem and as an example to give an example of the native interface.分类
回归
iris数据集
lightgbm
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
iris = load_iris()
data, target = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=43)
gbm = lgb.LGBMClassifier(
objective='multiclass',
num_class=3,
num_leaves=31, # 控制每个决策树中叶子节点的数量,也就是决策树的复杂度
learning_rate=0.05,
n_estimators=20
)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=5)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
print(accuracy_score(y_pred, y_test))
lgb.plot_importance(gbm)
plt.show()
[1] valid_0's multi_logloss: 1.02277
[2] valid_0's multi_logloss: 0.943765
[3] valid_0's multi_logloss: 0.873274
[4] valid_0's multi_logloss: 0.810478
[5] valid_0's multi_logloss: 0.752973
[6] valid_0's multi_logloss: 0.701621
[7] valid_0's multi_logloss: 0.654982
[8] valid_0's multi_logloss: 0.611268
[9] valid_0's multi_logloss: 0.572202
[10] valid_0's multi_logloss: 0.53541
[11] valid_0's multi_logloss: 0.502582
[12] valid_0's multi_logloss: 0.472856
[13] valid_0's multi_logloss: 0.443853
[14] valid_0's multi_logloss: 0.417764
[15] valid_0's multi_logloss: 0.393613
[16] valid_0's multi_logloss: 0.370679
[17] valid_0's multi_logloss: 0.349936
[18] valid_0's multi_logloss: 0.330669
[19] valid_0's multi_logloss: 0.312805
[20] valid_0's multi_logloss: 0.296973
1.0