"Machine Learning Formula Derivation and Code Implementation" chapter13-LightGBM

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

LightGBM

As far as the performance of the GBDT series of algorithms is concerned, XGBoost is already very efficient, but it is not without flaws. LightGBM is an improved version aimed at the defects of XGBoost, which makes the GBDT algorithm system lighter and more efficient, and can be fast and accurate . 直方图算法This chapter introduces the basic principles of LightGBM, including , 单边梯度抽样, 互斥特征捆绑算法and , for the places that XGBoost can optimize leaf-wise生长策略.

1 Where XGBoost can be optimized

XGBoostUse the pre-sorting algorithm to find the optimal splitting point of the feature. Although the pre-sorting algorithm can accurately find the splitting point of the feature, this method takes up too much space. In the case of a large amount of data and features, it will seriously affect algorithm performance. XGBoostThe complexity of the algorithm for finding the optimal split point can be estimated as:
complexity = number of features × number of feature split points × sample size complexity = number of features\times number of feature split points\times sample sizethe complexity=feature number×The number of feature split points×
SinceXGBoostthe complexity of the sample size is determined特征数,特征分裂点的数量andthe optimization direction of is also considered from these three directions.样本量LightGBM

2 Basic Principles of LightGBM

LightGBMThe full name is light gradient boosting machine(lightweight gradient booster), which is a top Boosting algorithm framework open sourced by Microsoft in 2017. Like XGBoost, LightGBM is also an engineering implementation of the GBDT algorithm framework, but it is faster and more efficient.

2.1 Histogram algorithm

In order 减少特征分裂点数量和更加高效地寻找最优特征分裂点, lightGBM is different from XGBoost's pre-sorting algorithm, and uses the histogram algorithm to find the optimal feature split point. The main idea is to discretize continuous floating-point eigenvalues ​​into k integers and construct a histogram with a width of k. When traversing each feature data, the discretized value is used for indexing as the cumulative statistics of the histogram. After traversing once, the histogram can accumulate the corresponding statistics, and then find the optimal split point according to the histogram.

The essence of the histogram is a data discretization and binning operation. Although it is not a particularly novel optimization design, it does have fast speed and excellent performance, and the calculation cost and memory usage are greatly reduced.
insert image description here
Another benefit of histograms is differential speedup. The histogram of a leaf node can be obtained by the difference between the histogram of its parent node and the histogram of its sibling nodes, which can also speed up the splitting of feature nodes.
insert image description here

2.2 Unilateral Gradient Sampling

单边梯度抽样The ( gradient-based one-side sampling, GOSS) algorithm is an algorithm designed for optimization LightGBMfrom the perspective of reducing samplesLightGBM , and is one of the core principles of .

The main idea of ​​the unilateral gradient sampling algorithm is to eliminate most of the samples with small weights during the training process from the perspective of reducing samples, and only calculate the information gain for the remaining sample data.

In the algorithm chapter10in AdaBoost, a key element of the algorithm is the sample weight, and the optimal classification effect is achieved by continuously adjusting the sample classification weight during the training process. However, GBDTthere is no related design of sample weight in the series, and GBDTthe concept of weight is replaced by sample gradient. Generally speaking, samples with small training gradients have small empirical errors, indicating that this part of the data has been well trained. The GBDTidea is to discard this part of the sample in the next step of residual fitting, but this may change The data distribution of training samples affects the final training accuracy.

LightGBMIt is proposed to use it GOSS采样算法, and its purpose is to retain as much as possible the samples that are helpful for calculating information gain and improve the speed of model training. GOSSThe basic method is to first sort the features that need to be split in descending order of absolute value, take the previous data with the largest absolute value a%, assuming that the sample size is , randomly select a data nfrom the remaining data, and multiply this data by With a constant : multiply the small gradient sample by a weight coefficient to pull the sample distribution back as much as possible. This approach will allow the algorithm to focus more on samples that are not sufficiently trained, and the original data distribution will not change much. Finally, use the data to calculate the information gain of this feature.(1-a)%b%b%(1-a)/ba+b

GOSSGBDTThe algorithm is mainly optimized from the perspective of reducing samples . LightGBMThis is one of the reasons for the faster speed by discarding samples with small gradients and speeding up model training without losing too much accuracy .

2.3 Mutually exclusive feature bundling algorithm

The histogram algorithm corresponds to the optimization of the feature split point, the unilateral gradient sampling corresponds to the optimization of the sample size, and finally the optimization of the number of features remains.

互斥特征捆绑The ( exclusive feature bunding, EFB) algorithm speeds up model training by bundling two mutually exclusive features into one feature and reducing the number of features without losing feature information. Most of the time, the two features are not completely mutually exclusive. You can define a conflict ratio to measure the degree of feature non-mutual exclusion. When the conflict ratio is low, you can bundle the two features that are not completely mutually exclusive, which has no effect on the final model accuracy. too much impact.

The so-called 特征互斥, that is, two features will not be non-zero at the same time , which one-hotis somewhat similar to the expression of classification features. There are two key issues in the mutually exclusive feature bundling algorithm: one is how to judge which features to bundle, and the other is how to bind the features, that is, how to take the value of the bound features.

For the first problem, the EFB algorithm converts it into 图着色问题( graph coloring problem) to solve. The basic idea is to regard all features as vertices in the graph, connect two features that are not independent of each other with an edge, and the weight of the edge represents the conflict ratio of the two connected features. The features that need to be bound together are graphs. Points (features) to be colored in the same color in coloring problems.

The second problem is to determine how the bound features are valued. The key is to be able to separate the original features from the merged features, that is, after binding to a feature, we can still identify it from the bundlebound out the original features. EFB算法Try to deal with this problem from the perspective of histograms. The specific method is to divide different eigenvalues ​​into bundledifferent histograms in the binding, and add a bias constant to the eigenvalues ​​to deal with it.

To give a simple example, suppose we want to bind two features, feature A and feature B, the value range of feature A is [10,20), and the value range of feature B is [10, 30), we can give Add an offset of 10 to the value range of feature B, then the value range of feature B becomes [20, 40), and the value range of the bound feature becomes [10, 40), so that features A and Feature B can be merged.

2.4 leaf-wise growth strategy

LightGBMIt also proposes XGBoosta leaf node growth method that is different from layer-by-layer growth, that is, a leaf-wisedecision tree growth method that grows by leaf nodes ( ) with depth restrictions.

XGBoostThe layer-by-layer level-wisealgorithm has the advantage that it can be optimized by multiple threads, and it is also convenient to control the complexity of the model, and it is not easy to overfit. The disadvantage is that all leaf nodes of the same layer are treated indiscriminately, and most node splits and gain calculations are not necessary. , resulting in additional computational overhead.

LightGBMThe algorithm for growing by leaf nodes is proposed leaf-wise, which is more accurate and efficient, and can save unnecessary computing overhead. At the same time, in order to prevent a certain node from growing too much, a depth limit mechanism can be added to ensure a certain degree of accuracy while ensuring accuracy. to prevent overfitting.

In addition to the above four improved algorithms, LightGBM also has some improvements and optimizations in engineering implementation, such as direct support for category features (no need to perform one-hot processing on category features), efficient parallelism and cache (hit) hit rate optimization, etc. .
insert image description here
insert image description here

3 LightGBM native library example

The Microsoft development team of the open source LightGBMproject provides a native library implementation of the algorithm, and lightgbm库provides two major interfaces, and the following uses the classification problem and as an example to give an example of the native interface.分类回归iris数据集lightgbm

import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

iris = load_iris()
data, target = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=43)
gbm = lgb.LGBMClassifier(
    objective='multiclass',
    num_class=3,
    num_leaves=31, # 控制每个决策树中叶子节点的数量,也就是决策树的复杂度
    learning_rate=0.05,
    n_estimators=20
)

gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=5)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
print(accuracy_score(y_pred, y_test))
lgb.plot_importance(gbm)
plt.show()
[1]	valid_0's multi_logloss: 1.02277
[2]	valid_0's multi_logloss: 0.943765
[3]	valid_0's multi_logloss: 0.873274
[4]	valid_0's multi_logloss: 0.810478
[5]	valid_0's multi_logloss: 0.752973
[6]	valid_0's multi_logloss: 0.701621
[7]	valid_0's multi_logloss: 0.654982
[8]	valid_0's multi_logloss: 0.611268
[9]	valid_0's multi_logloss: 0.572202
[10]	valid_0's multi_logloss: 0.53541
[11]	valid_0's multi_logloss: 0.502582
[12]	valid_0's multi_logloss: 0.472856
[13]	valid_0's multi_logloss: 0.443853
[14]	valid_0's multi_logloss: 0.417764
[15]	valid_0's multi_logloss: 0.393613
[16]	valid_0's multi_logloss: 0.370679
[17]	valid_0's multi_logloss: 0.349936
[18]	valid_0's multi_logloss: 0.330669
[19]	valid_0's multi_logloss: 0.312805
[20]	valid_0's multi_logloss: 0.296973
1.0

insert image description here
Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131329783