[Machine Learning] Interpretation of LightGBM (Integrated Learning_Boosting_GBM)

[Machine Learning] Interpretation of LightGBM (Integrated Learning_Boosting_GBM)

1 Introduction

GBDT (Gradient Boosting Decision Tree) is a long-lasting model in machine learning. Its main idea is to use weak classifiers (decision trees) to iteratively train to obtain the optimal model. This model has good training effect and is not easy to overfit. Etc. GBDT is not only widely used in the industry, but is usually used for tasks such as multi-classification, click-through rate prediction, and search ranking; it is also a deadly weapon in various data mining competitions. According to statistics, more than half of the champion schemes in Kaggle competitions are based on GBDT.

And LightGBM (Light Gradient Boosting Machine) is also a kind of GBDT, LightGBM can be said to be optimized on XGBoost . It uses decision trees as base learners. LightGBM was born for efficient parallel computing, and its Light is reflected in the following points:

  • faster training
  • lower memory usage
  • Support single-machine multi-threading, multi-machine parallel computing, and GPU training
  • Ability to handle large-scale data

insert image description here


2. Data preprocessing

Big data is reflected in two aspects: more samples and more features. LightGBM solves these two problems in the data preprocessing stage.

  • Aiming at the problem of many samples, a gradient-based one-side sampling algorithm (Gradient-based One-Side Sampling, GOSS) is proposed;
  • For the problem of many features, a mutually exclusive feature bundling algorithm (Exclusive Feature Bundling, EFB) is proposed.

2.1 Gradient-based one-sided sampling (GOSS)

If there are too many samples, they can be deleted. LightGBM uses the obtained gradient to filter samples.

  • According to common sense, the larger the gradient, the less learning should be. If the large gradient samples can be predicted correctly, the contribution to the gain will be greater, so it is hoped that the large gradient samples can be accurately divided when the node splits, and the small gradient samples can have errors.
  • That being the case, when screening samples, keep large gradient samples and only delete a part of small gradient samples. This is the idea of ​​GOSS, and the specific process is as follows:
    insert image description here

1) Arrange the absolute value of the gradient in descending order first, which is convenient for screening
2) Set a ratio threshold a, the sorted sample before a * 100% is called a large gradient sample, and keep all
3) After ( 1 − a ) ∗ 100 (1-a ) * 100%(1a)A sample of 100 is called a small gradient sample, random sampling, and the sampling ratio isb ∗ 100 b * 100%b100
4) Since random sampling is not for all samples, it will change the original data distribution. In order to keep the data distribution as much as possible, it is necessary to multiply the sampled samples by a coefficient to keep them from the original distribution. This coefficient is:
insert image description here
explain the name of the algorithm-gradient-based unilateral sampling.

  • Gradient-based representations are now sorted in gradient order;
  • One-sided sampling is now only sampled on the small gradient sample side.

The author tested on five data sets, as shown in the table below, the values ​​in the table are the number of seconds required to train a single decision tree. Here you only need to pay attention to the two columns in the red box,

  • Among them, the EFB_only column is the LightGBM method that only uses EFB (mutually exclusive feature bundle),
  • Another column is the complete LightGBM algorithm.

So the difference between these two columns is that the one on the left does not use GOSS, but the one on the right does, so it can be used to compare the effect of GOSS. It's clear that GOSS is about twice as fast on average.

insert image description here


2.2 Exclusive Feature Bundle (EFB)

Many features can reduce dimensionality. LightGBM utilizes sparsity to combine features losslessly.

  • From the perspective of features, sparse features will contain many 0 elements;
  • From the perspective of samples, multiple sparse features of a sample are often 0 at the same time.

Based on this idea, EFB bundles mutually exclusive features. The overall process is somewhat similar to the One-Hot inverse process. The following figure is an example for a detailed introduction.
insert image description here
1) Look at the first form first , this is the original form without EFB. There are 6 samples in the table, each sample has 5 features, the first 3 features are sparse, and the last 2 features are dense. Regardless of the dense features, only look at the sparse features. The goal is to combine these three sparse features into a new feature, and call this new feature Bundle.

  • When there is only 1 non-zero element in the 3 sparse features of a row of samples, the 0 element can be ignored, and only the non-zero element can be kept, thus realizing the dimensionality reduction of 3 -> 1.
    • But obviously there is no way to restore 1 -> 3 in this way, because it is not clear which original feature the non-zero elements obtained after merging belong to, which means that we lose some information when merging.
  • The original feature to which the merged element belongs can be expressed connotatively through the data distribution range. Assuming that the distribution range of the three features is 1~10, the first feature does not move, the second feature staggers the first distribution, all elements are offset to the right by 10 on the coordinate axis, and the third feature staggers the first two features , all elements are offset to the right by 20. In this way, the second table below the arrow is formed, and each element can determine which original feature it belongs to according to the size range. The process of this distribution stagger is as follows:
    insert image description here

2) Note that when sample 3 is merged, there are two non-zero elements, which do not meet the requirements. LightGBM defines this situation as a conflict. If this situation is completely rejected, there will be very few features that can be merged, so there is no way but to tolerate conflicts appropriately.

When the conflict ratio of several features is small (the threshold given by the source code is 1/10000), the impact is not great, and the conflict is ignored, and these features are called mutually exclusive features; when the conflict ratio is large, it cannot be ignored, and EFB is not applicable . Then for the conflict situation of sample 3, the features that were finally involved in the merger shall prevail, so the table is 20+8 instead of 10+3.

But how do we find these mutually exclusive feature combinations? Trying every combination is an NP-hard problem, and the existing computing power cannot do it. So we can only use the greedy algorithm to find it. The specific process is as follows:

  • Traverse the features, first take out the first feature as a combination
  • Put the second feature into this combination. If the conflict ratio is small, put it in and merge it into one feature. If the conflict ratio is large, just take it out as another combination.
  • The third feature continues to be put into the existing combination, put it if you can, and form a new combination if you can’t
  • Do the same for all features by analogy.

For the feature traversal here, the author gives two traversal orders, one is to build a map according to mutual exclusion, and traverse in descending order of node degree; the other is to count the number of non-zero values ​​of a feature, and to Traversal in descending order of numbers. The role of sorting is weakened in the source code. If you are interested, you can read the original text and source code.

The author conducted tests on five data sets, as shown in the following table. The values ​​in the table are the seconds required to train a single decision tree. Here, we only need to pay attention to the two columns in the red box. Among them, the lgb_baseline column uses ordinary sparse optimization. LightGBM method, EFB_only is the LightGBM method that uses EFB, and EFB increases the speed by about 8 times.

insert image description here
The speed increase mainly comes from two points:

  • Ordinary sparse optimization needs to save the non-zero value table. After using the mutually exclusive feature bundle, multiple sparse features are bundled into dense features, and the non-zero value table is not used, which saves memory and maintenance time;
  • When traversing multiple sparse features in sequence, there is a problem of low cache hit rate (cache miss) every time the feature is switched. After combining into one feature, there is no need to switch features, and there is no problem of low cache hit rate.

3. Decision tree learning

The learning process of a decision tree is divided into two parts:

  • Node level: find the optimal split point of a leaf node (the value of the feature)
  • Tree structure level: choose which leaf node to split (feature)

3.1 Finding the optimal splitting point for continuous features

For continuous features, you can use the pre-sort method to calculate the gain, but each split point has to be tried, and the number of times to calculate the gain will be large. In order to reduce the amount of calculation, LightGBM performs equidistant discretization on continuous features, that is, histogram statistics, and only calculates the gain once at each box of the histogram. In this way, for a single feature, the time complexity of calculating the gain is from the number of different feature values
​​O ( ndistinct _ value ) O(n_{distinct\_value})O ( ndistinct_value) down to the number of histogram binsO ( nbin ) O(n_{bin})O ( nbin) , the number of boxes will be controlled below 256. becausendistinct_value > > nbin n_{distinct\_value} >> n_{bin}ndistinct_value>>nbin, so the speed increase will be obvious. The following figure compares the two methods of pre-sorting and histogram:
insert image description here
using the histogram algorithm, in addition to the increase in speed, the memory consumption will also be reduced.

  • For the pre-sorting algorithm, each feature of each sample needs to save a 32-bit floating-point feature value, and an ordered index pointing to the sample, and an index also occupies 32 bits;
  • For the histogram algorithm, each feature of each sample only needs to save a bin position, because nbin n_{bin}nbinwill be limited to 256, so 8-bit integers will suffice. Therefore, the memory usage of the histogram can be reduced to 1/8 1/8 of the presort1/8

The histogram algorithm has fewer positions to calculate the gain. Compared with the pre-sorting algorithm, it may miss the best split position.

  • The author said that this actually avoids overfitting and can improve the generalization ability. At the same time, this error can be eliminated in the gradient promotion.

In addition, LightGBM also uses a histogram to speed up the difference when the node is split.

  • For the selected split feature, its histogram can be cut across the split point, but for the unselected feature, the sub-histogram of the left and right nodes must be rebuilt according to the split rule of the selected feature.

Here you can use the histogram to do the difference acceleration method to speed up this process. When the parent node splits, its histogram is known. Only the histogram of one leaf node is required, and the other leaf node can be obtained by difference, which reduces the amount of calculation. This process can continue to be optimized. When finding the histogram of leaf nodes, you can select leaves with fewer samples, and find leaves with larger samples by doing the difference, which further reduces the amount of calculation. Histogram difference acceleration can reduce the time-consuming to the original 1 / 2 1/21/2
insert image description here

3.2 Finding the optimal split point of category features

For categorical features, it is generally One-Hot encoding, and then input into the decision tree. In this way, the decision tree is a one-vs-rest mode when learning node splitting, and can only be classified according to one category at a time, as shown in the figure below. This mode is relatively inefficient and not conducive to decision tree learning.

  • LightGBM has optimized this, using the many-vs-many mode to split nodes, as shown in the figure below.

insert image description here
LightGBM is based on this article "On Grouping for Maximum Homogeneity", according to the GH = ∑ gradientessian \frac{G}{H} = \frac{\sum gradient}{hessian} for each category of featuresHG=hessiangradientSort, and then construct a histogram in this order to find the optimal split point.

  • Here is a common sense question for machine learning: why continuous features can directly construct histograms, but category features should follow GH \frac{G}{H}HGto build the histogram in the order of the eigenvalues, but not in the order of the eigenvalues?
    • Because the values ​​in continuous features have a size relationship, but the values ​​​​in categorical features have no size relationship, but only represent a certain category, such as oranges and apples, they are evenly matched. The histogram-based node splitting requires that the values ​​in the features have a size relationship, so the category features should be according to GH \frac{G}{H}HGSort the order, introduce the size relationship, and then build the histogram.

3.3 Learning Tree Structure by Leaf Growth Strategy

There are two ways to grow the tree structure, as shown in the figure below:

  • One is the layer-by-layer growth strategy used by most GBDT algorithms;
  • The other is the grow-by-leaf strategy used by LightGBM.

insert image description here

  • For layer-by-layer growth, the nodes of one layer include all the training data, and all the nodes of this layer can be split by traversing all the data once, which is simple and convenient.
    • However, this growth method brings a lot of unnecessary splits. Some nodes in the same layer have little gain from splitting. Such nodes should avoid splitting and reduce computing overhead.
  • LightGBM optimizes this problem by using the strategy of growing by leaves. The specific rules are:
    • At each split, the leaf that brings the greatest gain is chosen for splitting.
    • In the case of the same number of splits, it is obvious that growing by leaves can reduce the loss function more.

4. Multi-machine parallel optimization

If one machine is not enough, then several machines run together. There is a communication cost problem in multi-machine parallelism, and the data information sent between machines should not be too much, otherwise a lot of extra overhead will be generated.

There are two main parallel methods before: 1) feature parallelism; 2) data parallelism. They are all aimed at the process of finding the optimal split point, and perform parallel calculations, each with its own shortcomings.
LightGBM is optimized on the basis of data parallelism and proposes election parallelism.

The three methods are described below.

4.1 Feature Parallelism

Each worker has some features of all samples, and they each output their own local optimal split point and gain. After summarizing all the results, compare the gains and select the global optimal split point (feature + threshold). All workers should split their nodes according to the global optimum.

However, the selected global optimal feature is only stored in one worker, and other workers do not have this feature, so it is impossible to judge which sample is assigned to the left leaf and which is assigned to the right leaf according to the threshold. How to solve? Only the worker that has stored this feature can tell other workers the division result of each sample. Because only left and right divisions are possible, one bit can be used to store the division result of a single sample. In this case the traffic is O ( nsample ) O(n_{sample})O ( nsample) , very large.

insert image description here
In addition to the problem of relatively large communication volume, careful analysis will reveal that there are two processes that cannot be parallelized:

  • Node splitting: each worker has to split itself, there will be no new nodes without splitting itself, and the time complexity is
  • Gradient calculation: Each worker has to calculate the gradient by itself. If you don’t calculate it yourself, you don’t have it. The time complexity is

Therefore, there is still a lot of room for optimization in feature parallelism.

4.1 Data Parallelism

Each worker has all the characteristics of a part of the data, they output their own local histograms, and then aggregate them into a global histogram, and find the optimal split point on the global histogram.
insert image description here
Data parallelism does not have the problems of feature parallelism, but its shortcomings are also obvious. It needs to send all histograms, the communication cost is O ( nbin × nfeature ) O(n_{bin} \times n_{feature})O ( nbin×nfeature) , when the number of features reaches millions, the communication cost is unacceptable.

4.3 Parallel election

Therefore, LightGBM did not use these two methods, but adopted election parallelism, which is an improved algorithm based on data parallelism, which was proposed by the author of LightGBM in another article.

  • When the number of features is large, data parallelism will have a huge communication cost, but if the number of features that need to be communicated can be reduced, data parallelism will be a very good method. Election Parallel is designed based on this idea, the specific method is as follows:
    insert image description here

    • Let each worker select k features with the most potential to compete for the global optimum according to the gain, called local optimal features
    • Take the gain corresponding to these local optimal features and the amount of data used to calculate the gain nsample n_{sample}nsampleTo summarize
    • Considering that the gain of a small number of samples is unreliable, put nsamplemax ( nsample ) \frac{n_{sample}}{max(n_{sample})}max(nsample)nsampleAs weights, weight the gain
    • Different workers may choose the same gain, in this case just add their weighted gains together
    • According to the weighted gain sorting, the k features with the most potential to compete for the global optimum are selected again, which are called the global optimum features
    • Each worker only sends k global optimal feature histograms, and finds the optimal split point after summarizing

Since each worker only sends k histograms, the election parallel traffic is O ( nbin × k ) O(n_{bin} \times k)O ( nbin×k ) , by adjustingkkk , can control the size of communication traffic and greatly reduce communication time consumption. The parallel speed of LightGBM grows nearly linearly.

The author selected two tasks of ranking learning and click-through rate prediction to test the time-consuming of the above three parallel methods. The unit is hours. The results are shown in the following table. Election parallelism takes the shortest time.
insert image description here
Combined with the analysis results of the task dataset, the relevant information of the dataset is shown in the table below. The data set used for sorting learning has fewer samples and more features, so feature parallelism takes much less time than data parallelism; while the data set used for click-through rate prediction has a large number of samples and fewer features, so data parallelism takes much less time than feature parallelism. Parallelism takes much less time. The election parallel does not talk about martial arts, the two tasks take the shortest time, and the speed is close to linear growth.
insert image description here
LightGBM uses the same method as XGBoost in terms of gradient promotion, and it is also a second-order Taylor expansion solution with regular terms. And the shrinkage technique is also adopted when integrating the new model. On the whole, the improvement of LightGBM is reflected in the two aspects of faster speed and lower memory usage. It is a very potential algorithm and has great reference significance for follow-up work.

reference

【1】https://zhuanlan.zhihu.com/p/366952043
【2】Gradient boosting decision tree (GBDT) parallel learning algorithm research, Ke Guolin https://d.wanfangdata.com.cn/thesis/Y3025839
【3】 A Communication-Efficient Parallel Algorithm for Decision Tree https://papers.nips.cc/paper/2016/file/10a5ab2db37feedfdeaab192ead4ac0e-Paper.pdf
【4】LightGBM: A Highly Efficient Gradient Boosting Decision Tree https://papers.nips. cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/130650837