Machine Learning-Decision Tree (XGBoost, LightGBM)

 

[Machine Learning] Decision Tree-XGBoost, LightGBM

Mainly introduce mainstream integration algorithms based on Boosting framework, including XGBoost and LightGBM.

1. XGBoost

XGBoost is a massively parallel boosting tree tool. It is currently the fastest and best open source boosting tree toolkit, which is more than 10 times faster than common toolkits. Both Xgboost and GBDT are boosting methods. Except for some differences in project implementation and problem solving, the biggest difference is the definition of the objective function. Therefore, this article will introduce mathematical principles and engineering implementation, and at the end introduce the advantages of Xgboost.

1.1 Mathematical principles

1.1.1 Objective function

We know that XGBoost is an addition expression composed of a base model:

 

Among them is the base model and the predicted value of the sample.

The loss function can be represented by the predicted value and the true value:

 

Where is the sample size.

We know that the prediction accuracy of the model is determined by the deviation and variance of the model. The loss function represents the deviation of the model. If the variance is small, a simple model is required. Therefore, the objective function is composed of the loss function of the model and a regular term that suppresses the complexity of the model. , So we have:

 

  It is the regular term of the model. Since XGBoost supports decision trees and linear models, it will not be described here.

We know that the boosting model is forward addition. Taking the model of the first step as an example, the model's prediction for the first sample is:

 

Among them, the predicted value given by the model in the first step is a known constant, which is the predicted value of the new model we need to add this time. At this time, the objective function can be written as:

 

Finding the optimal objective function at this time is equivalent to solving.

Taylor's formula is a method of approximating a function with an order derivative at its place by using a polynomial of the degree concerned. If the function has an order derivative in a closed interval contained, and has an order derivative in the open interval, then the closed interval At any point above, the polynomial is called the Taylor expansion where the function is, which is the remainder of the Taylor formula and is a high-order infinitesimal.

According to Taylor's formula, we carry out Taylor's second-order expansion of the function at the point, and the following equation can be obtained:

 

We regard as, as, so the objective function can be written as:

 

Where is the first derivative of the loss function, and is the second derivative of the loss function. Note that the derivative here is the derivative of .

Let's take the square loss function as an example:

 

then:

 

Since in the first step is actually a known value, so is a constant, which will not affect the optimization of the function, so the objective function can be written as:

 

So we only need to find the value of the first and second derivative of the loss function at each step (because the previous step is known, so these two values ​​are constants), and then optimize the objective function, you can get each step , And finally get an overall model according to the additive model.

1.1.2 Objective function based on decision tree

We know that the base model of Xgboost not only supports decision trees, but also supports linear models . Here we mainly introduce the objective function based on decision trees.

We can define a decision tree as a certain sample, here represents which leaf node the sample is on, and represents the value of the leaf node, so it represents the value of each sample (that is, the predicted value ).

The complexity of a decision tree can be composed of the number of leaves. The fewer leaf nodes, the simpler the model. In addition, the leaf nodes should not contain too high weight (analogous to the weight of each variable in LR), so the regular term of the objective function can be defined as:

 

That is, the complexity of the decision tree model is determined by the number of leaf nodes of all decision trees generated and the paradigm of the vector composed of the weights of all nodes.

This picture shows how to solve the regular term of XGBoost based on decision tree.

We set as the sample set of the first leaf node, so our objective function can be written as:

 

The second to third steps may not be very clear. Here are some explanations: The second step is to traverse all the samples and find the loss function of each sample, but the sample will eventually fall on the leaf node, so we also You can traverse the leaf nodes, then get the sample set on the leaf nodes, and finally find the loss function. That is, the collection of our previous samples is now rewritten as a collection of leaf nodes. Since there are multiple samples in a leaf node, there are two items and to take the value of the first leaf node.

To simplify the expression, we define,, then the objective function is:

 

Here we need to note that the sum is the result obtained in the previous step. Its value is known and can be regarded as a constant. Only the leaf node of the last tree is uncertain. Then the objective function pair is obtained by first-order derivative and equal to it, then it can be obtained Get the weight corresponding to the leaf node:

 

So the objective function can be simplified to:

 

The above figure gives an example of objective function calculation. Find the first derivative and second derivative of each sample of each node, and then sum the sum of the contained samples for each node, and finally traverse the nodes of the decision tree to get Objective function.

1.1.3 Optimal segmentation point partition algorithm

In the growth process of the decision tree, a very critical issue is how to find the optimal segmentation point of the leaf node. Xgboost supports two methods of splitting nodes-greedy algorithm and approximate algorithm.

1) Greedy algorithm

  1. Starting from the tree with depth, enumerate all available features for each leaf node;
  2. For each feature, the training samples belonging to the node are arranged in ascending order according to the feature value, the best split point of the feature is determined by linear scanning, and the split income of the feature is recorded;
  3. Select the feature with the highest profit as the split feature, use the best split point of the feature as the split position, split the left and right two new leaf nodes on this node, and associate the corresponding sample set for each new node
  4. Go back to step 1, and execute recursively until certain conditions are met

So how to calculate the split return for each feature?

Assuming that we complete feature splitting at a certain node, the objective function before sorting can be written as:

 

The objective function after splitting is:

 

Then for the objective function, the profit after splitting is:

 

Note that the feature benefit can also be used as an important basis for feature importance output.

For each split, we need to enumerate all possible segmentation schemes for features. How to efficiently enumerate all segmentations?

I assume that we want to enumerate all such conditions, and for a particular split point we want to calculate the sum of the left and right derivatives.

We can find that for all split points, we can enumerate all the gradients and and and of the split by just doing a scan from left to right. Then use the above formula to calculate the score of each segmentation scheme.

Observing the benefits after splitting, we will find that node division will not necessarily make the results better, because we have a penalty term for introducing new leaves, which means that if the gain of the introduced split is less than a threshold, we can Cut out this segmentation.

2) Approximate algorithm

The greedy algorithm can reach the optimal solution, but when the amount of data is too large, it cannot be read into the memory for calculation. The approximate algorithm mainly provides an approximate optimal solution for the shortcoming of the greedy algorithm.

For each feature, only examining the quantile can reduce the computational complexity.

The algorithm first proposes candidate division points according to the quantile of the feature distribution, then maps continuous features to buckets divided by these candidate points, and then aggregates statistical information to find the best split point for all intervals.

There are two strategies when proposing candidate segmentation points:

  • Global: propose candidate segmentation points before learning each tree, and use this segmentation during each split;
  • Local: The candidate segmentation point will be proposed again before each split.

Intuitively, the Local strategy requires more calculation steps, while the Global strategy requires more candidate points because the nodes are not divided.

The following figure shows the AUC transformation curves of different split strategies. The abscissa is the number of iterations, the ordinate is the test set AUC, eps is the accuracy of the approximation algorithm, and the reciprocal is the number of buckets.

We can see that the Global strategy has similar accuracy when the number of candidate points is large (eps is small) and the local strategy has similar accuracy when the number of candidate points is small (eps is large). In addition, we also found that when the value of eps is reasonable, the quantile strategy can achieve the same accuracy as the greedy algorithm.

  • The first for loop: Find the candidate set of cutting points for the feature k according to the quantile of the feature distribution. XGBoost supports Global strategy and Local strategy.
  • The second for loop: For the candidate set of each feature, map the sample to the bucket interval formed by the candidate point set corresponding to the feature, that is, for each bucket statistics, and finally find the best statistics on these statistics. Good split point.

The following figure gives a specific example of the approximate algorithm, taking the thirds as an example:

Sort according to the characteristics of the sample, then divide based on the quantile, and count the values ​​in the three buckets, and finally solve the gain of node division.

1.1.4 Weighted quantile thumbnail

In fact, XGBoost does not simply quantify according to the number of samples, but uses the second derivative value as the weight of the sample, as follows:

So the question is: Why do we use sample weighting?

We know that the objective function of the model is:

 

If we sort it out a bit, we can see that there is a weighting effect on loss.

 

Where and are both constants. We can see that is the weight of the sample in the squared loss function.

For data sets with the same sample weights, there is a solution for finding candidate quantiles (GK algorithm), but when the sample weights are not the same, how to find candidate quantiles? (The author gave a Weighted Quantile Sketch algorithm, which will not be introduced here.)

1.1.5 Sparse sensing algorithm

In the first article of the decision tree, we introduced the split strategy of the CART tree in response to missing data, and XGBoost also gave its solution.

XGBoost only considers data traversal of non-missing values ​​in the process of constructing tree nodes, and adds a default direction to each node. When the corresponding feature value of the sample is missing, it can be classified into the default direction, the best The default direction can be learned from the data. As for how to learn the branch of the default value, it is actually very simple. The samples that enumerate the default value of the feature are classified as the gain after the left and right branches, and the enumeration item with the largest gain is the optimal default direction.

In the process of building the tree, it is necessary to enumerate the samples with missing features. At first glance, the calculation amount of the algorithm has doubled, but in fact, the algorithm only considers the traversal of samples without missing features in the process of building the tree, and the feature value The missing samples do not need to be traversed and only need to be directly assigned to the left and right nodes, so the number of samples that the algorithm needs to traverse is reduced. The following figure shows that the sparse sensing algorithm is more than 50 times faster than the basic algorithm.

1.2 Project realization

1.2.1 Block structure design

We know that one of the most time-consuming steps of decision tree learning is to sort the values ​​of features every time to find the best split point. XGBoost sorts the data according to the features before training, and then saves it in the block structure, and uses the Compressed Sparse Columns Format (CSC) for storage in each block structure. The subsequent training process The block structure is repeatedly used in the middle, which can greatly reduce the amount of calculation.

  • Each block structure includes one or more sorted features;
  • Missing feature values ​​will not be sorted;
  • Each feature will store an index pointing to the sample gradient statistical value, which is convenient for calculating the first and second derivative values;

The features stored in this block structure are independent of each other, which is convenient for computers to perform parallel calculations. When splitting a node, it is necessary to select the feature with the largest gain as the split. At this time, the gain calculation of each feature can be performed at the same time, which is why Xgboost can realize distributed or multi-threaded calculation.

1.2.2 Cache access optimization algorithm

The design of the block structure can reduce the amount of calculation when the node is split, but the design of the feature value to access the sample gradient statistical value through the index will cause the memory space of the access operation to be discontinuous, which will cause the cache hit rate to be low, thereby affecting the efficiency of the algorithm.

In order to solve the problem of low cache hit rate, XGBoost proposes a cache access optimization algorithm: allocate a continuous buffer area for each thread, and store the required gradient information in the buffer, so that the non-continuous space to the continuous space is realized Conversion improves the efficiency of the algorithm.

In addition, adjusting the block size appropriately can also help cache optimization.

1.2.3 "Out-of-core" block calculation

When the amount of data is too large, it is impossible to load all the data into the memory. You can only temporarily store the data that cannot be loaded into the memory to the hard disk first, and then perform the load calculation when needed. This operation inevitably involves due to memory and Resource waste and performance bottlenecks caused by different hard disk speeds. In order to solve this problem, XGBoost has a separate thread dedicated to reading data from the hard disk, so as to realize the simultaneous processing of data and reading of data.

In addition, XGBoost also uses two methods to reduce the overhead of hard disk read and write:

  • Block compression: compress the Block by column and decompress it when reading;
  • Block splitting: Store each block in a different disk, and reading from multiple disks can increase throughput.

1.3 Advantages and disadvantages

1.3.1 Advantages

  1. Higher accuracy: GBDT only uses the first-order Taylor expansion, while XGBoost performs the second-order Taylor expansion on the loss function. XGBoost introduces the second-order derivative to increase accuracy on the one hand, and to be able to customize the loss function on the other hand. The second-order Taylor expansion can approximate a large number of loss functions;
  2. Greater flexibility: GBDT uses CART as the base classifier. XGBoost not only supports CART but also linear classifiers. (XGBoost using linear classifiers is equivalent to logistic regression (classification problem) with L1 and L2 regularization terms or linear Regression (regression problem)). In addition, the XGBoost tool supports custom loss functions, as long as the function supports first-order and second-order derivation;
  3. Regularization: XGBoost adds a regular term to the objective function to control the complexity of the model. The regular term contains the L2 paradigm of the number of leaf nodes of the tree and the weight of leaf nodes. The regular term reduces the variance of the model, makes the learned model simpler, and helps prevent overfitting;
  4. Shrinkage (reduced): equivalent to the learning rate. XGBoost will multiply the weight of the leaf node by this coefficient after one iteration, mainly to weaken the influence of each tree, so that there is more room for learning later;
  5. Column sampling: XGBoost draws on the approach of random forest and supports column sampling, which can not only reduce overfitting, but also reduce calculations;
  6. Missing value processing: The sparse perception algorithm adopted by XGBoost greatly accelerates the speed of node splitting;
  7. Parallel operation: The block structure can support parallel computing well.

1.3.2 Disadvantages

  1. Although the use of pre-sorting and approximate algorithms can reduce the amount of calculation to find the best split point, it is still necessary to traverse the data set during the node splitting process;
  2. The space complexity of the pre-sorting process is too high, not only needs to store the feature value, but also needs to store the index of the gradient statistical value of the sample corresponding to the feature, which is equivalent to consuming twice the memory.

2. LightGBM

LightGBM was proposed by Microsoft and is mainly used to solve the problems encountered by GDBT in massive data so that it can be used in industrial practice better and faster.

From the name of LightGBM, we can see that it is a lightweight (Light) gradient booster (GBM). Compared with XGBoost, it has the characteristics of fast training speed and low memory usage. The following figure shows the comparison of memory and training time between XGBoost, XGBoost_hist (XGBoost using gradient histogram) and LightGBM for different data sets:

So how does LightGBM achieve faster training speed and lower memory usage?

We have just analyzed the shortcomings of XGBoost, and LightGBM has proposed the following solutions to solve these problems:

  1. Single-sided gradient sampling algorithm;
  2. Histogram algorithm;
  3. Mutually exclusive feature bundling algorithm;
  4. Leaf-wise vertical growth algorithm based on maximum depth;
  5. Optimal segmentation of category features;
  6. Feature parallelism and data parallelism;
  7. Cache optimization.

This section will continue to introduce LightGBM from the perspectives of mathematical principles and engineering implementation.

2.1 Mathematical principles

2.1.1 One-sided gradient sampling algorithm

The gradient size of the GBDT algorithm can reflect the weight of the sample. The smaller the gradient, the better the model fit. The Gradient-based One-Side Sampling (GOSS) uses this information to sample the sample, reducing a lot of For samples with small gradients, you only need to focus on samples with high gradients in the next calculation pot, which greatly reduces the amount of calculation.

The GOSS algorithm retains samples with large gradients and randomly samples samples with small gradients. In order not to change the data distribution of the samples, a constant is introduced to balance the samples with small gradients when calculating the gain. The specific algorithm is as follows:

We can see that GOSS sorts the samples based on the absolute value of the gradient in advance (there is no need to save the sorted result ), and then get the first a% of the sample with a large gradient, and the b% of the overall sample. When calculating the gain, multiply it by Amplify the weight of samples with small gradients. On the one hand, the algorithm pays more attention to the under-trained samples, and on the other hand, multiplies the weight to prevent the sampling from causing too much influence on the original data distribution.

2.1.2 Histogram Algorithm

1) Histogram algorithm

The basic idea of ​​the histogram algorithm is to discretize continuous features into k discrete features, and at the same time construct a histogram with a width of k for statistical information (containing k bins). Using the histogram algorithm, we don't need to traverse the data, only need to traverse k bins to find the best split point.

We know that feature discretization has many advantages, such as convenient storage, faster calculation, strong robustness, and more stable models. For the histogram algorithm, the most straightforward has the following two advantages (take k=256 as an example):

  • Smaller memory footprint: XGBoost needs to use 32-bit floating-point numbers to store feature values, and 32-bit shaping to store indexes, while LightGBM only needs 8 bits to store histograms, which is equivalent to a reduction of 1/8;
  • The computational cost is lower: when calculating the feature splitting gain, XGBoost needs to traverse the data once to find the best split point, while LightGBM only needs to traverse once k times, directly reducing the time complexity from to, and we know that.

Although the precise segmentation points cannot be found after the discretization of the features, it may have a certain impact on the accuracy of the model, but the coarser segmentation also has the effect of regularization, which reduces the variance of the model to a certain extent.

2) Histogram acceleration

When constructing the histogram of a leaf node, we can also construct it by subtracting the histogram of the parent node from the histogram of the adjacent leaf node, thereby reducing the amount of calculation by half. In the actual operation process, we can also calculate the leaf nodes with small histograms first, and then use the histogram to make the difference to obtain the leaf nodes with large histograms.

3) Sparse feature optimization

XGBoost only considers non-zero values ​​for acceleration when pre-sorting, while LightGBM also uses a similar strategy: only non-zero features are used to construct a histogram.

2.1.3 Mutually exclusive feature bundling algorithm

High-dimensional features are often sparse, and the features may be mutually exclusive (for example, two features do not take non-zero values ​​at the same time), if the two features are not completely mutually exclusive (such as only part of the case where they are different, take non-zero values ​​at the same time) Value), the degree of mutual exclusion can be expressed by the mutual exclusion rate. The exclusive feature bundling algorithm (Exclusive Feature Bundling, EFB) points out that if some features are merged and bound, the number of features can be reduced.

In response to this idea, we will encounter two problems:

  1. Which features can be bound together?
  2. After the feature is bound, how to determine the feature value?

For question 1: EFB algorithm constructs a weighted undirected graph using the relationship between features and features, and converts it into a graph coloring algorithm. We know that graph coloring is an NP-Hard problem, so the greedy algorithm is used to obtain an approximate solution. The specific steps are as follows:

  1. Construct a weighted undirected graph, vertices are features, and edges are the degree of mutual exclusion between two features;
  2. Sort nodes in descending order according to their degree. The greater the degree, the greater the conflict with other features;
  3. Traverse each feature, assign it to an existing feature package, or create a new feature package, so the overall conflict is the smallest.

The algorithm allows pairwise features not completely mutually exclusive to increase the number of feature bundles, and balances the accuracy and efficiency of the algorithm by setting the maximum mutual exclusion rate. The pseudo code of the EFB algorithm is as follows:

We see that the time complexity is that it can be dealt with when there are not many features, but if the feature dimension reaches a million level, the amount of calculation will be very large. In order to improve efficiency, we propose a faster solution: change EFB In the algorithm, by constructing a graph, the strategy of sorting according to node degree is changed to sorting according to non-zero value technology, because the more non-zero values, the greater the probability of mutual exclusion.

For question two: the paper gives a feature merging algorithm, the key is that the original feature can be separated from the merged feature. Assuming that there are two eigenvalues ​​in the Bundle, the value of A is [0, 10] and the value of B is [0, 20]. In order to ensure the mutual exclusion of features A and B, we can add an offset to feature B Converted to [10, 30], the value of the feature after Bundle is [0, 30], which realizes feature merging. The specific algorithm is as follows:

2.1.4 Leaf-wise algorithm with depth limitation

There are two strategies in the process of building a tree:

  • Level-wise: grow based on layers until the stop condition is reached;
  • Leaf-wise: The leaf node with the largest gain is split each time until the stop condition is reached.

XGBoost adopts a Level-wise growth strategy, which facilitates parallel calculation of split nodes at each layer, which improves the training speed, but at the same time, because the node gain is too small, many unnecessary splits are added and the amount of calculation is reduced; LightGBM uses Leaf-wise The growth strategy reduces the amount of calculation, and cooperates with the maximum depth limit to prevent over-fitting. Since the node with the largest gain needs to be calculated each time, it cannot be split in parallel.

2.1.5 Optimal segmentation of category features

Most machine learning algorithms cannot directly support category features, and generally encode category features and then input them into the model. The common method for processing category features is one-hot encoding, but we know that one-hot encoding is not recommended for decision trees:

  1. The problem of unbalanced sample segmentation will occur, and the segmentation gain will be very small. For example, after the nationality segmentation, a series of characteristics such as whether it is China or the United States will be generated. In this series of characteristics, only a small number of samples are 1, and a large number of samples are 0. The gain of this division is very small: the smaller the split sample set, its proportion to the total sample is too small. No matter how large the gain is, it can be almost ignored after multiplying by the ratio; the larger split sample set is almost the original sample set, and the gain is almost zero;
  2. Affecting decision tree learning: Decision trees rely on the statistical information of the data, and one-hot code encoding will split the data into small scattered spaces. If the statistical information is inaccurate in these scattered small spaces, the learning effect will be worse. The essence is that the expressive ability of the feature after the one-hot code encoding is poor. The predictive ability of the feature is artificially split into multiple parts, and each one fails to compete with other features for the optimal division point. In the end, the feature is important. The performance will be lower than the actual value.

LightGBM natively supports category features, and uses the many-vs-many segmentation method to divide category features into two subsets to achieve the optimal segmentation of category features. Assuming that there are k categories for a certain dimension feature, there is a medium possibility, and the time complexity is. LightGBM is based on Fisher's " On Grouping For Maximum Homogeneity " to achieve the time complexity of.

The above picture shows the split based on one-hot encoding on the left, and the latter picture shows LightGBM splits based on many-vs-many. Given the depth, the latter can learn a better model.

The basic idea is to classify the category features according to the training target during each grouping, sort the histogram according to its cumulative value, and then find the best segmentation on the sorted histogram. In addition, LightGBM also adds constraint regularization to prevent overfitting.

We can see that this way of processing category features increases the AUC by 1.5 points, and the time is only 20% longer.

2.2 Project realization

2.2.1 Feature Parallel

The traditional feature parallel algorithm is to divide the data vertically, and then use different machines to find the optimal split point of different features, obtain the optimal split point based on communication integration, and then inform other machines of the division result based on the communication.

The traditional feature parallel method has a big disadvantage: each machine needs to be informed of the final division result, which adds additional complexity (because the data is divided vertically, each machine contains different data, and the division result needs to be notified through communication) .

LightGBM does not perform vertical data division. Each machine has complete training set data. After the best division scheme is obtained, the division can be performed locally to reduce unnecessary communication.

2.2.2 Data Parallel

The traditional data parallel strategy is mainly to divide the data horizontally, then build a histogram locally and integrate it into a global histogram, and finally find the best division point in the global histogram.

This data division has a big disadvantage: the communication overhead is too large. If point-to-point communication is used, the communication overhead of a machine is approximately; if integrated communication is used, the communication overhead is,

LightGBM uses the Reduce scatter method to distribute the histogram integration tasks to different machines, thereby reducing the communication cost, and further reducing the communication between different machines through the histogram difference.

2.2.3 Parallel voting

In the case where the amount of data is particularly large and features are particularly large, parallel voting can be used. Voting parallelism is mainly optimized for the large communication cost of data merging when data is parallelized. It merges only partial histograms of features through voting to achieve the purpose of reducing the amount of communication.

The general steps are two steps:

  1. Find Top K features locally, and filter out the features that may be the best split point based on voting;
  2. When merging, only the features selected by each machine are merged.

2.2.4 Cache optimization

As mentioned above, the pre-sorted feature of XGBoost is the statistical value of the sample gradient given by the index. Because the results of index access are not continuous, XGBoost proposes a cache access optimization algorithm for improvement.

The histogram algorithm used by LightGBM is inherently friendly to Cache:

  1. First of all, all features use the same method to obtain gradients (different from obtaining gradients through different indexes for different features), and only need to sort the gradients and achieve continuous access, which greatly improves cache hits;
  2. Second, because there is no need to store the index of the feature to the sample, the storage consumption is reduced, and there is no problem of Cache Miss.

2.3 Comparison with XGBoost

This section mainly summarizes the advantages of LightGBM over XGBoost, and introduces it from two aspects of memory and speed.

2.3.1 Less memory

  1. XGBoost uses the index that needs to record the feature value and the statistical value of the corresponding sample after pre-sorting, while LightGBM uses the histogram algorithm to convert the feature value to the bin value, and does not need to record the index of the feature to the sample, reducing the space complexity In order to greatly reduce the memory consumption;
  2. LightGBM uses a histogram algorithm to convert stored feature values ​​into stored bin values, which reduces memory consumption;
  3. LightGBM uses a mutually exclusive feature bundling algorithm in the training process to reduce the number of features and reduce memory consumption.

2.3.2 Faster

  1. LightGBM uses the histogram algorithm to transform the traversal samples into traversal histograms, which greatly reduces the time complexity;
  2. LightGBM uses a unilateral gradient algorithm to filter out samples with small gradients during the training process, reducing a lot of calculations;
  3. LightGBM uses a growth strategy based on the Leaf-wise algorithm to build a tree, reducing a lot of unnecessary calculations;
  4. LightGBM uses optimized feature parallel and data parallel methods to accelerate calculations. When the amount of data is very large, it can also use the voting parallel strategy;
  5. LightGBM has also optimized the cache to increase the hit rate of Cache hits.

references

  1. 《XGBoost: A Scalable Tree Boosting System》
  2. Chen Tianqi's paper speech PPT
  3. What are the differences between GBDT and XGBOOST in machine learning algorithms? -Wepon's answer-Knowing
  4. LightGBM: A Highly Efficient Gradient Boosting Decision Tree
  5. LightGBM document
  6. Paper reading-LightGBM principle
  7. LightGBM of machine learning algorithm
  8. Should the decision tree in sklearn use one-hot encoding? -Ke Guolin's answer-Knowing
  9. How to play with LightGBM
  10. A Communication-Efficient Parallel Algorithm for Decision Tree.

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/114690291