[Machine Learning] Interpretation of Integrated Learning (ensemble learning)

[Machine Learning] Interpretation of Integrated Learning (ensemble learning)

1. Ensemble learning

1.1 Preface

In supervised learning algorithms for machine learning, our goal is to learn a model that is stable and performs well in all aspects.

  • But the actual situation is often not so ideal. Sometimes we can only get multiple models with preferences (weakly supervised models, which perform better in some aspects).

The emergence of ensemble learning is to combine multiple weakly supervised models in order to obtain a better and more comprehensive strong supervised model.

  • The underlying idea of ​​ensemble learning is: even if a weak classifier gets a wrong prediction, other weak classifiers can correct the error back .
    • A single learner is called a weak learner (usually refers to a learner whose generalization performance is slightly better than random guessing: for example, a classifier with a slightly higher accuracy than 50% in the binary classification problem).
    • In contrast, ensemble learning is a strong learner (a classifier that achieves an accuracy exceeding all weak learners by integrating some weak learners in a certain way).

Integrated learning itself is not a separate machine learning algorithm, but completes learning tasks by building and combining multiple machine learners. Integrated learning can be used for

  • Integration of classification problems, integration of regression problems, integration of feature selection, integration of outlier detection, etc., it can be said that all machine learning fields can see the figure of integrated learning.

Common integrated learning algorithms include Boosting, Bagging (Bootstrap Aggregating), Voting (in fact, voting is not strictly speaking), Stacking, etc.
insert image description here

1.2 What is integrated learning

From the figure below, make a summary of the idea of ​​integrated learning. For the training set data, by training several individual learners, through a certain combination strategy, to complete the learning task, (often, a learner that is significantly superior to single learning can be obtained) can finally form a strong learner.
insert image description here
Integrated learning is a technical framework that combines basic models according to different ideas to achieve better goals. Ensemble learning has two main problems to solve:

  • The first is how to obtain several individual learners,
  • The second is how to choose a combination strategy to combine these individual learners into a strong learner.

2. How to get several individual learners

There are two options for how to get several individual learners.

  • The first is that all individual learners are of the same type, or homogeneous.
    • For example, both are decision tree individual learners, or both are neural network individual learners. Such as bagging and boosting series.
  • The second is that not all individual learners are of the same type, or heterogeneous.
    • For example, if we have a classification problem, we use support vector machine individual learners, logistic regression individual learners, and naive Bayes individual learners to learn the training set, and then determine the final classification strong learner through a certain combination strategy. This integrated learning is called Stacking.

Homogeneous individual learners can be divided into two categories according to whether there is a dependency relationship between individual learners.

  • The first is that there is a strong dependency between individual learners, and a series of individual learners basically need to be generated serially , and the representative algorithm is the Boosting series algorithm.
  • The second is that there is no strong dependency between individual learners, and a series of individual learners can be generated in parallel . The representative algorithm is the Bagging series algorithm.

2.1 Bagging of integrated learning

  • Representative algorithm: Random Forest (Random Forest)

Bagging ( training multiple classifiers to take the average ): subsampling from the training set to form the sub-training set required by each base model, and then synthesizing the results predicted by all base models to generate the final prediction result.
insert image description here
1) Bagging working mechanism:

  • Draw the training set from the original sample set. In each round, n training samples are drawn from the original sample set using the self-service sampling method (Bootstraping) (in the training set, some samples may be drawn multiple times, and some samples may not be drawn once), and a total of k rounds of extraction are performed. Get k training sets. (The k training sets are independent of each other);
  • Each time a training set is used to obtain a model, k training sets obtain a total of k models. (Note: There is no specific classification algorithm or regression method here. We can use different classification or regression methods according to specific problems, such as decision trees, perceptrons, etc.);
  • For the classification problem: the k models obtained in the previous step are voted to obtain the classification result;
  • For regression problems: Calculate the mean of the above model as the final result. (all models have the same importance).

2) Random sampling of the training set
The training set of individual weak learners of Bagging is obtained by random sampling.

  • Through T times of random sampling, we can obtain T sampling sets. For these T sampling sets, we can independently train T weak learners, and then use the set strategy to obtain the final T weak learners. strong learner.
  • For the random sampling here, Bootstrap sampling is generally adopted, that is, for the original training set of m samples, we randomly collect a sample each time and put it into the sampling set, and then put the sample back, that is, It is said that the sample may still be collected during the next sampling, so that m times of collection can finally obtain a sampling set of m samples. Since it is random sampling, each sampling set is different from the original training set, and other The sampling sets are also different, resulting in multiple different weak learners.

3) Random Forest (Random Forest)
Random Forest: Many decision trees are put together in parallel, data sampling is random, feature selection is random, and they are all randomly selected with replacement. It is a specialized advanced version of bagging.

  • The so-called specialization is because the weak learners of the random forest are all decision trees.
  • The so-called advanced is that random forest adds random selection of features on the basis of random sampling of bagging samples, and its basic idea does not depart from the category of bagging.

2.2 Boosting of integrated learning

  • Representative algorithms: AdaBoost, Xgboost, GBDT

Boosting (boosting algorithm, starting from the weak learner and training by weighting): the training process is stepped, and the base model is trained in order of one and one (in theory, it is serial, but it can be implemented in parallel).

  • The training set of the base model is transformed every time according to a certain strategy. If a certain data is wrongly divided this time, then I will give it a greater weight next time. A linear integration of the results predicted by all base models produces the final prediction.
    insert image description here

1) Boosting working mechanism:

  • First, a weak learner 1 is trained from the training set with initial weights;
  • Then update the weight of the training samples according to the learning error rate performance of the weak learning, so that the weight of the training sample points with a high learning error rate of the previous weak learner 1 becomes higher, that is, let the points with a high error rate be in the weak learner 2 behind receive more attention;
  • Then train the weak learner 2 based on the adjusted weighted training set;
  • Repeat this until the number of weak learners reaches the number T specified in advance;
  • Finally, the T weak learners are integrated through an ensemble strategy to obtain the final strong learner.

2) Two core issues about Boosting:

  • How to change the weights or probability distribution of the training data at each round?

    • By increasing the weights of the samples that were misclassified by the weak classifier in the previous round and reducing the weights of the paired samples in the previous round, the classifier has a better effect on the misclassified data.
  • By what means are weak classifiers combined?

    • Weak classifiers are linearly combined through an additive model,
      • For example: AdaBoost (Adaptive boosting) algorithm: assign equal weights to each training example at the beginning of training, and then use this algorithm to train the training set for t rounds. After each training, assign a larger weight to the training examples that failed to train. Weight, that is, let the learning algorithm pay more attention to the wrong samples after each learning, so as to obtain multiple prediction functions. The residual error is gradually reduced by fitting the residual error, and the models generated at each step are superimposed to obtain the final model.
      • GBDT (Gradient Boost Decision Tree), each calculation is to reduce the last residual, and GBDT builds a new model in the direction of residual reduction (negative gradient).

2.3 Stacking of integrated learning

stacking (stacking various classifiers (KNN, SVM, RF, etc.), staged operation:

  • In the first stage, the input data features are used to obtain the results of each base model;
  • In the second stage, the results of the previous stage are used to train to obtain classification results.
    • Train a model to combine the results of each base model): All the trained base models are used to predict the training base, and the predicted value of the jth base model for the i-th training sample will be used as the i-th sample in the new training set The jth eigenvalue of , and finally train based on the new training set.
      • It is equivalent to assuming that there are n samples and k base learners, then the new training set is n * k in size and k in dimension.
    • In the same way, the prediction process must first go through the predictions of all the base models to form a new test set, and finally make predictions on the test set.

insert image description here

1) Stacking working mechanism:

  • First train multiple different models;
  • Then, the output of each previously trained model is used as input to train a model to obtain a final output.

2) Stacking trains a model to combine (combine) other base models.

  • The specific method is to divide the data into two parts, use one part to train several basic models A1, A2, A3, use the other part of data to test these several basic models, and use the output of A1, A2, A3 as input to train the combined model B.
  • Note that instead of organizing the results of the model, it organizes the model. In theory, Stacking can organize any model. In practice, single-layer logistic regression is often used as a model. (There must be doubts, it looks more like an organizational result? In fact, the result of the organizational model is essentially a way of organizing the model.)
  • The stacking model is also implemented in Sklearn: StackingClassifier.

2.4 Summary

In summary, according to the relationship between individual learners, integrated learning is generally divided into three categories: Bagging, Boosting, and Stacking (we can simply regard it as parallel, serial, and tree).

  • Bagging is to organize the results of each base model and take a compromise result;
  • Boosting is to train a new model based on the mistakes in the old model, improving layer by layer;
  • Stacking is to organize the base model. Note that it is not the organization result, but the base model itself. This method seems more flexible and more complicated.

3. How to choose a bonding strategy

Suppose the T weak learners we get are: { h 1 , h 2 , . . . h T } \{h_1,h_2,...h_T\}{ h1,h2,...hT}

3.1 Average method

For numerical regression prediction problems, the commonly used combination strategy is the average method, that is, the output of several weak learners is averaged to obtain the final prediction output.

  • The simplest average is the arithmetic mean, which means that the final prediction is: H ( x ) = ∑ i = 1 T hi ( x ) H(x)=\sum_{i=1}^T h_i(x)H(x)=i=1Thi(x)

  • If each individual learner has a weight ww, the final prediction is: H ( x ) = ∑ i = 1 T wihi ( x ) H(x)=\sum_{i=1}^T w_ih_i(x)H(x)=i=1Twihi( x ) . wi w_i
    in thatwiis the individual learner hi h_ihiThe weight of , usually: wi ≥ 0 , ∑ i = 1 T wi = 1 wi≥0, \sum_{i=1}^Tw_i=1wi0,i=1Twi=1

3.2 Voting method

For the prediction of classification problems, we usually use the voting method. Suppose our prediction category is {c1,c2,...cK}, for any prediction sample x, the prediction results of our T weak learners are {h1(x),h2(x)...hT(x)} .

  • The simplest voting method is the relative majority voting method, which is what we often say that the minority obeys the majority, that is, among the prediction results of T weak learners for sample x, the category cici with the largest number is the final classification category. If more than one category received the highest votes, one was chosen at random as the final category.
  • A slightly more complicated voting method is the absolute majority voting method, which is what we often call more than half of the votes. On the basis of the relative majority voting method, not only the highest number of votes is required, but also a majority of the votes. Otherwise the prediction is rejected.
  • More complicated is the weighted voting method. Like the weighted average method, the classification votes of each weak learner are multiplied by a weight, and finally the weighted votes of each category are summed, and the category corresponding to the largest value is the final category.

3.3 Learning method

The above two types of methods average or vote on the results of weak learners, which are relatively simple, but the learning error may be large, so there is a learning method.

  • For the learning method, the representative method is stacking. When using the combination strategy of stacking, instead of doing simple logical processing on the results of the weak learner, we add a layer of learner, that is, we weakly The learning result of the learner is used as input, the output of the training set is used as output, and a learner is retrained to obtain the final result.
  • In this case, we refer to the weak learner as the primary learner and the learner used for combining as the secondary learner.
  • For the test set, we first use the primary learner to predict once to get the input samples of the secondary learner, and then use the secondary learner to predict once to get the final prediction result.

4. Summary

An ensemble method is a meta-algorithm that combines several machine learning techniques into a predictive model to achieve the effect of reducing variance (bagging), bias (boosting), or improving prediction (stacking). The characteristics of integrated learning method:

① Bring together multiple classification methods to improve the accuracy of classification. (These algorithms can be different algorithms, or the same algorithm.)

② The ensemble learning method builds a set of base classifiers from the training data, and then classifies by voting on the predictions of each base classifier

③ Strictly speaking, ensemble learning is not a classifier, but a method of combining classifiers.

④ Usually the classification performance of an integrated classifier will be better than that of a single classifier

⑤ If a single classifier is compared to a decision maker, the method of integrated learning is equivalent to multiple decision makers making a decision together.

5. supplement

1) The difference between Bagging and Boosting

  • On sample selection:
    • Bagging: The training set is selected with replacement in the original set, and each round of training sets selected from the original set is independent.
    • Boosting: The training set of each round remains unchanged, but the weight of each sample in the classifier in the training set changes. The weights are adjusted according to the classification results of the previous round.
  • Sample weights:
    • Bagging: Using uniform sampling, each sample is equally weighted
    • Boosting: Constantly adjust the weight of the sample according to the error rate, the greater the error rate, the greater the weight.
  • Prediction function:
    • Bagging: All prediction functions are weighted equally.
    • Boosting: Each weak classifier has a corresponding weight, and a classifier with a small classification error will have a greater weight.
  • Parallel Computing:
    • Bagging: each prediction function can be generated in parallel
    • Boosting: Each prediction function can only be generated sequentially, because the latter model parameter requires the result of the previous round of the model.

2) The new algorithm obtained by combining the decision tree with these algorithm frameworks:

  • Bagging + Decision Tree = Random Forest
  • AdaBoost + decision tree = boosted tree
  • Gradient Boosting + decision tree = GBDT

reference

【1】https://blog.csdn.net/xiao_yi_xiao/article/details/124040296

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/130507112