Machine Learning Algorithms - Stacking of Ensemble

This article is based on the "Kaggle Competition Integration Guide" to summarize what is ensemble learning, and the more commonly used techniques. The integrated learning technology taught here is used for classification tasks. I don’t know much about regression and prediction. Readers can check the corresponding blogs or papers by themselves.

1 What is an ensemble of models?

The ensemble method refers to the overall model composed of multiple weak classifier models. What we need to study is:

  • ①The form of the weak classifier model
  • How these weak classifiers are combined into a strong classifier

Children's shoes who have learned the basics of machine learning should know that there are two categories of integrated learning - Boosting represented by Adaboost and Bagging represented by RandomForest . They belong to the homogenous ensembles method in ensemble learning; and today I will mainly focus on a more widely used ensemble method in kaggle competitions - Stacked Generalization (SG) , also known as the method of stack generalization (a typical representative of heterogenous ensembles) will be introduced.

write picture description here

As shown above, weak classifiers are grey and their combined predictions are red. The graph shows the temperature-ozone correlation.

2 The concept of Stacked Generalization

As a technique commonly used by high scorers in kaggle competitions, SG can even reduce the error rate by as much as 30% compared to the current best method in some cases.

The following figure is an example to briefly introduce what is SG:

  • ① Divide the training set into 3 parts, which are used to learn and fit the 3 base classifiers (Base-leaner) respectively.
  • ② Use the results predicted by the three base classifiers as the input of the next layer of classifier (Meta-learner)
  • ③ Use the result obtained by the next layer of classifier as the final prediction result

write picture description here

The feature of this model is that by using the prediction of the first stage (level 0) as the feature of the prediction of the next layer, it can have stronger nonlinear expression ability and reduce the generalization error than independent prediction models . Its goal is to reduce the Bias-Variance of machine learning models at the same time.

All in all, stack generalization is the result of further generalization of the Aggregation method in Ensemble learning. It is a method to comprehensively reduce Bias and Variance by replacing Voting/Averaging of Bagging and Boosting through Meta-Learner. For example: Voting can be achieved by kNN, weighted voting can be achieved by softmax (Logistic Regression), and Averaging can be achieved by linear regression.

3 A small example

The above mentioned Voting and Averaging in the classic method of homologous integration. Here we take the classification task as an example to illustrate Voting. So what is Voting?

Voting, as the name suggests, means voting. Assuming that our test set has 10 samples, the correct situation should be 1:
we have 3 binary classifiers with a correct rate of 70%, which are recorded as A, B, and C. You can think of these classifiers as pseudo-random number generators, i.e. "1" with 70% probability and "0" with 30% probability .

Below we can explain how the method of ensemble learning improves the accuracy rate from 70% to nearly 79% according to the principle of conformity (the minority obeys the majority).

All three are correct
  0.7 * 0.7 * 0.7
= 0.3429

Two are correct
  0.7 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7
= 0.4409

Two are wrong
  0.3 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.3
= 0.189

All three are wrong
  0.3 * 0.3 * 0.3
= 0.027

We see that in addition to the 34.29% that are both predicted to be positive, there is a 44.09% probability (2 positive and 1 negative, according to the above principle, the result is considered positive) that the result is positive. Most voting ensembles will make the final accuracy around 78% (0.3429 + 0.4409 = 0.7838).

Note that the weights of each base classifier here are considered the same.


4 The development of stack generalization

The following content is excerpted from Shi Chunqi https://www.jianshu.com/p/46ccf40222d6

The earliest person who paid attention to and proposed Stacking technology was David H. Wolpert. His paper Stacked Generalization published in 1992
can be regarded as a complex version of cross -validation, through the winner-takes-all way to integrate.

write picture description here
The above picture is the difference between Stacking and Boosting and Bagging in the target, data, classifier and integration methods. In fact, the flexibility and uncertainty of Stacking makes it possible to implement both Bagging and Boosting.

In terms of theory, after SG was proposed by Wolpert in 1992, Leo Breiman combined the Generalized Linear Model and the SG method to propose "Stacked Regressions" in 1996. After that, Mark J. van der Laan from UC Berkeley (UC Berkeley) theoretically proved the effectiveness of the Stacking method when he expressed Super Learner in 2007.

In practice, in addition to the breakthrough of SG theory itself, the breadth and depth of SG applications are also constantly breaking through, one of which is the distribution of training data (the emergence of Blending); the other is the deep (3 or more ) Stacking appear. Currently, Stacking can be done by using the mlxtend library in python.

5 Common Meta-Learner Selections

The contents of this section are all excerpted from my thoughts today, Stacked Generalization

  1. Meta-Learner of Statistical Methods:

    Voting ( Majority based, Probabilitybased)

    Averaging (Weighted, Ranked)

  2. Classic easy-to-explain machine learning algorithms:

    Logistic Regression (LR)

    Decision Tree (C4.5)

  3. Non-linear machine learning algorithms:

    Gradient Boosting Machine (GBM,XGBoost),

    Nearest Neighbor (NN),

    k-Nearest Neighbors (k-NN),

    Random Forest (RF)

    Extremely Randomized Trees (ERT).

  4. Weighted Linear / Quadratic Models

    Feature weighted linear stacking

    Quadratic - Linearstacking

  5. Multiple response analysis (non-linear) framework

    Multi-response linear regression (MLR)

    Multi-response model trees (MRMT)

  6. Other, Online Learning, Neural Networks, Genetic Learning, Swarm Intelligence

    6.1 Online learning Online stacking (OS)

    Linear perceptron with online random tree
    
    Random bit regression (RBR)
    
    Vowpal Wabbit  (VW)
    
    Follow the Regularized Leader (FTRL)
    

    6.2 Artificial neural network (ANN)

      2 layer - ANN
    
      3 layer - ANN
    

    6.3 Genetic learning Genetic algorithm (GA)

      GA-Stacking
    

    6.4 Swarm intelligence (SI)

    Artificial bee colony algorithm

Also, here's a list of articles showing that from 1997 to 2013, Meta-learner settings became more and more novel and extensive:
write picture description here

To sum up, SG is a very powerful integration method. In a sense, similar to deep learning, it increases the depth of learning vertically, but also increases the complexity and inexplicability of the model. How to carefully choose Meta-learner and Base-learner, training methods, evaluation criteria, etc. are also experiences that should be paid attention to.

References

  1. Now I think about it, Stacked Generalization
  2. Kaggle Competition Integration Guide

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325372720&siteId=291194637