Model Fusion and Prediction Fusion

Model Fusion and Prediction Fusion




1. Model Fusion Improvement Technology

  • Model fusion is to generate a group of individual learners first, and then combine them with a certain strategy to strengthen the model effect.
  • Why model fusion? The analysis shows that with the increase of the number T of individual classifiers in the ensemble, the error rate of the ensemble learner will decrease exponentially and eventually tend to zero. Through fusion, the effect of "learning from each other's strengths and making up for weaknesses" can be achieved. Combining the advantages of individual learners can reduce prediction errors and optimize overall model performance. and,个体学习器的准确性越高、多样性越大,模型融合的提升效果就越好。
  • According to the relationship between individual learners, model fusion techniques can be divided into two categories:
    1. 不存在强依赖关系A method that can be generated simultaneously between individual learners 并行化, represented by Bagging 方法and 随机森林.
    2. The serialization method that 存在强依赖关系must be generated between individual learners , representing yes .串行Boosting 方法

1. Bagging method and random forest

  • The Bagging method is to start from 抽样得到the sub-training set required by each base model in the training set, and then continue to synthesize the prediction results of all the base models to generate the final prediction result, as shown in the figure below: * The Bagging method uses the self-service sampling
    insert image description heremethod (Bootstrap sampling), that is, for the original training set of m samples, each time first 随机采集一个样本放入采样集, and then the sample 放回, that is to say, the sample may still be collected in the next sampling, so that m samples are collected, and finally m samples can be obtained. sample set. Because it is random sampling, each sampling set is different from the original training set and other sampling sets, so that multiple different weak learners can be obtained. ( 注意,原始数据集 m 个样本,采集的新数据集也是 m 个样本)
    *Random forest is an improvement to the Bagging method, and its improvement has two points: 基本学习器限定为决策树; In addition to Bagging 样本上加扰动, it 属性上也加上扰动is equivalent to introducing random attribute selection in the process of decision tree learning. For each node of the base decision tree, a subset containing k attributes is randomly selected from the attribute set of the node, and then an optimal attribute is selected from this subset for division.

2. Boosting method

  • The training process of the Boosting method is stepped, that is, the base models are trained one by one in order (the implementation can be parallelized), the training set of the base model is converted according to a certain strategy each time, and then the prediction of all base models The results are linearly integrated to produce the final prediction results, as shown in the following figure:
    insert image description here
  • The well-known algorithms in the Boosting method include the AdaBoost algorithm and the Boosting Tree series of algorithms. Among the boosting tree series algorithms, the most widely used is the gradient boosting tree (Gradient Boosting Tree), which is briefly introduced below.
    1. AdaBoost algorithm: It is a binary classification algorithm when the additive model, the loss function is an exponential function, and the learning algorithm is a forward distribution algorithm.
    2. Boosting tree: It is an algorithm when the additive model and learning algorithm are forward distribution algorithms, and the basic learner is limited to decision trees. For the binary classification problem, the loss function is an exponential function, which means that the basic learner in the AdaBoost algorithm is limited to a binary decision tree; for the regression problem, the loss function is the square error, and the fitting is the residual of the current model.
    3. Gradient Boosting Tree: It is an improvement on the boosting tree algorithm. The boosting tree algorithm is only suitable for the error function as an exponential function and square error, and for the general loss error, the gradient boosting algorithm can use the value of the negative gradient of the loss function in the current model as an approximation of the residual.

2. Prediction result fusion strategy

1. Voting

  • Voting (voting mechanism) is divided into soft voting and hard voting. Its principle adopts the idea that the minority obeys the majority. This method can be used to solve classification problems.
    1. Hard voting: vote directly for multiple models, and the class with the most votes is the final predicted class.
    2. Soft voting: The same principle as hard voting, it adds the function of setting weights, which can set different weights for different models, and then distinguish the importance of different models.

2. Soft voting code example:

Soft voting code example link

3. Averaging 和 Ranking

  • The principle of Averaging is to use the average of the model results as the final forecast value, and a weighted average method can also be used. But there is also a problem: if the fluctuation range of the prediction results of different regression methods is relatively large, then the regression results with small fluctuations will play a relatively small role in the fusion.
  • The idea of ​​Ranking is consistent with that of Averaging. Because there are certain problems in the above average method, the method of averaging the rankings is adopted here. If there is a weight, the sum of the weight ratio rankings of the n models is calculated, which is the final result.

4. Blending

  • Blending is to divide the original training set into two parts, such as 70% of the data as a new training set, and the remaining 30% of the data as a test set.
  • In the first layer, we use 70% of the data to train multiple models, and then predict the label of the remaining 30% of the data. In the second layer, directly use the result of 30% of the data in the first layer as the new feature and continue training for you.
  • Advantages of Blending: Blending is simpler than Stacking (no need to perform k cross-validation to obtain stacker feature), and avoids some information leakage problems, because generlizers and stacker use different data sets.
  • Disadvantages of Blending:
    1. Very little data was used (blender used only 10% of the training set in the second stage).
    2. blender can overfit.

    Explanation: In terms of practical results, Stacking and Blending have similar effects.

5. Stacking

  • The basic principle of Stacking is to use all the trained base models to make predictions on the training set, 将第 j 个基模型对第 i 个训练样本的预测值作为新的训练集中第 i 个样本的第 j 个特征值, and finally 基于新的训练集进行训练. Similarly, the prediction process must first go through the predictions of all the base models to form a new test set, and finally predict the test set, as shown in the following figure:
    insert image description here
  • Stacking is a layered model integration framework. Take two layers as an example: the first layer is composed of multiple basic learners, and its input is the original training set; the model of the second layer is trained with the output of the first layer learner as the training set, so as to obtain a complete Stacking model. The Stacking two-layer model uses all the training set data.
  • The following examples further illustrate:
    1. There are two sets of data in the training set and the test set, and the training set is divided into 5 parts: train1, train2, train3, train4, train5.
    2. Select a base model. It is assumed here that we have chosen xgboost, lightgbm, randomforest as the base model. For example, in the xgboost model part, train1, train2, train3, train4, and train5 are used as the verification set in turn, and the remaining 4 are used as the training set, and then 5-fold cross-validation is used for model training, and then prediction is made on the test set. In this way, 5 predictions trained by the xgboost model on the training set and 1 prediction value B1 on the test set will be obtained, and then the 5 predictions will be vertically overlapped and merged to obtain A1. The lightgbm and randomforest models are partly the same, as shown in the figure below:
      insert image description here
  1. After the training of the three basic models is completed, the predicted values ​​of the three models on the training set are used as three "features" A1, A2, and A3 respectively, and then the LR model is used for training and the LR model is established.
  2. Use the trained LR model to make predictions on the basis of the "feature" values ​​(B1, B2, B3) predicted on the test set of the three base models, and obtain the final predicted category or probability.

Explanation: In the process of stacking, if the predicted value and original features of the first layer model are combined into the training of the second layer model, the effect of the model can be improved, and the overfitting of the model can also be prevented.

3. Other ways to improve

  • Through the analysis of weight or feature importance, you can accurately find important data and fields and related feature directions, and you can continue to refine in this direction. At the same time, you can find more data in this direction, and you can also do related feature combinations , which can improve the performance of the model.
  • Through Bad-Case analysis, you can effectively find the sample points with inaccurate predictions, and then analyze the data back to find the relevant reasons, so as to find ways to improve the accuracy of the model.

Guess you like

Origin blog.csdn.net/weixin_51524504/article/details/130103545