Financial Risk Control Task5 Model Fusion

1 Introduction

Model fusion is an important means of scoring in the later stages of the game, especially in multi-person team learning games, where the models of different teammates are fused together, unexpected results may be obtained. Often, the greater the difference between the models, the better the performance of the models Under the premise, the results of model fusion will be greatly improved. The following is the way of model fusion.

  1. Average:
    a. Simple average method
    b. Weighted average method
  2. Voting:
    a. Simple voting method
    b. Weighted voting method
  3. Synthesis:
    a. sort fusion
    b. log fusion
  4. stacking:
    a. Build a multi-layer model and use the prediction results to refit the predictions. 5. Blending:
    a. Select part of the data for prediction training to obtain prediction results as new features, and bring them into the remaining data for prediction. Blending has only one layer, while Stacking has multiple layers
  5. Boosting/bagging (already mentioned in Task4, so I won't go into details)

2 Detailed explanation of stacking\blending

  1. Stacking
    uses the prediction results obtained by several base learners and uses the prediction results as a new training set to train a learner. As shown in the figure below, it is assumed that there are five basic learners, and the data is brought into the five basic learners to obtain the prediction results, and then brought into the model six for training and testing. However, since the results obtained by the five base learners are directly brought into the model six, it is easy to lead to overfitting. Therefore, when using the five and model for prediction, you can consider using K-fold verification to prevent overfitting.
    insert image description here
  2. Blending
    is different from stacking. Blending combines the predicted value as a new feature with the original feature to form a new feature value for prediction. In order to prevent overfitting, the data is divided into two parts d1 and d2, and the data of d1 is used as the training set, and the data of d2 is used as the test set. The predicted data is used as a new feature and the data of d2 is used as a training set to combine new features to predict the result of the test set.
    insert image description here
  3. The difference between Blending and stacking
    a. stacking
    In stacking, because the data used by the two layers is different, the problem of information leakage can be avoided. During the team competition, there is no need to share your random seeds with your teammates.
    b. Blending
    Blending is simpler than stacking and does not need to build a multi-layer model. Since the blending pair divides the data into two parts, some data information will be ignored in the final prediction. At the same time, when using the second layer of data, overfitting may occur because the second layer of data is less.

3 code example

3.1 average

  1. Simple weighted average, the results are directly fused
    to find the average of multiple forecast results. pre1-pren are the results predicted by n groups of models, and they are weighted and fused
pre = (pre1 + pre2 + pre3 +...+pren )/n
  1. The weighted average method
    generally performs weighted fusion based on the accuracy of the previous prediction models, and assigns higher weights to models with high accuracy.
pre = 0.3pre1 + 0.3pre2 + 0.4pre3

3.2 Voting

  1. simple vote
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150,max_depth=4,min_child_weight=2,subsample=0.7,objective='binary:logistic')
vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb',clf3)])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))
  1. weighted vote
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=4, min_child_weight=2,subsample=0.7,objective='binary:logistic')
vclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('xgb', clf3)],voting='soft', weights=[2, 1, 1])
vclf = vclf .fit(x_train,y_train)
print(vclf .predict(x_test))

3.3 Stacking

3.4 blending

Guess you like

Origin blog.csdn.net/BigCabbageFy/article/details/108832912