Python data science competition model fusion

Model fusion

The idea of model fusion is that the combination of multiple models can improve the overall performance. The ensemble model is a powerful technique that can improve accuracy on various machine learning tasks. Model fusion is an important part of the later stage of the kaggle competition. Generally speaking, there are the following types:

1. Simple weighted fusion:
regression (classification probability): Arithmetic mean, geometric mean fusion
Classification: Voting
synthesis: Rank averaging, log fusion

2. Stacking/blending:
build a multi-layer model, and use the prediction results to fit the prediction

3. Boosting/bagging:
multi-tree boosting method, which has been used in xgboost, Adaboost, GBDT, etc.

Averaging

Basic idea: For regression problems, a simple and straightforward idea is to average. A slightly improved method is to perform a weighted average. The weights can be determined by the sorting method. For example, for the three basic models of A, B, and C, the model effects are ranked. Assuming that the rankings are 1, 2, 3, then the weights assigned to the three models are respectively It is 3/6, 2/6, 1/6.

The average method or weighted average method seems simple, in fact, the later advanced algorithms can also be said to be based on this. Bagging or Boosting is an idea of fusing many weak classifiers into strong classifiers.

Simple arithmetic averaging method: The Averaging method averages the results predicted by multiple models. This method can be used both for regression problems and for averaging the probabilities of classification problems.
Weighted arithmetic average method: This method is an extension of the average method. Considering that the abilities of different models are different, and their contributions to the final result are also different, weights are needed to represent the importance of different models.

Voting

Basic idea: Suppose that there are 3 basic models for a two-classification problem. Now we can get a voting classifier based on these basic learners, and take the class with the most votes as the category we want to predict.

Absolute majority voting method: The final result must account for more than half of the vote.
Relative majority voting method: The final result has the most votes in the voting.
Weighted voting
Hard voting: voting directly on multiple models without distinguishing the relative importance of the model results. The class with the most final votes is the class that is finally predicted.
Soft voting: The function of setting weights has been added. Different weights can be set for different models to distinguish the importance of different models.

Voting method implementation code: `

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

model1 = LogisticRegression(random_state=2020)
model2 = DecisionTreeClassifier(random_state=2020)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train, y_train)
model.score(x_test, y_test)

Stacking

Basic idea: stacking is to use the initial training data to learn several basic learners, and use the prediction results of these learners as a new training set to learn a new learner. Model the results predicted by different models.
Insert picture description here
In the stacking method, we call the individual learner a primary learner, the learner used for combination is called a secondary learner or a meta-learner, and the data used by the secondary learner for training is called a secondary training set. The secondary training set is obtained by using the primary learner on the training set.

Stacking is essentially such a straightforward idea, but it is sometimes a problem if the distribution of the training set and the test set are not so consistent. The problem is to use the tags trained by the initial model and then use the real tags for retraining. Undoubtedly, it will cause a certain model to overfit the training set, so maybe the generalization ability or effect of the model on the test set will be reduced to a certain extent, so the problem now becomes how to reduce the overfitting of retraining , Here we generally have two methods:

Try to choose a simple linear model for the secondary model, but the linear model does not work well after practice. LightGBM is recommended
Use K-fold cross-validation

Call APi to implement stacking:

from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
#加载数据集
from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1, random_state=111)
#创建数据集
dataset = Dataset(X_train,y_train,X_test)
#创建RF模型和LR模型
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={
    
    'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression, parameters={
    
    'normalize': True},name='lr')
# Stack两个模型
# Returns new dataset with out-of-fold predictions
pipeline = ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.stack(k=10,seed=111)
#第二层使用lr模型stack
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict()
# 使用5折交叉验证结果
results10 = stacker.validate(k=5,scorer=mean_absolute_error)

Manual implementation: just create a base learner as an example:

from sklearn.model_selection import StratifiedKFold,KFold,RepeatedKFold
import xgboost.sklearn import XGBClassifier
import numpy as np

skf = StratifiedKFold(n_splits = 5, shuffle = True ,random_state=16)
oof_xgb = np.zeros(len(train))  #创建数组
pre_xgb = np.zeros(len(test))

for k,(train_in,test_in) in enumerate(skf.split(train,train_y)):
    X_train,X_test,y_train,y_test = X[train_in],X[test_in],y[train_in],y[test_in]
    params = {
    
    'learning_rate': 0.008, 
              'n_estimators': 1000
              'max_depth': 5,
              'subsample': 0.8, 
              'colsample_bytree': 0.8, 
              'objective':'binary:logistic',
              'eval_metric':'auc',
              'silent': True, 
              'nthread': 4,
              }
    # train
    clf = XGBClassifier(params)
    clf.fit(trn_x, trn_y,eval_set=[(val_x, val_y)],eval_metric='auc',early_stopping_rounds=100,verbose=100)
    print('Start predicting...')
    oof_xgb[test_in] = clf.predict(X_test)
    pre_xgb += clf.predict(test) / skf.n_splits
    
print('XGB predict over')

Blending

Basic idea: Blending uses the same method as stacking, but only selects a fold result from the training set, and then concat with the original feature as the feature of the meta learner meta learner, and perform the same operation on the test set.
Divide the original training set into two parts, such as 70% of the data as the new training set, and the remaining 30% as the test set.

In the first layer, we train multiple models on the 70% of the data, and then predict the label of the 30% data, and also predict the label of the test set
In the second layer, we directly use the 30% data predicted in the first layer as a new feature to continue training, then use the label predicted by the first layer of the test set as the feature, and use the model trained in the second layer to make further predictions

Comparison of Blending and Stacking:
Advantages:

Blending is simpler than stacking because there is no need to perform k times of cross-validation to obtain the stacker feature
Blending avoids an information leakage problem: generators and stackers use different data sets

Disadvantages:

Blending uses very little data (the second stage of blender only uses 10% of the training set), and blender may overfit
Stacking using multiple cross-validation will be more robust