1. Multi-model fusion

As we all know, machine learning can improve the performance of the model through the fusion of multiple models. In the various algorithm competitions in recent years, the top few are almost all multi-model fusion. For example, the models that won the championship and runner-up results in the otto product classification challenge on kaggle are "behemoths" that integrate 1000+ models. On a certain level, multi-model fusion is a "violent" solution to data. Several models are used to improve performance. Of course, some masters can achieve the same effect of multi-model through data pre-processing + single model. However, the poster is still lazy, too lazy to analyze the relationship between data and feature analysis. Let's not talk nonsense, let's get to our topic!

2. The way of multi-model fusion

1、Voting/Average

real label	1	0	1	1	1	0	1	1	1
classifier 1	1	0	1	1	0	0	0	1	1
classifier 2	1	1	1	1	1	1	1	1	1
Classifier 3	1	1	0	1	1	1	1	1	0

From the above table, we can see that
the accuracy of classifier 1=0.8,
the accuracy of classifier 2=0.8, and
the accuracy=0.6 of classifier 3.
If the principle of minority obeying the majority is followed, the accuracy after model fusion=0.8

1.1 Found the problem

You may find that after using multi-model fusion, the performance of the model does not improve. On the one hand, there may be too few models voting, and some models just misclassify a certain sample, resulting in no obvious performance improvement after model fusion. On the other hand, it may also be related to the small difference between the models. Let's look at another example.

real label	1	0	1	1	1	0	1	1	1
classifier 1	1	0	1	1	0	0	0	1	1
classifier 2	0	1	1	1	1	0	1	0	1
Classifier 3	1	0	0	1	1	1	0	1	0

From the above table, we can see that
the accuracy of classifier 1=0.8,
the accuracy of classifier 2=0.7,
and the accuracy of classifier 3=0.6.
If the principle of minority obeying the majority is followed, the accuracy after model fusion=0.9

1.2 why

Through observation, we found that in the following experiments, the difference between the models was greater. In other words, the model made mistakes on different samples. The reason for the low accuracy or no improvement in the first experiment was that the correlation between the models was high. Errors occurred on the same or several samples, which would lead to no improvement or even a decline in the performance of the fusion model.

1.3 Conclusion

For model fusion, the greater the difference between the models, the better the fusion result will be. The difference between the models referred to here does not refer to the difference in the correct rate, but the difference in the correlation between the models.

3. Use the iris data set that comes with machine learning for multi-model fusion

Through the above experiments, we use the Voting package in sklearn to implement an example of multi-model fusion

1. Import the required packages

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

2. Data reading

iris = load_iris()
data = iris.data
labels = iris.target

3. Divide the training set, test set, and 10-fold cross-validation

x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.3， random_state=10)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=10)

4. Model construction

# 模型
cls_knn = KNeighborsClassifier(n_neighbors=3)
cls_gnb = GaussianNB()
cls_dt = DecisionTreeClassifier(random_state=10)

Only three classifiers are used here, namely K-Nearest Neighbor, Gaussian Bayesian and Decision Tree. The reason why many classifiers are not selected is that other classifiers have similar principles to these three algorithms and cannot achieve the model differences we mentioned at the beginning.

Voting and performance comparison

# 投票
voting = VotingClassifier(estimators=[('knn', cls_knn), ('gnb', cls_gnb),
                                      ('dt', cls_dt)], voting='hard', weights=[2, 1, 1])

# 比较性能(10折交叉验证）
acc_list = []
for clf in (cls_knn, cls_gnb, cls_dt, voting):
    for k, (train, test) in enumerate(kfold.split(x_train, y_train)):
        clf.fit(x_train[train], y_train[train])
        y_pred = clf.predict(x_test)
        acc = accuracy_score(y_test, y_pred)
        acc_list.append(acc)
    print(clf.__class__.__name__, np.mean(acc_list))
    acc_list.clear()

4. Results

KNeighborsClassifier 0.9577777777777777
GaussianNB 0.9533333333333334
DecisionTreeClassifier 0.9333333333333333
VotingClassifier 0.9488888888888889

The results given for the first time are as shown above. We found that the ACC after voting was not as good as K-Nearest Neighbor and Gaussian Bayesian. This may be due to the problem that the difference between classifiers is too small. They all collectively made mistakes on part of the data. So, to change the way of thinking, now I change the weight of voting, give priority to training KNN to maximize ACC, and then give it a larger weight. This is a bit like the good students in the class. Larger, and adjust the model hyperparameters at the same time to further improve the model performance.

KNeighborsClassifier 0.9844444444444445
GaussianNB 0.9888888888888889
DecisionTreeClassifier 0.96
VotingClassifier 0.9911111111111112

According to the weight of 2:1:1, we found that the accuracy of model fusion has indeed improved, and the goal has been achieved!

Machine learning multi-model fusion prediction iris data set