Data mining learning - integrated learning (classifier combination)

Table of contents

1. Basic idea of ​​integrated learning

2. Integrated learning model combination strategy

(1) Average method & weighted average method (combination strategy of integrated regression model)

(2) Relative majority voting method & weighted voting method (combination strategy of integrated classification model)

 3.Bagging method and random forest

(1) Bagging method

 (2) Random forest (random forest)

4.Boosting method and Adaboost

(1) Boosting method

(2) Adaboost method

5. Integrated learning python implementation


1. Basic idea of ​​integrated learning

Integrated learning (classifier combination) is to integrate multiple data mining models (base model, base classifier) ​​together for learning. Multiple base models learn data sets and output results respectively, and then the integrated learning model integrates these results through certain methods, and finally forms the result of the integrated learning model.

2. Integrated learning model combination strategy

(1) Average method & weighted average method (combination strategy of integrated regression model)

Simple average method:

 Weighted average method:

(Compared to the simple average method, the weight coefficient parameter is increased, which is more prone to overfitting)

 In the actual use process, it is found that the result of the weighted average method is sometimes not as good as the simple average.

(2) Relative majority voting method & weighted voting method (combination strategy of integrated classification model)

Relative majority voting method:

The category with the most votes is the output category of the ensemble model (if there are multiple categories with the same highest votes, one of these categories is randomly selected as the final output.)

Weighted voting method:

(It is a special form of the voting method. In the weighted voting method, the voting power of different base models is different aaaaa)

 3.Bagging method and random forest

(1) Bagging method

The Bagging method is a parallel integrated learning method, and its basic structure is as follows:

 (2) Random forest (random forest)

Random forest is a specific implementation of Bagging method.

The basic steps are as follows:

1. Select base model training samples

2. Training decision tree base model

3. Integrate multiple decision trees

(In the random forest, more decision tree-based models can obtain better prediction results)

4.Boosting method and Adaboost

(1) Boosting method

The boosting method is a serial training base model, and its basic structure is as follows:

 (In the boosting method, the training samples of the base model are related to the prediction results of the previous base model, and the current base model focuses on the wrong samples predicted by the previous base model)

(2) Adaboost method

The Adaboost method (Adaptive Boosting) is a concrete implementation of Boosting integrated learning.

Specific steps:

1. Initialize sample weights

2. Training base model

3. Calculate the weight of the base model

4. Update the weight of the sample

5. Iterative training of multiple base models

6. Combining the prediction results of the base model

5. Integrated learning python implementation

The sklearn library provides the use of multiple integrated learning models, which can be called directly.

Example code (use of random forest classification model function):

The code in the previous article is used here to modify it. It can be seen that compared with using a single Gaussian Naive Bayesian splitter for training, the accuracy of the model trained with random forest is higher.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier

#训练模型函数
def model_fit(x_train,y_train,x_test,y_test):
    model=RandomForestClassifier(n_estimators=50)
    model.fit(x_train,y_train)#对训练集进行拟合
    # print(model.score(x_train,y_train))
    print("accurancy:",model.score(x_test,y_test))
    Y_pred=model.predict(x_test)
    cm=confusion_matrix(Y_pred,y_test)
    return cm

#混淆矩阵可视化
def matplotlib_show(cm):
    plt.figure(dpi=100)#设置窗口大小(分辨率)
    plt.title('Confusion Matrix')

    labels = ['a', 'b', 'c', 'd']
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels)
    plt.yticks(tick_marks, labels)
    sns.heatmap(cm, cmap=sns.color_palette("Blues"), annot=True, fmt='d')
    plt.ylabel('real_type')#x坐标为实际类别
    plt.xlabel('pred_type')#y坐标为预测类别
    plt.show()

if __name__ == '__main__':
    cancer = load_breast_cancer()
    x, y = cancer.data, cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)
    cm=model_fit(x_train,y_train,x_test,y_test)
    matplotlib_show(cm)

Running results: (accuracy rate and prediction result graph)

 

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/126728049