Machine learning algorithm selection

 How to choose different algorithms in different scenarios always troubles me. I have consulted many machine learning experts. The final conclusion is that if time permits, it is best to try each algorithm and choose the best algorithm to apply to the corresponding one. Scenes.

The following code selects five mainstream machine learning algorithms, including SVM, KNN, decision tree, logistic regression, naive Bayes, and of course ensemble learning algorithms, Bagging, Adaboost, GBDT, and random forest. Write a general function to construct the above models respectively, and make ROC curve for model evaluation.

The following is the download address of the test data, the extraction code is: lg9r

https://pan.baidu.com/s/1e8txYy-PZrwKKP3JD4sJAg

 

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = 'Seven'

import pandas as pd
# 导入集成学习算法
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
# 导入普通模型算法
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# 导入模型评估库
from sklearn.metrics import roc_curve, auc
# 导入数据拆分库
from sklearn.model_selection import train_test_split
# 导入归一化库
from sklearn.preprocessing import MinMaxScaler
# 导入可视化库
import matplotlib.pyplot as plt

# 设置中文显示
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 去除警告信息
import warnings

warnings.filterwarnings('ignore')

data_path = '/Users/gaofei/Desktop/ensemble/data.csv' # 请自行下载数据并替换路径

# 读取数据文件
data_frame = pd.read_csv(data_path, encoding='gbk')

# 获取字段名
cols = list(data_frame.columns)

# 归一化
scaler = MinMaxScaler()
values = scaler.fit_transform(data_frame.values[:, :-1])

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(values[:, :-1], data_frame.values[:, -1], test_size=0.3)


# 输出ROC曲线
def plot_roc():
    # 构建Bagging模型
    clf_bagging = BaggingClassifier()
    # 构建Adaboost模型
    clf_ada = AdaBoostClassifier()
    # 构建GBDT模型
    clf_gbdt = GradientBoostingClassifier()
    # 构建RandomForest模型
    clf_rf = RandomForestClassifier()
    # 构建SVM模型
    clf_svm = SVC(probability=True) #若不将probability设为True,predict_proba将不可用
    # 构建KNN模型
    clf_knn = KNeighborsClassifier()
    # 构建naive bayes模型
    clf_nb = GaussianNB()
    # 构建Logistic Regression模型
    clf_logistic = LogisticRegression()
    # 构建decision tree模型
    clf_dt = DecisionTreeClassifier()

    # 构建模型集合名称
    clfs = [clf_bagging, clf_ada, clf_gbdt, clf_rf, clf_svm, clf_knn, clf_nb, clf_logistic, clf_dt]
    # 模型名称列表
    names = ['Bagging', 'Adaboost', 'GBDT', 'RandomForest', 'SVM', 'KNN', 'Naive Bayes', 'Logistic Regression',
             'Decision Tree']

    # 各模型预测为1的概率
    prbs_1 = []
    for clf in clfs:
        # 训练数据
        clf.fit(X_train, y_train)
        # 输出混淆矩阵
        pre = clf.predict(X_test)
        # 输出预测测试集的概率
        y_prb_1 = clf.predict_proba(X_test)[:, 1]
        prbs_1.append(y_prb_1)

    for index, value in enumerate(prbs_1):
        # 得到误判率、命中率、门限
        fpr, tpr, thresholds = roc_curve(y_test, value)
        # 计算auc
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label='{0}_AUC = {1:.5f}'.format(names[index], roc_auc))

    plt.title('ROC曲线')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlabel('误判率')
    plt.ylabel('命中率')
    plt.show()
    return prbs_1


if __name__ == '__main__':
    print(plot_roc())

Model evaluation:

The ROC curve drawn by the test data is as follows:

As can be seen from the above figure, the integrated learning algorithm is significantly better than the ordinary algorithm.

The above code is just to provide a reference for algorithm selection. It is just the first step. Next, you can choose several algorithms with better results to use sklearn's GridSearch to adjust parameters, and finally determine an algorithm.

Guess you like

Origin blog.csdn.net/gf19960103/article/details/89475609
Recommended