Python学习之路_sklearn

题目:

For this assignment you need to generate a random binary classification
problem, and then train and test (using 10-fold cross validation) the three
algorithms. For some algorithms inner cross validation (5-fold) for choosing
the parameters is needed. Then, show the classification performace
(per-fold and averaged) in the report, and briefly discussing the results.

Steps:
1 Create a classification dataset (n_samples >= 1000, n_features >= 10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
GaussianNB
SVC (possible C_values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
RandomForestClassifier (possible n_estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
Accuracy
F1-score
AUC ROC

5 Write a short report summarizing the methodology and the results

代码:

from sklearn import cross_validation
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

import numpy as np

performance = np.ndarray(shape=(10, 3, 3))


def Gaussian_naive_Bayes(X_train, y_train):
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def SVM(X_train, y_train):
    clf = SVC(C=1e-01, kernel='rbf', gamma=0.1)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def Random_Forest(X_train, y_train):
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    return metric(y_test, pred)


def metric(y_test, pred):
    acc = metrics.accuracy_score(y_test, pred)
    f1 = metrics.f1_score(y_test, pred)
    auc = metrics.roc_auc_score(y_test, pred)

    return acc, f1, auc


dataset = datasets.make_classification(n_samples=1000, n_features=10,
                                       n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)

kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True)
i = 0
for train_index, test_index in kf:
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]
    performance[i, 0, :] = Gaussian_naive_Bayes(dataset[0][train_index], dataset[1][train_index])
    performance[i, 1, :] = SVM(dataset[0][train_index], dataset[1][train_index])
    performance[i, 2, :] = Random_Forest(dataset[0][train_index], dataset[1][train_index])
    i += 1

name = ['GaussianNB', 'SVC', 'RandomForestClassifier']
mean = np.mean(performance, axis=0)
for i in list(range(0, 3)):
    print(name[i])
    print('  Accuracy: ', performance[:, i, 0], ' Averaged: ', mean[i, 0])
    print('  F1-score: ', performance[:, i, 1], ' Averaged: ', mean[i, 1])
    print('  AUC ROC:  ', performance[:, i, 2], ' Averaged: ', mean[i, 2], '\n')

结果:

截图:

结论:

总的来说RandomForestClassifier的效果最好,SVC次之,GaussianNB最差,但是平均精准度都能达到80%以上

猜你喜欢

转载自blog.csdn.net/manjiang8743/article/details/80723309