Exercise - sklearn

exercise

Create a classification dataset(n_samples>=1000, n_features>=10)
Split the dataset using 10-fold cross validation
Train the algorithms
- GaussianNB
- SVC(possible C values[1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
- RandomForestClassifier(possible n_estimators value[10, 100, 1000])
Evaluate the cross-validated performance
- Accuracy
- F1-score
- AUC ROC

solution

建立一个二分类的数据集，创建 准确率、F1-score、auc_roc值 的字典用于存储不同算法的结果

from sklearn import datasets
sample = datasets.make_classification(n_samples=1000, n_features=10)
data, target = sample
acc_dic = {'NB':[], 'SVM':[], 'RFC':[]}
f1_dic = {'NB':[], 'SVM':[], 'RFC':[]}
auc_dic = {'NB':[], 'SVM':[], 'RFC':[]}

将数据集分为10个folds

from sklearn import cross_validation
kf = cross_validation.KFold(len(data), n_folds=10, shuffle=True)

编写待调用算法如下，其中perfromance()将其评估结果加入到对应字典——

#performance
from sklearn import metrics
def performance(y_test, pred, algorithm):
    acc = metrics.accuracy_score(y_test, pred)
    print(acc)
    f1 = metrics.f1_score(y_test, pred, average='weighted')
    print(f1)
    auc = metrics.roc_auc_score(y_test, pred)
    print(auc)
    acc_dic[algorithm].append(acc)
    f1_dic[algorithm].append(f1)
    auc_dic[algorithm].append(auc)

#GaussianNBayes
from sklearn.naive_bayes import GaussianNB
def GussianNBayes(X_train, y_train, X_test, y_test, algorithm='NB'):
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    print(pred, y_test)
    performance(y_test, pred, algorithm)

#SVM
from sklearn.svm import SVC
def S_V_M(X_train, y_train, X_test, y_test, algorithm='SVM'):
    clf = SVC(C=1e-01, kernel='rbf', gamma=0.1)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    print(pred, y_test)
    performance(y_test, pred, algorithm)

#random forest
from sklearn.ensemble import RandomForestClassifier
def R_F_C(X_train, y_train, X_test, y_test, algorithm='RFC'):
    clf = RandomForestClassifier(n_estimators=6)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    print(pred, y_test)
    performance(y_test, pred, algorithm)

对各个folds调用上述不同算法如下

for train_index, test_index in kf:
    X_train, y_train = data[train_index], target[train_index]
    X_test, y_test = data[test_index], target[test_index]
    GussianNBayes(X_train, y_train, X_test, y_test)
    S_V_M(X_train, y_train, X_test, y_test)
    R_F_C(X_train, y_train, X_test, y_test)

输出各评估结果及对应的均值

#values
for key, value in acc_dic.items():
    print('for', key, ':', 'Accuracy of all folds is ', value)
for key, value in f1_dic.items():
    print('for', key, ':', 'F1-score of all folds is ', value)
for key, value in auc_dic.items():
    print('for', key, ':', 'AUC-ROC of all folds is ', value)

#mean
for key, value in acc_dic.items():
    print('for', key, ':', 'mean of Accuracy of all folds is ', sum(value) / len(value))
for key, value in f1_dic.items():
    print('for', key, ':', 'mean of F1-score of all folds is ', sum(value) / len(value))
for key, value in auc_dic.items():
    print('for', key, ':', 'mean of AUC-ROC of all folds is ', sum(value) / len(value))

输出结果如下：

for NB : Accuracy of all folds is  [0.88, 0.9, 0.94, 0.96, 0.96, 0.92, 0.95, 0.91, 0.91, 0.89]
for SVM : Accuracy of all folds is  [0.89, 0.9, 0.95, 0.97, 0.96, 0.93, 0.95, 0.93, 0.93, 0.92]
for RFC : Accuracy of all folds is  [0.92, 0.95, 0.93, 0.97, 0.98, 0.96, 0.97, 0.95, 0.94, 0.97]
for NB : F1-score of all folds is  [0.8806854345165239, 0.9, 0.94, 0.96, 0.960016006402561, 0.92, 0.9500150015001501, 0.9103570338724444, 0.9100816614578082, 0.8890371743383791]
for SVM : F1-score of all folds is  [0.8907272912216457, 0.8995376208490964, 0.9500150135121609, 0.9699788965933073, 0.96, 0.9300350035003501, 0.9500150015001501, 0.9302776930119012, 0.9300635144671843, 0.9191070690367922]
for RFC : F1-score of all folds is  [0.9204569563443493, 0.9500985579416951, 0.9300210189170253, 0.9699788965933073, 0.98, 0.960032012805122, 0.97000900090009, 0.9500767263427109, 0.9401445202729828, 0.9699603295697284]
for NB : AUC-ROC of all folds is  [0.8834229020256303, 0.8958333333333334, 0.9399038461538461, 0.9598554797270173, 0.9607371794871795, 0.9233239662786029, 0.9511217948717948, 0.9125615763546798, 0.9101010101010101, 0.8823051948051949]
for SVM : AUC-ROC of all folds is  [0.8956180239768499, 0.8916666666666667, 0.9503205128205129, 0.9692894419911683, 0.9615384615384616, 0.9327579285427539, 0.9511217948717948, 0.933087027914614, 0.9303030303030303, 0.911525974025974]
for RFC : AUC-ROC of all folds is  [0.9247622984704423, 0.95, 0.9302884615384616, 0.9692894419911683, 0.9799679487179487, 0.9622641509433962, 0.9711538461538461, 0.9503284072249589, 0.9434343434343434, 0.9683441558441558]
for NB : mean of Accuracy of all folds is  0.922
for SVM : mean of Accuracy of all folds is  0.933
for RFC : mean of Accuracy of all folds is  0.9540000000000001
for NB : mean of F1-score of all folds is  0.9220192312087866
for SVM : mean of F1-score of all folds is  0.932975710369259
for RFC : mean of F1-score of all folds is  0.9540778019687011
for NB : mean of AUC-ROC of all folds is  0.9219166283138289
for SVM : mean of AUC-ROC of all folds is  0.9327228862651825
for RFC : mean of AUC-ROC of all folds is  0.9549833054318722

可以看到，三种方法的准确率都较高，大于90%，且对三个指标，都是随机森林算法较好，在多次运行后，情况基本相同。可见在当前数据集中，随机森林算法性能更好。

(Week 14)Python-sklearn_exercises

Exercise - sklearn

exercise

solution

猜你喜欢