题目
思路
这道题目考察的是对ML中随机数据集的生成、数据集的划分、基本的模型训练算法以及验证性地考察
不同算法的性能。这次练习的目的是让我们了解一下Python中机器学习的一些基本流程。
代码及注释
from numpy import *
from sklearn import *
Bayes_acy = []
Bayes_F1 = []
Bayes_AR = []
SVM_acy = []
SVM_AR = []
SVM_F1 = []
RF_acy = []
RF_F1 = []
RF_AR = []
X, Y = datasets.make_classification(n_samples = 1000, n_features = 10)
for train, test in cross_validation.KFold(1000, shuffle = True, n_folds = 10):
Xtrain = X[train]
Ytrain = Y[train]
Xtest = X[test]
Ytest = Y[test]
#贝叶斯
Bayes_clf = naive_bayes.GaussianNB()
Bayes_clf.fit(Xtrain, Ytrain)
Bayes_pred = Bayes_clf.predict(Xtest)
Bayes_acy.append(metrics.accuracy_score(Ytest, Bayes_pred))
Bayes_F1.append(metrics.f1_score(Ytest, Bayes_pred))
Bayes_AR.append(metrics.roc_auc_score(Ytest, Bayes_pred))
#支持向量机
Cvalues = [1e-02, 1e-01, 1e00, 1e01, 1e02]
Cscore = []
for C in Cvalues:
Score2 = []
for train2, test2 in cross_validation.KFold(len(Xtrain), n_folds = 5, shuffle = True):
Xtrain2 = Xtrain[train2]
Xtest2 = Xtrain[test2]
Ytrain2 = Ytrain[train2]
Ytest2 = Ytrain[test2]
innerSVM_clf = svm.SVC(C = C, kernel = 'rbf')
innerSVM_clf.fit(Xtrain2, Ytrain2)
innerSVM_pred = innerSVM_clf.predict(Xtest2)
Score2.append(metrics.accuracy_score(Ytest2, innerSVM_pred))
Cscore.append(sum(Score2)/len(Score2))
bestC = Cvalues[argmax(Cscore)]
print("The best C of SVM is %f"%bestC)
SVM_clf = svm.SVC(C = bestC, kernel = 'rbf')
SVM_clf.fit(Xtrain, Ytrain)
SVM_pred = SVM_clf.predict(Xtest)
SVM_acy.append(metrics.accuracy_score(Ytest, SVM_pred))
SVM_F1.append(metrics.f1_score(Ytest, SVM_pred))
SVM_AR.append(metrics.roc_auc_score(Ytest, SVM_pred))
#随机森林
values = [10, 100, 1000]
scores = []
for est in values:
Score2 = []
for train2, test2 in cross_validation.KFold(len(Xtrain), n_folds = 5, shuffle = True):
Xtrain2 = Xtrain[train2]
Xtest2 = Xtrain[test2]
Ytrain2 = Ytrain[train2]
Ytest2 = Ytrain[test2]
RF_clf2 = ensemble.RandomForestClassifier(n_estimators=est)
RF_clf2.fit(Xtrain2, Ytrain2)
RF_pred2 = RF_clf2.predict(Xtest2)
Score2.append(metrics.accuracy_score(Ytest2, RF_pred2))
scores.append(sum(Score2)/len(Score2))
best_est = values[argmax(scores)]
print("The best n_estimator of Ramdom Forest is %f"%best_est)
RF_clf = ensemble.RandomForestClassifier(n_estimators=best_est)
RF_clf.fit(Xtrain, Ytrain)
RF_pred = RF_clf.predict(Xtest)
RF_acy.append(metrics.accuracy_score(Ytest, RF_pred))
RF_F1.append(metrics.f1_score(Ytest, RF_pred))
RF_AR.append(metrics.roc_auc_score(Ytest, RF_pred))
print("Compare Accuracy:")
print("Bayes:")
print(Bayes_acy)
print("Average = %f"%(sum(Bayes_acy)/len(Bayes_acy)))
print("SVM:")
print(SVM_acy)
print("Average = %f"%(sum(SVM_acy)/len(SVM_acy)))
print("Random Forest:")
print(RF_acy)
print("Average = %f"%(sum(RF_acy)/len(RF_acy)))
print("\nCompare F1-score:")
print("Bayes:")
print(Bayes_F1)
print("Average = %f"%(sum(Bayes_F1)/len(Bayes_F1)))
print("SVM:")
print(SVM_F1)
print("Average = %f"%(sum(SVM_F1)/len(SVM_F1)))
print("Random Forest:")
print(RF_F1)
print("Average = %f"%(sum(RF_F1)/len(RF_F1)))
print("\nCompare AUC ROC:")
print("Bayes:")
print(Bayes_AR)
print("Average = %f"%(sum(Bayes_AR)/len(Bayes_AR)))
print("SVM:")
print(SVM_AR)
print("Average = %f"%(sum(SVM_AR)/len(SVM_AR)))
print("Random Forest:")
print(RF_AR)
print("Average = %f"%(sum(RF_AR)/len(RF_AR)))
结果与分析
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 100.000000
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 100.000000
The best C of SVM is 10.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 10.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 10.000000
The best n_estimator of Ramdom Forest is 100.000000
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 10.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 1000.000000
The best C of SVM is 1.000000
The best n_estimator of Ramdom Forest is 1000.000000
Compare Accuracy:
Bayes:
[0.93, 0.92, 0.92, 0.94, 0.92, 0.92, 0.93, 0.9, 0.96, 0.93]
Average = 0.927000
SVM:
[0.96, 0.93, 0.91, 0.95, 0.95, 0.95, 0.94, 0.91, 0.97, 0.93]
Average = 0.940000
Random Forest:
[0.99, 0.96, 0.95, 0.98, 0.97, 0.98, 0.96, 0.96, 0.99, 0.99]
Average = 0.973000
Compare F1-score:
Bayes:
[0.9306930693069307, 0.923076923076923, 0.923076923076923,
0.9433962264150944, 0.923076923076923, 0.9130434782608695,
0.9278350515463918, 0.9038461538461539, 0.962962962962963,
0.9230769230769231]
Average = 0.927408
SVM:
[0.9591836734693877, 0.9320388349514563, 0.9158878504672898,
0.9523809523809524, 0.9504950495049505, 0.945054945054945,
0.9375000000000001, 0.9142857142857143, 0.9724770642201834,
0.9230769230769231]
Average = 0.940238
Random Forest:
[0.9896907216494846, 0.9607843137254902, 0.9514563106796117,
0.9811320754716981, 0.970873786407767, 0.9777777777777777,
0.9591836734693877, 0.96, 0.9909909909909909, 0.9887640449438202]
Average = 0.973065
Compare AUC ROC:
Bayes:
[0.9318910256410255, 0.9199999999999999, 0.9227053140096618,
0.9391025641025641, 0.9195678271308523, 0.9212121212121213,
0.9297719087635054, 0.8995598239295717, 0.9642857142857143,
0.9326298701298701]
Average = 0.928073
SVM:
[0.9607371794871794, 0.93, 0.9102254428341385, 0.9495192307692308,
0.9501800720288115, 0.9505050505050505, 0.9395758303321328,
0.9093637454981992, 0.9732142857142857, 0.9326298701298701]
Average = 0.940595
Random Forest:
[0.9903846153846153, 0.96, 0.9537037037037037, 0.9791666666666667,
0.9697879151660664, 0.9797979797979798, 0.9599839935974389,
0.9603841536614646, 0.9910714285714286, 0.9910714285714285]
Average = 0.973535
- 对比三种评估方法,F1和ROC AUC方法评估结果普遍比用accuracy评估得分高,这说明用accuracy评估会更加严格
- 可以看到支持向量机和随机森林算法在三种不同评估下性能均优于初等贝叶斯,所以支持向量机和随机森林会比贝叶斯更优一些