支持向量机和决策树 - 目录
1 简介
1.1 代码下载
代码路径,欢迎 star~~
https://github.com/spareribs/kaggleSpareribs/blob/master/Overdue/ml/code/sklearn_config.py
https://github.com/spareribs/kaggleSpareribs/blob/master/Overdue/ml/code/sklearn_train.py
1.2 代码使用方法
- 【必须】config.py 设置文件存放的路径
- 【必须】先执行 features 中的 base.py 先把数据处理好 [PS:需要根据实际情况修改]
- 【可选】再通过 code 中的 sklearn_config.py 设置模型的参数[PS: 按需修改]
- 【必须】最后通过 code 中的 sklearn_train.py 训练模型输出结果
3 核心代码说明
3.1 模型配置
""" 开启交叉验证 """
status_vali = False
""" 模型参数 """
clfs = {
'svm': LinearSVC(C=0.5, penalty='l2', dual=True),
'svm_linear': SVC(kernel='linear', probability=True),
'svm_ploy': SVC(kernel='poly', probability=True),
'rf': RandomForestClassifier(n_estimators=10, criterion='gini'),
}
3.2 模型训练
可以修改模型的选择 [ svm, svm_linear, svm_ploy, rf ]
""" 1 读取数据 """
data_fp = open(features_path, 'rb')
x_train, y_train = pickle.load(data_fp)
data_fp.close()
""" 2 训练分类器, clf_name选择需要的分类器 """
clf_name = "svm"
clf = clfs[clf_name]
clf.fit(x_train, y_train)
""" 3 在验证集上评估模型 """
if status_vali:
print("测试模型 & 模型参数如下:\n{0}".format(clf))
print("=" * 20)
pre_train = clf.predict(x_train)
print("训练集正确率: {0:.4f}".format(clf.score(x_train, y_train)))
print("训练集f1分数: {0:.4f}".format(f1_score(y_train, pre_train)))
print("训练集auc分数: {0:.4f}".format(roc_auc_score(y_train, pre_train)))
print("-" * 20)
pre_vali = clf.predict(x_vali)
print("测试集正确率: {0:.4f}".format(clf.score(x_vali, y_vali)))
print("测试集f1分数: {0:.4f}".format(f1_score(y_vali, pre_vali)))
print("测试集auc分数: {0:.4f}".format(roc_auc_score(y_vali, pre_vali)))
print("=" * 20)
3.3 输出结果
3.3.1 LinearSVC
测试模型 & 模型参数如下:
LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)
====================
训练集正确率: 0.8022
训练集f1分数: 0.4489
训练集auc分数: 0.6422
--------------------
测试集正确率: 0.7954
测试集f1分数: 0.4449
测试集auc分数: 0.6396
====================
3.3.2 SVC linear
测试模型 & 模型参数如下:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3,
gamma='auto_deprecated', kernel='linear', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False)
====================
训练集正确率: 0.7977
训练集f1分数: 0.3910
训练集auc分数: 0.6181
--------------------
测试集正确率: 0.7884
测试集f1分数: 0.3837
测试集auc分数: 0.6146
====================
3.3.2 SVC ploy
测试模型 & 模型参数如下:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='poly', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False)
====================
训练集正确率: 0.8206
训练集f1分数: 0.4373
训练集auc分数: 0.6398
--------------------
测试集正确率: 0.7526
测试集f1分数: 0.2067
测试集auc分数: 0.5482
====================
3.3.2 rf
测试模型 & 模型参数如下:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False)
====================
训练集正确率: 0.9814
训练集f1分数: 0.9609
训练集auc分数: 0.9624
--------------------
测试集正确率: 0.7730
测试集f1分数: 0.3721
测试集auc分数: 0.6060
====================