机器学习读书笔记(六)Model Evaluation and Hyperparameter Tuning

读书笔记(六)Model Evaluation and Hyperparameter Tuning

嘿嘿嘿

一、Pipeline一体化操作

流程图如下

代码如下:
#The intermediate steps in a pipeline constitute scikit-learn transformers, and the last one is an estimator.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(random_state=1,solver='lbfgs'))

pipe_lr.fit(X_train, y_train)#实际上调用了StandardScaler的fit_transform on the training data,
											#并且将标准化后的参数传递给PCA,PCA操作同理
y_pred = pipe_lr.predict(X_test)#LogisticRegression的操作
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

二、Holdout cross-validation and K-fold cross-validation

1. Holdout cross-validation

基本方法:

将data分为三个部分,分别为 training set、validation set 、test set
训练集用于训练不同的模型,验证集用于模型选择,而测试集则用于 obtain a less biased estimate of its ablility to generalize data. 测试模型的泛化能力。

缺点:

结果对于dataset的三个subset的划分十分敏感,not robust enough~~(相对于下面将要介绍的K-fold cross-validation )

如图:

2.K-fold cross-validation

基本方法:

将dataset不放回地随机分为k组,其中k-1组用于training,剩下1组用于testing。重复该步骤k次,得到k个model estimates。计算出平均值,可以得到一个比Holdout cross-validation更不依赖于data的分割的估计。

如图:
在这里插入图片描述
优点:

每一个样本点都有且仅有一次成为training set 和 test set的一部分。相较于holdout方法,减小了过拟合的可能(lower-variance estimate)

小注意~:

  • 1.这个一般方法用于调参,调参完毕,再将整个dataset代入模型之中得到一个最后的performance estimate.

  • 2.k的一般默认取值为10,对于较小的数据集,我们可以适当增大k,由于更多的数据能被用上,这样可以lower bias(但是增大了计算量并且可能导致higher variance)。对于较大的数据集,我们可以适当地减小k(e.g. k
    =5),相对地可以减少计算量。

  • #在数据集极端小的情况下,一个Kfold的极端例子是k=n,称为leave-one-out(LOO),即每次test sample只有一个样本点。

3.Stratified k-fold(分层)

基本方法:
In stratified cross-validation, the calss proportions are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.(即每一个fold中不同类的比例与总体保持一致,故称为stratified,不禁让人联想到train_test_split()中的参数,分层取样)

import numpy as np
from sklearn.model_selection import StratifiedKFold
    

kfold = StratifiedKFold(n_splits=10,
                        random_state=1).split(X_train, y_train)

scores = []
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,
          np.bincount(y_train[train]), score))
    
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

结果如下

Fold:  1, Class dist.: [256 153], Acc: 0.935
Fold:  2, Class dist.: [256 153], Acc: 0.935
Fold:  3, Class dist.: [256 153], Acc: 0.957
Fold:  4, Class dist.: [256 153], Acc: 0.957
Fold:  5, Class dist.: [256 153], Acc: 0.935
Fold:  6, Class dist.: [257 153], Acc: 0.956
Fold:  7, Class dist.: [257 153], Acc: 0.978
Fold:  8, Class dist.: [257 153], Acc: 0.933
Fold:  9, Class dist.: [257 153], Acc: 0.956
Fold: 10, Class dist.: [257 153], Acc: 0.956

CV accuracy: 0.950 +/- 0.014

一个更为简洁的方式using stratified k-fold cross-validation
#(其实我个人感觉参数input里也妹有分层的参数鸭,但是查了github上源代码人家好像default就都是stratified的,好吧。。。。ヽ(ー_ー)ノ)

from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=pipe_lr,
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=-1)#n_jobs代表CPU的核心数,其中-1代表使用所有CPU
print('CV accuracy scores: %s' % scores)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

结果如下:

CV accuracy scores: [0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556
 0.97777778 0.93333333 0.95555556 0.95555556]
CV accuracy: 0.950 +/- 0.014

三、Diagnose bias and variance problems with learning curves

1. illustration of bias and variance

首先来看一幅图:
总体来说,

  • 当一个模型相对于训练集来说太过于复杂—(there are too many degrees of freedom or parameters in the model),模型就会趋于过拟合而缺乏泛化能力,因而在测试集或验证集上表现不佳。从图像上来看,即蓝线代表的Validation accuracy 始终在 Training Accuracy 之下。

  • 通常来说,增加Number of training sample 常常可以减小过拟合的程度。从图像上来看,不管是哪种图像,当横坐标增加,两条曲线趋与一致。但是实践中这常常不现实或者不经济。

在这里插入图片描述

左上:Underfit(high bias)
模型的Training and Validation accuracy 都较低,说明模型欠拟合。此时需要增加模型的参数以增加模型的复杂度。例如:收集或者构造更多的特征,或者减小正则化的程度(C or λ)。

右上:Overfit (high variance)
模型的Training and Validation accuracy 之间有着一个较大的gap,说明过拟合。(training的预测能力很好,但是相对地validation的预测能力很差,中间有着巨大的差距,且此差距随着样本数的增加而减小)。此时需要减小模型的复杂度或者是收集更多的数据集。例如:增加正则化的程度,SBS特征选取,PCA 或者 LDA 特征提取。

用自己的数据集画一个实际的图:

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve


pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(penalty='l2', random_state=1,solver = 'lbfgs'))

#train_sizes返回的是实际的分组数量,注意与参数中的区分开
train_sizes, train_scores, test_scores =\
                learning_curve(estimator=pipe_lr,
                               X=X_train,
                               y=y_train,
                               train_sizes=np.linspace(0.1, 1.0, 10),
                               cv=10,
                               n_jobs=1)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
#计算出标准差好画带状图
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(train_sizes, train_mean,
         color='blue', marker='o',
         markersize=5, label='training accuracy')
#画带状图
plt.fill_between(train_sizes,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha=0.15, color='blue')

plt.plot(train_sizes, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='validation accuracy')

plt.fill_between(train_sizes,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

plt.grid()
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.8, 1.03])
plt.tight_layout()
#plt.savefig('images/06_05.png', dpi=300)
plt.show()

如图
在这里插入图片描述

四、Addressing overfitting and underfitting with validation curves

1. 初级版:用sklearn中的validation_curve调参

不同于上述以样本量为自变量的图,在此处,我们将重点放在如何调整参数使得从而提高validation accuracy。因此我们可以对上述代码稍作改动,在确定的样本数量下,将inverse 正则化参数C 作为自变量画图如下:

from sklearn.model_selection import validation_curve


param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(
                estimator=pipe_lr, 
                X=X_train, 
                y=y_train, 
                param_name='logisticregression__C', 
                param_range=param_range,
                cv=10)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(param_range, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='validation accuracy')

plt.fill_between(param_range, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')

plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1.0])
plt.tight_layout()
# plt.savefig('images/06_06.png', dpi=300)
plt.show()

如图所示
在这里插入图片描述
可以看出,当C较小时,即正则化力度较大,模型欠拟合。当C不断增大的时候,即意味着正则化力度减小,模型有过拟合的迹象,此时sweet spot 为C = 1.

2. 升级版:用grid search 调参

首先认识一下机器学习中的两种参数:第一中是通过training data 确定的参数,比如逻辑回归中的权重;第二种是我们自己设置的,不受训练过程影响的参数,即超参数(Hyperparameters or tuning parameter)比如逻辑回归中的正则化参数与决策树中树的深度。

grid search 其实没有什么高深之处,it’s a brute-force exhaustive search paradiam. 就是尝试所有的参数组合,找出最优的那一组。
代码如下:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

pipe_svc = make_pipeline(StandardScaler(),
                         SVC(random_state=1))

param_range = [0.0001,  0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'svc__C': param_range, 
               'svc__kernel': ['linear']},
              {'svc__C': param_range, 
               'svc__gamma': param_range, 
               'svc__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=10,
                  n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

结果如下:

0.9846153846153847
{'svc__C': 1000.0, 'svc__gamma': 0.0001, 'svc__kernel': 'rbf'}

最后我们再将测试集代入最优模型之中
代码如下:

clf = gs.best_estimator_

# clf.fit(X_train, y_train)
# Note that the line above is not necessary, because
# the best_estimator_ will already be refit to the complete
# training set because of the refit=True setting in GridSearchCV
# (refit=True by default). Thanks to a reader, German Martinez,
# for pointing it out.
#这个德国哥们儿有点nb,甚至能改错
print('Test accuracy: %.3f' % clf.score(X_test, y_test))

结果如下:

Test accuracy: 0.974

3. Nested cross-validation

选择算法用,在一项实验中,Varma and Simon conclude that the true error of the estimate is almost unbiased relative to the test set when nested cross-validation is used.(具体原因也不知道是为啥,嗯,反正用于比较算法就对了)

具体操作:
在这里插入图片描述
内层交叉验证(innner loop):用于模型选择,可以进行特征工程处理数据。
外层交叉验证(outer loop):用于模型评估,使用所有数据集进行分割,而不仅是训练集,且用Stratified K-Fold保证类别比例不变。外层每一折都使用内层得到的最优参数组合进行训练。
#此处参考↑:
#作者:行走的程序猿
#链接:https://www.jianshu.com/p/cdf6df99b44b

下面我们来比较SVM与决策树模型的表现:

sklearn实现代码如下

1.SVM

gs = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=2)

scores = cross_val_score(gs, X_train, y_train, 
                         scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores),
                                      np.std(scores)))

结果如下:

CV accuracy: 0.974 +/- 0.015

2.决策树

from sklearn.tree import DecisionTreeClassifier

gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
                  param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7, None]}],
                  scoring='accuracy',
                  cv=2)

scores = cross_val_score(gs, X_train, y_train, 
                         scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), 
                                      np.std(scores)))

结果如下:

CV accuracy: 0.934 +/- 0.016

五、Performance evaluation matrix

图像解释很清楚~~
在这里插入图片描述
sklearn 中已经封装好的挺方便

from sklearn.metrics import confusion_matrix

pipe_svc.fit(X_train, y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

如下:

[[71  1]
 [ 2 40]]

美化一下:

fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')

plt.xlabel('Predicted label')
plt.ylabel('True label')

plt.tight_layout()
#plt.savefig('images/06_09.png', dpi=300)
plt.show()

在这里插入图片描述

from sklearn.metrics import precision_score, recall_score, f1_score

print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1: %.3f' % f1_score(y_true=y_test, y_pred=y_pred))
Precision: 0.976
Recall: 0.952
F1: 0.964

可以更换上述grid_search的评价指标了
#以f1_score为指标,以0为positive
scorer = make_scorer(f1_score, pos_label=0)

from sklearn.metrics import make_scorer

scorer = make_scorer(f1_score, pos_label=0)

c_gamma_range = [0.01, 0.1, 1.0, 10.0]

param_grid = [{'svc__C': c_gamma_range,
               'svc__kernel': ['linear']},
              {'svc__C': c_gamma_range,
               'svc__gamma': c_gamma_range,
               'svc__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring=scorer,
                  cv=10,
                  n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
0.9862021456964396
{'svc__C': 10.0, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}

六、ROC曲线(Receiver Operating Characteristics)

以FPR与TPR分别为横纵坐标绘制ROC曲线:

from sklearn.metrics import roc_curve, auc
from scipy import interp

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(penalty='l2', 
                                           random_state=1, 
                                           C=100.0))

X_train2 = X_train[:, [4, 14]]
    

cv = list(StratifiedKFold(n_splits=3, 
                          random_state=1).split(X_train, y_train))

fig = plt.figure(figsize=(7, 5))

mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []

for i, (train, test) in enumerate(cv):
    probas = pipe_lr.fit(X_train2[train],
                         y_train[train]).predict_proba(X_train2[test])

    fpr, tpr, thresholds = roc_curve(y_train[test],
                                     probas[:, 1],
                                     pos_label=1)
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr,
             tpr,
             label='ROC fold %d (area = %0.2f)'
                   % (i+1, roc_auc))

plt.plot([0, 1],
         [0, 1],
         linestyle='--',
         color=(0.6, 0.6, 0.6),
         label='random guessing')

mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
         label='mean ROC (area = %0.2f)' % mean_auc, lw=2)
plt.plot([0, 0, 1],
         [0, 1, 1],
         linestyle=':',
         color='black',
         label='perfect performance')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.legend(loc="lower right")

plt.tight_layout()
# plt.savefig('images/06_10.png', dpi=300)
plt.show()

在这里插入图片描述

发布了3 篇原创文章 · 获赞 2 · 访问量 119

猜你喜欢

转载自blog.csdn.net/weixin_45783752/article/details/104060550