机器学习：模型评估与sklearn实现(二)_交叉验证

一、介绍

S折交叉验证法：数据被随机划分为5个互不相交且大小相同的子集，利用S-1个子集数据训练模型，余下的一个子集测试模型。由于测试集由S种选法，因此对S种组合依次重复进行，获取测试误差的均值。将这个均值作为泛化误差的估计。

二、KFold

1.原型

sklearn.model_selection.KFold(n_splits=3,shuffle=False,random_state=None)

2.参数

n_splits:一个整数k，即将数据集划分几份。
shuffle：一个bool值。如果该值为True，则在划分数据集之前先混洗数据集。
random_state：一个整数，或者一个RandomState实例，或者None。

如果为整数，则它指定了随机数生成器的种子。

如果为RandomState实例，则指定了随机数生成器。

如果为None，则使用默认的随机数生成器。

3.作用

对于样本数为n的数据集，KFold会先将0~(n-1)之间的整数从前到后均价划分为n_splits份，每次迭代时依次挑选一份作为测试集的下标。如果希望是随机挑选而不是顺序挑选，则可以在划分之前混洗数据，即shuffle=True。

4.实例代码

生成数据

from sklearn.model_selection import KFold
import numpy as np
X = np.random.rand(9,4)
y = np.array([1,1,0,0,1,1,0,0,1])

调用KFold示例(非混洗)

folder = KFold(n_splits=3,random_state=0,shuffle=False)
for train_index,test_index in folder.split(X,y):
    print("Train Index:",train_index)
    print("Test Index:",test_index)
    print("X_train:",X[train_index])
    print("X_test:",X[test_index])
    print("")

Train Index: [3 4 5 6 7 8]
Test Index: [0 1 2]
X_train: [[ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]
 [ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]
X_test: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]]

Train Index: [0 1 2 6 7 8]
Test Index: [3 4 5]
X_train: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]
 [ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]
X_test: [[ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]]

Train Index: [0 1 2 3 4 5]
Test Index: [6 7 8]
X_train: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]
 [ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]]
X_test: [[ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]

调用KFold示例(混洗)

folder = KFold(n_splits=3,random_state=0,shuffle=True)
for train_index,test_index in folder.split(X,y):
    print("Shuffled Train Index:",train_index)
    print("Shuffled Test Index:",test_index)
    print("Shuffled X_train:",X[train_index])
    print("Shuffled X_test:",X[test_index])
    print("")

Shuffled Train Index: [0 3 4 5 6 8]
Shuffled Test Index: [1 2 7]
Shuffled X_train: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]
 [ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]
Shuffled X_test: [[ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]]

Shuffled Train Index: [0 1 2 3 5 7]
Shuffled Test Index: [4 6 8]
Shuffled X_train: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]
 [ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]]
Shuffled X_test: [[ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]

Shuffled Train Index: [1 2 4 6 7 8]
Shuffled Test Index: [0 3 5]
Shuffled X_train: [[ 0.54331568  0.63086644  0.02425023  0.00419293]
 [ 0.37441732  0.27994645  0.7224304   0.82671591]
 [ 0.4999551   0.41196095  0.87824022  0.58348625]
 [ 0.72628999  0.52417452  0.06881971  0.95963271]
 [ 0.02276032  0.98144591  0.37960828  0.61095952]
 [ 0.41491324  0.42039075  0.95688853  0.15339434]]
Shuffled X_test: [[ 0.17697103  0.42337491  0.44060735  0.12488469]
 [ 0.13079725  0.48578664  0.64161516  0.16668596]
 [ 0.25872091  0.73951121  0.04957464  0.45203743]]

三、StratifiedKFold

1.原型

sklearn.model_selection.StratifiedKFold(n_splits=3,shuffle=False,random_state=None)

2.参数

n_splits:一个整数k，即将数据集划分几份。
shuffle：一个bool值。如果该值为True，则在划分数据集之前先混洗数据集。
random_state：一个整数，或者一个RandomState实例，或者None。

如果为整数，则它指定了随机数生成器的种子。

如果为RandomState实例，则指定了随机数生成器。

如果为None，则使用默认的随机数生成器。

3.作用

KFold的分层采样版本

4.实例代码

生成数据

from sklearn.model_selection import KFold,StratifiedKFold
import numpy as np
X = np.random.rand(8,4)
y = np.array([1,1,0,0,1,1,0,0])

KFold

folder = KFold(n_splits=4,random_state=0,shuffle=False)
for train_index,test_index in folder.split(X,y):
    print("Train Index:",train_index)
    print("Test Index:",test_index)
    print("y_train:",y[train_index])
    print("y_test:",y[test_index])
    print("")

Train Index: [2 3 4 5 6 7]
Test Index: [0 1]
y_train: [0 0 1 1 0 0]
y_test: [1 1]

Train Index: [0 1 4 5 6 7]
Test Index: [2 3]
y_train: [1 1 1 1 0 0]
y_test: [0 0]

Train Index: [0 1 2 3 6 7]
Test Index: [4 5]
y_train: [1 1 0 0 0 0]
y_test: [1 1]

Train Index: [0 1 2 3 4 5]
Test Index: [6 7]
y_train: [1 1 0 0 1 1]
y_test: [0 0]

StratifiedKold

stratified_folder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
for train_index,test_index in stratified_folder.split(X,y):
    print("Stratified Train Index:",train_index)
    print("Stratified Test Index:",test_index)
    # 标签y的分布比KFold均匀
    print("Stratified y_train:",y[train_index])
    print("Stratified y_test:",y[test_index])
    print("")

Stratified Train Index: [1 3 4 5 6 7]
Stratified Test Index: [0 2]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0]

Stratified Train Index: [0 2 4 5 6 7]
Stratified Test Index: [1 3]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0]

Stratified Train Index: [0 1 2 3 5 7]
Stratified Test Index: [4 6]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0]

Stratified Train Index: [0 1 2 3 4 6]
Stratified Test Index: [5 7]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0]

四、cross_val_score

1.原型

sklearn.model_selection.cross_val_score(estimator,X,y=None,scoring=None,cv=None,n_jobs=1,verbose=0,fit_params=None,pre_dispatch=’2*n_jobs’)

2.参数

estimator:指定的学习器，该学习器必须由.fit方法来进行训练。
X:数据集中的样本集。
y:数据集中的标记集。
scoring:一个字符串，或者可调用对象，或者None。它指定了评分函数，其原型是：scorer(estimator,X,y)。如果为None，则默认采用estimator学习器的.score方法。如果为字符串，可以为下列字符串。

‘accurach’：采样的是metrics.accuracy_score评分函数。

‘average_precision’:采用的是metrics.average_precision_score评分函数。

f1系列：采用的是metrics.f1_score评分函数。

‘log_loss’:采用的是metrics.log_loss评分函数。

‘precision’系列：采用的是metrics.precision_score评分函数。

‘recall’系列：采用的是metrics.recall_score评分函数。

‘roc_auc’:采用的是metrics.roc_auc_score评分函数。

‘adjusted_rand_score’：采用的是metrics.adjusted_rand_score评分函数。

‘mean_absolute_error’：采用的是metrics.mean_absolute_error评分函数。

‘mean_squared_error’:采用的是metrics.mean_squared_error评分函数。

‘r2’:采用的是metrics.r2_score评分函数。
cv：一个整数、k折交叉生成器、一个迭代器、或者None。
如果为None，则使用默认的3折交叉生成器。

如果为整数，则指定了k折交叉生成器的k值。

如果为k折交叉生成器，则直接指定了k折交叉生成器。

如果为迭代器，则迭代器的结果就是数据集划分的结果。

fit_params：一个字典，指定了estimator执行.fit方法时的关键字参数。

n_jobs：并行性。默认为-1表示派发任务到所有计算机的CPU上。

verbose:一个整数，用于控制输出日志。

pre_dispatch:一个整数或者字符串，用于控制并行执行时，分发的总的任务数量。

3.作用

便利函数，它是在指定数据集上运行指定学习器时，通过k折交叉获取的最佳性能。

4.实例代码

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits
from sklearn.svm import LinearSVC
digits = load_digits()
X = digits.data
y = digits.target
result = cross_val_score(LinearSVC(),X,y,cv=10)
print("Cross Val Score is:",result)

Cross Val Score is: [ 0.9027027   0.95081967  0.89502762  0.88888889  0.93296089  0.97206704
  0.96648045  0.93820225  0.85875706  0.9375    ]