Data segmentation: aside method train_test_split, leave-one LeaveOneOut, GridSearchCV (cross validation grid search +), Bootstrap

Nichimoesha

Artificial Intelligence AI: Keras PyTorch MXNet TensorFlow PaddlePaddle deep learning practical (not regularly updated)


1.10 cross-validation, grid search

learning target

  • aims
    • Know that cross-validation, grid search concept
    • We will use the cross-validation, grid search optimization training model

What Is cross-validation (cross validation)

Cross-validation: the training data to get, divided into training and validation sets. The following diagram, for example: dividing the data into 4 parts, one of which used as the validation set. After then 4 times (group) of the test, each time a different replacement validation set. 4 to obtain a result of the model group, averaged as the final result. Also known as 4-fold cross validation.

1.1 Analysis

Before we know the data is divided into training and test sets, but in order to allow the model to get more accurate results from the training. Do the following processing

  • Training set: training set + validation set
  • Test set: Test Set

1.2 Why do we need cross-validation

Cross-validation purpose: To make the model more accurate and reliable assessment of

Question: This is just to make the model more accurate assessment of credibility, then how to select or tuning parameters?

Under normal circumstances, there are many parameters that need to be manually specified (such as K-value k- nearest neighbor algorithm), this is called hyper-parameters . But the complicated manual process, so it is necessary to model the preset parameter combinations of several super. Super each cross-validation parameters are used for evaluation. Finally, select the optimal combination of parameters modeling.

3 cross-validation, grid search (model selection and tuning) API:

  • sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
    • To specify parameter values ​​estimator exhaustive search
    • estimator: Estimator objects
    • param_grid: parameter estimator (dict) { "n_neighbors": [1,3,5]}
    • cv: Specifies several fold cross validation
    •  
    • fit: Enter training data
    • score: accuracy
    • Result analysis:
      • Best score__: The best result of verification in the cross-validation
      • Best Estimator : The best parameters of the model
      • CV Results : the accuracy of the result set and the training set accuracy verification result after each cross-validation

Case 4 Iris Value Tuning increased K

  • Use GridSearchCV build estimator
# 1、获取数据集
iris = load_iris()
# 2、数据基本处理 -- 划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
# 3、特征工程:标准化
# 实例化一个转换器类
transfer = StandardScaler()
# 调用fit_transform
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4、KNN预估器流程
#  4.1 实例化预估器类
estimator = KNeighborsClassifier()

# 4.2 模型选择与调优——网格搜索和交叉验证
# 准备要调的超参数
param_dict = {"n_neighbors": [1, 3, 5]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
# 4.3 fit数据进行训练
estimator.fit(x_train, y_train)
# 5、评估模型效果
# 方法a:比对预测结果和真实值
y_predict = estimator.predict(x_test)
print("比对预测结果和真实值:\n", y_predict == y_test)
# 方法b:直接计算准确率
score = estimator.score(x_test, y_test)
print("直接计算准确率:\n", score)
  • See the final selection is then evaluated and the results of the cross validation of results
print("在交叉验证中验证的最好结果:\n", estimator.best_score_)
print("最好的参数模型:\n", estimator.best_estimator_)
print("每次交叉验证后的准确率结果:\n", estimator.cv_results_)
  • Final Results
比对预测结果和真实值:
 [ True  True  True  True  True  True  True False  True  True  True  True
  True  True  True  True  True  True False  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
直接计算准确率:
 0.947368421053
在交叉验证中验证的最好结果:
 0.973214285714
最好的参数模型:
 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
每次交叉验证后的准确率结果:
 {'mean_fit_time': array([ 0.00114751,  0.00027037,  0.00024462]), 'std_fit_time': array([  1.13901511e-03,   1.25300249e-05,   1.11011951e-05]), 'mean_score_time': array([ 0.00085751,  0.00048693,  0.00045625]), 'std_score_time': array([  3.52785082e-04,   2.87650037e-05,   5.29673344e-06]), 'param_n_neighbors': masked_array(data = [1 3 5],
             mask = [False False False],
       fill_value = ?)
, 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}], 'split0_test_score': array([ 0.97368421,  0.97368421,  0.97368421]), 'split1_test_score': array([ 0.97297297,  0.97297297,  0.97297297]), 'split2_test_score': array([ 0.94594595,  0.89189189,  0.97297297]), 'mean_test_score': array([ 0.96428571,  0.94642857,  0.97321429]), 'std_test_score': array([ 0.01288472,  0.03830641,  0.00033675]), 'rank_test_score': array([2, 3, 1], dtype=int32), 'split0_train_score': array([ 1.        ,  0.95945946,  0.97297297]), 'split1_train_score': array([ 1.        ,  0.96      ,  0.97333333]), 'split2_train_score': array([ 1.  ,  0.96,  0.96]), 'mean_train_score': array([ 1.        ,  0.95981982,  0.96876877]), 'std_train_score': array([ 0.        ,  0.00025481,  0.0062022 ])}

5 summary

  • Cross-validation [know]
    • definition:
      • Will get the training data set is divided into training and validation
      • * Fold cross-validation
    • Split Mode:
      • Training set: training set + validation set
      • Test set: Test Set
    • Why do we need cross-validation
      • In order for the model to be evaluated more accurate and reliable
      • Note: Cross-validation can not improve the accuracy of the model
  • Grid search [know]
    • Super parameters:
      • sklearn, the need to manually specify the parameters, called the hyper-parameters
    • Grid search is the value of these ultra-parameters passed in the form of a dictionary, and then select the optimal value
  • api [know]
    • sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
      • estimator - choose which training model
      • param_grid - Super parameters need to pass
      • cv - several-fold cross validation

The first chapter knowledge added: Reconsideration data division

As already mentioned, we can be on hand to evaluate the generalization error learner through experimental tests and make a choice .

For this purpose, use a "test set" (testing set) is determined to test the learner new samples, and then to "measurement errors" on the test set (testing error) as a generalization error approximation.

Usually we assume that the test sample is independent and identically distributed samples derived from real sample distribution . But take note that the test set and the training set should be as mutually exclusive.

Mutually exclusive, that is, the test sample try not to appear in the training set, not used in the training process.

Why should not appear in the test sample in the training set as possible of it? To see this, consider a scenario:

Teacher out of 10 exercises for the students to practice, the teacher with the same questions as this 10 question exam, the exam results can effectively reflect the students learn better wrong with that?

The answer is no, some students may only do this 10 question was able to get a high score.

Come back to our question, we hope to get good generalization performance models like the students of the course is to learn very well, the ability to obtain the knowledge, "by analogy"; the training sample is equivalent to the students to practice exercise testing process is equivalent exam. Obviously, if the test samples are used for training, and the resulting estimated results will be too "optimistic".

However, we have only one sample contains m data sets

Not only to train, but also to test, how can we do it?

  • The answer is: By appropriate processing of D, from which the training set and a test set S T. (This is what we've been doing earlier).

Below we summarize with some common practices:

  • Distillation method
  • Cross validation
  • Bootstrapping

1 distillation method

Everyone in the use of the process, should be noted that the division of training / test set should be kept as far as possible consistency of data distribution, data partitioning process to avoid introducing additional bias and have an impact on the end result , for example, at least in the classification task to maintain the proportion of the sample is similar category.

If the division process to look at the data set from the perspective of sampling (sampling), the retention category proportional sampling methods generally referred to "stratified sampling" (stratified sampling).

Obtained, for example training set and a test set S containing 30% of the sample T through the sample containing 70% of D-like layering,

If D contains 500 positive cases, 500 counterexample, the layered sampled should contain S n Example 350, Example 350 trans, T 150 and positive examples and counterexamples contains 150;

If S, T Proportions vary considerably in the sample, the error will be due to differences in the training / test data distribution generated bias estimates.

Another problem is noted that, even after the training test sample ratio of a given set, there are still a variety of ways to divide initial data set D divided.

For example, in the above example, the sample may be sorted D, and then the front 350 into positive examples in the training set, may be put into the last 350 positive examples in the training set, these will result in a different partition different training / test set, the corresponding results of the model evaluation will be different.

Therefore, the estimation results obtained using a single-aside often enough reliable method, when using method aside, generally use several randomly divided and averaged as an evaluation method of leaving repeated experimental evaluation results.

For example, 100 randomly divided, each generating a training / test set for experimental evaluation, obtained after 100 100 results, while leaving the method returns to 100 is the average of the results.

In addition, we want to evaluate the D train is out of the performance of the model, but leaving France to be divided training / test set, which can lead to a dilemma:

  • If so training set S contains the vast majority of samples, the trained models may be closer to the trained models with D, but the T is relatively small, the evaluation result may not be stable and accurate;
  • If the test set so that T contains multiple samples, the training set S D and a larger difference, the model may be evaluated quite different compared with the model trained with D, thereby reducing the fidelity of the evaluation results (Fidelity ).

The problem is not the perfect solution, it is common practice to sample approximately 2 / 3-4 / 5 for training, the remaining samples for testing.

A Python leave law:

from sklearn.model_selection import train_test_split
#使用train_test_split划分训练集和测试集
train_X , test_X, train_Y ,test_Y = train_test_split(
        X, Y, test_size=0.2,random_state=0)

Leaving the method, there is a special case, called: Leave-one (Leave-One-Out, referred LOO) , i.e., each time one sample as a test set.

Obviously, not affected by the leave-one random sample division manner, because there is only m samples manner into m subsets each subset contains a sample;

Use leave-one Python implementation:

from sklearn.model_selection import LeaveOneOut

data = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(data):
    print("%s %s" % (train, test))
'''结果
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
'''

Leave-one-out advantages and disadvantages:

advantage:

  • Leaving the training set and the initial data set using at least one method only one sample as compared to that which in most cases, the leave one out evaluation model is actually desired assessed by trained D model is very similar to . Therefore, leave-one-evaluation results is often considered to be more accurate .

Disadvantages:

  • Leave-one also has its drawbacks: the large data sets, the m training models computational overhead may be intolerable (e.g., a data set comprising samples 1 million, 1 million model need training, or in which does not consider the case of parameter adjustment algorithm.

2 cross validation

Then, each with k-1 subsets as the training set, and sets, as a subset of the remainder of the test set; k can be obtained such a set of training / test set, which can be the k-th training and testing, the ultimate return of the k-means test results.

Obviously, the stability and fidelity of the results of cross-validation evaluation value of k largely depend, to emphasize this point, usually called cross validation "k-fold cross validation" (k- fold cross validation) . The most commonly used value of k is 10, referred to as 10-fold cross validation case; other common k-values ​​are 5, 20 and the like. FIG 10 shows a schematic view of the fold cross-validation.

Similar aside method, the data set D into k subsets there are also divided in many ways. In order to reduce differences due to different samples divided introduced, typically k-fold cross validation using a different random division of the p-th repetition, the mean final evaluation result that the p-th k-fold cross validation results, for example, common "10 10-fold cross-validation. "

Cross-validation implementation, in addition to the foregoing we speak GridSearchCV, there KFold, StratifiedKFold

KFold sum StratifiedKFold

from sklearn.model_selection import KFold,StratifiedKFold
  • usage:
    • The training / test data set into n_splits disjoint subsets, with each subset wherein a validation set as the remaining n_splits-1 as a training set for training and testing times n_splits, results obtained n_splits
    • The difference between usage and KFold of StratifiedKFold is: SKFold is stratified sampling, to ensure that the training set, the test set, the proportion of each category and the samples are consistent with the original data set.
  • important point:
    • 对于不能均等分数据集,其前n_samples % n_splits子集拥有n_samples // n_splits + 1个样本,其余子集都只有n_samples // n_splits样本
  • 参数说明:

    • n_splits:表示划分几等份
    • shuffle:在每次划分时,是否进行洗牌
      • ①若为Falses时,其效果等同于random_state等于整数,每次划分的结果相同
      • ②若为True时,每次划分的结果都不一样,表示经过洗牌,随机取样的
  • 属性:

    • ①split(X, y=None, groups=None):将数据集划分成训练集和测试集,返回索引生成器
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold

X = np.array([
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74]
])

y = np.array([1,1,0,0,1,1,0,0])

folder = KFold(n_splits = 4, random_state=0, shuffle = False)
sfolder = StratifiedKFold(n_splits = 4, random_state = 0, shuffle = False)

for train, test in folder.split(X, y):
    print('train:%s | test:%s' %(train, test))
    print("")

for train, test in sfolder.split(X, y):
    print('train:%s | test:%s'%(train, test))
    print("")

结果:

# 第一个for,输出结果为:
train:[2 3 4 5 6 7] | test:[0 1]

train:[0 1 4 5 6 7] | test:[2 3]

train:[0 1 2 3 6 7] | test:[4 5]

train:[0 1 2 3 4 5] | test:[6 7]

# 第二个for,输出结果为:
train:[1 3 4 5 6 7] | test:[0 2]

train:[0 2 4 5 6 7] | test:[1 3]

train:[0 1 2 3 5 7] | test:[4 6]

train:[0 1 2 3 4 6] | test:[5 7]

可以看出,sfold进行4折计算时候,是平衡了测试集中,样本正负的分布的;但是fold却没有。

3 自助法

我们希望评估的是用D训练出的模型。但在留出法和交叉验证法中,由于保留了一部分样本用于测试,因此实际评估的模型所使用的训练集比D小,这必然会引入一些因训练样本规模不同而导致的估计偏差。留一法受训练样本规模变化的影响较小,但计算复杂度又太高了。

有没有什么办法可以减少训练样本规模不同造成的影响,同时还能比较高效地进行实验估计呢?

“自助法”( bootstrapping)是一个比较好的解决方案,它直接以自助采样法( bootstrap sampling)为基础。给定包含m个样本的数据集D,我们对它进行采样产生数据集D:

  • 每次随机从D中挑选一个样本,将其拷贝放入D,然后再将该样本放回初始数据集D中,使得该样本在下次采样时仍有可能被到;
  • 这个过程重复执行m次后,我们就得到了包含m个样本的数据集D′,这就是自助采样的结果。

即通过自助采样,初始数据集D中约有36.8%的样本未出现在采样数据集D′中。

于是我们可将D′用作训练集,D\D′用作测试集;这样,实际评估的模型与期望评估的模型都使用m个训练样本,而我们仍有数据总量约1/3的、没在训练集中出现的样本用于测试。

这样的测试结果,亦称“包外估计”(out- of-bagestimate)

自助法优缺点:

  • 优点:
    • 自助法在数据集较小、难以有效划分训练/测试集时很有用;
    • 此外,自助法能从初始数据集中产生多个不同的训练集,这对集成学习等方法有很大的好处。
  • 缺点:
    • 自助法产生的数据集改变了初始数据集的分布,这会引入估计偏差。因此,在初始数据量足够时;留出法和交叉验证法更常用一些。

4 总结

综上所述:

  • 当我们数据量足够时,选择留出法简单省时,在牺牲很小的准确度的情况下,换取计算的简便;
  • 当我们的数据量较小时,我们应该选择交叉验证法,因为此时划分样本集将会使训练数据过少;
  • 当我们的数据量特别少的时候,我们可以考虑留一法。

 

 

 

 

 

 

 

 

 

 

 

发布了308 篇原创文章 · 获赞 112 · 访问量 18万+

Guess you like

Origin blog.csdn.net/zimiao552147572/article/details/104441045