Python-K folding cross validation, the difference between KFold and StratifiedKFold, random_state (random state)

K-fold cross-validation

The method of dividing the training set and the test set greatly affects the final model and parameter values. In general, K-fold cross-validation is used for model optimization to find the superparameter values that make the model generalization performance optimal , and at the same time , the performance of the current model algorithm can be tested .
Insert picture description here
When the value of k is large, more data will be used for model training in each iteration, and the minimum deviation can be obtained, and the algorithm time is extended.
When the k value is small, the calculation cost of the performance evaluation of the model for repeated fitting on different data blocks is reduced, and an accurate evaluation of the model is obtained on the basis of the average performance.

Two-fold implementation code

Usually implemented with the following modules

from sklearn.model_selection import KFold,StratifiedKFold

StratifiedKFold parameter description:

class sklearn.model_selection.StratifiedKFold(n_splits=’warn’, shuffle=False, random_state=None)
n_splits:表示几折(折叠的数量)
shuffle== True:选择是否在分割成批次之前对数据的每个分层进行打乱。
               52折使用,这样每次的数据是进行打乱的,否则,每次取得的数据是相同的
random_state:控制随机状态,随机数生成器使用的种子

Two points to note:
1. kf.split(x)The index of the data set is returned, and the data needs x[train_index]to be extracted
2. shuffle=TrueWhen shuffle (meaning shuffled), each run code is different, the index obtained randomly is different. On the contrary, it is unchanged.

import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold
x = np.array([[1, 1], [2, 2], [3, 3], [4, 4],[5,5],[6,6]])
kf = KFold(n_splits=2,shuffle=True)
for train_index, test_index in kf.split(x):
    print('train_index:', train_index)
    print("train_data:",x[train_index])
    print('test_index', test_index)
    print("--------二折时,测试集变成了训练集分割线--------")

train_index: [1 2 3]
train_data: [[2 2]
 [3 3]
 [4 4]]
test_index [0 4 5]
--------二折时,测试集变成了训练集分割线--------
train_index: [0 4 5]
train_data: [[1 1]
 [5 5]
 [6 6]]
test_index [1 2 3]
--------二折时,测试集变成了训练集分割线--------

The difference between KFold and StratifiedKFold

Stratified is the meaning of hierarchical sampling, to ensure that the proportion of samples in the training set and test set is the same as the original data set.
In the following example, 6 pieces of data correspond to 6 labels. We divide it into three folds, so each time training, 4 pieces of data are train and 2 pieces of data are test.

StratifiedKFold can ensure that the sample ratio is the same as the original data set, ie train_index = [0,1,2,3] train_label = [1,1,1,0]
test_index = [4,5] test_label = [0, 0] ----- Phenomenon of data distribution bias

import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold

x = np.array([[1, 1], [2, 2], [3, 3], [4, 4],[5,5],[6,6]])
y=np.array([1,1,1,0,0,0])

kf = StratifiedKFold(n_splits=3,shuffle=True)
for train_index, test_index in kf.split(x,y):
    print('train_index:', train_index)
    print('test_index', test_index)
    print("--------二折时,测试集成了训练集分割线--------")

train_index: [0 1 4 5]
test_index [2 3]
--------二折时,测试集成了训练集分割线--------
train_index: [0 2 3 5]
test_index [1 4]
--------二折时,测试集成了训练集分割线--------
train_index: [1 2 3 4]
test_index [0 5]
--------二折时,测试集成了训练集分割线--------

random_state (random state)

Why do you need such a parameter random_state (random state)?

1、在构建模型时:
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)
2、在生成数据集时:
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
3、在拆分数据集为训练集、测试集时:
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

If random_state is not set, the model constructed each time is different.
The data set generated each time is different, and the training set and test set split each time are different, so it depends on the needs. reference

Published 10 original articles · Liked4 · Visits 377

Guess you like

Origin blog.csdn.net/m0_46204224/article/details/105617436