KFold cross validation

1. Reasons for setting up validation set

  In the process of machine learning modeling, the data is divided into training set and test set . The test set training set is two completely separate data sets. It does not participate in training at all. It is only used to test the effect of the model after the model is finalized. And the training set needs to separate part of the data to verify the training effect of the model, that is, the verification set . In the validation set, after each training set is finished, the effect of the model is initially tested. The reason for setting up a validation set is that the training data will be over-fitting , that is, the training data can match the training data well, but the effect on data other than the training data is very poor. The validation set does not participate in the training, and the matching degree of the model to the data outside the training set can be objectively evaluated .

2. Principles of cross-validation

  Cross-validation is often used for data verification. The principle is to divide the data into n groups , each group of data must be used as a validation set for one validation, and the remaining n-1 groups of data are used as the training set. In this way, a total of n cycles are required to verify n times, and n models are obtained, and the n errors obtained by these n models are calculated to obtain the cross-validation error.

 

sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

n_splits: k-fold cross-validation, which means dividing into several equal parts

shuffle : Whether to shuffle each time it is divided

① If Falses, the effect is equivalent to random_state equal to an integer, and the result of each division is the same

②If it is True, the result of each division is different, which means that the
random_state sampled randomly after shuffling is equivalent to the function of the random number seed. When the value of random_state is the same, the generated data set is the same.

 

K-fold cross-validation (KFold) divides the data set into k groups, called folds. If the value of k is equal to the number of data set instances, then there will be only one test set each time. This processing method is called "leave one"

#导入相关模块
from sklearn.model_selection import KFold
#导入相关数据
X = ["a", "b", "c", "d", "e", "f"]
#设置分组这里选择分成3份。
kf = KFold(n_splits=3)
#查看分组结果
for train, test in kf.split(X):
    print("%s-%s" % (train, test))

#输出结果
[2 3 4 5]-[0 1]
[0 1 4 5]-[2 3]
[0 1 2 3]-[4 5]

 

When using Kfold cross-validation, the larger the value of k within a limit, the better. Because the larger the k, the more times we verify, and the final average number can represent the accuracy of the training model.

But k needs to be within a limit. Too large k has two disadvantages:

1. It is easy to cause excessive burden on the machine and spend a lot of time.

2. There are too few data in the test set (or verification set) of each verification, and it is difficult to get an accurate false alarm rate.

In general, k generally takes 10, and the value depends on the situation of different projects. Of course, there must be k<n (the number of data in the training set).

Guess you like

Origin blog.csdn.net/weixin_48135624/article/details/114881801