Machine Learning Notes: Cross Validation

There is supervised learning in machine learning. When training a model by labeling data, cross-validation is usually used to select model parameters.

Divide the labeled data into three sets: training set , (cross) validation set , and test set :


In the model of machine learning, some model parameters need to be specified in advance. They are constants before training (different from the parameters obtained by the minimize objective function during the training process). Specifying parameters based on experience is not necessarily reliable, so it needs to be Before training, do a cross-validation to choose the value of this constant.

When doing model selection, more precisely when doing model parameter selection, and (here, taking the SVM of the polynomial kernel as an example, the parameter we need to select is the order k of the polynomial):

1 Training: first use the Training Set to run the primary model, secondary model, and cubic model (several times refers to the order of z=Z θ (x)), and then use the numerical optimization algorithm to obtain the specified order. In the number of cases, what are the parameters θ that minimize the training error

2 Parameter optimization: Run each model M(1, θ 1 )...M(k, θ k )...M(n, θ n ) obtained in Cross Validation Set in 1 to calculate the Cross Validation error, select the model number k that minimizes this error

3 Final training: Given the order k, the training obtains the parameter θ that minimizes the Training error on the Training Set+CV Set, and obtains the model M(k, θ k ')

 

effect evaluation:

Use the model M(k, θ k ') obtained in Test Set in 3 (the selected model has been fixed at this time, and the parameters cannot be changed), run it, and calculate the Test error as an evaluation of the generalization ability of the model, That is, the prediction accuracy for unknown data.

An important point is that the data in the test set must be unseen by the trained model before, so as to predict the predictive ability of the model for new unknown data. Therefore, Test Set cannot participate in the process of model selection. It can only be used as an effect evaluation after the model is fixed to see how the trained model will perform if it encounters new data, and whether it can go online.

The above is the basic cross-validation. There is a variant with moderate effect and time cost, that is, n-fold cross-validation, which is the most commonly used method for machine learning experiments:

1 The data is proportionally divided into (generalized) training set A and test set B
2 Use the (generalized) training data A for n-fold verification. The (generalized) training set is divided into n parts from a1 to an, each ai in the n parts is used as a CV set in turn, and the remaining n-1 parts are used as a Training set. In this way, n times of training, the average value of the CV error obtained from n times is taken as the final CV error of a model, so as to select the optimal model parameters.
3 Then use the entire (generalized) training set A to train a model under the selected optimal model parameters.
4 Finally, the test results are given on the test set B.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324872524&siteId=291194637