Machine learning model of assessment methods

Experience with over-fitting error

In machine learning, we generally the number of samples of the total number of misclassified samples proportions referred to as " error rate " ', recording the number of the error rate E, the error sample is a, total number of samples as m, E = a / m correspondingly, denoted accuracy of 1 - E.

The difference between the actual learner predicted output and actual output sample called " error ", the error on the training set is called " training error " or " experience error ", the error in the new sample is called " generalization error . "

Obviously we want to get smaller generalization error learner, but a new set of samples we do not know, so just try to make the training error is reduced, but not a lot of time training error as small as possible, because it is likely learner the training set is unique as a general rule of law that all sample sets, such generalization ability learners have decreased, and we call this phenomenon as " over-fit ", while the corresponding called " underfitting " , over-fitting Yes key obstacles faced by machine learning.

Model assessment method

We generally use of experimental tests to assess learner generalization error and therefore require a " test set ", the measurement error on the test set as generalization error approximation. Note that the test set and the training set should be kept mutually exclusive , as we generally do not put questions as examples of classroom final exam, otherwise may come too optimistic results.

Provided data set D = {(x1, y1), (x2, y2), ..., (xn, yn)} contains n data samples, S is the training set, D is a test set, the following are several common to D obtained by processing the S and D.

Distillation method

Directly to D is divided into two mutually exclusive subsets, one of them as S, and the other as D, i.e., D = S ∪ T, S ∩ T = Ø

For chestnut: Suppose n = 1000, where 700 samples as a training set S, 300 samples as a test set T, the first training with S, obtained after the test using the model T, the number of samples if the error of 90 , the error rate (90/300) * 100% = 30%, compared with 70% accuracy.

There is a need to pay attention to the principle of dividing the training set and test set to try to keep the consistency of the data distribution, to avoid the introduction of additional deviation data partitioning. For example, the proportion of the category in the classification task to keep samples of similar, we can use " stratified sampling ."

For chestnut: D 500 positive cases, 500 counter-example, we have 70% of the training set, the remaining 30% as the test set, it should be kept in the training set containing 350 positive cases, 350 counter-example, the test set comprising Example 150 positive, 150 negative examples.

Further, according to our different division of the D, S and will be different T, this time testing the results may be different, in order to improve the stability, we can average the results of several tests taken divided .

Generally, we put 2/3 - 4/5 of the sample used for training, rest for testing.

Cross validation

The D into k similar size mutually exclusive subsets, namely:

D = D1 ∪ D2 ∪ ... ∪ Dk, where Di ∩ Dj = Ø, i ≠ j

To maintain consistency of data distribution, Di samples obtained by the delamination.

With each k-1 subsets as S, as a subset of the remainder of the T, the training may be performed k times, k obtained results, the final results of the average return k.

Obviously the value of k to a large extent determines the stability and fidelity of the evaluation results, so we cross validation also called " k-fold cross-validation ."

The most commonly used value of k is 10, 10-fold cross validation case referred to, as well as the other value of k is typically 5,20 like.

Also due to the different division methods D, it will be different training and test sets thereby affecting the test results, we usually randomly p DIVISION then repeated p times, the results of the final assessment of the mean of the p-th evaluation, for example, "10 times 10 fold cross-validation. "

Exception: leave one

If so k = m, the unique division manner, since the difference between S and D only one sample, the results are usually considered to be more accurate, but when n is very large, for example, D contains one million samples need to be trained 1,000,000 model, which was not considered in the case of parameter adjustment algorithm.

Bootstrapping

A collecting D`, we randomly taken from a sample is placed D` D, D, and then back into the sample, repeat n times, then D` contains n samples, and the number of samples D consistent.

In this case there may be a probability D` duplicate samples, sampled n times were not taken to D exist in the sample, the samples were not provided to the n samples taken in is P, then

P = ( 1 - 1 / n ) ^ n, 

Taking the limit for P:

 =  1/e ≈ 0.368

That is 36.8% of the sample D has not D` in, so we can D` as the training set, the D - D` as the test set, so that a realistic assessment model and the model are expected to evaluate the use of n training samples, while still total about 1/3 of the sample for testing did not appear in the training set, such as test results, known as " outsourcing estimate ."

Self dataset is typically used less difficult to efficiently divide the training and test set time.

Bootstrap produce a number of different training sets from the initial data, it is of great benefit to integrated learning and other methods.

However, the resulting data set Bootstrap changed the distribution of original data, thus introducing bias estimate. Therefore, when the initial data sufficiently often used aside and cross-validation method.

This series of blog is a summary of watermelon book learning, I would like as a line of notes.

the above.

Guess you like

Origin www.cnblogs.com/shenxi-ricardo/p/12056787.html