Data Mining: Evaluation model states

Data Mining: Evaluation model states

Model evaluation before only in how to evaluate the prediction accuracy of the model, the model does not fit over and under-fit state to consider. In other words, the model fitting out, we want to optimize it, and how to optimize a state which depends on the current model over-fitting, underfitting and so on. We have to be optimized for the model.

First, the model state

State model can be divided into two categories:

  1. Overfitting: effective model in the training set, the effect on the test set difference.
  2. Underfitting: model effect on the training and test sets are not good.

Here Insert Picture Description

And this effect is the accuracy of the model evaluation. Accuracy from the reverse side, the error is too large.
Error: difference between the output and the real output samples learner prediction
according to the division of the data set, there are the following definition:

  1. Training error difference (training error): also known as empirical error (empirical error), the error learning in the training set.
  2. Test error (test error): error learning in the test set.
  3. Generalization error (generalization error): learner unknown errors in the new sample.

Meaning training model: get the generalization error is small learner. However, the new sample did not know in advance, we really do is to try to make the experience error is minimized. But need to be clear, even if the misclassification rate of 0, 100% accuracy of the learner, not be able to get good predictions on the new sample. We actually hope that in the new sample can show a very good learner.

To this end, a study should be applicable to all potential sample of "universal law" from the training sample as much as possible, so as to make the right judgment in the face of new samples. Because the generalization error can not be measured, therefore, we will test error typically approximately equivalent to the generalization error.

Model Assessment

Second, the deviation and variance

How to assess the state of our model, we use a learning curve here. With increasing sample size, it can reflect changes in the training data and test data bias and variance.

Generally, we use the model deviation (Bias) and variance (Variance) model to describe the generalization performance, which is called skew - variance decomposition (bias-variance decomposition).

Deviation : difference between the mean value obtained from all the sample size of m sets of training data to train the model and the output of all the real model output. Deviation is usually because we made the wrong assumptions about learning algorithm caused. Measure learning algorithm to predict the degree of deviation from the expectations and actual results, that characterizes the learning algorithm itself fit capability . Model compared with other models, different models for the accuracy of training and test sets were compared.

Variance : variance obtained from all the sample size of the training data set of m outputs of all trained models. Simply put, the study measures the change in the performance of the same size as a result of changes in the training set, that characterizes the change in the data set lead to changes in learning performance, that is, learning stability of the algorithm . Compared with the model itself, the same model accuracy on the training set and test set for comparison.

Noise : Expression of the desired lower bound on the generalization error of the current task in any learning algorithm that can be achieved, i.e. characterizes the error flag data set itself .
Here Insert Picture Description
Select the model, evaluate and optimize - on

The following graph may well understand the difference between the deviation and variance **. That means that the level of accuracy deviation model; stability means that the level of variance model **.

Assuming that one shot is a machine learning model to predict a sample. Hit the bull's-eye position represents the prediction accuracy, the greater the deviation from the bull's-eye farther representative of prediction error.
We obtain training sample of size n m of the set, n trained models, make predictions on the same sample, corresponding to n times we made shooting, the shooting result as shown by n samples.
Our most desired outcome is the result of the upper left corner, shooting and accurate results and focus, indicating bias and variance models are small;
upper right though the center of the bull's-eye around the shooting results, but more dispersed distribution, indicating that smaller deviations model However, larger variances;
Similarly, bottom left model described small variance, the larger the deviation;
lower right figure shows a large variance model, larger deviations.

Here Insert Picture Description
Here Insert Picture Description

So, when we want to model bias and variance are relatively small.

Model Assessment: Status Assessment Model

Overfitting is: small deviation, variance is large.
Underfitting is: Deviation large and small variance.

So, adjusting the model state that is a trade-off between bias and variance.

Third, the learning curve

By accurately and cross-validation of the training set is drawn when different training sets can be seen in the new performance model data, and then determines whether the variance model too high or deviation, and increasing the training set can be reduced if over-fitting. Here Insert Picture Description
When the training and test sets of convergence and accuracy of the low accuracy converges, with reference classifier (red line) in comparison deviation is high.
The upper left corner represents the high deviation classifier, the accuracy of the training and validation sets are very low, is likely to be less fit.
Here Insert Picture Description

Training and test sets the accuracy of the reference classification has similar accuracy, indicating a lower bias, but the accuracy of the gap in the test set and the training set is large, high variance.
Upper right corner of the variance is high, the accuracy of the training and validation sets too much difference, is over-fitting.
Here Insert Picture Description

More complex models, the learning ability, so the smaller the error in the training set. But for the test set error, when it is reduced to a certain extent, the model may be because there have been too complex to fit phenomenon, but the error increases. At this time, by dimensionality reduction, increased regularization and other measures to reduce the over-fitting.
Here Insert Picture Description

Measure To deal with the situation
Overfitting: low deviation, high variance Increasing the amount of data, reducing the complexity of the model (reducing the number of features increases regularization)
Underfitting: high deviation, low variance Increase the amount of data is invalid. Enhance the complexity of the model (the number of features increases, reducing the regularization)

Drawing on the learning curve, you can go online to find sklearn official.

Published 26 original articles · won praise 29 · views 10000 +

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/104572784