Machine learning Notes (b) - quality determination Model

First, divide the training set and test set

  • Training set: a set of training model
  • Test sets: a training set of test model.

Common methods of resolution dataset:

1. distillation method

Aside method (hold-out) the data set D directly split into two mutually exclusive sets, where a training set as S, and the other as the test set T. I.e., D = S∪T, S∩T = ∅. After the model is trained on the S, T with the test to evaluate the error, as the estimated generalization error.
Note: (1) divide training / test set to maintain the consistency of data distribution as much as possible, to avoid an impact on the final result because of data into the process of introducing additional bias.
(2) different division will lead to different training / test set, model evaluation results are also differences.
(3) common practice to sample approximately 2 / 3-4 / 5 for training and the remaining samples for testing.

2. The cross validation

Cross-validation (cross validation) first data set D into k disjoint sets of similar size, each subset as possible to ensure consistency of data distribution, i.e. from the sample obtained by layering in D. Then each subset with k-1 and set as a training set, as the rest of the test set, so that you can get the k sets of training / test set, which can be training and testing times k, k which returns the final test results average. k value of 10 commonly, referred to as case 10 fold cross-validation. Other common k values ​​are 5, 20 and the like.
Here Insert Picture Description

3. Bootstrap

Self-sampling method with self-service basis. Can better reduce the impact of different training sample size, but also more efficient to carry out experiments to estimate. D data set contains a set of m samples, we generate a data set it is sampled D ': a randomly selected sample from each of D, copy it into D', and then back into the initial data set of samples D so that the next sample when the sample is still likely to be taken; when this process is repeatedly performed m times, to obtain a set of data samples D comprises m ', which is the result of self-sampling. D portion of the sample will have D 'appear many times, and another portion of the sample does not appear. Probability sample in m samples are not taken to be approximately 0.368.
Self-help method is useful when small data set, it is difficult to effectively divided into training / test set; self-help method generating a plurality of different training focus from the initial data set, a great advantage of inherited learning.

Second, the evaluation results: accuracy, confusion matrix, precision, recall, F1 Score, ROC curve

1. The precise ratio (precision):

Predictive value of 1, and the ratio of the forecast, that is: We are concerned about the event, predicted how accurate.

2. recall (recall):

All real data is 1, the prediction of the number, namely: a real case of the occurrence of the event that we are concerned, we successfully predicted the proportion of how much.

In general, the precision and recall is a measure of a contradiction. When high precision, recall often low; while the recall rate, precision often low.
In the choice of metric-offs, often they need to be based on the scene. In the future trend of the stock of such binary classification problem, call the general concern is that people, in rising stocks, real rise in the proportion of bigger is better, this is the hope precision rate . Even omitted some rising cycle, we do not have much to lose. For the diagnosis of the disease in the medical field, if the recall rate is low, it means that the patient would have got sick, single No predicted, so we want as much as possible to all patients predicted, so is the need to improve recall .

3. confusion matrix:

For dichotomous questions, and learning their true class category prediction may be divided into a combination of real cases (TP), false positive cases (FP), true negatives (TN), false counter-example (FN) four cases. TP + FP + TN + FN = the total number of samples, the matrix is ​​formed as follows:

forecast result forecast result
The true situation Positive example Counterexample
Positive example TP FN
Counterexample FP TN

Wherein:
(1) precision Precision = TP / (TP + FP )

(2) recall rate, sensitivity, recall, true positive rate the Recall = TPR = TP / (TP + FN)
(. 3) true negative rate the TNR, specificity the TNR = the TN / (the FP + the TN)
(. 4) false negative rate FNR, misdiagnosis rate (1-sensitivity) the FNR = FN / (TP + FN)
(. 5) false positive rate FPR, the misdiagnosis rate (1-specificity) FPR = FP / (FP + TN)

4.F1 Score:

If we take care of both accuracy rate (P) and recall (R) these two indicators, F1 Score is the harmonic mean of precision and recall.
2 = Fl P R & lt / (P + R & lt)
i.e. 1 / F1 = (1 / P + 1 / R) / 2

The classification threshold:

I.e., to set the threshold value determination threshold sample positive examples.
If the probability of the return of a logistic regression model to predict an email as 0.9995, then the model predicts this message is very likely to be spam. On the contrary, predicted in a logistic regression model with a score of 0.0003 in another e-mail is probably not spam. If you can predict the score of an email it to 0.6? In order to map the value into a binary logistic regression category, you must specify the classification threshold (also referred to as judgment threshold). If the value is above the threshold, then the "Spam"; if the value is below the threshold, it indicates "not spam." People tend to think that the threshold should always be classified as 0.5, but the threshold depends on the specific problem, so you have to adjust it.
precision decreases with increasing threshold, recall decreases as the threshold increases . If some scenarios require precision, recall is maintained at 80% threshold can be obtained in this way.

6.ROC curve:

ROC curve (Receiver Operation Characteristic Cureve), describes the relationship between the TPR and FPR. The x-axis is FPR, y-axis is the TPR.
TPR is a positive example, how many were correctly determined to be positive; FPR is negative in all cases, the number was incorrectly determined to be positive. Classification threshold different values, the calculation result TPR and FPR are also different, the most ideal situation, we expect all positive cases & negative examples were successfully predicted TPR = 1, FPR = 0, i.e., all of the positive examples predicted values> all Example negative predictive value, then the threshold value can take a value between the minimum and the maximum positive value of the predicted Example Example negative predictive value.
TPR bigger the better, FPR smaller the better, but the two often contradictory indicators. To increase the TPR, it can be predicted more positive example samples, at the same time increases more as the case of false negative cases were positive examples.
Here Insert Picture Description
The closer the top left corner of the ROC curve, the better the classifier proof effect. The upper left corner point of perfect classification, (TPR = 1, FPR = 0), expressed as doctors are highly skilled, full-diagnosis right.

7.AUC:

In general ROC curve, we are concerned about is the area under the curve, known as AUC (Area Under Curve). The AUC is the range of the horizontal axis (0,1), the vertical axis (0,1) so that the total area is less than 1.

Under the ROC curve of trapezoid, a rectangle can be regarded as the characteristic trapezoidal. Thus, this area can be considered the AUC of the bottom :( lower base +) * H / 2, the area under the curve can be obtained by the superposition of a plurality of trapezoidal areas. AUC, the better the classifier classification results.
When • AUC = 1, is the perfect classifier, using the predictive model, no matter what the threshold value can be set to come to a perfect forecast. The vast majority predict occasions, there is no perfect classifier.
• 0.5 <AUC <1, better than random guessing. The classifier (model) to properly set the threshold, then, to have predictive value.
• AUC = 0.5, like random guessing, model no predictive value.
• AUC <0.5, worse than random guessing; but as long as the line is always anti predict, it is better than random guessing.

Third, the evaluation of regression results: MSE, RMSE, MAE, R Squared

1. The mean square error MSE

Different test set data amount m, because the accumulation operation, the data increases, the error will gradually accumulate; m and thus a measure of correlation. To offset the amount of image data, the amount of data can be removed, offset error. The results obtained by this approach is called the mean square error. The statistical parameters are the raw data and the corresponding prediction data points and the mean square error
MSE (Mean Squared Error):
Here Insert Picture Description

2. The root mean square error RMSE

But using the mean square error MSE receive the impact dimensionless. For example, when a measure of real estate, the unit of y is (million), the result obtained is a measure (million square feet). To solve the problem of the dimensions, you may be prescribing (dimension in order to solve the problem of the variance, which is the square root of squared difference obtained) to give the root mean square error RMSE (Root Mean Squarde Error):
Here Insert Picture Description

3. The average absolute error MAE

For linear regression algorithm, there is another very simple evaluation criteria. Required distance between the true value and the minimum prediction result, an absolute value subtraction can be done directly, then divided by m plus m times, to obtain the average distance is referred to as the mean absolute error MAE (Mean Absolute Error):
Here Insert Picture Description
Before determining the loss function, we mentioned, not the absolute value function everywhere, so do not use absolute values. But it does not affect the evaluation model. Thus Evaluation model loss function can be different.

4.R-Square (coefficient of determination)

The main determining factor, SST these two parameters determined by the SSR.
(1) SSR: squared differences Sum of squares of the regression, i.e. the prediction data and the original data mean the sum, the following equation
Here Insert Picture Description
(2) SST: Total sum of squares, i.e., the square of the difference between the original data and the mean sum, the following equation
Here Insert Picture Description
It found: SST = SSE + SSR
and our "coefficient of determination" is defined as the ratio of SSR and SST, so

Here Insert Picture Description
"Coefficient of determination" is characterized by a change in the data fit is good or bad. The above expression can know the "coefficient of determination" normal value range of [0 1], the closer to 1, indicates that the variable equation y stronger explanatory power of this model are better fit to the data.
Here Insert Picture Description
R & lt Why good side this indicator?
For molecules, the square of the difference between the predicted value and the true value and that using the wrong Our model predicts that produced.
For the denominator, it is the square of the difference between the mean and the true value and that thought wrong, "predictive value = sample mean" this model (Baseline Model) produced.
We use the Baseline model produced more errors, we use their own models with fewer errors. So with the wrong fewer errors divided by one minus the more, in fact, it is a measure of our model fit place to live data, that is no corresponding index error.
We the above, the following conclusions can be:
R & lt ^ 2 <= 1
R2 Ye greater, the greater the subtraction of a small molecule, a low error rate; prediction model when we make any mistakes, R2 maximum value of 1
when we when model is equal to the reference model, R ^ 2 = 0
if R ^ 2 <0, that we learned model not as a reference model. At this point, it is likely the data is not any linear relationship exists.

Released two original articles · won praise 0 · Views 82

Guess you like

Origin blog.csdn.net/weixin_43312354/article/details/104739970