Data mining: model assessment

Data mining: model assessment

Generally in the game, we will see return for (classification) problem, different games have different evaluation. When we train the model, not only to remember the Regression by the mean square error, not to care about the game objective evaluation, so we tend to get poor results, poles apart.

First, why are there so many indicators to measure

Because of the particular scene, the game will have a different bias. For example, the same question, M1 and M2 classification accuracy rate is not the same, then, we need to look at specific requirements, if M1 focus a little more, then when optimizing, it should be focused on the optimization of the M1.
Here Insert Picture Description

Some common evaluation.
Here Insert Picture Description
Here posted online written in relatively good model to evaluate the article, due to the limited amount of knowledge I can not make a good summary, therefore make reference to others. After more so after this familiar, do supplement.
Evaluation methods and machine learning models of
machine learning models evaluated (Evaluating Machine Learning Models) - the major concepts and pitfalls
machine learning models relevant evaluation summary of the most complete
model commonly used indicator to assess
Additionally, this article is from Cai Cai machine learning .

Second, the regression evaluation index

Regression model to assess the class and type classification algorithm is actually similar to the law - to find differences in actual and predicted values of the label . Regression class algorithm, we have two different angles to look at the effect of return:
first, whether we predicted the correct value .
Second, do we fit into enough information.
These two angles correspond to the different evaluation index model.

No predicted the correct value : MSE and MAE

RSS residual sum of squares, is the difference between the predicted value and the true value is to evaluate the effectiveness of our return from the first point of view, it is both our RSS loss function, is also one of our regression model assessment index class model . However, RSS has a fatal flaw: it is an unbounded and can be infinitely large ( more samples, check the residual sum of squares greater ).
To deal with this situation, sklearn use RSS variants, the mean square error MSE (mean squared error) to measure the difference between our forecast and the true value of value ( which eliminates factors affecting Because the sample model evaluation. ) , with an average error obtained. It can be in the range of labels and compare together, in order to obtain a more reliable basis for this assessment. ( Average error on each sample, with mean difference compare to see how many ) in sklearn them, we have two ways to call this evaluation index,

  1. Sklearn using specific metrics in the model evaluation module class mean_squared_error,
  2. The cross-validation is invoked using the class cross_val_score scoring parameters used to set inside the mean square error.
    Here Insert Picture Description

1. The mean square error is negative :
cross_val_score (REG, X-, Y, CV = 10, = Scoring "neg_mean_squared_error")
as a mean square error of the error itself, the loss. It is divided into a sklearn loss model (loss). In sklearn among all the negative numbers are represented losses, thus mean square error is also displayed as a negative number. The real value of the mean square error MSE, in fact, neg_mean_squared_error removes the digital negative sign.
metrics can be retrieved in a positive mean square error.

MAE : on Mean Absolute error, mean absolute error
Here Insert Picture Description
which is consistent with the concept as expressed mean square error, but the difference between the outside and the predicted value of the true label we use the L1 norm (absolute value). Real use, MSE and MAE choose to use a fine. ( MAE less sensitive to outliers than the MSE, there are robust )
in sklearn them, we use the command from sklearn.metrics import mean_absolute_error to call MAE.
Cross-validation may be used in the scoring = "neg_mean_absolute_error", in order to call at MAE cross-validation.

Whether fit enough information
for the return type algorithm, only to explore whether the data is not accurate enough to predict. In addition to the numerical size of the data itself, we also hope that our model can capture the "laws" of data, such as data distribution, monotonicity, etc. , and whether to capture this information and can not be used to measure the MSE.
This figure, if the MSE measure, then the mean square error will be very small, but the figure, it is clear that the fit is not right, do not fit the data changing trends. Thus the light can not be evaluated from the regression data MSE.
Here Insert Picture Description

We use variance to measure the amount of information on the data. If the larger the variance, the greater the amount of information on behalf of data, and this information includes the size of the value of not only, but also includes those laws that we hope to capture model. In order to measure the amount of information on the model to capture the data, we define R ^ 2 to help us:
Here Insert Picture Description
where the numerator is the MSE, the denominator is the variance due to the reciprocal of the related sample related, so about out. The variance
is essentially any difference in y value and a mean sample, the greater the difference, the more the value of the information carried.
In R ^ 2, the molecule is a true value difference value and the predicted value ( a measure of the degree of difference ), which is our model does not capture the amount of information to , the denominator is the real amount of information carried by the label, so the measure is 1 - our model does not capture the ratio of the amount of information to the real amount of information carried by the label , so, R ^ 2 closer to 1, the better.
sklearn can be used in three ways to call,

  1. R2_score is introduced directly from the metrics, the input prediction value and the true value scoring.
  2. It is a direct interface to score from the linear regression LinearRegression to make the call. (All regression models have this interface)
  3. In the cross-validation, enter "r2" to call.
from sklearn.metrics import r2_score
r2_score(yhat,Ytest)
r2 = reg.score(Xtest,Ytest)
r2

However, these two results are not the same. . .

2. The different results of the same evaluation index

In our evaluation of the classification model which, we carried out a comparison if a == b of this judgment and if b == a fact is a concept entirely, so when we conduct assessment model, from we are not stepped on in this pit.
But look at the formula for calculating R2, R2 and obvious indicators of classification models of accuracy or precision is not the same, calculate R2 involved in the great value of the difference between the predicted and actual values, predicted values must be in the molecule, the true value denominator, so we call the model evaluation index metrcis module, he must know to check the parameters of the index, what is the true value requires us to enter or to enter the predicted value. ( When a parameter R ^ 2, the input is to be noted that the molecular predicted value [] or [the actual value] denominator, a different order of input, the results are different. )

3. Negative ^ 2 R
TSS = ESS + RSS is not always true.
Here Insert Picture Description
Here Insert Picture Description
About 2 () / TSS, formula (inside) is equal to 0. need to see about the assumptions of linear regression:

  1. Because between dependent and independent variables have a linear relationship
  2. In repeated sampling, the value of the argument is fixed, i.e. non-random, assuming x
  3. Error term is a random variable desirably 0.
  4. For all x, the variance of the error term are the same
  5. Error term follow a normal distribution.
    Wherein three conditions not satisfied, the formula (in) would not be zero.
    Here Insert Picture Description

Here Insert Picture Description
When R ^ 2 is negative, first look at the process and data modeling process is correct, you may have hurt the data itself, maybe your modeling process is the presence of a bug. If the regression model of integration, check your weak evaluator of whether the number is insufficient, random forests, boosting tree models are prone to negative at the time only a couple of trees. If you have checked all of the code, your pre-determined no problem, but your R ^ 2 is still negative, which proves that the linear regression model does not fit your data.

Moreover

Third, the classification evaluation index

In measuring classification problems with accuracy, for a few cases like the sample can not be properly captured. (For example, 80 positive and 20 negative, if the predicted positive, there will be a 80% accuracy rate, but this is meaningless, for credit card fraud and so on, we are more concerned about the minority sample, and therefore can not be accurately rate to measure) we tend to look for the ability to capture the minority class and the majority of the balance after a wrong ruling class to pay the costs . If in the case of a model can try to capture the minority class, the majority class can also try to determine the correct, then this model is very good up . In order to assess the capability, we will introduce a new model assessment index: confusion matrix and ROC curve .

Confusion Matrix :
confusion matrix is a multidimensional measure of system binary classification problems, extremely useful when the sample imbalance. In the confusion matrix, we'll minority considered positive cases, the majority of the class considered to be a negative example. (True value will always EDITORIAL)
Here Insert Picture Description
We have six different evaluation index model to evaluate the scope of these indicators are between [0,1], with 11 and 00 as molecular indicators are closer to 1 as possible to 01 and 10 as an indicator molecules are closer to 0 as possible.

3.1 Art classes captured a few: Precision, recall and F1 score

The overall effect model: accuracy
accuracy Accuracy , is that all correctly predicted all samples divided by the total sample, in general, the closer to 1, the better.
Here Insert Picture Description

Arts catch a few classes: precision, recall and F1 score

Accuracy Precision, also known as precision , represents all the samples we predicted to be the minority class, the proportion of the real minority share .
Here Insert Picture Description
The lower the accuracy, it means we harming too much the majority class ( in our sample forecast for the minority class, the class have both majority and minority, so if we accuracy is low, it means that there are a lot most of the class was supposed to be our friendly fire became a minority, if we judge by mistake for most of the cost of the class is relatively high, that we have a lot of hope that the majority of the class was sentenced correct, then we should not harming the majority class, to the pursuit of high accuracy ). The lower the accuracy of our judgment of the majority class will be more wrong. Of course, if our goal is to do what it takes to capture a few classes, then we do not care about accuracy. Most classes are concerned about whether the judgment is correct, select accuracy.

Recall Recall, also known as sensitivity (sensitivity), the real rate, recall
represents all true sample of 1, the proportion of samples correctly predicted by our share.
Here Insert Picture Description
Only with a small number of related classes . The higher recall rate, mean that we try to capture a few more classes ( this on the premise that all categories are true minority class, so the higher the recall rate, represent the true minority class, we predict the ratio of the the larger, that is, to capture a few more classes ), the lower the recall rate, we will not have enough to capture the minority class.

Recall and precision is a trade-off (between minority with the majority class very often overlap together, while capturing the minority class, will be some of the most friendly fire class), represents the balance between the two the capture requirements minority class and try not to injure balance the needs of the majority class. High costs are harming the majority of the class, still can not capture the higher cost of minority class, need to be measured in terms of business between.

For taking into account the precision and recall, we created a harmonic mean of the two considerations as a comprehensive index balancing the two, called F1measure. Reconcile between the two numbers tend to average close to two numbers in a relatively small number of those, so we pursue the highest possible F1 measure, we can guarantee precision and recall rates are high. F1 measure between [0,1] distribution, the closer to 1 the better.
Here Insert Picture Description

3.2 Most wrong ruling class considerations: Specificity and false positive rate

** specificity (Specificity) ** 0 represents all real sample, the samples were correctly predicted as the ratio occupied 0.
Here Insert Picture Description

Specificity measures the ability of a model-based determination right most, and 1 - specificity majority class is a model error of judgment capability , this capability is calculated as follows, and called the false positive rate (False Positive Rate) . (False positive rate: the number of majority class was sentenced to wrong Accuracy: How many minority was sentenced to right.)

Here Insert Picture Description

sklearn in confusion matrix derived from the six model assessment index:
Here Insert Picture Description

3.3 ROC curve and its related issues

In our six evaluation index, the most important thing is the recall rate, then the accuracy rate and false positive rate FPR.

False positive rate has a very important application: We in the pursuit of higher Recall the time, Precision will decline, that as more minority class is captured out, there will be more numerous class is an error of judgment, but we very curious, with the gradual increase of Recall, the model will be most like the ability to judge how to change it wrong? We want to understand, I am a minority class every judgment is correct, there are that many majority class will be an error of judgment . False positive rate (in most categories, the number of majority class was sentenced wrong) just to help us measure the ability to change this. (Recall the greater, the more minority captured, while also harming the more numerous class, accidental injury means more majority class was sentenced to errors, FPR is a measure of the majority class, how many most of the class was sentenced to error, so FPR will also increase if Recall Recall increasing faster than the increase in FPR prove determine the cost of a minority class is not great; if the FPR increased faster than Recall, shows that judge a minority at the expense of more numerous class)

Relative, Precision can not determine whether the proportion accounted for much of the majority class in all categories in most of these errors of judgment, it can not also take into account the overall model Accuracy in the process of lifting Recall . Therefore, we can use the Recall and FPR balance between, instead of a balance between Recall and Precision, let us measure the model to capture as much as possible in a small number of class, the majority class of accidental injury how changes in circumstances, which is a measure of our ROC curve balance .

ROC curve, the full name of The Receiver Operating Characteristic Curve. This is an a false positive rate at different thresholds for the FPR recall the abscissa, the threshold curve Recall vertical axis .

Probability (Probability) with a threshold value (threshold) :
Probability: sample point is determined for a class probability
threshold value: number manually set, if the probability is greater than the threshold value, it will be set to a certain category.

To understand the probability threshold, the situation is most likely to recall the situation we do with logistic regression classification of time. Logistic regression predict_proba interface generation for each sample likelihood (class probability) under each category label. For these likelihoods, a predetermined natural logistic regression, when the likelihood of a sample under the category corresponding to the label is greater than 0.5, this sample was divided into this category. In this process, it is called the threshold value 0.5.
By constantly adjusting the threshold, you can reach different Recall and FPR. Generally speaking, the threshold can be lowered to enhance Recall.

Note, however, does not raise the threshold, will be able to increase or decrease the Recall, all have to be determined based on the actual distribution of the data. And to reflect the impact of the threshold, it must first get probabilistic classifier to predict in a few classes. For this generation born logistic regression likelihood algorithm and so is naive Bayes algorithm in the calculation of probabilities , natural probability is very easy to get, but for some other classification algorithms, such as decision trees, such as SVM, their classification and probability not relevant.

There are tree leaf node, a leaf node may contain the samples of different classes. Suppose a sample is contained in a leaf node, the node 10 contains a sample, of which six were 1, 4 0, n is 1 the probability of occurrence of this type of leaf nodes is 60% in this category 0 the probability of occurrence of leaf nodes is 40%. For all samples in the leaf nodes, the probability of occurrence of 1 and 0 on the node, this sample is taken corresponding to the probability of 1's and 0's , you can go to test themselves. But think about a problem, because the decision tree can be drawn very deep, in deep enough, may not contain more than one category of labels on each leaf node of the decision tree, a leaf may be only a single label , that is not the purity of the leaf node is 0, at this moment, for each sample, their corresponding "probability" is 0 or a 1. This time, we can not adjust the threshold value to adjust our Recall and the FPR .
So, if we have the probability that demand, we will give priority to the pursuit of a logistic regression or naive Bayes. But in fact, SVM can generate a probability.

SVM Probability Forecast: important parameters probability, the interface predict_proba and
decision_function

Decision-making closer to the point boundary, the more blurred determine the probability of a certain type, and the farther from the border, the greater the probability of a discrimination class. From the decision boundary point to measure the likelihood of the sample belongs to a category. Decision_function interface value returned by us and therefore considered to be a confidence (confidence) SVM in . SVC important parameters probability.
Here Insert Picture Description

The fundamental purpose of the establishment of the ROC curve is to find a balance between Recall and FPR, so that we can measure as much as possible to capture a few models in class, accidental injury What about the majority of the class will change .
If the ROC curve is concave, then the positive and negative adjustments to the class on the line. Closer to the middle of the line, the worse.
Here Insert Picture Description
The horizontal axis is the FPR, on behalf of the majority class model error of judgment ability ordinate Recall, the model represents the ability to capture the minority class, so the ROC curve represents, with the increasing Recall's, FPR how to increase.
We hope that with the continuous improvement of Recall, FPR increase the slower the better, which shows that we can
to try to efficiently capture a few classes, but not many will be the majority class error of judgment
. So, we want to see the image, the ordinate rapid rise, the abscissa slow growth, which is an arc across the top left of the image . This model represents the effect is very good, the ability of the minority have good capture.
Curve is determined not only reliable, requires a value to measure the extent of the curve near the top left corner. AUC area , which represents the area under the ROC curve, the larger the area, the closer to the upper left corner represents the ROC curve, the better the model. AUC was calculated area of more complicated, so we use sklearn to help us.

ROC curve and AUC is the area sklearn :
in sklearn, we have helped us calculate the false positive abscissa ROC curve of the FPR, class sklearn.metrics.roc_curve Recall ordinate and the corresponding threshold value. At the same time, we have to help us calculate class sklearn.metrics.roc_auc_score AUC area.

In the case can not be obtained probability model, in fact, the probability of having to use force, if there is confidence, it also can complete our ROC curve.

Using the ROC curve to find the optimal threshold :
ROC curve of anti
should increase when a recall how FPR change, that is, when the ability of the model to capture minority becomes strong, it will most like accidental injury situation is serious. Our hope is that when the ability to capture model minority class become strong, try not to injure the majority class, that is, as the recall increases, the size of the FPR small as possible. So we want to find the most little, in fact, it is the biggest point Recall and FPR gap. This point, also known as Youden index.

Here Insert Picture Description

Fourth, clustering evaluation index

Is different from the model evaluation classification and regression model, clustering algorithm is not a simple matter.
In the classification, there are (label) is a direct result of the output, and the classification results have right or wrong, so we use predictive accuracy, confusion matrix, ROC curve and so on indicators to assess, but in any case assessment, are "model to find the right answer" capability.
And in return, due to fit the data, we have SSE mean square error, there is a loss function to measure the degree of fit of the model. But these measures can not be used in clustering.

4.1 real label known

Here Insert Picture Description
Here Insert Picture Description

4.2 real unknown label

Clustering, is entirely dependent on the evaluation of the degree of dense cluster (cluster small difference) and degree of dispersion among the cluster (cluster outer large difference) to assess the effect of clustering. Which outline factor is the evaluation of the most commonly used clustering algorithm. (B want bigger the better, the smaller the better A.) In the range (-1,1), s is greater than 0, indicating that the clustering effect can be closer to 1, the better the clustering. Closer to 0, indicating no difference between these two clusters, can be combined into a cluster.
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Furthermore, Carlin available Gaussian - Sahara Bass index (Calinski-Harabaz Index, referred to as CHI, also called variance than standard), Davis - Boulding index (Davies-Bouldin) and contingency matrix (Contingency Matrix), etc. .
Wherein sklearn.metrics.calinski_harabaz_score (X, y_pred) more commonly used, because it is a matrix operation, faster.
They usually draw the outline coefficient distribution map after map data distribution and clustering to choose our best n_clusters. Please refer to the above machine learning Cai Cai.
The learning curve can not solve the problem outline factor in the cluster, because if you do not analyze the effect of each class of poly out, it is difficult to say the value of the highest profile is the best clustering coefficient.

Published 26 original articles · won praise 29 · views 10000 +

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/104552621