Machine Learning Notes 1 - Empirical Errors, Model Evaluation Methods and Performance Measurements

About model evaluation

Empirical error

Error rate

m samples, a classification error, error rate: E = a / m E=a/mE=a/m

Empirical error

The error of the learner on the training set is called the empirical error or training error;

The error on new samples is called generalization error;

It does not mean that the higher the accuracy, the better the learner;

Overfitting and underfitting

Overfitting : The learner learns the training samples too well, causing many of the characteristics of the training samples themselves to be regarded as the general properties of all potential samples (the training error is small and the generalization error is large);

Underfitting : The general properties of the training samples do not have student numbers, and the training error is large;

The following figure provides a visual explanation (Source: Machine Learning Watermelon Book):


Underfitting can generally be avoided, but overfitting is basically unavoidable. It can only be said to reduce its impact;

Model evaluation method

The test set should be mutually exclusive with the training set as much as possible;

Set aside method

Divide the data set into two mutually exclusive sets, one as the training set and the other as the test set; usually 2/3~4/5 of the samples are used for training and the rest are used for testing;

It should be noted that the distribution consistency of the data needs to be ensured through stratified sampling method;

In order to maintain the training accuracy and universality, we can perform n division evaluations and finally find the mean of the n results;

cross validation method

Divide the data set into many (n) mutually exclusive subsets of similar size, and then loop n times (as shown in the figure below):

  • Each time, the i-th set is selected as the test set, and the union of the remaining subsets is used as the training set ( i ∈ [1, n] i \in [1,n]i[1,n]);
  • Loop k times, selecting a different set as the test set each time, and finally return the mean value of k tests;
  • The stability of the evaluation results depends on the number of k. Generally, the value of k is 10;

Insert image description here

A special case of cross-validation - leave one out
  • There are n samples in the data set, and k divisions are performed. When k=n, the division method of n samples is unique, and each divided subset has only one element;

  • It is called the leave-one-out method because this method only leaves one sample as the test set in each test;

  • Since the training set of the leave-one-out method is only one sample different from the initial data set, the evaluation results of the leave-one-out method are often more accurate;

  • The disadvantage is that the computational overhead may be high, because as the amount of data increases, the number of divisions will also increase;

self-help method
  • Given a data set D containing n samples, and a sampling set T, T is initially empty;
  • Each time a sample is selected from D and copied into T, a sample that has already been collected may be collected again;
  • Repeat the second step n times to obtain the final sampling set T, using T as the training set and D\T as the test set;

The probability that each sample is not collected from beginning to end is lim ⁡ m → ∞ ( 1 − 1 m ) m = 1 e \lim_{m \to \infty}{(1- \frac{1}{m}) ^m} = \frac{1}{e}limm(1m1)m=e1

That is, each sample has approximately 1 3 \frac{1}{3}31The probability is not collected and used as a test set; from a macro perspective of probability, that is, we put about 1 3 \frac{1}{3}31The samples are used as the test set;

The bootstrap method is suitable for small data sets or is more useful when dividing training\test sets, and can also generate multiple different data sets from the initial data set; but this method will change the distribution of the initial data set (because each time The sampling is random, and stratified sampling cannot be guaranteed, etc.), so there will be estimation errors; so it is better to use the first two methods when the amount of data is sufficient.

Performance metrics

Precision and error rate

These two indicators are suitable for both binary classification tasks and multi-classification tasks.

Assume that a sample set D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( xn , yn ) D={(x_1,y_1),(x_2,y_2), is given . ..,(x_n,y_n)}D=(x1,y1),(x2,y2),...,(xn,yn) , insideyi y_iyixi x_ixiThe true classification result of f (xi) f(x_i)f(xi) is the result predicted by the learner;

  • The error rate refers to the proportion of misclassified samples to the total number of samples, defined as follows:
    E ( f ; D ) = 1 m ∑ i = 1 m ∥ ( f ( xi ) ≠ yi ) E(f;D)= \frac{1}{m} \sum_{i=1}^{m} \parallel(f(x_i) \neq y_i)E(f;D)=m1i=1m(f(xi)=yi)

  • In regression methods, the error rate is often expressed as the mean square error:
    E ( f ; D ) = 1 m ∑ i = 1 m ( f ( xi ) − yi ) 2 E(f;D)= \frac{1}{ m} \sum_{i=1}^{m} (f(x_i) - y_i)^2E(f;D)=m1i=1m(f(xi)yi)2

  • The definition of accuracy is: 1 − E ( f ; D ) 1-E(f;D)1E(f;D)

recall rate and precision rate

Error rate cannot meet all task requirements, so we introduce the concepts of recall rate and precision rate;

For example, to determine the quality of a search engine, assume that the total set of correct information for a user's query (which we can understand as a collection of information that can meet the user's query needs) is D, and the set of information returned by the search engine for us is S. There are two main issues we need to be concerned about:

  • How much information in S is in D?
  • How much information in D is in S?

The first question is that we need to determine the proportion of the correct information returned by the search engine to the total number of returned information (|S|). This is the precision rate;

If there are very few results returned, then the above indicator alone is not enough, so we have to consider the second question, that is, how many correct information can this search engine return in total? This is the recall rate;

From the above example, we can understand that these two indicators are generally used in binary classification problems. Let’s look at another example:

The true situation Search results Results abandoned by search engines
correct information TP (real example) FN (false counterexample)
error message FP (false positive) TN (true negative example)

The results retrieved by a search engine may be truly correct (TP) and actually incorrect (FP). Then according to the definition of precision, the calculation formula of precision P is: P = TPTP + FPP=\
frac {TP}{TP+FP}P=TP+FPTP
Among the truly correct information, some are retrieved by search engines (TP) and some are discarded (FN). Therefore, the formula for the recall rate is:
R = TPTP + FNR=\frac{TP}{TP+FN}R=TP+FNTP
The recall rate and the precision rate are two contradictory indicators, because if the number of queries is increased for the recall rate, the precision rate will decrease. Therefore, we introduce the PR curve to express these two indicators intuitively, and then to A description of the learner performance (picture source Xigua Shu);

Insert image description here

The A curve and the B curve completely "wrap" the C curve, so we can judge that the A and B learners are better than C; but there is an intersection between A and B, and B is stronger when the recall rate is high Some, on the other hand, A is stronger, it is difficult to say who is stronger.

Therefore, we introduce the concept of balance point: when P=R, whoever has a higher PR value is relatively stronger: A has a higher balance point PR value, so it can be considered that A is better than B.

However, this evaluation is still not objective enough, because we have already defaulted that the importance of recall rate and precision rate is the same. In practical applications, the importance of recall rate and precision rate is often different according to different needs. For example, in Taobao recommendations, precision rate will be more important. Therefore, we need to set a weighted evaluation criterion:
F β = ( 1 + β 2 ) × P × R β 2 × P + R F_{\beta}=\frac{(1+\beta^2)×P×R} {\beta^2×P+R}Fb=b2×P+R(1+b2)×P×R
This β \betaWhen β is less than 1, it has a greater impact on the precision rate; conversely, it has a greater impact on the recall rate;

This β \betaWhen β equals 1, this formula is equivalent to the standardF 1 F1F 1 measurement method:
F 1 = 2 × P × RP + R = 2 × TPN + TP − TN F_1=\frac{2×P×R}{P+R}=\frac{2×TP}{N+ TP-TN}F1=P+R2×P×R=N+TPTN2×TP
where NNN is the total number of samples;

Receiver operating characteristic – ROC curve

This is an extension of recall rate and precision rate. The idea is to use a classification threshold to perform two classifications, which is mainly used to classify the calculation results of neural networks;

Neural networks generally predict values ​​between 0 and 1. Set a threshold a between 0 and 1. If it is greater than a, it is judged to be a positive class, otherwise it is a negative class;

Take the duplicate check rate and precision rate as an example. If we want the precision rate to be more important, refer to the PR curve mentioned before, then we can set this threshold smaller; similarly, if we want the recall rate to be more important, refer to the PR curve mentioned before. is more important, then we set the threshold larger;

The ROC curve is a tool used to judge the "good or bad expected generalization performance of the learner". It is a bit difficult to understand the concept. You can understand it by reading the following steps first;

First move down the above table:

The true situation Search results Results abandoned by search engines
correct information TP (real example) FN (false counterexample)
error message FP (false positive) TN (true negative example)

Define two concepts:

  • Hit rate: TPR = TPTP + FN TPR=\frac{TP}{TP+FN}TPR=TP+FNTP
  • False alarm rate: FPR = FPTN + FP FPR=\frac{FP}{TN+FP}FPR=TN+FPFP(proportion of counterexamples found among all counterexamples);

For the training results, we sort them according to the predicted value;

Next, draw a line chart with FPR as the abscissa and TPR as the ordinate, similar to the following figure:

The drawing process is roughly as follows:

  • Given m positive examples and n negative examples, sorted according to the results of machine learning, the sequence s 1 , s 2 , . . . , sm + n {s_1,s_2,...,s_{m+n}} is obtaineds1,s2,...,sm+n
  • First set the classification threshold to the maximum, that is, all samples are set as negative examples. At this time, TP=0, so the hit rate and false alarm rate are both 0, and then draw a point at the origin of the coordinates;
  • Afterwards, the threshold is set to the predicted value of each sample, the respective hit rate and false alarm rate are calculated, and the corresponding coordinate points (xi, yi) (x_i,y_i) are obtained(xi,yi) , draw on the diagram;
  • The last points are connected into a line. If the area of ​​the curve envelope is larger, it means that the learner can classify better and the quality will be higher;

Therefore, we can see from the above process that this is actually an extended application of recall rate and precision rate. In order to improve the recall rate, we definitely need to increase the number of samples. So if my learner increases the number of samples (the false alarm rate will also increase because of this), if my hit rate is getting higher and higher, it means that my learner The better.

If you want to use the ROC curve to quantitatively express the performance of two learners, there are two methods:

  • Find the optimal zero boundary point: find max (TPR − FPR) max(TPR-FPR)max(TPRThe threshold corresponding to F P R ) . The closer this point is to the upper left of the graph, the better the ROC curve is;

  • Area method (AUC): compare the size of the area enclosed by the two curves, the x-axis, and the vertical line on the right half of the graph, which can be calculated using the following formula: AUC
    = 1 2 ∑ i = 1 m − 1 ( xi + 1 − xi ) ∗ ( yi + yi + 1 ) AUC=\frac{1}{2} \sum_{i=1}^{m-1}(x_{i+1}-x_i)*(y_i+y_{i+1 })AUC=21i=1m1(xi+1xi)(yi+yi+1)
    The larger the value of AUC, the better the learner;

Cost-sensitive error rates and cost curves

Taking the two-classification problem as an example, there are two main situations of classification errors:

  • Return part of the false information as true information (false positive examples);
  • Discard part of the true information as the author's information (false counterexample);

In practical applications, the price we pay for the two errors is completely different. For example, in an access control system, the potential consequences of blocking passable users out of the door and putting impassable users in are different. In the previous algorithm, we all assumed equal error cost, directly calculated the error rate as the number of errors, and did not consider the consequences of different errors. Therefore, we need to do another weighting for the two error categories.

Assume that in a sample set D, the positive example set is T and the negative example set is F; the cost caused by false positive examples is cost 10 cost_{10}cost10, the cost caused by false counterexamples is cost 01 cost_{01}cost01

Weighting the error rate formula to get the cost-sensitive error rate:
E ( f ; D ; cost ) = 1 m ( ∑ xi ∈ T ∥ ( f ( xi ) ≠ yi + ∑ xi ∈ F ∥ ( f ( xi ) ≠ yi ) E(f;D;cost)=\frac{1}{m}(\sum_{x_i \in T}\parallel(f(x_i) \neq y_i+\sum_{x_i \in F}\parallel(f(x_i ) \neq y_i)E(f;D;cost)=m1(xiT(f(xi)=yi+xiF(f(xi)=yi)
When the values ​​of i and j in the formula are not limited to 0 and 1, the formula can be rewritten as a cost-sensitive error rate formula for multi-classification problems;

Similarly, the ROC curve does not consider the cost issue, but the cost curve does. The process is as follows:

First of all, the first concept is called weighted positive probability, which should be relatively easy to understand:
P cost = p × cost 01 p × cost 01 + ( 1 − p ) × cost 10 P_{cost}=\frac{p×cost_{ 01}}{p×cost_{01}+(1-p)×cost_{10}}Pcost=p×cost01+(1p)×cost10p×cost01
where p represents the precision rate, that is, the probability that the sample is a positive example;

The second concept is called normalized cost, first formula:
coststorm = ( 1 − TPR ) × p × cost 01 + FPR × ( 1 − p ) × cost 10 p × cost 01 + ( 1 − p ) × cost 10 cost_{torm}=\frac{(1-TPR)×p×cost_{01}+FPR×(1-p)×cost_{10}}{p×cost_{01}+(1-p)×cost_{ 10}}coststorm _ _ _=p×cost01+(1p)×cost10(1TPR)×p×cost01+FPR×(1p)×cost10
TPR and FPR in the formula are the hit rate and false alarm rate in ROC, then (1 − TPR) (1-TPR)(1T P R ) is the ratio of true information that has not been selected, that is, the ratio of false counterexamples to total positive examples; the entire calculation formula is for( 1 − TPR ) (1-TPR)(1T P R ) andFPR FPRF P R performs a weighted operation on both, and the result obtained can be understood asthe average weighted cost of the two types of errors.

Using the weighted positive probability as the abscissa and the normalized cost as the ordinate, the cost curve can be obtained;

Expected overall cost and cost curve

Combine the two quantities ( 1 − TPR ) at each point in the ROC (1-TPR)(1T P R ) andFPR FPRF P R is calculated, and then a line from (0, FPR) (0,FPR)is drawn on the cost plane(0,F P R ) to( 1 , 1 − TPR ) (1,1-TPR)(1,1T P R ) line segment, traversing all ROC points in this way, you can get an image similar to the picture below (picture source Xigua Shu):

The shaded part is the expected overall cost, and the polyline surrounding this part is the cost curve. Comparing with the above section, we can find that when the abscissa is 0 (that is, p=0), costtorm cost_{torm}coststorm _ _ _Can be regarded as FPR FPRF P R , and when the abscissa is 1 (p=1),costtorm cost_{torm}coststorm _ _ _Can be regarded as 1 − TPR 1-TPR1T P R , which is FNR FNRin the figureFNR

When facing a learning model that considers costs, you can refer to these two methods. In fact, these two methods are essentially different execution methods of the same idea. Which one you choose depends on the specific situation.

Next section: Machine Learning Notes 2 - Comparative Test

Guess you like

Origin blog.csdn.net/qq_45882682/article/details/122588141