Zhou Zhihua machine learning - learning performance metrics

Evaluation criteria to measure the generalization of the model is performance metrics (performance measure).

(1) error rate and precision

(2) precision, recall and F1

Based on a sample real class, you can learn is predicted category combinations are divided into real patients (true positive), false positive cases (false positive), true negatives (true negative), false counter-example (false negative), TP , FP , TN , FN respectively, corresponding to the number of samples which, there TP + the FP + FN + the TN = total number of samples.

Precision P and recall R are defined as:

P = TP / ( TP + the FP positive results of Examples) => the real number of cases / total number of positive results of Examples

R & lt = TP / ( TP + FN all true positive results of Examples Example Number) => the real number of cases / Results

When high precision, recall often low, when the recall rate, precision often low (for example, choose watermelon example, want to recall all the melons are higher than they should be chosen as far as possible, but this search precision inevitably lower. If you want high precision should pick as the surest melon, but this will certainly miss some good melon, the recall rate is low). Usually only a few simple tasks before they can make the recall and precision are high.

Usually in a positive predictor learner possible embodiment of sample size sorting, i.e., the top surface is changed to recognize learner "most likely" is a sample positive examples, the last row is the learner that the "best possible" the positive samples. One by one from top to bottom as the positive example sample prediction, each time to calculate the current recall ratio, precision, recall and precision rate in the longitudinal axis, the axis of abscissa recall ratio, precision ratio obtained - recall curve curves PR, PR referred to FIG.

If a PR curve are completely encased learner learner another curve, it may assert the former outperforms the latter. In the case of the two curves intersect, we designed a number of people considering the precision, recall performance metrics. "Equilibrium point" (Break-Event Point, referred BEP) is one such measure, it is the precision of the whole value = check, consider BEP oversimplification, more commonly used F 1 measure:

F. . 1 = 2 × P × R & lt / ( P + R & lt ) = 2 × TP / (total number of sample + TP - the TN )

Real-world applications, the emphasis on precision and recall vary, for example, greater precision survey product recommendation system, while fugitive retrieval, more hope and less missed fugitives, recall is more important. F general form of a metric - F beta] , can express on the precision / check different preferences of full-rate:

Fβ=(1+β2)×P×R/((β2×P)+R

β> 0 measures the relative importance of the precision recall rate, β = 1 is degraded when the standard F. 1, beta]> have a greater impact check full rate 1, β <greater precision 1 influences.

When it is desired at n when the binary classifiers confusion matrix synthesis was investigated precision and recall, an approach: First calculate the precision of each confusion matrix and recall, referred to as ( P . 1 , R & lt . 1 ) , ( P 2 , R & lt 2 ), ..., ( P n- , R & lt n- ), and then calculate the average to obtain a "macro precision" (macro-P), "macro recall" (macro-R), and corresponding "macro F1" (macro-F1):

 

Also corresponding elements in each of the average confusion matrix, and then obtain a TP , the FP , the TN , the average value of FN , and then based on these values to calculate the "micro-precision" (Micro- P ), "micro recall" (Micro - R & lt ) and "micro- F. . 1" (micro- F. . 1):

 

(3) ROC and AUC

Will for real value or probability test sample prediction, the predicted value with a threshold value comparison, is greater than the threshold value is divided into n-type, or a trans type, according to the real value or probability prediction result, the test sample sorting, most likely positive examples at the top, most of the positive cases impossible at the bottom surface, the classification process corresponds to a sort "truncation point (cut point) the sample is divided into two parts, the first part of the positive cases, after a portion of the sentence as a counterexample. to select the cut-off point, quality sorting itself good or bad, depending on the task, reflecting the learner is considering starting to study the learning from this point of view in good or bad .ROC curve "expect generalization performance" at different tasks generalization powerful tool performance.

ROC (receiver operating characteristic, Receiver Operating Characteristic), PR curves create a similar manner, according to the sorting prediction results, the sample is a positive one by one predicted embodiment, each calculation of "true positives rate" (True Positive Rate, TPR ), and "false positive rate Example" (false positive rate, referred to as the FPR), respectively, as the horizontal and vertical axes.

TPR = positive results of Example TP / (TP + FN) => true in cases / All cases results in real

FPR = FP / (TN + FP) => n of false positive results of Examples Example / true negative results in all cases

从ROC图中,可见点(0,1)对应于将所有正例排在所有反例之前的“理想模型”,对角线对应于“随机猜测”模型。绘图过程:给定m+个正例和m-个反例(二者数目不一定一样),根据预测结果对样例排序,先把分类阈值设为最大,即所有样例预测为反例,此时真正例率和假正例率均为0,在坐标(0,0)处标记一个点,然后,将分类阈值依次设为每个样例的预测值,即从上至下依次逐个将样例划入正例范围。设前一个标记点坐标为(x,y),当前若为真正例,则对应标记点的坐标为(x,y+1/m+);当前若为假正例,则对应标记点的坐标为(x+1/m-,y),然后连接所有相邻点即可。

比较学习器时:若一个学习器的ROC曲线被另一个学习器曲线完全“包住”,则后者性能优于前者;若两学习曲线交叉,判据是ROC曲线下的面积即AUC。AUC可通过ROC曲线下各部分面积求和而得,假定ROC曲线由坐标点{(x1,y1),(x2,yx),…,(xm,ym)}按序连接形成,且(x1=0,xm=1),则AUC估算为:

形式上,AUC考虑的是样本预测的排序质量,它与排序误差有紧密联系。给定m+个正例和m-个反例,令D+D-分别表示正反例集合,排序“损失”定义为:

 

即考虑任一一对正反例,若正例预测值小于反例,则记1个“罚分”,若相等,则记0.5个“罚分”。AUC=1-该值。

(4)代价敏感错误率与代价曲线

为权衡不同类型错误所造成的不同损失,可为错误赋予“非均等代价”(unequal cost)。例如:

 

在非均等代价下,希望的不再是简单地最小化错误次数,而是希望最小化“总体代价”(total cost)。表2的二分类问题,其“代价敏感”错误率为:

还可给出基于分布定义的代价敏感错误率,及其他一些性能度量如精度的代价敏感版本,若令costij中的ij取值不限于0、1,则可定义出多分类任务的代价敏感性能度量。“代价曲线”(cost curve),横轴是取值为[0,1]的正例概率代价:

其中p是样例为正例的概率,纵轴是取值为[0,1]的归一化代价:

其中FPR是假正例率,FNR=1-TPR是假反例率。

代价曲线的绘制:具体参见P36.

 

(4)比较检验

学习性能比较时存在的问题:首先,比较的是泛化性能而非测试集性能,其次,与测试集的选择相关,第三,学习算法本身的随机性。

统计假设检验(hypothesis test)为学习器性能比较提供了依据。若在测试集上观察到学习器A比B好,则A的泛化性能是否在统计意义上优于B,以及此结论把握有多大。

①假设检验

假设是对学习器泛化错误率分布的某种判断或猜想,例如є=є0。现实任务中只能获知测试错误率є(^),泛化错误率与测试错误率未必相同,但二者接近的可能性较大,相关很大的可能性较小,因此,可根据测试错误率估推出泛化错误率的分布。

对于泛化错误率为є的学习器,将其中m’个样本误分类、其样本正确分类的概率是є m’(1- є)m-m’,由此可算出恰好将є(^)×m个样本误分类的概率如下所示(也即表示泛化错误率为є的学习器被测得错误率为є(^)的概率):

上式对є对偏导,可知其概率在є=є(^)时最大,|є-є(^)|增大时P减小。这个概率符合二项分布。

 

可使用“二项检验”(binomial test)对є≤0.3(即泛化错误率是否不大于0.3)这样的假设进行检验,更一般地,考虑假设є≤є0, 则在1-α的概率(反映了结论的置信度)内所能观察到的最大错误率如下式计算:

(个人理解:预测错误概率<α的情况下,泛化错误率є的最大值?)

若测试错误率є(^)小于临界值,根据二项检验可得结论:在α显著度下,假设є≤є0不能被拒绝,也即是能以1-α的置信度认为,学习器的泛化错误率不大于є0

 

Guess you like

Origin www.cnblogs.com/Sweepingmonk/p/11037315.html