question:
- What is AUC
- What can AUC be used for
- How to solve AUC (deep understanding of AUC)
What is AUC
Confusion matrix
The confusion matrix is the basis for understanding most evaluation metrics, and undoubtedly the basis for understanding AUC. Rich materials introduce the concept of confusion matrix. Here is a classic diagram to explain what confusion matrix is.
Obviously, the confusion matrix contains four parts of information:
1. True negative (TN), called the true negative rate, indicating that the actual number of negative samples predicted to be negative samples
2. False positive (FP), called the false positive rate, Indicates the number of samples that are actually predicted to be positive samples from negative samples
3. False negative (FN), called the false negative rate, indicates the number of samples that are actually predicted to be negative samples from positive samples
4. True positive (TP), called the true positive rate, Indicates the number of samples that are actually positive samples predicted to be positive samples
Compared with the confusion matrix, it is easy to understand the relationships and concepts, but over time, it is easy to forget the concepts. Might as well divide the memory into two parts according to the position. The first part is True/False, which means true and false, which means the correctness of the prediction. The latter part is positive/negative, which means positive and negative samples, which means the result of the prediction, so , the confusion matrix can be expressed as a set of correctness-prediction results . Now let's look at the concepts of the above four parts (all represent the number of samples, omitted below):
1. TN, the prediction is a negative sample, the
prediction is correct 2. FP, the prediction is a positive sample, the prediction is wrong
3. FN, the prediction is Negative sample, the prediction is wrong
4. TP, the prediction is a positive sample, the prediction is right
Almost all evaluation metrics I know are based on confusion matrix, including accuracy, precision, recall, F1-score, and of course AUC.
ROC curve
In fact, it is not so easy to figure out what AUC is at once, first we have to start with the ROC curve . For a two-class classifier , the output label (0 or 1) often depends on the probability of the output and a predetermined probability threshold. For example, the common threshold is 0.5. If it is greater than 0.5, it is considered a positive sample, and if it is less than 0.5, it is considered a negative sample. sample. If you increase this threshold, the probability of prediction error (for positive samples, that is, the prediction is a positive sample but the prediction is wrong, the same below) will decrease, but the probability of correct prediction will also decrease; if you decrease this Threshold, then the probability of correct prediction will increase but at the same time the probability of wrong prediction will also increase. In fact, the selection of this threshold also reflects the classification ability of the classifier to a certain extent . Of course, we hope that no matter how large the threshold is selected, the classification can be as correct as possible, that is, the stronger the classification ability of the classifier, the better, to a certain extent, it can be understood as a kind of robust ability .
In order to visually measure this classification ability, the ROC curve was born! As shown in the figure below, it is a ROC curve (the original data of the curve will be introduced in the third part). Now the concern is:
- Horizontal axis: False Positive Rate (FPR)
- Vertical axis: True Positive Rate (TPR)
- The false positive rate, which is simply and popularly understood, is the possibility that the prediction is a positive sample but the prediction is wrong. Obviously, we do not want this indicator to be too high.
- The true positive rate represents the possibility that the prediction is a positive sample but the prediction is correct. Of course, we hope that the higher the true positive rate, the better.
显然,ROC曲线的横纵坐标都在[0,1]之间,自然ROC曲线的面积不大于1。现在我们来分析几个特殊情况,从而更好地掌握ROC曲线的性质:
- (0,0):假阳率和真阳率都为0,即分类器全部预测成负样本
- (0,1):假阳率为0,真阳率为1,全部完美预测正确,happy
- (1,0):假阳率为1,真阳率为0,全部完美预测错误,悲剧
- (1,1):假阳率和真阳率都为1,即分类器全部预测成正样本
- TPR=FPR,斜对角线,预测为正样本的结果一半是对的,一半是错的,代表随机分类器的预测效果
于是,我们可以得到基本的结论:ROC曲线在斜对角线以下,则表示该分类器效果差于随机分类器,反之,效果好于随机分类器,当然,我们希望ROC曲线尽量除于斜对角线以上,也就是向左上角(0,1)凸。
AUC(Area under the ROC curve)
ROC曲线一定程度上可以反映分类器的分类效果,但是不够直观,我们希望有这么一个指标,如果这个指标越大越好,越小越差,于是,就有了AUC。AUC实际上就是ROC曲线下的面积。AUC直观地反映了ROC曲线表达的分类能力。
- AUC = 1,代表完美分类器
- 0.5 < AUC < 1,优于随机分类器
- 0 < AUC < 0.5,差于随机分类器
AUC能拿来干什么
从作者有限的经历来说,AUC最大的应用应该就是点击率预估(CTR)的离线评估。CTR的离线评估在公司的技术流程中占有很重要的地位,一般来说,ABTest和转全观察的资源成本比较大,所以,一个合适的离线评价可以节省很多时间、人力、资源成本。那么,为什么AUC可以用来评价CTR呢?我们首先要清楚两个事情:
1. CTR是把分类器输出的概率当做是点击率的预估值,如业界常用的LR模型,利用sigmoid函数将特征输入与概率输出联系起来,这个输出的概率就是点击率的预估值。内容的召回往往是根据CTR的排序而决定的。
2. AUC量化了ROC曲线表达的分类能力。这种分类能力是与概率、阈值紧密相关的,分类能力越好(AUC越大),那么输出概率越合理,排序的结果越合理。
我们不仅希望分类器给出是否点击的分类信息,更需要分类器给出准确的概率值,作为排序的依据。所以,这里的AUC就直观地反映了CTR的准确性(也就是CTR的排序能力)
AUC如何求解
步骤如下:
1. 得到结果数据,数据结构为:(输出概率,标签真值)
2. 对结果数据按输出概率进行分组,得到(输出概率,该输出概率下真实正样本数,该输出概率下真实负样本数)。这样做的好处是方便后面的分组统计、阈值划分统计等
3. 对结果数据按输出概率进行从大到小排序
4. 从大到小,把每一个输出概率作为分类阈值,统计该分类阈值下的TPR和FPR
5. 微元法计算ROC曲线面积、绘制ROC曲线
代码如下所示:
import pylab as pl from math import log,exp,sqrt import itertools import operator def read_file(file_path, accuracy=2): db = [] #(score,nonclk,clk) pos, neg = 0, 0 #正负样本数量 #读取数据 with open(file_path,'r') as fs: for line in fs: temp = eval(line) #精度可控 #score = '%.1f' % float(temp[0]) score = float(temp[0]) trueLabel = int(temp[1]) sample = [score, 0, 1] if trueLabel == 1 else [score, 1, 0] score,nonclk,clk = sample pos += clk #正样本 neg += nonclk #负样本 db.append(sample) return db, pos, neg def get_roc(db, pos, neg): #按照输出概率,从大到小排序 db = sorted(db, key=lambda x:x[0], reverse=True) file=open('data.txt','w') file.write(str(db)) file.close() #计算ROC坐标点 xy_arr = [] tp, fp = 0., 0. for i in range(len(db)): tp += db[i][2] fp += db[i][1] xy_arr.append([fp/neg,tp/pos]) return xy_arr def get_AUC(xy_arr): #计算曲线下面积 auc = 0. prev_x = 0 for x,y in xy_arr: if x != prev_x: auc += (x - prev_x) * y prev_x = x return auc def draw_ROC(xy_arr): x = [_v[0] for _v in xy_arr] y = [_v[1] for _v in xy_arr] pl.title("ROC curve of %s (AUC = %.4f)" % ('clk',auc)) pl.xlabel("False Positive Rate") pl.ylabel("True Positive Rate") pl.plot(x, y)# use pylab to plot x and y pl.show()# show the plot on the screen
数据:提供的数据为每一个样本的(预测概率,真实标签)tuple
Data link: https://pan.baidu.com/s/1c1FUzVM , password 1ax8
calculation result: AUC=0.747925810016, which is basically consistent with the calculated value of roc_AUC in Spark MLLib
bigger
Summarize
- The ROC curve reflects the classification ability of the classifier, combined with the accuracy of the output probability of the classifier
- AUC quantifies the classification ability of the ROC curve. The larger the classification, the better the classification effect and the more reasonable the output probability.
- AUC is often used for offline evaluation of CTR. The larger the AUC, the stronger the sorting ability of CTR.
References
Many big cows have their own knowledge and understanding of AUC. What is the meaning of AUC here, and give some answers from big cows that
can help them understand AUC [1] From how do you understand auc in machine learning and statistics?
[2] How to understand the auc in From machine learning and statistics?
[3] What are the advantages and disadvantages of From precision, recall, F1 value, ROC, and AUC?
[4] How high is From's AUC considered high?
Some other reference materials:
Using Python to draw ROC curve and AUC value to calculate
precision and recall rate, RoC curve and PR curve
ROC and AUC introduction and how to calculate AUC
evaluation index based on confusion matrix
Machine learning performance evaluation index