Model evaluation index—ROC curve

For the classification model, after the model is established, we want to evaluate the model. Common indicators include confusion matrix, F1 value, KS curve, ROC curve, AUC area, etc. You can also define your own function, divide the model result into n (100) parts, and calculate the accuracy and coverage of top1. The confusion matrix, KS curve and F1 value were explained before. This article explains the principle of ROC curve and Python implementation examples. Other indicators will be explained in detail in subsequent articles. Please look forward to the pictures.

  

1. Introduce the ROC curve in detail

  

1 What is the ROC curve

  
The ROC curve is also called the receiver operating characteristic curve (Receiver Operating Characteristic Curve): it is a measure of the classification problem. It takes the false positive rate FPR (False Positive Rate) as the horizontal axis, the true positive rate TPR (True Positive Rate) as the vertical axis, and adjusts a curve drawn by different thresholds. The closer the ROC curve is to the upper left corner, the better the prediction effect of the model. For the reasons, see later.
  

2 A small example to understand the ROC curve

  
Assume that 1 represents an account involved in gambling and fraud, and 0 represents a normal account that is not involved in gambling or fraud.
  
picture
  
T: correct prediction, F: wrong prediction, P: 1, N: 0.
  
TP (True Positive): The number that the model correctly predicts is 1, that is, the true value is 1, and the model predicts the number of 1.
  
FN (False Negative): The number of model errors predicted as 0, that is, the true value is 1, and the model predicts the number of 0.
  
FP (False Positive): The number of model errors predicted as 1, that is, the true value is 0, and the model predicts the number of 1.
  
TN (True Negative): The number that the model correctly predicts is 0, that is, the true value is 0, and the model predicts the number of 0.
  

True positive rate TPR: Among the samples that are actually 1, the proportion of samples predicted to be 1 by the model. That is to say, in this example, the proportion of the actual number of customers involved in gambling and fraud predicted by the model is calculated as follows:
  
        TPR=TP/(TP+FN)
  
False Positive Rate FPR: Among the samples that are actually 0, the proportion of customers predicted by the model is The proportion of samples predicted to be 1. That is to say, in this example, the proportion of normal customers who are predicted by the model to be involved in gambling and fraud, the calculation formula is as follows:
  
        FPR=FP/(FP+TN)
  
Suppose there are 10 samples, of which 5 customer accounts are involved in gambling and fraud , 5 customer accounts is normal. Now use the logistic regression model to predict, the results are as follows:
  
picture

Then divide according to different thresholds, and the corresponding FPR and TPR values ​​are as follows:
  
insert image description here
  
draw the points on the graph to get the corresponding ROC curve.
  
insert image description here
  
When the number of customers is large, a smoother ROC curve can be obtained by finer division of the threshold. Furthermore, the area enclosed by the ROC curve and the x-axis is the value of AUC.

  

2. How to draw the ROC curve with Python

  
Draw the ROC curve with Python, mainly based on the roc_curve and auc functions in the sklearn library. The roc_curve function is used to calculate FPR and TPR, and the auc function is used to calculate the area under the curve.
  

1 Detailed explanation of roc_curve function

  
First look at the calling statement of the roc_curve function:

roc_curve(y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True)

Detailed explanation of input parameters:
  
‍y_true : The true label of the sample, which is a one-dimensional vector consistent with the number of samples, and is generally binary. If the label is not {-1,1} or {0,1}, pos_label can be specified explicitly.
  
y_score: model prediction score, which can be a probability estimate of the positive class, a confidence value, or a non-threshold measure of the decision (returned by "decision_function" on some classifiers, such as SVM), and is also a one-dimensional vector consistent with the number of samples. A simple understanding is a score measure obtained after classifying the test set to measure whether the class is positive or negative.
  
pos_label: If y_true does not meet the {0,1},{-1,1} labels, you need to specify which samples are positive through this parameter, and the rest are negative, and are not entered by default.
  
sample_weight: A one-dimensional vector consistent with the number of samples, specifying the weight of each sample, which is not entered by default.
  
drop_intermediate: When true (default True), some suboptimal thresholds that do not appear on the ROC curve will be removed.
  
Detailed explanation of return value:
  
fpr: False positive rate sequence, which is a one-dimensional vector consistent with the number of threads.
  
tpr: true positive rate sequence, which is a one-dimensional vector consistent with the number of threads.
  
thread: The sequence is a descending sequence. The y_score is divided under each threshold, and the ones larger are regarded as positive, and the ones smaller are regarded as negative, so as to calculate the fpr and tpr corresponding to the threshold.

  

2 Concrete examples of drawing ROC curves

  
For ease of understanding, we use the example in Chapter 1 as an input parameter to draw the ROC curve. The code is as follows:

import os
import pandas as pd
import matplotlib
matplotlib.rcParams['axes.unicode_minus']=False
#处理图像显示中文的问题
import seaborn as sns
sns.set(font= "Kaiti",style="ticks",font_scale=1.4)
from sklearn.metrics import *

y_pred = [0.9, 0.8, 0.3, 0.7, 0.5, 0.6, 0.4, 0.3, 0.1, 0.2]
#预测值
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
#实际值
fpr_Nb, tpr_Nb, _ = roc_curve(y_true, y_pred)
aucval = auc(fpr_Nb, tpr_Nb)    # 计算auc的取值
plt.figure(figsize=(10,8))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_Nb, tpr_Nb,"r",linewidth = 3)
plt.grid()
plt.xlabel("假正率FPR")
plt.ylabel("真正率TPR")
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.title("ROC曲线")
plt.text(0.15,0.9,"AUC = "+str(round(aucval,4)))
plt.show()

The results are as follows:
  
picture
  
It can be found that the drawing results are not much different from our manual animation. ‍So far, the drawing principle of the ROC curve and the Python implementation example have been explained. Interested students can try to implement it by themselves.

  
You may be interested in:
Draw Pikachu with Python Draw a
word cloud map
with Python Draw 520 eternal heartbeats with
Python With sound and text) Use the py2neo library in Python to operate neo4j and build a relationship map Python romantic confession source code collection (love, rose, photo wall, confession under the stars)



Guess you like

Origin blog.csdn.net/qq_32532663/article/details/129772801