A brief description of sklearn.metrics.roc_curve usage

1. Background

I ’m a little machine learning boy, I just started learning sklearn. Like some newly-started Xiaobai, I was confused by the confusion matrix. When I recently met a friend and asked me how the result returned by roc_curve was generated, I was dazed. The shift + Tab viewing system description did not understand (I am very weak in English, parents must learn English well); Then, I went online for help, and did not get the desired answer. Finally , inspired by this article http://www.bubuko.com/infodetail-2718749.html , I found out the rules again and shared my little ideas to everyone, hoping to help everyone

Second, TP, TN, FP, FN concepts

Insert picture description here## 3. TPR, TNR, FPR, FNR concepts
1. TPR = tp / (tp + fn)
TPR: the true rate or sensitivity or recall rate or recall rate or true rate or power, the sample that was originally a positive sample was The total number of samples predicted as positive samples ÷ the actual result is the total number of samples of positive samples.
Another: accuracy or precision rate formula is equal to tp / (tp + fp)
accurate score calculation: (tp + tn) / (tp + fp + fn + tn)
2. FNR = fn / (tp + fn) = 1-TPR
FNR: the false negative rate, the number of samples that were originally positive samples is predicted to be the total number of negative samples ÷ the true result is the total samples of positive samples number.
Equivalent to hypothesis testing, the probability of making the second type of error (β)
3, FPR = fp / (fp + tn)
FPR: the false positive rate, the samples that were originally negative samples are predicted to be the total number of positive samples ÷ true The result is the total number of negative samples.
Equivalent to the hypothesis test, the probability of making the first error (α)
4, TNR = tn / (fp + tn) = 1-FPR
TNR: the true negative rate or specificity, the sample that was originally a negative sample is predicted to be negative The total number of samples ÷ the true result is the total number of negative samples.

Fourth, a simple analysis of the operation mechanism of roc_curve

4.1 Brief introduction of roc_curve

4.1.1 Important parameters

y_true: real result data, data type is array
y_score: prediction result data, either label data or probability value, data type is array with shape consistent with y_true
pos_label: default is None, only when label data such as {0,1 }, {-1, 1} binary data can only be used by default; otherwise, a positive sample value needs to be set

4.1.2 Results returned

The results of returning three arrays are fpr (false positive rate), tpr (recall rate), threshold (threshold)

4.2 The first situation: y_score is the label data

4.2.1 Examples

代码.

//python 代码
y_true=np.array([0, 0, 0, 1, 1, 0, 0, 0, 1, 0])
y_score=np.array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0])
fpr,tpr,threshold=roc_curve(y_true,y_score)

返回结果.

threshold:array([2, 1, 0])
tpr:array([0.        , 0.66666667, 1.        ])
fpr:array([0., 0., 1.])

4.2.2. Explanation:

1. The result returned by threshold is the data formed by sorting the elements in y_score after deduplication and adding a value of 'maximum value +1' in descending order. Each element is used as a threshold, and the data type is a one-dimensional array. For example: y_score = np.array ([0, 1, 2,0,3,1]) corresponds to threshold = np.array ([4, 3, 2,1,0])
2. When index = 0, the threshold It is equal to threshold [0] = 2. At this time, it is assumed that all elements in y_score greater than or equal to 2 correspond to samples with index in y_true being positive samples, others are negative samples, and then compared with the corresponding elements in y_true to form a confusion matrix, because there is no value greater than or equal to 2, so TP and FP Both are 0, that is, tpr [0] = 0/3 = 0.0, fpr [0] = 0/7 = 0.0
3. When index = 1, the threshold is equal to threshold [1] = 1. At this time, assume that all elements in y_score greater than or equal to 1 correspond to samples with index in y_true being positive samples, and others are negative samples, and then compare with the corresponding elements in y_true to form a confusion matrix. Since there are 2 numbers greater than or equal to 1, TP = 2 and FP = 0, that is, tpr [1] = 2/3 = 0.66666667, fpr [1] = 0/7 = 0.0
4. When index = 2, the threshold is equal to threshold [2] = 0. At this time, it is assumed that all elements in y_score greater than or equal to 0 correspond to samples with index in y_true being positive samples, and others are negative samples. Then they are compared with the corresponding elements in y_true to form a confusion matrix. Since there are 10 numbers greater than or equal to 0, TP = 3 and FP = 7, that is tpr [2] = 3/3 = 1.0, fpr [2] =
7/7 = 1.0 So, the final result: tpr = array ([0., 0.66666667, 1.]), fpr = array ([0., 0., 1.])

4.3 The second situation: y_score is the probability value

4.3.1 Examples

代码.

//python 代码
y_true=np.array([0,0,1,1])
y_score=np.array([0.1,0.4,0.35,0.8])
fpr,tpr,threshold=roc_curve(y_true,y_score)

返回结果.

threshold:array([1.8 , 0.8 , 0.4 , 0.35, 0.1])
tpr:array([0. , 0.5, 0.5, 1. , 1.])
fpr:array([0. , 0. , 0.5, 0.5, 1. ])

4.3.2, explanation:

1. When index = 0, the threshold is equal to threshold [0] = 1.8. At this time, it is assumed that all elements in y_score greater than or equal to 1.8 correspond to samples with index in y_true being positive samples, others are negative samples, and then compared with the corresponding elements in y_true to form a confusion matrix, because there is no value greater than or equal to 1.8, so TP and FP Both are 0, that is tpr [0] = 0/2 = 0.0, fpr [0] = 0/2 = 0.0
2. When index = 1, the threshold is equal to threshold [1] = 0.8. At this time, assume that all elements in y_score greater than or equal to 1 correspond to samples with index in y_true being positive samples, others are negative samples, and then compare with the corresponding elements in y_true to form a confusion matrix, because there are 1 greater than or equal to 0.8, which is just y_true In this position, the element value is 1, so TP = 1 and FP = 0, that is, tpr [1] = 1/2 = 0.5, fpr [1] = 0/2 = 0.0
3. When index = 2, the threshold is equal to threshold [ 2] = 0.4. At this time, it is assumed that all elements in y_score greater than or equal to 0.4 correspond to samples with index in y_true being positive samples, others are negative samples, and then compared with the corresponding elements in y_true to form a confusion matrix. Since there are 2 numbers greater than or equal to 0.4, TP = 1 and FP = 1, that is tpr [2] = 1/2 = 0.5, fpr [2] = 1/2 = 0.5
4. When index = 3, the threshold is equal to threshold [3] = 0.35. At this time, it is assumed that all elements in y_score greater than or equal to 0.35 correspond to samples with index in y_true are positive samples, others are negative samples, and then compared with the corresponding elements in y_true to form a confusion matrix, because there are 3 numbers greater than or equal to 0.35, so TP = 2 and FP = 1, that is tpr [3] = 2/2 = 1.0, fpr [3] = 1/2 = 0.5
5. When index = 4, the threshold is equal to threshold [4] = 0.1. At this time, it is assumed that all elements in y_score greater than or equal to 0.1 correspond to samples with index in y_true being positive samples, and others are negative samples. Then they are compared with the corresponding elements in y_true to form a confusion matrix. Since there are 4 numbers greater than or equal to 0.1, TP = 2 and FP = 2, that is tpr [4] = 2/2 = 1.0, fpr [4] = 2/2 = 1.0
Therefore, the final result: tpr = array ([0., 0.5, 0.5, 1., 1 .]), Fpr = array ([0., 0., 0.5, 0.5, 1.])

This is the end of my sharing, brother, please correct me in time!

I also recommend a blog with a good ROC curve description: https://blog.csdn.net/yuxiaosmd/article/details/83046162 .

Published 12 original articles · Like9 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/sun91019718/article/details/101314545