-2 classification algorithm. Precision and recall curves

Precision and recall are two different evaluation indexes, there are differences between them many times, specifically how to accurately interpret and recall when in use, depending on the specific use to be on the scene

Some scenes, people may pay more attention to accuracy rate, such as stock forecasting system, we define the stock rose to 1, the stock falls to zero, we are more concerned about the future rise of the proportion of stock, while in other scenes, people pay more attention to recalls rate, such as cancer prediction system, health is defined as 1, 0 sick, we are more concerned about the omission examination of cancer patients.

F1 Score

F1 Score both accuracy and recall rate, which is the harmonic mean of the two

\[\frac{1}{F1} = \frac{1}{2}(\frac{1}{Precision} + \frac{1}{recall})\]
\[F1 = \frac{2\cdot precision\cdot recall}{precision+recall}\]
定义F1 Score

def f1_score(precision,recall):
    try:
        return 2*precision*recall/(precision+recall)
    except:
        return 0

Seen from the above, F1 Score prefer small fraction of that index

Precision-Pecall balance

Precision and recall are two mutually contradictory goals, increased by one index, another indicator will inevitably decline. How to achieve a balance between the two?

Principle memories logistic regression algorithm: a result of the probability of occurrence is greater than 0.5, it is categorized as a put, the probability is less than 0.5, put it categorized as 0, the decision boundary is: \ (\ Theta ^ T \ = CDOT X_b 0 \)

This determines the outcome of straight or curved classification decision boundary translating the \ (\ theta ^ T \ cdot X_b \) is not equal to 0, but a threshold: \ (\ Theta ^ T \ = CDOT X_b threshold \)


Circles represent the classification result is 0, a five-pointed star is representative of the classification result can be seen from the figure, precision and recall are two conflicting indicators, with the increase threshold, decreasing the recall, precision rate increases.


Programming predictions and confusion matrix under different thresholds

from sklearn.linear_model import LogisticRegression

# 数据使用前一节处理后的手写识别数据集
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)

Each test score data request in the logical value regression algorithm:

decision_score = log_reg.decision_function(x_test)

Predict results under different thresholds

y_predict_1 = numpy.array(decision_score>=-5,dtype='int')
y_predict_2 = numpy.array(decision_score>=0,dtype='int')
y_predict_3 = numpy.array(decision_score>=5,dtype='int')

View confusion matrix under different thresholds:

Precision Rate - recall curve

Determined at step 0.1, the precise rate threshold [min, max] interval under recall and view the curve wherein:

threshold_scores = numpy.arange(numpy.min(decision_score),numpy.max(decision_score),0.1)

precision_scores = []
recall_scores = []

# 求出每个分类阈值下的预测值、精准率和召回率
for score in threshold_scores:
    y_predict = numpy.array(decision_score>=score,dtype='int')
    precision_scores.append(precision_score(y_test,y_predict))
    recall_scores.append(recall_score(y_test,y_predict))

Draw precision and recall rate with threshold variation curve

plt.plot(threshold_scores,precision_scores)
plt.plot(threshold_scores,recall_scores)
plt.show()

Precision draw ratio - recall curve

plt.plot(precision_scores,recall_scores)
plt.show()

The accuracy rate sklearn - recall curve

from sklearn.metrics import precision_recall_curve

precisions,recalls,thresholds = precision_recall_curve(y_test,decision_score)

# sklearn中最后一个精准率为1,召回率为0,没有对应的threshold
plt.plot(thresholds,precisions[:-1])
plt.plot(thresholds,recalls[:-1])
plt.show()

Guess you like

Origin www.cnblogs.com/shuai-long/p/11609409.html