Impact assessment classification machine learning

For regression problems, usually MSE, MAE, RMSE, R ^ 2 are four ways to judge the effect of the model. For classification problems, the easiest way is to use to evaluate the accuracy of the model results. For example sklearn the default classification score is based on accurate rate statistics.

Use to evaluate the accuracy understand very simple, but extremely skewed data predicting there will be a big problem. For example, for cancer prediction problem, the proportion of healthy vs sick could be 10,000: 1. For this extremely skewed data, we can do a simple model, directly predict all samples belong to health class, so the model accuracy rate can reach 99.99%.

For this type of data, score classification algorithm model can be assessed by means of the confusion matrix.

Confusion matrix

For ease of explanation the following confusion matrix and the precision and recall rate terminology, in order to analyze the first two-class issue as an example.

Real / forecast 0 1
0 TN FP
1 FN TP
  1. In the above table, rows represent actual values, the columns represent the predicted value.
  2. 0 for negative, 1 representative of postive.
  3. TN (True Negative) represent the actual value is negative, the predicted value is negative, negative predicted correctly.
  4. FP (False Positive) represent the actual value is negative, positive predictive value, positive predictive error.
  5. FN (False Negative) represent the actual value is positive, the predicted value is negative, negative prediction error.
  6. TP (True Positive) represents the actual value is positive, predictive value positive, positive prediction error.

It says some abstract, following give a specific example.

Real / forecast 0 1
0 9980 10
1 3 7
  1. 9980 individual does not have cancer, but the algorithm also predicted who did not have cancer.
  2. 10 people not suffering from cancer, but they have cancer prediction algorithm.
  3. 3 people suffering from cancer, but the algorithm to predict who did not have cancer.
  4. 7 people with cancer, their algorithm to predict cancer.

Accuracy rate

For the precise definition of rates are: to predict the result of an event of interest (total 17) correctly predicted probability of seven correct 10 errors.

Precision rate = TP / (TP + FP) = 7 / (10 + 7), that is not the time to do 17 times the prevalence predicted, on average, seven times correct.

Recall

Recall the definition is: type (that is, 10 patients) of interest, the probability of predicted (predicted 7).

Recall = TP / (TP + FN) = 7 / (7 + 3) = 70%, that is to say whenever there are 100 patients, average algorithm can successfully identify 70, will miss 30.

F1-Score

For some scenes, selecting the exact rate is more appropriate, such as stocks forecast scenarios, to predict the stock is up or down, business needs to find a more accurate stock can rise. As for the prediction of disease scene, to predict whether people sick visits, this time the business needs is to identify all patients do not miss any of the sick patient can be said to be healthy patients diagnosed may not be too many relationships, it does not the patients were diagnosed as healthy.

However, in some scenarios, we need to integrated precision and recall, this time how to do it? F1-score can be used to solve, f1 is a harmonic mean of precision and recall:

F 1=\frac{2 \cdot \text { precision} \cdot \text {recall} }{\text { precision }+\text { recall }}

Examples

To demonstrate the above-mentioned three concepts, First we construct a highly skewed data, we choose sklearn provide handwriting recognition data set, the data set itself, 0-9, are more evenly distributed this we is converted into binary data class data, a class equal to 9, another type of data is not equal to 9 produced skew.

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()

y[digits.target==9] = 1
y[digits.target!=9] = 0
复制代码

Logistic regression was used to predict:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

log_reg.score(X_test, y_test)
复制代码

Since the data is extremely skewed, even predict all types of samples are 0, the accuracy can reach 90%. Accuracy can only show the accuracy of the model prediction for each sample, and can not really pinpoint the type of sample 1, that accuracy does not reflect whether the model accurately identify the type of sample 1. sklearn metrics package that provides direct support for confusion matrix, precision, recall rates.

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_log_predict)

from sklearn.metrics import precision_score

precision_score(y_test, y_log_predict)

from sklearn.metrics import recall_score

recall_score(y_test, y_log_predict)

from sklearn.metrics import f1_score

f1_score(y_test, y_log_predict)
复制代码

PR curve

For binary classification problem, we can adjust the classification boundary value to adjust the precision and recall of proportion. score> classification when the threshold is 1, score <0 threshold is classified. The threshold value is increased, the rate of increase accuracy, reduce the recall; threshold value is decreased, decreased precision, recall rate. Precision and recall are restraining each other, two conflicting variables, not increased at the same time.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)


from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

precisions = []
recalls = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), 0.1)

for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold, dtype='int')
    precisions.append(precision_score(y_test, y_predict))
    recalls.append(recall_score(y_test, y_predict))
复制代码
plt.plot(precisions, recalls)
plt.show()
复制代码

ROC curve

ROC (Receiver Operation Characteristic Curve) is used to describe the relationship between the TPR and FPR, wherein:

  1. TPR (True Positive Rate) represents the true rate is predicted to be the number of positive results of positive samples / actual number of positive samples: TPR = TP / (TP + FN)
  2. The TNR (True Negative Rate) represents the true negative rate; predicted negative samples Number of negative results / negative real number of samples: TNR = TN / (TN + FP)
  3. The FPR (False Positive Rate), and false positive rate; predicted negative samples Number positive result / Negative Sample Actual: FPR = FP / (TN + FP)
  4. The FNR (False Negative Rate), and false negative rate; the number is predicted to result in positive samples negative / positive actual number of samples: FNR = FN / (TP + FN)
F P R=\frac{F P}{T N+F P}
T P R=\frac{T P}{T P+F N}

Examples

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)

from sklearn.metrics import roc_curve

fprs, tprs, thresholds = roc_curve(y_test, decision_scores)

import matplotlib.pyplot as plt
plt.plot(fprs, tprs)
plt.show()
复制代码

ROC curve and the area enclosed boundary pattern, as a measure of the merits of the model, the larger the area, the more excellent model.

Reproduced in: https: //juejin.im/post/5cf5ea866fb9a07efb6971b1

Guess you like

Origin blog.csdn.net/weixin_33881041/article/details/91443415