Python code implementation
· A classification model
1.sklearn.metrics contains commonly used evaluation:
#准确率 accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
parameter:
- y_true: validation set
- y_pred: classifier return value
- normalize: The default value is True, returns the proportion of correctly classified; the number of samples If False, returns the correct classification
# Accuracy rate precision_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None) #召回率 recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
#F1
f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
parameter:
- average : string, [None, ‘micro’, ‘macro’(default), ‘samples’, ‘weighted’]
- When a binary classification matrics to expand or multi-label classification problem, we can be seen as a set of multiple data of binary classification, each class is a binary classification. Then, we can calculate metrics for each dichotomous mean score across multiple classification, which is useful in some cases. You can use the average arguments.
- macro: calculate the mean of the binary metrics, given the same weight value for each class. When the subclass is important to go wrong, because the macro-averging method is the average of the performance. On the other hand, this method assumes that all categories are as important, and therefore greatly affect the performance of macro-averaging method will subclasses.
- weighted: For uneven number of classes, computing an average of the binary metrics achieved by weighting on the score of each class.
- micro: each sample is given and the contribution of its entire class of metrics pair (sample-weight), rather than the sum of the whole class of metrics, it will be right on the metrics for each class of weight factors and summed, to calculate the entire share. Micro-averaging method in a multi-tag (the multilabel) provided in question, comprising a plurality of classification, this time, categories will be ignored.
- samples: Application in the multilabel problem. It is not calculated for each category, instead, it will evaluate the data by calculating the difference of the real class, and class prediction of metrics, for averaging (sample_weight-weighted)
- average: average = None will return an array that contains scores for each category.
Articles connected to the case practice code is as follows:
1 # reads the corresponding library 2 from sklearn Import Datasets . 3 from sklearn.model_selection Import train_test_split . 4 from sklearn.neighbors Import KNeighborsClassifier . 5 Import numpy AS NP . 6 . 7 # reading data X-, Y . 8 IRIS = datasets.load_iris () . 9 X- = iris.data 10 Y = iris.target . 11 12 is # dividing the data into training and testing data 13X_train, X_test, y_train, android.permission.FACTOR. Train_test_split = (X-, Y, = 20 is random_state ) 14 15 # build KNN model, K is 3, and do train 16 CLF = KNeighborsClassifier (N_NEIGHBORS = 3 ) . 17 clf.fit (X_train, y_train ) 18 is . 19 # calculation accuracy of 20 is from sklearn.metrics Import accuracy_score 21 is correct np.count_nonzero = ((clf.predict (X_test) == android.permission.FACTOR.) == True) 22 is Print ( " the accuracy IS:% .3f " % (correct / len (X_test))) # the Accuracy iS: # Output: 0.921 23 is # the iris recognition into a binary classification problem 24 y_train_binary y_train.copy = () # prediction value y backup training set 25 y_test_binary y_test.copy = () # prediction value y backup test set 26 is # y value is not one which will be assigned to 0, i.e. the predicted value y only be 1 and 0, the original three classification problems turned into 2 classification. 27 y_train_binary [y_train_binary! =. 1] = 0 28 y_test_binary [y_test_binary! =. 1] = 0 29 # learning with the KNN 30 KNN = KNeighborsClassifier (N_NEIGHBORS =. 3 ) 31 is knn.fit (X_train, y_train_binary) 32 y_pred = knn.predict (X_test) 33 is from sklearn.metrics Importaccuracy_score, precision_score, recall_score, f1_score 34 is # accuracy 35 Print ( ' accuracy: {:. 3f} ' .format (accuracy_score (y_test_binary, y_pred))) 36 # accuracy rate 37 [ Print ( ' accuracy rate: {:. 3f } ' .format (precision_score (y_test_binary, y_pred))) 38 is # recall 39 Print ( ' recall: {} :. 3F ' .format (recall_score (y_test_binary, y_pred))) 40 # Fl value 41 is Print ( ' Fl value: {} 3F :. '.format (f1_score (y_test_binary, y_pred))) 42 is # Output accuracy: 0.921 43 is # Output accuracy rate: 0.867 44 is # Output recall: 0.929 45 # Output Fl value: 0.897
2, sklearn function of the PR curve
1) Determine the value of Precision Recall precision_recall_curve PR value curve ()
sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)
AP's calculation, the reference here is calculated by the year of 2010 PASCAL VOC CHALLENGE. Setting a first set of thresholds, [0, 0.1, 0.2, ..., 1]. Then, for each of a recall is greater than a threshold value (such as recall> 0.3), we will obtain a corresponding maximum precision. In this way, we calculated the 11 precision. AP is the average of the 11 precision. This method is called in English 11-point interpolated average precision.
Of course PASCAL VOC CHALLENGE changed since 2010, after another calculation method. The new calculation assumes that there are M N samples in positive examples, then we will get a recall value M (1 / M, 2 / M, ..., M / M), for each of the recall value r, we It can be calculated corresponding to (r '> = r) the maximum precision, and this precision of M values are averaged to obtain the value of the last AP.
2) How to draw curves PR sklearn
1 from sklearn.metrics import precision_recall_curve,average_precision_score 2 import matplotlib.pyplot as plt 3 precision, recall, _ = precision_recall_curve(y_test_binary, y_pred) 4 plt.step(recall, precision, color='b', alpha=0.2,where='post') 5 plt.fill_between(recall, precision, step='post', alpha=0.2, color='b') 6 plt.xlabel('Recall') 7 plt.ylabel('Precision') 8 plt.ylim([0.0, 1.05]) 9 plt.xlim([0.0, 1.0]) 10 plt.title('2-class Precision-Recall curve: AP={:.3f}'.format(average_precision_score(y_test_binary, y_pred)))
3, sklearn function of the ROC curve
1) roc_curve () fpr and evaluation of tpr
sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
2) roc_auc_score () obtaining the AUC value
roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None)
3) how to draw ROC curve
1 import numpy as np 2 import matplotlib.pyplot as plt 3 from itertools import cycle 4 from sklearn import svm, datasets 5 from sklearn.metrics import roc_curve, auc 6 from sklearn.model_selection import train_test_split 7 from sklearn.preprocessing import label_binarize 8 from sklearn.multiclass import OneVsRestClassifier 9 from scipy import interp 10 # Import some data to play with 11 iris = datasets.load_iris() 12 X = iris.data 13 y = iris.target 14 # Binarize the output 15 y = label_binarize(y, classes=[0, 1, 2]) 16 n_classes = y.shape[1] 17 # Add noisy features to make the problem harder 18 random_state = np.random.RandomState(0) 19 n_samples, n_features = X.shape 20 X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] 21 # shuffle and split training and test sets 22 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,random_state=0) 23 # Learn to predict each class against the other 24 classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,random_state=random_state)) 25 y_score = classifier.fit(X_train, y_train).decision_function(X_test) 26 # Compute ROC curve and ROC area for each class 27 fpr = dict() 28 tpr = dict() 29 roc_auc = dict() 30 for i in range(n_classes): 31 fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) 32 roc_auc[i] = auc(fpr[i], tpr[i]) 33 34 # Compute micro-average ROC curve and ROC area 35 fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) 36 roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) 37 plt.figure() 38 lw = 2 39 plt.plot(fpr[2], tpr[2], color='darkorange',lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2]) 40 plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 41 plt.xlim([0.0, 1.0]) 42 plt.ylim([0.0, 1.05]) 43 plt.xlabel('False Positive Rate') 44 plt.ylabel('True Positive Rate') 45 plt.title('Receiver operating characteristic example') 46 plt.legend(loc="lower right") 47 plt.show()
4, using the confusion matrix sklearn
Get confusion matrix confusion_matrix ()
sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
parameter:
-
- y_true: the true value of the dependent variable
- y_pred: predicted dependent variable values
- labels: tag list index order matrix
- sample_weight: sample weights
Output:
a matrix, shape = [Number of type y, y is the number of types of]
Two-regression model
1 """ 2 # 利用 diabetes数据集来学习线性回归 3 # diabetes 是一个关于糖尿病的数据集, 该数据集包括442个病人的生理数据及一年以后的病情发展情况。 4 # 数据集中的特征值总共10项, 如下: 5 # 年龄 6 # 性别 7 #体质指数 8 #血压 9 #s1,s2,s3,s4,s4,s6 (六种血清的化验数据) 10 #但请注意,以上的数据是经过特殊处理, 10个数据中的每个都做了均值中心化处理,然后又用标准差乘以个体数量调整了数值范围。 11 #验证就会发现任何一列的所有数值平方和为1. 12 """ 13 import matplotlib.pyplot as plt 14 import numpy as np 15 from sklearn import datasets, linear_model 16 from sklearn.metrics import mean_squared_error, r2_score 17 # Load the diabetes dataset 18 diabetes = datasets.load_diabetes() 19 # Use only one feature 20 # 增加一个维度,得到一个体质指数数组[[1],[2],...[442]] 21 diabetes_X = diabetes.data[:, np.newaxis,2] 22 print(diabetes_X) 23 # Split the data into training/testing sets 24 diabetes_X_train = diabetes_X[:-20] 25 diabetes_X_test = diabetes_X[-20:] 26 # Split the targets into training/testing sets 27 diabetes_y_train = diabetes.target[:-20] 28 diabetes_y_test = diabetes.target[-20:] 29 # Create linear regression object 30 regr = linear_model.LinearRegression() 31 # Train the model using the training sets 32 regr.fit(diabetes_X_train, diabetes_y_train) 33 # Make predictions using the testing set 34 diabetes_y_pred = regr.predict(diabetes_X_test) 35 # The coefficients 36 # 查看相关系数 37 print('Coefficients: \n', regr.coef_) 38 #output Coefficients: [938.23786125] 39 # The mean squared error 40 # 均方差 41 # 查看残差平方的均值(mean square error,MSE) 42 print("Mean squared error: %.2f" 43 % mean_squared_error(diabetes_y_test, diabetes_y_pred)) 44 # output Mean squared error: 2548.07 45 # Explained variance score: 1 is perfect prediction 46 # R2 决定系数(拟合优度) 47 # 模型越好:r2→1 48 # 模型越差:r2→0 49 print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred)) 50 #output Variance score: 0.47 51 # Plot outputs 52 plt.scatter(diabetes_X_test, diabetes_y_test, color='black') 53 plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3) 54 plt.xticks(()) 55 plt.yticks(()) 56 plt.show()