Chapter 3 classification
EDITORIAL
reference book
"Machine learning practical - based Scikit-Learn and TensorFlow"
tool
python3.5.1,Jupyter Notebook, Pycharm
problem solved
Data given MNIST: [WinError 10060] Since the connection is unable to correctly respond after a period of time or the connected host does not respond, the connection attempt fails.
Reference links: scikit-Learn to use fetch_mldata can not download MNIST dataset Problem Solution
StratifiedKFold
- Compared to cross_val_score () this type of cross-validation function, which allows you to control some more, you can implement cross-validation on their own.
cross_val_predict
- Evaluation score is not returned, but a predicted value of each folded
- The resulting one-dimensional array, you think, ah, multi-fold after each sample will be the one and only validation set, so get is consistent with the number of original samples predicted label
confusion_matrix
- Confusion matrix
- Confusion rows of the matrix represent the actual category columns represent predicted category.
decision_function()
This method returns a score for each instance, and then these fractions can be predicted using an arbitrary threshold value.
Use cross_val_predict () function to get all instances of the training set scores
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
With these scores, you may be used precision_recall_curve () precision and recall function to calculate all possible threshold
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds): plt.plot(thresholds, precisions[:-1], "b--", label="Precision") plt.plot(thresholds, recalls[:-1], "g-", label="Recall") plt.xlabel("threshold") plt.legend(loc = "upper left") plt.ylim([0, 1]) plot_precision_recall_vs_threshold(precisions, recalls, thresholds) plt.show()
roc_curve()
Receiver operating characteristic curve
from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_train_5, y_scores) def plot_roc_curve(fpr, tpr, label=None): plt.plot(fpr, tpr, linewidth=2, label=label) plt.plot([0, 1], [0, 1], 'k--') plt.axis([0, 1, 0, 1]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plot_roc_curve(fpr, tpr) plt.show()
roc_auc_score()
- There is a comparator classifier is to measure the area under the curve (AUC)
- ROC AUC perfect classification is equal to 1, and purely random classifier ROC AUC equal to 0.5.
- from sklearn.metrics import roc_auc_score
Select ROC curve and PR
- Since the ROC curve and precision / recall (PR curve) are very similar, so you might ask how to decide which profile to use.
- There is a rule of thumb is, when the positive type is very rare false positive or if you are more concerned about class rather than false negatives class, you should select the PR curve ; on the contrary it is the ROC curve. PR curve close as possible to the upper right corner.
Multi-class classifier
sklearn can detect you try to use a binary classification algorithm for multi-class classification task, it automatically runs OvR (except SVM classifier, it will use OvO).
If you want to force sklearn use one or many strategies you can use OneVsOneClassifier or OneVsRestClassifier class.
from sklearn.multiclass import OneVsOneClassifier ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42)) ovo_clf.fit(X_train, y_train) ovo_clf.predict([some_digit]) len(ovo_clf.estimators_)
Error Analysis
- cross_val_predict() + confusion_matrix()
My CSDN: https://blog.csdn.net/qq_21579045
My blog garden: https://www.cnblogs.com/lyjun/
My Github: https://github.com/TinyHandsome
Paper come Zhongjue know this practice is essential ~
Welcome to come OB ~
by Li Yingjun children