"Machine Learning in Practice: Based on Scikit-Learn, Keras and TensorFlow Version 2" - Study Notes (3)

Chapter III Classification

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O'Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9. Environment: Anaconda (Python
3.8 ) + Pycharm
Learning time: 2022.04.01~2022.04.02

The most common supervised learning tasks include regression tasks (predicting values) and classification tasks (predicting classes).
Chapter 2 attempts a regression task—predicting housing prices—using algorithms such as linear regression, decision trees, and random forests.
This chapter will try to do a classification task. The content is roughly as follows:

3.1 MNIST

This chapter will use the MNIST dataset, a collection of images of 70,000 digits handwritten by U.S. high school students and Census Bureau employees. Each picture is labeled with the number it represents.
This dataset is so widely used that it has been called the "Hello World" of machine learning: anyone thinking of a new classification algorithm will want to see how it performs on MNIST.
Therefore, anyone who studies machine learning will face MNIST sooner or later.

Scikit-Learn provides a number of helper functions to help you download popular datasets. MNIST is also one of them. Here is the code to get the MNIST dataset:

# 获取MNIST数据集
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
# 这里加入as_frame=False是因为按书上写法后面步骤出现了问题:“查看数据集中的一个数字”显示不出来。
# 原因可能是:https://www.cnpython.com/qa/1394204
print(mnist.keys())  # 查看所有的键
# 读取速度有点慢,查了一下网上直接下载文件的方法: https://www.jianshu.com/p/d282bce1a999
# 注释写道:“默认情况下,Scikit-Learn将下载的数据集缓存在$HOME/scikit_learn_data目录下。”所以,之后的调用应该都会快一些。

Datasets loaded by Scikit-Learn usually have a similar dictionary structure, including:

  • DESCR key: describe the dataset;
  • data key: Contains an array with one row for each instance and one column for each feature;
  • target key: Contains an array with tags.
# 查看MNIST数据集中的数组
X, y = mnist["data"], mnist["target"]
print(X.shape)
print(y.shape)

There are 70,000 pictures (70,000) in total, and each picture has 784 features (because the picture is 28×28 pixels (28*28=784), each feature represents the intensity of a pixel, from 0 (white) to 255 (black)).
Let’s take a look at a number in the data set first. You only need to grab the feature vector of an instance, re-form it into a 28×28 array, and then use Matplotlib’s imshow() function to display it:

# 查看数据集中的一个数字
import matplotlib.pyplot as plt
some_digit = X[1]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()  # 查看数据集中的一个数字
print(y[1])  # 查看该数字的标签
# 要注意:标签是字符,大部分机器学习算法希望是都是数字,所以要把y转换成整数:
import numpy as np
y = y.astype(np.uint8)

You should still create a test set and set it aside before you start digging into this data.
In fact, the MNIST dataset has been split into a training set (the first 60,000 images) and a test set (the last 10,000 images)

X_train, y_train = X[:60000], y[:60000]
X_test, y_test = X[60000:], y[60000:]

Finally, we shuffle the training set data.

  • This ensures that all folds are similar during cross-validation (you definitely don't want a fold to lose some numbers)

  • In addition, some machine learning algorithms are sensitive to the order of training examples, which may lead to poor performance if many similar examples are continuously input. Shuffle the dataset exactly to make sure this doesn't happen

3.2 Training a Binary Classifier

As a beginner, let's simplify the problem and just try to recognize a single number, such as the number 5.
Then this "Digit 5 ​​Detector" is an example of a binary classifier that can only distinguish between two classes: 5 and not-5. First create the target vector for this classification task:

y_train_5 = (y_train == 5)  # y_train_5/y_test_5中保存的是一系列True或False值
y_test_5 = (y_test == 5)  # y_train_5/y_test_5中的True表示“是数字5”,False表示“不是数字5”

Then pick a classifier and start training. A good initial choice is a stochastic gradient descent (SGD) classifier, using Scikit-Learn's SGDClassifier class.
The advantage of this classifier is that it can efficiently handle very large datasets. This is partly because SGD processes training instances independently, one at a time (which also makes SGD well suited for online learning), as we will see later.

# 此时先创建一个SGDClassifier并在整个训练集上进行训练:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)  # 定义1个SGDClassifier分类器,并设置参数random_state=42
# 因为SGDClassifier在训练时是完全随机的(因此得名“随机”),如果你希望得到可复现的结果,需要设置参数random_state。
sgd_clf.fit(X_train, y_train_5)  # 用数据集去训练这个分类器

Use this to detect if the image is the digit 5: (some_digit is the digit image of X[0] viewed above)

print(sgd_clf.predict([some_digit]))  # 输出将是“True”

The classifier guesses that this image represents 5 (True). Looks like it guessed right this time! So, let's evaluate the performance of this model.

3.3 Performance measurement

Evaluating classifiers is much more difficult than evaluating regressors, so this chapter will spend a lot of time on this topic, and will cover many methods of performance measurement.

3.3.1 Measuring Accuracy Using Cross Validation

As mentioned in Chapter 2, cross-validation is a good way to evaluate models.

(1) for loop to achieve cross-validation

Sometimes you may wish to have more control than Scikit-Learn provides cross-validation functions such as cross_val_score().
In this case, you can implement cross-validation yourself, and the operation is also simple and straightforward.

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
# 使用交叉验证测量准确率
skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)  # 定义折叠的参数:3次,随机种子是42,确保可复现
# 原书程序运行时出错,提示说要加上shuffle=True
for train_index, test_index in skfolds.split(X_train, y_train_5):
    # 每个折叠由StratifiedKFold执行分层抽样产生,其所包含的各个类的比例符合整体比例。
    clone_clf = clone(sgd_clf)  # 复制SGDClassifier分类器。每次迭代都创建一个副本,然后用测试集进行预测,每次互相不影响
    X_train_folds = X_train[train_index]  # 用迭代的train_index/test_index划分本次的数据集和测试集
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)  # 用数据训练副本分类器
    y_pred = clone_clf.predict(X_test_fold)  # 用分类器对本次的测试集进行预测
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))  # 分别输出:0.9502, 0.96565, 0.96495

(2) The cross_val_score function implements cross validation

Now, use the cross_val_score() function to evaluate the SGDClassifier model using K-fold cross-validation (3 folds).
Remember, K-fold cross-validation means splitting the training set into K folds (in this case, 3 folds), and then keeping 1 fold at a time for prediction and the remaining folds for training (see section 2 chapters)

from sklearn.model_selection import cross_val_score
cross_val_score_result = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print(cross_val_score_result)

Accuracy (ratio of correct predictions) for cross-validation across all folds exceeds 93%? Looks pretty amazing, doesn't it?

(3) Comparison of accuracy rate

Before you get too excited, let's define a dumb classifier, Never5Classifier, that classifies every image as "not 5":

from sklearn.base import BaseEstimator


class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

Can you guess the accuracy of this model? Let's see:

never_5_clf = Never5Classifier()
cross_val_score_result = cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print(cross_val_score_result)  # 输出 [0.91125, 0.90855, 0.90915]

That's right, the accuracy rate is also over 90%!
This is because only about 10% of the pictures are of the number 5, so if you guess that a picture is not a 5, you will be right 90% of the time, which is beyond the big prophet!
This means that accuracy is often not the first performance metric for a classifier, especially if you are dealing with biased datasets (i.e. some classes are more frequent than others).

3.3.2 Confusion Matrix

A better way to evaluate the performance of a classifier is the confusion matrix. The general idea is to count the number of times an instance of class A is classified into class B.
The rows in the confusion matrix represent the actual classes and the columns represent the predicted classes.

For example, to know how many times the classifier confused the number 3 with the number 5, you only need to look at row 5, column 3 of the confusion matrix.

(1) Output confusion matrix

To calculate a confusion matrix, you need to have a set of predictions before you can compare them to the actual targets. Of course, you can use the test set to make predictions, but don't touch it for now
(the test set is best saved for the end of the project, when you are ready to start the classifier). As an alternative, the cross_val_predict() function can be used:

Like the cross_val_score() function, the cross_val_predict() function also performs K-fold cross-validation, but returns instead of evaluation scores, predictions for each fold. This means that for each instance a clean prediction is obtained ("clean" meaning that the model predicts on data it has never seen during its training).

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)  # 得到交叉验证的预测结果

The confusion matrix can now be obtained using the confusion_matrix() function. Just give the target (actual) category (y_train_5) and predicted category (y_train_pred):

from sklearn.metrics import confusion_matrix
SGD_confusion_matrix = confusion_matrix(y_train_5, y_train_pred)  # 获取混淆矩阵
print(SGD_confusion_matrix)
# 输出[53892, 687],
#    [ 1891, 3530]

In this example, the first line represents all the pictures of "not 5" (negative class): 53 057 were correctly classified as "not 5" (true negative class), and 1522 were incorrectly classified as "5" (false positive class); the second line indicates that of all "5" (positive class) pictures: 1325 were incorrectly classified as "non-5" class (false negative class), and 4096 were correctly classified as "5" " This category (true category).
A perfect classifier has only true and true negative classes, so its confusion matrix will only have non-zero values ​​on its diagonal (top left to bottom right):

y_train_perfect_predictions = y_train_5  # pretend we reached perfection(假设我们达到了完美<把真实数据当做预测数据>)
print(confusion_matrix(y_train_5, y_train_perfect_predictions))
# 输出[54579, 0],
#    [ 0, 5421]

Confusion matrices can be very informative, but sometimes you may want the metrics to be a little more concise.

(2) Evaluation indicators

An interesting metric is the accuracy of positive class predictions, also known as classifier precision. TP is the number of true classes and FP is the number of false positives.

Precision = TPTP + FP Precision = \frac{TP}{TP + FP}precision=TP+FPTP

Making a single positive class prediction and making sure it's correct gives you perfect accuracy (accuracy = 1/1 = 100%). But that doesn't make much sense, since the classifier ignores everything but this positive instance.
Therefore, precision is often used together with another metric, and this metric is recall, also known as sensitivity or true class rate:
it is the ratio of positive class instances that are correctly detected by the classifier. FN is the number of false negative classes.

Recall = TPTP + FN Recall = \frac{TP}{TP + FN}recall rate=TP+FNTP

3.3.3 Precision, recall and F1-score

Scikit-Learn provides functions to compute various classifier metrics, including precision and recall:

from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred))  # 输出:0.8370879772350012
print(recall_score(y_train_5, y_train_pred))  # 输出:0.6511713705958311

Looking at it now, the 5-detector doesn't look as glamorous as its accuracy. When it said a picture was a 5, it was accurate only 72.9% of the time, and only 75.6% of the number 5s were detected by it.
So we can conveniently combine precision and recall into a single metric called F1 score. This is a great metric when you need an easy way to compare two classifiers.
F1 score is the harmonic mean of precision and recall .

Normal averaging treats all values ​​equally, while harmonic averaging gives higher weight to low values. Therefore, a classifier can get a high F1 score only when both recall and precision are high.

F 1 − score = 2 1 precision + 1 recall = 2 ∗ precision ∗ recall precision + recall = TPTP + FN + FP 2 F1-score = \frac{2}{\frac{1}{precision} + \ frac{1}{recall}} = 2 * \frac{precision*recall}{precision+recall} = \frac{TP}{TP + \frac{FN+FP}{2}}Q1 _score=precision1+recall rate12=2precision+recall rateprecisionRecall _=TP+2FN+FPTP

To calculate the F1 score, just call f1_score():

from sklearn.metrics import f1_score
print(f1_score(y_train_5, y_train_pred))

The F1 score favors those classifiers with similar precision and recall. This doesn't necessarily always do what you want:
in some cases you care more about precision, while in others you might really care about recall.

For example, suppose you train a classifier to detect videos that children are safe to watch, then you might prefer a classifier that blocks many good videos (low recall), but keeps only safe videos (high accuracy
) , rather than a classifier with high recall, but in production there might be some really bad videos (in which case
you might even add a human pipeline to check the videos picked by the classifier). Conversely, if you train a classifier to detect thieves through image surveillance:
you can probably live with a precision of 30%, but a recall of 99% (sure, security guards will get some false alarms, but almost all Thieves are doomed).

3.3.4 Precision/recall trade-off (how to set the threshold)

Unfortunately, you can't have your cake and eat it too, you can't increase precision and decrease recall at the same time, or vice versa. This is called the precision/recall tradeoff.
To understand this trade-off, let's look at how SGDClassifier makes classification decisions.

For each instance, it calculates a score based on the decision function, and if the value is greater than a threshold, the instance is judged as a positive class, otherwise it is judged as a negative class.
Now, if you increase the threshold, the false positive class becomes a true negative class, so the precision improves, but a true class becomes a false negative class, and the recall drops to 50%.
Conversely, lowering the threshold decreases precision while increasing recall.
Scikit-Learn does not allow thresholding directly, but gives access to the decision scores it uses for predictions. Instead of calling the predict() method of the classifier, call the decision_function() method,
which returns the score of each instance, and then you can use any threshold to make predictions based on these scores

# 调用decision_function设置阈值进行判断
y_scores = sgd_clf.decision_function([some_digit])  # 返回some_digit的分数到y_scores里面
print(y_scores)  # 输出:2164.22030239
threshold = 0  # 阈值设为0
y_some_digit_pred1 = (y_scores > threshold)
threshold = 8000  # 阈值设为8000
y_some_digit_pred2 = (y_scores > threshold)
print(y_some_digit_pred1, '\n', y_some_digit_pred2)  # 分别输出True和False

This proves that increasing the threshold can indeed reduce the recall.
So how do you decide what threshold to use? First, use the cross_val_predict() function to obtain scores for all instances in the training set, but this time it needs to return decision scores instead of predictions:
with these scores, the precision_recall_curve() function can be used to calculate the precision and Recall rate:

# 判断使用什么阈值更合适
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")  # 获取训练集中所有实例的分数
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)  # 计算所有可能的阈值下的精度和召回率
# precision_recall_curve计算不同概率阈值的精确召回对(注意:此实现仅限于二进制分类任务),分别返回precisions, recalls, thresholds

Finally, use Matplotlib to plot precision and recall as a function of threshold:

# 绘制制精度和召回率相对于阈值的函数图
import matplotlib as mpl


def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.style.use('seaborn')  # 指定绘画风格
    mpl.rcParams["font.sans-serif"] = ["SimHei"]  # 指定字体为SimHei,用于显示中文,如果Ariel,中文会乱码
    mpl.rcParams["axes.unicode_minus"] = False  # 用来正常显示负号
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")  # 一条阈值-精度曲线
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")  # 一条阈值-召回率曲线
    plt.xlabel('阈值', fontsize=20)  # 添加x坐标轴标签
    plt.legend(fontsize=18)  # 加图例


plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

You may be wondering why the precision curve is bumpier than the recall curve in Figure 3-4? The reason is that as you increase the threshold, the accuracy may sometimes drop (although the general trend is upward).
Another way to find a good precision/recall tradeoff is to directly plot precision and recall as a function of it.

# 绘制Precision-Recall曲线
from sklearn.metrics import PrecisionRecallDisplay
pr_display = PrecisionRecallDisplay(precision=precisions, recall=recalls).plot()
plt.show()

Let's say you decide to set the accuracy to 90%. Look up the graph and find that a threshold of 8000 needs to be set. More precisely, you can search for the lowest threshold that gives at least 90% accuracy
(np.argmax() will give you the first index of the maximum value, which in this case represents the first True value):

threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
print(threshold_90_precision)  # 输出:3370.0194991439557

To make a prediction (now on the training set), instead of calling the classifier's predict() method, you can also run this code:

y_train_pred_90 = (y_scores >= threshold_90_precision)
print(y_train_pred_90)

Check the precision and recall of these predictions:

y_train_pred_90_p_score = precision_score(y_train_5, y_train_pred_90)
y_train_pred_90_r_score = recall_score(y_train_5, y_train_pred_90)
print('精度:', y_train_pred_90_p_score, '\n', '召回率:', y_train_pred_90_r_score)

Now you have a classifier with 90% accuracy (or close enough)! As you can see, it's fairly easy to create a classifier with any accuracy you want:
as long as the threshold is high enough! However, if the recall rate is too low, the precision is not very useful!
If someone says, "We need 99% precision," you should ask, "What's the recall?"

3.3.5 ROC curve

There is another tool that is often used with binary classifiers called the Receiver Operating Characteristic Curve (or ROC for short).

ROC is very similar to a precision/recall curve, but instead of precision and recall, it plots true class rate (another name for recall rate) and false positive rate (FPR).
FPR is the ratio of negative class instances that are incorrectly classified as positive class. It is equal to 1 minus the true negative rate (TNR), which is the proportion of negative instances that are correctly classified as negative, also known as specificity.
Therefore, the ROC curve plots the relationship between sensitivity (recall) and (1-specificity) .

To draw the ROC curve, you first need to use the roc_curve() function to calculate the TPR and FPR of various thresholds, and then use Matplotlib to draw the curve of FPR to TPR.

# 使用roc_curve()函数计算多种阈值的TPR和FPR
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)


# 使用Matplotlib绘制FPR对TPR的曲线。
def plot_roc_curve(fpr, tpr, label=None):
    plt.style.use('seaborn')  # 指定绘画风格
    mpl.rcParams["font.sans-serif"] = ["SimHei"]  # 指定字体为SimHei,用于显示中文,如果Ariel,中文会乱码
    mpl.rcParams["axes.unicode_minus"] = False  # 用来正常显示负号
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')  # Dashed diagonal
    plt.xlabel('假正率', fontsize=18)  # 添加x坐标轴标签
    plt.ylabel('真正率(召回率)', fontsize=18)  # 添加x坐标轴标签


plot_roc_curve(fpr, tpr)
plt.show()

Again, there is a trade-off here: the higher the recall rate (TPR), the more false positives (FPR) the classifier produces.
The dotted line represents the ROC curve of a purely random classifier, and an excellent classifier should be as far away from this line as possible (to the upper left corner).
One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier has an ROC AUC equal to 1, while a purely random classifier has an ROC AUC equal to 0.5.
Scikit-Learn provides functions to calculate ROC AUC:

from sklearn.metrics import roc_auc_score
roc_auc_score_forSGD = roc_auc_score(y_train_5, y_scores)
print(roc_auc_score_forSGD)

Since ROC curves are very similar to precision/recall (PR) curves, you may be asking how to decide which curve to use.
There is a rule of thumb that PR curve should be chosen when positives are very rare or you care more about false positives than false negatives, otherwise ROC curve .

For example, looking at the previous ROC curve plot (and ROC AUC score), you may think that the classifier is really good. But that's mainly because the number of positive classes (number 5) is really small compared to the negative class (non-5).
In contrast, the PR curve clearly shows that the classifier still has room for improvement (the curve could be closer to the upper left corner).

3.3.6 Experiment (ROC AUC score of RandomForestClassifier classifier)

Now let's train a RandomForestClassifier classifier and compare its ROC curve and ROC AUC score with the SGDClassifier classifier.

First, get the score for each instance in the training set.
But since it works differently, the RandomForestClassifier class does not have a decision_function() method, instead, it has a dict_proba() method.

Scikit-Learn classifiers will usually have one of these two methods (or both).

The dict_proba() method returns an array where each row represents an instance and each column represents a category, meaning the probability that a given instance belongs to a given category (for example, there is a 70% chance that this picture is a
number 5)

# 测试随机森林模型的ROC和AUC
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)  # 定义1个随机森林分类器
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")  # 交叉验证
# roc_curve()函数需要标签和分数,但是我们不提供分数,而是提供类概率。我们直接使用正类的概率作为分数值:
y_scores_forest = y_probas_forest[:, 1]  # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)  # 把分数返回到各个参数
# 绘制ROC曲线来看看对比结果:
plt.plot(fpr, tpr, "b:", label="SGD")  # 传入SGD模型的的分数及标签
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")  # 传入Random Forest模型的分数及标签
plt.legend(loc="lower right")  # 显示标签并定义位置
print('\n\ntest')
plt.show()
# 计算随机森林的ROC AUC分数
RandomForest_roc_auc_score = roc_auc_score(y_train_5, y_scores_forest)
print(RandomForest_roc_auc_score)

The ROC curve of RandomForestClassifier looks much better than that of SGDClassifier, it is closer to the upper left corner, so its ROC AUC score is also much higher.

Hopefully now you have a handle on how to train a binary classifier, how to choose an appropriate metric to evaluate a classifier using cross-validation, how to
choose the precision/recall tradeoff that meets your needs, and how to use ROC curves and ROC AUC scores for comparison Multiple models.

3.4 Multiclass Classifier

A binary classifier distinguishes between two classes, while a multiclass classifier (also known as a multinomial classifier) ​​can distinguish between more than two classes.
There are some algorithms (such as Random Forest classifier or Naive Bayes classifier) ​​that can handle multiple classes directly. There are also some strict binary classifiers (such as support vector machine classifiers or linear classifiers).
However, there are various strategies that allow you to use several binary classifiers for multiclass classification purposes:

  • To create a system to classify pictures of digits into 10 classes (from 0 to 9), one way is to train 10 binary classifiers, one for each digit (0-detector, 1-detector, 2-detector, and so on).
    Then, when you need to detect and classify a picture, get the decision score of each classifier, and which classifier gives the highest score, and classify it into that class.
    This is called a one-versus-residual (OvR) strategy , also known as one-versus-all .
  • Another approach is to train a binary classifier for each pair of digits: one for 0 and 1, one for 0 and 2, one for 1 and 2, and so on.
    This is called a one-to-one (OvO) strategy . If there are N categories, then this requires training N×(N-1)/2 classifiers. For the MNIST problem, this means training 45 binary classifiers!
    When you need to classify a picture, you need to run 45 classifiers to classify the picture, and finally see which class wins the most.
    The main advantage of OvO is that each classifier only needs to use part of the training set to train the two classes it has to distinguish. Some algorithms, such as support vector machine classifiers, perform poorly when the data size increases.
    For this type of algorithm, OvO is a preferred choice because it is much faster to train multiple classifiers separately on a small training set than to train a few classifiers on a large data set.
    But for most binary classifiers, the OvR strategy is still a better choice.

Scikit-Learn can detect that you are trying to use a binary classification algorithm for multi-class classification tasks, and it will automatically run OvR or OvO depending on the situation. Let's try out the SVM classifier with the sklearn.svm.SVC class:

one_digit = X[0]  # 提取图片数据集中的1个数字图片,用来测试
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)  # y_train, not y_train_5
one_digit_pre = svm_clf.predict([one_digit])
print(one_digit_pre)

very easy! This code trains the SVC on the training set using the original target classes 0 to 9 (y_train) instead of "5" and "remaining" as the target classes (y_train_5) and then makes a prediction (in this case the prediction is
correct ). And internally, Scikit-Learn actually trains 45 binary classifiers, gets their decision scores on images, and then chooses the class with the highest score.
To find out if this is the case, call the decision_function() method. It will return 10 scores, 1 per class instead of 1 per instance:

one_digit_scores = svm_clf.decision_function([one_digit])
print(one_digit_scores)
one_digit_scores_max = np.argmax(one_digit_scores)  # 输出最大值(9.3132482对应的值)
print(one_digit_scores_max)  # 确认最大值确实是5

When training a classifier, the list of target classes is stored in the classes_ attribute, sorted by value.
In this example, the index of each class in the classes_ array corresponds exactly to the class itself (for example, the 5th class on the index is exactly the number 5 class), but in general, this will not be so coincidental.

print(svm_clf.classes_)
print(svm_clf.classes_[5])

If you want to force Scikit-Learn to use one-to-one (OvO) or one-to-residue (OvR) strategies, you can use the OneVsOneClassifier or OneVsRestClassifier classes.
Just create an instance and pass the classifier to its constructor (it doesn't even have to be a binary classifier).
For example, the following code creates a multiclass classifier based on SVC using the OvR strategy:

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC())  # 使用0VR策略的SVC
ovr_clf.fit(X_train, y_train)
ovr_one_digit_pre = ovr_clf.predict([one_digit])
print(ovr_one_digit_pre)  # 输出0VR策略的SVC所预测的值
print(len(ovr_clf.estimators_))  # 输出类别数量?

In addition, it is also possible to use SGDClassifier or RandomForestClassifier for multi-classification:

# 用SGD做多分类
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)  # 定义1个SGDClassifier分类器,并设置参数random_state=42
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([one_digit])
sgd_clf.decision_function([one_digit])  # 这次Scikit-Learn不必运行OvR或者OvO了,因为SGD分类器直接就可以将实例分为多个类。
# 调用decision_function()可以获得分类器将每个实例分类为每个类的概率列表:
sgd_clf.decision_function([one_digit])
# 现在,你当然要评估这个分类器。与往常一样,可以使用交叉验证。
# 使用cross_val_score()函数来评估SGDClassifier的准确性:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

It exceeds 84% ​​on all test folds. If it is a pure random classifier, the accuracy rate is about 10%, so this result is not too bad, but there is still room for improvement.
For example, simple scaling of the input (as described in Chapter 2) can improve accuracy to over 89%:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

3.5 Error Analysis

Here, suppose you have found a promising model, and now you want to find some way to improve it further. One of the ways is to analyze its error type.

First look at the confusion matrix. As before, use the cross_val_predict() function to make predictions, and then call the confusion_matrix() function:

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
print(conf_mx)
plt.matshow(conf_mx, cmap=plt.cm.gray)  # 调用matpllib查看混淆矩阵图像
plt.show()

The confusion matrix looks good because most of the images are on the main diagonal, which means they are classified correctly.
The number 5 looks slightly darker than the other numbers, which may mean that there are fewer pictures of the number 5 in the dataset, or that the classifier may not perform as well on the number 5 as it does on the other numbers. In fact, you might verify that both are true.
Let's focus on the bug.
First, you need to divide each value in the confusion matrix by the number of images in the corresponding class, so that you are comparing the error rate rather than the absolute value of the error (the latter is unfair to classes with more images): use
0 Fill the diagonal, keep only the errors, and redraw the result (see image below):

It is now clear what kind of errors the classifier is making. Remember, each row represents the actual class, while each column represents the predicted class.
Column 8 looks very bright, indicating that many images were incorrectly classified as the number 8. However, row 8 is not so bad, telling you that the number 8 is actually correctly classified as the number 8.
Note that the errors are not perfectly symmetrical, e.g. the number 3 and the number 5 are often confused (in both directions).
**Analyzing a confusion matrix can often give you insight into how to improve your classifier. ** Judging by the above picture, your energy can be spent on improving the classification error of the number 8.

For example, you could try to collect more training data on numbers that look like 8s so that the classifier can learn to distinguish them from real numbers.
Alternatively, new features could be developed to improve the classifier—for example, write an algorithm to count the number of closed loops (e.g., the number 8 has two, the number 6 has one, and the number 5 has none).
Or, you can also preprocess the image (for example, using Scikit-Image, Pillow or OpenCV) to make certain modes more prominent, such as loop closure and the like.

Analyzing individual errors can also provide insight to the classifier: what is it doing? Why did it fail? But this is often more difficult and time-consuming.
For example, let's look at an example of the number 3 and the number 5 (the plot_digits() function just uses Matplotlib's imshow() function):

# 查看数字3和数字5的示例
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap=mpl.cm.binary, **options)
    plt.axis('off')


cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8, 8))
plt.subplot(221)
plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222)
plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223)
plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224)
plot_digits(X_bb[:25], images_per_row=5)
plt.show()

Some of the numbers that the classifier got wrong (i.e. the lower left and upper right matrices) were indeed so poorly written that even humans had a hard time distinguishing them (e.g. the number 5 in row 1 really looks like number 3).
However, to us, most misclassified images still look so obviously wrong that it is difficult for us to understand why the classifier got it wrong.
The reason is that the simple SGDClassifier model we use is a linear model.
What it does is it assigns each pixel a weight for each class, and when it sees a new image, it aggregates the weighted pixel intensities to get a score to classify.
The number 3 and the number 5 differ only in a fraction of the pixel bits, so it is easy for the classifier to confuse them.

The main difference between the number 3 and the number 5 is the position of the small line in the middle that connects the top line to the lower arc.
If you write a number 3 that moves the connection point slightly to the left, the classifier might classify it as a number 5, and vice versa. In other words, this classifier is very sensitive to image shift and rotation.
So one of the ways to reduce the confusion between the number 3 and the number 5 is to preprocess the image to make sure they are centered and not rotated. This also helps reduce other errors.

3.6 Multi-label classification

So far, each instance has only been assigned to one class. And in some cases, you want the classifier to output multiple classes for each instance.

For example, a classifier for face recognition: what if more than one person is recognized in a photo? Of course, a tag should be attached to each person identified.

Assuming the classifier has been trained to recognize three faces - Alice, Bob, and Charlie, then when shown a photo of Alice and Charlie, it should output [1, 0, 1] (
meaning "It's Alice, not Bob, it's Charlie") classification systems that output multiple binary labels are called multi-label classification systems.
Let's look at a simpler example:

# 多标签分类器(两个标签)
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)  # 数字书否是大数(是否>7)
y_train_odd = (y_train % 2 == 1)  # 数字是否是奇数
y_multilabel = np.c_[y_train_large, y_train_odd]  # 把两个标签合并成一个数组
knn_clf = KNeighborsClassifier()  # 定义1个KNeighborsClassifier分类器(不是所有的分类器都支持多标签分类)
knn_clf.fit(X_train, y_multilabel)  # 使用多个目标数组数据去训练KNeighborsClassifier分类器
print(knn_clf.predict([one_digit]))  # 对X[0]这个数据进行预测,并输出。(输出的是两个标签:[False, True])

The result is correct! The number 5 is indeed small (False) and odd (True).
There are many ways to evaluate multi-label classifiers, and choosing the right metric depends on your project.
One way to do this is to measure the F1 score (or any other binary classifier metric discussed earlier) for each label, and simply calculate the average score.

# 计算所有标签的平均F1分数:
from sklearn.metrics import f1_score
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
knn_f1_score = f1_score(y_multilabel, y_train_knn_pred, average="macro")  # average="weighted"
print(knn_f1_score)

This assumes that all tags are equally important, which may not be the case. In particular, if the training photos have a lot more Alice than Bob and Charlie, you might want to give a higher weight to the classifier that differentiates Alice.
A simple approach is to give each label a weight equal to its own support (that is, the number of instances with that target label).
To do this, just set average="weighted" in the code above.

3.7 Multiple Output Classification

The last type of classification task we will discuss is called multi-output-multiclass classification (or simply multi-output classification).
In simple words, it is a generalization of multi-label classification, and its label can also be multi-class (for example, it can have more than two possible values).
To illustrate this, build a system to remove noise from an image. Feed it a noisy image, and it will (hopefully) output a clean digital image,
represented as an array of pixel intensities like any other MNIST image. Note that the output of this classifier is multiple labels (one label per pixel), and each label can have multiple values ​​(pixel intensity ranging from 0 to 225).
So this is an example of a multi-output classifier system.

The line between classification and regression is sometimes blurry, like this example. Arguably, predicting pixel intensities is more of a regression task than classification.
The multi-output system is not limited to classification tasks, it is possible for a system to output multiple labels for each instance, including both class labels and value labels.

We also start by creating training and test sets, adding noise to the pixel intensities of the MNIST images using NumPy's randint() function. The goal is to restore the picture to the original:

noise = np.random.randint(0, 100, (len(X_train), 784))  # 创建噪声
X_train_mod = X_train + noise  # 增加噪声
noise = np.random.randint(0, 100, (len(X_test), 784))  # 创建噪声
X_test_mod = X_test + noise  # 增加噪声
y_train_mod = X_train  # 原始图片作为训练集的目标值
y_test_mod = X_test  # 原始图片作为测试集的目标值
# 左边是有噪声的输入图片,右边是干净的目标图片。现在通过训练分类器,清洗这张图片:
knn_clf.fit(X_train_mod, y_train_mod)
some_index = 0
clean_digit = knn_clf.predict([X_test_mod[some_index]])


def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap=mpl.cm.binary, interpolation="nearest")
    plt.axis("off")


plot_digit(clean_digit)
plt.show()

Seems close enough to the target. This concludes the classifier tour. Hopefully now you have a grasp of how to choose good metrics for classification tasks, how to choose an appropriate precision/recall tradeoff, how to compare multiple classifiers, and more generally, how to build superior classification systems for a variety of tasks.

3.8 Exercises

1. Build a classifier for the MNIST dataset and achieve an accuracy of over 97% on the test set.
Hint: KNeighborsClassifier is very effective for this task, you just need to find suitable hyperparameter values ​​(try doing a grid search for the two hyperparameters weights and n_neighbors).

2. Write a function that can move the MNIST image by one pixel in any direction (up, down, left, right). Then for each image in the training set, four shifted copies (one for each orientation) are created and
added to the training set. Finally, train the model on this expanded training set and measure its accuracy on the test set. You should notice that the model performs even better!
This technique of artificially expanding the training set is called data augmentation or training set expansion.

3. Great starting point on Kaggle: Processing the Titanic dataset.

4. Create a spam classifier (more challenging exercise):

  • Download spam and non-spam emails from Apache SpamAssassin's public dataset
    .
  • Unzip the dataset and familiarize yourself with the data format.
  • Divide the dataset into training and testing sets.
  • Write a data preparation pipeline to convert each email into a feature vector. Your pipeline should convert emails into a (sparse) vector indicating the presence or absence of all possible words.
    For example, if all emails contain only the four words "Hello" "how" "are" "you", then the email "Hello you Hello Hello you" will be converted into a vector [1, 0, 0, 1] (
    meaning is "Hello" exists, "how" does not exist, "are" does not exist, "you" exists), if you want to count the number of occurrences of each word, then this vector is [3, 0, 0, 2].
  • Add hyperparameters to the pipeline to control whether to strip email headers, whether to convert each email to lowercase, whether to remove punctuation, whether to replace "URLs" with "URL", whether to replace all
    lowercase numbers with "NUMBER", Or even whether to perform stemming (i.e. remove word suffixes, there are Python libraries available for this).
  • Finally, try a few more classifiers to see if you can create a spam classifier with high recall and high precision.

The answers to the above exercises can be obtained on the Jupyter notebook, the link address is: https://github.com/ageron/handson-ml2.

PS: I didn’t think about writing the first and second chapters, so I didn’t sort them out. I’ll sort them out after finishing the thesis (tired.jpg), and this article is still the first LeetCode .

Guess you like

Origin blog.csdn.net/Morganfs/article/details/123926929
Recommended