Classification (four) multivariate classification

Multivariate classification

Binary classifier introduced before, the data may be divided into two categories (e.g., "number 5" and "not a number 5"). Polyhydric classifier (also referred to as a polynomial classifiers) can distinguish between two or more categories.

Some algorithms (such as random forests or Naive Bayes) can directly handle multiple categories. Others such as SVM, linear classifiers are strictly binary classifier. But we still have a lot of different ways to make a binary classifier to achieve multi-functional classification.

For example, one digital handwriting into 10 classes (0-9) The embodiment 10 is trained binary classifier, each binary classifier classified a number (e.g. classifier 0-, 1- classifier, 2- classifier, etc ...). Then do the picture classification, all the pictures sent to the classifier, and get their decision points, the final score to determine the highest decision-making this picture belongs to which category. This is called one-versus-all (OvA) strategy, also known as one-versus-the rest.

Another strategy is to train a classifier for each pair of numbers: one for category 0 and 1, 0 and classification for the next 2, then a Category 1 and 2, and so on. This is called one-versus-one (OvO) strategy. If there are N classes, we need to train N × (N-1) / 2 classifiers. For MNIST problem it means to train 45 binary classifier, very time-consuming and cumbersome. The main benefit OvO: Each classification is only required on the part of the training set (which includes part of the training set need to distinguish between the two categories of data) can be trained.

Some algorithms such as SVM, with the expansion of its data sets, expanding its own and does not have as much room for improvement, it is more suitable OvO. It can be trained on multiple classifier training set of many small, without the need to train a few classifier on a large training set. For most binary classification algorithm is, OvA is more appropriate.

In sk-learn, if you use a binary classification algorithm to train more than one classification, it will automatically run OvA (except SVM, it will automatically use OvO). Let's try SGDClassifier:

sgd_clf.fit(X_train, y_train)
sgd_clf.predict(X_test[:5])
>[7 2 1 0 4]

 

We can see the usage is very simple. In the bottom of it, in fact, it is the training of 10 binary classifiers, and when predicting the picture to 10 classifiers, which score the highest score to take the decision as a judgment category classification.

Further validation, we can call decision_function () method, it will not be a fraction of each of the data returned, but 10 points, a score for each category:

sgd_clf.decision_function(X_test[:1])
>array([[-27972.77566096, -52417.77039463, -14344.98217961,
         -1308.44575644, -19922.84531732,  -9208.91066356,
        -38331.13646795,   8007.54256279,  -4273.31795296,
         -5951.32911022]])

np.argmax(digit_scores)
>7

sgd_clf.classes_
>array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

sgd_clf.classes_[7]
>7

 

We can see the highest score is 7, so in the end it will be classified as number 7.

It should be noted that: After a classifier is trained well, it will target category as a list stored in its classes_ properties, according to the size of the value of the order. In this example, classes_ array index corresponding to each category is just a category (for example, the index value is 5 exactly corresponds to the number 5), but under normal circumstances would not be so in fact happen.

 

If we want to be forcibly sk-learn to use one-versus-one or one-versus-all, you can use OneVsOneClassifier or OneVsRestClassifier class. We need to create only one instance of binary classifier, and then passed to its constructor can be. For example, the following code creates a multi-classifier using OvO strategy is based on the SGDClassifier:

from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict(X_test[:5])
>array([7, 2, 1, 0, 4], dtype=uint8)

len(ovo_clf.estimators_)
>45

 

We can see a total of 45 trained binary classifier.

Here we train a RandomForestClassifier, very simple and fast than the previous speed:

forest_clf.fit(X_train, y_train)
forest_clf.predict(X_test[:5])
>array([7, 2, 1, 0, 4], dtype=uint8)

 

The sk-leran not run OvA or OvO, because the random forest classifier can classify data directly into categories. We can call predict_proba () Gets each data is classified as the probability of each category:

forest_clf.predict_proba(X_test[1:2])
>array([[0. , 0. , 0.7, 0.2, 0. , 0. , 0.1, 0. , 0. , 0. ]])

 

This classifier can see that 70% of certainty believe that this picture belongs to the second category (starting from 0), which is the number 2. It also should be considered part of this image 3 and 6, corresponding to the certainty factor of 20% and 10%, respectively.

Next we evaluate these classifiers, as before, using cross-validation. Use cross_val_score () method to assess the accuracy of SGDClassifier:

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')
>array([0.87082583, 0.87089354, 0.88628294])

 

Can be seen in all test off, registration rate reached to 87%. If it is a completely random classifier, its accuracy rate should be 10%. With respect to the 10% accuracy rate, our classifier is clearly a lot better, but you can do better. For example, they do a Standardization:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring='accuracy')
>array([0.89957009, 0.89344467, 0.89963495])

 

We can see the correct rate to increase by more than 89%. After completion of the training model, based on machine learning project process, the next step we will try to optimize the model, which is a method error analysis, content is presented next.

 

Guess you like

Origin www.cnblogs.com/zackstang/p/12330961.html