The four most common classification models in machine learning

Click on the blue word to follow me, there are dry goods to receive!

Author: Jason Brownlee
Translation: Hou Boxue

foreword

Machine learning is a field of study that learns algorithms from a training set.

Classification is a task that requires the use of machine learning algorithms that learn how to assign class labels to datasets.

To give an easy-to-understand example: classify emails as " spam " or " not spam" (the typical feature of binary classification is "either or the other", about binary classification, we will touch on it later).

You may encounter many different types of classification tasks in machine learning, but each model uses a corresponding modeling method.

So in this article, you will learn about different types of classification predictive modeling methods in machine learning.

  • Classification predictive modeling assigns class labels to input samples;

  • Binary classification refers to predicting one of two classes (either or the other), while multiclassification involves predicting one of more than two classes;

  • Multi-label classification involves predicting one or more classes for each sample;

  • In imbalanced classification, samples are not equally distributed across classes;

Overview

This article is divided into five parts, they are:

  1. Classification Predictive Modeling

  2. two-class

  3. Multi-class classification

  4. Multi-label classification

  5. Imbalanced classification

Classification Predictive Modeling

In machine learning, classification [1] refers to the predictive modeling problem of predicting the class label for a given example of input data.

Examples of classification problems include:

  • Categorize whether an email is spam or not.

  • Given a handwritten character, classify it as one of the known characters.

  • Classify users as churn based on recent user behavior.

From a modeling perspective, classification requires a training dataset that contains many input and output data from which to learn.

The model will use the training dataset and calculate how to more accurately map input data samples to specific class labels.

Therefore, the training dataset must be sufficiently representative and have many samples for each class label.

Category labels are usually string values ​​such as " spam " (spam), " not spam " (not spam), and must be mapped to numerical values ​​before being provided to the modeling algorithm. This is often called label encoding [2] , where each class label is assigned a unique integer, such as " spam " = 0, " no spam " = 1.

There are many different types of classification algorithms that can model classification prediction problems.

There is no fixed pattern guideline on how to apply an appropriate algorithm to a specific classification problem. But it can be determined by experimentation, usually by experimenters using controlled experiments, to pick out which algorithms and algorithm configurations have the best performance on a given classification task.

The classification predictive modeling algorithm is evaluated based on the prediction results. Classification accuracy is a commonly used metric that evaluates the performance of a model by its predicted class labels. Even if the classification results are not perfect, it is a good start for many classification tasks.

Some tasks may require predicting the probability of each sample class membership, rather than the label. This provides additional uncertainty to the forecast, and a common judgment method to assess the probability of forecasting is the ROC curve (integrated area)

There are four main types of classification tasks you may encounter, which are:

  • two-class

  • Multi-class classification

  • Multi-label classification

  • Imbalanced classification

Let's take a closer look at each type in turn.

binary classification model

Binary classification [3] refers to classification tasks with two class labels.

Examples include:

  • Email spam detection (spam or not)

  • Churn prediction (churn or not)

  • Conversion prediction (buy or not)

Typically, binary classification tasks involve one class belonging to a normal state and one class belonging to an abnormal state.

For example, " not spam " is a normal state, while " spam " is an abnormal state. Another example is that " cancer not detected " is a normal state involving medical examination tasks, while " cancer detected " is an abnormal state.

The class in the normal state is assigned the class label 0, and the class in the abnormal state is assigned the class label 1.

The binary classification task is usually modeled by first predicting the Bernoulli probability distribution model for each sample.

The Bernoulli distribution is a discrete probability distribution that covers both cases where the outcome of an event is 0 or 1. If the result is predicted to be 1, for classification, this means that the model predicts that the data belongs to class 1, which can also be said to be an abnormal state.

Popular algorithms that can be used for binary classification include:

  • Logistic Regression

  • k-Nearest Neighbors

  • Decision Trees

  • Support Vector Machine

  • Naive Bayes

Some algorithms are specifically designed for binary classification problems, and they do not natively support more than two types of classification. For example logistic regression and support vector machines.

Next, let's take a closer look at the dataset to train an intuition for binary classification problems through practice and thought.

We can generate a synthetic binary classification dataset using the make_blobs() function [4] .

The sample code below generates a dataset with 1,000 samples belonging to one of two classes, each with two input features.

# example of binary classification task
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, random_state=1)
# summarize dataset shape
print(X.shape, y.shape)
# summarize observations by class label
counter = Counter(y)
print(counter)
# summarize first few examples
for i in range(10):
 print(X[i], y[i])
# plot the dataset and color the by class label
for label, _ in counter.items():
 row_ix = where(y == label)[0]
 pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

First run the example, summarizing the created dataset, splitting the 1000 samples into input ( X ) and output ( y ) elements.

The distribution of class labels is then aggregated, showing whether the instance belongs to class 0 or class 1, with 500 samples in each class.

Next, summarize the first 10 samples in the dataset, showing that the input values ​​are numbers and the target values ​​are integers 0 or 1 representing the class member type.

(1000, 2) (1000,)

Counter({0: 500, 1: 500})

[-3.05837272  4.48825769] 0
[-8.60973869 -3.72714879] 1
[1.37129721 5.23107449] 0
[-9.33917563 -2.9544469 ] 1
[-11.57178593  -3.85275513] 1
[-11.42257341  -4.85679127] 1
[-10.44518578  -3.76476563] 1
[-10.44603561  -3.26065964] 1
[-0.61947075  3.48804983] 0
[-10.91115591  -4.5772537 ] 1

Finally, create a scatterplot for the input variables in the dataset and color the points according to their class values.

We can intuitively distinguish two different clusters.

Scatter plot of binary classification dataset

Multiclass Classification Model

Multi-class classification [5] refers to classification tasks with more than two class labels.

Examples include:

  • face classification

  • Classification of plant species

  • Optical Character Recognition

Unlike binary classification, multiclass classification has no concept of normal and abnormal outcomes. Instead, samples are classified as belonging to one of a series of known categories.

On some problems, the number of class labels can be very large. For example, a model can predict that a photo belongs to one of thousands or tens of thousands of faces in a facial recognition system.

Problems involving predicting sequences of words, such as text translation models, can also be viewed as a special type of multi-class classification. Each word in the sequence of words to be predicted involves a multi-class classification, where the vocabulary size defines the number of possible classes that can be predicted, which may be in the thousands of words.

Multiclass classification tasks are often modeled using a model of the Multinoulli probability distribution of each sample.

The Multinoulli probability distribution is a discrete probability distribution covering cases where events will have a definite outcome, eg discrete probability distribution k in {1, 2, 3, ..., k }. For classification, this means that the model can predict the probability that a sample belongs to each class label.

Many of the algorithms used for binary classification can also be used to solve multiclass problems.

Popular algorithms that can be used for multiclass classification include:

  • k-Nearest Neighbors

  • Decision Trees

  • Naive Bayes

  • Random Forest

  • Gradient Boosting

This involves using a strategy that fits multiple binary classification models for each class against all other classes (called "one-to-many"), or for each pair of classes (called "one-to-one") a model.

  • One -to-many : Fits a binary model for each class to all other classes.

  • One- to-One : Fits a binary classification model for each pair of classes.

Binary classification algorithms that can use these strategies for multi-classification include:

  • Logistic Regression

  • Support Vector Machine

Next, let's take a closer look at the dataset to train an intuition for multi-class classification problems through practice and thought.

We can generate a comprehensive multi-class classification dataset using the make_blobs() function [6] .

The code below represents generating a dataset with 1,000 examples belonging to one of three classes, each with two input features.

# example of multi-class classification task
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
# define dataset
X, y = make_blobs(n_samples=1000, centers=3, random_state=1)
# summarize dataset shape
print(X.shape, y.shape)
# summarize observations by class label
counter = Counter(y)
print(counter)
# summarize first few examples
for i in range(10):
	print(X[i], y[i])
# plot the dataset and color the by class label
for label, _ in counter.items():
	row_ix = where(y == label)[0]
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

First run the code to aggregate the created dataset, splitting the 1000 samples into input ( X ) and output ( y ).

The distribution of class labels is then aggregated, showing that the samples belong to class 0, class 1, or class 2, and that there are approximately 333 examples in each class.

Next, show the first 10 samples in the dataset, showing that the input values ​​are numbers and the target values ​​are integers representing the type of class membership.

(1000, 2) (1000,)

Counter({0: 334, 1: 333, 2: 333})

[-3.05837272  4.48825769] 0
[-8.60973869 -3.72714879] 1
[1.37129721 5.23107449] 0
[-9.33917563 -2.9544469 ] 1
[-8.63895561 -8.05263469] 2
[-8.48974309 -9.05667083] 2
[-7.51235546 -7.96464519] 2
[-7.51320529 -7.46053919] 2
[-0.61947075  3.48804983] 0
[-10.91115591  -4.5772537 ] 1

Finally, create a scatterplot for the input variables in the dataset and color the points according to their categorical values.

We can easily distinguish three different clusters.

Scatter plot for a multiclass classification dataset

Multi-label classification model

Multi-label classification [7] refers to classification tasks with two or more class labels, where each sample can predict one or more class labels.

Consider the example of photo classification [8] , where a given photo may have multiple objects in the scene, and the model can predict the presence of multiple known objects in the photo, such as " bicycle ", " apple ", " person ", etc.

This is different from binary and multi-class, where a single class label is predicted for each sample.

Multi-label classification tasks are typically modeled with a model that predicts multiple outputs, each of which will be predicted as a Bernoulli probability distribution (0,1 distribution). Essentially, this is a model that makes multiple binary classification predictions for each sample.

Classification algorithms used for binary or multi-class classification cannot be directly used for multi-label classification. Specialized versions of standard classification algorithms, the so-called multi-label version algorithms, are available, including:

  • Multi-label Decision Trees

  • Multi-label Random Forests

  • Multi-label Gradient Boosting

Another approach is to use a separate classification algorithm to predict the label for each class.

Next, let's take a closer look at the dataset to train an intuition for multi-label classification problems through practice and thought.

We can use the make_multilabel_classification() function [9] to generate an algorithmically synthesized multi-label classification dataset.

The code below says to generate a dataset of 1,000 examples, each with two input features. There are three categories, and each category may have one of two labels (0 or 1).

# example of a multi-label classification task
from sklearn.datasets import make_multilabel_classification
# define dataset
X, y = make_multilabel_classification(n_samples=1000, n_features=2, n_classes=3, n_labels=2, random_state=1)
# summarize dataset shape
print(X.shape, y.shape)
# summarize first few examples
for i in range(10):
 print(X[i], y[i])

First run the code to aggregate the created dataset, splitting the 1000 examples into input ( X ) and output ( y ).

Next, show the first 10 samples in the dataset, showing that the input values ​​are numbers and the target values ​​are integers representing the class label membership category.

(1000, 2) (1000, 3)

[18. 35.] [1 1 1]
[22. 33.] [1 1 1]
[26. 36.] [1 1 1]
[24. 28.] [1 1 0]
[23. 27.] [1 1 0]
[15. 31.] [0 1 0]
[20. 37.] [0 1 0]
[18. 31.] [1 1 1]
[29. 27.] [1 0 0]
[29. 28.] [1 1 0]

Imbalanced classification model

Imbalanced classification [10] refers to classification tasks in which the number of samples in each class is not evenly distributed.

Generally, imbalanced classification tasks are binary classification tasks, where the majority of samples in the training dataset belong to the normal class and the minority of samples belong to the anomalous class.

Examples include:

  • Fraud identification

  • Outlier detection

  • medical diagnostic tests

These problems are modeled as binary classification tasks, although specialized techniques may be required.

Use specialized modeling algorithms to correct the composition of samples in the training dataset by undersampling or oversampling the majority class.

Examples include:

  • Random undersampling [11]

  • SMOTE oversampling [12]

When fitting a model to a training dataset, specialized modeling algorithms can be used to focus on minority groups, such as cost-sensitive machine learning algorithms.

Examples include:

  • Cost-sensitive Logistic Regression

  • Cost-sensitive Decision Trees

  • Cost-sensitive Support Vector Machines

Finally, other performance metrics may be required for evaluation due to the potential for error in reporting classification accuracy.

Examples include:

  • PrecisionPrecision

  • Recall

  • F-Measure

Next, let's take a closer look at the dataset to train an intuition for imbalanced classification problems through practice and thought.

We can use the make_classification() function [13] to generate a code-synthesized imbalanced binary classification dataset.

The code below represents generating a dataset with 1,000 examples belonging to one of two classes, each with two input features.

# example of an imbalanced binary classification task
from numpy import where
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, weights=[0.99,0.01], random_state=1)
# summarize dataset shape
print(X.shape, y.shape)
# summarize observations by class label
counter = Counter(y)
print(counter)
# summarize first few examples
for i in range(10):
 print(X[i], y[i])
# plot the dataset and color the by class label
for label, _ in counter.items():
 row_ix = where(y == label)[0]
 pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

First run the code to aggregate the created dataset, splitting the 1000 examples into input ( X ) and output ( y ).

The distribution of class labels is then aggregated, showing a severe class imbalance, with about 980 examples belonging to class 0 and about 20 examples belonging to class 1.

Next, show the first 10 samples in the dataset, showing that the input values ​​are numbers and the target values ​​are integers representing the class membership category. In this case, we can see that most of the examples belong to category 0.

(1000, 2) (1000,)

Counter({0: 983, 1: 17})

[0.86924745 1.18613612] 0
[1.55110839 1.81032905] 0
[1.29361936 1.01094607] 0
[1.11988947 1.63251786] 0
[1.04235568 1.12152929] 0
[1.18114858 0.92397607] 0
[1.1365562  1.17652556] 0
[0.46291729 0.72924998] 0
[0.18315826 1.07141766] 0
[0.32411648 0.53515376] 0

Finally, create a scatterplot for the input variables in the dataset and color the points according to their class values.

We can see one main class, which belongs to class 0, and some scattered ones that belong to class 1. As can be seen, datasets with imbalanced class label properties are more challenging to model.

Scatter plot of imbalanced binary classification dataset

Summary

This article presents different types of classification predictive modeling approaches in machine learning.

Specifically, the following points:

  • Classification predictive modeling involves assigning class labels to input samples (test set);

  • Binary classification refers to predicting one of two classes, while multiclassification involves predicting one of more than two classes;

  • Multi-label classification involves predicting one or more classes for each sample;

  • Imbalanced classification refers to the classification task when the samples are not equally distributed among the various categories;

References

[1]

Classification: https://en.wikipedia.org/wiki/Statistical_classification

[2]

Tag Encoding: https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

[3]

Binary classification: https://en.wikipedia.org/wiki/Binary_classification

[4]

make_blobs() function: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

[5]

Multiclass classification: https://en.wikipedia.org/wiki/Multiclass_classification

[6]

make_blobs() function: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

[7]

Multi-label classification: https://en.wikipedia.org/wiki/Multi-label_classification

[8]

Photo classification: https://machinelearningmastery.com/object-recognition-with-deep-learning/

[9]

make_multilabel_classification()函数: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

[10]

Imbalanced Classification: https://machinelearningmastery.com/what-is-imbalanced-classification/

[11]

Random Undersampling: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

[12]

SMOTE Oversampling: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

[13]

make_classification() function: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

-END-

Featured in the past

 

Numpy advanced sorting tips

The 50 Most Valuable Charts for Matplotlib Visualization

Easy data scaling with sklearn

Python big data analysis

data created value

Long press the QR code to follow

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326833193&siteId=291194637