Machine Learning Basics 09 - Review Classification Algorithms (Based on Indian Diabetes Pima Indians Dataset)

Algorithm review is one of the main ways to choose a suitable machine learning algorithm. It is not known which algorithm is the most effective for the problem before reviewing the algorithm
, and certain experiments must be designed for verification to find the most effective algorithm for the problem. This chapter will learn to
review six machine learning classification algorithms through scikit-learn, compare the results of the algorithm evaluation matrix, and choose the appropriate algorithm.

How to review classification algorithms for machine learning?

Before reviewing the algorithms, there is no way to judge which algorithm is the most effective for the data set and can generate the optimal model
. It is necessary to judge which algorithm is the most effective for the problem through a series of experiments, and then further select the algorithm. This process is called algorithm review.

When choosing an algorithm, you should change your thinking, not which algorithm should be used for the data, but which algorithms should be reviewed with the data. It should be a guess first, what algorithm will have the best results. This is a great way to train our sensitivity to data. I highly recommend that you use different algorithms on the same data set to review the effectiveness of the algorithm, and then find the most effective algorithm.
Here are some suggestions for reviewing algorithms:

  • Try a variety of representative algorithms.
  • Experiment with various machine learning algorithms.
  • Try multiple models.

Next, several common classification algorithms will be introduced.

In classification algorithms, there are currently many types of classifiers: linear classifiers, Bayesian classifiers, distance-based classifiers, etc. Next, six classification algorithms will be introduced, and two linear algorithms will be introduced first:

  • logistic regression.
  • Linear Discriminant Analysis.

Introduce four more nonlinear algorithms:

  • K nearest neighbors.
  • Bayesian classifier.
  • Classification and regression trees.
  • Support Vector Machines.

Next, we will continue to use the Pima Indians dataset to review the algorithm, and we will use 10-fold cross-validation to evaluate the accuracy of the algorithm. Use the average accuracy to standardize the score of the algorithm to reduce the impact of data distribution imbalance on the algorithm.

Both logistic regression and linear discriminant analysis assume that the input data conforms to a Gaussian distribution.

logistic regression

Regression is an easy-to-understand model, which is equivalent to y=f(x), indicating
the relationship between the independent variable x and the dependent variable y. Just like a doctor first looks, smells, asks, and cuts when treating a disease, and then judges whether the patient is sick or what kind of disease, the "look, smell, ask, and cut" here is to obtain the independent variable x, which is the characteristic data; to judge whether the patient is sick or not It is equivalent to obtaining the dependent variable y, that is, predicting the classification.

Logistic regression is actually a classification algorithm rather than a regression algorithm. It usually uses known independent variables to predict the value of a discrete dependent variable (such as binary values ​​​​0/1, yes/no, true/false). Simply put, it predicts the probability of an event by fitting a Logit Function. So it predicts a probability value, and its output value should be 0 to 1, so it is very suitable for dealing with binary classification problems. The implementation class in scikit-learn is LogisticRegression. code show as below:

Dataset download

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)

#打印标签名称
print(data.columns)

#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
model = LogisticRegression()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



operation result:

算法评估结果:0.776 (0.045)

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA), also known as Fisher Linear Discriminant (
FLD), is a classic algorithm for pattern recognition, which was introduced into the field of pattern recognition and artificial intelligence by Belhumeur in 1996.

The basic idea of ​​linear discriminant analysis is to project high-dimensional pattern samples into the best discriminant vector space to achieve the effect of extracting classification information and compressing the dimensionality of feature space. After projection, pattern samples are guaranteed to have the largest class gap in the new subspace distance and the smallest within-class distance, i.e., the patterns have the best separability in that space.

Therefore, it is an effective feature extraction method. Using this method can maximize the between-class scatter matrix and minimize the intra-class scatter matrix of the projected pattern samples. That is to say, it can ensure that the pattern samples after projection have the smallest intra-class distance and the largest inter-class distance in the new space, that is, the pattern has the best separability in this space. Linear discriminant analysis, like principal component analysis, is widely used in data dimensionality reduction .

The implementation class in scikit-learn is LinearDiscriminantAnalysis. code show as below:

import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.model_selection import KFold, cross_val_score

#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)

#打印标签名称
print(data.columns)

#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
model = LinearDiscriminantAnalysis()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



operation result:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
算法评估结果:0.767 (0.048)

nonlinear algorithm

Four nonlinear algorithms are introduced below: K nearest neighbor (KNN), Bayesian classifier, classification and regression tree and support vector machine algorithm.

K nearest neighbor algorithm

The K nearest neighbor algorithm is a relatively mature method in theory, and it is also one of the simplest machine learning algorithms.

In KNN, the distance between objects is calculated as the dissimilarity
index between objects, which avoids the matching problem between objects. The distance generally uses Euclidean distance or Manhattan distance; Decisions are made based on the optimal class, rather than by a single object class. This is the advantage of the KNN algorithm. The implementation class in scikit-learn is KNeighborsClassifier. code show as below:

import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)



#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
model = KNeighborsClassifier()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



operation result:


算法评估结果:0.711 (0.051)

Bayesian classifier

The classification principle of the Bayesian classifier is to use the prior probability of an object to calculate its posterior probability in all categories by using the Bayesian formula, that is, the probability that the object belongs to a certain category, and select the object with the largest posterior probability class as the class to which the object belongs. That is, Bayesian classifiers are optimized in the sense of minimum error rate.

The probability of occurrence of each category, whichever is the largest, is considered which category the item to be classified belongs to. Bayesian classifiers are characterized as follows:

  • A Bayesian classifier is a statistically based classifier that classifies a given sample according to its probability of belonging to a specific class.
  • The theoretical basis of Bayesian classifier is Bayesian theory.
  • A simple form of Bayesian classifier is the Naive Bayesian classifier, which has comparable performance to classifiers such as random forests and neural networks.
  • Bayesian classifier is an incremental classifier.

In Bayesian classifiers, the input data is also assumed to conform to a Gaussian distribution. The implementation class in
scikit-learn is GaussianNB.
code show as below:

import pandas as pd

from sklearn.model_selection import KFold, cross_val_score
from sklearn.naive_bayes import GaussianNB


#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)


#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#高斯朴素贝叶斯
model = GaussianNB()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



operation result:

算法评估结果:0.759 (0.039)

Classification and Regression Trees

The English abbreviation of classification and regression tree is CART , which also belongs to a kind of decision tree. The construction of the tree is based on the Gini index.

CART assumes that the decision tree is a binary tree , the values ​​of the internal node features are "yes" and "no", the left branch is the branch with the value "yes", and the right branch is the branch with the value "no". Such a decision tree is equivalent to recursively bisecting each feature, dividing the input space (feature space) into a finite number of units, and determining the predicted probability distribution on these units, that is, the conditional probability of the output under the given input conditions distributed.

The CART algorithm consists of the following two steps.

  • Tree generation: Generate a decision tree based on the training data set, and the generated decision tree should be as large as possible.
  • Tree pruning: Use the verification data set to prune the generated tree, and select the optimal subtree. At this time, the minimum loss function is used as the pruning standard.

The generation of a decision tree is the process of constructing a binary decision tree recursively, using the square error minimization criterion for the regression tree, or using the Gini index minimization criterion for the classification tree to perform feature selection and generate a binary tree. A CART model can be built through the DecisionTreeClassifier class in scikit-learn. code show as below:

import pandas as pd

from sklearn.model_selection import KFold, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)


#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#高斯朴素贝叶斯
model = DecisionTreeClassifier()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



operation result:

算法评估结果:0.695 (0.051)

Support Vector Machines

Support vector machine was first proposed by Corinna Cortes and Vapnik in 1995. It shows many unique advantages in solving small samples, nonlinear and high-dimensional pattern recognition, and can be extended to other machine learning problems such as function fitting. .

In machine learning, a support vector machine (SVM) is a supervised learning model related to related learning algorithms that can analyze data and recognize patterns for classification and regression analysis. Given a set of training samples, each record is labeled with its category, trained using the support vector machine algorithm, and a model is built to classify new data instances, making it a non-probabilistic binary linear classification.

An example of an SVM model is, for example, the mapping of different points in space such that instances belonging to different classes are represented by a partition with a sharp gap and as wide as possible. New instances are then mapped into the same space and predicted to belong to the same category based on the fact that they fall on the same gap. Now SVM has also been extended to deal with multi-classification problems, and an SVM model can be built through the SVC class in scikit-learn.

code show as below:

import pandas as pd

from sklearn.model_selection import KFold, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

#数据预处理
path = 'D:\down\\archive\\diabetes.csv'
data = pd.read_csv(path)


#将数据转成数组
array = data.values
#分割数据,去掉最后一个标签
X = array[:, 0:8]

Y = array[:, 8]

num_folds = 10
seed = 7

#特征选择
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#高斯朴素贝叶斯
model = SVC()

result = cross_val_score(model, X, Y, cv=kfold)

print("算法评估结果:%.3f (%.3f)" % (result.mean(), result.std()))



算法评估结果:0.760 (0.035)

Introduces six classification algorithms, along with their implementation in scikit-learn. Algorithms are mainly
divided into: linear algorithm, distance algorithm, tree algorithm, statistical algorithm, etc. Each algorithm has different applicable scenarios and has different requirements for data sets.

This time, several algorithms are reviewed using the Pima Indians dataset, which is an effective way to choose an appropriate algorithm model.

The six algorithm evaluation tables are as follows:

algorithm name Algorithm Evaluation Results
Logistic regression LogisticRegression Algorithm Evaluation Results: 0.776 (0.045)
Linear Discriminant Analysis LinearDiscriminantAnalysis Algorithm Evaluation Results: 0.767 (0.048)
K nearest neighbor algorithm KNeighborsClassifier Algorithm evaluation result: 0.711 (0.051)
Bayesian classifier GaussianNB Algorithm Evaluation Results: 0.759 (0.039)
Classification and regression tree DecisionTreeClassifier Algorithm evaluation result: 0.695 (0.051)
Support Vector Machine SVC() Algorithm Evaluation Results: 0.760 (0.035)

Guess you like

Origin blog.csdn.net/hai411741962/article/details/132478062