Comparison of classifiers commonly used in machine learning

Traditional machine learning supervised learning classification classification and regression, classification is for discrete data, while regression is for continuous data. On the basis of data preprocessing, it is necessary to predict the data, usually using CV cross-validation. Model evaluation and selection. This article compares various regressors with continuous data combined with the sklearn library:

1.linear regression

Disadvantages: As the name suggests, linear regression assumes that the data obeys a linear distribution. This assumption also limits the accuracy of the model, because in reality, due to the existence of noise, few data are strictly linear.

Advantages: Based on this assumption, linear regression can obtain y_predict by finding a closed solution through normal equation

2.logistic regression

Disadvantages: Derived from linear regression, the linear value range is compressed in the (0,1) range through the sigmoid function. The disadvantage is the same as linear regression, and it also requires that the data be free of missing data.

Advantages: There are two ways to solve, accurate analytical solution and SGD algorithm estimation, use analytical solution when accuracy is required, and SGD iteration when time efficiency is required

3. SVM (Support Vector Machine)

Disadvantages: The computational cost is relatively high. SVM maps low-dimensional disordered data to high-dimensional space through kernel functions (RBF, poly, linear, sigmoid), and separates them through hyperplanes

Advantages: SVM is classified by the support surface, that is to say, it does not need to calculate all the samples, only a small number of samples need to be removed from the high-dimensional data, which saves memory

In the default configuration of sklearn, the accuracy of the three kernel functions is probably: RBF>poly>linear

4.Naive Bayes

Disadvantages: This model is suitable for text samples, using the naive Bayes principle to assume that the samples are independent of each other, so the effect is very poor on samples with strong correlation

Advantages: also based on its independent assumption, the probability calculation is greatly simplified, saving memory and time

5.K nearest neighbors

Disadvantages: k needs to be set manually, and the complexity of the algorithm is very high

Advantages: "Near Zhu is red, near ink is black" KNN is a model trained without parameters

6. Decision Tree (DT)

Disadvantage: Time consuming on training data

Advantages: The model with the lowest data requirements, the data can be missing, nonlinear, and can be of different types, the model closest to human logical thinking, good interpretability

7. Integrated Model (Unity Model)

Random forest: Randomly draw samples to form multiple classifiers. Through voting, the minority obeys the majority to determine the final classifier result that belongs to the majority, and the classifiers are not related to each other.

Gradient boost: from weak to strong, the most typical representative is adaboost (three cobblers, the top Zhuge Liang). The weak classifiers are combined according to a certain calculation method to form a strong classifier. There is a relationship between the classifiers. The final classification is The result of combining multiple classifiers

Generally, GB>RF>DT

However, the disadvantage of the integrated model is that it is affected by probability and has uncertainty.


The above is a comparison of commonly used regression classifiers. After knowing the advantages and disadvantages of various classifiers, you can use the correct classifier to complete your own data processing. The following table compares the same task by calculating the residuals of various classifiers. Whether the classifiers are good or bad, it can be seen that under the premise of the default parameters of sklearn, the accuracy ranking is: ensemble model>DT>SVM>KNN>Linear

classification regressor import python library command import function command Residual (%)
linear regression from sklearn.linear_model import LinearRegressor lr = LinearRegressor() 5.223
SGD regression penalty L2 from sklearn.linear_model import SGDRegressor SGDR = SGDRegressor("penalty = l2") 5.780
SGD regression penalty L1 SGDR = SGDRegressor("penalty = l1") 5.765
SVR(rbf kernel) from sklearn .svm import SVR
(Penalty parameter :C,Kernel coefficient :gamma)
SVR = SVR(kernel="rbf") 0.627
SVR(sigmoid kernel) SVR = SVR(kernel="sigmoid ") 82.507
SVR(poly kernel) SVR = SVR(kernel="poly") 20.862
SVR(linear kernel) SVR = SVR(kernel="linear") 6.451
KNN(n=5,weights=uniform) from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor(n=5,weights="uniform") 0.731
KNN(n=5,weights=distance) knn = KNeighborsRegressor(n=5,weights="distance") 1.087
DT from sklearn.tree import DecisionTreeRegressor DT = DecisionTreeRegressor() 0.447
Random forest from sklearn.ensemble import RandomForestRegressor RF = RandomForestRegressor() 0.270
Extra Trees from sklearn.ensemble import ExtraTreesRegressor AND = ExtraTreesRegressor() 0.246
Gradient Boosting from sklearn.ensemble import GradientBoostingRegressor GB = GradientBoostingRegressor() 0.284

Original address: https://blog.csdn.net/july_sun/article/details/53088673

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324769996&siteId=291194637