Traditional machine learning supervised learning classification classification and regression, classification is for discrete data, while regression is for continuous data. On the basis of data preprocessing, it is necessary to predict the data, usually using CV cross-validation. Model evaluation and selection. This article compares various regressors with continuous data combined with the sklearn library:
1.linear regression
Disadvantages: As the name suggests, linear regression assumes that the data obeys a linear distribution. This assumption also limits the accuracy of the model, because in reality, due to the existence of noise, few data are strictly linear.
Advantages: Based on this assumption, linear regression can obtain y_predict by finding a closed solution through normal equation
2.logistic regression
Disadvantages: Derived from linear regression, the linear value range is compressed in the (0,1) range through the sigmoid function. The disadvantage is the same as linear regression, and it also requires that the data be free of missing data.
Advantages: There are two ways to solve, accurate analytical solution and SGD algorithm estimation, use analytical solution when accuracy is required, and SGD iteration when time efficiency is required
3. SVM (Support Vector Machine)
Disadvantages: The computational cost is relatively high. SVM maps low-dimensional disordered data to high-dimensional space through kernel functions (RBF, poly, linear, sigmoid), and separates them through hyperplanes
Advantages: SVM is classified by the support surface, that is to say, it does not need to calculate all the samples, only a small number of samples need to be removed from the high-dimensional data, which saves memory
In the default configuration of sklearn, the accuracy of the three kernel functions is probably: RBF>poly>linear
4.Naive Bayes
Disadvantages: This model is suitable for text samples, using the naive Bayes principle to assume that the samples are independent of each other, so the effect is very poor on samples with strong correlation
Advantages: also based on its independent assumption, the probability calculation is greatly simplified, saving memory and time
5.K nearest neighbors
Disadvantages: k needs to be set manually, and the complexity of the algorithm is very high
Advantages: "Near Zhu is red, near ink is black" KNN is a model trained without parameters
6. Decision Tree (DT)
Disadvantage: Time consuming on training data
Advantages: The model with the lowest data requirements, the data can be missing, nonlinear, and can be of different types, the model closest to human logical thinking, good interpretability
7. Integrated Model (Unity Model)
Random forest: Randomly draw samples to form multiple classifiers. Through voting, the minority obeys the majority to determine the final classifier result that belongs to the majority, and the classifiers are not related to each other.
Gradient boost: from weak to strong, the most typical representative is adaboost (three cobblers, the top Zhuge Liang). The weak classifiers are combined according to a certain calculation method to form a strong classifier. There is a relationship between the classifiers. The final classification is The result of combining multiple classifiers
Generally, GB>RF>DT
However, the disadvantage of the integrated model is that it is affected by probability and has uncertainty.
The above is a comparison of commonly used regression classifiers. After knowing the advantages and disadvantages of various classifiers, you can use the correct classifier to complete your own data processing. The following table compares the same task by calculating the residuals of various classifiers. Whether the classifiers are good or bad, it can be seen that under the premise of the default parameters of sklearn, the accuracy ranking is: ensemble model>DT>SVM>KNN>Linear
classification regressor | import python library command | import function command | Residual (%) |
linear regression | from sklearn.linear_model import LinearRegressor | lr = LinearRegressor() | 5.223 |
SGD regression penalty L2 | from sklearn.linear_model import SGDRegressor | SGDR = SGDRegressor("penalty = l2") | 5.780 |
SGD regression penalty L1 | SGDR = SGDRegressor("penalty = l1") | 5.765 | |
SVR(rbf kernel) | from sklearn .svm import SVR (Penalty parameter :C,Kernel coefficient :gamma) |
SVR = SVR(kernel="rbf") | 0.627 |
SVR(sigmoid kernel) | SVR = SVR(kernel="sigmoid ") | 82.507 | |
SVR(poly kernel) | SVR = SVR(kernel="poly") | 20.862 | |
SVR(linear kernel) | SVR = SVR(kernel="linear") | 6.451 | |
KNN(n=5,weights=uniform) | from sklearn.neighbors import KNeighborsRegressor | knn = KNeighborsRegressor(n=5,weights="uniform") | 0.731 |
KNN(n=5,weights=distance) | knn = KNeighborsRegressor(n=5,weights="distance") | 1.087 | |
DT | from sklearn.tree import DecisionTreeRegressor | DT = DecisionTreeRegressor() | 0.447 |
Random forest | from sklearn.ensemble import RandomForestRegressor | RF = RandomForestRegressor() | 0.270 |
Extra Trees | from sklearn.ensemble import ExtraTreesRegressor | AND = ExtraTreesRegressor() | 0.246 |
Gradient Boosting | from sklearn.ensemble import GradientBoostingRegressor | GB = GradientBoostingRegressor() | 0.284 |