Interview Requirements: Comparison of Advantages and Disadvantages of Machine Learning Algorithms (Summary)

The purpose of this article is to provide a pragmatic and concise inventory of current machine learning algorithms.

Mainly review the adaptation scenarios and advantages and disadvantages of several commonly used algorithms!

There are too many machine learning algorithms, such as classification, regression, clustering, recommendation, image recognition, etc. It is really not easy to find a suitable algorithm, so in practical applications, we generally use heuristic learning methods to experiment . Usually at the beginning, we will choose algorithms that are generally recognized by everyone, such as SVM, GBDT, and Adaboost. Now deep learning is very hot, and neural networks are also a good choice.

If you care about accuracy, the best way is to test and compare each algorithm one by one through cross-validation, then adjust the parameters to ensure that each algorithm achieves the optimal solution, and finally choose the best one of.

But if you are just looking for a "good enough" algorithm to solve your problem, or here are some tips for reference, let's analyze the advantages and disadvantages of each algorithm, based on the advantages and disadvantages of the algorithm, it is easier for us to choose it.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, and technical exchange improvement, all of which can be obtained by adding the communication group. The group has more than 2,000 friends. The best way to add notes is: source + interest directions, making it easy to find like-minded friends.

Method ①, add WeChat account: dkl88191, remarks: from CSDN + python
method ②, WeChat search official account: Python learning and data mining, background reply: add group

1. There is no such thing as a free lunch

In the field of machine learning, a basic theorem is that "there is no such thing as a free lunch". In other words, no algorithm is perfect for every problem, especially for supervised learning (e.g. predictive modeling).

For example, you cannot say that a neural network is better than a decision tree in any case, and vice versa. They are influenced by many factors, such as the size or structure of your dataset.

As a result, when evaluating performance and picking algorithms on a given test set, you should use different algorithms depending on the specific problem.

Of course, the chosen algorithm must be applicable to your own problem, which requires choosing the right machine learning task. As an analogy, if you need to clean your house, you might use a vacuum cleaner, a broom, or a mop, but you should never pull out a shovel and dig.

2. Bias & Variance

In statistics, the quality of a model is measured by bias and variance, so let's first popularize bias and variance:

  • Deviation: Describes the gap between the expected E' of the predicted value (estimated value) and the actual value Y. The larger the deviation, the more it deviates from the real data.

  • Variance: It describes the variation range and degree of dispersion of the predicted value P, which is the variance of the predicted value, that is, the distance from its expected value E. The larger the variance, the more spread out the distribution of the data.

The true error of the model is the sum of the two, as in Equation 3:

In general, if the training set is small, a classifier with high bias/low variance (eg, Naive Bayes NB) is more advantageous than a classifier with low bias/high variance (eg, KNN), because the latter will happen Overfitting. However, as your training set grows, the model becomes better at predicting the original data and the bias decreases, at which point low-bias/high-variance classifiers will gradually show their advantage (since they have lower asymptotic error), and a high bias classifier is no longer sufficient to provide an accurate model.

Why is Naive Bayes high bias and low variance?

The following content is quoted from Zhihu:

First, assume you know the relationship between the training set and the test set. To put it simply, we need to learn a model on the training set, and then use it on the test set. Whether the effect is good or not depends on the error rate of the test set. But in many cases, we can only assume that the test set and the training set conform to the same data distribution, but we cannot get the real test data. At this time, how to measure the test error rate when only seeing the training error rate?

Since there are few training samples (at least not enough), the model obtained through the training set is not always true. (Even if the correct rate is 100% on the training set, it does not mean that it portrays the real data distribution. We must know that depicting the real data distribution is our purpose, rather than only depicting the limited data points of the training set). Moreover, in practice, training samples often have certain noise errors, so if you pursue the perfection of the training set too much and use a very complex model, it will make the model regard the errors in the training set as real data distribution characteristics , resulting in an incorrect estimate of the data distribution. In this case, it will be a mess when it comes to the real test set (this phenomenon is called overfitting). But you can't use a too simple model, otherwise, when the data distribution is more complex, the model is not enough to describe the data distribution (reflected that even the error rate on the training set is very high, this phenomenon is less fitting). Overfitting means that the model adopted is more complicated than the true data distribution, while underfitting means that the model adopted is simpler than the true data distribution.

Under the framework of statistical learning, when we describe the complexity of the model, we have such a view that Error = Bias + Variance. Error here can probably be understood as the prediction error rate of the model, which is composed of two parts, one part is the inaccurate part (Bias) caused by the model being too simple, and the other part is caused by the model being too complex Greater room for change and uncertainty (Variance).

Therefore, it is easy to analyze Naive Bayes in this way. It simply assumes that the various data are irrelevant, and it is a "seriously simplified model"**. Therefore, for such a simple model, the Bias part is larger than the Variance part in most cases, that is to say, high bias and low variance.

In practice, in order to make the Error as small as possible, we need to balance the proportion of Bias and Variance when selecting a model, that is, balance over-fitting and under-fitting.

When the complexity of the model increases, the deviation will gradually decrease, while the variance will gradually increase.

3. Advantages and disadvantages of common algorithms

3.1 Naive Bayes

Naive Bayes belongs to the generative model (regarding the generative model and the discriminant model, the main thing is whether the joint distribution is required), it is relatively simple, you only need to do a bunch of counts. Naive Bayesian classifiers will converge faster than discriminative models, such as logistic regression, if there is a conditional independence assumption (a stricter one), so you only need less training data. Even if the NB conditional independence assumption does not hold, the NB classifier still performs very well in practice. Its main disadvantage is that it cannot learn the interaction between features. In terms of R in mRMR, it is feature redundancy. To quote a more classic example, for example, although you like the movies of Brad Pitt and Tom Cruise, it cannot learn that you don't like the movies they played together.

advantage

  • The naive Bayesian model originated from classical mathematical theory, has a solid mathematical foundation, and a stable classification efficiency rate;

  • It has high speed when training and querying large numbers. Even with a very large-scale training set, there are usually only a relatively small number of features for each item, and the training and classification of items is only a mathematical operation of feature probability;

  • It performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training (that is, it can train new samples in real time);

  • Less sensitive to missing data, and the algorithm is relatively simple, often used in text classification;

  • Naive Bayes is easy to understand for the interpretation of results;

shortcoming

  • Need to calculate the prior probability;

  • There is an error rate in the classification decision;

  • Sensitive to the expression form of the input data;

  • Since **"uses the assumption of independence of sample attributes, it does not work well if the sample attributes are related"**;

Naive Bayes application field

  • It is widely used in fraud detection;

  • whether an email is spam;

  • Should an article be classified into technology, politics, or sports;

  • Does a piece of text express a positive or negative emotion?

  • face recognition.

3.2 Logistic Regression

Logistic regression is a discriminative model, accompanied by many methods of model regularization (L0, L1, L2, etc), and you don't have to worry about whether your features are correlated like you are using Naive Bayes. You also get a decent probabilistic interpretation compared to decision trees, SVMs, and you can even easily update the model with new data (using online gradient descent). Use it if you need a probabilistic framework (e.g., to simply adjust classification thresholds, indicate uncertainty, or obtain confidence intervals), or if you want to quickly incorporate more training data into the model later.

Sigmoid function : the expression is as follows:

advantage

  • Easy to implement and widely used in industrial problems;

  • The amount of calculation is very small, the speed is very fast, and the storage resources are low;

  • Convenient observation sample probability fraction;

  • For logistic regression, multicollinearity is not a problem, it can be combined with L2 regularization to solve the problem;

  • Computationally inexpensive, easy to understand and implement;

shortcoming

  • When the feature space is large, the performance of logistic regression is not very good;

  • Easy to **"underfitting"**, generally the accuracy is not very high

  • Does not handle large numbers of multi-class features or variables well;

  • Can only handle two classification problems (the softmax derived on this basis can be used for multi-classification), and must **"linearly separable"**;

  • For nonlinear features, conversion is required;

Logistic regression application field

  • Used in the field of binary classification, the probability value can be obtained, which is suitable for the field ranked according to the classification probability, such as search ranking, etc.;

  • The extended softmax of Logistic regression can be applied to multi-category fields, such as handwriting recognition, etc.;

  • credit evaluation;

  • Measuring marketing success;

  • Predict the revenue of a product;

  • Whether an earthquake will occur on a given day.

3.3 Linear regression

Linear regression is used for regression. It is not used for classification like Logistic regression. Its basic idea is to optimize the error function in the form of the least squares method with ** "gradient descent method". Of course, it can also be used directly by normal equation Finding the solution for the parameters yields:

In LWLR (locally weighted linear regression), the calculation expression of the parameters is:

It can be seen that LWLR is different from LR. LWLR is a non-parametric model, because each regression calculation must traverse the training sample at least once.

advantage

  • Simple to implement and simple to calculate;

shortcoming

  • Cannot fit nonlinear data.

3.4 Nearest Neighbor Algorithm - KNN

KNN is the nearest neighbor algorithm, and its main process is:

1. 计算训练样本和测试样本中每个样本点的距离(常见的距离度量有欧式距离,马氏距离等);
2. 对上面所有的距离值进行排序(升序);
3. 选前k个最小距离的样本;
4. 根据这k个样本的标签进行投票,得到最后的分类类别;

How to choose an optimal K value depends on the data. In general, a larger K value can reduce the influence of noise during classification, but it will blur the boundaries between categories. A good K value can be obtained through various heuristic techniques, such as cross-validation. In addition, the existence of noise and non-correlation feature vectors will reduce the accuracy of the K-nearest neighbor algorithm. The nearest neighbor algorithm has a strong consistency result. As the data tends to be infinite, the algorithm guarantees that the error rate will not exceed twice the error rate of the Bayesian algorithm. For some good values ​​of K, K-Nearest Neighbors guarantees that the error rate will not exceed the Bayesian theoretical error rate.

advantage

  • The theory is mature and the thinking is simple, which can be used for both classification and regression;

  • Can be used for nonlinear classification;

  • The training time complexity is O(n);

  • No assumptions about the data, high accuracy, insensitive to outlier;

  • KNN is an online technology, new data can be directly added to the data set without retraining;

  • KNN theory is simple and easy to implement;

shortcoming

  • The problem of sample imbalance (that is, some categories have a large number of samples, while others have a small number of samples) is not effective;

  • Requires a lot of memory;

  • For data sets with large sample sizes, the amount of calculation is relatively large (reflected in distance calculation);

  • When the sample is unbalanced, the prediction deviation is relatively large. For example, there are fewer samples in a certain category, while there are more samples in other categories;

  • KNN will re-perform a global operation for each classification;

  • The choice of k value size is not optimal in theory, and is often combined with K-fold cross-validation to obtain the optimal k value selection;

KNN algorithm application field

Text classification, pattern recognition, cluster analysis, multi-category fields.

3.5 Decision tree

One of the great advantages of decision trees is that they are easy to interpret. It can deal with the interaction between features without pressure and is non-parametric, so you don't have to worry about outliers or whether the data is linearly separable (for example, a decision tree can easily handle class A in some feature dimension x At the end, category B is in the middle, and then category A appears at the front end of the feature dimension x). One of its shortcomings is that it does not support online learning, so after the arrival of new samples, the decision tree needs to be completely rebuilt. Another disadvantage is that it is prone to overfitting, but this is where ensemble methods such as random forest RF (or boosted tree) come in. In addition, random forest is often the winner of many classification problems (usually a little bit better than support vector machine), its training is fast and adjustable, and you don't have to worry about adjusting a lot of parameters like support vector machine, so in the past Both are always popular.

A very important point in the decision tree is to select an attribute for branching, so pay attention to the calculation formula of information gain and understand it in depth.

The calculation formula of information entropy is as follows:

Among them, n means that there are n classification categories (for example, if it is a second-class problem, then n=2). Calculate the probability sum of these two types of samples appearing in the total samples respectively, so that the information entropy before the unselected attribute branch can be calculated.

Now select an attribute for branching. At this time, the branching rule is: if yes, divide the sample into one branch of the tree; if not equal, enter another branch. Obviously, the samples in the branch are likely to include two categories, and the entropy sum of the two branches is calculated separately, and the total information entropy after the branch is calculated, then the information gain at this time. Based on the principle of information gain, all the attributes are tested, and an attribute that maximizes the gain is selected as the branch attribute for this time.

advantage

  • The decision tree is easy to understand and explain, can be visualized and analyzed, and rules can be easily extracted;

  • Can handle nominal and numerical data at the same time;

  • It is more suitable for processing samples with missing attributes;

  • Able to handle irrelevant features;

  • When testing the data set, the running speed is relatively fast;

  • Ability to produce actionable and effective results on large data sources in a relatively short period of time.

shortcoming

  • It is prone to overfitting (random forest can greatly reduce overfitting);

  • It is easy to overlook the correlation of attributes in the dataset;

  • For those data with inconsistent numbers of samples in each category, in the decision tree, when performing attribute division, different judgment criteria will bring different attribute selection tendencies; the information gain criterion has a preference for more desirable attributes (typically representative ID3 Algorithm), while the Gain Rate Criterion (CART) has a preference for attributes with a small number of attributes, but when CART divides attributes, it no longer simply uses the gain rate to divide, but uses a heuristic rule) (as long as It is the use of information gain, which has this shortcoming, such as RF).

  • When the ID3 algorithm calculates the information gain, the result is biased towards the features with more numerical values.

improvement measures

  • Pruning the decision tree. You can use cross-validation and add regularization.

  • Using decision tree-based combination algorithms, such as bagging algorithm and randomforest algorithm, can solve the problem of overfitting;

"Application Fields of Decision Tree Algorithms"

Enterprise management practice, enterprise investment decision-making, due to the good analysis ability of decision tree, it is widely used in the decision-making process.

3.5.1 ID3, C4.5 algorithm

The ID3 algorithm is based on information theory, with information entropy and information gain as the measurement standard, so as to realize the induction and classification of data. The ID3 algorithm calculates the information gain of each attribute, and selects the attribute with the highest gain as a given test attribute. The core idea of ​​the C4.5 algorithm is the ID3 algorithm, which is an improvement of the ID3 algorithm. The improvements include:

  • Using the information gain rate to select attributes overcomes the disadvantage of choosing attributes with more values ​​when using information gain;

  • pruning during tree construction;

  • Can handle non-discrete data;

  • Can handle incomplete data.

advantage

  • The resulting classification rules are easy to understand and have high accuracy.

shortcoming

  • In the process of constructing the tree, the data set needs to be scanned and sorted multiple times, which leads to the inefficiency of the algorithm;

  • C4.5 is only suitable for data sets that can reside in memory. When the training set is too large to fit in memory, the program cannot run.

3.5.2 CART classification and regression tree

It is a decision tree classification method that uses the Gini index estimation function based on the minimum distance to determine the extended shape of the decision tree generated by the sub-data set. If the target variable is nominal, it is called a classification tree; if the target variable is continuous, it is called a regression tree. Classification trees are methods of separating data into discrete classes using tree-structured algorithms.

advantage

  • Very flexible, can allow some misclassification costs, can also specify a priori probability distribution, and can use automatic cost complexity pruning to get a more inductive tree;

  • CART is very robust in the face of problems such as missing values ​​and large number of variables.

3.6 Adaboosting

Adaboost is an additive model. Each model is established based on the error rate of the previous model. It pays too much attention to the wrong samples and reduces the attention to the correctly classified samples. After successive iterations, a relatively relatively nice model. This algorithm is a typical boosting algorithm, and the advantages of its summation theory can be explained by Hoeffding's inequality. Interested students can read the detailed description of the AdaBoost algorithm in this article I wrote before. The following summarizes its advantages and disadvantages.

advantage

  • Adaboost is a classifier with very high accuracy.

  • Sub-classifiers can be built using various methods, and the Adaboost algorithm provides the framework.

  • When using simple classifiers, the calculated results are understandable, and the construction of weak classifiers is extremely simple.

  • Simple, without feature screening.

  • Not prone to overfitting.

For the difference between Adaboost, GBDT and XGBoost algorithms, refer to this article: Differences between Adaboost, GBDT and XGBoost

shortcoming

  • Sensitive to outlier

3.7 SVM support vector machine

Support vector machine, an enduring algorithm with high accuracy, provides a good theoretical guarantee to avoid overfitting, and even if the data is linearly inseparable in the original feature space, as long as a suitable kernel function is given, it can run well very good. It is especially popular in text classification problems that are prone to ultra-high dimensions. It's a pity that the memory consumption is large, it is difficult to explain, and the operation and parameter adjustment are also a bit annoying, but the random forest just avoids these shortcomings and is more practical.

advantage

  • Can solve high-dimensional problems, that is, large feature spaces;

  • Solve machine learning problems under small samples;

  • Ability to handle interactions of non-linear features;

  • No local minimum problem; (relative to neural network and other algorithms)

  • No need to rely on the entire data;

  • Strong generalization ability;

shortcoming

  • When there are many observation samples, the efficiency is not very high;

  • There is no general solution to nonlinear problems, and sometimes it is difficult to find a suitable kernel function;

  • The explanatory power of high-dimensional mapping of kernel functions is not strong, especially radial basis functions;

  • Conventional SVM only supports binary classification;

  • Sensitive to missing data ;

The choice of kernel is also skillful (libsvm comes with four kernel functions: linear kernel, polynomial kernel, RBF and sigmoid kernel):

  • First, if the number of samples is less than the number of features, then there is no need to choose a nonlinear kernel, simply use a linear kernel;

  • Second, if the number of samples is greater than the number of features, a nonlinear kernel can be used to map the samples to a higher dimension, and generally better results can be obtained;

  • Third, if the number of samples and the number of features are equal, a nonlinear kernel can be used in this case, and the principle is the same as the second one.

For the first case, it is also possible to reduce the dimensionality of the data first, and then use a nonlinear kernel, which is also a method.

Application field of SVM

Text classification, image recognition (mainly two classification fields, after all, conventional SVM can only solve two classification problems).

3.8 Artificial neural network

advantage

  • High classification accuracy;

  • Strong parallel distributed processing ability, distributed storage and learning ability,

  • Strong robustness and fault tolerance to noise nerves;

  • With the function of associative memory, it can fully approach complex nonlinear relationships;

shortcoming

  • Neural networks require a large number of parameters, such as network topology, initial values ​​of weights and thresholds;

  • ​Black box process, the learning process cannot be observed, and the output results are difficult to explain, which will affect the credibility and acceptability of the results;

  • If the learning time is too long, it may fall into a local minimum, and may even fail to achieve the purpose of learning.

Application field of artificial neural network

​At present, deep neural networks have been applied to computer vision, natural language processing, speech recognition and other fields and achieved good results.

3.9 K-Means Clustering

It is a simple clustering algorithm that divides n objects into k partitions according to their attributes, k<n. The core of the algorithm is to optimize the distortion function J so that it converges to a local minimum but not a global minimum.

For articles about K-Means clustering, see Machine Learning Algorithms - K-means Clustering. Regarding the derivation of K-Means, there is a lot of knowledge in it, and it contains powerful EM ideas.

advantage

  • The algorithm is simple and easy to implement;

  • The algorithm is fast;

  • The algorithm is relatively scalable and efficient for processing large datasets, since its complexity is approximately O(nkt), where n is the number of all objects, k is the number of clusters, and t is the number of iterations. Usually k<<n. This algorithm **"normally converges locally"**.

  • The algorithm tries to find the k partitions that minimize the value of the squared error function. Clustering works better when the clusters are dense, spherical, or clumpy, and the clusters are distinct from each other.

shortcoming

  • It has high requirements on data types and is suitable for numerical data;

  • May converge to a local minimum, slow to converge on large-scale data

  • The number of groups k is an input parameter, an inappropriate k may return poor results.

  • It is sensitive to the cluster center value of the initial value, which may lead to different clustering results for different initial values;

  • Not suitable for finding clusters of non-convex shape, or clusters of widely varying sizes.

  • Sensitive to "noise" and outlier data, a small amount of such data can have a great impact on the average.

3.10 EM Maximum Expectation Algorithm

The EM algorithm is a model-based clustering method, which is an algorithm for finding the maximum likelihood estimation of parameters in a probability model that depends on hidden variables that cannot be observed. The E-step estimates hidden variables, the M-step estimates other parameters, and alternately pushes the extreme value to the maximum.

The EM algorithm is more complex than the K-means algorithm, and the convergence is slower. It is not suitable for large-scale data sets and high-dimensional data, but it is more stable and accurate than the K-means algorithm. EM is often used in the field of data clustering in machine learning and computer vision.

3.11 Integrated algorithm (AdaBoost algorithm)

advantage

  • Good use of weak classifiers for cascading;

  • Different classification algorithms can be used as weak classifiers;

  • AdaBoost has high precision;

  • Compared with the bagging algorithm and the Random Forest algorithm, the weight of each classifier fully considered by AdaBoost;

shortcoming

  • The number of AdaBoost iterations, that is, the number of weak classifiers is not easy to set, and can be determined by cross-validation;

  • Data imbalance leads to a decrease in classification accuracy;

  • The training is time-consuming, and the best segmentation point of the current classifier is re-selected each time;

AdaBoost application field

In the field of pattern recognition and computer vision, it is used in binary classification and multi-classification scenarios.

3.12 Sorting algorithm (PageRank)

PageRank is Google's page ranking algorithm, which determines the importance of all web pages based on the regression relationship of web pages linked from many high-quality web pages, which must still be high-quality web pages. (That is to say, the more awesome friends a person has, the greater the probability that he is awesome.)

advantage

  • It is completely independent of the query, only depends on the link structure of the web page, and can be calculated offline.

shortcoming

  • The PageRank algorithm ignores the timeliness of web search.

  • Old pages rank high, have existed for a long time, and have accumulated a large number of in-links, while new pages with the latest information have low ranks because they have almost no in-links.

3.13 Association rule algorithm (Apriori algorithm)

The Apriori algorithm is an algorithm for mining association rules, which is used to mine the inherent, unknown but actually existing data relationships. Its core is a recursive algorithm based on the idea of ​​two-stage frequency sets.

The Apriori algorithm is divided into two stages

  • Find frequent itemsets

  • Finding association rules from frequent itemsets

shortcoming

  • When the candidate item set is generated at each step, there are too many combinations generated in a cycle, and elements that should not be involved in the combination are not excluded;

  • Every time the support degree of an item set is calculated, all the records in the database are scanned and compared once, which requires a large I/O load.

4. Algorithm selection reference

The author has translated some foreign articles before, and one of them gave a simple algorithm selection technique:

  1. The first choice should be logistic regression. If its effect is not very good, then its results can be used as a reference to compare with other algorithms on the basis;

  2. Then try decision trees (random forests) to see if you can drastically improve your model performance. Even if you don't use it as the final model in the end, you can use random forest to remove noise variables and do feature selection;

  3. If the number of features and observation samples are very large, then when resources and time are sufficient (this premise is very important), using SVM is an option.

Usually: [GBDT>=SVM>=RF>=Adaboost>=Other...], deep learning is very popular now, and it is used in many fields. It is based on neural networks. At present, the author is also learning, but The theoretical knowledge is not solid, and the understanding is not deep enough, so I won’t introduce it here. I hope that I can write an article in the future.

Algorithms are important, "but good data is better than good algorithms ." Designing good features is of great benefit. If you have a very large data set, whichever algorithm you use may not have much impact on classification performance (at this point you can make a decision based on speed and ease of use).

references

  1. Machine Learning Algorithms Comparison

  2. Machine Learning - Advantages and Disadvantages of Common Algorithms

  3. Selecting the best Machine Learning algorithm for your regression problem

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/130757328