[Machine learning notes] Summary of common algorithms for machine learning (updated)

Summary of common algorithms for machine learning

Supervised Learning

Linear regression algorithm

advantage:

  • Fast modeling speed and low storage resources;
  • Simple thinking and easy implementation. Rapid modeling, effective for small data volume and simple relationship;
  • Is the basis of many powerful nonlinear models;

  • The linear regression model is very easy to understand, and the results are very interpretable, which is conducive to decision analysis;

  • It contains many important ideas in machine learning.

  • Can solve the regression problem.

Disadvantages:

  • Poor fitting of complex data, under-fitting;
  • Very sensitive to outliers;
  • It is difficult to model non-linear data or polynomial regression with correlation between data features.

  • It is difficult to express highly complex data well.

Several key points of elastic network regression

  • It encourages group effects in the case of highly correlated variables, rather than zeroing some of them as Lasso does. Elastic networks are very useful when multiple features are related to another feature. Lasso tends to choose one of them randomly, while elastic networks prefer to choose two;
  • There is no limit to the number of selected variables.

Key points of polynomial regression

  • Ability to simulate non-linear separable data; linear regression cannot do this. It is more flexible overall, and can simulate some quite complex relationships;
  • Fully control the modeling of element variables (the index of the variables must be set);
  • Need careful design. Some prior knowledge of data is required to select the best index;
  • If the index is not selected properly, it is easy to overfit.

The main points of ridge regression

  • The assumption of this regression is the same as that of least square regression. The difference is that when least square regression is used, we assume that the error of the data follows the Gaussian distribution using maximum likelihood estimation (MLE). The factor, the prior information of w, uses the maximum posterior estimate (MAP) to get the final parameters;
  • It reduces the value of the coefficient, but does not reach zero, which indicates that there is no feature selection function.

The main points of Lasso's return

There are some differences between ridge regression and Lasso regression, which can basically be attributed to the difference in the nature of L2 and L1 regularization:

  • Built-in feature selection (Built-in feature selection): This is a very useful property of the L1 norm, and the L2 norm does not have this feature. This is actually because the L1 norm tends to produce sparse coefficients. For example, suppose the model has 100 coefficients, but only 10 of them are non-zero coefficients. This actually means "the other 90 variables are not useful for predicting the target value." The L2 norm produces non-sparse coefficients, so there is no such attribute. Therefore, it can be said that Lasso regression has made a "parameter selection" form, and the weight of the unselected feature variables to the whole is 0.
  • Sparseness: Refers to the fact that only a few entries in the matrix (or vector) are non-zero. The L1 norm has the property of producing many coefficients with zero values ​​or very small values ​​with few large coefficients.
  • Calculation efficiency: L1 norm has no analytical solution, but L2 norm has. This allows the solution of the L2 norm to be calculated. However, the solution of the L1 norm is sparse, which makes it possible to use it with sparse algorithms, which makes it more computationally efficient.

Key points of KNN algorithm

advantage:

  • KNN can handle classification problems, and naturally it can handle multi-classification problems, such as the classification of iris flowers;
  • Simple, easy to understand, and powerful at the same time, for the recognition of handwritten digits and irises, the accuracy rate is very high
  • KNN can also handle regression problems, that is, prediction

Disadvantages:

  • The efficiency is low, because every time you classify or return, you must count the training data and test data again. If the amount of data is large, the computing power required will be amazing, but in machine learning, big data processing is very common. One thing
  • Dependence on training data is particularly large. Although all machine learning algorithms are highly dependent on data, KNN is particularly serious, because if our training data set, one or two data is wrong, just right and we need to classify Next to the numerical value, which will directly lead to inaccuracy of the predicted data, and the fault tolerance of the training data is too poor;
  • Dimensional disaster, KNN is not very good for multi-dimensional data processing.
  • Algorithm advantages:

        (1) Simple, easy to understand, easy to implement, no need to estimate parameters.

        (2) The training time is zero. It does not show training, unlike other supervised algorithms that use the training set to train a model (that is, fit a function), and then use the model to classify the validation set or test set. KNN just saves the samples and processes them when the test data is received, so KNN training time is zero.

        (3) KNN can handle the classification problem, and at the same time can naturally handle multi-classification problems, suitable for classifying rare events.

        (4) It is especially suitable for multi-modal problems (multi-modal objects with multiple category labels), KNN performs better than SVM.

        (5) KNN can also handle regression problems, that is, prediction.

        (6) Compared with algorithms such as Naive Bayes, there is no assumption on the data, the accuracy is high, and it is not sensitive to abnormal points.

      Algorithm disadvantages:

        (1) The amount of calculation is too large, especially when the number of features is very large. Each text to be classified must calculate its distance to all known samples in order to obtain its Kth nearest neighbor.

        (2) The intelligibility is poor, and rules like decision trees cannot be given.

        (3) It is a lazy learning method, basically not learning, resulting in slower prediction speed than algorithms such as logistic regression.

        (4) When the samples are unbalanced, the prediction accuracy of rare categories is low. When the samples are unbalanced, for example, the sample size of one class is large, while the sample size of other classes is small, it may lead to that when a new sample is input, the sample of the large-capacity class is the majority of the K neighbors of the sample. 

        (5) The dependence on the training data is particularly large, and the fault tolerance of the training data is too poor. If one or two data is wrong in the training data set, just next to the value to be classified, this will directly lead to the inaccuracy of the predicted data.

Logistic regression

advantage:

  • Easy to understand and implement, low storage resources;
  • Applicable to continuous and categorical independent variables;

Disadvantages:

  • It is easy to underfit and the classification accuracy is not high;
  • When the feature space is large, the performance of logistic regression is not very good;
  • For non-linear features, conversion is required;
  • Can't handle a large number of multi-type features or variables well;
  • Can only deal with two classification problems, and must be linearly separable;

Naive Bayes NBA

advantage:

  • The naive Bayesian model originated from the theory of classical mathematics and has a stable classification efficiency;
  • It performs well on small-scale data, can handle multi-classification tasks, and is suitable for incremental training. Especially when the amount of data exceeds memory, you can incrementally train incrementally.
  • Less sensitive to missing data, and the algorithm is relatively simple, often used in text classification;
  • Real-time prediction: very fast and can be used in real time;

  • Can be expanded by large data sets;

  • Not sensitive to irrelevant features;

  • Good performance with high-dimensional data (large number of features).

Disadvantages:

  • 理 In theory, compared with other classification methods, the Naive Bayes model has the smallest error rate. However, this is not always the case, because, given the output category of the Naive Bayes model, it is assumed that the attributes are independent of each other. This assumption is often true in practical applications. When there are many attributes or attributes When the correlation between them is large, the classification effect is not good; and when the attribute correlation is small, Naive Bayes has the best performance. For this, algorithms such as Naive Bayes are moderately improved by considering partial relevance.
  • It is necessary to know the a priori probability, and the a priori probability often depends on the hypothesis. There may be many hypothetical models, so at some times, the prediction effect will be poor due to the hypothetical prior model.
  • Since we determine the posterior probability through priors and data, and thus determine the classification, there is a certain error rate in the classification decision.
  • Very sensitive to the expression of the input data.

The application scope of Naive Bayes:

Can be used for text classification (can predict multiple categories and do n’t mind handling unrelated features), spam filtering (recognizing spam), sentiment analysis (recognizing positive and negative emotions in social media analysis), recommendation system (under What will be purchased in one step).

Support Vector Machine SVM

advantage:

  • Perform well in high dimensions. There are infinite dimensions in the real world (not just 2D and 3D). For example, image data, genetic data, medical data, etc. have higher dimensions, and support vector machines are useful in this regard. Basically, when the number of features / columns is large, SVM performs well.
  • The best algorithm when classes are separable (when instances of two classes can be easily separated by straight lines or non-linearities).
  • The effect of outliers is small;

  • SVM is suitable for binary classification in extreme cases.

Disadvantages:

  • Slow: For large machine learning data sets, it takes a lot of time to process;
  • Poor performance of overlapping classes: poor performance in the case of overlapping classes;
  • It is important to choose the appropriate hyperparameters: this will allow sufficient generalization performance;
  • Choosing the right kernel function can be troublesome.

Application range of SVM:

Bag of words application (many features and columns), speech recognition data, image classification (non-linear data), medical analysis (non-linear data), text classification (many features).

Decision tree

advantage:

  • Easy to understand and explain, visual analysis, easy to extract rules;
  • Can handle nominal and numeric data at the same time;
  • More suitable for processing samples with missing attributes;
  • Ability to deal with unrelated features;
  • When testing the data set, it runs faster;
  • It can produce feasible and good results for large data sources in a relatively short time.

Disadvantages:

  • Decision trees are prone to overfitting, but random forests can greatly reduce overfitting;
  • Decision tree is easy to ignore the correlation of attributes in the data set;
  • For data with inconsistent sample sizes in various categories, different decision criteria will bring different propensities for attribute selection in the decision tree when dividing attributes; the information gain criterion has a preference for attributes that can be taken in large numbers, while the gain rate Criterion CART has a preference for a small number of attributes that can be taken, but when CART divides attributes, it does not simply use the gain rate to divide it wholeheartedly, but uses a heuristic rule.

Random forest

advantage:

  • For most of the data, its classification effect is better;
  • Can handle high-dimensional features, not easy to produce overfitting, and the model training speed is relatively fast, especially for big data;
  • When deciding the category, it can assess the importance of variables;
  • Strong adaptability to the data set: it can handle both discrete data and continuous data, and the data set does not need to be normalized.

Disadvantages:

  • The classification of a small number of data sets and low-dimensional data sets may not necessarily achieve good results;
  • The calculation speed is slower than a single decision tree;
  • When we need to infer independent or non-independent variables that are out of range, random forests are not doing well.

Unsupervised Learning

K-Means

advantage:

  • The principle is easy to understand and realize;
  • When the difference between clusters is more obvious, the clustering effect is better;

Disadvantages:

  • When the size of the sample set is large, the convergence speed will become slow;
  • Sensitive to outlier data, a small amount of noise will have a greater impact on the average;
  • The value of k is very critical. For different data sets, the choice of k has no reference and requires a lot of experiments;

DBSCAN

advantage:

  • The clustering speed is fast and can effectively deal with noise points and find spatial clusters of arbitrary shapes;
  • Compared with K-MEANS, there is no need to input the number of clusters to be divided;
  • The shape of the cluster is not biased;
  • Noise filtering parameters can be entered when needed.

Disadvantages:

  • When the amount of data increases, a larger memory is required to support I / O consumption;
  • When the density of spatial clustering is not uniform and the difference between clustering intervals is very different, the clustering quality is poor, because in this case, it is difficult to select the parameters MinPts and Eps;
  • The clustering effect of the algorithm depends on the selection of the distance formula, and the Euclidean distance is commonly used in practical applications. For high-dimensional data, there is a "dimensional disaster".

FP Growth

Classic association rule mining algorithms include Apriori algorithm and FP-growth algorithm.

The Apriori algorithm scans the transaction database multiple times, each time using the candidate frequent zd set to generate frequent sets; while FP-growth uses a tree structure, it does not need to generate candidate frequent sets but directly obtain frequent sets, greatly reducing the number of scans of the transaction database Improve the efficiency of the algorithm. However, Apriori's algorithm has good scalability and can be used in parallel computing and other fields.

Apriori Algorithm is a basic algorithm in association rules. It is an association rule mining algorithm proposed by Dr. Rakesh Agrawal and Ramakrishnan Srikant in 1994. The purpose of the association rules is to find the relationship between items in a data set, which is also known as Market Basket analysis, because "shopping blue analysis" is very appropriate to express the application of the algorithm in the scenario A subset.

Apriori

The Apriori algorithm is an algorithm for mining association rules. It is used to mine its contained, unknown, but actually existing data relationships. Its core is a recursive algorithm based on the two-stage frequency set idea.

The Apriori algorithm is divided into two stages:

1) Find frequent itemsets

2) Find association rules from frequent itemsets

Disadvantages:

  • There are too many combinations generated in a cycle when generating candidate itemsets at each step, and elements that should not participate in the combination are not excluded;
  • Each time the support of the item set is calculated, all the records in the database are scanned and compared, which requires a large I / O load.

PCA

advantage:

  • Make the data set easier to use;
  • Reduce the computational cost of the algorithm;
  • Remove noise
  • Make the results easy to understand;
  • No parameter restrictions at all.

Disadvantages:

  • If the user has certain prior knowledge of the observed object and masters some characteristics of the data, but cannot intervene in the processing process through parameterization and other methods, the expected effect may not be obtained and the efficiency is not high;
  • Eigenvalue decomposition has some limitations, such as the transformation matrix must be a square matrix;
  • In the case of non-Gaussian distribution, the principal components obtained by the PCA method may not be optimal.

PCA algorithm application:

  • Exploration and visualization of high-dimensional data sets;
  • data compression;
  • Data preprocessing;
  • Analysis and processing of image, voice and communication;
  • Dimensionality reduction (most important), remove data redundancy and noise.

Ensemble learning

XGBoost

advantage:

  • Less feature engineering is required (data scaling is not required, data normalization, and missing values ​​can also be handled well);
  • The importance of features can be found (it outputs the importance of each feature and can be used for feature selection);
  • Outliers have the smallest impact;
  • Can handle large data sets well;
  • Good execution speed;
  • Excellent model performance (winner in most Kaggle competitions);
  • It is not easy to overfit.

Disadvantages:

  • Difficult to explain, difficult to visualize;
  • If the parameters are not adjusted correctly, it may be overfitting;
  • Because there are too many hyperparameters, it is difficult to adjust.

Application areas of XGBoost:

Can be used for any classification problem. XGBoost is particularly useful if there are too many features, the data set is too large, there are outliers and missing values, and you do n’t want to do too much feature engineering. It has won almost all competitions, so this is an algorithm that must be kept in mind when solving any classification problem.

AdaBoosting

Adaboost is an additive model. Each model is built based on the error rate of the previous model. It pays too much attention to the wrong samples, and reduces the attention to the correctly classified samples. After each iteration, you can get a relatively Good model. This algorithm is a typical boosting algorithm, and the advantages of its summing theory can be explained using Hoeffding's inequality.

The advantage is that it has very high-precision characteristics . This algorithm can use various methods to construct sub-classifiers, and the Adaboost algorithm provides a framework. At the same time, when using a simple classifier, the calculated result is understandable, and the construction of the weak classifier is extremely simple. Simplicity is also one of the features, without feature selection. Finally, overfitting is not easy to happen. There is only one disadvantage, which is that it is more sensitive to outliers .

 

Common algorithms for machine learning
classification Small classification algorithm Computational complexity Interpretive Impact of missing values

Supervised learning

Supervised

Learning

return

Regression

Linear regression

Linear Regression

low easily sensitive

Elastic network regression

ElasticNet Regression

     

Polynomial regression

Ploynominal Regression

     
Ridge Regression      
Lasso regression      

classification

Classification

K-neighbor algorithm (KNN) high general general

Logistic regression

Logistic Regression

low easily sensitive

Naive Bayes NBA

Naive Bayesian Algorithm

in easily Less sensitive
Support Vector Machine SVM in easily sensitive

Decision tree

Decision Tree

low easily Not sensitive

Random forest

Random Forest

low easily Not sensitive

Unsupervised learning

Unsupervised

Learning

Clustering

Clustering

Fuzzy C-Means      
Means Shift      
K-Means low easily general
DBSCAN      
Hierarchical clustering      

Association rule learning

Association

Rule

Learning

FP Growth      
Apriori      
Euclat      

Dimensionality reduction

Dimensionality

Reduction

Algorithm

LDA      
SVD      
LSA      
PCA      
t-SNE      

Integrated learning

Ensemble

Learning

Boosting XGBoost      
LightGBM      
CatBoost      
AdaBoosting low easily Not sensitive

Neural network and

Deep learning

Neural Network

and

Deep Learning

CNN   high difficult Not sensitive
RNN        

Reinforcement learning

Reinforcement

Learning

Q-Learning        
DQN        
SARSA        
A3C        
Genetic Algorithm        

 

 

 

 

 

 

 

 

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105400069