Common Classification Algorithm Application Scope/Data Requirements

A single classification algorithm : Decision Trees, Bayesian, Artificial Neural Networks, K-Nearest Neighbors, Support Vector Machines, and Association Rule-Based Classification, HMM

Combined Classification Algorithms: Bagging and Boosting

k-近邻(kNN,k-Nearest Neighbors)算法

Find the k training samples closest to the unknown sample x, and see which category most of these k samples belong to, and classify x into that category.

Model input requirements: continuous values, categorical variables need to be one-hot encoded, because it is to calculate the distance, it is necessary to normalize the data

Important parameters of the model: definition of K value and distance

Pros: Easy to understand and implement

Disadvantages: large amount of calculation, high complexity, not suitable for real-time scenarios

Application Scenario: Image Compression

   2.朴素贝叶斯

Use Bayes' theorem to predict the possibility that a sample of an unknown category belongs to each category, and select a category with a higher probability as the final category of the sample

Model input requirements: continuous values ​​need to be discretized into probability density, such as the Gaussian model http://blog.csdn.net/u012162613/article/details/48323777, and the input of Bayesian is probability, so it needs to be non-negative

Important parameters of the model:

Advantages: The generative model can be used to classify by calculating probabilities. It can be used to deal with multi-classification problems. It performs well on small-scale data. It is suitable for multi-classification tasks and incremental training. The algorithm is relatively simple.

Disadvantage: requires a strong conditional independence assumption

Application scenario: text classification (eg: spam identification)

3.神经网络

Artificial Neural Networks (ANN) is a mathematical model that uses a structure similar to that of the brain's synaptic connections for information processing

Model input requirements: normalized features

Important parameters of the model: the number of network layers and the number of nodes

Advantages: It has the function of realizing any complex nonlinear mapping

Disadvantages: slow convergence speed, large amount of calculation, long training time, easy to converge to local optimum

Application scenarios: image processing, pattern recognition

4.支持向量机

According to the criterion of structural risk minimization, the optimal classification hyperplane is constructed to maximize the classification interval to improve the generalization ability of the learning machine

Model input: binary classification, normalization

Important parameters of the model: kernel function

Advantages: can solve machine learning problems in the case of small samples, can solve high-dimensional problems, can avoid neural network structure selection and local minimum point problems

Disadvantages: The kernel function is sensitive, and only two classifications can be done without modification

Application scenarios: high-dimensional text classification, small sample classification

 5.决策树

A decision tree is a tree structure (it can be binary or non-binary). Each of its non-leaf nodes represents a test on a feature attribute, each branch represents the output of this feature attribute on a certain value range, and each leaf node stores a category

Model input: can handle continuous values, category variables need one-hot

Important parameters of the model: the height of the tree

Advantages: super learning ability and generalization ability, fast training speed

Disadvantages: easy to overfitting, improved to random forest (Random Forest, RF)

Application scenario: search and sort

 6.LR

Establish a regression formula for the classification boundary line based on the existing data, and classify in turn

Model input: continuous values ​​need to be discretized, and categorical variables need to be one-hot

Important model parameters: input feature discretization

Advantages: fast training speed, suitable for real-time scenarios

Disadvantages: poor fitting ability, unable to handle non-offline scenarios, need to artificially set combination features

Application scenarios: various real-time systems: such as ctr estimation
Transferred from http://f.dataguru.cn/thread-896022-1-1.html

Guess you like

Origin blog.csdn.net/xllzuibangla/article/details/124971314