[Data mining] ---- Big Data mining of data clustering, classification, regression

1. Classification (classification)

Supervised learning one of the two applications, resulting in discrete results.

例如向模型输入人的各种数据的训练样本，产生“输入一个人的数据，判断是否患有癌症”的结果，结果必定是离散的，只有“是”或“否”。

Classification is a supervised learning algorithm for discrete random variables modeled or predicted. Use cases include e-mail filtering, financial transaction fraud and predict employee categories such as output tasks.

Many regression algorithm has its corresponding classification algorithms, predictive classification algorithm is usually applied to a category (or categories of probabilities) rather than continuous values.

Classification data mining a very important task, use classification technology may extract a model describing the function or data class (often referred classifier) from the data, and each object is attributed to a data set known object class. From the point of view of machine learning, classification is a supervised learning, ie the data object for each class of training samples have been identified can be formed between the knowledge expression data to identify the corresponding object class by learning. In this sense, the type of knowledge and source data mining goal is formed according to the sample data are classified, and thus also can predict the future classified data. Classification of a wide range of applications such as medical diagnosis, credit rating credit card, image pattern recognition.

The resulting classification mining classification model output be described may take various forms. One of the major representation are: classification rules, decision trees, neural networks and mathematical formulas. In addition, more recently, the rise of a new approach - rough sets, knowledge representation that the use of production rules.

　　Classification (Classification) is the process: it is described to identify and distinguish the data model or conceptual type (or function), to be able to use the model to predict the class of unknown object class mark. Classification analysis is a more important task in data mining, currently the most widely used commercially. The purpose of classification is to learn a classification function or classification model (also often referred to as classifiers), the model data item in the database can be mapped to a certain class in a given category.
　　Classification and regression can be used to predict, both aim to automatically derive from the historical data record in the promotion of the description of the given data, which can predict future data. The difference is that with the return of classification is discrete class of output value, the output value of the return is continuous. Both usually in the form of a decision tree, based on the data values from the root to start the search, along data to meet the branch going up, went to the leaves can be determined category.
　　 To construct the classifier, the need for a training data set as input. Training set consists of a set of database records or tuples, each tuple is a (also known attributes or characteristics) consisting of the value of the relevant feature vector field, in addition, there is a category labeled training samples. In the form of a particular sample can be expressed as: (v1, v2, ..., vn; c); wherein vi indicates that the field value, c denotes category. There are statistical methods, machine learning, neural network classifier constructor and so on.
　　Different classifiers have different characteristics. There are three classification or comparison evaluation criteria: 1) the prediction accuracy; 2) computational complexity; conciseness 3) described by the model. Forecast accuracy is the most used a more scale, especially for predictive classification task. Computational complexity depends on the specific implementation details and hardware environment, data mining, since the operation target is a huge amount of data, so the complexity of the problem space and time will be a very important aspect. For descriptive classification task, the more simple the more popular model description.
　　Also note that the general effect of classifying features and data related to some big noise in the data, there are vacancies in some value, some sparse, among others related field or property is strong, and some property is discrete and or some hybrid of successive values. Now widely recognized that there is no certain method can be adapted to a variety of characteristics of the data

1.1 Logistic Regression (regularization)

Logistic regression is a linear regression corresponding to a classification method, and the basic concept of a linear regression derivation algorithm out. Logistic regression by Logistic function (i.e. Sigmoid function) mapped to the predicted intermediate 0-1, the predicted value is thus the probability can be considered as a category.

The model still is "linear", so only the data is linearly separable (ie, the data may be a hyperplane complete separation), the algorithm in order to have good performance. The same model can punish Logistic model coefficients and perform regularization.

Advantages: output has good probability interpretation, and the algorithm can be regularized to avoid over-fitting. Logistic model is easy to use stochastic gradient descent and updated with new data model weights.
Drawback: Logistic regression in the number of non-linear decision boundary or relatively poor performance.

Python implementation: http: //scikit-learn.org/stable/modules/linear_model.html#logistic-regression
R achieve: https: //cran.r-project.org/web/packages/glmnet/index.html

1.2 classification tree (integration method)

And regression tree corresponding classification algorithm is a classification tree. They usually refer to the decision tree, or a little more rigorous called "classification and regression tree (CART)," which is very famous CART algorithm.

Simple random forest

Pros: As with regression, classification tree method is also integrated performance is very good in practice. They generally have a relatively robust and scalable abnormal data. Because of its hierarchical structure, integrated classification tree method can naturally model the nonlinear decision boundary.
Disadvantages: not bound single tree tend to over-fitting, using the integrated process may weaken the influence of this aspect.

Random Forests Python implementation: http: //scikit-learn.org/stable/modules/ensemble.html#regression
Random Forests R achieved: https: //cran.r-project.org/web/packages/randomForest/index.html
Gradient boosting tree Python implementation: http: //scikit-learn.org/stable/modules/ensemble.html#classification
R achieve boosting tree gradient: https: //cran.r-project.org/web/packages/gbm/index.html

1.3 depth study

Depth of learning is also easy to adapt to the classification issue. In fact, the depth of learning applications to more of a classification tasks, such as image classification.

Pros: deep learning is ideal for classification of audio, text and image data.
Cons: and regression problems, as the depth of the neural network requires a lot of training data, so it is not a general purpose algorithm.

Python resources: https: //keras.io/
R resource: http: //mxnet.io/

1.4 SVM

Support vector machine (SVM) technique may be used in a kernel function called extended to nonlinear classification problem, and the calculation algorithm is essentially the distance between the two observations called support vectors. SVM decision boundary algorithm that is looking to maximize its border with the sample interval, therefore SVM classifier is also known as the big pitch.

SVM kernel function uses nonlinear transform, nonlinear transform the problem into a linear problem

For example, SVM using linear kernel will be able to get results similar to logistic regression, support vector machine because only maximizes the interval and more robust. Therefore, in practice, the biggest advantage is that you can use non-linear SVM kernel function nonlinear decision boundary modeling.

Advantages: SVM decision boundary energy nonlinear modeling, and there are many alternative kernel formation. SVM also facing considerable over-fitting robust, which is particularly prominent in the high-dimensional space.
Cons: However, SVM algorithm is memory-intensive, due to choose the right kernel is very important, so it is difficult to adjust parameters, can not be extended to a larger data set. Currently in the industry, the random forest is usually better than SVM.

Python implementation: http: //scikit-learn.org/stable/modules/svm.html#classification
R achieve: https: //cran.r-project.org/web/packages/kernlab/index.html

1.5 Naive Bayes

Naive Bayes (NB) is a Bayesian classification method and characterized based on conditional independence assumption. Naive Bayes model is essentially a probability table, the probability of this table by updating its training data. To predict a new observation, naive Bayes algorithm is based on characteristic values of the samples looking for the largest category of probability in the probability table.

It is called "simple" because the core is the characteristic conditions of the algorithm independence assumption (independent of each other between each feature), and this basic assumption that in the real world is not realistic.

Advantages: Even conditional independence assumptions is difficult to set up, but the naive Bayes algorithm performance in practice unexpectedly good. The algorithm is easy to implement and can be updated with the data set expands.
Cons: Because Naive Bayes algorithm is too simple, so it is also frequently listed above classification algorithm replaced.

Python implementation: http: //scikit-learn.org/stable/modules/naive_bayes.html
R achieve: https: //cran.r-project.org/web/packages/naivebayes/index.html

2. clustering (clustering)

Unsupervised learning results. The result of the clustering will produce a set of collection objects in the collection with the same collection of objects similar to each other, with other sets of objects are different

没有标准参考的学生给书本分的类别，表示自己认为这些书可能是同一类别的（具体什么类别不知道）。

Clustering is an unsupervised learning tasks, the algorithm is based on the internal structure of the data to find natural populations observed sample (ie, clusters). Use cases include customer segmentation, clustering news, articles and other recommended.

Because learning is an unsupervised cluster (i.e., data not marked), and are often used to visualize the data evaluation results. If there is the "right answer" (ie, in the presence of concentrated training pre-marked cluster), the classification algorithm may be more appropriate.

　　And classification technologies, machine learning, clustering is an unsupervised learning. In other words, the cluster is not known in advance to be classified under the classification of the situation, a method of clustering information according to information similarity principle. The purpose of clustering is to make the difference between objects that belong to the same category as small as possible, but differences on the different categories of objects as large as possible. Therefore, the significance lies in the clustering of content organized in a hierarchical structure observed the similar things organized together. By clustering, one can identify dense and sparse areas, and thus find an interesting relationship between global distribution patterns, and data attributes.

　Clustering analysis is a booming field. Clustering technology is mainly based on statistical methods, machine learning, neural network-based methods. More representative clustering technique is based on geometric distance clustering method, such as Euclidean distance, Manhattan distance, Minkowski distance. Cluster analysis is widely used in a variety of commercial, biology, geography, and other network services.

　　Clustering (Clustering) means in accordance with " attracts like " principle, does not have the type of aggregate samples into different groups, such a collection of data objects, called a set of clusters, and each such cluster to the process described herein. It is designed to be similar to each other such that between clusters belong to the same sample, while samples of different clusters should not similar enough. Classification rules and different, pre clustering does not know will be divided into several groups and what kind of group, do not know to define a set of rules which differentiate according to space. Its purpose is intended to find the relationship between the properties of a function of spatial entities, to tap the knowledge of mathematical equations to represent a property named variable.
　　Currently, clustering technology is booming, covers areas of data mining, statistics, machine learning, spatial database technology, biology and other fields of marketing, cluster analysis has become the field of data mining research in a very active research topic.

Common clustering algorithms include :

K- means clustering algorithm (K-mensclustering) is the most typical clustering algorithm

Division of K belonging to the center point (K-MEDOIDS) algorithm, CLARANS algorithm;

BIRCH algorithm is AHP, CURE algorithm, CHAMELEON algorithm;

DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm;: Based on Density

Grid-based approach: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm; model-based approach.

2.1 K-means clustering

K-means clustering is a measure of a general purpose algorithm, based clustering geometric distance (i.e., distance in the coordinate plane) between sample points. Ethnic groups cluster around the center of the cluster, while globular clusters exhibit class and of similar size. Clustering algorithm is recommended for beginners algorithm because the algorithm is not only simple, but also flexible enough to face the most problems can give reasonable results.

Advantage: K-means clustering is the most popular clustering algorithm because the algorithm is fast enough, simple, and if your data preprocessing and feature works very effectively, then the clustering algorithm has amazing flexibility.
Cons: The algorithm need to specify the number of clusters, and the choice of the value of K is usually not so easy to determine. In addition, if the training data in real globular cluster is not a class, then K-means clustering will draw some relatively poor clusters.

Python implementation: http: //scikit-learn.org/stable/modules/clustering.html#k-means
R achieve: https: //stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

2.2 Affinity Propagation Clustering

AP clustering algorithm is a relatively new clustering algorithm, the clustering algorithm determines a cluster pattern based on the distance (graph distances) between two sample points. The clustering method using cluster has a smaller and unequal size.

Pros: The algorithm does not need to point out clearly the number of clusters (but need to specify the "sample preference" and "damping" and other ultra-parameters).
Disadvantages: The main drawback AP clustering algorithm training speed is relatively slow and requires a lot of memory, so it is difficult to extend to large data set. In addition, the algorithm assumes that the same kind of potential clusters are spherical.

Python implementation: http: //scikit-learn.org/stable/modules/clustering.html#affinity-propagation
R achieve: https: //cran.r-project.org/web/packages/apcluster/index.html

2.3 hierarchical clustering (Hierarchical / Agglomerative)

Hierarchical clustering is a series based on the concept of clustering algorithms:

The beginning of a data point as a cluster
For each cluster, based on the same standard cluster merge
This process is repeated until leaving only one cluster, and therefore get a hierarchy of clusters.

Advantages: The main advantage of hierarchical clustering is no longer required cluster is assumed to be spherical. Further it can be extended to large data sets.
Cons: a bit like a K-means clustering, the algorithm needs to set the number (level ie after the completion of the algorithm needs to be retained) cluster.
Python implementation: http: //scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
R achieve: https: //stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html

2.4 DBSCAN

DBSCAN is a density-based algorithm, dense regions will form a cluster of sample points. There is also a recently known as the new progress HDBSCAN, which allows changing the density clusters.

Advantages: DBSCAN cluster need not assume a spherical shape, and its performance is extensible. Further, it does not require every point is assigned to a cluster, which reduces the abnormal data cluster.
Cons: User must adjust "epsilon" and "min_sample" These two clusters are defined density ultra-parameters. DBSCAN very sensitive to these super parameters.

Python implementation: http: //scikit-learn.org/stable/modules/clustering.html#dbscan
R achieve: https: //cran.r-project.org/web/packages/dbscan/index.html

Epilogue

This article from the regression, classification and clustering three angles preliminary understanding of the advantages and disadvantages of each method, but also a basic understanding of what those algorithms in the end yes. But each of the above algorithm has more concepts and details do not show up, we do not know what their function is lost, what is the goal of the training, the weight update strategy is what so some of the problems. So we hope from the heart of the machine always in search of some articles provide specific details of these algorithms for interested readers.

Linear Regression:

Decision Tree (integrated approach):

Support Vector Machines:

Detailed support vector machine (attached Learning Resources)

Deep learning:

Clustering Algorithm:

Machine Understanding Big Data secret: Detailed clustering algorithm depth

3. Regression (regression)

Supervised learning one of the two applications, produce continuous results.

例如向模型输入人的各种数据的训练样本，产生“输入一个人的数据，判断此人20年后今后的经济能力”的结果，结果是连续的，往往得到一条回归曲线。当输入自变量不同时，输出的因变量非离散分布。

Regression is a supervised learning algorithm for numerical continuous random variables to predict and modeling. General use cases include cases of continuously varying prices forecast, stock charts or test scores and so on.

Features task is to mark the return of datasets with numerical target variable. In other words, there is a numeric label of each sample was observed to oversee the true value algorithm.

3.1 linear regression (regularization)

Linear regression is one of the most common processing tasks regression algorithm. In the form of the algorithm it is very simple, it is desirable to use a super-plane fit the data set (when only two variables is a straight line). If the variable data set of linear relationship exists, then it can be a very good fit.

In practice, a simple linear regression is typically used regularized regression method (LASSO, Ridge and Elastic-Net) replaced. Regularization is actually a punishment for taking too much regression coefficients in order to reduce the risk of over-fitting technology. Of course, we have to determine the punishment strength to allow the model to achieve a balance between under-fitting and over-fitting.

Pros: understanding and interpretation of linear regression are very intuitive, and also through regularization to reduce the risk of over-fitting. Further, the linear model is easy to use stochastic gradient descent update model right weight and the new data.
Disadvantages: linear regression variables are non-linear relationship between the time of poor performance. And it is also not flexible enough to capture more complex patterns, or add the correct interaction term polynomial is very difficult and requires a lot of time.

Python realization: http: //scikit-learn.org/stable/modules/linear_model.html
R achieve: https: //cran.r-project.org/web/packages/glmnet/index.html

3.2 Regression Tree (integrated approach)

Regression tree (decision tree) is achieved by learning a hierarchical set of data is divided into different branches repeated segmentation criterion is to maximize the gain each time separated information. This regression tree branch structure so that it is natural to learn non-linear relationship.

Integrated process, such as a random forest (RF) or gradient boosting tree (GBM) is a combination of a number of independent training tree. The main idea of this algorithm is to combine multiple weak learning algorithm and become a strong learning algorithm, but does not specifically unfold here. In practice there is usually easy good RF performance, and is more difficult to adjust parameters GBM, but usually have higher gradient boosting tree performance limit.

Advantages: decision trees can learn non-linear relationship, outliers also has strong robustness. Integrated learning in practice, the performance is very good, it often won many classical (non- depth study ) machine learning competition.
Cons: unconstrained, it is easy to over-fitting single tree, a single tree can remain as a branch (no pruning), and remember it until the training data. Integrated approach can weaken the impact of this disadvantage.

Random Forests python achieve: http: //scikit-learn.org/stable/modules/ensemble.html#random-forests
Random Forests R achieved: https: //cran.r-project.org/web/packages/randomForest/index.html
Gradient boosting tree Python implementation: http: //scikit-learn.org/stable/modules/ensemble.html#classification
R achieve boosting tree gradient: https: //cran.r-project.org/web/packages/gbm/index.html

3.3 depth study

Deep learning means to learn a very complex pattern of multilayer neural network. The algorithm used between the input layer and the output layer, a hidden layer model the characterization of the intermediate data, which is difficult to learn a portion other algorithms.

Deep learning there are several other important mechanisms, such as convolution and drop-out, etc. These mechanisms so that the algorithm can learn to high-dimensional data. However, relative to other deep learning algorithms require more data, because it has a greater magnitude of parameters to be estimated.

Pros: deep learning is the most advanced in certain areas of technology, such as computer vision and speech recognition and so on. Neural network on the depth image, audio and text data such outstanding performance, and the algorithm also easy to use for the new data back propagation algorithm updates the model parameters. Their architecture (ie the number and level of structure) can be adapted to a variety of problems and hidden layers also reduces the algorithm relies on the characteristics of the project.
Cons: deep learning algorithms are generally not suitable as a general-purpose algorithm, because it requires a large amount of data. In fact, deep learning is usually on the classical machine learning methods have not integrated well performance. In addition, it is computationally intensive, so this requires more experienced person on the assistant training (ie set architecture and ultra-parameters) to reduce training time.

Python resources: https: //keras.io/
R resource: http: //mxnet.io/

3.4 Nearest Neighbor algorithm

Nearest Neighbor algorithm is "based on the instance", which means that it needs to retain each training sample observations. Nearest Neighbor algorithm to predict the value of the new sample is observed by searching the most similar training samples.

And this algorithm is memory intensive, high-dimensional data of the treatment effect is not very good, and also requires efficient distance function to measure and calculate the similarity. In practice, the use of substantially regularized regression tree or integrated process is the best choice.

[Data mining] ---- Big Data mining of data clustering, classification, regression

Guess you like