Article takes you understand what data mining

Big Data era has arrived, and use the Internet to generate life in the large amount of data to identify problems and create value, so that data mining has become a new science and technology. So what is the big data mining, data mining what the process is, and its specific algorithm and what? Today this article will take you to understand those things data mining.

01, first of all, data mining in the end what is?
 
The official definition of data mining (Data Mining) is a lot of, incomplete, noisy, fuzzy, random data is extracted from the implicit therein, it is not known in advance, but is potentially useful information and knowledge of the process.
 
Easy to understand that data mining is from large amounts of data, it was found that "stuff" we want.
 
02 What is this "thing" specifically referring to?
 
It is called a prediction task.
 
That gave certain target attribute, so in addition to predict a specific attribute targets. If the property is a discrete, often called 'category', whereas if the target is a continuous attribute value, called the 'Regression'.
 
Another is called the description of the task.
 
This refers to identify potential links between data mode. Let's say there is a strong relationship between the two associated data, a feature like big data analysis found: buy diapers male will usually buy beer, then this may be based on these two businesses goods sold as a package to improve performance. Another very important thing is to cluster analysis, which is the application of a very, very frequent in daily data mining analysis, observations aimed at discovering the group closely related, all data can be classified under the labels are unavailable suitable categories for analysis or dimensionality reduction.
 
There are other tasks described abnormality detection, the process is similar to the reverse of the clusters, the clusters of similar data together with the polymerization, and the abnormality detection to eliminate outlier points far out.
 
03 general process of data mining include the following:
 
Data preprocessing data mining process
 
First, it said data preprocessing. The reason why there is such a step, because generally relates to data mining requires a relatively large amount of data, which may lead to different sources of varying formats, data also may have some missing value or an invalid value, if not treated directly these 'dirty' data into the model to run, very easily lead to failure or poor usability model calculation, the data pre-processing the data mining process is indispensable step.
 
As for data mining and post processing is relatively more easy to understand. Completion data preprocessing, feature we usually constructed into a specific model and to calculate, using some criteria to judge the performance of different models, or combinations of models, to determine a final model is most suitable for post-processing. Post-treatment process is equivalent to have found that the results we want to find, and then to apply it, or in a suitable manner indicating its use.
 
This involves a series of algorithms to data mining, it is divided into classification algorithm, clustering algorithm and association rule three categories, these three basically covers all the current commercial market demand for algorithms. And these three, the most classic is following these ten algorithms.
 
 
1, the classification decision tree algorithm C4.5
 
C4.5, is a classification decision tree algorithm machine learning algorithm, it is a decision tree (decision tree, is organized between nodes to make decisions like an inverted planting trees) improved algorithm core algorithm of ID3.
 
2, K average algorithm
 
K average algorithm (k-means algorithm) is a clustering algorithm, the classification of subjects into the n class k (k according to their properties
 
3, support vector machine algorithm
 
SVM (Support Vector Machine) algorithm, abbreviated as SVM, is a method of supervised learning, widely used in statistical classification and regression analysis.
 
4、The Apriori algorithm
 
Apriori algorithm is one of the most influential mining Boolean association rules frequent item sets algorithm, the core is a two-stage recursive algorithm "frequent item set" based on the idea. It relates to association rules are single-dimensional, single, Boolean association rules in the classification.
 
5, the maximum expected (EM) algorithm
 
Maximum expected (EM, Expectation-Maximization) algorithm is an algorithm to find the maximum likelihood estimate parameters of the probability model, which relies on probabilistic models of hidden variables can not be observed. Maximum expected is often used in data gathering machine learning and computer vision.
 
6, Page Rank algorithm
 
Page Rank based on the number and quality of external and internal links of the site, measure the value of the site.
 
7, Ada Boost iterative algorithm
 
Ada boost is an iterative algorithm, the core idea is the same for different training set a classifier is trained (weak classifiers), then these weak classifiers together, constitute a stronger final classifier (strong classifier) .
 
8, kNN nearest neighbor classification algorithm
 
K-nearest neighbor (k-Nearest Neighbor, KNN) classification algorithm, is a more mature approach in theory, one of the most simple machine learning algorithm. The idea of ​​the method is: if a sample in feature space of the k most similar (i.e. nearest feature space) most of the samples belonging to a particular category, then the sample may also fall into this category.
 
9, Naive Bayes Naive Bayes algorithm
 
Naive Bayes algorithm priori probability of an object, followed by the posterior probability is calculated using the Bayes formula, and select the class with the largest posterior probability of a class of the object belongs. Estimated parameters required little naive Bayes model, less sensitive to missing data, the algorithm is relatively simple.
 
10, CART: Classification and Regression Tree algorithm.
 
Classification and Regression Tree algorithm (CART, Classification and Regression Trees) is a classification of data mining algorithms, there are two key ideas: the first is the idea of ​​recursively partitioning the argument space; second idea is to use verification data pruning.
 
Conclusion:
 
A data mining into the deep sea, from dawn to struggle. This algorithm just ten, enough for you to chew on Shang Hao for some time ......
 
But please do not panic and think they can harness the power of the machine, the power of mathematics to understand the world works, or to predict what the use of research to do some interesting things, this is a rare enjoyment!

Guess you like

Origin www.cnblogs.com/Liz-Murray-coming/p/11332950.html