Data mining of a way to learn: data mining awareness

1. What is data mining?

Data mining is a hot topic of artificial intelligence and database research, the so-called data mining is revealed large amounts of data from a database of implicit, previously unknown and potentially useful non-trivial process information. Data mining is a decision support process, which is mainly based on artificial intelligence, machine learning, pattern recognition, statistical, database, visualization technologies, highly automated analysis of business data, make inductive reasoning, dig out the underlying pattern to help decision-makers to adjust marketing strategies, reduce risk and make the right decisions. Knowledge discovery process consists of three stages: ① data preparation; ② data mining; ③ expression and interpretation of the results. Data mining can interact with the user or knowledge base.
Data mining is the analysis of data collected from the source, find the law from huge data, find the treasure.

2. The basic flow of data mining

Data mining can be divided into six steps:
　1. Business understanding: Data mining is not our aim, our goal is to better help business, so from a business perspective first step in our understanding of project requirements, on this basis, and then the goal of data mining is defined.
　2. Data understood: attempt to collect part of the data, and to explore the data, including data descriptions, data quality verification. This will help you to collect the data have a preliminary knowledge.
　3. Data Preparation: begin collecting data, and the data cleaning, data integration operations, before complete preparation of data mining
　4. model: selection and application of data mining model and optimized to obtain better classification results
　5. assessment model: the model evaluation and review each step of constructing the model, your model is to achieve the intended business objectives
　on the line 6. release:: role model is to find the gold from the data, that is, we called "knowledge", the knowledge gained needs to be translated into how users can use, can be presented in the form of a report, it can be more complex to implement a repeatable data mining process.
If the result of data mining is part of daily operations, then subsequent monitoring and maintenance will become important.

3. Data mining algorithm Ten

In order to perform data mining tasks, data scientists have proposed a variety of models in many data mining model, the authoritative international academic organization ICDM (the IEEE International Conference on Data Mining) selected the top ten classical algorithm.
According to different purposes, I can use these algorithms into four categories, so that you understand.
l1 classification algorithm:. C4.5, Naive Bayes (Naive Bayes), SVM, KNN, Adaboost, the CART
　　L2 clustering algorithm:. K-Means, EM
　　L3 correlation analysis:. Apriori
　　link analysis l4:. PageRank

C4.5
　　　 C4.5 algorithm is the highest number of votes algorithm, the algorithm can be said of the top ten. C4.5 algorithm is a decision tree, the decision tree configuration in which the inventive process proceeds pruning, and may process continuous attribute, it can be processed incomplete data. It can be said that the decision tree classification algorithm, with landmark significance.
　　2. Naive Bayes (Naive Bayes)
　　　 Naive Bayes model is based on the principles of probability theory, the thought of it is this: given the unknown object you want to classify, you need to solve this condition in unknown object appears probability of occurrence of each category, of which the largest, which is considered the classification of this unknown object belongs.
　　SVM 3.
　　　SVM support vector machine called the Chinese, English is the Support Vector Machine, referred to as SVM. SVM classification model established a hyperplane in training. If you do not understand hyper-plane, it does not matter, I'll be described in the following articles algorithm.
　　KNN 4.
　　 KNN also known as K-nearest neighbor algorithm, English is the K-Nearest Neighbor. The so-called K-nearest neighbor, that is, each sample can use its closest K neighbors to represent. If a sample, it's the K nearest neighbors belong to Category A, then the sample also belong to Category A.
　　AdaBoost 5.
　　　　Adaboost classification model established a joint in training. boost representatives to enhance the meaning in English, so Adaboost classifier is constructed to enhance the algorithm. It allows us more weak classifiers to form a strong classifier, so Adaboost is a commonly used classification algorithm.
　　6. CART
　　　　CART representatives of classification and regression trees, English is the Classification and Regression Trees. Like English, as it was constructed two trees: one is the classification tree, the other is a regression tree. And C4.5, as it is a decision tree learning method.
　　The Apriori 7. The
　　　　the Apriori an association rule mining (association rules) algorithm, it is to reveal the relationships between items, it is widely used in the field of mining and business network security by mining frequent itemsets (frequent item sets). Frequent item set is a set of association rules often appear together an article suggesting that there may be a strong relationship between the two items.
　　K-Means 8. The
　　　　K-Means algorithm is a clustering algorithm. You can understand, and ultimately I want to object divided into K classes. Assuming that each category inside, there is a "central point", that is, opinion leaders, it is the core of this category. Now I have a new point to be classified, this time as long as the new point calculated from the K center point, which is the center point of near distance, which becomes a category.
　　EM 9.
　　　　EM algorithm is also called expectation-maximization algorithm is a method of maximum likelihood estimation of demand parameters. The principle is this: Suppose we want to evaluate the parameters A and B, both in the initial state is unknown, and know that the information you can get A's B's information, in turn, will get to know the B of A . A first consider giving an initial value, in order to obtain a valuation of B, then from B's valuation, re-estimate the value A, the process continues until convergence.
EM algorithm often used for clustering and machine learning fields.
　　10. PageRank
　　　　PageRank originated in the calculation of influential papers, if an article on the more times is introduced, on behalf of the stronger influence of this paper. Similarly PageRank is Google creatively applied to heavy page weight calculation: The more a page linked page, described more on this page, "reference", The higher the page is the chain of frequency, indicating that the page is the higher the number of references. Based on this principle, we can get the weight of the site is divided.

4. Mathematical Principles of Data Mining

Having said that the classic data mining algorithms, but if you do not understand probability theory and mathematical statistics, is still difficult to grasp the nature of the algorithm; if you do not understand linear algebra, it is difficult to understand matrix and vector operations in data mining value; if you do not have the most conceptual approach to optimization, it is not deep understanding of iterative convergence. So, you want a deeper understanding of data mining methods, it is very necessary to understand the mathematical principles of its back.
　　1. Probability theory and mathematical statistics
　　　probability theory in college when we basically have learned, but the content of university teacher, a little more biased probability, statistical section speaks less. To use probability theory in data mining is more and more in places. Such as conditional probability, the concept of independence, as well as a random variable, the concept of multi-dimensional random variables.
　　　Many algorithms are related to the nature of probability theory, so that the probability theory and mathematical statistics is an important mathematical basis of data mining.
　　2. Linear Algebra
　　　vectors and matrices are important points in linear algebra, which is widely used in data mining, such as we often will target the abstract is a matrix, an image can be abstracted is a matrix, we also often calculates eigenvalues and eigenvectors, eigenvectors are approximated by a representative feature of the object. This is the basic idea of large data dimensionality reduction.
　　　Based on a variety of matrix operations, matrix theory and applications based on mature, we can help solve many practical problems, such as the PCA method, SVD method, and MF, NMF methods have a wide range of data mining.
　　Figure 3. On the
　　　rise of social networks, so that the application of graph theory more widely. Relationships between people, can be connected with two nodes on the graph theory, of the number of nodes can be understood as a person's friend. We have all heard the theory of six degrees of extraordinary vein, proved to be an average of one person with another person connected to the Facebook, requires only 3.57 people. Of course, graph theory is very effective for the analysis of the network structure, but also the relationship between mining and graph theory, image segmentation plays an important role.
4. optimization method
　　　Optimization method is equivalent to the process of self-learning machine learning, machine know when the target, deviating results after training you need to iterative adjustment, this adjustment is the process of optimizing it. In general, the process is long and iterative learning, random. Optimization of the proposed method is to use a shorter time to get convergence, to achieve better results.

Enron fireworks

Published 18 original articles · won praise 16 · views 525

Private letter concerns