2. Overview of Data Mining - "Data Mining and Data Operation Practice"

2.1 Data Mining Concepts

         Data mining (Data Mining) is the core part of knowledge discovery (KDD), which refers to the non-trivial process of automatically extracting useful information hidden in data from data sets. and modes, etc. In general, data mining integrates theories and technologies in databases, artificial intelligence, machine learning, statistics, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, and spatial data analysis.

 2.2 The main difference between statistical analysis and data mining

Compared with traditional statistical analysis techniques, data mining has the following characteristics:

  • Data mining is good at dealing with big data, especially data with millions of rows or more.
  • Data mining uses data mining tools in practical applications. Many of these mining tools do not require a special professional statistical background as a necessary condition.
  • The application of data mining tools is more in line with the actual needs of enterprises.
  • From the operator's point of view, data mining technology is more used by enterprise data analysts than by statisticians for detection.

There are significant differences between data mining and statistical analysis in the following aspects:

  • In statistical analysis of data, it is often necessary to make assumptions about data analysis and variable relationships, determine what probability function to use to describe variable relationships, and how to test the statistical significance of parameters; in data mining applications, there is no need to make any assumptions about age data distribution. Data mining algorithms automatically find variable relationships.
  • The application of statistical analysis in forecasting is often expressed as a set of functional relations, while the focus of data mining in forecasting applications lies in the predicted results, and often does not produce clear functional relations from the results.

2.3 Main mature technologies of data mining

2.3.1 Decision Tree

In a decision tree, the analyzed data samples are first integrated into a tree root, and then branched layer by layer to finally form several nodes, each node representing a conclusion.

The construction of the decision tree does not require any domain knowledge. The biggest advantage is that the generated series of rules from the root to the branches can be easily understood by analysts, and these typical rules do not even need to be sorted out, they are ready-made. Applied business optimization strategies. In addition, decision tree techniques are indeed very tolerant of data distribution and are not easily affected by extreme values.

At present, the most commonly used decision tree algorithms are CHAID, CART, ID3:

  • CHAID (Chi-square Automatic Interaction Detector): Chi-square automatic interaction detection. According to the principle of local optimality, that is, the nodes are not correlated with each other, after a node is determined, the following growth process is completely carried out within the node; the chi-square test is used to select the independent variable that has the most influence on the dependent variable, and the application The premise is that the dependent variable is a categorical variable; if the independent variable has missing data, the actual value is regarded as a separate category of value.
  • CART (Classification and Regression Tree): Classification and regression tree. Focus on the overall optimization, that is, let the tree grow as much as possible, and then go back and prune the tree; the standard used is not the chi-square test, but the index of impurity such as the Gini coefficient; the decision tree generated by CART is dichotomous , each node can only be divided into two branches, and in the process of tree growth, the same independent variable can be used repeatedly; if there is missing data in the independent variable, an alternative data will be found to fill in the missing value.
  • ID3 (Iterative Dichotomiser): Iterative dichotomizer. The biggest feature is that the selection criterion of independent variables is that the attribute with the highest information gain is selected as the split attribute of the node based on the measurement of information gain. This kind of idea of ​​division purity; but the information gain degree has a disadvantage, that is, it tends to select attributes with a large number of values, which easily makes the division meaningless. In this regard, the later developed C4.5 uses the information gain ratio (Gain Ratio) to replace the information gain metric in ID3, and adds a split information (SplitInformation) to normalize it.

The use of decision tree technology in data operation is reflected in: as a typical support technology for classification and prediction problems, it is widely used in user division, behavior prediction, rule sorting, etc., and can even be used as a variable screening method in the early stage of other modeling technologies. a way.

2.3.2 Neural Network

The neural network finally obtains an output model by inputting multiple nonlinear models and the weighted interconnection between different models. The current mainstream "neural network" algorithm is Backpropagation, which learns on a multi-layer feed-forward neural network, which in turn consists of an input layer, one or more hidden layers and an output layer, as shown below:

 

Because the neural network has the characteristics of large-scale parallel structure and parallel processing of information, it has good self-adaptation, self-organization and high fault tolerance, and has strong learning, memory and recognition functions. The main disadvantage is the inexplicability of its knowledge and results. No one knows how the nonlinear function in the hidden layer handles the independent variables, but this disadvantage does not affect its application.

In the process of neural network technology modeling, the following five factors have a significant impact on the model results:

  • layers
  • The number of input variables in each layer
  • Kind of contact
  • degree of contact
  • Transfer function (also called activation function or squeeze function)

The main uses of neural network technology in data operation are: as an important technical support for classification and prediction problems, it is widely used in user division, behavior prediction, and marketing response.

2.3.3 Regression

 Regression generally refers to multiple linear regression and logistic regression. Logistic regression is more used in data-based operations, which also includes response prediction, classification and division. Multiple linear regression is more common in statistical analysis and will not be introduced here.

Logistic regression equations can be used to predict the possibility of a "choose one" event. The predicted dependent variable is a probability between 0 and 1. If this probability is converted, a linear formula can be used to describe the relationship between the hidden place and the independent variable. The specific formula is as follows:

Logistic regression uses the maximum likelihood method to estimate the parameters. The advantage of this method is that the parameter estimation is stable, the deviation is small, and the estimation variance is small in the large sample data.

2.3.4 Association Rule

The main purpose of association rule data mining is to find out the frequent pattern (Frequent Pattern) in the data set, that is, the pattern that recurs many times and the concurrent relationship (Cooccurrence Relationship), that is, the relationship that occurs at the same time, and the frequent concurrent relationship is also called association.

The most classic case of applying association rules is shopping basket analysis. By analyzing the associations between items in a customer's shopping basket, customers' shopping habits can be mined, thereby helping retailers to better formulate targeted marketing strategies. Here is a simple example:

Baby diapers → beer [support=10%, confidence=70%]

Support and confidence reflect the usefulness and certainty of the rule, respectively. This rule states that 10% of customers buy both baby diapers and beer, and 70% of all customers who buy baby diapers also buy beer. After discovering this association rule, diapers and beer can be placed together for promotion, which is the classic marketing case of "beer and diapers" in Walmart.

Among the many association rule data mining algorithms, the most famous is the Apriori algorithm, which is divided into the following two steps:

(1) Generate all frequent itemsets. A frequent itemset is an itemset whose support is higher than the minimum support threshold (min-sup).

(2) Generate all trusted association rules from frequent itemsets. The trusted management rules here refer to rules whose confidence is greater than the minimum confidence threshold (min-conf).

Association rule algorithms are not only useful in the analysis of numerical data sets, but also in plain text documents and web pages. For example, discovering concurrent relationships between words and web usage patterns are the basis for data mining, search, and recommendation.

2.3.5 Clustering

Algorithms of cluster analysis can be divided into Partitioning Method, Hierarchical Method, Density-based Method, Grid-based Method, and Model-based Method. Method (Model-based Method), etc., of which the first two methods are the most commonly used. The specific clustering method is introduced in most statistics textbooks and will not be repeated here.

2.3.6 Bayesian Classifier

Bayesian classification methods are mainly used to predict the likelihood of relationships among class members. For example, the probability that a given observation belongs to a specific category is determined by the relevant attributes of a given observation. Studies have shown that Naive Bayes methods are even comparable to decision-making and neural network algorithms.

Let X represent the measurement description of n attributes; H is a certain hypothesis, such as assuming that an observation x belongs to a specific category C; for classification problems, it is hoped that H is established through the given measurement description of C. Probability, calculate the probability that the new observation belongs to category C. The posterior probability P(H|X) is based on more information than the prior probability P(H), which is independent of X. If there are M classification categories in a given dataset, naive Bayesian classification can predict whether a given observation belongs to a particular category with the highest posterior probability, that is, the naive Bayesian method predicts that X belongs to the category When C i , it means if and only if:

P(CI|X)>P(Cj|X)  1≤j≤m  ,  ,j≠i

That is, to maximize P(C I |X), just maximize P(X|Ci)P(Ci).

2.3.7 Support Vector Machine

Compared with the traditional neural network, SVM is not only simple in structure, but also has significantly improved performance of various technologies, so it has become one of the hot spots in the field of machine learning today.

Support vector machines are based on the principle of minimum structural risk. In the linear case, the optimal classification hyperplane of the two types of samples is found in the original space. In the non-linear case, it uses a non-linear mapping to map the original training set data to higher dimensions. In the new dimension, it even linearly best separates the hyperplane. Using an appropriate sufficiently high-dimensional non-linear map, the two classes of data can always be separated by a hyperplane.

The basic concepts of support vector machines are as follows:

Let the given training sample set be {(x 1 ,y 1 ),(x 2 ,y 2 ),...,(x n ,yn)}, where x i belongs to R n , y belongs to {-1, 1}

Suppose that the training set can be divided linearly by a hyperplane, let the hyperplane be (w, x) + b = 0

The basic idea of ​​SVM can be illustrated by the following two-dimensional case:

The plus and minus signs in the figure represent two types of samples, H is the classification line, H 1 and H 2 are the straight lines that pass through the samples that are closest to the classification line and are parallel to the classification line, and the distance between them is called Classification Margin (Margin). The so-called optimal classification line requires that the classification line can not only correctly separate the two classes (the training error is 0), but also maximize the classification interval. Generalized to high-dimensional space, the optimal classification line becomes the optimal classification surface.

One type of vector closest to the hyperplane is called a support vector, and a set of support vectors can uniquely determine a hyperplane. Through the learning algorithm, SVM can automatically find those support vectors that have better distinguishing ability for classification, and the classifier constructed from this can maximize the interval between classes, so it has better adaptability and higher classification. Accuracy. The disadvantage of SVM is that the training data is large, but its advantages are also obvious - the ability to model complex nonlinear decision boundaries is highly accurate, and it is not easy to overfit.

2.3.8 Principal Components Analysis

2.3.9 Hypothesis Test

The above two methods, principal component analysis and hypothesis testing, are discussed in detail in common statistical analysis books, and will not be repeated here.

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325060976&siteId=291194637