Internet data mining and analysis explained

1. Definition

Data mining (English: Data mining), also translated as data exploration and data mining. It is a step in Knowledge-Discovery in Databases (KDD). Data mining generally refers to the process of searching for information hidden in large amounts of data through algorithms. Data mining is often associated with computer science and achieves the above goals through many methods such as statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (relying on past rules of thumb) and pattern recognition.

Data mining is currently a hot issue in the field of artificial intelligence and database research. It is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, databases, visualization technology, etc., to analyze enterprise data in a highly automated manner and make inductive arrangements. Uncover potential patterns to help decision makers adjust market strategies and reduce risks. Application areas include intelligence retrieval, intelligence analysis, pattern recognition, etc.
 

2. Data mining objects

The type of data can be structured, semi-structured, or even heterogeneous. Methods of discovering knowledge can be mathematical, non-mathematical, or inductive. The knowledge finally discovered can be used for information management, query optimization, decision support and maintenance of the data itself.

The object of data mining can be any type of data source. It can be a relational database, which is a data source that contains structured data; it can also be a data warehouse, text, multimedia data, spatial data, time series data, and Web data, which is a data source that contains semi-structured data or even heterogeneous data. .

Methods of discovering knowledge can be numerical, non-numeric, or inductive. The knowledge finally discovered can be used for information management, query optimization, decision support and maintenance of the data itself.

3. Data Mining Steps


The data mining process model steps mainly include defining problems, establishing data mining libraries, analyzing data, preparing data, building models, evaluating models and implementation. Let's take a closer look at the specific content of each step:

(1) Define the problem: The first and most important requirement before starting knowledge discovery is to understand the data and business problem. You must have a clear and clear definition of your goals, that is, decide what you want to do. For example, when you want to improve the utilization rate of your email, you may want to "increase user utilization rate" or you may want to "increase the value of one user use." The models established to solve these two problems are almost completely different. , a decision must be made.

(2) Establishing a data mining library: Establishing a data mining library includes the following steps: data collection, data description, selection, data quality assessment and data cleaning, merging and integration, building metadata, loading the data mining library, and maintaining the data mining library .

(3) Analyze data: The purpose of analysis is to find the data fields that have the greatest impact on the prediction output and decide whether to define export fields. If the data set contains hundreds or thousands of fields, then browsing and analyzing the data will be a very time-consuming and tiring task. In this case, you need to choose a tool software with a good interface and powerful functions to assist you in completing these tasks. .

(4) Prepare data: This is the last step of data preparation before building the model. This step can be divided into four parts: selecting variables, selecting records, creating new variables, and converting variables.

(5) Model building: Model building is an iterative process. Different models need to be carefully examined to determine which model is most useful for the business problem faced. First use a portion of the data to build a model, and then use the remaining data to test and validate the resulting model. Sometimes there is a third data set, called the validation set, because the test set may be affected by the characteristics of the model, and an independent data set is needed to verify the accuracy of the model. Training and testing data mining models requires splitting the data into at least two parts, one for model training and the other for model testing.

(6) Evaluate the model: After the model is established, it is necessary to evaluate the results obtained and explain the value of the model. The accuracy obtained from the test set is only meaningful for the data used to build the model. In practical applications, it is necessary to further understand the types of errors and the related costs caused by them. Experience has proven that a valid model is not necessarily a correct model. The direct reason for this is the various assumptions implicit in model building, so it is important to test the model directly in the real world. Apply it to a small area first, obtain test data, and then promote it to a large area after you feel satisfied.

(7) Implementation: Once the model is established and validated, there are two main ways to use it. The first is to provide analysts with a reference; the other is to apply this model to different data sets.

4. Data mining analysis methods

4.1 Concept

Data mining is divided into guided data mining and unguided data mining. Guided data mining uses available data to build a model that describes a specific attribute. Unguided data mining is to find some relationship among all attributes. Specifically, classification, valuation, and prediction belong to guided data mining; association rules and clustering belong to unguided data mining.

1. Classification: It first selects a training set that has been classified from the data, uses data mining technology on the training set to build a classification model, and then uses the model to classify unclassified data.

2. Valuation: Valuation is similar to classification, but the final output result of valuation is a continuous value, and the amount of valuation is not predetermined. Valuation can be used as a preparation for classification.

3.Prediction: It is performed through classification or valuation. A model is obtained through classification or valuation training. If the model has a high accuracy for the test sample group, the model can be used to predict new samples. Unknown variables are used for prediction.

4. Correlation grouping or association rules: The purpose is to discover which things always happen together.

5. Clustering: It is a method of automatically finding and establishing grouping rules. It divides similar samples into a cluster by judging the similarity between samples.

4.2 Analysis methods

1) Decision tree method

Decision trees have a strong ability to solve classification and prediction problems. They are expressed in the form of rules, and these rules are expressed as a series of questions. By constantly asking questions, the required results can finally be derived. A typical decision tree has a root at the top and many leaves at the bottom, which breaks down records into different subsets. The fields in each subset may contain a simple rule. In addition, decision trees may have different shapes, such as binary trees, ternary trees, or mixed decision tree types.

2) Neural network method

The neural network method simulates the structure and function of the biological nervous system. It is a nonlinear prediction model learned through training. It regards each connection as a processing unit and attempts to simulate the functions of human brain neurons. It can complete classification, Various data mining tasks such as clustering and feature mining. The learning method of neural network is mainly reflected in the modification of weights. Its advantage is that it has anti-interference, non-linear learning, and associative memory functions, and can obtain accurate prediction results for complex situations; its disadvantage is that it is not suitable for processing high-dimensional variables, cannot observe the intermediate learning process, and has "black box" properties, and the output results It is also difficult to explain; secondly, it requires a long learning time. Neural network method is mainly used in clustering technology of data mining.

3) Association rules method

Association rules are rules that describe the relationships between data items in a database. That is, the occurrence of certain items in a transaction can lead to the occurrence of other items in the same transaction, that is, the hidden association or mutual relationship between data. In customer relationship management, by mining a large amount of data in an enterprise's customer database, interesting relationships can be discovered from a large number of records, key factors affecting marketing effects can be found, and product positioning, pricing and customized customer groups can be identified. , provide reference basis for decision support such as customer seeking, segmentation and retention, marketing and promotion, marketing risk assessment and fraud prediction.

4) Genetic algorithm

The genetic algorithm simulates the phenomena of reproduction, mating, and genetic mutation that occur in natural selection and inheritance. It is a machine learning method based on evolutionary theory that uses operations such as genetic combination, genetic crossover mutation, and natural selection to generate implementation rules. Its basic viewpoint is the principle of "survival of the fittest", which has properties such as implicit parallelism and easy combination with other models. The main advantage is that it can handle many data types and can process various data in parallel; the disadvantage is that it requires too many parameters, is difficult to code, and generally requires a large amount of calculation. Genetic algorithms are often used to optimize neuron networks and can solve problems that are difficult to solve with other techniques.

5) Cluster analysis method

Cluster analysis is to divide a set of data into several categories based on similarities and differences. The purpose is to make the similarity between data belonging to the same category as large as possible and the similarity between data in different categories as small as possible. According to the definition, it can be divided into four categories: hierarchical-based clustering methods; partition clustering algorithms; density-based clustering algorithms; grid-based clustering algorithms. Commonly used classic clustering methods include K-mean, K-medoids, ISODATA, etc.

6) Fuzzy set method

The fuzzy set method uses fuzzy set theory to carry out fuzzy evaluation, fuzzy decision-making, fuzzy pattern recognition and fuzzy clustering analysis on problems. Fuzzy set theory uses membership degrees to describe the attributes of fuzzy things. The higher the complexity of the system, the stronger the ambiguity.

7) Web page mining

By mining the Web, we can use the massive data of the Web to analyze, collect information related to politics, economics, policy, technology, finance, various markets, competitors, supply and demand information, customers, etc., and focus on analyzing and processing those that are harmful to the enterprise. External environmental information and internal operating information that have significant or potentially significant impacts, and based on the analysis results, we can identify various problems that may arise in the corporate management process and precursors that may cause crises, and analyze and process this information in order to identify, analyze, Evaluate and manage crises.

8) Logistic regression analysis

It reflects the temporal characteristics of attribute values ​​in transaction databases, generates a function that maps data items to a real-valued predictor variable, and discovers dependencies between variables or attributes. Its main research issues include trend characteristics of data sequences, data Sequence prediction and correlation between data, etc.

9) Rough set method

It is a new mathematical tool for dealing with vague, imprecise, and incomplete problems. It can handle issues such as data reduction, data correlation discovery, and data significance evaluation. Its advantage is that the algorithm is simple, and it does not require prior knowledge about the data during its processing, and can automatically find out the inherent laws of the problem; its disadvantage is that it is difficult to directly process continuous attributes, and the attributes must be discretized first. Therefore, the discretization problem of continuous attributes is a difficulty that restricts the practical application of rough set theory.

10) Connection analysis

It takes relationships as the main body, and has developed many applications based on the relationships between people and people, things and things, or people and things. For example, the telecommunications service industry can use link analysis to collect the time and frequency of customers' phone calls, and then infer customers' usage preferences and propose solutions that are beneficial to the company. In addition to the telecommunications industry, more and more marketers are also using link analysis to conduct research that is beneficial to the enterprise.

Guess you like

Origin blog.csdn.net/m0_68949064/article/details/129494996