What is the process of data mining? Easy to understand

Author: Charles Lu · C Ajiawoer (Charu C. Aggarwal)
Source: large data DT

Guide: The data mining process includes multiple stages such as data cleaning, feature extraction, and algorithm design. This article will discuss these stages.

01 Data Mining Process

The process of a typical data mining application includes the following stages.

1. Data Collection

Data collection may be the use of specialized hardware such as sensor networks, hand-entered user surveys, or software tools such as Web crawlers to collect documents. Although this stage is closely related to specific applications, it often falls outside the scope of data mining analysts. This stage is also critical to the data mining process, because the choices made at this stage will significantly affect the entire Data mining process .

The data generated in the acquisition phase is usually stored in a database first, which is called a data warehouse in a broad sense, and then processed.

2. Feature extraction and data cleaning

The format of the data obtained in the above-mentioned acquisition phase is often not suitable for direct processing. For example, the collected data may be logs or free-form documents using complex codes, and in many cases, various types of data are arbitrarily mixed together to form free-form documents.

To make such data suitable for further processing, it is necessary to convert them into a format suitable for data mining algorithms , such asMultidimensional dataTime series dataorSemi-structured dataWait.

Multidimensional data is the most common format, and its different fields correspond to various measurement attributes that can be called features, attributes, or dimensions. Extracting these features is a crucial stage of data mining, and the feature extraction stage is usually carried out in parallel with the data cleaning stage in order to estimate or correct missing data and incorrect data.

In addition, in many cases, data may be aggregated from multiple sources, and they need to be converted into a unified format for processing . The final result of the above process is a well-structured data set that can be used effectively by computer programs. After the feature extraction stage, the data can be stored back in the database for further processing.

3. Analysis and processing and algorithms

The last step of the data mining process is to design effective analysis methods for the processed data. In many cases, it is unlikely that the application at hand can be directly transformed into a standard data mining problem, such as association pattern mining , clustering , classification, and anomaly detection .Super problemOne of ".

But these four super-problems have wide coverage and can constitute the basic modules of data mining tasks, and most applications can be realized by putting together these components as basic modules.

The entire data mining process can be represented by Figure 1-1. Please note that the analysis and processing module in the figure shows a solution designed for a specific application and composed of multiple basic modules. This part depends on the skills of the analyst. The usual practice is to use one or more of the four main problems as a basic module to build.
Insert picture description here
▲Figure 1-1 Data processing pipeline

It needs to be admitted that not all data mining applications can use these four main problems to build solutions, but many applications can be solved in this way, so it is necessary to give these four main problems a special status. Below we use a recommended application example to explain the entire process of data mining.

  • Consider the scenario of such an online retailer. This retailer saves the visit log of customers visiting its website and also collects basic information about the customer. Assuming that each page of the website corresponds to a product, a customer visiting a page may indicate that they are interested in the corresponding product. Retailers hope to recommend products to customers in a targeted manner through the analysis of customers' personal data and their buying behavior.

Example of a problem-solving process : The analyst’s first job is to collect data from two different sources, one is the log extracted from the website’s log system, and the other is the customer’s personal data extracted from the retailer’s database . One difficulty here is that these two types of data use very different data formats, and it is not easy to process them together. For example, a log can appear in the following form.
Insert picture description here
The log may contain thousands of such entries. The above entry shows that a customer with an IP address of 98.206.207.157 visited the web page productA.htm. To confirm who the customer using an IP address is, you can use previous login information, or through cookie records on the web page, or even directly through the IP address itself, but this confirmation process may be full of noise and may not always produce accurate results.

As part of the data cleaning and extraction process, analysts also need to design algorithms to effectively filter different log entries so that only those data segments that provide accurate results are used, because the original log contains many that may not be of any use to the retailer. extra information.

In the feature extraction stage, the retailer decides to extract features from the webpage access log, and creates a record for each customer, in which each product is set as an attribute to record the number of visits by this customer to the corresponding product webpage.

Therefore, this feature extraction needs to process each original log and aggregate the features extracted from multiple logs. Later, during data integration, these attribute data will be added to the retailer’s customer database. This customer database contains customer personal data. If certain entries are missing from the personal data record, further data cleaning is required for it.

In the end, we get a data set that integrates the attributes of the customer's personal information and the attributes of the number of visits to the product by the customer .

At this point, the analyst needs to decide how to use this cleaned data set to provide customers with recommendations. Analysts can divide similar customers into several groups and make recommendations based on the buying behavior of each group.

Cluster analysis can be used as a basic module here to identify groups of similar customers. For each customer, you can recommend the product that the customer's group as a whole visits the most times (here refers to the product web page). This case contains a complete data mining process.

There are many beautiful methods of providing recommendations, and they have their own advantages and disadvantages in different situations. Therefore, the entire data mining process is an art, which is largely determined by the skills of the analyst, not entirely by specific techniques or Due to the basic modules, this skill can only be acquired through the practice of processing various types of different data under different application requirements .

02 Data preprocessing stage

The data preprocessing stage may be the most critical stage in the data mining process . However, this stage is rarely discussed as it should, because most data mining discussions focus on data analysis. This stage starts after data collection and includes the following steps.

1. Feature extraction

Analysts may be faced with a large number of original documents, system logs, and business transactions, but there is hardly any guiding quick start method to convert these raw data into meaningful data. This step is highly dependent on the abstraction ability of the analyst to find the features most relevant to the application at hand.

For example, in credit card fraud detection applications, charge amount, repetition frequency, and location information are often effective indicators to find fraud, while many other characteristic information may be of little use. Therefore, extracting the correct features is often a technical task and requires a full understanding of the relevant field of the application at hand .

2. Data cleaning

The data obtained by the above feature extraction may contain errors, and some items may be lost during collection and extraction. Therefore, we may have to discard some data records that contain errors, or estimate and fill missing entries, and eliminate inconsistencies in the data.

3. Feature selection and conversion

When the data dimension is very high, many data mining algorithms will fail. And when the data dimension is very high, the data noise will increase, which may cause data mining errors. Therefore, some methods need to be used to remove features irrelevant to the application, or to transform the data into a new dimensional space to make data analysis easier .

Another related issue is data conversion , which converts some attributes to another attribute of the same or similar data type. For example, converting age values ​​into age groups may be more effective and convenient for analysis.

The data cleaning process usually needs to use statistical methods to estimate missing data. In addition, in order to ensure the accuracy of the mining results, it is usually necessary to eliminate incorrect data entries.

Since feature selection and data conversion are highly dependent on specific analysis issues, they should not be regarded as part of data preprocessing. Even in some cases, feature selection may be closely integrated with specific algorithms or methods to form a packaging model or embedded model The form appears. However, in general, the feature selection phase is performed before the specific mining algorithm is applied.

03 Analysis phase

A major challenge is that each data mining application is unique, and it is difficult to create a flexible and reusable mining technology for many types of applications. However, we found that some data mining methods recurring in various applications, the so-called "super problem" or the basic module of data mining.

How to use these basic methods in specific data mining applications largely depends on the analyst’s skills and experience. Although these basic modules can be described well, how to use them in practical applications can only be achieved through Practice to learn.

Guess you like

Origin blog.csdn.net/qq_32727095/article/details/114323495