Data Mining (1)--Basic Knowledge Learning

Table of contents

foreword

text

1. History and development of data mining

a.Basic description

b. Typical knowledge discovery process 

 c. Typical data mining system structure

d. There are still many problems in data mining to be further studied

3. Research content and functions of data mining 

a. Research content 

b. Main functions of data mining

4. Common techniques and tools for data mining

a. Data mining commonly used techniques

b. Ten classic algorithms for data mining

c. Tools for data mining

d. Traditional data analysis methods and data mining

5. Data mining application hotspots

6. The main problems faced by data mining

a. Problems faced by mining methods

b. Problems with user interactivity

c. Application and social impact

summary

References


foreword

Since the 1990s, with the popularization of database technology applications, data mining (Data Mining) technology has attracted great attention from academia and industry. Large-scale data, the actual value of these data can be truly brought into play in the future. Due to the application needs of data analysis and management work, these data need to be converted into useful information and knowledge, that is, from traditional data statistics to data mining and analysis. In addition, the information and knowledge obtained through data mining technology can also be widely used in various industries, including market development and analysis, business management, production control, engineering design and scientific exploration. (Excerpt from "Data Mining: Methods and Applications" Xu Hua)

text

1. History and development of data mining

a.Basic description

Data Mining (DM), also known as Knowledge Discovery in Database (KDD), is an interdisciplinary research field involving machine learning, artificial intelligence, database theory, and statistics.
Data mining is to dig out useful information from a large amount of data in the database, that is, to discover hidden, regular, and unknown information from a large number of incomplete, noisy, fuzzy, and random practical application data. non-trivial process of generating, but potentially useful and ultimately comprehensible information and knowledge.

Not all operations and analysis related to databases belong to the scope of data mining research.

Data Mining (DM) is the core part of Knowledge Discovery (KDD).
The development of the theoretical basis of data mining mathematics is inseparable from the development of statistics.

b. Typical knowledge discovery process 

6999fcb264904299b55babc71aef7397.png

 c. Typical data mining system structure

 7660dc241c7b4a0c938100ef8ed2bbe3.png

d. There are still many problems in data mining to be further studied

There are still many problems in data mining to be further studied, including the following research directions:
① Algorithm efficiency and scalability
② Handling different types of data and data sources
③ Interactivity of data mining systems
④ Information protection and data mining in data mining Security
⑤Explore new application areas
⑥Availability, certainty and expressibility of data mining results
⑦Visual data mining

3. Research content and functions of data mining 

a. Research content 

The most common types of knowledge discovered by data mining are the following five types:
① Generalization.
Generalized knowledge refers to the general description knowledge of category characteristics, reflecting the common nature of similar things.
It is the generalization, refinement and abstraction of data.
② Association Knowledge (Association) Association
knowledge reflects the knowledge of dependence or association between an event and other events, also known as dependency
(Dependency) relationship
③Classification & Clustering:
Classification knowledge is used to reflect the common nature of similar things Feature-type knowledge and
difference-type feature knowledge between different things
④ Prediction-type knowledge (Prediction)
Prediction-type knowledge predicts
future , which can also be considered as Associative knowledge with time as the key attribute
⑤Deviation knowledge (Deviation)
Deviation knowledge is the description of differences and extreme special cases, revealing the abnormal
phenomena that things deviate from the routine, such as special cases outside the standard class, outliers outside the data clustering value etc. 

b. Main functions of data mining

1. Class/Concept Description: Characterization and Distinction
To descriptively summarize a data set containing a large amount of data and obtain a concise and accurate description, this description is called Class/Concept Description (Class/Concept Description). .
This description can be obtained by the following methods:
(1) data characterization
(2) data differentiation
(3) data characterization and comparison

2. Association Analysis
Association Analysis (Association Analysis) is to find frequently occurring itemset pattern knowledge from a given data set, also known as association rules age(X,"20..29")^income(X,"20. .29K") >buys(X,"PC")[support = 2%, confidence = 60%]

3. Classification and prediction

Research work related to data mining often tries to build a model or description function to describe or distinguish different types and concepts, so as to realize the potential prediction requirements for the future. For example, in actual work, relevant countries are often classified according to the type of climate, and are divided into tropical countries, temperate countries and frigid countries. In real life, cars are classified according to their displacement. Divided into small displacement vehicles, large displacement vehicles and other types. In the process of actually applying data mining technology to solve related problems, classification techniques and methods are often used to solve the prediction of unknown results or unknown quantitative features.

4. Cluster analysis
The data analyzed and processed by cluster analysis (whether in learning or in classification prediction) have no (determined in advance) category assignment.
Clustering principle:

Maximize the similarity within the class

Minimize the similarity between classes 

5. Outlier analysis
Most data mining methods discard outliers as noise or anomalies, but outliers can be detected using statistical tests. 

6. Evolution Analysis
Data evolution analysis (Evolution Analysis) is to model and describe the change rules and trends of data objects that change over time. 

4. Common techniques and tools for data mining

a. Data mining commonly used techniques

Predictive technology, cluster analysis, evolutionary computing, fuzzy logic, game tree, statistical analysis, decision-making and control theory, parallel computing Haitong storage, association rule technology, rough set technology, gray system, artificial intelligence, knowledge reasoning, visualization technology

b. Ten classic algorithms for data mining

1. Decision tree classifier C4.5 (classification algorithm)

2. K-means algorithm (clustering algorithm)

3. Support vector machine (classification algorithm)

4. Apriori algorithm (frequent pattern analysis algorithm)

5. Maximum expectation estimation algorithm (integrated weak classifier)

6. PageRank algorithm (sorting algorithm)

7. AdaBoost algorithm (integrated weak classifier)

8. K nearest neighbor classification algorithm (classification algorithm)

9. Naive Bayesian Algorithm (Classification Algorithm)

10. Classification and regression tree algorithm (clustering algorithm)


C4.5 (61 votes)
K-Means (60 votes)
SVM (58 votes)
Apriori (52 votes)
EM (48 votes)
PageRank (46 votes)
AdaBoost (45 votes)
kNN (45 votes)
Naive Bayes (45 votes)
CART (34 votes)

c. Tools for data mining

1. Neural network-based tools
Neural networks are used for classification, feature mining, prediction, and pattern recognition.
2. Tools based on rules and decision trees
The main advantage is that both rules and decision trees are readable.
3. Tools based on fuzzy logic
This method uses fuzzy logic for data query, sorting, etc.
4. Comprehensive multi-method tools
These tools are generally large in scale and suitable for large databases (including parallel databases)

d. Traditional data analysis methods and data mining

(1) Massive data

(2) High-dimensional data

(3) High complexity data. The following are some typical types of complexity data in daily work

① Data flow and sensory data.

②Time series data, data series that change over time.

③ Structured data, graphs, social network, and multi-link relational data.

④ Heterogeneous database, legal data.

⑤ Spatial data, spatiotemporal description data, multimedia data, Web data.

⑥Software programs, scientific simulation data, etc.
 

5. Data mining application hotspots

Data mining technology comes from the direct needs of business, and has a wide range of use values ​​in various fields.
1. Applications in the financial field
2. Network financial transactions
3. Retail business applications
4. Medical telecommunications applications

6. The main problems faced by data mining

a. Problems faced by mining methods

(1) When actually using data mining methods to discover knowledge, it is usually hoped that the mining methods adopted can realize different types of knowledge mining from different types of data.

(2) The object of data mining is often large-scale massive data, and the performance of mining algorithms is also one of the important issues that often attract attention in the process of data mining.

(3) In descriptive data mining tasks, it is necessary to perform corresponding pattern evaluation on the analyzed frequent patterns or regularities

(4) The objects of data mining work are often users with different professional backgrounds. How to integrate relevant background knowledge in the mining method to make the mining work more targeted is also an important issue in the research of mining methods.

(5) During the use of mining methods, the objects to be mined are often noisy and incomplete data.

(6) In recent years, with the maturity of parallel computing technology and the construction of cloud computing technology platform, future mining methods for massive data are often required to be parallel, distributed and incremental.

(7) The mining algorithm should be able to actively integrate the discovered knowledge, that is, realize the fusion of knowledge.

b. Problems with user interactivity

(1) On the issue of user interactivity, it is necessary to propose a data mining-oriented query language to realize real-time data mining. (2) It is necessary to present the user's data mining results representation and visualization technology in an intuitive way to present the mining results. That is to carry out research on computational visualization methods for data mining technology.

(3) Users often need to implement interactive mining at multiple levels of abstraction, that is, the entire data mining process is required to be interactive.

c. Application and social impact

(1) In terms of application. There is an urgent need to carry out domain-oriented data mining and realize data mining that cannot be perceived or seen by ordinary people. (2) In the application process of data mining, it is necessary to strengthen the protection of data security, integrity and privacy.

summary

This chapter analyzes some basic concepts in data mining in detail, expounds the history and development of data mining technology, summarizes the content and functions of
data mining, analyzes existing data mining technologies and tools, and introduces the principles of data mining App hotspots.

As the inevitable result of the development of database technology, data mining technology has been extensively researched and applied. Data mining is to discover valuable knowledge from massive data. A typical knowledge discovery process includes data cleaning, data integration, data selection, data transformation, data mining, schema evaluation and knowledge representation. Data mining work can be carried out on different data warehouses. Data mining can be completed: data feature extraction, feature recognition, association analysis, classification, clustering, outlier point analysis and trend analysis, etc. (Excerpt from "Data Mining: Methods and Applications" Xu Hua)

References

"Data Mining: Methods and Applications" by Xu Hua

Guess you like

Origin blog.csdn.net/weixin_53197693/article/details/129247208