Data Mining Study Notes Overview

Data Mining Study Notes

1. Overview

The main reason for the popularity of data mining is that there is a large amount of data, which can be widely used, and there is an urgent need to convert this data into useful information and knowledge.

Data mining is the result of the natural evolution of information technology. The evolutionary process is witnessed by the database industry developing the following functions: data collection and database creation, data management (including data storage and extraction, database transaction processing), and data analysis and understanding (involving data warehousing and data mining)

 

Now, data can be stored in different types of databases.

A data warehouse is a store where multiple heterogeneous data sources are organized in a unified schema at a single site to support management decisions. Data warehousing techniques include data cleaning, data integration, and online analytical processing ( OLAP ). OLAP is an analytical technique with the ability to summarize, merge, and aggregate functions, as well as the ability to view information from different perspectives

 

Data mining tools for data analysis can discover important data patterns, useful for business decisions, knowledge bases, scientific and medical

academic research has made a great contribution. The gap between data and information requires the systematic development of data mining tools to transform data graves into knowledge "gold nuggets".

 

 

Data mining is the extraction or "mining" of knowledge from large amounts of data. Such as knowledge mining, knowledge extraction, data / pattern analysis, data archaeology and data fishing in databases. similar to data mining

 

Knowledge discovery process:

Data cleaning (removing noise or inconsistent data)

Data integration (multiple data sources can be combined)

Data selection (extracting data from the database relevant to the analytical task)

Data transformation (transformation or unification of data into a form suitable for mining; e.g., by summarizing or aggregating operations)

Data mining (basic steps, using intelligent methods to extract data patterns)

Pattern evaluation (identifying really interesting patterns that provide knowledge, according to some interestingness measure)

Knowledge Representation (using visualization and knowledge representation techniques to provide users with mined knowledge).

 

The data mining step can interact with the user or with the knowledge base. Interesting patterns are provided to the user or stored in the knowledge base as new knowledge.

 

Broad View of Data Mining: Data mining is the process of mining interesting knowledge from large amounts of data stored in databases, data warehouses or other information repositories.

 

A typical data mining system has the following main components:

Database, Data Warehouse, or Other Information Repository: This is a database or group of databases, data warehouses, expanded tables, or other types of information repositories. Data cleansing and integration can be done on the data.

Database or data warehouse server: According to the data mining request of the user, the database or data warehouse server is responsible for extracting the relevant data. Knowledge Base: This is domain knowledge that is used to guide searches, or to assess the interestingness of result patterns. This knowledge may include concept hierarchies, which are used to organize attributes or attribute values ​​into different layers of abstraction. Knowledge of user confidence can also be included. This knowledge can be used to assess the degree of interest of a pattern in terms of undesiredness. Other examples of domain knowledge are interest limits or thresholds and metadata (eg, describing data from multiple heterogeneous data sources).

Data mining engine: This is the basic part of the data mining system, which consists of a set of functional modules for feature, association, classification, cluster analysis, evolution and deviation analysis.

Pattern Evaluation Module: Typically, this part uses an interestingness metric and interacts with a mining module to focus the search on

fun mode. It may filter discovered patterns using an interestingness threshold. The pattern evaluation module can also be integrated with the mining module, depending on the implementation of the data mining method used. For effective data mining, it is recommended to push pattern evaluation into the mining process as much as possible in order to limit the search to patterns of interest.

Graphical User Interface: This module communicates between the user and the mining system, allowing the user to interact with the system, specifying data mining queries or tasks, providing information, helping search focus, and performing exploratory data mining based on the intermediate results of data mining. In addition, this component allows users to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize patterns in different forms.

 

 

Data mining involves the integration of multidisciplinary technologies, including database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information extraction, image and signal processing, and spatial data analysis.

 

An algorithm is scalable, and given available system resources such as memory and disk space, its running time should increase linearly with the size of the database. Through data mining, interesting knowledge, laws, or high-level information can be extracted from the database, and can be observed or browsed from different angles. Discovered knowledge can be used for decision making, process control, information management, query processing, etc.

 

 

In principle, data mining can be performed on any type of information store. This includes relational databases, data warehouses, transactional databases, advanced database systems, flattened files and WWW . Advanced database systems include object-oriented and object - relational databases; databases for special applications, such as spatial databases, time series databases, text databases and multimedia databases. Mining challenges and techniques can vary by storage system.

 

A data warehouse is constructed through data cleansing, data transformation, data integration, data loading, and periodic data refreshes

 

Typically, a data warehouse is modeled with a multidimensional database structure. Among them, each dimension corresponds to one or a group of attributes in the schema, and each cell stores the aggregated measure. The actual physical structure of a data warehouse can be a relational data store or a multidimensional data cube.

It provides a multidimensional view of the data and allows quick access to precomputed and aggregated data.

 

A data warehouse collects subject information across an organization and, therefore, is enterprise-wide. On the other hand, a data mart is a sectoral subset of a data warehouse. It focuses on selected topics and is sector-wide.

 

Typically, a transactional database consists of a file, where each record represents a transaction. Typically, a transaction contains a unique transaction identification number ( trans_ID ) , and a list of items that make up the transaction

 

 

Object-oriented databases are based on object-oriented programming paradigms. In general terms, each entity is seen as an object

 

Data and code involving an object are encapsulated in a unit. Each object is associated with:

A variable set that describes the data. This corresponds to the attributes of the entity - relationship and relational models.

A set of messages that objects can use to communicate with other objects, or with other parts of the database system.

A set of methods, where each method holds the code that implements a message. Once the message is received, the method returns a response value

 

Objects that share a common set of properties can be grouped into an object class. Each object is an instance of its object class. Object classes can be organized into class / subclass hierarchies such that each class represents characteristics common to objects of that class

 

Both time databases and time series databases store time-related data. Time databases typically store data that contains time-related attributes. These properties may involve several time labels, each with different semantics. A time series database stores the sequence of values ​​that change over time

Text databases are databases that contain textual descriptions of objects; multimedia databases store image, audio, and video data.

 

Data mining functions are used to specify the type of patterns to be found in data mining tasks. Generally, data mining tasks can be divided into two categories:

description and forecast. Descriptive mining tasks characterize general characteristics of data in a database. Predictive mining tasks infer on current data to make predictions.

 

Data can be associated with classes or concepts, and it may be useful to describe each class and concept in a summarized, concise, and precise manner. The description of such a class or concept is called a class / concept description. This description can be obtained by ( 1 ) characterizing the data, generally aggregating the data for the class under study (often referred to as the target class), or ( 2 ) discriminating the data, comparing the target class with one or more comparison classes ( Often referred to as contrastive classes) for comparison, or ( 3 ) data characterization and comparison.

 

A data feature is a summary of general features or characteristics of the target class of data. Typically, user-specified classes of data are collected through database queries

 

 

The output of the data feature can be provided in a variety of forms. Includes pie charts, bar charts, curves, cubes, and multidimensional tables including crosstabs. Result descriptions can also be provided in the form of generalization relations or rules (called feature rules).

 

Data differentiation is the comparison of the general characteristics of the target class object with the general characteristics of one or more comparison class objects. The target class and comparison class are specified by the user, and the corresponding data is extracted through database query.

 

Association analysis discovers association rules that exhibit conditions under which attribute - values ​​frequently occur together in a given dataset. Association analysis is widely used for shopping basket or transaction data analysis

 

 

Classification is the process of finding a model ( or function ) that describes or identifies a data class or concept so that the model can be used to predict objects whose class labels are unknown. The derived model is based on analysis of the training dataset (ie, data objects whose class labels are known).

Classification can be used to predict the class label of a data object. When the value being predicted is numeric data, it is often called a prediction

 

Correlation analysis may need to be performed before classification and prediction, which attempts to identify attributes that are not useful for classification and prediction. These attributes should be excluded.

 

Unlike classification and prediction, clustering analyzes data objects without regard to known class labels. normally,

Class labels are not provided in the training data because it is not known where to start. Clustering can generate such labels. Objects are clustered or grouped according to the principle of maximizing intra-class similarity and minimizing inter-class similarity. That is, clusters of objects are formed such that objects in one cluster are highly similar and very dissimilar to objects in other clusters. Each cluster formed can be viewed as an object class from which rules can be derived. Clustering also facilitates categorization, organizing observations into class hierarchies and grouping similar events together.

 

 

The database may contain data objects that do not conform to the general behavior or model of the data. These data objects are outsiders. Most data mining methods discard outliers as noise or exceptions. Outsider data analysis is called outsider mining. Outsiders can be detected using statistical tests

 

Data evolution analysis describes and models regularities or trends in objects whose behavior changes over time. While this may include characterization, differentiation, association, classification, or clustering of time-related data, different characteristics of this type of analysis include time-series data analysis, sequential or periodic pattern matching, and similarity-based data analysis.

 

A pattern is interesting if ( 1 ) it is easy to understand, ( 2 ) valid to some extent on new or test data, ( 3 ) potentially useful, and ( 4 ) novel. It's also interesting if a pattern conforms to a certain assumption that the user is convinced of. Interesting patterns represent knowledge.

 

Depending on the type of data mined or a given data mining application, a data mining system may also integrate spatial data analysis, information extraction, pattern recognition, image analysis, signal processing, computer graphics, web technology, economics, or psychology technology in the field.

 

According to different standards, data mining systems can be classified as follows: classification according to the type of database mined, classification according to the type of knowledge mined, classification according to the technology used, classification according to the application.

 

The main problem of data mining

Data mining techniques and user interface issues: This reflects the type of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, specific mining and knowledge display.

 

Regarding the diversity of database types. Handling of relational and complex data types: Since relational databases and data warehouses have been widely used, it is important to develop effective data mining systems for them. However, other databases may contain complex data objects, hypertext and multimedia data, spatial data, temporal data, or transactional data. Due to the diversity of data types and the different goals of data mining, it is unrealistic to expect a system to mine all types of data. To mine specific types of data, a specific data mining system should be constructed. In this way, we may have different data mining systems for different types of data.

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326741864&siteId=291194637