"Data mining concepts and technologies" study notes

Reprinted from data mining concepts and technologies - study notes (1)

Chapter One Introduction
Why data mining
to solve the "data rich, but information poor" issue.
Explosive data growth, widely available, a huge amount -> Data Times -> need a powerful and versatile tool, found valuable information from massive data.

What is data mining
mining knowledge from data.

Data cleaning (removing noise and inconsistent data delete)
data integration (multiple data sources can be combined)
data selection (extracted from the database associated with the data analysis task)
data conversion (or aggregated by aggregation operation, the data conversion and unified into a suitable mining form)
data mining (basic steps, using intelligent methods to extract data mode)
model assessment (based on some interestingness measure, the really interesting data recognition on behalf of knowledge)
knowledge representation (using visualization and knowledge representation technology, to to provide users with knowledge mining)

What type of data mining can
database data
data associated with a set of internal software programs and a set of data access and management components.
Relational database is a collection of tables (Properties -> fields or columns, ancestral -> records or rows). Object is identified unique key, a set of attribute values is described.
Data warehouse
from the information repository from multiple data sources collected (eg branches all over the world the company's database)

Transaction data
Each record represents a transaction.
It contains a unique transaction identification number, and an integral term of the transaction.
For instance, Article 50 transactions mall, the user buy the A, D, F these three items.
Other types of data
spatial data, hypertext and multimedia data ...... like
may what type of digging mode
(typically characterize the nature of the target data) descriptive
predictive (made induction on the current data in order to predict)
class / concepts described: characterization and distinguishing
data which characterize: summary general properties or characteristics of the target class data
data distinguished: the target class, with one or more classes may be compared comparison.
For example: regular purchases computer products to customers and non-customers to purchase computer products to compare.

Mining frequent patterns, association and correlation of
frequent patterns: frequently appearing in the data mode.

Frequent item sets: frequently appear together in a transaction data set (the customer is always at the canteen together to buy milk and bread)
Frequent sequence: first customer to buy a digital camera, buy a memory card.
Frequent substructure
correlation analysis (eg analysis, which items are always purchased together)

"Computer" => "software" [1%, 50%] represents 1% of all transactions display computer and software are purchased at the same time. Who purchased a computer, there is a 50% probability will choose to buy the software.
Correlation (associated attributes - statistical correlation between values)

For classification and regression analysis to predict

Classification: identify and describe distinguishing data model or conceptual type (or function), the prediction reference category.
Regression: establishing a continuous valued function model, numerical data values of the prediction Nanyihuode or missing.
Correlation analysis may be required before the classification and regression, it tries to identify and classification and regression process significantly associated attributes.

Cluster Analysis
Cluster analysis: data objects, regardless of class label. Clustering may be used to generate label data groups of classes.
-> "to maximize the similarity class, the minimum inter-class similarity"

Outlier analysis of
abnormal mining. Sometimes seen as noise and discarded, but in applications such as fraud detection, rare events occur, the more interesting.

All models are interesting to you?

Easily understood by people
in some kind of certainty for the new test data is valid or
potentially useful
novel
objective metrics: support, confidence.

What is the use of technology

Statistics: For example, after the establishment of classification or predictive models, statistical hypothesis testing to validate the model
of machine learning: supervised learning (classification is basically a synonym), unsupervised learning (clustering is essentially a synonym), semi-supervised learning, active learning
information retrieval: search for a document or document scientific information on
what type of application-oriented
business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries ......

The main problem of data mining
mining
new types of knowledge, multidimensional space of knowledge, interdisciplinary, mining capabilities in a network environment, model assessment
of user interaction
combined with background knowledge
representation and visualization of knowledge to make it easier to understand
the effectiveness and scalability
processing a wide variety of data types
dynamic and complex.
Data mining and social
protection of privacy
social impact

Author: Nepalese are all Nepalese
Source: CSDN
Original: https: //blog.csdn.net/echody/article/details/53301756
Copyright: This article is a blogger original article, reproduced, please attach Bowen link!

Published 24 original articles · won praise 0 · Views 4333

Guess you like

Origin blog.csdn.net/lynchyueliu/article/details/104361863