Some basic concepts and processes of data mining

Data mining: database knowledge discovery
process:

  1. Data cleaning: eliminate noise and inconsistent data
  2. Data integration: multiple data sources can be combined
  3. Data selection: extract and analyze task-related data from the database
  4. Data transformation: transform and unify data into a form suitable for mining through summary or aggregation operations
  5. Data mining: using intelligent methods to extract data patterns
  6. Pattern assessment: Identify truly interesting patterns that represent knowledge based on interest
  7. Knowledge representation: Use visualization and knowledge representation technology to provide users with
    Insert picture description here
    6 stages of knowledge mining (crisp-dm) data mining: business understanding, data understanding, data preparation, modeling, model evaluation and model publishing

OLTP (online transaction processing): It is mainly production data processing, which is generally common in data generation, so it is a real-time data processing system. For example, a transaction is completed, it is immediately recorded in the database system, so it is necessary to consider when building a database Three paradigms are constructed to facilitate data addition, deletion and modification.

OLAP (online analitics processing): mainly to construct historical data for easy query, so its tables are generally flat, and the data after insertion is generally not changed, so its data is generally divided into fact tables and dimension tables for convenience Analysts retrieve data from them for analysis, which is also the processing method of data warehouses and data marts

Knowledge Discovery in Data (KDD)

ER diagram: correlation diagram between different entities

Frequent item set: refers to a collection of commodities that frequently appear together in a transaction data set, such as milk and bread that many customers frequently purchase together.

Cluster: A collection of data objects that makes the objects in the same cluster similar to each other and different from other cluster objects.
Outlier analysis: Based on clustering technology, treat possible outliers as objects that are highly different from other objects

Data matrix and dissimilarity matrix
Generally, memory-based clustering and KNN (nearest neighbor) algorithms run on these two data structures.
Data matrix (object-attribute structure): This data structure is in the form of a relational table or n The p matrix stores n data objects, p attributes
Insert picture description here
. The dissimilarity matrix (object-object structure): stores the proximity of n objects between two pairs. The n
n matrix is ​​used to represent
Insert picture description here
binary attributes. The proximity measure
uses symmetric and asymmetric binary attributes. characterization objects dissimilarity and similarity measure
Insert picture description here
jaccard coefficient: SIM (i, j) describes the degree of similarity
Insert picture description here
example:
Insert picture description here
Insert picture description here
the above example the object i, j can take values 0 were removed, i.e., the property does not participate in the comparison, it is called an asymmetric Binary dissimilarity

Refer to
Slowly Changing Dimension:
https://www.nuwavesolutions.com/slowly-changing-dimensions/

Published 69 original articles · praised 11 · 20,000+ views

Guess you like

Origin blog.csdn.net/weixin_41636030/article/details/95903662