Data mining: database knowledge discovery
process:
- Data cleaning: eliminate noise and inconsistent data
- Data integration: multiple data sources can be combined
- Data selection: extract and analyze task-related data from the database
- Data transformation: transform and unify data into a form suitable for mining through summary or aggregation operations
- Data mining: using intelligent methods to extract data patterns
- Pattern assessment: Identify truly interesting patterns that represent knowledge based on interest
- Knowledge representation: Use visualization and knowledge representation technology to provide users with
6 stages of knowledge mining (crisp-dm) data mining: business understanding, data understanding, data preparation, modeling, model evaluation and model publishing
OLTP (online transaction processing): It is mainly production data processing, which is generally common in data generation, so it is a real-time data processing system. For example, a transaction is completed, it is immediately recorded in the database system, so it is necessary to consider when building a database Three paradigms are constructed to facilitate data addition, deletion and modification.
OLAP (online analitics processing): mainly to construct historical data for easy query, so its tables are generally flat, and the data after insertion is generally not changed, so its data is generally divided into fact tables and dimension tables for convenience Analysts retrieve data from them for analysis, which is also the processing method of data warehouses and data marts
Knowledge Discovery in Data (KDD)
ER diagram: correlation diagram between different entities
Frequent item set: refers to a collection of commodities that frequently appear together in a transaction data set, such as milk and bread that many customers frequently purchase together.
Cluster: A collection of data objects that makes the objects in the same cluster similar to each other and different from other cluster objects.
Outlier analysis: Based on clustering technology, treat possible outliers as objects that are highly different from other objects
Data matrix and dissimilarity matrix
Generally, memory-based clustering and KNN (nearest neighbor) algorithms run on these two data structures.
Data matrix (object-attribute structure): This data structure is in the form of a relational table or n The p matrix stores n data objects, p attributes
. The dissimilarity matrix (object-object structure): stores the proximity of n objects between two pairs. The n n matrix is used to represent
binary attributes. The proximity measure
uses symmetric and asymmetric binary attributes. characterization objects dissimilarity and similarity measure
jaccard coefficient: SIM (i, j) describes the degree of similarity
example:
the above example the object i, j can take values 0 were removed, i.e., the property does not participate in the comparison, it is called an asymmetric Binary dissimilarity
Refer to
Slowly Changing Dimension:
https://www.nuwavesolutions.com/slowly-changing-dimensions/