Data Warehouse and Data Mining Practice Final Review Summary

The content of this article is the scope of the final review outline of the author's data warehouse mining and practice, and the outline is labeled as part of the catalog of "Data Warehouse Mining Practice".

1.1.2 What is a data warehouse

definition

A data warehouse is a collection of subject-oriented, stable, integrated, and time-varying data.

Features (4)

  1. Topic-oriented
    Topic refers to the key areas that users care about when using the data warehouse to make decisions. From the perspective of data organization, a topic is a collection of data.
    The data oriented to the subject organization has the following characteristics:
    A. Each topic has complete and consistent content on the basis of which to analyze and process.
    B. There are overlapping contents between themes, reflecting the connection between themes. Overlap is logical, not physical.
    C. There are differences in the way the topics are integrated
    D. The subject areas should be independent and complete
  2. The data stored in the integrated
    data warehouse is generally extracted from the database system already established by the enterprise, but it is not a simple copy of the original data, but has been extracted, screened, cleaned, converted, and integrated.
  3. Non-volatile
    data warehouse is unchanged for a certain period of time
  4. Time-varying
    Periodic receipt of new data content from the operational database system

1.2.1 Composition of data warehouse system

definition

The data warehouse system takes the data warehouse as the core, integrates various application systems, and provides a platform for unified historical data analysis. Through the data analysis and reporting module and analysis tools OLAP (Online Analytical Processing), decision analysis, data mining to complete the extraction of information to meet the needs of decision-making.

composition

A data warehouse system usually refers to a database environment
consisting of the following three parts:

  1. Data storage and management
    include the following four contents:
    A. Data warehouse : The core of the entire data warehouse environment is the place where data is stored and the support for data retrieval is provided.
    B. Extraction tools : Extract data from various environments, perform necessary transformation and sorting, and store them in the data warehouse.
    C. Metadata : data about data, located in the upper layer of the data warehouse, is the data describing the structure, location and establishment method of the data in the data warehouse.
    D. Data mart: divided from the data warehouse for a certain topic
  2. OLAP server layer
    OLAP service is a software that provides analysis for data stored in a data warehouse.
  3. Front-end analysis tool layer
    Data report, data analysis and data mining generate various data analysis and summary reports for users, as well as data mining results.

1.2.2 ETL

ETL: Extract, Transform, Load. Referred to as data extraction, it integrates and improves the value of data according to unified rules, which is the process of completing the transformation of data from the data source to the target data warehouse.

  1. Data extraction
    Extract data from various original business systems
  2. Data conversion
    Convert the extracted data according to the originally designed rules, unifying the original heterogeneous data formats
  3. Data loading
    Import the converted data into the data warehouse incrementally or in full according to the plan

1.4 Relationship between data warehouse and operational database

The difference between operational and analytical data

operational data analytical data
detail Comprehensive
instant access historical data
updatable not updateable
operational needs known in advance Operational requirements are not known in advance
Compliant with the software development lifecycle completely different life cycle
High performance requirements Relaxed performance requirements
Operate one unit at a time Operate a collection at a certain moment
transaction driven analysis driven
detail The amount of data in one operation is large

Data Warehouse vs. Operational Database

database operational database
subject-oriented application-oriented
Huge capacity relatively small capacity
Data is synthesized or refined data is detailed
save historical data save current data
Usually the data is not updatable data is updatable
Operational needs are determined ad hoc Operational requirements are known in advance
An operation accesses a data set One operation accesses one record
Data is often redundant Data is not redundant
relatively infrequent operation More frequent operation
What is being queried is processed data What is queried is raw data
Support Decision Analysis Support transaction processing
Decision analysis requires historical data Transaction processing requires current data
requires complex calculations few complex calculations
The service object is the senior decision-makers of the enterprise The service object is the personnel in the business processing of the enterprise

2.3.1 Multidimensional data model and related concepts

  1. Granularity
    refers to the detail and level of data units in the data warehouse. The more detailed the data, the smaller the granularity and the lower the level.
  2. Dimension,
    referred to as "dimension", refers to a specific angle from which people observe things, conceptually similar to the attributes of relational tables
  3. Dimension attributes and dimension members
    A dimension is described by a set of attributes, and a value of a dimension is called a member of the dimension
  4. Dimension Hierarchy
    The same dimension can have various values ​​with different levels of detail, and values ​​with large granularity can be mapped to values ​​with small granularity, thus forming a hierarchy
  5. Metric/Fact
    A measure is an information unit in a data warehouse, that is, a unit in a multidimensional space for storing data, also known as a fact.

2.3.4 Several common multidimensional data models based on relational databases

Three modes: star schema, snowflake schema, and fact constellation schema
The star schema is the most basic schema. A star schema has multiple dimension tables, but only one fact table can exist. On the basis of the star schema, construct the layer structure of the dimension table (normalization of the dimension table), and obtain the snowflake schema . If the limitation of only one fact table in the star schema is broken, and these fact tables share some or all of the existing dimension table information, it is called a fact constellation schema .

3.1 Overview of OLAP

3.2 Multidimensional data model of OLAP

3.3.1 Efficient Computation of Data Cubes

5.1 The concept of association analysis

5.2 Aprior Algorithm

7.1 Classification process

7.3 Decision tree classification algorithm

7.4 Naive Bayes classification algorithm

10.1 Overview of Clustering

10.2 K-means Algorithm

10.3.1 Overview of Hierarchical Clustering Algorithms

10.3.2 DIANA algorithm and AGNES algorithm

Guess you like

Origin blog.csdn.net/qq_43759081/article/details/122387259