Introduction to Data Warehouse Theory

1. Data Warehouse

1. What is a data warehouse

  • Data warehouse is a 面向主题的、集成的、非易失的且随时间变化的collection of data
    Insert picture description here

1) Subject-oriented

Subject is an abstract concept that integrates, categorizes and analyzes the data in the enterprise information system at a higher level.
Each subject basically corresponds to a macro analysis field.
In a logical sense, it corresponds to a certain enterprise in the enterprise. 1. Analysis objects involved in the field of macro analysis

  • For example, "sales analysis" is an analysis field, so the theme of this data warehouse application is "sales analysis"

Extract topic

  • A transaction-oriented "shopping mall" database system, its data model is as follows:
    采购子系统:
    Order (order number, supplier number, total amount, date)
    order details (order number, product number, category, unit price, quantity)
    supplier (supplier number, supply Business name, address, telephone)
    销售子系统:
    customer (customer number, name, gender, age, education level, address, telephone)
    sales (employee number, customer number, product number, quantity, unit price, date)
    库存管理子系统:
    picking list (requisition number) , picking person, product number, quantity, date)
    feed single (feed a single number, order number, feed people, receiving people, date)
    inventory (product number, warehouse number, inventory, date)
    Treasury (Treasury No., warehouse manager, location, inventory description)
    人事管理子系统:
    employees (employee number, name, gender, age, education level, department number)
    department (department number, department name, department supervisor, telephone)

2) Integration

  • Integration means that the data in the data warehouse must be consistent.The
    data in the data warehouse is extracted from the original scattered multiple databases, data files and data segments. The
    data source may have both internal data and external data: for example, F /M, 0/1, A/B
  • Integration method
    Unification: Elimination of inconsistencies
    Integration: Synthesis and calculation of original data

3) Non-volatile

  • The data in the data warehouse is the analytical data formed after extraction. It
    does not have originality. It is
    mainly used for enterprise decision analysis.
    The main execution is the "query" operation. Under normal circumstances, the "update" operation is not performed.
    A stable data environment is also possible. Conducive to data analysis operations and decision making

4) Change over time

  • Data warehouse organizes data in the form of dimensions. Time dimension is a very important dimension in data warehouse.
    New data content is
    constantly added. Old data content is constantly deleted
    . Comprehensive data related to time is updated.

Two. The difference between data warehouse and database

Database is designed to capture and store data
Data warehouse is designed to analyze data

database database
Nature Collection of data Collection of data
Positioning Transaction processing OLTP Data analysis OLAP
Group-oriented Front-end user manager
operating Add, delete, modify Inquire
Data granularity record Dimension
Table Structure 3NF Star, snowflake

The difference between OLTP and OLAP


  • On-Line Transaction Processing OLTP On-Line Transaction Processing
    OLTP is the main application of traditional relational databases, mainly for basic and daily transaction processing, such as bank transactions
  • OLAP
    On-Line Analytical Processing
    OLAP is the main application of the data warehouse system. It supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results
Contrast properties OLTP OLAP
Read characteristics Only a few records are returned per query Aggregate a large number of records
Write characteristics Random, low-latency write user input Batch Import
scenes to be used User, Java EE project Internal analysts to provide support for decision making
Data representation Latest data status Historical state over time
Data size GB TB to PB

Three. Data warehouse architecture

Inmon architecture

Insert picture description here

Kimball architecture
Insert picture description here

Solution of Hybrid Architecture
Insert picture description here
Data Warehouse

  • Data collection
    Flume, Sqoop, Logstash, Datax
  • Data storage
    MySQL, HDFS, HBase, Redis, MongoDB
  • Data calculation
    Hive, Tez, Spark, Flink, Storm, Impala
  • Data visualization
    Tableau, Echarts, Superset, QuickBI, DataV
  • Task scheduling
    Oozie, Azkaban, Crontab

Data ETL

  • Extract to
    obtain data from operational data sources
  • Transform
    transforms the data into a form and structure suitable for query and analysis
  • Load (Load)
    imports the converted data into the final target data warehouse

ETL tools

  • Oracle
    OWB and ODI
  • Microsoft
    SQL Server Integration Services
  • SAP
    Data Integrator
  • IBM
    InfoSphere DataStage、Informatica
  • Pentaho
    Kettle

4. Modeling of data warehouse

1. Choose a business process

  • Confirm which business process should be covered by the data warehouse.
    For example: understand and analyze the sales of a retail store
  • Recording method
    Use plain text
    Use business process modeling notation (BPMN) method
    Use the same modeling language (UML)

2. Declare granularity

  • Used to determine what is represented in the facts
    For example: a purchase item on a shopping receipt by a customer of a retail store
  • Granularity must be declared before selecting dimensions and facts
  • It is recommended to start designing from the original granular data. The
    original record can satisfy unexpected user queries.
  • Different facts can have different granularity

3. Confirm the dimensions

  • Explain where the data of the fact table is collected from
  • Typical dimensions are nouns
    such as: date, store, inventory, etc.
  • The dimension table stores all relevant data of a certain dimension.
    For example, the date dimension should include data such as year, quarter, month, week, and day.

4. Confirm the facts

  • Identify digitized metrics and form the records of the fact table
  • Closely related to the business users of the system
  • Most of the metrics of the fact table are numeric types,
    which can be accumulated and calculated.
    For example: cost, quantity, amount

5. Star model features

  • Consists of fact table and dimension table
  • There can be one or more fact tables in a star schema, and each fact table references any number of dimension tables
  • The star model divides the business process into facts and dimensions.
    Facts contain business metrics, which are quantitative data, such as sales price, sales quantity, distance, speed, weight, etc. Fact
    dimensions are descriptions of fact data attributes, such as date, product , Customers, geographic location, etc. are dimensions
    Insert picture description here
  • Advantages
    Simplify queries
    Simplify business report logic
    Obtain query performance
    Fast aggregation
    Easy to provide data to the cube
  • Disadvantages
    Data integrity cannot be guaranteed
    Not flexible enough for analysis needs

6. Features of Snowflake Model

  • A logical layout of tables in a multidimensional model
  • Composed of fact table and dimension table
  • Normalize the dimension table in the star schema.
    Remove low cardinality attributes from the dimension table and form a separate table
  • One dimension is normalized into multiple related tables
    Insert picture description here
  • Advantages
    Some OLAP multidimensional database modeling tools are optimized for the snowflake model.
    Standardized dimensional attributes save storage space
  • Disadvantages
    Normalization of dimensional attributes increases the connection operation and complexity of queries and
    does not ensure data integrity

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108333493