Article Directory
1. Data Warehouse
1. What is a data warehouse
- Data warehouse is a
面向主题的、集成的、非易失的且随时间变化的
collection of data
1) Subject-oriented
Subject is an abstract concept that integrates, categorizes and analyzes the data in the enterprise information system at a higher level.
Each subject basically corresponds to a macro analysis field.
In a logical sense, it corresponds to a certain enterprise in the enterprise. 1. Analysis objects involved in the field of macro analysis
- For example, "sales analysis" is an analysis field, so the theme of this data warehouse application is "sales analysis"
Extract topic
- A transaction-oriented "shopping mall" database system, its data model is as follows:
采购子系统:
Order (order number, supplier number, total amount, date)
order details (order number, product number, category, unit price, quantity)
supplier (supplier number, supply Business name, address, telephone)
销售子系统:
customer (customer number, name, gender, age, education level, address, telephone)
sales (employee number, customer number, product number, quantity, unit price, date)
库存管理子系统:
picking list (requisition number) , picking person, product number, quantity, date)
feed single (feed a single number, order number, feed people, receiving people, date)
inventory (product number, warehouse number, inventory, date)
Treasury (Treasury No., warehouse manager, location, inventory description)
人事管理子系统:
employees (employee number, name, gender, age, education level, department number)
department (department number, department name, department supervisor, telephone)
2) Integration
- Integration means that the data in the data warehouse must be consistent.The
data in the data warehouse is extracted from the original scattered multiple databases, data files and data segments. The
data source may have both internal data and external data: for example, F /M, 0/1, A/B - Integration method
Unification: Elimination of inconsistencies
Integration: Synthesis and calculation of original data
3) Non-volatile
- The data in the data warehouse is the analytical data formed after extraction. It
does not have originality. It is
mainly used for enterprise decision analysis.
The main execution is the "query" operation. Under normal circumstances, the "update" operation is not performed.
A stable data environment is also possible. Conducive to data analysis operations and decision making
4) Change over time
- Data warehouse organizes data in the form of dimensions. Time dimension is a very important dimension in data warehouse.
New data content is
constantly added. Old data content is constantly deleted
. Comprehensive data related to time is updated.
Two. The difference between data warehouse and database
Database is designed to capture and store data
Data warehouse is designed to analyze data
database | database | |
---|---|---|
Nature | Collection of data | Collection of data |
Positioning | Transaction processing OLTP | Data analysis OLAP |
Group-oriented | Front-end user | manager |
operating | Add, delete, modify | Inquire |
Data granularity | record | Dimension |
Table Structure | 3NF | Star, snowflake |
The difference between OLTP and OLAP
On-Line Transaction Processing OLTP On-Line Transaction Processing
OLTP is the main application of traditional relational databases, mainly for basic and daily transaction processing, such as bank transactions- OLAP
On-Line Analytical Processing
OLAP is the main application of the data warehouse system. It supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results
Contrast properties | OLTP | OLAP |
---|---|---|
Read characteristics | Only a few records are returned per query | Aggregate a large number of records |
Write characteristics | Random, low-latency write user input | Batch Import |
scenes to be used | User, Java EE project | Internal analysts to provide support for decision making |
Data representation | Latest data status | Historical state over time |
Data size | GB | TB to PB |
Three. Data warehouse architecture
Inmon architecture
Kimball architecture
Solution of Hybrid Architecture
Data Warehouse
- Data collection
Flume, Sqoop, Logstash, Datax - Data storage
MySQL, HDFS, HBase, Redis, MongoDB - Data calculation
Hive, Tez, Spark, Flink, Storm, Impala - Data visualization
Tableau, Echarts, Superset, QuickBI, DataV - Task scheduling
Oozie, Azkaban, Crontab
Data ETL
- Extract to
obtain data from operational data sources - Transform
transforms the data into a form and structure suitable for query and analysis - Load (Load)
imports the converted data into the final target data warehouse
ETL tools
- Oracle
OWB and ODI - Microsoft
SQL Server Integration Services - SAP
Data Integrator - IBM
InfoSphere DataStage、Informatica - Pentaho
Kettle
4. Modeling of data warehouse
1. Choose a business process
- Confirm which business process should be covered by the data warehouse.
For example: understand and analyze the sales of a retail store - Recording method
Use plain text
Use business process modeling notation (BPMN) method
Use the same modeling language (UML)
2. Declare granularity
- Used to determine what is represented in the facts
For example: a purchase item on a shopping receipt by a customer of a retail store - Granularity must be declared before selecting dimensions and facts
- It is recommended to start designing from the original granular data. The
original record can satisfy unexpected user queries. - Different facts can have different granularity
3. Confirm the dimensions
- Explain where the data of the fact table is collected from
- Typical dimensions are nouns
such as: date, store, inventory, etc. - The dimension table stores all relevant data of a certain dimension.
For example, the date dimension should include data such as year, quarter, month, week, and day.
4. Confirm the facts
- Identify digitized metrics and form the records of the fact table
- Closely related to the business users of the system
- Most of the metrics of the fact table are numeric types,
which can be accumulated and calculated.
For example: cost, quantity, amount
5. Star model features
- Consists of fact table and dimension table
- There can be one or more fact tables in a star schema, and each fact table references any number of dimension tables
- The star model divides the business process into facts and dimensions.
Facts contain business metrics, which are quantitative data, such as sales price, sales quantity, distance, speed, weight, etc. Fact
dimensions are descriptions of fact data attributes, such as date, product , Customers, geographic location, etc. are dimensions
- Advantages
Simplify queries
Simplify business report logic
Obtain query performance
Fast aggregation
Easy to provide data to the cube - Disadvantages
Data integrity cannot be guaranteed
Not flexible enough for analysis needs
6. Features of Snowflake Model
- A logical layout of tables in a multidimensional model
- Composed of fact table and dimension table
- Normalize the dimension table in the star schema.
Remove low cardinality attributes from the dimension table and form a separate table - One dimension is normalized into multiple related tables
- Advantages
Some OLAP multidimensional database modeling tools are optimized for the snowflake model.
Standardized dimensional attributes save storage space - Disadvantages
Normalization of dimensional attributes increases the connection operation and complexity of queries and
does not ensure data integrity