How to design a data warehouse? What exactly is a data warehouse? ---Detailed explanation

1. What is a data warehouse? – Data Warehouse Concepts

The founder of the data warehouse concept defined the data warehouse in the book "Building a Data Warehouse" as follows: a data warehouse (Data Warehouse) is a subject-oriented ( Subject Oriented), data integrated (Integrated), relatively stable (non-volatile ) (Non-Volatile), reflecting historical changes ( time-varying ) (Time Variant) data collection, used to support management decisions (Decision Making Support).
The data warehouse is the structured data environment of the decision support system (dss). As shown in the figure below, the decision support system performs online analytical processing (OLAP) based on the data warehouse. Commonly used technologies include HDFS, HBase, Hive, SparkSql, etc.

insert image description here

  1. Data Acquisition, Acquisition of source data into a data warehouse
  2. Data Analysis Based on Data Warehouse
  3. Generate reports

2. The difference between OLTP and OLAP

OLTP (On-Line Transaction Processing) is online transaction processing, also known as transaction-oriented processing. Its basic feature is that the user data received by the front desk can be immediately transmitted to the computing center for processing, and the processing is given in a very short time As a result, it is one of the ways to quickly respond to user operations, such as ERP systems, CRM systems, Internet e-commerce systems, etc. These systems are characterized by frequent transaction operations and small data volumes.
OLAP (On-Line Analytical Processing), sometimes called Decision Support System (DSS), supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results. This type of system is characterized by no transactional operations, mainly query operations, and a large amount of data.

Detailed differences:

Compare items OLTP OLAP
Function transaction-oriented transaction processing query-oriented
design business oriented subject-oriented
data latest data, two-dimensional data historical data, multidimensional data
storage M,G (memory unit) T、P、E
Response time quick slow
user business operator management decision makers

3. Characteristics of data warehouse

3.1 Subject-oriented

The concept of understanding themes can be compared with database application systems.
Database applications are divided into applications and databases based on business processes . For example, ERP (Enterprise Resource Planning) includes: Invoicing System, Human Resource Management System, Financial Management System, Warehouse Management System, etc. Invoicing System manages the , sales, storage and other business processes, the human resources system manages employee information, treatment and other related information. The data warehouse organizes and divides data into several themes based on data analysis requirements , such as sales themes, employee themes, and product themes. Themes are an abstract concept that can be understood as classifications, catalogs, etc. of related data . Sales can be carried out through sales themes Relevant analysis, such as annual sales ranking, monthly order statistics, etc. In short, the theme organizes data based on analysis needs , and the database application system organizes data based on business processes. Note: the data in the theme is cross-application systems.

3.2 Data Integration

The data in the topic is across application systems , that is to say, the data is scattered in various application systems . For example, sales data exists in the purchase-sales-inventory system, and also in the financial system. In order to conduct sales analysis, the sales data needs to be integrated. Integration In the sales topic, data analysis can be performed from the sales topic.

3.3 Non-volatile

The database application system processes and stores data according to business requirements, while the data warehouse stores data according to data analysis requirements. The data in the data warehouse is used for query and analysis. In order to ensure the accuracy and stability of data analysis, the data warehouse The data in is generally rarely updated , and historical snapshots will be saved.

3.4 Time Variation

The data stored in the data warehouse is historical data, which changes over time. For example, sales data over the years will be stored in the data warehouse. Even if the data in the data warehouse is rarely updated, there is no guarantee that there will be no changes. The following requirements :
1) New data will be added
continuously Annual sales data will be gradually added to the data warehouse.
2) Delete expired data
The data in the data warehouse will be stored for a long time (5–10 years), but there is also an expiration time, and the expired data will be deleted when the expiration time expires. 3) Aggregating
historical detailed data In order to facilitate data analysis, finer-grained data will be aggregated and stored according to analysis requirements, which is also a manifestation of time-varying, for example: in order to facilitate statistics of annual sales, sales records will be grouped by Statistics are carried out on a monthly basis. When calculating annual sales, it is only necessary to conduct statistics on monthly sales results.

4. Data warehouse system architecture

4.1 System Structure Diagram

The data warehouse provides a data environment for enterprise decision-making analysis. Where does the data come from? How is the data stored in the data warehouse? How does the decision analysis system obtain data from the data warehouse for analysis? We can call all parts of data acquisition, storage, data warehouse, and data analysis a data warehouse system
insert image description here

  1. Identify the source data on which the analysis depends.
  2. Ingest source data into data warehouse via ETL.
  3. Data is stored according to the topic structure provided by the data warehouse.
  4. Create a data mart (a subset of a data warehouse) based on the business analysis requirements of each department.
  5. Application systems such as decision analysis and reporting query and analyze data from the data warehouse.
  6. Users query analysis results and reports through the application system.

4.2. Source data

Source data refers to the original data used for analysis. This step is mainly to determine the source data according to the analysis requirements. This data is distributed in the internal system and external partial systems. The internal data is mainly the enterprise ERP system, and the external data refers to the external partial systems of the enterprise. The generated data usually refers to industry data. The biggest feature of the source data is that the format is not uniform. If you want to analyze the source data, you need to go through ETL to centrally acquire, filter, and convert the data.

4.3、ETL

ETL (Extra, Transfer, Load) includes three processes of data extraction, data conversion, and data loading .

  1. Extraction
    Data extraction is to collect source data from source data such as various business systems and external systems.
  2. Conversion
    ​ If the collected source data is to be stored in the data warehouse, the source data needs to be converted according to a certain data format. Common conversion methods include data type conversion, format conversion, missing value supplementation, and data synthesis.
  3. Loading
    The converted data can be stored in the data warehouse, and this process needs to be loaded. Data loading is usually carried out at a certain frequency, such as loading the order data of the day every day, loading customer information every week, and so on.

4.4. Data Warehouse and Data Mart

A data warehouse is a collection of data used for overall analysis of an enterprise , such as: sales topics, customer topics, product topics, etc. A data mart is a collection of data used for departmental analysis . In terms of scope, it is a subset of the data warehouse. For example, the data mart of the sales department only has sales topics.
Why is there the concept of data mart?
Usually, it is difficult to build a data warehouse from the perspective of the enterprise as a whole, and there are many business and analysis requirements involved. Therefore, the concept of data mart is proposed, and the data warehouse can be built from a certain department first, so that the efficiency is relatively high.
The industry calls the process of building a data warehouse starting from the enterprise as a top-down, and the process of building a data warehouse starting from a data mart and gradually improving the entire data warehouse is called bottom-up. It is usually recommended to build a data warehouse from the bottom up, but this is also controversial in the industry.
What is the difference between a data warehouse and a data mart?
1. The difference in scope
The data warehouse is a collection of data for the overall analysis of the enterprise.
A data mart is a collection of data for department-level analysis.
2. Different data granularity
Data warehouse usually includes data details with finer granularity.
The data mart will aggregate data on the basis of the data warehouse, and the aggregated data will be directly used for departmental business analysis.

4.5. Application system

The application system here refers to the system that uses the data warehouse to complete functions such as data analysis, data query, and data reporting. The application system needs to query and analyze data from the data warehouse, such as: OLAP system, data query system, etc.

4.6. Users

The users who use the data warehouse system mainly include data analysts, management decision-makers (company executives), etc.

5. Dimensional Analysis

5.1. Introduction to Dimensional Analysis

Data analysis usually adopts dimensional analysis. For example, the user proposes an index to analyze the course visits. In order to meet different analysis needs, the course visits can be analyzed from the time dimension, and the daily and hourly course visits can also be analyzed from the course dimension. To analyze the number of visits to courses, analyze the visits of each course and each course category.

5.2. Indicators and dimensions

To do dimensional analysis requires an understanding of two terms: metrics and dimensions.
Indicators are standards for measuring business development, also called metrics , such as prices, sales, etc.; indicators can be summed, averaged, etc. calculated.
Indicators are divided into absolute value and relative value . Absolute value reflects the specific size and amount, such as price, sales volume, score, etc.; relative value reflects a certain degree, such as pass rate, purchase rate, increase, etc. Dimensions are the characteristics
of transactions , such as color, area, time, etc., and indicators can be analyzed and compared according to different dimensions. For example, analyze the sales volume of products in different regions according to the regional dimension, and analyze the sales volume of products every month according to the time. The same product sales index will get different results when analyzed from different dimensions. Dimensions are divided into qualitative and quantitative dimensions . Qualitative dimensions are the characteristics of character types. For example, the regional dimension includes all provinces in the country; quantitative dimensions are the characteristics of numerical types, such as price ranges, sales ranges, etc. For example, the price range dimension is divided into 0 – 100, 100-1000 two ranges, the indicators can be analyzed according to the price range dimension. Speaking of this, the indicators can actually be converted into dimensions, and the converted dimensions are quantitative dimensions. Use specific index values ​​to measure different dimensions. The relationship between the x-axis and the y-axis.

5.3, dimension layering and classification

Usually, the first thing you see in the analysis results is a total number, such as the annual course purchase volume, and then you will look at the quarterly and monthly course purchase volume in detail. The whole year, quarter, and month belong to a level of the time dimension , year, quarter, and month are the three levels of this level; another example is the analysis of course purchases by region. The country, province, city, and county belong to a level of the regional dimension, and there are four levels in the level.
Equivalent to subdividing dimensions. If two levels are subdivided, the dimension contains one level and multiple levels. If three layers are subdivided, the dimension contains multiple layers and multiple levels.
Each dimension has at least one hierarchy and that hierarchy has at least one level.

5.4 Drill down and roll up

There are different levels in the dimension, and each level can have multiple levels, so that analysis can be performed according to multiple maintenance levels and levels, and high-level summary information can be obtained flexibly, and low-level detailed information can be obtained.
The process of obtaining high-level summary information is called roll-up, and the process of obtaining low-level detailed information is called drill-down. For example, in the analysis of course visits, the time dimension has four levels, namely year, month, day, and hour. Now we analyze the daily course visits at a certain level, such as analyzing the course visits by day. At this time, we can drill down and analyze by hour to get the course visits per hour in a day, or roll up by month to get monthly of course visits.

6. Data Warehouse Modeling

There are two commonly used data warehouse modeling methods: three-paradigm modeling method and dimensional modeling method. The three-paradigm modeling method is mainly applied to traditional enterprise-level data warehouses. This type of data warehouse is usually implemented using a relational database. It was proposed by Inmon and applied to the top-down data warehouse architecture; the dimensional data model is to create a model based on dimensional analysis, which was proposed by Kimball and applied to the bottom-up data warehouse architecture. This course takes a dimensional modeling approach.
Dimensional modeling, referred to as DM (Dimensional modeling), the view of data warehouse master Kimball: Dimensional data model is a design technology that tends to support end users to query the data warehouse , and is built around performance and understandability . A dimensional model organizes data according to how users view or analyze it.
Two core concepts of dimensional modeling: fact table and dimension table .

6.1. Fact table

The fact table records the digital information of a specific event , and generally consists of numeric numbers and foreign keys pointing to dimension tables .
The design of the fact table depends on the business system, and the data in the fact table is the indicator data of the business system . The essence of data analysis is the calculation operation based on the fact table.

6.1.1. Classification

6.1.1.1 Transaction fact table

Transaction fact table, transaction fact table, periodic snapshot fact table, and cumulative snapshot fact table use the same dimensions, but they are very different in describing business facts.
The transaction-level facts recorded in the transaction fact table store the most atomic data , also known as " atomic fact table " or " transaction fact table ". The data in the transaction fact table is generated after the transaction event occurs, and the granularity of the data is usually one record per transaction. Once the transaction is committed and the fact table data is inserted, the data will no longer be changed, and the update method is incremental update.
The date dimension of the transaction fact table records the date when the transaction occurred, and the fact it records is the content of the transaction activity. Users can conduct particularly detailed analysis of transaction behavior through the transaction fact table.
The fact table often referred to in communication mostly refers to the transaction fact table.

6.1.1.2 Periodic snapshot fact table

Periodicsnapshot fact table, the periodic snapshot fact table records facts at regular and predictable intervals, such as daily, monthly, annual, and so on. Typical examples are sales day snapshot table, inventory day snapshot table, etc.
Imagine the following scenario, how to get statistics on the transaction volume of commodities in a quarter? If the factual transaction table within a quarter is used for calculation, although the result can be obtained, the efficiency is too low and it is not feasible in actual production. Therefore, it is necessary to periodically integrate the specified metrics and use it as a periodic snapshot table for downstream applications. Generally, when designing a fact table, the transaction fact table and the periodic snapshot table are designed in pairs. Most of the periodic tables are generated by processing the transaction table , and some special data are directly applied to the system (such as order evaluation).
The granularity of the periodic snapshot fact table is one record per time period, which is usually coarser than the granularity of the transactional fact table, and is an aggregation table built on top of the transactional fact table. For example, if the time period is 1 week, then a record in the snapshot fact table of this period is the statistical value of a certain metric for this week. Periodic snapshot fact tables have fewer dimensions than transactional fact tables.
The date dimension of the periodic snapshot fact table is usually the end date of the recording time period, and the recorded facts are some aggregated fact values ​​in this time period. Once the data in the fact table is inserted, it cannot be changed, and its update method is incremental update.

6.1.1.3 Cumulative snapshot fact table

Accumulating snapshot fact table, cumulative snapshot fact table and periodic snapshot fact table are somewhat similar, and they store snapshot information of transaction data. But there is also a big difference between them. The periodic snapshot fact table records the data of the definite period, while the cumulative snapshot fact table records the data of the indeterminate period .
The cumulative snapshot fact table represents a time span that completely covers the life cycle of a transaction or product , and it usually has multiple date fields to record key time points in the entire life cycle. For example, the order cumulative snapshot fact table will have time points such as payment date, delivery date, and receipt date.
A complete transaction record in the transaction fact table will have a series of data in different states to record the entire transaction process; while the cumulative snapshot fact table will only have one record, and the data will be updated until the end of the process.
The cumulative snapshot fact table represents a time span that completely covers the life cycle of a transaction or product, and it usually has multiple date fields to record key time points in the entire life cycle. Also, it will have an additional date field indicating the date it was last updated.
Since many dates in the fact table are not known at the time of first loading, proxy keywords must be used to deal with undefined dates, and this type of fact table can be updated after the data is loaded to supplement subsequent known date information.

features business facts Cycle Snapshot Facts Cumulative Snapshot Facts
time/period time period Multiple points in time with a short time span
granularity Each row represents a transaction event Each row represents a time period Each row represents a business cycle
Fact table loading Add Add New additions and modifications
Fact table update not update not update Update when new events occur
time dimension business date end of period Completion dates for multiple business processes
fact business activities performance over time Performance over defined multiple business phases

6.2. Dimension table

Dimension refers to the angle of observing data , which is generally a noun . For example, for the fact of sales amount, we can observe and analyze it from multiple dimensions such as sales time, sales products, sales stores, and purchasing customers.
Dimension tables have fewer records than fact tables, but each record may contain many fields.

6.2.1

It mainly includes two types of data:
1. High-cardinality dimensional data : generally, data tables similar to user data tables and product data tables. The amount of data may be tens of millions or hundreds of millions.
2. Low-cardinality dimension data : generally configuration tables, such as Chinese meanings corresponding to enumeration values, or date dimension tables, geographic dimension tables, etc. The amount of data may be single digits or thousands or tens of thousands.
Cardinality refers to the number of different values ​​in a field. For example, the primary key column has unique values, so it has the highest cardinality, while the cardinality of columns such as gender enumeration values ​​(date, region, etc.) is very low.

6.3 Common modeling methods

6.3.1, star model

It is a multidimensional data relationship. A fact table is the center, and multiple dimension tables surround it.
There can be one or more fact tables in a star schema, and each fact table can refer to any number of dimension tables.
A star schema divides business processes into facts and dimensions. Facts are measures of business, quantitative data such as price, sales volume, distance, speed, quality, etc. Dimensions are descriptions of attributes of fact data, such as dates, products, customers, geographic locations, and so on.

insert image description here

6.3.2. Snowflake model

When one or more dimension tables are not directly connected to the fact table , but are connected to the fact table through other dimension tables , it is like multiple snowflakes connected together, so it is called a snowflake model. The snowflake model is an extension of the star model. It further stratifies the dimension tables of the star model. The original dimension tables may be expanded into small fact tables to form some local "hierarchy" areas. These decomposed The tables are all joined to the main dimension table instead of the fact table.
insert image description here
How to layer the dimension table?
That is, attributes with low cardinality (more repetitions, lower identification, less dimension data, such as gender) are removed from the dimension table and formed into a separate table.
For example, in the case mentioned above, the purchase quantity indicator has a course dimension, and the course dimension can hierarchically expand the course classification into a new dimension table.
Impact of Hierarchy
The process of hierarchization is to form a new table with highly repetitive fields in dimension tables, so hierarchization inevitably increases the number of tables, reduces data storage space , and improves the efficiency of data update. But more tables need to be connected when querying.
To sum up, in the snowflake schema, a dimension is normalized into multiple associated tables, and in the star schema, each dimension is represented by a single dimension table.

6.4, gradually changing dimension

Dimensions can be divided into no-change dimensions and change dimensions according to the degree of change . For example, relevant information about a person, such as ID number, name, and gender, are invariable parts; while marital status, work experience, work unit, and training experience are fields that may change.
Most dimensional data migrates slowly over time. For example, if a new product is added, or the ID number of the product is modified, or a new attribute is added to the product, then the dimension table will be modified or a new row will be added. In this way, in the process of designing and using dimensions, the processing of slowly changing dimension data must be considered .
Slowly changing dimension , that is, the attributes in the dimension may change over time . For example, the DimCustomer dimension containing the user's address may change, which will affect the accuracy of business statistics . The DimCustomer dimension is the Slowly Changing Dimension (SCD).

6.4.1, SCD1 (slow gradient type 1)

Overwrite existing values ​​directly by updating dimension records . History of record is not maintained . It is generally used to modify wrong data, that is, historical data is wrong data, and it has no other use.

In the data warehouse, we can keep the business data and the data in the data warehouse always consistent. You can use the Business Key - CustomerID from the business database in the Customer dimension to track changes in business data, and overwrite the old business data once a change occurs.

6.4.2, SCD2 (slow gradient type 2)

When the source data changes, a new **"version" record** is created for the dimension record to maintain the dimension history. SCD2 does not delete or modify existing data . SCD2 is also called a zipper table .
There are many demand scenarios in the data warehouse to summarize and analyze historical data, so historical data from business systems will be maintained as much as possible, so that the system can truly capture changes in such historical data.

6.4.3, SCD3 (slow gradient type 3)

In fact SCD1 and 2 can meet most needs, but there are still other solutions, such as SCD3. SCD3 hopes to maintain only less history.

For example, add a new column to the history field to be maintained , and then only update the Current Column and Previous Column each time. In this way, only the last two historical records are saved, and the historical data are all in the same row of data . But if there are many fields to maintain, it will be more troublesome, because there are more Current and Previous fields. So SCD3 is still not as common as SCD1 and SCD2. It is only applicable when the storage space for the data is insufficient and the user accepts limited historical data .

7. Layering of data warehouse

7.1. Why layering?

As a data planner, we definitely hope that our data can flow in an orderly manner, and the entire life cycle of the data can be clearly and clearly perceived by designers and users. Intuitively speaking, the hierarchy is clear and the dependencies are intuitive as shown in the figure.

insert image description here
However, in most cases, the data system we complete has complex dependencies and chaotic levels. As shown in the figure below
, without knowing it, we may create a data system with chaotic table dependency structure and even circular dependency.
insert image description here

Therefore, we need a set of effective data organization and management methods to make our data system more orderly, which is the data layering mentioned. Data layering cannot solve all data problems, but data layering can bring us the following benefits:
1. Clear data structure: each data layering has its scope and responsibilities. Time can be more convenient to locate and understand.
2. Simplify complex problems: decompose a complex task into multiple steps to complete, and each layer solves a specific problem.
3. Easy to maintain: when there is a problem with the data, you don’t need to repair all the data, you only need to start repairing from the problematic steps.
4. Reduce repetitive development: Standardize data layering and develop some common middle-level data, which can reduce the workload of repetitive development.
5. High performance: The construction of the data warehouse will greatly shorten the time to obtain information. As a collection of data, the data warehouse can directly obtain all the information from the data warehouse, especially for the associated query and complex query of massive data, so the data warehouse is divided into The layer is conducive to realizing complex statistical requirements and improving the efficiency of data statistics.

The data model is usually divided into three layers: data operation layer (ODS), data warehouse layer (DW) and data application layer (APP). To put it simply, we can understand it as: the ODS layer stores the raw data that is accessed, the DW layer stores the data in the middle layer of the data warehouse that we want to focus on, and the APP is the application data customized for the business. The design of these three layers is described in detail below.

7.2. Hierarchical method

7.2.1. Source Data Layer (ODS)

There is no change in the data of this layer, and the data structure and data of the peripheral system are directly used, and it is not open to the outside world; it is a temporary storage layer, which is a temporary storage area for interface data, and prepares for the next step of data processing.
(data that does not need to be modified)

7.2.2, data warehouse layer (DW)

The data at the DW layer should be consistent, accurate, and clean data, that is, the data after cleaning (removing impurities) from the source system data.

7.2.2.1 DWD detail layer

Store detailed data, which is the most fine-grained factual data. This layer generally maintains the same data granularity as the ODS layer and provides certain data quality assurance. At the same time, in order to improve the usability of the data detail layer, this layer will use some dimension degeneration methods to degenerate the dimension into the fact table and reduce the association between the fact table and the dimension table. When we do this step, we can first determine our business theme, and build a table at this level according to the theme.

7.2.2.2 DWM middle layer

Store intermediate data, which is the intermediate table data that needs to be created for data statistics. This data is generally aggregated data for multiple dimensions. The data in this layer usually comes from the data in the DWD layer.

7.2.2.3 DWS business layer

Store wide table data. This layer of data is aggregated data for a certain business field. The data of the application layer usually comes from this layer. Why is it called wide table? It is mainly for the needs of the application layer to store all business-related data Collected and stored in a unified manner, which is convenient for the business layer to obtain. The data of this layer usually comes from the data of DWD and DWM layers.

In actual calculation, if the statistical indicators of wide tables are calculated directly from DWD or ODS, there will be problems of too much calculation and too few dimensions. Therefore, the general practice is to first calculate multiple small intermediate tables at the DWM layer. , and then spliced ​​into a wide table of DWS. Since the boundary between wide and narrow is not easy to define, the DWM layer can also be removed, leaving only the DWS layer, and all the data can be placed in the DWS.

7.2.3 Data application layer (ADS or DA or APP)

The data source directly read by the front-end application; the generated data is calculated according to the requirements of reports and thematic analysis.

7.2.4 Dimensional Surface (DIM)

Dimension surface layer, dimension surface layer mainly contains two parts of data:

  1. High-cardinality dimensional data: generally, data tables similar to user data tables and product data tables. The amount of data may be tens of millions or hundreds of millions.
  2. Low-cardinality dimension data: generally a configuration table, such as the Chinese meaning corresponding to an enumeration value, or a date dimension table. The amount of data may be single digits or tens of thousands.
    insert image description here
    insert image description here

Guess you like

Origin blog.csdn.net/weixin_48143996/article/details/121988548