Data Warehouse Modeling Theory and Warehouse Design Ideas

1. Data Warehouse

1.1. Overview of Data Warehouse

A data warehouse is an enterprise-level data management system designed for data analysis. The data warehouse can centralize and integrate a large amount of data from multiple information sources. With the analysis capability of the data warehouse, enterprises can obtain valuable information from the data to improve decision-making. At the same time, the large amount of historical data accumulated in the data warehouse over time is also invaluable to data scientists and business analysts.

1.2. Core Architecture of Data Warehouse

insert image description here

2. Overview of data warehouse modeling

2.1. Significance of data warehouse modeling

  • If we think of data as books in a library, we would like to see them placed in categories on the shelves;
  • If data is regarded as the building of the city, we hope that the urban planning layout is reasonable;
  • If we regard data as computer files and folders, we hope to have a good folder organization according to our own habits, instead of a messy desktop, often at a loss for finding a file.

The data model is the data organization and storage method , which emphasizes the reasonable storage of data from the perspective of business , data access and use . Only after the data is organized and stored in an orderly manner can the data be used with high performance, low cost, high efficiency and high quality.

  • High performance: A good data model can help us quickly query the required data.
  • Low cost: A good data model can reduce repeated calculations, realize the reuse of calculation results, and reduce calculation costs.
  • High efficiency: A good data model can greatly improve the user experience of using data and improve the efficiency of using data.
  • High quality: A good data model can improve the confusion of data statistics and reduce the possibility of calculation errors.

2.2. Data Warehouse Modeling Methodology

2.2.1. ER model

The modeling method proposed by Bill Inmon, the father of data warehousing, is to use the entity-relationship (Entity Relationship, ER) model to describe the business of the enterprise from the perspective of the entire enterprise, and express it in a standardized way, which conforms to 3NF in terms of paradigm theory.

2.2.1.1, entity relationship model

The entity-relationship model abstracts complex data into two concepts—entity and relationship. An entity represents an object, such as a student, a class, and a relationship refers to a relationship between two entities, such as the affiliation between a student and a class.

2.2.1.2. Database normalization

Database normalization is the process of designing a database (usually a relational database) using a series of paradigms in order to reduce data redundancy and enhance data consistency.
This series of paradigms refers to the different specifications that need to be followed when designing a relational database.

There are six normal forms of relational databases, namely the first normal form (1NF), the second normal form (2NF), the third normal form (3NF), Bath-Codd normal form (BCNF), the fourth normal form (4NF) and Fifth Normal Form (5NF).

The higher the level of paradigm followed, the lower the data redundancy.

2.2.1.3, three paradigms

2.2.1.3.1. Functional dependencies

insert image description here

  1. full functional dependency
    insert image description here
  2. partial functional dependency
    insert image description here
  3. transitive functional dependencies
    insert image description here
2.2.1.3.2, the first normal form

insert image description here

2.2.1.3.3, second normal form

insert image description here

2.2.1.3.4, the third normal form

insert image description here

The figure below shows a model constructed using the modeling method advocated by Bill Inmon. It can be seen from the figure that it is relatively loose and fragmented, with a large number of physical tables.

insert image description here
The starting point of this modeling method is to integrate data , and its purpose is to combine and merge the data of the entire enterprise, and perform standardized processing to reduce data redundancy and ensure data consistency. This model is not suitable for direct use in analytical statistics.

2.2.2. Dimensional model

The modeling method advocated by another master in the field of data warehouse, Ralph Kimball, is dimensional modeling. The dimensional model presents complex business through the two concepts of fact and dimension . Facts usually correspond to business processes, and dimensions usually correspond to the environment in which business processes occur .

Note: Business processes can be summarized as inseparable behavioral events, such as order placement, order cancellation, payment, and chargeback in e-commerce transactions, all of which are business processes.

The figure below shows a typical dimensional model, in which the SalesOrder at the center is a fact table, which stores all the records of the business process of placing an order. Each of the surrounding tables is a dimension table, including Date (date), Customer (customer), Product (product), Location (region), etc. These dimension tables constitute the environment in which each order occurs, that is, who , when, where and what product was ordered. As can be seen from the figure, the model is relatively clear and concise.

insert image description here
Dimensional modeling takes data analysis as the starting point and serves data analysis. Therefore, it focuses on how users can complete demand analysis faster and how to achieve better response performance for large-scale and complex queries.

3. Fact table of dimensional modeling theory

3.1. Fact table overview

As the core of data warehouse dimensional modeling, the fact table is designed around the business process. It contains references to dimensions related to the business process (dimension table foreign keys) and measures of the business process (usually summable numeric fields) .

3.2. Fact table features

Fact tables are usually "slender", that is, there are fewer columns, but more rows, and the growth rate of rows is fast.

3.3. Fact table classification

There are three types of fact tables: transaction fact tables, periodic snapshot fact tables, and cumulative snapshot fact tables , each of which has different characteristics and applicable scenarios.

3.3.1, transactional fact table

3.3.1.1. Overview

The transactional fact table is used to record each business process , and it stores the atomic operation events of each business process, that is, the most fine-grained operation events. Granularity refers to the degree of business detail expressed by a row of data in the fact table.

The transactional fact table can be used to analyze various statistical indicators related to various business processes. Because it saves the most fine-grained records, it can provide maximum flexibility and support unexpected statistical requirements at various levels of detail.

3.3.1.2. Design process

The following four steps can generally be followed when designing a transactional fact table.

Select Business Process→Declaration Granularity→Confirm Dimension→Confirm Fact

  1. Select business process
    In the business system, select the business process we are interested in. The business process can be summarized as inseparable behavioral events, such as order placement, order cancellation, payment, and refund in e-commerce transactions. business process. Usually, a business process corresponds to a transactional fact table.

  2. Declare the granularity
    After the business process is determined, you need to declare the granularity for each business process. That is, to precisely define what each row of data in each transactional fact table represents , the finest granularity should be selected as much as possible to meet the needs of various levels of detail. A typical granularity statement is as follows: a row of data in the order fact table represents a commodity item in an order.

  3. Determining Dimensions Determining dimensions specifically refers to determining the dimensions related to each transactional fact table . When determining dimensions, select as much environmental information as possible related to the business process. Because the richness of the dimension determines the richness of the indicators that the dimensional model can support.

  4. Determining Facts The word "fact"
    here refers to the measurement value of each business process (usually a numerical value that can be added, such as: times, number, number of pieces, amount, etc.) .

After the above four steps, the transactional fact table is basically designed.

  • The first step is to select the business process to determine which transactional fact tables exist.
  • The second step is to determine what each row of data in each transactional fact table is,
  • The third step is to determine the dimension foreign key of each transactional fact table,
  • The fourth step is to determine the measurement value field of each transactional fact table.

3.3.1.3 Insufficient

The transactional fact table can store the most fine-grained operational events of all business processes, so it can theoretically support various statistical granularity requirements related to each business process. But for some specific types of requirements, the logic may be more complicated, or the efficiency will be lower.
For example, the following situation

3.3.1.3.1. Stock indicators

For example, commodity inventory, account balance, etc. Take the virtual currency in e-commerce as an example here. The business process included in the virtual currency business mainly includes acquiring currency and using currency. The two business processes each correspond to a transactional fact table, and one stores all atomic operation events for acquiring currency. , another sheet that stores all the atomic operation events that use the currency.

Assuming that there is an existing demand, it is required to count the virtual currency balance of each user as of the current day. Since the acquisition of currency and the use of currency will affect the balance, it is necessary to aggregate the two transactional fact tables, and it is necessary to distinguish the impact of the two on the balance (plus or minus), and it is necessary to aggregate the full table data of the two tables to get statistical results.

It can be seen that this is not a good solution in terms of logic or efficiency.

3.3.1.3.2, multi-transaction association statistics

For example, it is now necessary to count the average value of the time interval from the user's order to payment in the last 30 days. The statistical idea should be to find the fact table of the order transaction and the fact table of the payment transaction, filter out the records of the last 30 days, and then associate the two fact tables according to the order id, then subtract the order time from the payment time, and then calculate the average value.

Although the logic is not complicated, its efficiency is low. It should be that the order transaction fact table and the payment transaction fact table are both large tables, and the operation of joining large tables with large tables should be avoided as much as possible.

It can be seen that the performance of the transactional fact table in the above two scenarios is not ideal. The other two types of fact tables to be introduced below are to make up for the lack of transactional fact tables.

3.3.2, periodic snapshot fact table

3.3.2.1. Overview

The periodic snapshot fact table records facts at regular and predictable time intervals, and is mainly used to analyze some stock-type (such as commodity inventory, account balance) or state-type (air temperature, driving speed) indicators.

For stock indicators such as commodity inventory and account balance, the latest results are usually calculated and saved in the business system, so a full amount of data is regularly synchronized to the data warehouse and a periodic snapshot fact table is built to easily meet such statistical needs. There is no need to aggregate a large amount of historical records in the transactional fact table.

For state indicators such as air temperature and driving speed, because their values ​​are often continuous, we cannot capture their changing atomic transaction operations, so we cannot use transactional fact tables to count such requirements. Instead, it can only be sampled periodically to build a periodic snapshot fact table.

3.3.2.2. Design process

3.3.2.2.1, Determine the granularity

The granularity of the periodic snapshot fact table can be described by the sampling period and dimension , so the granularity can be determined after the sampling period and dimension are determined.

The sampling period is usually selected daily.

Dimensions can be determined based on statistical indicators. For example, if the indicator is to count the inventory of each commodity in each warehouse, then the dimensions can be determined as warehouses and commodities.

After the sampling period and dimensions are determined, the granularity of the table can be determined as daily-warehouse-commodity.

3.3.2.2.2. Confirmation of facts

Facts can also be determined based on statistical indicators. For example, if the indicator is to count the inventory of each commodity in each warehouse, then the fact is the inventory of the commodity.

3.3.3. Fact type

The fact type here refers to the type of the measure , not the type of the fact table. Facts (measures) are divided into three categories, which are additive facts, semi-additive facts, and non-additive facts .

  1. Additive facts
    Additive facts refer to facts that can be accumulated according to all dimensions related to the fact table , such as facts in transactional fact tables .
  2. Semi-additive facts
    Semi-additive facts refer to facts that can only be accumulated according to some dimensions related to the fact table , such as facts in periodic snapshot fact tables . Take the daily snapshot fact table of the inventory of each product in each warehouse as an example. The inventory facts in this table can be accumulated according to the warehouse or product dimension, but cannot be accumulated according to the time dimension, because it is meaningless to accumulate the daily inventory .
  3. Non-additive facts
    Non-additive facts refer to facts that are not additive at all , such as ratio facts . Non-additive facts often need to be transformed into additive facts, for example ratios can be transformed into numerators and denominators.

3.3.4. Cumulative snapshot fact table

3.3.4.1. Overview

The cumulative snapshot fact table is a fact table constructed based on the joint processing of multiple key business processes in a business process , such as the business processes of placing an order, paying, shipping, and confirming receipt in a transaction process.

Cumulative snapshot fact tables typically have multiple date fields, one for each key business process (milestone) in the business process.
insert image description here
Cumulative snapshot fact tables are mainly used to analyze requirements such as time intervals between business processes (milestones) . For example, the average time interval from the user's order to payment mentioned above, using the cumulative snapshot fact table for statistics, can avoid the associated operation of the two transaction fact tables, making it very simple and efficient.

3.3.4.2. Design process

The design process of the cumulative snapshot fact table is similar to that of the transactional fact table, and the following four steps can also be used. The following focuses on the differences from the transactional fact table. Select Business Process→Declaration Granularity→Confirm Dimension→Confirm Fact

  1. Select a business process
    Select multiple key business processes in a business process that require correlation analysis, and multiple business processes correspond to a cumulative snapshot fact table.
  2. Declaration Granularity
    Precisely define what each row of data represents, try to choose the smallest granularity.
  3. Confirm Dimensions
    Select the dimensions related to each business process. It should be noted that each business process requires a date dimension.
  4. Confirm Facts
    Select measures for each business process.

4. Dimension table of dimensional modeling theory

4.1. Overview of dimension tables

Dimension tables are the foundation and soul of dimensional modeling. Fact tables are designed around business processes, while dimension tables are designed around the environment in which business processes are located .

A dimension table mainly includes a primary key and various dimension fields, and the dimension fields are called dimension attributes.

4.2. Dimension table design steps

4.2.1. Determine the dimension (table)

When designing the fact table, the dimensions related to each fact table have been determined. In theory, each related dimension needs to correspond to a dimension table.

It should be noted that there may be situations where multiple fact tables are related to the same dimension. In this case, the uniqueness of the dimension needs to be guaranteed, that is, only one dimension table should be created.

In addition, if some dimension tables have few dimension attributes, such as only one name, the dimension table may not be created, but the dimension attributes of the table are directly added to the related fact table. This operation is called dimension degeneration .

4.2.2. Determine the main dimension table and related dimension tables

Both the main dimension table and the related dimension table here refer to tables related to a certain dimension in the business system. For example, the commodity-related tables in the business system include sku_info, spu_info, base_trademark, base_category3, base_category2, base_category1, etc., where sku_info is called the main dimension table of the commodity dimension, and the other tables are called related dimension tables of the commodity dimension. Dimension tables are usually of the same granularity as the main dimension table.

4.2.3. Determine dimension attributes

Determining dimension attributes means determining dimension table fields. Dimension attributes mainly come from the main dimension table and related dimension tables corresponding to the dimension in the business system. Dimension attributes can be directly selected from the main dimension table or related dimension tables, or can be obtained through further processing.

When determining dimension attributes, the following requirements need to be followed:

  1. Generate as many dimension attributes as possible.
    Dimension attributes are the basic source of query constraints and grouping fields for subsequent analysis and statistics, and are the key to data usability. The richness of dimension attributes directly affects the richness of indicators that the data model can support.

  2. Try not to use codes, but use clear text descriptions. Generally, codes and texts can coexist.

  3. Precipitate general dimension attributes as much as possible.
    The acquisition of some dimension attributes requires more complex logical processing, for example, it needs to be obtained by splicing multiple fields. To avoid repeated processing for each subsequent use, these dimension attributes can be deposited into dimension tables.

4.3 Key points of dimension design

4.3.1, normalization and denormalization

Normalization refers to the process of designing a database using a series of paradigms, the purpose of which is to reduce data redundancy and enhance data consistency. Usually, after normalization, the fields of one table will be split into multiple tables.

Denormalization refers to the redundancy of data from multiple tables into one table, the purpose of which is to reduce join operations and improve query performance.

When designing a dimension table, if it is normalized, the resulting dimensional model is called a snowflake model, and if it is denormalized, the resulting model is called a star model . The main purpose of the data warehouse system is for data analysis and statistics, so whether it is convenient for users to perform statistical analysis determines the quality of the model.
insert image description here

  • Using the snowflake model, users need a large number of association operations in the process of statistical analysis, which has high complexity and poor query performance.
  • Using the star model is convenient, easy to use and has good performance

Therefore, for ease of use and performance considerations, dimension tables are generally not standardized, and star schemas are often used.

4.3.2. Dimension change

Dimension attributes are usually not static, but change over time. An important feature of data warehouses is to reflect historical changes, so how to preserve the historical state of dimensions is one of the important tasks of dimension design. There are usually two ways to save the historical state of dimension data, namely full snapshot table and zipper table .

4.3.2.1, full snapshot table

The calculation cycle of the offline data warehouse is usually once a day, so a full amount of dimension data can be saved every day. The advantages and disadvantages of this approach are obvious.

  • The advantages are simple and effective, low development and maintenance costs, and easy to understand and use.
  • The disadvantage is a waste of storage space, especially when the change ratio of the data is relatively low.

4.3.2.2. Zipper watch

The significance of the zipper table is that it can save the historical state of dimension information more efficiently.

  1. What is a zipper watch
    insert image description here

  2. Why make a zipper watch
    insert image description here

  3. how to use zip watch

insert image description here

4.3.3, multi-valued dimension

If a record in the fact table corresponds to multiple records in a dimension table, it is called a multi-valued dimension. For example, a record in the order fact table is an order, and an order may contain multiple items, and there may be multiple pieces of data corresponding to it in the item dimension table.

In view of this situation, the following two solutions are usually adopted.

  • The first: reduce the granularity of the fact table, for example, reduce the granularity of the order fact table from an order to a commodity item in an order.
  • The second method: use multiple fields in the fact table to save multiple dimension values, and each field saves a dimension id. This scheme is only applicable to the case where the number of multi-valued dimensions is fixed.

It is recommended to try to use the first solution to solve the multi-value dimension problem.

4.3.4, multi-valued attributes

An attribute in a dimension table has multiple values ​​at the same time, which is called a "multi-valued attribute". For example, the platform attribute and sales attribute of the product dimension, each product has multiple attribute values.

In this case, the following two solutions are generally available.

  • The first method: put the multi-value attribute into a field, the content of which is in the form of key1:value1, key2:value2, for example, the platform attribute value of a mobile phone product is "brand: Huawei, system: Hongmeng, CPU: Kirin 990" .
  • The second method: put multi-valued attributes into multiple fields, and each field corresponds to an attribute. This scheme is only applicable to the case where the number of multi-valued attributes is fixed.

5. Data Warehouse Design

5.1. Hierarchical Planning of Data Warehouse

An excellent and reliable data warehouse system requires a good data hierarchy. Reasonable layering can make the data system clearer and simplify complex problems. The following is the hierarchical planning of the project
insert image description here

5.2. Data warehouse construction process

insert image description here

5.2.1. Data research

Data research focuses on two tasks, namely business research and demand analysis. Whether these two tasks are done adequately directly affects the quality of the data warehouse.

5.2.1.1, business research

The main goal of business research is to be familiar with business processes and business data.

Familiarity with the business process is required. To clarify the specific process of each business, it is necessary to list each business process included in the business .

Familiarity with business data is required to match data (including embedded logs and business data tables) with business processes, and clarify which table data each business process will affect and what impact it will have. The impact needs to be specific, whether it is adding a piece of data or modifying a piece of data, and it is necessary to clarify the logic of the new content or modification.

The following business e-commerce transaction is used as an example to demonstrate. The business process involved in the transaction business includes the buyer placing an order, the buyer paying, the seller delivering the goods, and the buyer receiving the goods. The specific process is shown in the figure below.
insert image description here

5.2.1.2. Demand Analysis

Typical demand indicators such as the total orders of mobile phone categories in each province in the last day. When analyzing demand, it is necessary to clarify the business process and dimensions
required by the demand . For example, the business process required by the demand is the buyer's order, and the required dimensions include date, province, and product category.

5.2.1.3 Summary

After completing the business analysis and requirements analysis, it is necessary to ensure that each requirement can find the corresponding business process and dimension.
If the existing data cannot meet the requirements, it is necessary to communicate with the business side, for example, a page needs to add a buried point of a certain behavior.

5.2.2. Clear data fields

In addition to the horizontal layering, the data warehouse model design usually also needs to divide the data domain vertically according to the business situation. The significance of dividing the data domain is to facilitate the management and application of data .

Usually, it can be divided according to the business process or department . This project is divided according to the business process. It should be noted that a business process can only belong to one data domain.

The following are all business processes and data domain division details required for this data warehouse project.
insert image description here

5.2.3. Build a business bus matrix

The business bus matrix contains all the facts (business processes) and dimensions required by the dimensional model, as well as the relationship between each business process and each dimension. The rows of the matrix are each business process, the columns of the matrix are each dimension, and the intersection of the rows and columns represents the relationship between the business process and the dimension.
insert image description here
A business process corresponds to a transactional fact table in the dimensional model, and a dimension corresponds to a dimension table in the dimensional model. Therefore, the process of building a business bus matrix is ​​the process of designing a dimensional model.

However, it should be noted that the bus matrix usually only contains transactional fact tables , and the other two types of fact tables need to be designed separately.

According to the design process of the transactional fact table, select the business process—>statement granularity—>confirm the dimension—>confirm the fact . The final business bus matrix obtained is shown in the following table.
insert image description here

5.2.4. Clear statistical indicators

The specific task of clarifying statistical indicators is to analyze the needs in depth and build an indicator system. The main significance of constructing the indicator system is to standardize the definition of indicators. The definitions of all indicators must follow the same set of standards, which can effectively avoid problems such as ambiguity in indicator definitions and duplication of indicator definitions.

5.2.4.1. Concepts related to the indicator system

5.2.4.1.1, atomic index

The atomic index is based on the measurement value of a certain business process , and is an index that cannot be disassembled in the business definition. The core function of the atomic index is to define the aggregation logic of the index. We can conclude that atomic indicators contain three elements, which are business process, measurement value and aggregation logic .

For example, the total amount of orders is a typical atomic index, where the business process is that the user places an order, the measurement value is the order amount, and the aggregation logic is sum(). It should be noted that atomic indicators are only used to assist in defining the concept of indicators, and usually do not correspond to actual statistical requirements.

5.2.4.1.2. Derived indicators

Derived indicators are based on atomic indicators, and their relationship with atomic indicators is shown in the figure below.
insert image description here
Unlike atomic indicators, derived indicators usually correspond to actual statistical needs.

5.2.4.1.3. Derivative indicators

Derived indicators are compounded through various logical operations on the basis of one or more derived indicators. Such as ratio, ratio, etc. type of indicators. Derived indicators will also correspond to actual statistical needs.
insert image description here

5.2.4.2. The significance of the index system for data warehouse modeling

From the above two specific cases, it can be seen that the vast majority of statistical requirements can be defined using the standard set of atomic indicators, derived indicators, and derived indicators. At the same time, it can be found that these statistical requirements directly or indirectly correspond to one or more derived indicators.

When there are enough statistical requirements, there will inevitably be cases where some of the statistical requirements correspond to the same derived indicators. In this case, we can consider saving these public derived indicators. The main purpose of this is to reduce repeated calculations and improve data reusability.

These public derived indicators are uniformly stored in the DWS layer of the data warehouse. Therefore, the design of the DWS layer can refer to the derived indicators we sorted out according to the existing statistical needs.

5.2.5. Dimensional model design

The design of the dimensional model can refer to the business bus matrix obtained above. Fact tables are stored in the DWD layer, and dimension tables are stored in the DIM layer.

5.2.6. Summary Model Design

The design of the summary model can refer to the index system (mainly derived indicators) sorted out above. The corresponding relationship between the summary table and the derived indicators is that a summary table usually contains multiple derived indicators with the same business process, the same statistical cycle, and the same statistical granularity .

Guess you like

Origin blog.csdn.net/prefect_start/article/details/129271521