Data warehouse series (1) What is dimensional modeling and the basic elements of dimensional modeling

I. Introduction

Studying data warehouse, you will definitely know two people: the father of data warehouse Bill Inmon (Bill Inmon) and the authoritative expert of data warehouse Ralph Kimball.

The two DW architectures of Inmon and Kimball have supported the development of data warehouse and business intelligence for nearly two decades. Inmon advocates a top-down architecture. Different OLTP data is concentrated to subject-oriented, integrated, non-volatile and time-varying In the structure, it is used for later analysis; and the data can be drilled down to the finest level or rolled up to the summary level; the data mart should be a subset of the data warehouse; each data mart is specially designed for independent departments .

Kimball is just the opposite of Inmon. The Kimball architecture is a bottom-up architecture. It believes that a data warehouse is a collection of a series of data marts. Enterprises can build data warehouses incrementally through a series of data marts with the same dimensions. By using consistent dimensions, they can see information in different data marts together, which means that they have publicly defined elements.

Here I mainly introduce the dimensional modeling method. This method was first proposed by Kimball, and its simplest description is to build data warehouses and data marts according to fact tables and dimension tables. In the dimensional modeling method system, dimensions are the angles of describing facts, such as date, customers, suppliers, etc., and facts are indicators to be measured, such as the number of customers and sales. According to the introduction of general books, dimensional modeling will also be divided into star model, snowflake model, etc., each has its advantages and disadvantages, but it is rarely answered directly, that is, why do data warehouses use dimensional modeling?

Star model
Star model
Insert picture description here
snowflake model

The data warehouse contains a lot of content, it can include architecture, modeling and methodology. Corresponding to the specific work, it can contain the following content:

1. Data architecture system: a data architecture system centered on Hadoop and Spark.

2. Various data modeling methods: such as dimensional modeling, paradigm modeling, and entity modeling.

3. Auxiliary systems: scheduling systems, metadata systems, ETL systems, visualization systems, and other auxiliary systems.

Regardless of the extent of the data warehouse, the core position of the data model in the data warehouse system is irreplaceable. Therefore, the following will elaborate on the typical representatives of data modeling: dimensional modeling, and make an in-depth analysis of its related theories and practical use.

In order to understand more truly what dimensional modeling is, I will simulate an e-commerce scenario that everyone is familiar with in a follow-up article, and use the theory to model. After all, there will be a gap between theoretical and realistic work scenarios. In this regard, I will share the trade-offs that companies have made in practical applications. Next, let's learn more about dimensional modeling.

Second, what is dimensional modeling

The dimensional model is advocated by Ralph Kimball, a master in the field of data warehouses. His "Data Warehouse Toolbox" is the most popular data warehouse modeling classic in the field of data warehouse engineering. Dimensional modeling builds models based on the needs of analysis and decision-making, and the constructed data model serves the analysis needs. Therefore, it focuses on how users can complete the analysis needs more quickly, while also having better response performance for large-scale complex queries.

Let us explain in another way what is dimensional modeling. Children's shoes who have studied the database should know the star model, which is our typical dimensional model. When we perform dimensional modeling, we will build a fact table, this fact table is the center of the star model, and then there will be a bunch of dimension tables, these dimension tables are stars diverging outward. So what is a fact table and what is a dimension table will be specifically explained below.

Star model

Three, the basic elements of dimensional modeling

There are some more important concepts in dimensional modeling. After understanding these concepts, you basically understand what dimensional modeling is.

3.1 Fact table

The measurable values ​​generated by operational events that occur in the real world are stored in the fact table. From the lowest level of granularity, a fact table row corresponds to a measurement event, and vice versa. I don't quite understand an example. For example, a purchase behavior can be understood as a fact. Let’s look at an example of the star model.

Insert picture description here
The order table (ICstockbill) in the figure is a fact table. You can understand that it is an operational event that occurs in reality. Every time we complete an order, we will add a record to the order. We can go back and look at the characteristics of the fact table. There is no actual content stored in the fact table. It is a collection of a bunch of primary keys. These IDs can correspond to a record in the dimension table.

3.2 Dimension table

Each dimension table contains a single primary key column. The primary key of the dimension table can be used as the foreign key of any fact table associated with it. Of course, the description environment of the dimension table row should correspond exactly to the fact table row. The dimension table is usually relatively wide and is a flat non-standard table containing a large number of low-granularity text attributes. The customer (customer table), goods (commodity table), and d_time (timetable) in the figure are all dimension tables. These tables have a unique primary key, and then detailed data information is stored in the table.

Finally, talk about the advantages and disadvantages of the dimensional model:

Insert picture description here
1. Data redundancy is small (because a lot of specific information is stored in the corresponding dimension table, for example, there is only one copy of customer information)

2. The structure is clear (the table structure is clear at a glance)

3. Easy to do OLAP analysis (data analysis will be very convenient to use)

4. Increase the cost of use, such as linking multiple tables when querying

5. The data is inconsistent, for example, the data when the user initiates a purchase is inconsistent with the data stored in our dimension table

Let's talk about the advantages and disadvantages of wide fact tables without a data warehouse:

Insert picture description here
1. The business is intuitive. When doing business, this kind of table is particularly convenient and can be directly matched to the business.

2. Easy to use, very convenient when writing sql.

3. The data redundancy is huge, it's really big. Under the scale of hundreds of millions of users, his order behavior will be terrifying, the granularity will be rigid, everything will die, and the reusability of this table is too low.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/108229292