Comprehensive interpretation of hive data warehouse modeling

Comprehensive interpretation of star models, snowflake models and constellation models

background

In the business intelligence solution of multidimensional analysis, according to the relationship between fact table and dimension table, common models can be divided into star model, snowflake model and constellation model. When designing a logical data model, you should consider whether the data is organized according to a star model, a snowflake model or a constellation model.

1. Star model

In the star model, there is a fact table and zero or more dimension tables. The fact table and the dimension table are related through the primary key foreign key. There is no correlation between the dimension tables. When all the dimension tables are directly connected to the "fact table" At time, the entire diagram is like a star, so this model is called a star model. The star model is the simplest and most commonly used model. Since the star model has only one large table, it is more suitable for big data processing than other models. Other models can be transformed into star models.
The star schema is an informal structure. Each dimension of the cube is directly connected to the fact table. There is no changing dimension, so the data has some redundancy. For example, in the regional dimension table, there is country A. There are two records of city C in province B and city D in province B of country A. Then the information of country A and province B are stored twice respectively, that is, there is redundancy.
Insert picture description here

2. Snowflake model

When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table through other dimension tables, the diagram is like multiple snowflakes connected together, so it is called the snowflake model. The snowflake model is an extension of the star model. It further hierarchizes the dimension tables of the star model. The original dimension tables may be expanded into small dimension tables to form some partial "hierarchical" areas. These decomposed tables are all connected to the main dimension table instead of the facts. table. As shown in the figure, the regional dimension table is broken down into country, province, and city dimension tables. Its advantages are: Improve query performance by minimizing data storage and combining smaller dimension tables. The snowflake structure removes data redundancy.
Insert picture description here

3. Constellation model

The constellation model is an extension of the star model. The star model is based on one fact table and the constellation model is based on multiple fact tables, and the dimensional table information is shared. This model is often applied to data relationships than the star model and snowflakes. When the model is more complicated. The constellation model requires multiple fact tables to share the dimension table, so it can be regarded as a collection of star models, so it is also called a galaxy model
Insert picture description here

4. Contrast

Insert picture description here

  • Because of the redundancy of data, the star model does not require external connections for many statistical queries, so it is generally more efficient than the snowflake model.
  • The star structure does not need to consider many normalization factors, and the design and implementation are relatively simple.
  • Since the snowflake model removes redundancy, some statistics need to be generated through table connections, so the efficiency is relatively low.
  • Normalization is also a relatively complicated process, and the corresponding database structure design, data ETL, and later maintenance are all more complicated.

5. Summary

Through comparison, we can find that most of the time the data warehouse is more suitable to use the star model to build the underlying data Hive table. Through a large amount of redundancy to reduce the number of table queries to improve query efficiency, the star model supports comparison of OLAP analysis engines. Friendly, this is more reflected in Kylin. The snowflake model is very common in relational databases such as MySQL and Oracle, especially in e-commerce database tables. In the data warehouse, the application scenarios of the snowflake model and the constellation model are relatively few, but not without. Therefore, in the specific design, you can consider whether you can combine the advantages of the two to participate in the design to achieve the optimization purpose of the design.

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108947460