Dimension table design of slowly gradual change of data warehouse
Slowly changing dimensions:
Dimensional data will change with time, and the rate of change is relatively slow. This kind of dimensional data is usually called a slowly gradual dimension. Because the data warehouse needs to track historical changes, especially some important data, the historical state also needs to take certain measures save.
It is roughly divided into the following implementation ideas
1) Full snapshot:
Every day to save the full amount of snapshot data of the current data, this solution is suitable for the dimension of small data volume, using a simple way to save the historical state.
2) Extra data column to save historical status
Additional (one column / multiple columns) to retain one or more status values.
Id |
name |
dept |
Last_dept |
… |
1 |
jiangtai |
Dep1 |
Dep3 |
|
3) Zipper watch technology
When the dimension data changes, the old data is invalidated, and the changed data is inserted into the dimension table as a new record and becomes effective. This can record the history of data changes at a certain granularity.
Combined with the surrogate key mentioned earlier, Uid_org is the original business primary key, and Uid_agency is the surrogate key
Uid_agency |
Uid_org |
name |
dept |
Start_date |
End_date |
1 |
1 |
jiangtai |
Dep1 |
20180501 |
20180509 |
2 |
1 |
jiangtai |
Dep2 |
20180510 |
20991231 |
Question 1: How to get the data status of a certain day from the zipper table
SELECT *
FROM lalian_table
WHERE start_date <= ‘${bizdate}’ AND end_date >= ‘${bizdate}’
Ext: Slowly gradual changes
The surrogate key is a highly recommended method in dimensional modeling. Its application can effectively isolate the unstable problem of the data warehouse structure caused by the source change, and can also improve data retrieval performance.
However, as you can see, the maintenance cost of surrogate keys is very high, especially in the process of data loading, which has a greater impact on the fact table. The impact is even more serious in the construction of HIVE-based data warehouses. For example, the generation of surrogate keys, the status of the associated keys in the fact table, and the non-equivalence associations are not supported, which makes the ETL process more complicated.
Therefore, under the big data system, use the surrogate key cautiously. At the same time, for slow-gradual dimensional scenes, you can consider using space for time, and retain a full snapshot of the dimension table every day. But this will bring storage costs, measured according to the actual situation.