Data Warehouse_Zipper Table_Zipper Table Implementation Ideas

This article, mainly explain

 1. What is a zipper table and zipper display example

 2. How to build a zipper table under different original tables

 

First introduce what is the zipper table

 

1. What is a zipper table and zipper display example

What is a zipper watch

     When the dimension data changes, the old data is invalidated, and the changed data is inserted into the dimension table as a new record and becomes effective. This can record the history of data changes at a certain granularity.

 

 

Example of zipper table

      Please see my other article

 https://blog.csdn.net/u010003835/article/details/104420723

 

 

 

 2. How to build a zipper table under different original tables

 

1. The first case, the standardized table

    What is a standard table, the standard means that the original table contains create_time, update_time  

 

Assuming the original table to the data warehouse table, the daily collection method

Next we build the table and insert test data


use data_warehouse_test;

CREATE TABLE IF NOT EXISTS user_zipper_org (
	user_id BIGINT COMMENT '用户id'
	,user_name STRING COMMENT '用户姓名'
	,create_time DATE COMMENT '创建时间'
	,update_time DATE COMMENT '修改时间'
) 
PARTITIONED BY(
	pt STRING COMMENT '数据分区'
)
STORED AS ORC
;


ALTER TABLE user_zipper_org DROP IF EXISTS PARTITION (pt = '20200320');
ALTER TABLE user_zipper_org DROP IF EXISTS PARTITION (pt = '20200321');


INSERT INTO TABLE user_zipper_org PARTITION (pt = '20200320')
VALUES 
(1, 'szh', '2020-01-01', '2020-01-01')
,(2, 'yuqin', '2020-01-01', '2020-01-01')
,(3, 'heping', '2020-01-01', '2020-01-01')
,(4, 'quxingma', '2020-01-01', '2020-01-01')
,(5, 'zhouzhou', '2020-01-01', '2020-01-01')
;

INSERT INTO TABLE user_zipper_org PARTITION (pt = '20200321')
VALUES 
(1, 'szh2', '2020-01-01', '2020-03-21')
,(2, 'yuqin', '2020-01-01', '2020-01-01')
,(3, 'heping3', '2020-01-01', '2020-03-21')
,(4, 'quxingma', '2020-01-01', '2020-01-01')
,(5, 'zhouzhou', '2020-01-01', '2020-01-01')
,(6, 'newuser', '2020-03-21', '2020-03-21')
;

 

Build the final zipper table, in order to ensure data security, our final zipper table is a partition table

First of all, since it is a zipper table constructed from 20200320, all the data of the original table is equivalent to the currently effective data. We construct the final zipper table data of 20200320 based on these data.

Second, we need to construct temporary tables from the data partitioned on the final table 20200320.

The SQL of the above two steps is as follows

use data_warehouse_test;

CREATE TABLE IF NOT EXISTS user_zipper_final
(
	user_id BIGINT COMMENT '用户id'
	,user_name STRING COMMENT '用户姓名'
	,create_time DATE COMMENT '创建时间'
	,update_time DATE COMMENT '修改时间'
	,start_date DATE COMMENT '生效时间'
	,end_date DATE COMMENT '失效时间'
)
PARTITIONED BY(
	pt STRING COMMENT '数据分区'
);


INSERT OVERWRITE TABLE user_zipper_final PARTITION (pt = '20200320')
SELECT 
	org.user_id
	,org.user_name
	,org.create_time
	,org.update_time
	,'2020-03-20' AS start_date
	,'9999-12-31' AS end_date
FROM user_zipper_org AS org
WHERE pt = '20200320'
;


DROP TABLE IF EXISTS tmp_user_zipper_mid;
CREATE TABLE tmp_user_zipper_mid AS
SELECT 
	org.user_id
	,org.user_name
	,org.create_time
	,org.update_time
	,org.start_date
	,org.end_date
FROM user_zipper_final AS org
WHERE pt = '20200320'
;

 

 

The key steps are based on the data that changed in 20200321 (new and modified)

The new create_time is 2020-03-21

The modified update_time is 2020-03-21

And the full amount of zipper table data of 20200320, to build the zipper table data of 20200321 (here we first store as a temporary table, that is, an intermediate table)

 

First, we can obtain new and changed data in this way

2020-03-21 发生变动的数据

SELECT 
	org.user_id
	,org.user_name
	,org.create_time
	,org.update_time
	,'2020-03-20' AS start_date
	,'9999-12-31' AS end_date
FROM user_zipper_org AS org
WHERE pt = '20200321'
	AND (
		(
			create_time = '2020-03-21'
		)
		OR
		(
			update_time = '2020-03-21'
		)
	)
;

 

The data of the full zipper table of 20200321 can be constructed in the following way

use data_warehouse_test;

DROP TABLE IF EXISTS tmp_user_zipper_mid;

CREATE TABLE tmp_user_zipper_mid AS
SELECT *
FROM
(
SELECT 
	final.user_id
	,final.user_name
	,final.create_time
	,final.update_time
	,final.start_date
	,CAST (
		(
			CASE 
				WHEN 
					(
						new_data.user_id IS NOT NULL 
						AND
						final.end_date >= '2020-03-21' 
					)
					THEN '2020-03-20'
				ELSE final.end_date
			END 
		)	
		AS DATE
	)
	AS end_date
FROM 
	user_zipper_final AS final
LEFT JOIN (
	SELECT 
	org.user_id
	,org.user_name
	,org.create_time
	,org.update_time
	FROM user_zipper_org AS org
	WHERE pt = '20200321'
		AND (
			(
				create_time = '2020-03-21'
			)
			OR
			(
				update_time = '2020-03-21'
			)
		)
) AS new_data
ON  new_data.user_id = final.user_id
WHERE final.pt = '20200320'

UNION ALL

SELECT 
	new_data.user_id
	,new_data.user_name
	,new_data.create_time
	,new_data.update_time
	,CAST( '2020-03-21' AS DATE ) AS start_date
	,CAST ('9999-12-31' AS DATE ) AS end_date
FROM (
	SELECT 
	org.user_id
	,org.user_name
	,org.create_time
	,org.update_time
	FROM user_zipper_org AS org
	WHERE pt = '20200321'
		AND (
			(
				create_time = '2020-03-21'
			)
			OR
			(
				update_time = '2020-03-21'
			)
		)
) AS new_data
) AS tmp
;

 

 

Finally, we insert the results of the temporary table into the new zipper table partition.

Then, we check the data (

1. Get the latest zipper table partition data

2. Get the data of 2020-03-20 through the zipper table

3. Get the data of 2020-03-21 through the zipper table


use data_warehouse_test;

INSERT OVERWRITE TABLE user_zipper_final PARTITION (pt = '20200321')
SELECT 
	* 
FROM tmp_user_zipper_mid
;


SELECT *
FROM user_zipper_final
WHERE pt = '20200321'
;


SELECT * 
FROM user_zipper_final 
WHERE pt = '20200321' 
	AND start_date <= '2020-03-20' 
	AND end_date >= '2020-03-20'
;


SELECT * 
FROM user_zipper_final 
WHERE pt = '20200321' 
	AND start_date <= '2020-03-21' 
	AND end_date >= '2020-03-21'
;

 

 

 

2. The second case, the non-standard original table

     What is a non-standard original table is that there is no create_time or update_time in the original table. Or even more, the two do not exist. Under such circumstances, how should we construct the zipper table?

 

We carefully analyze this problem, how to solve it?

First of all, we can see that through a single create_time or update_time, we can't distinguish the changed data.

Therefore, our idea is to construct a column to identify the changed data.

We choose the way to take md5 values ​​for all columns! ! ! !

 

Suppose our original table has the following columns,

user_id, user_name, create_time

则 MD5(CONCAT(user_id, user_name, create_time))  

Assuming that we already have an initial zipper table, the difficulty lies in how to obtain the changed data (new and modified)

The SQL for obtaining changed data is as follows:

SELECT
	ta.*
FROM 
(
	SELECT 
		user_id
		,user_name
		,create_time
		,MD5(CONCAT(user_id,user_name,create_time)) AS user_flag
	FROM user_zipper_org
	WHERE pt = '20200321'
) AS ta
LEFT JOIN (
	SELECT 
		user_id
		,user_flag
	FROM user_zipper_final
	WHERE pt = '20200320'
		AND start_date <= '2020-03-20'
		AND end_date >= '2020-03-20'
) AS tb
ON ta.user_id = tb.user_id
AND ta.user_flag != tb.user_flag
;

 

 

 

 

 

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/104849019