1. Background
What is a zipper table? When the data warehouse is established, an important table data processing method can use the data structure in an algorithm, analogous to the zipper table in the data warehouse, and aims to solve the SCD requirements in the establishment of the data warehouse, so what is SCD , Is the slowly changing dimension, as time goes by, the data changes slowly relative to the fact table.
The common processing methods of SCD are as follows:
Keep original value
Direct coverage
Add new attribute column
Snapshot table
Zipper table
This article mainly explains the zipper table to deal with the problem of SCD, its characteristics are summarized as follows, when there are the following scenarios, you can use the zipper table.
1. The amount of meter data is large, and the full meter will take up a lot of storage
2. The table data will be modified, with incremental tables, it is difficult to deal with duplicate and modified data
3. There is a need for backtracking, and you need to know the full amount of data at a certain point in history
4. The data has been modified, but the frequency and amount are not very large, for example, only one part per million has been modified
2. Zipper table processing theory
First of all, the zipper table is a full scale table and not a partition table. In order to achieve the various effects described above, an intermediate table must be used as an intermediate springboard. The intermediate springboard table is a partition table. The data is incremental data. The incremental content includes Modifications and additions usually create_time or update_time
fall on the current day. For the zipper table, two fields that have nothing to do with the original data need to be added to identify the data start time and effective deadline. In the example, these two dates are start_date
and end_date
, There are three main ways to handle the zipper table: initialization, updating data every day, and rolling back data.
2.1 Initialization and new data
The daily rolling method is as follows:
The initialization part is the start time of the full scale of the zipper, and it also lays the earliest time that the rollback can be rolled back. The daily update logic is as shown in the figure above. The new data will be divided into two parts, one is the data added every day, and the partition for the day there are variations or not the same when the data changes, corresponding respectively to modify start_date
and end_date
can achieve updating data.
2.1 Data rollback
For the above update logic, let's consider how to roll back the data, that is, return to a certain point in history. For the zipper table, it is a full scale, so there is only one rollback. Rollback strategy can point to roll back time and data generation start_date
and end_date
, specifically how to roll back, we look at the following diagram:
In the end_date < rollback_date
data to be reserved for the processing end_date ≥ rollback_date ≥ start_date
set end_date
to 9999-12-31
, rollback the results, in order to maintain the general integrity of the data, the data can be rolled back in a new temporary table fastener.
3. Zipper table processing case
For the commonly used hierarchical DIM of the data warehouse, that is, the dimension layer is the common scenario of the zipper table, here is an example to see how to add and roll back the zipper table.
The zipper table is used to realize the DIM layer merchant dimension table in the core transaction analysis, and realize the rollback of the zipper table.
3.1 Create table and import data
The structure of the merchant dimension table is as follows:
--创建商家信息表(增量表 分区表)
drop table if exists ods.ods_trade_shops;
create table ods.ods_trade_shops(
`shopid` int COMMENT '商铺ID',
`userid` int COMMENT '商铺负责人',
`areaid` int COMMENT '区域ID',
`shopname` string COMMENT '商铺名称',
`shoplevel` int COMMENT '商铺等级',
`status` int COMMENT '商铺状态',
`createtime` string COMMENT '创建日期',
`modifytime` string COMMENT '修改日期'
) COMMENT '商家信息表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';
-- 创建商家信息维表
drop table if exists dim.dim_trade_shops;
create table dim.dim_trade_shops(
`shopid` int COMMENT '商铺ID',
`userid` int COMMENT '商铺负责人',
`areaid` int COMMENT '区域ID',
`shopname` string COMMENT '商铺名称',
`shoplevel` int COMMENT '商铺等级',
`status` int COMMENT '商铺状态',
`createtime` string COMMENT '创建日期',
`modifytime` string COMMENT '修改日期',
`startdate` string COMMENT '生效起始日期',
`enddate` string COMMENT '失效结束日期'
) COMMENT '商家信息表';
Import the following test data:
/root/data/shop-2020-11-20.dat
100050,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
/root/data/shop-2020-11-21.dat
100057,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
/root/data/shop-2020-11-22.dat
100059,9,100225,乐居xxx日用品,1,1,2020-06-28,2020-11-22 13:22:22
100060,10,100211,同仁xxx大健康,1,1,2020-06-28,2020-11-22 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,2,2020-06-28,2020-11-22 13:22:22
load data local inpath '/root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-20');
load data local inpath '/root/data/shop-2020-11-21.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-21');
load data local inpath '/root/data/shop-2020-11-22.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-22');
3.2 Zipper table initialization
Assuming that the first day's data are all historical data
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt ='2020-11-20';
3.3 Update zipper table
For incremental tables, the general logic is, create_time
or modifytime
interception as the day partition dt
, modifytime
greater than or equal to create_time
, here take the first two
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '2020-11-21'
UNION ALL
SELECT b.shopid,
b.userid,
b.areaid,
b.shopname,
b.shoplevel,
b.status,
b.createtime,
b.modifytime,
b.startdate,
CASE
WHEN a.shopid IS NOT NULL
AND b.enddate ='9999-12-31' THEN date_add('2020-11-21', -1)
ELSE b.enddate
END AS enddate
FROM
(SELECT *
FROM ods.ods_trade_shops
WHERE dt='2020-11-21') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
The script to load the zipper table is as follows:
dim_load_shops.sh
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '$do_date'
UNION ALL
SELECT b.shopid,
b.userid,
b.areaid,
b.shopname,
b.shoplevel,
b.status,
b.createtime,
b.modifytime,
b.startdate,
CASE
WHEN a.shopid IS NOT NULL
AND b.enddate ='9999-12-31' THEN date_add('$do_date', -1)
ELSE b.enddate
END AS enddate
FROM
(SELECT *
FROM ods.ods_trade_shops
WHERE dt='$do_date') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
"
hive -e "$sql"
You can execute this script to load 2020-12-22
the data,sh dim_load_shops.sh 2020-12-22
3.4 Roll back the zipper table to a certain point in time
First create a temporary table tmp.shops_tmp
to put the rolled back data
DROP TABLE IF EXISTS tmp.shops_tmp;
CREATE TABLE IF NOT EXISTS tmp.tmp_shops AS
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
startdate,
enddate
FROM dim.dim_trade_shops
WHERE enddate < '2020-11-21'
UNION ALL
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
startdate,
'9999-12-31' AS enddate
FROM dim.dim_trade_shops
WHERE startdate <= '2020-11-21'
AND enddate >= '2020-11-21';
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT *
FROM tmp.tmp_shops;
The rollback script is similar to the update script, as long as the sql in it is updated, it will not be repeated here. Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to more