Big data development---Overview of data warehouse zipper table, and how to iterate or roll back

1. Background

What is a zipper table? When the data warehouse is established, an important table data processing method can use the data structure in an algorithm, analogous to the zipper table in the data warehouse, and aims to solve the SCD requirements in the establishment of the data warehouse, so what is SCD , Is the slowly changing dimension, as time goes by, the data changes slowly relative to the fact table.

The common processing methods of SCD are as follows:

  • Keep original value

  • Direct coverage

  • Add new attribute column

  • Snapshot table

  • Zipper table

This article mainly explains the zipper table to deal with the problem of SCD, its characteristics are summarized as follows, when there are the following scenarios, you can use the zipper table.

1. The amount of meter data is large, and the full meter will take up a lot of storage

2. The table data will be modified, with incremental tables, it is difficult to deal with duplicate and modified data

3. There is a need for backtracking, and you need to know the full amount of data at a certain point in history

4. The data has been modified, but the frequency and amount are not very large, for example, only one part per million has been modified

2. Zipper table processing theory

First of all, the zipper table is a full scale table and not a partition table. In order to achieve the various effects described above, an intermediate table must be used as an intermediate springboard. The intermediate springboard table is a partition table. The data is incremental data. The incremental content includes Modifications and additions usually create_time or update_timefall on the current day. For the zipper table, two fields that have nothing to do with the original data need to be added to identify the data start time and effective deadline. In the example, these two dates are start_dateand end_date, There are three main ways to handle the zipper table: initialization, updating data every day, and rolling back data.

2.1 Initialization and new data

The daily rolling method is as follows:

file

The initialization part is the start time of the full scale of the zipper, and it also lays the earliest time that the rollback can be rolled back. The daily update logic is as shown in the figure above. The new data will be divided into two parts, one is the data added every day, and the partition for the day there are variations or not the same when the data changes, corresponding respectively to modify start_dateand end_datecan achieve updating data.

2.1 Data rollback

For the above update logic, let's consider how to roll back the data, that is, return to a certain point in history. For the zipper table, it is a full scale, so there is only one rollback. Rollback strategy can point to roll back time and data generation start_dateand end_date, specifically how to roll back, we look at the following diagram:

fileIn the end_date < rollback_datedata to be reserved for the processing end_date ≥ rollback_date ≥ start_dateset end_dateto 9999-12-31, rollback the results, in order to maintain the general integrity of the data, the data can be rolled back in a new temporary table fastener.

3. Zipper table processing case

For the commonly used hierarchical DIM of the data warehouse, that is, the dimension layer is the common scenario of the zipper table, here is an example to see how to add and roll back the zipper table.

The zipper table is used to realize the DIM layer merchant dimension table in the core transaction analysis, and realize the rollback of the zipper table.

3.1 Create table and import data

The structure of the merchant dimension table is as follows:

--创建商家信息表(增量表 分区表)
drop table if exists ods.ods_trade_shops;
create table ods.ods_trade_shops(
  `shopid` int COMMENT '商铺ID',
  `userid` int COMMENT '商铺负责人', 
  `areaid` int COMMENT '区域ID',
  `shopname` string COMMENT '商铺名称',
  `shoplevel` int COMMENT '商铺等级',
  `status` int COMMENT '商铺状态',
  `createtime` string COMMENT '创建日期',
  `modifytime` string COMMENT  '修改日期'
) COMMENT '商家信息表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';

-- 创建商家信息维表
drop table if exists dim.dim_trade_shops;
create table dim.dim_trade_shops(
  `shopid` int COMMENT '商铺ID',
  `userid` int COMMENT '商铺负责人', 
  `areaid` int COMMENT '区域ID',
  `shopname` string COMMENT '商铺名称',
  `shoplevel` int COMMENT '商铺等级',
  `status` int COMMENT '商铺状态',
  `createtime` string COMMENT '创建日期',
  `modifytime` string COMMENT  '修改日期',
  `startdate` string  COMMENT '生效起始日期',
  `enddate` string  COMMENT '失效结束日期'
) COMMENT '商家信息表';

Import the following test data:

/root/data/shop-2020-11-20.dat
100050,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22


/root/data/shop-2020-11-21.dat
100057,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22


/root/data/shop-2020-11-22.dat
100059,9,100225,乐居xxx日用品,1,1,2020-06-28,2020-11-22 13:22:22
100060,10,100211,同仁xxx大健康,1,1,2020-06-28,2020-11-22 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,2,2020-06-28,2020-11-22 13:22:22

load data local inpath '/root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-20');
load data local inpath '/root/data/shop-2020-11-21.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-21');
load data local inpath '/root/data/shop-2020-11-22.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-22');

3.2 Zipper table initialization

Assuming that the first day's data are all historical data

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt ='2020-11-20';

3.3 Update zipper table

For incremental tables, the general logic is, create_timeor modifytimeinterception as the day partition dt, modifytimegreater than or equal to create_time, here take the first two

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '2020-11-21'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('2020-11-21', -1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='2020-11-21') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;

The script to load the zipper table is as follows:

dim_load_shops.sh

#!/bin/bash

source /etc/profile
if [ -n "$1" ]
then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
fi

sql="
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '$do_date'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('$do_date', -1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='$do_date') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
"

hive -e "$sql"

You can execute this script to load 2020-12-22the data,sh dim_load_shops.sh 2020-12-22

3.4 Roll back the zipper table to a certain point in time

First create a temporary table tmp.shops_tmpto put the rolled back data

DROP TABLE IF EXISTS tmp.shops_tmp;
CREATE TABLE IF NOT EXISTS tmp.tmp_shops AS
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       enddate
FROM dim.dim_trade_shops
WHERE enddate < '2020-11-21'
UNION ALL
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       '9999-12-31' AS enddate
FROM dim.dim_trade_shops
WHERE startdate <= '2020-11-21'
  AND enddate >= '2020-11-21';


INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT *
FROM tmp.tmp_shops;

The rollback script is similar to the update script, as long as the sql in it is updated, it will not be repeated here. Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/111147473