Design and implementation of zipper table in data warehouse

1. Introduction

  • Incremental table: There are date partitions to store incremental data, that is, new increments and changes.
  • Full scale table: No date partition (overwrite and update every day), store the latest state of data as of now, so historical changes of data cannot be recorded
  • Snapshot table: There are date partitions, and the data is full every day (whether there is a change or not). The disadvantage is that each partition stores a lot of duplicate data, wasting storage space
  • Zipper table: The zipper table is a table used to maintain the historical state and the latest state data. According to the different zipper granularity, the zipper table is actually equivalent to a snapshot, but it has been optimized to remove some unchanged records.

2. Application scenarios

The zipper table is suitable for scenarios where there is a large amount of data and the historical snapshot information needs to be viewed when the proportion and frequency of field changes are small .

For example, there is a customer table with tens of millions of records and hundreds of fields. So for this kind of table, even if ORC compression is used, the data storage space of a single table will exceed 50GB per day. In the case of using three backups in HDFS, the storage space will be even larger.

So how should I design this table? Here are several options:

  1. Option 1 (full scale table): Extract the latest data every day to overwrite the data of the previous day. The advantage is that it is simple to implement and saves space, but the disadvantage is also obvious, and there is no historical state
  2. Solution 2 (snapshot table): If it is made full every day, we can view the historical data, but the disadvantage is that the storage space is too large, especially when the customer information does not change frequently, the repeated storage rate of the field is too high
  3. Solution 3 (zipper table): If the zipper table design is adopted, not only can the historical status be viewed, but also the storage space usage is extremely low (after all, data that has not changed will not be stored repeatedly)

3. Hive SQL practice

First create an original table of customer information for testing

CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_src (
	`cust_id` STRING COMMENT '客户编号',
	`phone` STRING COMMENT '手机号码'
)PARTITIONED BY (
  dt STRING COMMENT 'etldate'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;

Then insert some test data

cust_id phone dt
001 1111 20210601
002 2222 20210601
003 3333 20210601
004 4444 20210601
001 1111 20210602
002 2222-1 20210602
003 3333 20210602
004 4444-1 20210602
005 5555 20210602
001 1111-1 20210603
002 2222-2 20210603
003 3333 20210603
004 4444-1 20210603
005 5555-1 20210603
006 6666 20210603
002 2222-3 20210604
003 3333 20210604
004 4444-1 20210604
005 5555-1 20210604
006 6666 20210604
007 7777 20210604

A brief description of the data is as follows:

  • 20210601 is the starting date, there are 4 customers in total
  • 20210602 Updated the information of 002 and 004 customers, and added 005 customers
  • 20210603 updated the information of 001, 002, 005 customers, and added 006 customers
  • 20210604 Updated the information of customer 002, added customer 007, and deleted customer 001

Now back to the topic, how to design the zipper table?

First of all, the zipper table has two important audit fields: data effective date and data expiration date . As the name implies, the data effective date records when the record became effective, and the data expiration date records the expiration time of the record (9999-12-31 means it has been valid until now). Then the operations on the data can be divided into the following categories:

  1. Newly added record: data effective date is today, expiration date is 9999-12-31
  2. Records without change: the effective date of the data needs to be used before, and the expiration date remains unchanged
  3. Records with changes: == "For old records: keep, and change the expiration date to today; == "For new records: Add, the effective date is today, and the expiration date is 9999-12-31
  4. Deleted records: Need to close the loop, the expiration date becomes the same day

Therefore, the HQL implementation code of the zipper table is as follows:

-- 拉链表建表语句
CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_dst (
  `cust_id` STRING COMMENT '客户编号',
  `phone` STRING COMMENT '手机号码',
  `s_date` DATE COMMENT '生效时间',
  `e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;
-- 拉链表实现代码(含数据回滚刷新)
INSERT OVERWRITE TABLE datadev.zipper_table_test_cust_dst
-- part1: 处理新增的、没有变化的记录,以及有变化的记录中的新记录
select NVL(curr.cust_id, prev.cust_id) as cust_id,
       NVL(curr.phone, prev.phone) as phone,
       -- 没有变化的记录: s_date需要使用之前的
       case when NVL(curr.phone, '') = NVL(prev.phone, '') then prev.s_date
            else NVL(curr.s_date, prev.s_date)
            end as s_date,
       NVL(curr.e_date, prev.e_date) as e_date
from (
  select cust_id, phone, DATE(from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd')) as s_date, DATE('9999-12-31') as e_date
  from datadev.zipper_table_test_cust_src
  where dt = '${etldate}'
) as curr

left join (
  select cust_id, phone, s_date, if(e_date > from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'), DATE('9999-12-31'), e_date) as e_date,
         row_number() over(partition by cust_id order by e_date desc) as r_num -- 取最新状态
  from datadev.zipper_table_test_cust_dst
  where regexp_replace(s_date, '-', '') <= '${etldate}' -- 拉链表历史数据回滚
) as prev
on curr.cust_id = prev.cust_id
and prev.r_num = 1

union all

-- part2: 处理删除的记录,以及有变化的记录中的旧记录
select prev_cust.cust_id, prev_cust.phone, prev_cust.s_date,
       case when e_date <> '9999-12-31' then e_date
            else DATE(from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'))
            END as e_date
from (
  select cust_id, phone, s_date, if(e_date > from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'), DATE('9999-12-31'), e_date) as e_date
  from datadev.zipper_table_test_cust_dst
  where regexp_replace(s_date, '-', '') <= '${etldate}' -- 拉链表历史数据回滚
) as prev_cust

left join (
  select cust_id, phone
  from datadev.zipper_table_test_cust_src
  where dt = '${etldate}'
) as curr_cust
on curr_cust.cust_id = prev_cust.cust_id
-- 只要变化量
where NVL(prev_cust.phone, '') <> NVL(curr_cust.phone, '')
;

4. Test

4.1 The first day (20210601): Replace ${etldate} with 20210601 and execute SQL. This is the initial state, and there is no change in customer information, so the effective date is 2021-06-01, and the effective date is 9999-12-31 (meaning it is currently valid)

zipper_table_test_cust_dst.cust_id zipper_table_test_cust_dst.phone zipper_table_test_cust_dst.s_date zipper_table_test_cust_dst.e_date
001 1111 2021-06-01 9999-12-31
002 2222 2021-06-01 9999-12-31
003 3333 2021-06-01 9999-12-31
004 4444 2021-06-01 9999-12-31

4.2 The second day (20210602): Replace ${etldate} with 20210602, and execute SQL. At this time, the original table has modified the mobile phone numbers of 002 and 004, so there will be two records, one records the historical state of the data, and the other records the current state of the data. Then the original table also added 005 customers, so the effective date of the data at this time is 2021-06-02, and the expiration date is 9999-12-31

zipper_table_test_cust_dst.cust_id zipper_table_test_cust_dst.phone zipper_table_test_cust_dst.s_date zipper_table_test_cust_dst.e_date
001 1111 2021-06-01 9999-12-31
002 2222 2021-06-01 2021-06-02
002 2222-1 2021-06-02 9999-12-31
003 3333 2021-06-01 9999-12-31
004 4444 2021-06-01 2021-06-02
004 4444-1 2021-06-02 9999-12-31
005 5555 2021-06-02 9999-12-31

4.3 The third day (20210603): Replace ${etldate} with 20210602 and execute SQL. At this time, the original table has modified 001, 002, 005, and added 006.

zipper_table_test_cust_dst.cust_id zipper_table_test_cust_dst.phone zipper_table_test_cust_dst.s_date zipper_table_test_cust_dst.e_date
001 1111 2021-06-01 2021-06-03
001 1111-1 2021-06-03 9999-12-31
002 2222 2021-06-01 2021-06-02
002 2222-1 2021-06-02 2021-06-03
002 2222-2 2021-06-03 9999-12-31
003 3333 2021-06-01 9999-12-31
004 4444 2021-06-01 2021-06-02
004 4444-1 2021-06-02 9999-12-31
005 5555 2021-06-02 2021-06-03
005 5555-1 2021-06-03 9999-12-31
006 6666 2021-06-03 9999-12-31

4.4 The fourth day (20210604): Replace ${etldate} with 20210602 and execute SQL. At this time, the original table has updated 002, added 007, and deleted 001. It should be noted that when deleting, the data expiration date should be changed to the current day.

zipper_table_test_cust_dst.cust_id zipper_table_test_cust_dst.phone zipper_table_test_cust_dst.s_date zipper_table_test_cust_dst.e_date
001 1111 2021-06-01 2021-06-03
001 1111-1 2021-06-03 2021-06-04
002 2222 2021-06-01 2021-06-02
002 2222-1 2021-06-02 2021-06-03
002 2222-2 2021-06-03 2021-06-04
002 2222-3 2021-06-04 9999-12-31
003 3333 2021-06-01 9999-12-31
004 4444 2021-06-01 2021-06-02
004 4444-1 2021-06-02 9999-12-31
005 5555 2021-06-02 2021-06-03
005 5555-1 2021-06-03 9999-12-31
006 6666 2021-06-03 9999-12-31
007 7777 2021-06-04 9999-12-31

Five, the data rollback refresh of the zipper table

The latest status of the zipper table can be viewed through the following code

select * from datadev.zipper_table_test_cust_dst where e_date = '9999-12-31';

View the historical state/snapshot of the zipper table through the following code

-- 查看拉链表的20210602的快照
select cust_id, phone, s_date, if(e_date > '2021-06-02', DATE('9999-12-31'), e_date) as e_date
from datadev.zipper_table_test_cust_dst
where s_date <= '2021-06-02'; 

Therefore, for the data rollback refresh of the zipper table, we only need to find the historical snapshot of that day according to the appeal code, and then refresh it. (Note: The zipper table insert statement I posted above already includes the function of data rollback and refresh. Readers can test it by themselves - replace ${etldate} with the date to be rolled back, and then comment out the INSERT OVERWRITE TABLE line, just run select to view the results)

六、另一种实现

上一种实现方式有一个缺点,随着拉链表数据量的增多,每次执行的时间也会随之增多。因此,需要改进:可采用hive结合ES的方式。

-- 拉链表(hive只存储新增/更新量,全量存储于ES)实现代码

-- 临时表,只存放T-1天的新增以及变化的记录
CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_dst_2 (
  `id` STRING COMMENT 'es id',
  `cust_id` STRING COMMENT '客户编号',
  `phone` STRING COMMENT '手机号码',
  `s_date` DATE COMMENT '生效时间',
  `e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;

drop table datadev.zipper_table_test_cust_dst_2;

select * from datadev.zipper_table_test_cust_dst_2 a;




INSERT OVERWRITE TABLE datadev.zipper_table_test_cust_dst_2
  select concat_ws('-', curr.s_date, curr.cust_id) as id,
         curr.cust_id as cust_id,
         curr.phone as phone,
         DATE(curr.s_date) as s_date,
         DATE('9999-12-31') as e_date
  from (
    select cust_id, phone, from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd') as s_date
    from datadev.zipper_table_test_cust_src
    where dt = '20210603' -- etldate 
  ) as curr

    left join (
      select *
      from datadev.zipper_table_test_cust_src
      where dt = '20210602' -- prev_date
    ) as prev
      on prev.cust_id = curr.cust_id
  where NVL(curr.phone, '') <> NVL(prev.phone, '')

  union all

  select concat_ws('-', STRING(prev.s_date), prev.cust_id) as id,
         prev.cust_id as cust_id,
         prev.phone as phone,
         prev.s_date as s_date,
         case when NVL(prev.phone, '') = NVL(curr.phone, '') then prev.e_date
         else DATE(from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd'))
         end as e_date
  from (
    select cust_id, phone, s_date, e_date,
    -- 只更新最新的一条
    row_number() over(partition by cust_id order by s_date desc) as r_num
    from datadev.zipper_table_test_cust_dst_2
  ) as prev
  
  inner join (
      select *
      from datadev.zipper_table_test_cust_src
      where dt = '20210603' -- etldate 
  ) as curr
  on prev.cust_id = curr.cust_id
  where prev.r_num = 1 
  ;
  
  
   
  
-- mock: load delta data to es
CREATE TABLE IF NOT EXISTS datadev.es_zipper (
  `id` STRING COMMENT 'es id',
  `cust_id` STRING COMMENT '客户编号',
  `phone` STRING COMMENT '手机号码',
  `s_date` DATE COMMENT '生效时间',
  `e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;  

drop table datadev.es_zipper;

select * from datadev.es_zipper;


INSERT OVERWRITE TABLE datadev.es_zipper
SELECT nvl(curr.id, prev.id) as id,
nvl(curr.cust_id, prev.cust_id) as cust_id,
nvl(curr.phone, prev.phone) as phone,
nvl(curr.s_date, prev.s_date) as s_date,
nvl(curr.e_date, prev.e_date) as e_date
FROM datadev.es_zipper prev

full join datadev.zipper_table_test_cust_dst_2 curr
on curr.id = prev.id;

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/118112246