1. Introduction
- Incremental table: There are date partitions to store incremental data, that is, new increments and changes.
- Full scale table: No date partition (overwrite and update every day), store the latest state of data as of now, so historical changes of data cannot be recorded
- Snapshot table: There are date partitions, and the data is full every day (whether there is a change or not). The disadvantage is that each partition stores a lot of duplicate data, wasting storage space
- Zipper table: The zipper table is a table used to maintain the historical state and the latest state data. According to the different zipper granularity, the zipper table is actually equivalent to a snapshot, but it has been optimized to remove some unchanged records.
2. Application scenarios
The zipper table is suitable for scenarios where there is a large amount of data and the historical snapshot information needs to be viewed when the proportion and frequency of field changes are small .
For example, there is a customer table with tens of millions of records and hundreds of fields. So for this kind of table, even if ORC compression is used, the data storage space of a single table will exceed 50GB per day. In the case of using three backups in HDFS, the storage space will be even larger.
So how should I design this table? Here are several options:
- Option 1 (full scale table): Extract the latest data every day to overwrite the data of the previous day. The advantage is that it is simple to implement and saves space, but the disadvantage is also obvious, and there is no historical state
- Solution 2 (snapshot table): If it is made full every day, we can view the historical data, but the disadvantage is that the storage space is too large, especially when the customer information does not change frequently, the repeated storage rate of the field is too high
- Solution 3 (zipper table): If the zipper table design is adopted, not only can the historical status be viewed, but also the storage space usage is extremely low (after all, data that has not changed will not be stored repeatedly)
3. Hive SQL practice
First create an original table of customer information for testing
CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_src (
`cust_id` STRING COMMENT '客户编号',
`phone` STRING COMMENT '手机号码'
)PARTITIONED BY (
dt STRING COMMENT 'etldate'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;
Then insert some test data
cust_id | phone | dt |
001 | 1111 | 20210601 |
002 | 2222 | 20210601 |
003 | 3333 | 20210601 |
004 | 4444 | 20210601 |
001 | 1111 | 20210602 |
002 | 2222-1 | 20210602 |
003 | 3333 | 20210602 |
004 | 4444-1 | 20210602 |
005 | 5555 | 20210602 |
001 | 1111-1 | 20210603 |
002 | 2222-2 | 20210603 |
003 | 3333 | 20210603 |
004 | 4444-1 | 20210603 |
005 | 5555-1 | 20210603 |
006 | 6666 | 20210603 |
002 | 2222-3 | 20210604 |
003 | 3333 | 20210604 |
004 | 4444-1 | 20210604 |
005 | 5555-1 | 20210604 |
006 | 6666 | 20210604 |
007 | 7777 | 20210604 |
A brief description of the data is as follows:
- 20210601 is the starting date, there are 4 customers in total
- 20210602 Updated the information of 002 and 004 customers, and added 005 customers
- 20210603 updated the information of 001, 002, 005 customers, and added 006 customers
- 20210604 Updated the information of customer 002, added customer 007, and deleted customer 001
Now back to the topic, how to design the zipper table?
First of all, the zipper table has two important audit fields: data effective date and data expiration date . As the name implies, the data effective date records when the record became effective, and the data expiration date records the expiration time of the record (9999-12-31 means it has been valid until now). Then the operations on the data can be divided into the following categories:
- Newly added record: data effective date is today, expiration date is 9999-12-31
- Records without change: the effective date of the data needs to be used before, and the expiration date remains unchanged
- Records with changes: == "For old records: keep, and change the expiration date to today; == "For new records: Add, the effective date is today, and the expiration date is 9999-12-31
- Deleted records: Need to close the loop, the expiration date becomes the same day
Therefore, the HQL implementation code of the zipper table is as follows:
-- 拉链表建表语句
CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_dst (
`cust_id` STRING COMMENT '客户编号',
`phone` STRING COMMENT '手机号码',
`s_date` DATE COMMENT '生效时间',
`e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;
-- 拉链表实现代码(含数据回滚刷新)
INSERT OVERWRITE TABLE datadev.zipper_table_test_cust_dst
-- part1: 处理新增的、没有变化的记录,以及有变化的记录中的新记录
select NVL(curr.cust_id, prev.cust_id) as cust_id,
NVL(curr.phone, prev.phone) as phone,
-- 没有变化的记录: s_date需要使用之前的
case when NVL(curr.phone, '') = NVL(prev.phone, '') then prev.s_date
else NVL(curr.s_date, prev.s_date)
end as s_date,
NVL(curr.e_date, prev.e_date) as e_date
from (
select cust_id, phone, DATE(from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd')) as s_date, DATE('9999-12-31') as e_date
from datadev.zipper_table_test_cust_src
where dt = '${etldate}'
) as curr
left join (
select cust_id, phone, s_date, if(e_date > from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'), DATE('9999-12-31'), e_date) as e_date,
row_number() over(partition by cust_id order by e_date desc) as r_num -- 取最新状态
from datadev.zipper_table_test_cust_dst
where regexp_replace(s_date, '-', '') <= '${etldate}' -- 拉链表历史数据回滚
) as prev
on curr.cust_id = prev.cust_id
and prev.r_num = 1
union all
-- part2: 处理删除的记录,以及有变化的记录中的旧记录
select prev_cust.cust_id, prev_cust.phone, prev_cust.s_date,
case when e_date <> '9999-12-31' then e_date
else DATE(from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'))
END as e_date
from (
select cust_id, phone, s_date, if(e_date > from_unixtime(unix_timestamp('${etldate}', 'yyyyMMdd'), 'yyyy-MM-dd'), DATE('9999-12-31'), e_date) as e_date
from datadev.zipper_table_test_cust_dst
where regexp_replace(s_date, '-', '') <= '${etldate}' -- 拉链表历史数据回滚
) as prev_cust
left join (
select cust_id, phone
from datadev.zipper_table_test_cust_src
where dt = '${etldate}'
) as curr_cust
on curr_cust.cust_id = prev_cust.cust_id
-- 只要变化量
where NVL(prev_cust.phone, '') <> NVL(curr_cust.phone, '')
;
4. Test
4.1 The first day (20210601): Replace ${etldate} with 20210601 and execute SQL. This is the initial state, and there is no change in customer information, so the effective date is 2021-06-01, and the effective date is 9999-12-31 (meaning it is currently valid)
zipper_table_test_cust_dst.cust_id | zipper_table_test_cust_dst.phone | zipper_table_test_cust_dst.s_date | zipper_table_test_cust_dst.e_date |
001 | 1111 | 2021-06-01 | 9999-12-31 |
002 | 2222 | 2021-06-01 | 9999-12-31 |
003 | 3333 | 2021-06-01 | 9999-12-31 |
004 | 4444 | 2021-06-01 | 9999-12-31 |
4.2 The second day (20210602): Replace ${etldate} with 20210602, and execute SQL. At this time, the original table has modified the mobile phone numbers of 002 and 004, so there will be two records, one records the historical state of the data, and the other records the current state of the data. Then the original table also added 005 customers, so the effective date of the data at this time is 2021-06-02, and the expiration date is 9999-12-31
zipper_table_test_cust_dst.cust_id | zipper_table_test_cust_dst.phone | zipper_table_test_cust_dst.s_date | zipper_table_test_cust_dst.e_date |
001 | 1111 | 2021-06-01 | 9999-12-31 |
002 | 2222 | 2021-06-01 | 2021-06-02 |
002 | 2222-1 | 2021-06-02 | 9999-12-31 |
003 | 3333 | 2021-06-01 | 9999-12-31 |
004 | 4444 | 2021-06-01 | 2021-06-02 |
004 | 4444-1 | 2021-06-02 | 9999-12-31 |
005 | 5555 | 2021-06-02 | 9999-12-31 |
4.3 The third day (20210603): Replace ${etldate} with 20210602 and execute SQL. At this time, the original table has modified 001, 002, 005, and added 006.
zipper_table_test_cust_dst.cust_id | zipper_table_test_cust_dst.phone | zipper_table_test_cust_dst.s_date | zipper_table_test_cust_dst.e_date |
001 | 1111 | 2021-06-01 | 2021-06-03 |
001 | 1111-1 | 2021-06-03 | 9999-12-31 |
002 | 2222 | 2021-06-01 | 2021-06-02 |
002 | 2222-1 | 2021-06-02 | 2021-06-03 |
002 | 2222-2 | 2021-06-03 | 9999-12-31 |
003 | 3333 | 2021-06-01 | 9999-12-31 |
004 | 4444 | 2021-06-01 | 2021-06-02 |
004 | 4444-1 | 2021-06-02 | 9999-12-31 |
005 | 5555 | 2021-06-02 | 2021-06-03 |
005 | 5555-1 | 2021-06-03 | 9999-12-31 |
006 | 6666 | 2021-06-03 | 9999-12-31 |
4.4 The fourth day (20210604): Replace ${etldate} with 20210602 and execute SQL. At this time, the original table has updated 002, added 007, and deleted 001. It should be noted that when deleting, the data expiration date should be changed to the current day.
zipper_table_test_cust_dst.cust_id | zipper_table_test_cust_dst.phone | zipper_table_test_cust_dst.s_date | zipper_table_test_cust_dst.e_date |
001 | 1111 | 2021-06-01 | 2021-06-03 |
001 | 1111-1 | 2021-06-03 | 2021-06-04 |
002 | 2222 | 2021-06-01 | 2021-06-02 |
002 | 2222-1 | 2021-06-02 | 2021-06-03 |
002 | 2222-2 | 2021-06-03 | 2021-06-04 |
002 | 2222-3 | 2021-06-04 | 9999-12-31 |
003 | 3333 | 2021-06-01 | 9999-12-31 |
004 | 4444 | 2021-06-01 | 2021-06-02 |
004 | 4444-1 | 2021-06-02 | 9999-12-31 |
005 | 5555 | 2021-06-02 | 2021-06-03 |
005 | 5555-1 | 2021-06-03 | 9999-12-31 |
006 | 6666 | 2021-06-03 | 9999-12-31 |
007 | 7777 | 2021-06-04 | 9999-12-31 |
Five, the data rollback refresh of the zipper table
The latest status of the zipper table can be viewed through the following code
select * from datadev.zipper_table_test_cust_dst where e_date = '9999-12-31';
View the historical state/snapshot of the zipper table through the following code
-- 查看拉链表的20210602的快照
select cust_id, phone, s_date, if(e_date > '2021-06-02', DATE('9999-12-31'), e_date) as e_date
from datadev.zipper_table_test_cust_dst
where s_date <= '2021-06-02';
Therefore, for the data rollback refresh of the zipper table, we only need to find the historical snapshot of that day according to the appeal code, and then refresh it. (Note: The zipper table insert statement I posted above already includes the function of data rollback and refresh. Readers can test it by themselves - replace ${etldate} with the date to be rolled back, and then comment out the INSERT OVERWRITE TABLE line, just run select to view the results)
六、另一种实现
上一种实现方式有一个缺点,随着拉链表数据量的增多,每次执行的时间也会随之增多。因此,需要改进:可采用hive结合ES的方式。
-- 拉链表(hive只存储新增/更新量,全量存储于ES)实现代码
-- 临时表,只存放T-1天的新增以及变化的记录
CREATE TABLE IF NOT EXISTS datadev.zipper_table_test_cust_dst_2 (
`id` STRING COMMENT 'es id',
`cust_id` STRING COMMENT '客户编号',
`phone` STRING COMMENT '手机号码',
`s_date` DATE COMMENT '生效时间',
`e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;
drop table datadev.zipper_table_test_cust_dst_2;
select * from datadev.zipper_table_test_cust_dst_2 a;
INSERT OVERWRITE TABLE datadev.zipper_table_test_cust_dst_2
select concat_ws('-', curr.s_date, curr.cust_id) as id,
curr.cust_id as cust_id,
curr.phone as phone,
DATE(curr.s_date) as s_date,
DATE('9999-12-31') as e_date
from (
select cust_id, phone, from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd') as s_date
from datadev.zipper_table_test_cust_src
where dt = '20210603' -- etldate
) as curr
left join (
select *
from datadev.zipper_table_test_cust_src
where dt = '20210602' -- prev_date
) as prev
on prev.cust_id = curr.cust_id
where NVL(curr.phone, '') <> NVL(prev.phone, '')
union all
select concat_ws('-', STRING(prev.s_date), prev.cust_id) as id,
prev.cust_id as cust_id,
prev.phone as phone,
prev.s_date as s_date,
case when NVL(prev.phone, '') = NVL(curr.phone, '') then prev.e_date
else DATE(from_unixtime(unix_timestamp(dt, 'yyyyMMdd'), 'yyyy-MM-dd'))
end as e_date
from (
select cust_id, phone, s_date, e_date,
-- 只更新最新的一条
row_number() over(partition by cust_id order by s_date desc) as r_num
from datadev.zipper_table_test_cust_dst_2
) as prev
inner join (
select *
from datadev.zipper_table_test_cust_src
where dt = '20210603' -- etldate
) as curr
on prev.cust_id = curr.cust_id
where prev.r_num = 1
;
-- mock: load delta data to es
CREATE TABLE IF NOT EXISTS datadev.es_zipper (
`id` STRING COMMENT 'es id',
`cust_id` STRING COMMENT '客户编号',
`phone` STRING COMMENT '手机号码',
`s_date` DATE COMMENT '生效时间',
`e_date` DATE COMMENT '失效时间'
)STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
;
drop table datadev.es_zipper;
select * from datadev.es_zipper;
INSERT OVERWRITE TABLE datadev.es_zipper
SELECT nvl(curr.id, prev.id) as id,
nvl(curr.cust_id, prev.cust_id) as cust_id,
nvl(curr.phone, prev.phone) as phone,
nvl(curr.s_date, prev.s_date) as s_date,
nvl(curr.e_date, prev.e_date) as e_date
FROM datadev.es_zipper prev
full join datadev.zipper_table_test_cust_dst_2 curr
on curr.id = prev.id;