Implementation of zipper table in hive warehouse

In the construction of data warehouses, we often encounter the need to find the historical state of a certain piece of data and the point in time when the state changes. For example, to find all the change records of a user’s history, there will be changes and physical deletions in the business database. id is the primary key, so only the latest record of the user will be recorded. If the user table in the business database is fully synchronized, the user change record cannot be found in the latest partition. If all the daily snapshots of the full data are kept, a lot of space will be wasted, and The query efficiency is low and the logic is complicated. Making a zipper table can not only save space, but also quickly query all change records, change types, and change dates of a user.

This solution is better than the solution in my previous blog: the update method of the zipper table in the data warehouse.
This solution supports data re-running, that is, one day the data is found to be problematic, and the ods table is re-connected to the data. You only need to re-run from that day to Just in the zipper table, because this time the zipper table is designed as a partition table

The following is the implementation method:

  1. First create the same table as the business database in ods, and synchronize the table data in full every day
drop table if exists flowtest.tmp_user_test_df;
create table if not exists flowtest.tmp_user_test_df (
      user_id          string comment '用户id',
      user_name        string comment '用户名称',
      age              bigint comment '用户年龄',
      gender           string comment '性别'
) comment '用户信息表'              
partitioned by ( pt_day string comment '分区时间-天' )  
--lifecycle 5  --dataWorks和数栈中有此关键字,用于设置表的生命周期
;
  1. Create historical full zipper table in ods
drop table if exists flowtest.tmp_user_std_test_df;
create table if not exists flowtest.tmp_user_std_test_df (
      user_id          string comment '用户id',
      user_name        string comment '用户名称',
      age              bigint comment '用户年龄',
      gender           string comment '性别',
      md5_key          string comment '所有字段md5值',
      create_date      string comment '创建日期',
      operation_date   string comment '操作日期',
      operation_type   string comment '操作类型(A:第一次全量,U:更新,I:新增插入,D:物理删除)'
) comment '用户信息拉链表'              
partitioned by ( pt_day string comment '分区时间-天' )  
--lifecycle 5  --dataWorks和数栈中有此关键字,用于设置表的生命周期
;
  1. Insert simulation data

insert into table flowtest.tmp_user_test_df partition (pt_day = '20201212') values
("00001", "00001", 26, "男"), 
("00002", "00002", 27, "男"), 
("00003", "00003", 21, "女"), 
("00004", "00004", 22, "女"), 
("00005", "00005", 26, "女"), 
("00006", "00006", 26, "男"), 
("00007", "00007", 26, "男"), 
("00008", "00008", 26, "男"),
("00009", "00009", 26, "男"),
("00010", "00010", 26, "男"),
("00011", "00011", 26, "男"),
("00012", "00012", 26, "男");

--删除用户00006,新增00013,修改00011
insert into table flowtest.tmp_user_test_df partition (pt_day = '20201213') values
("00001", "00001", 26, "男"), 
("00002", "00002", 27, "男"), 
("00003", "00003", 21, "女"), 
("00004", "00004", 22, "女"), 
("00005", "00005", 26, "女"), 
("00006", "浮云", 26, "男"), 
("00007", "00007", 26, "男"), 
("00008", "00008", 26, "男"),
("00009", "00009", 26, "男"),
("00010", "00010", 26, "男"),
("00011", "00011", 26, "男"),
("00013", "00013", 26, "男");

--删除用户00011,新增00014,00012,修改00001
alter table flowtest.tmp_user_test_df drop partition (pt_day = '20201214');
insert into table flowtest.tmp_user_test_df partition (pt_day = '20201214') values
("00001", "ganling", 26, "男"), 
("00002", "00002", 27, "男"), 
("00003", "00003", 21, "女"), 
("00004", "00004", 22, "女"), 
("00005", "00005", 26, "女"), 
("00006", "浮云", 26, "男"), 
("00007", "00007", 26, "男"), 
("00008", "00008", 26, "男"),
("00009", "00009", 26, "男"),
("00010", "00010", 26, "男"),
("00013", "00013", 26, "男"),
("00014", "00014", 27, "男"),
("00012", "insert", 27, "男")
;
  1. Import the full data of 20201212 into the full zipper table
--第一次全量初始化
insert overwrite table flowtest.tmp_user_std_test_df partition (pt_day = '20201212')
select user_id          --用户id
      ,user_name        --用户名称
      ,age              --用户年龄
      ,gender           --性别
      ,md5(concat(nvl(user_id, '')
                 ,nvl(user_name, '')
                 ,nvl(age, '')
                 ,nvl(gender, ''))) as md5_key
      ,'20201212' as create_date      --创建日期
      ,'99991231' as operation_date   --操作日期
      ,'A' as operation_type   --操作类型(A:第一次全量,U:更新,I:新增插入,D:物理删除)
  from flowtest.tmp_user_test_df
 where pt_day = '20201212'
;
  1. All subsequent zippers will import the data into the zipper table every day
--后续全量拉链
insert overwrite table flowtest.tmp_user_std_test_df partition (pt_day = '${today}')
select coalesce(t1.user_id, t2.user_id) as user_id          --用户id
      ,coalesce(t1.user_name, t2.user_name) as user_name        --用户名称
      ,coalesce(t1.age, t2.age) as age              --用户年龄
      ,coalesce(t1.gender, t2.gender) as gender           --性别
      ,coalesce(t1.md5_key, t2.md5_key) as md5_key           --md5值
      --如果t2表的主键user_id为空,创建日期取跑数据当天,其他取t2.create_date
      ,case when t2.user_id is null then '${today}' --跑数当天${bdp.system.bizdate}
       else t2.create_date end as create_date      --创建日期
      --如果t1.user_id为空,则为if(t2.operation_date = '99991231', '${bdp.system.bizdate}', t2.operation_date)('D'),如果t2.user_id为空,则为'99991231'('I'),如果t1.md5_key<>t2.md5_key,则为t1.pt_day('U'),其他为t2.operation_date
      ,case when t1.user_id is null then if(t2.operation_date = '99991231', '${today}', t2.operation_date) --跑数当天${bdp.system.bizdate}
            when t2.user_id is null then '99991231'
            when t1.md5_key <> t2.md5_key then '${today}' --跑数当天${bdp.system.bizdate}
       else t2.operation_date end as operation_date   --操作日期
      --如果t1.user_id为空,则为'D',如果t2.user_id为空,则为'I',如果t1.md5_key<>t2.md5_key,则为'U',其他为t2.operation_type
      ,case when t1.user_id is null then 'D'
            when t2.user_id is null then 'I'
            when t1.md5_key <> t2.md5_key then 'U'
       else t2.operation_type end as operation_type   --操作类型(A:第一次全量,U:更新,I:新增插入,D:物理删除)
  from 
  ( --新的分区全量数据
    select user_id          --用户id
          ,user_name        --用户名称
          ,age              --用户年龄
          ,gender           --性别
          ,md5(concat(nvl(user_id, '')
                         ,nvl(user_name, '')
                         ,nvl(age, '')
                         ,nvl(gender, ''))) as md5_key
      from flowtest.tmp_user_test_df
     where pt_day = '${today}'
  ) t1 
  full outer join 
  ( --老的分区全量数据
    select user_id          --用户id
          ,user_name        --用户名称
          ,age              --用户年龄
          ,gender           --性别
          ,md5_key          --md5值
          ,create_date      --创建日期
          ,operation_date   --操作日期
          ,operation_type   --操作类型(A:第一次全量,U:更新,I:新增插入,D:物理删除)
      from flowtest.tmp_user_std_test_df
     where pt_day = '${yesterday}'
  ) t2 on t1.user_id = t2.user_id 
      --因为存在物理删除,并且物理删除后可能又将这个user_id重新录入,所以关联时过滤掉物理删除记录,直接将以前物理删除记录原样保留即可
      and t2.operation_type <> 'D'
;

Guess you like

Origin blog.csdn.net/lz6363/article/details/111187678