Incremental table and zipper table, do you understand?

data preparation

The first day of September 10 data

1,待支付,2020-09-10 12:20:11,2020-09-10 12:20:11
2,待支付,2020-09-10 14:20:11,2020-09-10 14:20:11
3,待支付,2020-09-10 16:20:11,2020-09-10 16:20:11

The first day of September 11 data

1,待支付,2020-09-10 12:20:11,2020-09-10 12:20:11
2,已支付,2020-09-10 14:20:11,2020-09-11 14:21:11
3,已支付,2020-09-10 16:20:11,2020-09-11 16:21:11
4,待支付,2020-09-11 12:20:11,2020-09-11 12:20:11
5,待支付,2020-09-11 14:20:11,2020-09-11 14:20:11

Comparing the data of the first day and the second day of mysql, it is found that the two new data of order id 4 and 5 were added on the second day, and the status of the order id 2 and 3 was updated to be paid

Total amount table

All the latest status data every day.

1. The full scale should be reported if there is any change

2. The data reported every time is all data (changed + unchanged)

Full draw to the ods layer on September 10

create table wedw_ods.order_info_20200910(
 order_id     string    COMMENT '订单id'
,order_status string    COMMENT '订单状态'
,create_time  timestamp COMMENT '创建时间'
,update_time  timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;

Insert picture description here

create table wedw_dwd.order_info_df(
 order_id     string    COMMENT '订单id'
,order_status string    COMMENT '订单状态'
,create_time  timestamp COMMENT '创建时间'
,update_time  timestamp COMMENT '更新时间'
) COMMENT '订单表'
partitioned by (date_id string)
row format delimited fields terminated by ','
;

# 把wedw_ods.order_info_20200910数据全量插到dwd层2020-09-10分区
insert overwrite table wedw_dwd.order_info_df
 partition(date_id = '2020-09-10')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200910
;

Insert picture description here
On September 11th, all samples will be drawn to the ods layer


create table wedw_ods.order_info_20200911(
 order_id     string    COMMENT '订单id'
,order_status string    COMMENT '订单状态'
,create_time  timestamp COMMENT '创建时间'
,update_time  timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;

Insert picture description here


# 把wedw_ods.order_info_20200911数据全量插到dwd层2020-09-11分区
insert overwrite table wedw_dwd.order_info_df 
partition(date_id = '2020-09-11')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200911
;

Insert picture description here
Full extraction, each partition retains a full historical snapshot.

Increment table

Incremental table: New data. Incremental data is the new data after the last export.

1. Record the amount of each increase, not the total amount;

2. Increment table, only the change amount is reported, no change is not required

3. The business database table needs to have the primary key, creation time, and modification time

Fully extracted to the ods layer on September 10 (full initialization)
Insert picture description here


# 把wedw_ods.order_info_20200910数据全量插到dwd层2020-09-10分区
insert overwrite table wedw_dwd.order_info_di
 partition(date_id = '2020-09-10')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200910
;

Insert picture description here
The updated data extracted on September 11 and the newly added data on the same day, that is, the data with order id 2,3,4,5
Insert picture description here
, the partition data of the wedw_dwd.order_info_di table on September 10 and the data extracted incrementally in wedw_ods.order_info_20200911 There are 2 options

a. The two tables are related by the primary key, the dwd table exists and the ods table does not exist data

union all Click on all the data in the wedw_ods.order_info_20200911 table, that is, insert the full amount of data into the partition on September 11 of the dwd table

insert overwrite table wedw_dwd.order_info_di 
partition(date_id = '2020-09-11')
select
 t1.order_id
,t1.order_status
,t1.create_time
,t1.update_time
from
wedw_dwd.order_info_di t1
left join
wedw_ods.order_info_20200911 t2
on t1.order_id = t2.order_id
where t1.date_id = '2020-09-10'
and t2.order_id is  null
union all
select 
 order_id
,order_status
,create_time
,update_time
from 
wedw_ods.order_info_20200911
;

Insert picture description here
b. Union all the two table data, and then de-duplicate according to order_id (group according to order, update time in descending order, take the first one)


insert overwrite table wedw_dwd.order_info_di partition(date_id = '2020-09-11')
select
 t2.order_id
,t2.order_status
,t2.create_time
,t2.update_time 
from
(
    select
     t1.order_id
    ,t1.order_status
    ,t1.create_time
    ,t1.update_time
    ,row_number() over(partition by order_id order by update_time desc) as rn
    from
    (
        select
         order_id
        ,order_status
        ,create_time
        ,update_time
        from
        wedw_dwd.order_info_di 
        where date_id = '2020-09-10'
 
        union all
 
        select 
         order_id
        ,order_status
        ,create_time
        ,update_time
        from 
        wedw_ods.order_info_20200911
    ) t1
) t2
where t2.rn = 1
;

Insert picture description here
Special incremental table: da table, the daily partition is the data of the day, and its data characteristic is that the data will not change after it is generated, such as the log table.

Zipper table

Maintain historical status and latest status data

Application situation:

1. The amount of data is relatively large

2. Some fields in the table will be updated

3. Need to view the historical snapshot information of a certain time point or time period.
View the status of
a certain order at a certain time point in history. The number of orders placed by a certain user in a certain period of time in the past

4. The proportion and frequency of updates is not very large

If the information in the table does not change very much, and a full amount is kept every day, then a lot of unchanged information will be saved in each full amount, which is a great waste of storage

advantage

1. Meet the historical state of the reaction data

2. Save storage to the greatest extent

Full draw to the ods layer on September 10

create table wedw_ods.order_info_20200910(
 order_id     string    COMMENT '订单id'
,order_status string    COMMENT '订单状态'
,create_time  timestamp COMMENT '创建时间'
,update_time  timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;

Insert picture description here
Establish the dwd layer zipper table
Add two fields:
start_dt (representing the start time of the life cycle of the record-the state of the cycle snapshot)
end_dt (the end time of the life cycle of the record)

end_dt = '9999-12-31' means that the record is currently in a valid state

create table wedw_dwd.order_info_dz(
 order_id     string    COMMENT '订单id'
,order_status string    COMMENT '订单状态'
,create_time  timestamp COMMENT '创建时间'
,update_time  timestamp COMMENT '更新时间'
,start_dt     date      COMMENT '开始生效日期'
,end_dt       date      COMMENT '结束生效日期'
) COMMENT '订单表'
partitioned by (date_id string)
row format delimited fields terminated by ','
;

Note: All data needs to be initialized when processing for the first time, start_time is set to data date 2020-09-10, end_time is set to 9999-12-31


insert overwrite table wedw_dwd.order_info_dz
 partition(date_id = '2020-09-10')
select
 order_id    
,order_status
,create_time 
,update_time 
,to_date(update_time) as start_dt   
,'9999-12-31' as end_dt  
from
wedw_ods.order_info_20200910
;

Insert picture description here
On September 11, extract the updated data and the newly added data that day to the ods layer, that is, the data with order id 2, 3, 4, and 5
Insert picture description here

insert overwrite table wedw_dwd.order_info_dz
 partition(date_id = '2020-09-11')
select
 t1.order_id    
,t1.order_status
,t1.create_time 
,t1.update_time
,t1.start_dt
,case when t1.end_dt = '9999-12-31' 
and t2.order_id is not null 
then t1.date_id 
else t1.end_dt end as end_dt
from
wedw_dwd.order_info_dz t1
left join wedw_ods.order_info_20200911 t2
on t1.order_id = t2.order_id
where t1.date_id = '2020-09-10'
union all
SELECT
 t1.order_id    
,t1.order_status
,t1.create_time 
,t1.update_time
,to_date(update_time) as start_dt
,'9999-12-31' as end_dt
FROM wedw_ods.order_info_20200911 t1
;

Insert picture description here
Query all current valid records:

select 
* 
from 
wedw_dwd.order_info_dz 
where 
date_id = '2020-09-11'
and end_dt ='9999-12-31'
;

Insert picture description here
Query historical snapshot on September 10th:


select 
* 
from 
wedw_dwd.order_info_dz 
where 
date_id = '2020-09-10' 
and start_dt <= '2020-09-10' 
and end_dt >='2020-09-10'
;

Insert picture description here
Query the historical snapshot on September 11:


select 
* 
from 
wedw_dwd.order_info_dz 
where 
date_id = '2020-09-11' 
and start_dt <= '2020-09-11' 
and end_dt >='2020-09-11'
;

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108942442