data preparation
The first day of September 10 data
1,待支付,2020-09-10 12:20:11,2020-09-10 12:20:11
2,待支付,2020-09-10 14:20:11,2020-09-10 14:20:11
3,待支付,2020-09-10 16:20:11,2020-09-10 16:20:11
The first day of September 11 data
1,待支付,2020-09-10 12:20:11,2020-09-10 12:20:11
2,已支付,2020-09-10 14:20:11,2020-09-11 14:21:11
3,已支付,2020-09-10 16:20:11,2020-09-11 16:21:11
4,待支付,2020-09-11 12:20:11,2020-09-11 12:20:11
5,待支付,2020-09-11 14:20:11,2020-09-11 14:20:11
Comparing the data of the first day and the second day of mysql, it is found that the two new data of order id 4 and 5 were added on the second day, and the status of the order id 2 and 3 was updated to be paid
Total amount table
All the latest status data every day.
1. The full scale should be reported if there is any change
2. The data reported every time is all data (changed + unchanged)
Full draw to the ods layer on September 10
create table wedw_ods.order_info_20200910(
order_id string COMMENT '订单id'
,order_status string COMMENT '订单状态'
,create_time timestamp COMMENT '创建时间'
,update_time timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;
create table wedw_dwd.order_info_df(
order_id string COMMENT '订单id'
,order_status string COMMENT '订单状态'
,create_time timestamp COMMENT '创建时间'
,update_time timestamp COMMENT '更新时间'
) COMMENT '订单表'
partitioned by (date_id string)
row format delimited fields terminated by ','
;
# 把wedw_ods.order_info_20200910数据全量插到dwd层2020-09-10分区
insert overwrite table wedw_dwd.order_info_df
partition(date_id = '2020-09-10')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200910
;
On September 11th, all samples will be drawn to the ods layer
create table wedw_ods.order_info_20200911(
order_id string COMMENT '订单id'
,order_status string COMMENT '订单状态'
,create_time timestamp COMMENT '创建时间'
,update_time timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;
# 把wedw_ods.order_info_20200911数据全量插到dwd层2020-09-11分区
insert overwrite table wedw_dwd.order_info_df
partition(date_id = '2020-09-11')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200911
;
Full extraction, each partition retains a full historical snapshot.
Increment table
Incremental table: New data. Incremental data is the new data after the last export.
1. Record the amount of each increase, not the total amount;
2. Increment table, only the change amount is reported, no change is not required
3. The business database table needs to have the primary key, creation time, and modification time
Fully extracted to the ods layer on September 10 (full initialization)
# 把wedw_ods.order_info_20200910数据全量插到dwd层2020-09-10分区
insert overwrite table wedw_dwd.order_info_di
partition(date_id = '2020-09-10')
select
order_id
,order_status
,create_time
,update_time
from wedw_ods.order_info_20200910
;
The updated data extracted on September 11 and the newly added data on the same day, that is, the data with order id 2,3,4,5
, the partition data of the wedw_dwd.order_info_di table on September 10 and the data extracted incrementally in wedw_ods.order_info_20200911 There are 2 options
a. The two tables are related by the primary key, the dwd table exists and the ods table does not exist data
union all Click on all the data in the wedw_ods.order_info_20200911 table, that is, insert the full amount of data into the partition on September 11 of the dwd table
insert overwrite table wedw_dwd.order_info_di
partition(date_id = '2020-09-11')
select
t1.order_id
,t1.order_status
,t1.create_time
,t1.update_time
from
wedw_dwd.order_info_di t1
left join
wedw_ods.order_info_20200911 t2
on t1.order_id = t2.order_id
where t1.date_id = '2020-09-10'
and t2.order_id is null
union all
select
order_id
,order_status
,create_time
,update_time
from
wedw_ods.order_info_20200911
;
b. Union all the two table data, and then de-duplicate according to order_id (group according to order, update time in descending order, take the first one)
insert overwrite table wedw_dwd.order_info_di partition(date_id = '2020-09-11')
select
t2.order_id
,t2.order_status
,t2.create_time
,t2.update_time
from
(
select
t1.order_id
,t1.order_status
,t1.create_time
,t1.update_time
,row_number() over(partition by order_id order by update_time desc) as rn
from
(
select
order_id
,order_status
,create_time
,update_time
from
wedw_dwd.order_info_di
where date_id = '2020-09-10'
union all
select
order_id
,order_status
,create_time
,update_time
from
wedw_ods.order_info_20200911
) t1
) t2
where t2.rn = 1
;
Special incremental table: da table, the daily partition is the data of the day, and its data characteristic is that the data will not change after it is generated, such as the log table.
Zipper table
Maintain historical status and latest status data
Application situation:
1. The amount of data is relatively large
2. Some fields in the table will be updated
3. Need to view the historical snapshot information of a certain time point or time period.
View the status of
a certain order at a certain time point in history. The number of orders placed by a certain user in a certain period of time in the past
4. The proportion and frequency of updates is not very large
If the information in the table does not change very much, and a full amount is kept every day, then a lot of unchanged information will be saved in each full amount, which is a great waste of storage
advantage
1. Meet the historical state of the reaction data
2. Save storage to the greatest extent
Full draw to the ods layer on September 10
create table wedw_ods.order_info_20200910(
order_id string COMMENT '订单id'
,order_status string COMMENT '订单状态'
,create_time timestamp COMMENT '创建时间'
,update_time timestamp COMMENT '更新时间'
) COMMENT '订单表'
row format delimited fields terminated by ','
;
Establish the dwd layer zipper table
Add two fields:
start_dt (representing the start time of the life cycle of the record-the state of the cycle snapshot)
end_dt (the end time of the life cycle of the record)
end_dt = '9999-12-31' means that the record is currently in a valid state
create table wedw_dwd.order_info_dz(
order_id string COMMENT '订单id'
,order_status string COMMENT '订单状态'
,create_time timestamp COMMENT '创建时间'
,update_time timestamp COMMENT '更新时间'
,start_dt date COMMENT '开始生效日期'
,end_dt date COMMENT '结束生效日期'
) COMMENT '订单表'
partitioned by (date_id string)
row format delimited fields terminated by ','
;
Note: All data needs to be initialized when processing for the first time, start_time is set to data date 2020-09-10, end_time is set to 9999-12-31
insert overwrite table wedw_dwd.order_info_dz
partition(date_id = '2020-09-10')
select
order_id
,order_status
,create_time
,update_time
,to_date(update_time) as start_dt
,'9999-12-31' as end_dt
from
wedw_ods.order_info_20200910
;
On September 11, extract the updated data and the newly added data that day to the ods layer, that is, the data with order id 2, 3, 4, and 5
insert overwrite table wedw_dwd.order_info_dz
partition(date_id = '2020-09-11')
select
t1.order_id
,t1.order_status
,t1.create_time
,t1.update_time
,t1.start_dt
,case when t1.end_dt = '9999-12-31'
and t2.order_id is not null
then t1.date_id
else t1.end_dt end as end_dt
from
wedw_dwd.order_info_dz t1
left join wedw_ods.order_info_20200911 t2
on t1.order_id = t2.order_id
where t1.date_id = '2020-09-10'
union all
SELECT
t1.order_id
,t1.order_status
,t1.create_time
,t1.update_time
,to_date(update_time) as start_dt
,'9999-12-31' as end_dt
FROM wedw_ods.order_info_20200911 t1
;
Query all current valid records:
select
*
from
wedw_dwd.order_info_dz
where
date_id = '2020-09-11'
and end_dt ='9999-12-31'
;
Query historical snapshot on September 10th:
select
*
from
wedw_dwd.order_info_dz
where
date_id = '2020-09-10'
and start_dt <= '2020-09-10'
and end_dt >='2020-09-10'
;
Query the historical snapshot on September 11:
select
*
from
wedw_dwd.order_info_dz
where
date_id = '2020-09-11'
and start_dt <= '2020-09-11'
and end_dt >='2020-09-11'
;