Cumulative snapshots are suitable for processes with a short period and a clear start and end status, such as an order execution process, and record the execution time of each step in the process, so that analysts have an overall grasp of the execution process, periodic snapshots The execution time of each step on the fact table record is gradually established, and is gradually updated in the fact table as the execution process.
Coupon life cycle: receive coupons → place an order with coupons → participate in the payment of coupons
Use of the cumulative snapshot fact table: count the number of coupon receipts, the number of coupon orders, and the number of coupon payments
1. Build a table
drop table if exists dwd_fact_coupon_use; COMMENT '删除已存在的表'
create external table dwd_fact_coupon_use(
`id` string COMMENT '编号',
`coupon_id` string COMMENT '优惠券ID',
`user_id` string COMMENT 'userid',
`order_id` string COMMENT '订单id',
`coupon_status` string COMMENT '优惠券状态',
`get_time` string COMMENT '领取时间',
`using_time` string COMMENT '使用时间(下单)',
`used_time` string COMMENT '使用时间(支付)'
) COMMENT '优惠券领用事实表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwd/dwd_fact_coupon_use/';
2. Data loading
Create dynamic partitions, which will be overwritten according to conditions.
The FULL OUTER JOIN keyword returns a row as long as there is a match in one of the left table (new) and right table (old)
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_fact_coupon_use partition(dt)
select
if(new.id is null,old.id,new.id),
if(new.coupon_id is null,old.coupon_id,new.coupon_id),
if(new.user_id is null,old.user_id,new.user_id),
if(new.order_id is null,old.order_id,new.order_id),
if(new.coupon_status is null,old.coupon_status,new.coupon_status),
if(new.get_time is null,old.get_time,new.get_time),
if(new.using_time is null,old.using_time,new.using_time),
if(new.used_time is null,old.used_time,new.used_time),
date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd') // 获取到动态分区
from
(
select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time
from dwd_fact_coupon_use
where dt in
(
select
date_format(get_time,'yyyy-MM-dd')
from ods_coupon_use
where dt='2020-10-30'
)
)old
full outer join
(
select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time
from ods_coupon_use
where dt='2020-10-30'
)new
on old.id=new.id;
3. Query loading results
select * from dwd_fact_coupon_use where dt='2020-10-30';
summary of a problem
new table and old table?
The new table is the new and changed data every day; the newly added data only needs to be inserted into the partition of the day. The modified data needs to be synchronized with the old data. Ninety-eight of these data are overwritten on the old table.
The old table is a partition that needs to be synchronized. In each partition, more or less data needs to be synchronized. In each partition, some data needs to be modified, and some does not need to be modified.
How to join two tables? (New table: new and changed and old table: this fact table)
- The new table has changed, the old table has also changed and needs to be updated
- The new table has changed, the old table has not changed, and the changed data needs to be inserted into the partition of the day
- The new table has not changed, the old table has changed, keep the data
How to divide the partition for updating data?
Divided into dynamic partition and static partition.
The specific partition is not sure about the dynamic partition, and it is taken from different partitions according to the different values of dt. The static partition has already determined which partition it is.
The above SQL is partitioned according to dt, what data should be placed in each dt?
dt is the partition according to the coupon receiving time get_time, because using_time and used_time are the same day, but get_time is the previous time, we need to update the two times using_time and used_time to the one with this data, so we need to update the same Put the get_time data together.
Specifically, we designed a partition table. This partition table stores the receipt records of the day. The data we pull every day has three time data, including get_time, using_time, and used_time. We directly put get_time in the partition of the day, and then Use including using_time and used_time to update the respective data in the partition where the respective get_time is located.
How to update the data?
Update the data in the old table with the data in the new table. New can be obtained from the partition table of the day in ods_coupon_use .
The old data has the partition where the data is modified, dt in (select gettime from ods_coupon_use where dt = ' 2020-10-30 ' ), the data that needs to be updated replace the old data with the new corresponding data
Data that old does not have but new has, is inserted into the new partition. Using full outer join, the above logic needs to be judged, and if...