Cumulative snapshot fact table-use of order discount table

Cumulative snapshots are suitable for processes with a short period and a clear start and end status, such as an order execution process, and record the execution time of each step in the process, so that analysts have an overall grasp of the execution process, periodic snapshots The execution time of each step on the fact table record is gradually established, and is gradually updated in the fact table as the execution process.

Coupon life cycle: receive coupons → place an order with coupons → participate in the payment of coupons

Use of the cumulative snapshot fact table: count the number of coupon receipts, the number of coupon orders, and the number of coupon payments

1. Build a table

drop table if exists dwd_fact_coupon_use;  COMMENT '删除已存在的表'
create external table dwd_fact_coupon_use(
    `id` string COMMENT '编号',
    `coupon_id` string  COMMENT '优惠券ID',
    `user_id` string  COMMENT 'userid',
    `order_id` string  COMMENT '订单id',
    `coupon_status` string  COMMENT '优惠券状态',
    `get_time` string  COMMENT '领取时间',
    `using_time` string  COMMENT '使用时间(下单)',
    `used_time` string  COMMENT '使用时间(支付)'
) COMMENT '优惠券领用事实表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwd/dwd_fact_coupon_use/';

2. Data loading

Create dynamic partitions, which will be overwritten according to conditions.

The FULL OUTER JOIN keyword returns a row as long as there is a match in one of the left table (new) and right table (old)

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_fact_coupon_use partition(dt)
select
    if(new.id is null,old.id,new.id),
    if(new.coupon_id is null,old.coupon_id,new.coupon_id),
    if(new.user_id is null,old.user_id,new.user_id),
    if(new.order_id is null,old.order_id,new.order_id),
    if(new.coupon_status is null,old.coupon_status,new.coupon_status),
    if(new.get_time is null,old.get_time,new.get_time),
    if(new.using_time is null,old.using_time,new.using_time),
    if(new.used_time is null,old.used_time,new.used_time),	
    date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd')  // 获取到动态分区
from
(
    select
        id,
        coupon_id,
        user_id,
        order_id,
        coupon_status,
        get_time,
        using_time,
        used_time
    from dwd_fact_coupon_use
    where dt in
    (
        select
            date_format(get_time,'yyyy-MM-dd')
        from ods_coupon_use
        where dt='2020-10-30'
    )
)old
full outer join
(
    select
        id,
        coupon_id,
        user_id,
        order_id,
        coupon_status,
        get_time,
        using_time,
        used_time
    from ods_coupon_use
    where dt='2020-10-30'
)new
on old.id=new.id;

3. Query loading results

select * from dwd_fact_coupon_use where dt='2020-10-30';

summary of a problem

new table and old table?

The new table is the new and changed data every day; the newly added data only needs to be inserted into the partition of the day. The modified data needs to be synchronized with the old data. Ninety-eight of these data are overwritten on the old table.

The old table is a partition that needs to be synchronized. In each partition, more or less data needs to be synchronized. In each partition, some data needs to be modified, and some does not need to be modified.

How to join two tables? (New table: new and changed and old table: this fact table)

  • The new table has changed, the old table has also changed and needs to be updated
  • The new table has changed, the old table has not changed, and the changed data needs to be inserted into the partition of the day
  • The new table has not changed, the old table has changed, keep the data

How to divide the partition for updating data?

Divided into dynamic partition and static partition.

The specific partition is not sure about the dynamic partition, and it is taken from different partitions according to the different values ​​of dt. The static partition has already determined which partition it is.

The above SQL is partitioned according to dt, what data should be placed in each dt?

dt is the partition according to the coupon receiving time get_time, because using_time and used_time are the same day, but get_time is the previous time, we need to update the two times using_time and used_time to the one with this data, so we need to update the same Put the get_time data together.

Specifically, we designed a partition table. This partition table stores the receipt records of the day. The data we pull every day has three time data, including get_time, using_time, and used_time. We directly put get_time in the partition of the day, and then Use including using_time and used_time to update the respective data in the partition where the respective get_time is located.

How to update the data?

Update the data in the old table with the data in the new table. New can be obtained from the partition table of the day in ods_coupon_use .

The old data has the partition where the data is modified, dt in (select gettime from ods_coupon_use  where dt = ' 2020-10-30 ' ), the data that needs to be updated replace the old data with the new corresponding data

Data that old does not have but new has, is inserted into the new partition. Using full outer join, the above logic needs to be judged, and if...

 

 

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109472771