Design of Zipper Table Based on MaxCompute

Abstract: Simple zipper table design

Background information:

In the process of data model design of data warehouse, such requirements are often encountered:
  1. The amount of data is relatively large;
  2. Some fields in the table will be updated, such as the user's address , product description information, order status, mobile phone number, etc.;
  3. You need to view historical snapshot information at a certain time point or time period. (For example, check the status of an order at a certain point in history, for example, check how many times a user has updated in a certain period of time in the past, etc.)
  4. The proportion and frequency of changes are not very large, for example, the total There are 10 million members, and about 100,000 are added and changed every day; if you keep a full amount of this table every day, then a lot of unchanged information will be saved in each full amount, which is a huge waste of storage. ;
 


To sum up: the introduction of the 'zipper history table' can not only meet the historical status of the response data, but also save storage to the greatest extent.
(Note: Inside Alibaba, it is largely based on storage-for-computing to provide development efficiency and ease of use, because today, the cost of storage is much lower than CPU and memory. Therefore, the snapshot method is adopted within Alibaba. Snapshots of the full amount of daily data are also used in extreme storage. The compression rate is high. In suitable scenarios, it can be compressed to about 1/30 of the original data.)

Demo data


The following is just how demo is implemented in MaxCompute The zipper table is based on some assumptions:
  ● Only one status occurs for the same order on the same day;
  ● Based on the data of 20150821 and before, there is no simple scenario simulation of the same order with two statuses;
  ● And the data source is in Alibaba Cloud RDS for Mysql. And indicated as orders.


Create a MaxCompute table


--ODS layer: incremental data table of orders, partitioned by day, storing daily incremental data

CREATE TABLE ods_orders_inc_d
(
orderid BIGINT
,createtime STRING
,modifiedtime STRING
,o_status STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 7;

--DW Layer: historical data zipper table, storing the historical status data of the order

CREATE TABLE dw_orders_his_d
(
orderid BIGINT COMMENT 'Order ID'
,createtime STRING COMMENT 'Order creation time'
,modifiedtime STRING COMMENT 'Order modification time'
,o_status STRING COMMENT 'Order modification time '
, dw_start_date STRING COMMENT 'order life cycle start time'
,dw_end_date STRING COMMENT 'order life cycle end time'
);

Implementation ideas


  ● Full initialization: Synchronize the full historical data from 2015-08-21 and before to ODS in full Brush into the DW layer.
  ● Incremental update: The whole-day incremental data of 2015-08-22 and 2015-08-23 is incrementally flushed into the downstream data.

Full initialization


  1. Create node task: data synchronization
  2. Select scheduling type: manual scheduling
  3. Configure data synchronization task: Mysql: Orders-->ODPS: ods_orders_inc_d
  4. Where condition configuration: modifiedtime <= '20150821'
  5. Partition value dt=20150821 is

submitted to the scheduling system, and after the data synchronization task is successfully executed, the ODS data is flushed into the DW.
Create SQL script:

INSERT overwrite TABLE dw_orders_his_d
SELECT orderid,createtime,modifiedtime,o_status,createtime AS dw_start_date,'99991231' AS dw_end_date
FROM ods_orders_inc_d
WHERE dt = '20150821';


Incrementally extract and generate a zip table


  1. Create a workflow task and select Periodic scheduling.
  2. Drag in the data synchronization node task and the SQL task in turn.
  3. The where condition is configured in the data synchronization task as: modifiedtime=${bdp.system.bizdate}
  4. The target table ods_orders_inc_d partition is configured as dt=${bdp.system.bizdate}
  5. Configure the SQL node, which is the downstream node of the data synchronization node.

--Refresh DW table through DW historical data and ODS incremental data
insert overwrite table dw_orders_his_d
SELECT a0.orderid, a0.createtime, a0.modifiedtime, a0.o_status, a0.dw_start_date, a0.dw_end_date
FROM (
    -- open orderid The windows are then sorted in reverse order according to the end of the life cycle, support re-run
    SELECT a1.orderid, a1.createtime, a1.modifiedtime, a1.o_status, a1.dw_start_date, a1.dw_end_date
    , ROW_NUMBER() OVER (distribute BY a1.orderid,a1 .createtime, a1.modifiedtime,a1.o_status sort BY a1.dw_end_date DESC) AS nums
    FROM (
        -- Match the historical data with the incremental data on the 22nd, when it is found that it exists in the new data on the 22nd and end_date > current Date means that the data status has changed, and then modify the life cycle
        -- modify the data that has expired yesterday and union the latest incremental data to DW
        SELECT a.orderid, a.createtime, a.modifiedtime, a.o_status, a. dw_start_date 
            , CASE
                WHEN b.orderid IS NOT NULL AND a.dw_end_date > ${bdp.system.bizdate} THEN ${yesterday}
                ELSE a.dw_end_date
            END AS dw_end_date
        FROM dw_orders_his_d a
        LEFT OUTER JOIN (
            SELECT *
            FROM ods_orders_inc_d
            WHERE dt = ${bdp.system.bizdate}
        ) b
        ON a.orderid = b.orderid
        UNION ALL
        --2015-08-22的增量数据刷新到DW
        SELECT orderid, createtime, modifiedtime, o_status, modifiedtime AS dw_start_date
            , '99991231' AS dw_end_date
        FROM ods_orders_inc_d
        WHERE dt = ${bdp.system.bizdate}
    ) a1
) a0
-- After the window is opened, the value of the life cycle '9999-12-31' in an order is read and written to prevent the data from being rerun.
WHERE a0.nums = 1
order by a0.orderid, a0.dw_start_date;

Note: When the test is run, select the business date as 20150822. You can also directly flush the incremental data of 20150822 and 20150823 into the DW by supplementing the data. In the above SQL, ${yesterday} is a custom variable, and its value is ${yyyymmdd-1}


How to use the zipper table


  ● View the full historical snapshot data of a certain day.

SELECT *
FROM dw_orders_his_d
WHERE dw_start_date <= '20150822'
    AND dw_end_date >= '20150822'
ORDER BY orderid
LIMIT 10000;

  ● Get a set of change records over a period of time, such as records changed in 20150822-20150823.

SELECT *
FROM dw_orders_his_d
WHERE dw_start_date <= '20150823'
    AND dw_end_date >= '
ORDER BY orderid
LIMIT 10000;

  ● View the historical changes of an order.

SELECT *
FROM dw_orders_his_d
WHERE orderid = 8
ORDER BY dw_start_date;

  ● Get the latest data.

SELECT *
FROM dw_orders_his_d
WHERE dw_end_date = '99991231'

Rolling back the data of a certain day or a period of time based on the historical zipper table is still a relatively complicated topic, which can be downloaded and discussed later.

Original link: https://yq.aliyun.com/articles/542146?spm=a2c41.11181499.0.0



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326205957&siteId=291194637