Implementing data warehouse tables zipper

Disclaimer: This article is a blogger original article, the source is http://blog.csdn.net/silentwolfyh https://blog.csdn.net/silentwolfyh/article/details/89361785

table of Contents

First, the zipper table functions and applications

Second, the zipper effect table shows

Third, zipper case operating table

1) construction of the table statement zipper

2) the total amount of the first operation Explanation

3) After the increment Explanation

4) Detailed overall sql



First, the zipper table functions and applications

In some cases, in order to maintain some of the history of the state, need to do the table with a zipper, the purpose of doing so in the case of the state can keep all can save space.

Zipper table applies to the following situations it

The amount of data a bit large, there is a change table in some fields, but it changes the frequency is not very high, they need to do business requirements change this state statistics, the amount of a full day of it, a bit unrealistic,

Not only a waste of storage space, business statistics can sometimes be a bit of trouble, then, the role of the zipper on the table mention stand out, and not only save space, but also meets the demand.

Usually the number of bins to represent, as in this example, two columns are start_date and end_date by increasing begin_date, en_date.

1  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
1  2016-08-20  2016-08-21  支付 2016-08-21  2016-08-21
1  2016-08-20  2016-08-22  完成 2016-08-22  9999-12-31
2  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
2  2016-08-20  2016-08-21  完成 2016-08-21  9999-12-31
3  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-21
3  2016-08-20  2016-08-22  支付 2016-08-22  9999-12-31
4  2016-08-21  2016-08-21  创建 2016-08-21  2016-08-21
4  2016-08-21  2016-08-22  支付 2016-08-22  9999-12-31
5  2016-08-22  2016-08-22  创建 2016-08-22  9999-12-31

begin_date represents the life cycle of this record start time, end_date this record represents the life cycle of the end of time;

end_date = '9999-12-31' that this record is currently in an active state;

If the query all valid current record, then select * from order_his where dw_end_date = '9999-12-31'

If the query history snapshot 2016-08-21, then select * from order_his where begin_date <= '2016-08-21' and end_date> = '2016-08-21'

Second, the zipper effect table shows

Then a brief update zips table:

Assume in days dimension to the last day of a state is the final state of the day.

Orders to a table, for example, the following is the original data, the daily order status details

1   2016-08-20  2016-08-20  创建
2   2016-08-20  2016-08-20  创建
3   2016-08-20  2016-08-20  创建
1   2016-08-20  2016-08-21  支付
2   2016-08-20  2016-08-21  完成
4   2016-08-21  2016-08-21  创建
1   2016-08-20  2016-08-22  完成
3   2016-08-20  2016-08-22  支付
4   2016-08-21  2016-08-22  支付
5   2016-08-22  2016-08-22  创建

According to the zipper is our hope that

1  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
1  2016-08-20  2016-08-21  支付 2016-08-21  2016-08-21
1  2016-08-20  2016-08-22  完成 2016-08-22  9999-12-31
2  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
2  2016-08-20  2016-08-21  完成 2016-08-21  9999-12-31
3  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-21
3  2016-08-20  2016-08-22  支付 2016-08-22  9999-12-31
4  2016-08-21  2016-08-21  创建 2016-08-21  2016-08-21
4  2016-08-21  2016-08-22  支付 2016-08-22  9999-12-31
5  2016-08-22  2016-08-22  创建 2016-08-22  9999-12-31

You can see the status of each order are 1,2,3,4, and can also count to the current active state.

Third, zipper case operating table

1) construction of the table statement zipper

In the present embodiment hive, for example, to consider only achieved, irrespective of the performance of

First, create a table

CREATE TABLE orders (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) row format delimited fields terminated by '\t'
 
 
CREATE TABLE ods_orders_inc (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) PARTITIONED BY (day STRING)
row format delimited fields terminated by '\t'
 
 
CREATE TABLE dw_orders_his (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
dw_start_date STRING,
dw_end_date STRING
) row format delimited fields terminated by '\t' ;

2) the total amount of the first operation Explanation

First, the whole amount of the update, we first come up 2016-08-20 of data.

Initialization, initialization into the first data 2016-08-20

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2016-08-20')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime < '2016-08-21' and modifiedtime <'2016-08-21';

The brush dw

INSERT overwrite TABLE dw_orders_his
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2016-08-20';

The following results

select * from dw_orders_his;
OK
1  2016-08-20  2016-08-20  创建 2016-08-20  9999-12-31
2  2016-08-20  2016-08-20  创建 2016-08-20  9999-12-31
3  2016-08-20  2016-08-20  创建 2016-08-20  9999-12-31

3) After the increment Explanation

The remaining need for an incremental update

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2016-08-21')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = '2016-08-21'  and modifiedtime = '2016-08-21') OR modifiedtime = '2016-08-21';
 
select * from ods_orders_inc where day='2016-08-21';
OK
1  2016-08-20  2016-08-21  支付 2016-08-21
2  2016-08-20  2016-08-21  完成 2016-08-21
4  2016-08-21  2016-08-21  创建 2016-08-21

4) Detailed overall sql

Delta tables into first, and then linked to a temporary table, in inserting into the new table
1, the failure determination value
2, the effective value is determined,
3, ALL joint by the UNION HTTP: //www.w3school .com.cn / sql / sql_union.asp

DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
    //判断失效值
    SELECT a.orderid,
    a.createtime,
    a.modifiedtime,
    a.status,
    a.dw_start_date,
    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2016-08-21' THEN '2016-08-21' ELSE a.dw_end_date END AS dw_end_date
    FROM dw_orders_his a
    left outer join (SELECT * FROM ods_orders_inc WHERE day = '2016-08-21') b
    ON (a.orderid = b.orderid)
    
    UNION ALL
    
     //判断有效值
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS dw_start_date,
    '9999-12-31' AS dw_end_date
    FROM ods_orders_inc
    WHERE day = '2016-08-21'
    
) x
ORDER BY orderid,dw_start_date;
 
INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;

In the above step of updating the data into the number of 2016-08-22, final results were as follows

select * from dw_orders_his;
OK
1  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
1  2016-08-20  2016-08-21  支付 2016-08-21  2016-08-21
1  2016-08-20  2016-08-22  完成 2016-08-22  9999-12-31
2  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-20
2  2016-08-20  2016-08-21  完成 2016-08-21  9999-12-31
3  2016-08-20  2016-08-20  创建 2016-08-20  2016-08-21
3  2016-08-20  2016-08-22  支付 2016-08-22  9999-12-31
4  2016-08-21  2016-08-21  创建 2016-08-21  2016-08-21
4  2016-08-21  2016-08-22  支付 2016-08-22  9999-12-31
5  2016-08-22  2016-08-22  创建 2016-08-22  9999-12-31

At this point, we get the data we want.

Fourth, the zipper work table implementation (negligible)

Here Insert Picture Description
1, the fastener of the final table data table, only one table, and must have a Join operation.
2, non-encrypted and non-decryption zipper:
. 1) and stage required for dwd join operations, the stage and consistent field names dwd
2) tmp.dwd data will be loaded into the cover through dwd, dwd for the final result table.
3, zipper encryption and decryption:
. 1) and Stage Tmp.Result_dwd (Tmp.Result_ewd) for connecting join, but after changing encrypted field name (e.g., SHA2, field change filed_sha2), so Tmp.Result_dwd (Tmp.Result_ewd) preserve the original field field name after name and encryption.
2) tmp.dwd and tmp.ewd stage and are Tmp.Result_dwd (Tmp.Result_ewd) and join a temporary table.
3) After Tmp.Result_dwd (Tmp.Result_ewd) stores the encrypted and unencrypted fields field stores the dwd and the ewd

Guess you like

Origin blog.csdn.net/silentwolfyh/article/details/89361785
Recommended