Big data: slowly changing dimensions

Preface

There are many dimension tables in business scenarios. These dimension tables are not static. Some dimension tables that don't care about historical data only need to be directly overwritten when the data changes. But if you need to save historical records, you can add two fields to indicate the effective time of the dimension data: a start time and an end time. You can also add an identifier to indicate whether the data is the latest dimension data.

Because of the unmodifiable data characteristics of HDFS, hive and spark-sql cannot use update to modify data like traditional databases. Usually, you can use the zipper table to deal with changing data, but the amount of data in the dimension table is often not large, ranging from hundreds of thousands of items to as few as a few. Using the zipper table is inevitable.

Therefore, for the changing dimension table, the solution is usually solved by slowly changing the dimension.

Ideas

Update ideas

When new data is imported from the DWD layer to the DWS layer one day, the data in the two places can be divided into five situations:

As shown in the figure above, assuming that the primary key is key, there are five situations in dwd and dws:

  1. The latest data in dwd and dws is exactly the same, no modification is required.
  2. Dwd is different from the latest data in dws. Set the data in dws as expired data, and insert the data in dwd into dws.
  3. The expired data in dws will be saved back to dws intact.
  4. The newly added data in dwd, that is, data whose key does not exist in dws, is inserted into dws.
  5. Exist in dws, but dwd does not contain this part of the data this time, and save it back to dws intact.

Give a chestnut:

The following data exists in dws:

key data start_date end_date is_current_flag

1

a 2020-08-01 9999-12-31 1
2 b 2020-08-01 9999-12-31 1
3 c 2020-08-01 2020-08-14 0
3 d 2020-08-15 9999-12-31 1

The dwd data of the day (assumed to be 2020-08-26):

key data
2 b
3 e
4 f

The combined data of the dws table on the day should be:

key data start_date end_date is_current_flag Remarks
1 a 2020-08-01 9999-12-31 1 Situation 5
2 b 2020-08-01 9999-12-31 1 Situation 1
3 c 2020-08-01 2020-08-14 0 Situation 3
3 d 2020-08-15 2020-08-25 0 Situation 2
3 e 2020-08-26 9999-12-31 1 Situation 2
4 f 2020-08-26 9999-12-31 1 Situation 4

Analysis:

key 1: Originally stored in the dws layer, and this dwd update data does not contain data with key 1, so this data is stored in the dws layer.

key 2: Originally stored in the dws layer, but the data content of key 2 in the dwd update data is exactly the same as that of the dws, so there is no need to modify it, and it is saved in the dws layer.

key 3: data is the data of c, the value of the is_current_flag field is 0, indicating that this piece of data is historical data, and it is stored in dws intact; data is the data of d, and the is_current_flag in dws is 1, but dwd There is data with a key of 3, and data is different from the data of dws, indicating that this data of dws will be updated soon, so the end_date of this data is set to yesterday's time, and is_current_flag is set to 0; data The data of e is in dwd, and is different from the data in dws whose key is 3 and is_current_flag is 1. Insert this data of dwd into dws, and set start_date to today’s time, end_date is the maximum time, is_current_flag Is 1.

key 4: There is no data with key 4 in dws, indicating that this is a brand new dimension record. Insert it into dws and set start_date to today’s time, end_date is the maximum time, and is_current_flag is 1.

Different methods of judging data

As above, the implementation of slowly changing dimensions needs to determine whether other fields are the same when the key is the same. If there are only a few fields, you can use case when to determine whether they are consistent, but what if there are hundreds of fields?

Consider using md5 or hash to calculate all fields except key and star_date, end_date, is_current_flag, and comparing the calculation result is equivalent to comparing whether all fields are consistent.

I choose to use hash, because when md5 calculates the null value, the result is inconsistent every time. If there is a null value in the data, it is impossible to judge whether the data is consistent.

However, there are certain problems in using hash, which I will say at the end. 

achieve

Prepare data

Build a table

create table ldltmp.test_scd_dwd
(
    id int,
    name string
)
stored as parquet;

create table ldltmp.test_scd_dws
(
    id int,
    name string,
    start_date string,
    end_date string,
    is_current_flag tinyint
)
stored as parquet;

create table ldltmp.test_scd_dws_bak
(
    id int,
    name string,
    start_date string,
    end_date string,
    is_current_flag tinyint
)
PARTITIONED BY (scd_date string)
stored as parquet;

Insert data

INSERT OVERWRITE TABLE ldltmp.test_scd_dwd
SELECT 1, 'a1'
UNION ALL
SELECT 2, 'b4'
UNION ALL
SELECT 3, 'c1'
UNION ALL
SELECT 4, 'd'
UNION ALL
SELECT 5, 'e'
UNION ALL
SELECT 7, 'g'

INSERT OVERWRITE TABLE ldltmp.test_scd_dws
SELECT 1, 'a', '2020-01-01', '9999-12-31', 1
UNION ALL
SELECT 2, 'b', '2020-01-01', '2020-01-07', 0
UNION ALL
SELECT 3, 'c', '2020-01-01', '9999-12-31', 1
UNION ALL
SELECT 4, 'd', '2020-01-01', '9999-12-31', 1
UNION ALL
SELECT 6, 'f', '2020-01-01', '9999-12-31', 1
UNION ALL
SELECT 2, 'b1', '2020-01-08', '9999-12-31', 1

Code

Take out all the dws data and create a view dws.

Calculate the data of dwd and create a view dwd.

Do the inner join of the views dws and dwd, the association condition is dws.id=dwd.id and dws.hash_all <> dwd.hash_all and dws.is_current_flag = 1. Create the dws_modify view by adding the data of dwd in the association result and three fields such as start_date to indicate the data after these keys in dws are about to be updated.

 

In addition, because this dimension table is not a partitioned table, the entire table is refreshed every time. If the data is wrong once, the previous history records are not reliable. Therefore, a backup table is used to save daily data. The backup table is partitioned by days. When the dimension table is updated, the updated data is written into the partition of the backup table at the same time, in order to ensure the synchronization of the two parts of data. Sex, use the with as method to write the results into two tables at the same time. Then regularly clean up the data in the backup table one month ago, so that when there is a problem with the data, you can at least go back to the data state within one month.

--创建dws临时视图
CREATE OR REPLACE TEMPORARY VIEW dws
AS
    SELECT *, HASH(`(id|start_date|end_date|is_current_flag)?+.+`) AS hash_all FROM ldltmp.test_scd_dws
;   

--创建dwd临时视图    
CREATE OR REPLACE TEMPORARY VIEW dwd
AS
    SELECT *, HASH(`(id)?+.+`) AS hash_all FROM ldltmp.test_scd_dwd
;

--使用dws、dwd视图创建dws中将被修改的数据视图(dws和dwd中都有的,且dws中is_current_flag 为1,且数据内容不同的)
CREATE OR REPLACE TEMPORARY VIEW dws_modify
AS
    SELECT dwd.`(hash_all)?+.+`, DATE_FORMAT(CURRENT_TIMESTAMP(), 'yyyy-MM-dd'), '9999-12-31' , 1
    FROM dws INNER JOIN dwd ON dws.id = dwd.id AND dws.hash_all <> dwd.hash_all AND dws.is_current_flag = 1
;


WITH scd_table AS
(
    --情况4
    SELECT dwd.`(hash_all)?+.+`, DATE_FORMAT(CURRENT_TIMESTAMP(), 'yyyy-MM-dd'), '9999-12-31', 1 
    FROM dwd LEFT JOIN dws ON dwd.id = dws.id
    WHERE dws.id IS NULL
    UNION ALL
    --情况1、情况3、情况5
    SELECT dws.`(hash_all)?+.+`
    FROM dws LEFT JOIN dws_modify ON dws.id = dws_modify.id
    where dws_modify.id IS NULL OR is_current_flag = 0
    UNION ALL
    --情况2 dws
    SELECT dws.`(end_date|is_current_flag|hash_all)?+.+`, DATE_FORMAT(DATE_SUB(CURRENT_TIMESTAMP(), 1), 'yyyy-MM-dd'), 0
    FROM dws INNER JOIN dws_modify ON dws.id = dws_modify.id
    WHERE dws.is_current_flag = 1
    UNION ALL
    --情况2 dwd
    SELECT * FROM dws_modify
)
FROM scd_table
--写入维度表
INSERT OVERWRITE TABLE ldltmp.test_scd_dws SELECT *
--写入备份表
INSERT OVERWRITE TABLE ldltmp.test_scd_dws_bak SELECT *, DATE_FORMAT(CURRENT_TIMESTAMP(), 'yyyy-MM-dd')
;

In addition, when applying the above code, you only need to modify the code of the three views. To update the insert logic part, you only need to modify the table name and the associated key.

Step on the pit

After running the data once, the data in the dimension table is 30+w, and the same data is tested again. Because the data is exactly the same, the data should not be modified, and the amount of data should remain unchanged, but the amount of data will change after the run. It became 60+w. Fully doubled.

Take a key, compare the hash values ​​of all data in dws and dwd, and find that the hash values ​​are actually different when the data is exactly the same.

It is suspected that the data types are inconsistent, resulting in different results when the same data is hashed.

For example, the results of hash(5) and hash('5') are different.

However, the data field that dwd fetched from the original table is the same as that of dws. It is guessed that after calculations such as join or sum, it is converted to string type in spark, and then converted to the corresponding data type again when the table is dropped. Then, when building the dwd view, first use cast to cast the non-string type fields, such as:

CREATE OR REPLACE TEMPORARY VIEW dwd
AS
SELECT *, HASH(`(material_code)?+.+`) AS hash_all
FROM
(
  select
  CASE WHEN core.material_code REGEXP '^[0-9]*$' then ltrim('0', core.material_code) else core.material_code  end as material_code,
  info.itemname as material_name_cn,
  core.material_name_en as material_name_en,
     ......
  cast(core.gross_contents as double)  ,
  cast(core.gross_weight as double)    ,
  cast(core.net_contents as double)    ,
  cast(core.net_weight as double)      ,
  core.industry_standard_description   ,
  cast(core.volume as double)          ,
  cast(core.material_height as double) ,
  cast(core.material_length as double) ,
  core.purchasing_valuekey             ,
  cast(core.material_width as double)  ,
  core.tax_classification_of_material  ,
    ......
  from ldldws.dim_core_material core
    left join
    (
      ...
    )info
      on CASE WHEN core.material_code REGEXP '^[0-9]*$' then ltrim('0', core.material_code) else core.material_code  end = info.itemunitcode
  --core join info
    left join
    (
     ...
    )group
      on core.material_group_code = group.material_group_code
  --join group
     ...
     ...
     ...
;

As a result, the hash value is the same as that of dws, and the problem is solved.

 

Guess you like

Origin blog.csdn.net/x950913/article/details/108319140