Hive / Data Warehouse_How to generate surrogate key in Hive

 

premise:

       The dimensions in the data warehouse, the fact table technology advocates the use of surrogate keys instead of entity keys. Below we explain the concept of surrogate keys and how to generate surrogate keys in Hive (self-increasing columns)

 

Surrogate key:

      There must be a column in the dimension table that can uniquely identify a row of records. This column maintains the relationship between the dimension table and the fact table. Generally, the business primary key in the dimension table can be regarded as the dimension primary key if it meets the conditions.

 

supplement:

      It is generated by the data warehouse process and has nothing to do with the business itself. The column that uniquely identifies a record in the dimension table and serves as the primary key of the dimension table is also the link that describes the relationship between the dimension table and the fact table.

Therefore, in the dimension table with surrogate key design, the associated key in the fact table is the surrogate key instead of the original business primary key, that is, the business relationship is maintained by the surrogate key, which effectively avoids the impact of changes in the source system on the data warehouse.

In actual business, the surrogate key is usually numeric and self-increasing.

 

 

Solve the following types of problems (partial problems):

1. When integrating the dimensions of multiple data sources, what if the business primary keys of different data sources are duplicated?

2. Dimension zipper table, the same subject swap records, how to do business key duplication

 

 

 

Examples are as follows:

Data for multiple data sources. For example, the following two users from different sources:

 

Technical department users

Table S1

Id

name

note

1

Da1

Tech-leader

2

Jiang1

tech

 

Finance Department users

Table S2

Id

name

note

1

Tian1

Fine-1

2

Or

Fine-2

 

 

 

Integrate into the following data

 

Table dim_user

Uid

id

name

note

source

1

1

Da1

Tech-leader

S1

2

2

Jiang1

tech

S1

3

1

Tian1

Fine-1

S2

4

2

Or

Fine-2

S2

 

Several ways to achieve:

1) UDF realizes self-increasing columns.

2) Implementing self-increasing keys in Hive

 

Below we mainly explain how to implement the auto-increment key in Hive: namely method 2:

 

Suppose we are collecting data from Table S1 and Table S2 incrementally

 

Step 1: Collect data incrementally and build a daily incremental table

S1, S2  -> Tmp_s_inc

Collect SQL:

S1 :

SELECT * FROM S1 WHERE created_time > ‘2018-06-01’

S2:

SELECT * FROM S2 WHERE created_time > ‘2018-06-01’

 

 

Final SQL:

INSERT OVERWRITE TABLE Tmp_s_inc  PARTITION( dt = ‘2018-06-01’)

SELECT

S1.*

,’S1’ AS source

FROM S1 WHERE created_time > ‘2018-06-01’

UNION ALL

SELECT

S2.*

,’S2’ AS source

FROM S2 WHERE created_time > ‘2018-06-01’

 

 

Step 2: Get the previous dimension table, the maximum Uid (surrogate key) of the previous day, SQL is as follows:

SELECT COALESCE(MAX(Uid, 0) FROM dim_user WHERE dt = ’2018-05-31 ’

 

 

The third step: Finally, the incremental data of the generated surrogate key is combined with the data of the previous day into the new partition:

INSERT OVERWRITE TABLE dim_user PARTITION (dt = ‘2018-06-01’)

SELECT

  ROW_NUMBER() OVER(ORDER BY id) + ta.max_id AS uid

  FROM tmp_s_inc AS tb

CROSS JOIN

(

       SELECT COALESCE(MAX(Uid, 0) FROM dim_user WHERE dt = ’2018-05-31’

) AS ta

UNION ALL

SELECT

  *

FROM dim_user WHERE dt = ‘2018-05-31’

;

 

 

 

 

Extra extension:

 

Hive 的CROSS JOIN :

CROSS JOIN in Hive is Cartesian product. CROSS JOIN can only be used with caution unless there are special requirements and the amount of data is not particularly large. otherwise. It is difficult to get the correct result, or the JOB cannot be executed at all.

 

Connection in Hive

Optimization tips:

The key of JOIN in Hive must be specified in ON (), not in WHERE, otherwise it will be Cartesian product first, then filter

 

Hive 的ROW_NUMBER() OVER()

Reference article: https://blog.csdn.net/u010003835/article/details/88179677

ROW_NUMBER() OVER ([partition BY COLUMN_A] ORDER BY COLUMN_B ASC/DESC) 

This function is mainly used for grouping and sorting. When the grouping condition is not specified, it will increase in order.

 

 

 

 

Related SQL:

Build table and build data:

CREATE TABLE IF NOT EXISTS tmp_S1 (
    id BIGINT 
    ,name STRING
    ,note STRING 
) PARTITIONED BY (
    pt STRING 
);


INSERT INTO TABLE tmp_S1 PARTITION (pt = '20190601')
VALUES (1, 's1-haha', 'CC'), (2, 's1-zk', 'CC');



CREATE TABLE IF NOT EXISTS tmp_S2 (
    id BIGINT 
    ,name STRING
    ,note STRING 
) PARTITIONED BY (
    pt STRING 
);


INSERT INTO TABLE tmp_S2 PARTITION (pt = '20190601')
VALUES (1, 's2-cx', 'CC'), (2, 's2-zk', 'CC');

 

Import incremental tables from data sources (including incremental table creation statements)

CREATE TABLE IF NOT EXISTS tmp_S_inc (
    id BIGINT 
    ,name STRING
    ,note STRING 
    ,source STRING
) PARTITIONED BY (
    pt STRING 
);



INSERT OVERWRITE TABLE tmp_S_inc PARTITION (pt = '20190601')
SELECT 
    tmp_s1.id
    ,tmp_s1.name
    ,tmp_s1.note
    ,'S1' AS source  
FROM tmp_s1 
WHERE pt = '20190601'
UNION ALL 
SELECT 
    tmp_s2.id
    ,tmp_s2.name
    ,tmp_s2.note
    ,'S2' AS source  
FROM tmp_s2 
WHERE pt = '20190601'
;

 

Import the target table from the incremental table (including the target table creation statement)

CREATE TABLE IF NOT EXISTS tmp_dim_S
(
    uid BIGINT,
    id BIGINT,
    name STRING,
    note STRING,
    source STRING 
)
PARTITIONED BY  
( 
    pt STRING 
);


-- SELECT COALESCE(MAX(uid), 0)
-- FROM tmp_dim_s 
-- WHERE pt = '20190531'
-- ;


INSERT OVERWRITE TABLE tmp_dim_S PARTITION (pt = '20190601')
SELECT 
  (ROW_NUMBER() OVER(ORDER BY ta.id) + max_uid) AS uid
  ,ta.*
FROM (
    SELECT 
        id
        ,name
        ,note
        ,source
    FROM tmp_S_inc
    WHERE pt = '20190601'
) AS ta
CROSS JOIN (
    SELECT COALESCE(MAX(uid), 0) AS max_uid
    FROM tmp_dim_S
    WHERE pt = '20190531' 
) AS tb 
UNION ALL 
SELECT 
    tmp_dim_S.uid
    ,tmp_dim_S.id
    ,tmp_dim_S.name
    ,tmp_dim_S.note
    ,tmp_dim_S.source
FROM tmp_dim_S
WHERE pt = '20190531'
;

 

 

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/104420508