premise:
The dimensions in the data warehouse, the fact table technology advocates the use of surrogate keys instead of entity keys. Below we explain the concept of surrogate keys and how to generate surrogate keys in Hive (self-increasing columns)
Surrogate key:
There must be a column in the dimension table that can uniquely identify a row of records. This column maintains the relationship between the dimension table and the fact table. Generally, the business primary key in the dimension table can be regarded as the dimension primary key if it meets the conditions.
supplement:
It is generated by the data warehouse process and has nothing to do with the business itself. The column that uniquely identifies a record in the dimension table and serves as the primary key of the dimension table is also the link that describes the relationship between the dimension table and the fact table.
Therefore, in the dimension table with surrogate key design, the associated key in the fact table is the surrogate key instead of the original business primary key, that is, the business relationship is maintained by the surrogate key, which effectively avoids the impact of changes in the source system on the data warehouse.
In actual business, the surrogate key is usually numeric and self-increasing.
Solve the following types of problems (partial problems):
1. When integrating the dimensions of multiple data sources, what if the business primary keys of different data sources are duplicated?
2. Dimension zipper table, the same subject swap records, how to do business key duplication
Examples are as follows:
Data for multiple data sources. For example, the following two users from different sources:
Technical department users
Table S1
Id |
name |
note |
1 |
Da1 |
Tech-leader |
2 |
Jiang1 |
tech |
Finance Department users
Table S2
Id |
name |
note |
1 |
Tian1 |
Fine-1 |
2 |
Or |
Fine-2 |
Integrate into the following data
Table dim_user
Uid |
id |
name |
note |
source |
1 |
1 |
Da1 |
Tech-leader |
S1 |
2 |
2 |
Jiang1 |
tech |
S1 |
3 |
1 |
Tian1 |
Fine-1 |
S2 |
4 |
2 |
Or |
Fine-2 |
S2 |
Several ways to achieve:
1) UDF realizes self-increasing columns.
2) Implementing self-increasing keys in Hive
Below we mainly explain how to implement the auto-increment key in Hive: namely method 2:
Suppose we are collecting data from Table S1 and Table S2 incrementally
Step 1: Collect data incrementally and build a daily incremental table
S1, S2 -> Tmp_s_inc
Collect SQL:
S1 :
SELECT * FROM S1 WHERE created_time > ‘2018-06-01’
S2:
SELECT * FROM S2 WHERE created_time > ‘2018-06-01’
Final SQL:
INSERT OVERWRITE TABLE Tmp_s_inc PARTITION( dt = ‘2018-06-01’)
SELECT
S1.*
,’S1’ AS source
FROM S1 WHERE created_time > ‘2018-06-01’
UNION ALL
SELECT
S2.*
,’S2’ AS source
FROM S2 WHERE created_time > ‘2018-06-01’
Step 2: Get the previous dimension table, the maximum Uid (surrogate key) of the previous day, SQL is as follows:
SELECT COALESCE(MAX(Uid, 0) FROM dim_user WHERE dt = ’2018-05-31 ’
The third step: Finally, the incremental data of the generated surrogate key is combined with the data of the previous day into the new partition:
INSERT OVERWRITE TABLE dim_user PARTITION (dt = ‘2018-06-01’)
SELECT
ROW_NUMBER() OVER(ORDER BY id) + ta.max_id AS uid
FROM tmp_s_inc AS tb
CROSS JOIN
(
SELECT COALESCE(MAX(Uid, 0) FROM dim_user WHERE dt = ’2018-05-31’
) AS ta
UNION ALL
SELECT
*
FROM dim_user WHERE dt = ‘2018-05-31’
;
Extra extension:
Hive 的CROSS JOIN :
CROSS JOIN in Hive is Cartesian product. CROSS JOIN can only be used with caution unless there are special requirements and the amount of data is not particularly large. otherwise. It is difficult to get the correct result, or the JOB cannot be executed at all.
Connection in Hive
Optimization tips:
The key of JOIN in Hive must be specified in ON (), not in WHERE, otherwise it will be Cartesian product first, then filter
Hive 的ROW_NUMBER() OVER()
Reference article: https://blog.csdn.net/u010003835/article/details/88179677
ROW_NUMBER() OVER ([partition BY COLUMN_A] ORDER BY COLUMN_B ASC/DESC)
This function is mainly used for grouping and sorting. When the grouping condition is not specified, it will increase in order.
Related SQL:
Build table and build data:
CREATE TABLE IF NOT EXISTS tmp_S1 (
id BIGINT
,name STRING
,note STRING
) PARTITIONED BY (
pt STRING
);
INSERT INTO TABLE tmp_S1 PARTITION (pt = '20190601')
VALUES (1, 's1-haha', 'CC'), (2, 's1-zk', 'CC');
CREATE TABLE IF NOT EXISTS tmp_S2 (
id BIGINT
,name STRING
,note STRING
) PARTITIONED BY (
pt STRING
);
INSERT INTO TABLE tmp_S2 PARTITION (pt = '20190601')
VALUES (1, 's2-cx', 'CC'), (2, 's2-zk', 'CC');
Import incremental tables from data sources (including incremental table creation statements)
CREATE TABLE IF NOT EXISTS tmp_S_inc (
id BIGINT
,name STRING
,note STRING
,source STRING
) PARTITIONED BY (
pt STRING
);
INSERT OVERWRITE TABLE tmp_S_inc PARTITION (pt = '20190601')
SELECT
tmp_s1.id
,tmp_s1.name
,tmp_s1.note
,'S1' AS source
FROM tmp_s1
WHERE pt = '20190601'
UNION ALL
SELECT
tmp_s2.id
,tmp_s2.name
,tmp_s2.note
,'S2' AS source
FROM tmp_s2
WHERE pt = '20190601'
;
Import the target table from the incremental table (including the target table creation statement)
CREATE TABLE IF NOT EXISTS tmp_dim_S
(
uid BIGINT,
id BIGINT,
name STRING,
note STRING,
source STRING
)
PARTITIONED BY
(
pt STRING
);
-- SELECT COALESCE(MAX(uid), 0)
-- FROM tmp_dim_s
-- WHERE pt = '20190531'
-- ;
INSERT OVERWRITE TABLE tmp_dim_S PARTITION (pt = '20190601')
SELECT
(ROW_NUMBER() OVER(ORDER BY ta.id) + max_uid) AS uid
,ta.*
FROM (
SELECT
id
,name
,note
,source
FROM tmp_S_inc
WHERE pt = '20190601'
) AS ta
CROSS JOIN (
SELECT COALESCE(MAX(uid), 0) AS max_uid
FROM tmp_dim_S
WHERE pt = '20190531'
) AS tb
UNION ALL
SELECT
tmp_dim_S.uid
,tmp_dim_S.id
,tmp_dim_S.name
,tmp_dim_S.note
,tmp_dim_S.source
FROM tmp_dim_S
WHERE pt = '20190531'
;