Hive---zipper table design and implementation

1 Data synchronization problem

In actual work, Hive is mainly used to build offline data warehouses, regularly collect data from various data sources into Hive, and provide data applications through hierarchical conversion. For example, it is necessary to synchronize the latest order information, user information, store information, etc. from MySQL every day to the data warehouse for order analysis and user analysis.

 

 For example: There is a user table in MySQL: tb_user. After each user is registered, the user's information will be added to the user table.

 

Since users register every day and new user information is generated, user data in MySQL needs to be synchronized to the Hive data warehouse every day.

Suppose that on the 1st, a table has been created in hive and the data has been pulled, but on the 2nd, 2 new user registration data are added to MySQL, and 1 user data is updated.

 

 Then we need to synchronize the data on No. 2 to hive. The new data will be loaded directly into the Hive table, but how is the updated data stored in the Hive table?

 Option 1: Direct coverage

Use the data of No. 2 to directly overwrite the data of No. 1.
Advantages: The simplest implementation and the most convenient to use.
Disadvantages: There is no historical status and if you want to query the data before 008, you cannot see it.

 Option 2: Build a full snapshot table based on date

 Create a table on the 1st to pull all the data.
Create another table on the 2nd to pull all the data 
... Create a table every day.
Advantages: The status of all data at different times is recorded.
Disadvantages: A lot of redundant storage did not occur. Changing data leads to excessive storage of data

 Option 3: Construct a zipper table and mark the time period of each state of the changed data through time

 

 The design of the zipper table is to record the status of updated data. Data that has not been updated is not stored in the status. It is used to store all the status of all data at different times. The life cycle of each status is marked through time. When querying, The data of the specified time range status can be obtained according to the demand. By default, the maximum value such as 9999-12-31 is used to represent the latest status.

2. Zipper table implementation principle

 

 1. Incrementally collect changed data and put it into the incremental table

 

2. Merge the data of the zipper table and the temporary table in Hive, and write the merged result into the temporary table

3. Overwrite the data of the temporary table into the zipper table

3 Zipper table implementation demonstration

 Create zip list

-- Data preparation
vi zipper.txt
001 186xxxx1234 laoda 0 sh 2021-01-01 9999-12-31
002 186xxxx1235 laoer 1 bj 2021-01-01 9999-12-31
003 186xxxx1236 laosan 0 sz 2021 -01-01 9999- 12-31
004 186xxxx1237 laosi 1 gz 2021-01-01 9999-12-31
005 186xxxx1238 laowu 0 sh 2021-01-01 9999-12-31
006 186xxxx1239 laoliu 1 bj 2021-01-0 1 9999-12-31
007 186xxxx1240 laoqi 0 sz 2021-01-01 9999-12-31
008 186xxxx1241 laoba 1 gz 2021-01-01 9999-12-31
009 186xxxx1242 laojiu 0 sh 2021-01-01 9999-12-31
010 186xxxx1243 laoshi 1 bj 2021- 01-01 9999-12-31

--创建拉链表
create table dw_zipper
(
    userid    string,
    phone     string,
    nick      string,
    gender    int,
    addr      string,
    starttime string,
    endtime   string
) row format delimited fields terminated by '\t';
load data local inpath '/root/zipper.txt' into table dw_zipper;
select * from dw_zipper;

 Create delta table

vi update.txt
008 186xxxx1241 laoba 1 sh 2021-01-02 9999-12-31
011 186xxxx1244 laoshi 1 jx 2021-01-02 9999-12-31
012 186xxxx1245 laoshi 0 zj 2021-01-02 9999-12-31

 

create table ods_update
(
    userid    string,
    phone     string,
    nick      string,
    gender    int,
    addr      string,
    starttime string,
    endtime   string
) row format delimited fields terminated by '\t';

load data local inpath '/root/update.txt' overwrite into table ods_update;

select * from ods_update;

Create temporary table

create table tmp_zipper
(
    userid    string,
    phone     string,
    nick      string,
    gender    int,
    addr      string,
    starttime string,
    endtime   string
) row format delimited fields terminated by '\t';

 Merge data into temporary table

insert overwrite table tmp_zipper
select
    userid,
    phone,
    nick,
    gender,
    addr,
    starttime,
    endtime
from ods_update
union all
--查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
select
    a.userid,
    a.phone,
    a.nick,
    a.gender,
    a.addr,
    a.starttime,
    --如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1
    if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a  left join ods_update b
                            on a.userid = b.userid ;

Overwrite zipper table data

insert overwrite table dw_zipper
select * from tmp_zipper;

Guess you like

Origin blog.csdn.net/m0_53400772/article/details/130828882