Hive uses temporary tables to retain full data
Requirements:
In the hive environment, table a is a full scale, and table b is an incremental table (only data for the day's run),
Suppose you need to keep the data in table a but not in table b in table a,
And you need to append the data that is in table b but not in table a to table a
Solution 1:
Use the left outer association to first filter out the data in table a but not in
table b,and then merge the data in table b with the filtered data
------------ ---------Create data (demo in oracle)
--查询b表在a表的信息
with a as(
select 1 as id, 'Lisi' as name ,'2019-10-01' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-01' as time from dual
union all
select 3 as id, 'Zhaoliu' as name,'2019-10-01' as time from dual
union all
select 4 as id, 'Pangsan' as name,'2019-10-01' as time from dual
),
b as(
select 1 as id, 'Lisi' as name,'2019-10-03' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-03' as time from dual
union all
select 5 as id, 'Huangsan' as name,'2019-10-03' as time from dual
)
--使用连接
select a.id, a.name,a.time
from a
left join b
on a.id = b.id
where b.id is null
union all
select b.id,b.name,b.time
from b
;
Solution 2:
First merge the data in tables a and b,
and then use the analysis function row_number() to sort, group and sort the repeated data, and only keep the latest data for the repeated data
----- ----------------Create data (demonstrated in oracle)
--查询b表在a表的信息
with a as(
select 1 as id, 'Lisi' as name ,'2019-10-01' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-01' as time from dual
union all
select 3 as id, 'Zhaoliu' as name,'2019-10-01' as time from dual
union all
select 4 as id, 'Pangsan' as name,'2019-10-01' as time from dual
),
b as(
select 1 as id, 'Lisi' as name,'2019-10-02' as time from dual
union all
select 2 as id, 'Wangmen' as name,'2019-10-02' as time from dual
union all
select 5 as id, 'Huangsan' as name,'2019-10-02' as time from dual
)
--使用连接
SELECT id
,NAME
,TIME
,rr
FROM (SELECT id
,NAME
,TIME
,row_number() over(PARTITION BY id ORDER BY TIME DESC) AS rr
FROM (SELECT a.id
,a.name
,a.time
FROM a a
UNION ALL
SELECT b.id
,b.name
,b.time
FROM b b) c) d
WHERE d.rr = 1
;
Due to the small amount of data, we can figure out which plan is better, and follow up.