Data Extraction aspects of the data warehouse construction, often need to extract incremental business library data. But business is not a layer of the same library data will take place according to the time the state changes, you need to synchronize data changes to update the HIVE. When used to do data warehouse on Oracle, you can use the merge method merge old and new data. However, this feature does not hive, the aims herein by sqoop extraction, automatically merge data.
Table Design
Extract the table is divided into three,
- _Arc a table, save a snapshot of the combined daily, according to pt field partition
- _Inc a table for storing the extracted incremental data day, in accordance with pt field partitions
- One without the suffix table, point to the subsequent final table ETL tasks.
step
- Use sqoop for hive import, the data import sheet _inc
- Core, using a full join, coalesce, if the combination of SQL merger will inc partition data table that day and the day before the arc more day to Merge partitioned data partition _arc table.
- The final day of the partition table points _arc by hive command set location.
Code points:
merge SQL
use ods;
insert overwrite table mytable_arc partition (pt='20200407')
select coalesce(a.id,b.id), if(a.id is null, b.type, a.type), if(a.id is null, b.amt, a.amt) from (
select id, type, amt
from mytable_inc where pt='20200407'
) a full join (
select id, type, amt
from mytable_arc where pt='20200406'
) b on a.%s = b.%s"
hive set location
use ods;
alter table mytable set location 'hdfs://hadoop01:9000/user/hive/warehouse/ods.db/mytable_arc/pt=20200407'"