Big data: How to deal with the situation where multiple batches of data are pumped a day and the batches contain duplicate data

Scenes:

The company's datalake reconstruction project, due to special reasons, the minimum time unit for pulling data from the original database to the big data platform is days . That is to say, if the data is drawn at 11 o'clock that day, but the original database is refilled at 2 o'clock in the afternoon. If you have the data, if you sample the data again, the second batch of data will contain one batch of data (then the data will be repeated).
However, the dws layer of the big data platform may have calculated the previous batch of data. If you delete and rerun the data for a whole day, it will affect the business (maybe the leader has already seen the display), and some tables are based on months. As a partition, a data error may have to rerun a whole month's data.

 

Solutions:


1. When drawing numbers, add a field load_ts, the value of the field is the drawing time (introduced in the azkaban scheduling script), that is to say, the data drawn in the same batch, load_ts are the same. This can be used to distinguish data batches.
2. The datalake is divided into tmp, src, dwd, dws, ads layers. A tmp layer is added to the four layers suggested by Ali to store the extracted data. This layer will be cleared for each extraction, that is: tmp The layer always saves only one batch of data.
3. When going from the tmp layer to the src layer, use rank to sort load_ts in flashback, use all fields (except load_ts) to partition, and take the rank as 1. In other words, the same data, keep the old, discard the new . In this case, the second batch of data in the src layer (load_ts is the time for the second draw) does not include the first batch of data.
4. When going from src to dwd layer, use rank to fetch all data with the largest load_ts, that is, fetch only the latest batch of data each time .


Code:


tmp-src

insert overwrite table tableName partition(xxx)
select * 
from
  (
  select 
    *,rank(partiton by xxx,xxx,xxx order by load_ts desc) as rk
  from
    (
    select * from src.tableName
    union all
    select * from tmp.tableName
    )--tmp_union_src
  )--tmp_union_src_with_rk
where rk = 1;

src-dwd

insert into table tableName partition(xxx)
select * 
from
  (
  select 
    *,rank(order by load_ts desc) as rk
  from tmp.tableName
  )
where rk = 1;

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106949280