Incremental ETL (long cycle indicators) optimization

       To calculate the unavoidable long-period data across statistical indicators in the daily needs of data processing, similar to the following:

              1, statistics for each city (last 30 days) the user visits;

                      Statistics for each city (this year) the user visits;

                      Statistics for each city (historic date) the user visits;

              2, statistics for each city (in the past 30 days | year | history so far) transaction number of users;

              3, there is a state change indicator data needs data set partial data row.

       Usually face more demand for index data, the biggest problem is the amount of data across long periods tend to be huge range of periods or data is not fixed. Followed by solving the following.

Scene 1: Statistics (last 30 days | year | history so far) the user visits?

       Conventional solutions are as follows:

select

city_id as city_id,

count(1) as pv

from events t1

where p_date >= date_add(current_date(),-31)

 and p_date <= date_add(current_date(),-1)

group by city_id

;

       Problems faced in the past 30 days buried a huge amount of data point event, and this simply can not afford server SQL running too long.

Solutions:

       By constructing the intermediate incremental update table, to reduce the amount of data SQL scanning results, thereby optimizing the execution time of the task and resource depletion.

1, the intermediate table Design

create table e_city_day_indi(

city_id     bigint

pv         int

)

partitioned by(p_date string)

stored as parquet

;

       e_city_day_indi There are two loading methods that I have called daily incremental update loop iteration method and update method for facilitating the subsequent description of the problem.

               l incremental update date: Summary data calculated for each day the same day the city PV, today's date and stored in the partition;

                         Advantage of more versatile;

                         Shortcoming can not completely avoid the multi-partition scan, and may cause excessive small files.

               l loop iteration update: new data and snapshot data every day to do the day before the merger, and stores the result to the current day partitions.

                         The advantages of the application side scan only new partition can meet the demand;

                         Disadvantage of using a narrow face, part of a long cycle demand can not be met.

2, ETL achieve

Day incremental update

insert overwrite table e_city_day_indi partition(p_date)

select

city_id as city_id,

count(1) as pv,

date_add(current_date(),-1) as p_date

from events

where p_date = date_add(current_date(),-1)

group by city_id

;

Loop iteration update

insert overwrite table e_city_day_indi partition(p_date)

select

city_id,

sum(pv) as pv,

date_add(current_date(),-1) as p_date

from (

select

city_id as city_id,

count(1) as pv

from events

where p_date = date_add(current_date(),-1)

group by city_id

union all

select

city_id,

pv

from e_city_day_indi

where p_date = date_add(current_date(),-2)

) t

group by city_id

;

3, the access optimization logic

Day incremental update SQL access

select

city_id,

sum(pv) as pv

from e_city_day_indi

where p_date >= date_add(current_date(),-31)

  and p_date <= date_add(current_date(),-1)

group by city_id

;

Loop iteration access SQL update

select

city_id,

pv

from e_city_day_indi

where pdate = date_add(current_date(),-1)

;

       Incremental load above two methods to meet the needs of Question 1, while reducing the amount of data processed.

场景2:统计每个城市(过去30天|本年|历史至今)交易用户数

         常规解决方案如下:

select

city_id,

count(distinct user_id) as uv

from events

where p_date >= date_add(current_date(),-31)

  and p_date <= date_add(current_date(),-1)

group by city_id

;

       典型count distinct类型需求,面临的问题,过去30天埋点事件数据量巨大,服务器无法承受且此简单SQL运行时间太长。

解决思路:

       通过构建增量更新的中间表,降低结果SQL扫描数据量,从而优化任务的执行时间和资源消耗问题。

1、  设计中间表

create table e_city_day_indi(

city_id     bigint,

user_id     bigint,

pv          int

) partitioned by(p_date)

stored as parquet

;

       e_city_day_indi数据加载方法在这里仅介绍日增量更新方法。

2、ETL实现

日增量更新

insert overwrite table e_city_day_indi partition(p_date)

select

city_id,

user_id,

count(1) as pv,

date_add(current_date(),-1) as p_date

from events

where p_date = date_add(current_date(),-1)

group by city_id,user_id

;

3、优化后取数逻辑

日增量更新

select

city_id,

count(distinct user_id) as uv,

sum(pv) as pv

from e_city_day_indi

where p_date >= date_add(current_date(),-31)

  and p_date <= date_add(current_date(),-1)

group by city_id

;

场景3:数据集部分数据行存在状态变化数据指标需求

       这种需求场景比较特殊,是针对全量数据做update操作,一般性问题是全表数据量较大,但是需要update数据量较小,且基于hive解决方案,update是较难以处理的场景。

       解决思路:将变化数据和明确不变数据做分离(能否合理做数据热度分离是关键)。具体实施细节是设计分区表,将少量变化数据存储一个分区,将大量不变数据存储在另一个分区,设计历史数据更新的仅扫描活跃分区即可。

       实施方案:

l  设计目标表两个分区(或者日期分区下设计二级分区)

                     活跃分区:用来存储生命周期未结束且可能存在update操作的数据。

                     非活跃分区:用来存储明确生命周期结束数据。

l  设计日期分区

                     历史日期分区(p_date <= current_date())存储生命周期结束的数据行;

                     未来日期分区(p_date = ‘9999-12-31’)存储生命周期尚未结束的数据行。

       以上3种典型场景是做数据开发过程中较基础也是比较常见的增量ETL开发场景,使用合适的方法处理问题可以大大减少资源消耗同时可以有效降低执行时间。

Guess you like

Origin www.cnblogs.com/wcwen1990/p/12017826.html
ETL