Ten billion data processing optimization

In a recent big data processing, encountered two large tables join result in data handling problems too slow (or even can not be calculated) of. Our warehouse based on the number of Ali's ODPS, it is similar with the Hive, so this article also applies to the use of Hive optimization. Process optimization problems, the general is to specify some common optimization parameters, but when setting parameters still does not work, we intend to combine specific business on optimized SQL to do. In order not to increase the burden on everyone's reading, I will simplify the business described in this article.

problem

This is an off-line data processing problems. In this business there are two tables, table structure and described as follows:

user_article_tb 表:复制代码

字段解释:
uid: 用户标识,itemid:文章id,dur: 阅读文章时长,如果大于 0 代表阅读了文章,等于 0 代表没有点击文章
dt:天分区,每天 55 亿条记录复制代码

user_profile_tb 表:复制代码
字段解释: 
uid:用户标识,gender:性别,F 代表女,M 代表男,age:年龄,city:城市 
dt:天分区字段,这是一张总表,每天存储全量用户画像属性,最新数据十亿级别复制代码

Demand is this: calculated seven days, the female users ctr on each article (will eventually be cut off in descending order). It is easy to write SQL directly, as follows:

select 
  itemid
  , count(if(dur > 0, 1, null)) / count(1) ctr
from
  (
      select uid, itemid, dur
      from user_article_tb
      where dt>='20190701' and dt<='20190707'
  ) data_tb
  join
  (
    select *
    from user_profile_tb
    where dt='20190707' --最新的日期
       and gender='F'
  ) profile_tb
  on 
    data_tb.uid = profile_tb.uid
group by 
  itemid
order by ctr desc
limit 50000
;复制代码

So the question is:

  • For user_article_tb, the amount of data seven days of nearly 40 billion records, need to join a one billion-level portrait table. The amount of data that basically do not run out
  • This exploration of nature as demand often changes. Assuming that demand computing to become a second-tier city Men or computing users do? They may need to re-run the entire data, it is necessary to pay the cost of time have to pay a high cost of resources

solve

We January to solve two problems mentioned above. Consider the first one, since the join of the two tables is too big, we can try to watch it smaller. The answer is yes, for the portrait table is obviously no way reduced, but for user_artitle_tb is possible. We can dt according to the partition table fields with daily data respectively join the portrait table, then the results are stored in a temporary table day inside. So every day is one billion level data join, the basic problem can be solved. But there are still extra daily data join, such as: data day in uid = 00001 users a day, read 1000 article, that the users need more join 999 times. In our business, the number of users a day to see the article> 10 is very common, so the situation is quite serious excess join in.

For extra join the case mentioned above, the most radical solution is to become uid user_article_tb table size, like portraits of the table. We convert the data into seven days uid granularity SQL as follows:

insert overwrite table user_article_uid_tb as 
select uid, wm_concat(':', concat_ws(',', itemid, dur)) item_infos
from 
  (
     select *
    from user_article_tb
     where dt >= '20190701' and dt <= '20190707'   
  ) tmp
group by uid复制代码

You can see from the above SQL, we first seven days of data will be done in accordance with uid group by operations, construction item_infos. Because we are computing ctr, so we can do the conversion table in accordance with uid size, and item_infos field contains what you want to do is choose based on business needs. A day less than 100 million uid, 7 Tian summary of less than 1 billion uid, uid size of two table join would be much faster.

At this point, the excess join the problem has been resolved, let's look at the second question. The problem is that we actually dimensional modeling theory said wide table, in order to avoid frequent join dimensional statistical tables in different dimensions, we can advance in the upstream data associated with common dimensions together to form a large wide table. Downstream data may be directly used to reduce the join. To our question, for example, SQL as follows:

create table user_profile_article_uid_tb as
select 
    data_tb.uid
    , item_infos
    , gender
    , age
    , city
  -- 其他维度字段
from
  (
      select uid, item_infos
      from user_article_uid_tb 
  ) data_tb
  join
  (
      select uid, gender, age, city
    from user_profile_tb
    where dt='20190707' --最新的日期
  ) profile_tb
  on 
    data_tb.uid = profile_tb.uid;复制代码

Thus, the two above-mentioned problems are solved. Ultimately, we demand: female users ctr each article is calculated as follows:

select 
    itemid
    , count(if(dur > 0, 1, null)) / count(1) ctr
from 
  (
    select       split(item_info, ',')[0] itemid
        , split(item_info, ',')[1] dur
    from user_profile_article_uid_tb 
    lateral view explode(split(item_infos, ':')) item_tb as item_info
  ) tmp
group itemid
order by ctr desc
limit 50000复制代码

Parameter optimization

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapred.reduce.tasks复制代码

These parameters are more common option, when these options are not able to achieve the best results, we need to optimize the business.

summary

This article describes on ODPS or Hive, ten billion scale data join optimization. The core idea is to reduce the amount of data to join, as well as optimization method is not universal, it must be carried out in conjunction with the business.

Welcome to public concern number "crossing codes" , to witness the growth of


Guess you like

Origin juejin.im/post/5d2de7bcf265da1b8608b9e7