Analysis and optimization of Hive's count (distinct)

HIVE-count distinct optimization method

Miss Marketing Department: Why is my sql so slow?
When doing re-statistics, the younger sister usually writes:

select
  count(distinct (bill_no)) as visit_users
from
  i_usoc_user_info_d
where
  p_day = '20200408'
  and bill_no is not null
  and bill_no != ''

This is not a problem, but we need to pay attention that we are writing hql, and its underlying engine is mapreduce, which is distributed computing, so naturally there will be typical problems of distributed computing such as data tilt , such as the use of the above A fusion model in which all user information is deposited in the data warehouse to count the number of all mobile phone numbers. This kind of writing can definitely run out of results, but the running time seems to be a little long ~:
Insert picture description here
The results of running out are: 177269899 has
hundreds of millions of records. Go to hdfs to view the file and find that the partition size is 55G, and at this time, we can find that there is only one stage by looking at the execution plan and logs:
Hadoop job information for Stage-1: number of mappers: 226; number of reducers: 1
I am actually familiar with the principle of mr here, I already understand the reason why this SQL runs slowly, because there is a very serious data tilt, 226 mappers, 1 reducer, all data only flows to the mapper after processing a reducer, logical plan something like:
Insert picture description here
why only a reducer, because the use of distinct and count (full aggreates), Mr job two functions generated only produce a reducer, and even explicitly specify the set mapred.reduce.tasks = 100000 is useless.
So for this kind of deduplication statistics, if the amount of data is large enough , my experience is generally more than 100 million records (depending on the cluster size and computing power of each classmate company), I will choose to use count plus group by statistics:

select
  count(a.bill_no)
from
  (
    select
      bill_no
    from
      dwfu_hive_db.i_usoc_user_info_d
    where
      p_day = '20200408'
      and bill_no is not null
      and bill_no != ''
    group by
      bill_no
  ) a

Running and running, the time-consuming is as follows:
Insert picture description here
I bought Kardi, 168s, the speed has increased by nearly 7 times, view the execution plan and logs:
Hadoop job information for Stage-1: number of mappers: 226; number of reducers: 92
Hadoop job information for Stage-2: number of mappers: 88; number of reducers: 1

found that two stages, that is, two mr jobs were started and the number of mappers for stage1 was unchanged, the reducer increased to 92, this is because the introduction of group by The data is grouped into multiple reducers for processing. The logic execution diagram is roughly as follows:

Summary : In the case of a large amount of data, using count + group by instead of count (distinct) can greatly improve the efficiency and speed of job execution. Generally speaking, the larger the amount of data, the more obvious the improvement effect.

Note : It is best to check the amount of data before development, but do not count tens of thousands of hundreds of thousands of tens of thousands of M data to recount the statistics and count and groupby to click to write up. Finally, it is found that the speed is not directly direct (distinct) fast The job hasn't gotten up yet. The count (distinct) of the others will come out, so the optimization must be based on a data volume problem, which is also different from other sql.

Published 14 original articles · Like1 · Visits 684

Guess you like

Origin blog.csdn.net/qq_33891419/article/details/103019254