hive group by distinct differences and performance comparison

Hive deduplication statistics

People usually believe that the use of Hive will often use it to re-count and the like, but it seems rarely concerned about the de-emphasis of the performance problem, but when the amount of data a very large table, you will find a simple count(distinct order_no)such statement running particularly slow, and direct running count(order_no)time of a lot worse, so study a little.
Let me talk Conclusion: can be used group byinstead of distincdo not use distinctexamples:

Actual proof

order_snap snapshot table for the order of records the total number of 763 191 489 - nearly 800 million records, total size: 108.877GB, storage is all the order information table about 20 fields, of which the order number is not repeated, so the total number of orders in the statistical number of time to re-weight the results are not the same, we take a look:
statistics all orders count how many roads, a countfunction of how you can get the sql performance.

  • DISTINCT
 
     
1
2
3
4
5
6
7
 
     
select count(distinct order_no) from order_snap;
Stage-Stage-1: Map:  396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent:  0 days 2 hours 11 minutes 55 seconds 670 msec
OK
_c0
763191489
Time taken: 1818.864 seconds, Fetched:  1 row(s)
  • GROUP BY
 
     
1
2
3
4
5
6
7
8
 
     
select count(t.order_no) from (select order_no from order_snap group by order_no) t;
Stage-Stage-1: Map:  396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS
Stage-Stage-2: Map:  177 Reduce: 1 Cumulative CPU: 280.22 sec HDFS Read: 472596 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent:  0 days 2 hours 52 minutes 16 seconds 920 msec
OK
_c0
763191489
Time taken: 244.192 seconds, Fetched:  1 row(s)

Conclusion: The wording of the second performance is the first of the 7.448499541times
noticed why there is this difference, Hadoop is actually handle large data, Hive data and are not afraid of much, for fear that data skew, we look at the output of both information:

 
     
1
2
3
4
 
     
# distinct
Stage-Stage-1: Map:  396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS
# group by
Stage-Stage-1: Map:  396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS

 

The transmission of the disease is not found, the use of distinct order_no will all have to shuffle inside a reducer, which is what we call data skew, are inclined to a reducer such performance can not lower it? Look at the second direct-order group number, from 457 reducer, the data is distributed to multiple machines execution, of course, faster time.
Since there is no manual Reduce the number specified, Hive specifies the size of the dynamic data in accordance with Reduce size, you can also specify a manual

 
     
1
 
     
hive> set mapred .reduce.tasks=100;

 

Like this, so in particular if the amount of data is large, try not to use distinctit.
But if you want to look at a statement in the total number of records and the number of records after de-emphasis, and that there is no way to filter, so you have two choices, either to use two sql statements are run, then union all or just plain the distinct. In particular, depends on the specific circumstances, the direct use good readability distinct, if not, then the amount of data is recommended, if the data is too large, performance is affected, and then consider optimizing

Guess you like

Origin www.cnblogs.com/hdc520/p/11797517.html