In HIVE, the difference between order by, sort by, distribute by and cluster by, and what is the meaning of cluster by

1. oreder by

      Mainly do global sorting.
      As long as order by is specified in Hive's sql, all data will be processed by the same reducer ** (no matter how many maps, and no matter how many blocks the file has, only one reducer will be started) **. But for large amounts of data this will take a long time to execute.
      There is a little difference from traditional sql: if you specify hive.mapred.mode = strict (the default value is nonstrict), you must specify limit to limit the number of output. Because: all the data will be carried out on the same reducer side, the result may not be able to produce in the case of large amount of data, then in such strict mode, the number of output must be specified.

2. sort by

      Each reduce end will be sorted, that is, partially ordered, and multiple reducers can be specified. At the same time, if you want to test the effect of the execution, it is recommended to save the output result locally and adjust the number of reduce. (I set to 3)

      Save the query results locally:

insert overwrite 
local directory '/home/data'
select * from stu sort by gradedesc;

      Adjust the number of reduce:

set mapreduce.job.reduce=3;

      View the number of reduce:

set mapreduce.job.reduce;

      However, when only sort by is used, the partitions are randomly assigned.

3. distribute by

      Specify the partitioning principle. Usually used with sort by, distribute by must be written before sort by. Understood: according to the XX field partition, and then sorted according to the XX field for
      an example:

      Only sort according to the grade field, but do not specify the partition field:

select * from stu sort by grade;

      Sort by class first, then sort by grade:

select * from stu distribute by class sort by grade;

4. cluster by

      When the fields specified by distribute by and sort by are the same, cluster by can be used.
      Note: The columns specified by cluster by can only be in descending order, asc and desc cannot be specified.

      for example:

select * from stu distribute by class sort by class

      Equivalent to:

select * from stu cluster by class

---------------------------------------------Dividing line--- --------------------------------
      Another example:

select * from stu distribute by class sort by class,name

      Equivalent to

select * from stud distribute cluster by class sort by name

      Note that the columns specified by cluster by can only be in descending order, asc and desc cannot be specified.

5. What is the significance of cluster by?

      For details, please see: HIVE in, cluster by what is the point .

Published 48 original articles · Like 36 · Visits 130,000+

Guess you like

Origin blog.csdn.net/weixin_42845682/article/details/104953351