Article Directory
1. oreder by
Mainly do global sorting.
As long as order by is specified in Hive's sql, all data will be processed by the same reducer ** (no matter how many maps, and no matter how many blocks the file has, only one reducer will be started) **. But for large amounts of data this will take a long time to execute.
There is a little difference from traditional sql: if you specify hive.mapred.mode = strict (the default value is nonstrict), you must specify limit to limit the number of output. Because: all the data will be carried out on the same reducer side, the result may not be able to produce in the case of large amount of data, then in such strict mode, the number of output must be specified.
2. sort by
Each reduce end will be sorted, that is, partially ordered, and multiple reducers can be specified. At the same time, if you want to test the effect of the execution, it is recommended to save the output result locally and adjust the number of reduce. (I set to 3)
Save the query results locally:
insert overwrite
local directory '/home/data'
select * from stu sort by gradedesc;
Adjust the number of reduce:
set mapreduce.job.reduce=3;
View the number of reduce:
set mapreduce.job.reduce;
However, when only sort by is used, the partitions are randomly assigned.
3. distribute by
Specify the partitioning principle. Usually used with sort by, distribute by must be written before sort by. Understood: according to the XX field partition, and then sorted according to the XX field for
an example:
Only sort according to the grade field, but do not specify the partition field:
select * from stu sort by grade;
Sort by class first, then sort by grade:
select * from stu distribute by class sort by grade;
4. cluster by
When the fields specified by distribute by and sort by are the same, cluster by can be used.
Note: The columns specified by cluster by can only be in descending order, asc and desc cannot be specified.
for example:
select * from stu distribute by class sort by class
Equivalent to:
select * from stu cluster by class
---------------------------------------------Dividing line--- --------------------------------
Another example:
select * from stu distribute by class sort by class,name
Equivalent to
select * from stud distribute cluster by class sort by name
Note that the columns specified by cluster by can only be in descending order, asc and desc cannot be specified.
5. What is the significance of cluster by?
For details, please see: HIVE in, cluster by what is the point .