Hvie in order by, sort by, distribute by the cluster by
Others
2019-12-12 11:22:56
views: null
Order By
- order by input will do a full sort , so only a Reducer (more Reducer can not guarantee the global order), but only a Reducer, when the input will lead to large-scale, long computation time consuming.
Sort By
- sort by not sorted globally, that sort of data is completed before entering the reducer.
- Therefore, if the sort with sort by, and settings that affect mapred.reduce.mode properties, sort by order will ensure that the output of each reducer and does not guarantee the global order .
- Unlike sort by order by, it is not affected hive.mapred.mode properties, sort by the data in the same data can only guarantee a sort may reduce the specified field.
- Sort by using reduce to specify the number (n specified by the set mapred.reduce.tasks =) performed on the output data merge sort performed again, all the results can be obtained.
Distribute By
- How the control is split in the map data to reduce side end.
- The Hive will later distribute by column corresponding to reduce the number of distributed, using the default hash algorithm.
- sort by generating a file for each sorting reduce.
- In some cases, in order to perform the subsequent polymerization operation, it is necessary to control a particular row, which should reducer.
- distribute by and sort by often used in conjunction.
- distribute by and sort by usage scenarios
- Map of uneven output file size
- Reduce the uneven output file size
- Too many small files
- Large file
Cluster By
- cluster by addition to the functions also distribute by both sort by function.
- But only sort of flashback sort, you can not specify the collation for the ASC or DESC.
Origin www.cnblogs.com/ronnieyuan/p/12027812.html