Hvie in order by, sort by, distribute by the cluster by

Order By

  • order by input will do a full sort , so only a Reducer (more Reducer can not guarantee the global order), but only a Reducer, when the input will lead to large-scale, long computation time consuming.

Sort By

  • sort by not sorted globally, that sort of data is completed before entering the reducer.
  • Therefore, if the sort with sort by, and settings that affect mapred.reduce.mode properties, sort by order will ensure that the output of each reducer and does not guarantee the global order .
  • Unlike sort by order by, it is not affected hive.mapred.mode properties, sort by the data in the same data can only guarantee a sort may reduce the specified field.
  • Sort by using reduce to specify the number (n specified by the set mapred.reduce.tasks =) performed on the output data merge sort performed again, all the results can be obtained.

Distribute By

  • How the control is split in the map data to reduce side end.
  • The Hive will later distribute by column corresponding to reduce the number of distributed, using the default hash algorithm.
  • sort by generating a file for each sorting reduce.
  • In some cases, in order to perform the subsequent polymerization operation, it is necessary to control a particular row, which should reducer.
  • distribute by and sort by often used in conjunction.
  • distribute by and sort by usage scenarios
    • Map of uneven output file size
    • Reduce the uneven output file size
    • Too many small files
    • Large file

Cluster By

  • cluster by addition to the functions also distribute by both sort by function.
  • But only sort of flashback sort, you can not specify the collation for the ASC or DESC.

Guess you like

Origin www.cnblogs.com/ronnieyuan/p/12027812.html