The hive group, distribute, sort, cluster, order difference

order by

Most inherited from the hive syntax mysql come in a study to save costs, the second is easier to get used to the syntax of programming mysql, order by such a transplant is over, but on this big data environment, it is not the role of in mysql as big, because the amount of data warehouse processing number is very large, if is to do a full order by sorting all the data, and open only to reduce a sort. Inefficiency, the cost of resource consumption will be very large, so it is suitable for the scene in case of a small amount of data is suitable for use with caution.

Here there is little difference with traditional sql: If you specify hive.mapred.mode = strict (the default is nonstrict), then you must specify a limit to limit the number of outputs, the reason is: all data is the same in a reducer end proceeds, the result may not be the case where a large amount of data, then in such a strict mode, must specify the number of outputs.

sort by

sort by only partial sequencing, does not guarantee that the overall situation in the sort mapreduce reduce the interval, if there is more than reduce the formation of a regional order, the interval is disorderly

distribute by

distribute by carried out in a field of the packet, and the same field in a pull to reduce the process is generally used in combination sort by, should distribute by using the time on the front on the back sort by

cluster by

and distribute by cluster by a sort by using a combination of the cluster by descending column can not be set to ascending

Published 39 original articles · won praise 13 · views 2298

Guess you like

Origin blog.csdn.net/qq_43205282/article/details/105017337