Most inherited from the hive syntax mysql come in a study to save costs, the second is easier to get used to the syntax of programming mysql, order by such a transplant is over, but on this big data environment, it is not the role of in mysql as big, because the amount of data warehouse processing number is very large, if is to do a full order by sorting all the data, and open only to reduce a sort. Inefficiency, the cost of resource consumption will be very large, so it is suitable for the scene in case of a small amount of data is suitable for use with caution.
Here there is little difference with traditional sql: If you specify hive.mapred.mode = strict (the default is nonstrict), then you must specify a limit to limit the number of outputs, the reason is: all data is the same in a reducer end proceeds, the result may not be the case where a large amount of data, then in such a strict mode, must specify the number of outputs.
sort by
sort by only partial sequencing, does not guarantee that the overall situation in the sort mapreduce reduce the interval, if there is more than reduce the formation of a regional order, the interval is disorderly
distribute by
distribute by carried out in a field of the packet, and the same field in a pull to reduce the process is generally used in combination sort by, should distribute by using the time on the front on the back sort by
cluster by
and distribute by cluster by a sort by using a combination of the cluster by descending column can not be set to ascending