Tuning of HIVE GROUP BY

  • By default, after the Map stage, the same Key data is distributed to a reduce, when the same key data is too large to generate a data skew. Not all of the polymerization operations are necessary to complete Reduce end, a lot of polymerization operations can be carried out first partially polymerized in the Map end, the conclusion that the end result at the end Reduce

  • Map open end of the polymerization parameters

    • Whether polymerized in Map end, the default is True: hive.map.aggr to true =

    • The number of entries in the Map polymerization operation end: hive.groupby.mapaggr.checkinterval = 100000

    • When data load balancing inclined (defaults to false): hive.groupby.skewindata to true =

      • When the option is set to true, the resulting query plans have two MR Job. The first MR Job, Map output result will be randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same Key data may be distributed to different Reduce thereby to achieve load balancing purposes; second MR Job Key profile according to the data again in accordance with the results of the preprocessing to Reduce (Key this process ensures that the same data is distributed to Reduce the same), and finally complete the final polymerization operation

Guess you like

Origin www.cnblogs.com/xiangyuguan/p/11411603.html
Recommended