Some optimization strategies hive work

1, hive crawling strategy
    hive.fetch.task.conversion = more/none
    more do not go mr, none go mr
 
2, explain Show Execution Plan
 
3, set up a local operation mode
    set hive.exec.mode.local.auto = true
    hive.exec.mode.local.inputbytes.max default 128M, represents the maximum load a file, if greater than this configuration will run in a cluster mode
 
4, parallel computing
    Set hive.exec.parallel = true/falses
    Set hive.exec.parallel.thread.number default 8
 
5, strict mode
    set hive.mapred.mode = strict/nonstrict
    Limit the query:
  • For a partitioned table must be added to the filter condition where the partition field
  • order by statement must contain limit output limit
  • Restrict query execution Cartesian product
 
6, hive sort
  • order by: to do the whole sort query results, allowing only a reduce processing (when the large amount of data, caution strict mode must be used in conjunction with limit.)
  • sort by: sorting the data of the individual reduce
  • distribute by: partitioned sort, and sort by combined use of regular
  • cluster by: equivalent to sort by + distribute by
    •   can not cluster by asc, desc manner specified sort order, can distribute by column sort by column asc | desc manner
 
7、hive join
  • When calculating join the small table (table driven) join in on the left
  • Map join: join in the map complete end
    •   SQL way: Add tags map join the (mapjoin hint) in sql statement
      •   语法:select /* MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key
    •   Automatic mapjion

      •       Enable automatic configuration of mapjion by later
          •   set hive.auto.convert.join = true (is true, hive automatic statistics table on the left, if you join a small table memory that start mapjion small table)
          •   hive.mapjion.smalltable.filesize 默认25M
          •   Whether Hive.ignore.mapjion.hint ignore maojoin hint of standard
  • Wherever possible, the same linkages (converted to a mr)
  • Large tables join a large table (not necessarily helpful)
    • Empty key Filter: join timeout is sometimes too much because of some key data corresponding to the same key and the corresponding data are transmitted to the reducer on the same, resulting in not enough memory. At this point we should carefully analyze these unusual key, in many cases, key data corresponding to these data are abnormal, we need to be filtered in SQL statements.
    • Empty key conversion: although sometimes a key blank corresponding to a lot of data, the corresponding data is not abnormal data, must be included in the result of the join, in this case we can list a key blank fields assigned a random value so that the data can not uniformly at random points on different reducer
 
8, map-side polymerizable
  • Map setting open end by the polymerization parameters: set hive.map.aggr = true
  • How many rows of data processing executed when the polymerization hive.groupby.mapaggr.checkinterval -map terminal gourp by (default 100000)
  • hive.map.aggr.hash.min.reduction - the minimum ratio of polymerization (pre-polymerization of the 100,000 data do, if the data amount after polymerization / 100,000 is greater than the 0.5 configuration, the non-polymerizable)
  • hive.map.aggr.hash.percentmemory -map end of the polymerization initiator used maximum memory
  • The maximum available content do hive.map.aggr.hash.force.flush.memory.threshold -map end of the polymerization operation of the hash table, the value is greater than the flush start
  • hive.groupby.skewindata - whether the data generated by the tilt of groupby do optimization. Default false
 
9, with small data files small, likely to cause stress in the file storage terminal, causing pressure to hdfs, affect the efficiency
  • Setting merge property
    • A merger of the map output file: hive.merge.mapfiles = true
    • Whether to merge reduce output file: hive.merge.mapredfiles = true
    • The combined size of the file: hive.merge.size.per.task = 256 * 1000 * 1000
 
10, deduplication statistics: a small amount of data that does not matter when, under the large amount of data, the need for due COUNT DISTINCT operation with a Reduce Task accomplished, Reduce the amount of data that need to be addressed in a much, it will cause the entire Job is difficult complete, replace general COUNT DISTINCT GROUP BY and then use the first way COUNT
 
11, the control map in the hive, and reduce the number of
  • The number of relevant parameters Map
    • Mapred.max.split.size maximum value of each split, i.e., each map file processing
    • mapred.min.split.size.per.node a node on the minimum split
    • mapred.min.split.size.per.rack a rack on the minimum split
  • Reduce the number of relevant parameters
    • mapred.reduce.tasks forced to reduce the number of tasks specified
    • Hive.exec.reducers.bytes.per.reduce reduce the amount of data each task processing
    • hive.exec.reduce.max reduce each task's largest book
 
12, hive-JVM reuse
    • Applicable scene
      • Excessive number of small files
      • Excessive number of task
    • Set by set mapred.job.reuse.jvm.num.tasks = n
      •   Defects: After setting is turned on, the slot will always be resource-intensive task, regardless of whether there are task to run until all of the task that the whole job all the execution is completed, the task will be to release all the resources of slots 

Guess you like

Origin www.cnblogs.com/liufei-yes/p/11518338.html