Hive optimization strategy

Hive optimization strategy is broadly divided into: configuration optimization (hive-site.xml and hive-cli Before performing configuration), table optimization, data skew Hive solution.
Answer when needed, require accurate to say specific configuration parameters, accurately state the specific configuration parameters , this is a profound lesson.

Configuration optimization

1-Fetch fetch configuration

  • Fetch fetch means, Hive query for some cases may not necessarily use MapReduce calculations. For example: SELECT * FROM employees; in this case, Hive can simply read the file storage directory corresponding to the employee, and outputs the query results to the console.
  • hive.fetch.task.conversion default hive-default.xml.template file is more, the old version hive default is minimal, the property was later modified more, in the global search, find the field, limit and so do not look away mapreduce.
<property>
    <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property> 

2- enable local mode of Hive

  • Most Hadoop Job is the need to complete the scalability of Hadoop provides for dealing with large data sets. Sometimes, however, the amount of data input Hive is very small. In this case, the trigger for the query execution time consuming task may be much more than the execution time of the actual job. For most of these cases, Hive local mode can handle all of the tasks on a single machine. For small data sets, the execution time can be significantly shortened.
  • Users can set hive.exec.mode.local.auto is true, to make this optimization Hive start automatically at the appropriate time.

3- open parallel execution Hive

  • Hive query will be converted into one or a plurality of stages. This stage can be MapReduce stage sampling phase, consolidation phase, limit stage. Or other stages of the process Hive implementation may be required. By default, Hive once only execution stage. However, a particular job may contain a number of stages, and these stages may not be entirely dependent on one another, that is to say some stage can be executed in parallel, so that may make the whole job execution time is shortened. However, if there are more stages can be executed in parallel, the job may be completed faster.
  • Hive.exec.parallel by setting parameter is true, it can be opened concurrently. However, in a shared cluster, we need to pay attention to the next, if the job in the parallel phase increase, then the cluster utilization will increase.
  • Setting parameters:
    • set hive.exec.parallel = true; // open the parallel execution of tasks
    • set hive.exec.parallel.thread.number = 16; // same sql allowed maximum degree of parallelism, the default is 8.
4-Hive strict mode
  • Hive provides a strict mode, it can prevent users to perform queries that might adversely affected not the intention.
  • By setting the default attribute value of a non-strict model hive.mapred.mode nonstrict. Turn on strict mode requires modification hive.mapred.mode is strict, strict mode is turned on three types of queries can be disabled.
    • 1) For the partition table, partition field unless the filter condition where clause to limit the scope of containing, or not allowed.
    • 2) For the query order by the statement, which calls for the use of limit statement.
    • 3) limit the Cartesian product of a query.

5-JVM reuse

  • JVM reuse content Hadoop tuning parameters, the performance of the Hive has a very big impact, especially for a particularly large number of small files is difficult to avoid the scene or task scene, most of this type of scenario execution time is very short.
  • The default configuration of Hadoop is usually derived using JVM to perform map and Reduce tasks. At this point the JVM startup process can cause considerable overhead, especially job execution contains hundreds task task situation. JVM JVM instances such reuse may be reused N times in the same job. The value of N may be disposed in the Hadoop mapred-site.xml file. Usually between 10-20, the specific need to test how much stars based on specific business scenarios.
<property>
  <name>mapreduce.job.jvm.numtasks</name> <value>10</value> <description>How many tasks to run per jvm. If set to -1, there is no limit. </description> </property> 
  • This shortcoming function is turned on JVM reuse will always take up the task to use slots for reuse, until the task is completed to release. If an "unbalanced" job there are a few reduce task execution time of other much more time consuming than the Reduce task, then the reserved slots would have been idling can not be used by other job, until all of the task was over before release.

6- turned speculative execution Hive

  • In a distributed cluster environment, because the program Bug (including bug Hadoop itself), the load is not balanced or uneven distribution of resources and other reasons, the speed will cause inconsistencies between multiple tasks with a job, it runs some tasks may be significantly slower than other tasks (such as the progress of a task is a job only 50%, while all other tasks have been finished running), these tasks will slow down the overall implementation progress of the job. To prevent this from happening, Hadoop uses speculative execution (Speculative Execution) mechanism, it is speculated that the "drag" the task according to certain rules, and start a backup task for such a task, so the task is to handle the same task as the original copies of data, and the final selection results the first successful operation to complete the task as the final result.
  • Set to open speculative execution parameters: Hadoop's mapred-site.xml file to configure
<property>
  <name>mapreduce.map.speculative</name> <value>true</value> <description>If true, then multiple instances of some map tasks may be executed in parallel.</description> </property> <property> <name>mapreduce.reduce.speculative</name> <value>true</value> <description>If true, then multiple instances of some reduce tasks may be executed in parallel.</description> </property> 
  • But hive itself also provides a configuration item to control speculative reduce-side execution:
<property>
    <name>hive.mapred.reduce.tasks.speculative.execution</name> <value>true</value> <description>Whether speculative execution for reducers should be turned on. </description> </property> 
  • Speculative execution on tuning these variables, it is difficult to give a specific recommendation. If the user is running for the deviation is very sensitive, then the functions can be closed off. If the user input because a large amount of data needed to perform a long map or Reduce task, then start the waste caused by the speculative execution is very big huge.

7- compression

Output stage compression turned on Map

  • Open the output stage of the compression map job transfer amount between the data map and Reduce task can be reduced. Specific configuration is as follows:
1)开启hive中间传输数据压缩功能
hive (default)>set hive.exec.compress.intermediate=true; 2)开启mapreduce中map输出压缩功能 hive (default)>set mapreduce.map.output.compress=true; 3)设置mapreduce中map输出数据的压缩方式 hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec; 4)执行查询语句 hive (default)> select count(ename) name from emp; 

Reduce the output stage compression is turned on

  • When Hive writes output to the table, the output of the same content may be compressed. Property hive.exec.compress.output controls this function. Users may need to keep the default value is false default settings file, so the default output is uncompressed plain text files. Users can set this value to true in the query or execute a script, to open the output compression.
1)开启hive最终输出数据压缩功能
hive (default)>set hive.exec.compress.output=true; 2)开启mapreduce最终输出数据压缩 hive (default)>set mapreduce.output.fileoutputformat.compress=true; 3)设置mapreduce最终数据输出压缩方式 hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; 4)设置mapreduce最终数据输出压缩为块压缩 hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK; 5)测试一下输出结果是否是压缩文件 hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc; 

Hive table optimization

1- small table, large table join

  • The key is relatively dispersed, and the small amount of data in the table on the left of the join, which can effectively reduce the chance of memory overflow error occurs; further, can allow a small dimension tables Group (number of records less than 1000) Advanced Memory . Reduce the end to complete the map.
  • The actual test found that: The new version of the hive has a large table on the small table JOIN JOIN small table and a large table is optimized. Small table on the left and right have no significant difference.

2- large table join large table

  • 1) - null filter
    • Sometimes join timeout because too much of certain key data corresponding to the same key and the corresponding data will be sent to the same reducer, resulting in memory is not enough. At this point we should carefully analyze these unusual key, in many cases, key data corresponding to these data are abnormal, we need to be filtered in SQL statements. E.g. key corresponding to the field is empty.
  • 2) - conversion key blank
    • Although sometimes a key blank corresponding to a lot of data, the corresponding data is not abnormal data, must be included in the result of the join, in this case we can list a key blank fields assigned a random value, such that the data randomization not evenly divided on a different reducer.

3-MapJoin

  • If no MapJoin MapJoin or do not meet the conditions, the parser will then Hive converted into the Common Join Join operation, namely: Reduce the completed join stage. Data skew easily occurs. MapJoin the small table can all be loaded into memory map at the end join, reducer treatment avoided.
    • 1) Open MapJoin parameters:
      (1) automatically setting Mapjoin
      SET = hive.auto.convert.join to true; defaults to true
      (2) large threshold setting table small table (the default is considered small table about 25M):
      SET Hive .mapjoin.smalltable.filesize = 25000000;

4-Group By

  • By default, Map Key phases of the same data distributed to a reduce, when a key on the tilt of the data is too large.
  • Not all polymerization operations need to be done in Reduce end, a lot of polymerization operations can be carried out first partially polymerized in the Map end, the final result in the conclusion that Reduce end.
  • 1) Open Map end of the polymerization parameters
    (1) whether the end of the polymerization in the Map, the default is True
    hive.map.aggr = to true
    polymerization operation (2) the number of entries in the Map end
    hive.groupby.mapaggr.checkinterval = 100000
    ( 3) when the inclination data load balancing (defaults to false)
    hive.groupby.skewindata to true =
  • When the option is set to true, the resulting query plans have two MR Job. The first MR Job, Map output result will be randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to different Reduce in so as to achieve load balancing purposes; second MR Job preprocessing data then results by Group by Key Reduce the distributed (this process can ensure the same Group by Key are distributed to Reduce the same), and finally complete the final polymerization operation.

5-Count (Distinct) to re-count

  • When a small amount of data that does not matter, a large amount of data under the circumstances, due to the need to use a COUNT DISTINCT operation Reduce Task accomplished, Reduce the amount of data that need to be addressed in a much, it will cause the entire Job is difficult to complete, generally use COUNT DISTINCT GROUP BY and then replace the first COUNT way.

6- Cartesian product

  • Avoid Cartesian product, join on time without conditions, or invalid on conditions, Hive can only be done using a reducer Cartesian product.

7- ranks of the filter

  • Column process: in the SELECT, take only desired columns, and if so, to make use of partitions filtered, less SELECT *.
  • Row: cut in the partition, when using an external associate, if the filter condition is written on the back side tables Where, the first full table it will associate, then filtered.

Guess you like

Origin www.cnblogs.com/sx66/p/12039571.html