Hive tuning operation

1. hive.fetch.task.conversion=more After the attribute is modified to more, mapreduce will not be used in global search, field search, limit search, etc.

2. When the amount of input data is small, the execution time of the query triggering task may be more than the execution time of the actual job. Hive can process all tasks on a single machine through the local mode. For small data sets, execution time can be significantly shortened

//开启本地 mr
hive.exec.mode.local.auto=true 

// 设置 local mr 的最大输入数据量,当输入数据量小于这个值时采用 local mr 的方式,默认为 134217728,即 128M
hive.exec.mode.local.auto.inputbytes.max=50000000;

// 设置 local mr 的最大输入文件个数,当输入文件个数小于这个值时采用 local mr 的方式,默认为 4
hive.exec.mode.local.auto.input.files.max=10;

3. Sometimes the join timeout is because some keys correspond to too much data, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory.
Therefore, these abnormal values ​​should be dealt with first, such as filtering the key with a null value, or adding a special identifier + random number to the key with a null value, and then removing the identifier in the reduce stage to perform an aggregation operation.

4. For MapJoin operations, when aggregated on the Reducer side, the processing of certain keys may be slow due to too much Key skew. In the case of MapJoin, the small table will be broadcast to the memory of each Map side and out of the cache to directly perform the join operation on the Map side.

// 开启Mapper端join,默认为 true
hive.auto.convert.join = true; 

// 大表小表的阀值设置(默认 25M 一下认为是小表):
hive.mapjoin.smalltable.filesize=25000000;

5. Group BY operation. By default, the same Key data in the Map phase is distributed to one reducer. When a key data is too large, it will be tilted.

// 开启Map端join操作
hive.map.aggr = true

// 在 Map 端进行聚合操作的条目数目
hive.groupby.mapaggr.checkinterval = 100000

// 当选项设定为 true,生成的查询计划会有两个 MR Job。第一个 MR Job 中,Map 的输出结果会随机分布到 Reduce 中,每个 Reduce 做部分聚合操作,并输出结果,这样处理的结
// 果是相同的 Group By Key 有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的;第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中(这个过程可以
// 保证相同的 Group By Key 被分布到同一个 Reduce 中),最后完成最终的聚合操作。
hive.groupby.skewindata = true


6. Count(Distinct) de-duplication statistics

When the amount of data is small, it doesn’t matter. When the amount of data is large, because the COUNT DISTINCT operation needs to be completed by a Reduce Task, the amount of data that this Reduce needs to process is too large, which will cause the entire job to be difficult to complete.
Generally, COUNT DISTINCT is used Replace with GROUP BY first and then COUNT


7. Try to avoid Cartesian product. When join, do not add on condition or invalid on condition. Hive can only use 1 reducer to complete Cartesian product.


8, rank filtering

Column processing: In SELECT, only take the required columns, if there are, use partition filtering as much as possible, and use SELECT * less.

Row processing: In partition tailoring, when external association is used, if the filter condition of the secondary table is written after Where, then the entire table will be associated first, and then filtered.

9. Dynamic partition adjustment 

// 开启动态分区功能(默认 true,开启)
hive.exec.dynamic.partition=true

// 设置为非严格模式(动态分区的模式,默认 strict,表示必须指定至少一个分区为, 静态分区,nonstrict 模式表示允许所有的分区字段都可以使用动态分区。)
hive.exec.dynamic.partition.mode=nonstrict

// 在所有执行 MR 的节点上,最大一共可以创建多少个动态分区
hive.exec.max.dynamic.partitions=1000

// 在每个执行 MR 的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。
hive.exec.max.dynamic.partitions.pernode=100

// 整个 MR Job 中,最大可以创建多少个 HDFS 文件。
hive.exec.max.created.files=100000

// 当有空分区生成时,是否抛出异常。一般不需要设置。
hive.error.on.empty.partition=false

10. Adjust the number of map tasks. Since a map of Hive corresponds to a file of BlockSize size, we can increase or decrease the map data by modifying this value.

// 小文件合并
hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

// 修改每个切片的最大的大小
mapreduce.input.fileinputformat.split.maxsize=100;

11、 调整Reduce任务数目

// 每个 Reduce 处理的数据量默认是 256MB
hive.exec.reducers.bytes.per.reducer=256000000

// 每个任务最大的 reduce 数,默认为 1009
hive.exec.reducers.max=1009

11. Parallel execution

//打开任务并行执行
hive.exec.parallel=true; 

//同一个 sql 允许最大并行度,默认为 8。
hive.exec.parallel.thread.number=16; 


12. JVM reuse, JVM reuse can make the JVM instance reused N times in the same job.

<property>
 <name>mapreduce.job.jvm.numtasks</name>
 <value>10</value>
 <description>How many tasks to run per jvm. If set to -1, there is no limit.
 </description>
</property>

The disadvantage of this feature is that enabling JVM reuse will always occupy the used task slot for reuse, and it will not be released until the task is completed. If an "unbalanced" job there are a few reduce task execution time than that
he Reduce much more time consuming task, then the reserved slots would have been idle the other job can not be use , It will not be released until all tasks are finished.


13. Speculative execution

Hadoop uses the Speculative Execution mechanism. It infers the “rearranged” task according to certain rules, and starts a backup task for such a task,
allowing the task to process the same data at the same time as the original task, and finally select the most The calculation result of successfully running the task is the final result.

<property>
 <name>mapreduce.map.speculative</name>
 <value>true</value>
 <description>If true, then multiple instances of some map tasks may be executed in parallel.</description>
</property>

<property>
 <name>mapreduce.reduce.speculative</name>
 <value>true</value>
 <description>If true, then multiple instances of some reduce tasks may be executed in parallel.</description>
</property>

 

Guess you like

Origin blog.csdn.net/qq_32323239/article/details/114254655