35. Optimization of Hive

This article mainly talks about the optimization problem of Hive, pay attention to the column "Broken Cocoon and Become a Butterfly-Big Data" to see more related content~


table of Contents

1. Hive data storage format

Two, Hive tuning

2.1 Fetch

2.2 Local mode query

2.3 Join between tables

2.4 Map-side aggregation operation

2.5 Deduplication statistics

2.6 Cartesian product

2.7 Query optimization

2.8 Enable dynamic partition

2.9 Set a reasonable number of Map and Reduce

2.10 JVM reuse

2.11 strict mode

2.12 Parallel execution

2.13 Speculative execution

2.14 Explain statement


 

1. Hive data storage format

Before proceeding with the optimization explanation, let's take a look at what data storage formats Hive has. There are four main formats for storing data supported by Hive, namely: TEXTFILE, SEQUENCEFILE, ORC and PARQUET. Among them, the storage formats of TEXTFILE and SEQUENCEFILE are based on row storage; while ORC and PARQUET are based on columnar storage. When querying a whole row of data that meets the conditions, columnar storage needs to go to each aggregated field to find the value of each corresponding column. Row storage only needs to find one of the values, and the remaining values ​​are in adjacent places. So the speed of row storage query is faster at this time. When the query requires only a few fields, columnar storage can greatly reduce the amount of data read; the data type of each field must be the same, and columnar storage can design better compression algorithms in a targeted manner. For the introduction of specific compression methods, please refer to our previous "16. Data Compression in Hadoop" . Next, look at our topic today-Hive tuning.

Two, Hive tuning

2.1 Fetch

In Hive, in some cases, the query does not need to go through the underlying MapReduce, such as querying all the data of a table, namely: select * from table_name;, this case is also called Fetch fetching. This is because hive.fetch.task.conversion in the hive-default.xml.template file is more by default, as shown below:

When the value is set to none, all query operations will go to the underlying MapReduce.

2.2 Local mode query

When the amount of input data of Hive is very small, all tasks can be processed on a single machine through the local mode. For small data sets, the execution time can be significantly shortened. By setting the value of hive.exec.mode.local.auto to true, Hive can automatically start this optimization when appropriate. The default is false.

//开启本地MapReduce
set hive.exec.mode.local.auto=true;
//设置local mapr的最大输入数据量,当输入数据量小于这个值时采用local MapReduce的方式,默认为134217728,即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置local MapReduce的最大输入文件个数,当输入文件个数小于这个值时采用local MapReduce的方式,默认为4
set hive.exec.mode.local.auto.input.files.max=10;

2.3 Join between tables

When a small table is joined with a large table, MapJoin is generally selected, and the small table is loaded into the memory to avoid ReduceJoin, which means that data skew is avoided.

//设置自动选择MapJoin,默认为true
set hive.auto.convert.join = true;
//大表小表的阈值设置,默认25M以下是小表
set hive.mapjoin.smalltable.filesize=25000000;

When a large table is joined with a large table, sometimes the timeout is due to too much data corresponding to some keys, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. In many cases, the data corresponding to these keys is abnormal data, and we need to filter in the SQL statement. For example, the field corresponding to the key is empty. Sometimes although there is a lot of data corresponding to a key being empty, the corresponding data is not abnormal data and must be included in the result of the join. At this time, we can assign a random value to the field where the key is empty in the table to make the data uniform and random The land cannot be divided into different reducers.

For the optimization between tables and tables, please refer to our previous "14. Join Operation in MapReduce" and "18. Hadoop Optimization" . After all, Hive's optimization is actually Hadoop's optimization, because the underlying layer of Hive relies on MapReduce.

2.4 Map-side aggregation operation

By default, the same Key data in the Map phase is distributed to a reduce. When a key data is too large, it will cause data skew. In fact, not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side first, and finally get the final result on the Reduce side. Open the Map-side aggregation operation with the following settings:

//是否在Map端进行聚合,默认为True
set hive.map.aggr = true;
//在Map端进行聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval = 100000;
//有数据倾斜的时候进行负载均衡(默认是false)
set hive.groupby.skewindata = true;

When the option is set to true, the generated query plan will have two MapReduce Jobs. In the first MapReduce Job, the output results of the Map will be randomly distributed to Reduce. Each Reduce performs part of the aggregation operation and outputs the results. In this way, the result of the processing is that the same Group By Key may be distributed to different Reduces. So as to achieve the purpose of load balancing; the second MapReduce Job is then distributed to Reduce according to the Group By Key according to the preprocessed data results (this process can ensure that the same GroupBy Key is distributed to the same Reduce), and finally complete the final aggregation operating.

2.5 Deduplication statistics

When the amount of data is large, due to the full aggregation operation of COUNT DISTINCT, even if the number of reduce tasks is set, set mapred.reduce.tasks=100; hive will only start one reducer. This causes the amount of data processed by a Reduce to be too large, making the entire job difficult to complete. Generally, COUNT DISTINCT is replaced by GROUP BY first and then COUNT.

2.6 Cartesian product

Try to avoid invalid on conditions when joining, because Hive can only use one Reducer to complete the Cartesian product.

2.7 Query optimization

When querying data, only take the columns you need and use SELECT * less. When using external associations, write the filter conditions of the table after the join in the join to avoid full table association.

2.8 Enable dynamic partition

Hive provides the function of dynamic partition. When inserting data into the partition table, it will automatically insert the data into the corresponding partition according to the value of the partition field, and enable the automatic partition function in the following way.

//开启动态分区功能,默认true
hive.exec.dynamic.partition=true;
//设置为非严格模式(动态分区的模式,默认strict,表示必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)
hive.exec.dynamic.partition.mode=nonstrict;
//在所有执行MapReduce的节点上,最大一共可以创建多少个动态分区,默认1000
hive.exec.max.dynamic.partitions=1000;
//在每个执行MapReduce的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即day字段有365个值,那么该参数就需要设置成大于365,如果使用默认值100,则会报错。
hive.exec.max.dynamic.partitions.pernode=100;
//整个MapReduce Job中,最大可以创建多少个HDFS文件。默认100000
hive.exec.max.created.files=100000;
//当有空分区生成时,是否抛出异常。一般不需要设置。默认false
hive.error.on.empty.partition=false;

2.9 Set a reasonable number of Map and Reduce

Normally, the job will generate one or more map tasks from the input directory. The main determinants of the map task are: the total number of input files, the input file size, and the file block size set by the cluster. It’s not that the more maps, the better. If a task has many small files (much less than the block size of 128M), each small file will also be treated as a block and completed by a map task, and a map task will start and The initialization time is much longer than the logic processing time, which will cause a lot of waste of resources. Moreover, the number of maps that can be executed at the same time is limited. At the same time, ensuring that each map handles file blocks close to 128m is not all right. For example, there is a 127M file, which is normally completed with a map, but this file has only one or two small fields, but there are millions of records. If the logic of map processing is more complicated, use a map task to do it. , It will definitely be very time-consuming.

Therefore, we must flexibly set a reasonable number of Map and Reduce according to different scenarios. The following situations are listed here.

1. When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thereby improving the execution efficiency of the task. The method to increase the map is: adjust the maximum value of maxSize according to the formula of computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M. Make the maximum value of maxSize lower than blocksize to increase the number of maps.

2. Merge small files before map execution to reduce the number of maps. CombineHiveInputFormat has the function of merging small files (the system default format). HiveInputFormat does not have the function of merging small files. Use the following settings to merge small files before map execution:

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

3. The following command merges small files at the end of the MapReduce task:

(1)在map-only任务结束时合并小文件,默认true
SET hive.merge.mapfiles = true;

(2)在map-reduce任务结束时合并小文件,默认false
SET hive.merge.mapredfiles = true;

(3)合并文件的大小,默认256M
SET hive.merge.size.per.task = 268435456;

(4)当输出文件的平均大小小于该值时,启动一个独立的map-reduce任务进行文件merge
SET hive.merge.smallfiles.avgsize = 16777216;

4. The number of reducers is not as large as possible. Too much start and initialization of reduce will also consume time and resources. In addition, there will be as many output files as there are reducers. If many small files are generated, then if these small files are used as the input of the next task, there will be too many small files. Adjust the number of reduce through the following settings:

(1)通过下面的设置修改每个Reducer处理的数据量,默认是256MB
hive.exec.reducers.bytes.per.reducer=256000000;

(2)通过下面的设置修改每个任务最大的reduce数,默认为1009
hive.exec.reducers.max=1009;

(3)在hadoop的mapred-default.xml文件中修改下面的参数,设置每个job的Reduce个数
set mapreduce.job.reduces = 15;

2.10 JVM reuse

JVM reuse has a very large impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or scenarios where there are a lot of tasks. Most of these scenarios have a short execution time. The default configuration of Hadoop usually uses a derived JVM to execute map and reduce tasks. At this time, the startup process of the JVM may cause considerable overhead, especially when the executed job contains hundreds of task tasks. JVM reuse can make the JVM instance reused N times in the same job. The value of N can be configured in the mapred-site.xml file of Hadoop. Usually between 10-20, it needs to be set in combination with specific business scenarios. The disadvantage of this function is that enabling JVM reuse will always occupy the used task slot for reuse, and it will not be released until the task is completed. If there are several reduce tasks in the job that take more time to execute than other reduce tasks, the reserved slots will always be free but cannot be used by other jobs until all tasks are finished. Will be released.

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

2.11 strict mode

To enable strict mode in Hive, you need to modify the value of hive.mapred.mode to strict. Turning on strict mode can prohibit three types of queries.

1. For partitioned tables, unless the where statement contains partition field filter conditions to limit the scope, it is not allowed to execute.

2. For queries that use the order by statement, the limit statement must be used. Because order by distributes all the result data to the same Reducer for processing in order to perform the sorting process, forcing the user to add this LIMIT statement can prevent the Reducer from executing for a long time.

3. Limit the query of Cartesian product. When executing a JOIN query in a relational database, use the where statement instead of the ON statement, so that the execution optimizer of the relational database can efficiently convert the WHERE statement into that ON statement. However, Hive does not perform this optimization. Therefore, if the table is large enough, the query will become uncontrollable.

2.12 Parallel execution

Hive converts a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merge stage, limit stage. Or other stages that may be required during the execution of Hive. By default, Hive will only execute one stage at a time. However, some specific job may contain many stages, and these stages may not be completely interdependent, that is to say, some stages can be executed in parallel, which may shorten the execution time of the entire job. Parallel execution can be turned on by setting the parameter hive.exec.parallel to true.

//打开任务并行执行
set hive.exec.parallel=true;

//同一个sql允许最大并行度,默认为8。
set hive.exec.parallel.thread.number=16;

2.13 Speculative execution

In a distributed cluster environment, due to program bugs or Hadoop's own bugs, unbalanced load or uneven resource distribution, etc., the running speed of multiple tasks in the same job will be inconsistent, and the running speed of some tasks may be significantly slower For other tasks, for example, the progress of a certain task of a job is only 50%, and all other tasks have been completed, these tasks will slow down the overall execution progress of the job. In order to avoid this situation, Hadoop adopts a speculative execution (Speculative Execution) mechanism, which speculates slow running tasks according to certain rules, and starts a backup task for such tasks, allowing the task to process the same data at the same time as the original task , And finally select the calculation result of the first successfully completed task as the final result.

Configure mapreduce.map.speculative and mapreduce.reduce.speculative to true in the mapred-site.xml file of Hadoop to enable speculative execution parameters. This configuration item is true by default. However, Hive itself also provides the configuration item hive.mapred.reduce.tasks.speculative.execution to control the speculative execution of reduce-side, which is true by default.

2.14 Explain statement

In Hive, you can use the Explain statement to view the execution plan of the statement. The syntax is as follows:

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

E.g:

1. View the execution plan of the following statement:

explain select * from people;

2. View the detailed execution plan

explain extended select * from people;

 

Okay, the above is the tuning content of Hive. This article is nearing the end. What problems did you encounter in this process? Welcome to leave a message and let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/111466765
Recommended