9. Enterprise-level tuning of Hive series

1. Execution plan (Explain)
explain select * from emp
# 没有生成 MR 任务的
Explain
STAGE DEPENDENCIES:
 Stage-0 is a root stage
STAGE PLANS:
 Stage: Stage-0
 Fetch Operator
...
# 有生成 MR 任务的
Explain
STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 depends on stages: Stage-1
STAGE PLANS:
 Stage: Stage-1
 Map Reduce
 Map Operator Tree:
 Reduce Operator Tree:
...
2. Fetch capture

Fetch capture means that queries in Hive do not need to use MapReduce calculations for certain situations. For example: SELECT* FROM employees; In this case, Hive can simply read the files in the storage directory corresponding to the employee, and then output the query results to the console

In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is changed to more, mapreduce will not be used for global search, field search, and limit search.

  • Set hive.fetch.task.conversion to none, and then execute the query statement, the mapreduce program will be executed
hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;
  • Set hive.fetch.task.conversion to more, and then execute the query statement. The following query methods will not execute the mapreduce program
hive (default)> set hive.fetch.task.conversion=more;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;
3. Local mode

Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes the amount of input data to Hive is very small. In this case, triggering execution tasks for the query may take much more time than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, execution time can be significantly reduced

Users can set the value of hive.exec.mode.local.auto to true to let Hive automatically start this optimization at an appropriate time, the default is false

# 开启本地 mr
set hive.exec.mode.local.auto=true; 
# 设置 local mr 的最大输入数据量,当输入数据量小于这个值时采用 local mr 的方式,默认为 134217728,即 128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
# 设置 local mr 的最大输入文件个数,当输入文件个数小于这个值时采用 local mr 的方式,默认为 4
set hive.exec.mode.local.auto.input.files.max=10;
4. Table optimization
4.1 Join of Small Table and Large Table (MapJOIN)

Place the tables with relatively scattered keys and small data volume on the left side of the join, and use map join to let the small dimension tables advance into memory. The join is done on the map side.
The actual test in a certain valley found that the new version of hive has optimized the small table JOIN large table and the large table JOIN small table. There is no difference between the small table on the left and the right.

4.2 Large Table Join Large Table
  1. Empty KEY filter

Sometimes the join timeout is because some keys correspond to too much data, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. At this point, we should carefully analyze these abnormal keys. In many cases, the data corresponding to these keys is abnormal data, and we need to filter in the SQL statement. For example, if the field corresponding to the key is empty, the operation is as follows

insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n left join bigtable o on n.id = o.id;
  1. Empty key conversion

Sometimes although a certain key is empty, there is a lot of data corresponding to it, but the corresponding data is not abnormal data and must be included in the join result. At this time, we can assign a random value to the empty key field in table a to make the data random Evenly distributed to different reducers

insert overwrite table jointable select n.* from nullidtable n full join bigtable o on nvl(n.id,rand()) = o.id;
  1. SMB(Sort Merge Bucket join)

For the join of two large tables, it can be optimized as a join of two bucket tables. The number of buckets should not exceed the number of available CPU cores.

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
4.3 Group By

By default, the same key data in the Map stage is distributed to a reduce, and when a key data is too large, it will be skewed. Not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side first, and finally the final result can be obtained on the Reduce side.

  1. Enable Map side aggregation parameter setting

Whether to aggregate on the Map side, the default is True

set hive.map.aggr = true

The number of entries aggregated on the Map side

set hive.groupby.mapaggr.checkinterval = 100000

Load balance when there is data skew (default is false)

set hive.groupby.skewindata = true
4.4 Count(Distinct) deduplication statistics

It doesn’t matter when the amount of data is small, but when the amount of data is large, because the COUNT DISTINCT operation needs to be completed with a Reduce Task, the amount of data that this Reduce needs to process is too large, which will make it difficult to complete the entire Job. Generally, COUNT DISTINCT is used First replace by GROUP BY and then COUNT, but you need to pay attention to the data skew caused by group by

select count(id) from (select id from bigtable group by id) a;
4.5 Cartesian product

Try to avoid Cartesian product, do not add on condition when joining, or invalid on condition, Hive can only use one reducer to complete Cartesian product

4.6 Row and column filtering

Column processing: In SELECT, only take the required columns. If there are partitions, use partition filtering as much as possible, and use SELECT less
. Later, then the whole table will be associated first, and then filtered, please filter by subquery first

The test associates two tables first, and then uses where condition to filter

hive (default)> select o.id from bigtable b join bigtable o on o.id = b.id where o.id <= 10;
Time taken: 34.406 seconds, Fetched: 100 row(s)

After passing the subquery, then associate the table

select b.id from bigtable b
join (select id from bigtable where id <= 10) o on b.id = o.id;
Time taken: 30.058 seconds, Fetched: 100 row(s)
4.7 Partitions
4.8 Buckets

5 Reasonably set the number of Map and Reduce

  • Complex files increase the number of Maps
  • CombineHiveInputFormat small files for merging
  • Handle large amounts of data and use the appropriate number of reduce

6. Parallel execution

By default, Hive will only execute one stage at a time. However, a particular job may contain many stages, and these stages may not be completely interdependent, which means that some stages can be executed in parallel, which may shorten the execution time of the entire job

# 打开任务并行执行
set hive.exec.parallel=true; 
# 同一个 sql 允许最大并行度,默认为8
set hive.exec.parallel.thread.number=16;

7. Strict Mode

Hive can prevent some dangerous operations by setting:

  1. Partitioned tables do not use partition filtering When hive.strict.checks.no.partition.filter is set to true, for partitioned tables, execution is not allowed unless the where statement contains partition field filter conditions to limit the range
  2. Use order by without limit filter When hive.strict.checks.orderby.no.limit is set to true, for queries using order by statement, limit statement must be used
  3. Cartesian product When hive.strict.checks.cartesian.product is set to true, the Cartesian product query will be restricted

Guess you like

Origin blog.csdn.net/SJshenjian/article/details/131873458
Recommended