Hive tuning method

1 Execution plan (Explain)

1.1 Basic syntax

HIVE provides the EXPLAIN command to display the execution plan of a query. This execution plan is very helpful for us to understand the underlying principles, hive tuning, and troubleshooting data skew.

The usage syntax is as follows:

EXPLAIN [EXTENDED|CBO|AST|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION|ANALYZE] query

Parameter Description:

EXTENDED: Add extended to output additional information about the plan. This is usually physical information, such as a filename. This extra information is not very useful to us

CBO: Outputs a plan generated by the Calcite optimizer. CBO supports from version 4.0.0 of hive

AST: The abstract syntax tree of the output query. AST was deleted in hive 2.1.0 version, there is a bug, dumping AST may cause OOM error, it will be fixed in version 4.0.0

DEPENDENCY: The use of dependency in the EXPLAIN statement produces additional information about the inputs in the plan. It shows various properties of the input

AUTHORIZATION: Displays all entities that need to be authorized to perform queries (if any) and authorization failures

LOCKS: This is useful for knowing which locks the system will acquire to run the specified query. LOCKS is supported from hive 3.2.0

VECTORIZATION: Add details to EXPLAIN output to show why Map and Reduce are not vectorized. Supported since Hive 2.3.0

ANALYZE: Annotate the plan with the actual number of lines. Supported since Hive 2.2.0

1.2 Practical operation

-- 执行
explain select * from emp;
-- 得到如下结果
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
            ListSink
  1. Execute the explain of MR
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: emp
            Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: sal (type: double), deptno (type: int)
              outputColumnNames: sal, deptno
              Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(sal), count(sal)
                keys: deptno (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: double), _col2 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), count(VALUE._col1)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), (_col1 / _col2) (type: double)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

2 Fetch capture

Fetch capture means that queries in Hive do not need to use MapReduce calculations for certain situations. For example: SELECT * FROM employees; In this case, Hive can simply read the files in the storage directory corresponding to the employee, and then output the query results to the console.
In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is changed to more, mapreduce will not be used for global search, field search, and limit search.

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

3 local mode

Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes the amount of input data to Hive is very small. In this case, the execution time triggered by the query may take much longer than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, execution time can be significantly reduced.

Users can enable Hive to automatically start this optimization when appropriate by setting the value of hive.exec.mode.local.auto to true.

set hive.exec.mode.local.auto=true;  --开启本地mr
--设置local mr的最大输入数据量,当输入数据量小于这个值时采用local  mr的方式,默认为134217728,即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
-- 设置local mr的最大输入文件个数,当输入文件个数小于这个值时采用local mr的方式,默认为4
set hive.exec.mode.local.auto.input.files.max=10;

4 table optimization

4.1 Join of Small Table and Large Table (MapJoin)

Place the tables with relatively scattered keys and small data volume on the left side of the join, which can effectively reduce the probability of memory overflow errors; further, you can use map join to make small dimension tables (the number of records below 1000) advanced Memory. Join is done on the map side.

The actual test found that the new version of hive has optimized the small table JOIN large table and the large table JOIN small table. There is no obvious difference between the small watch on the left and the right.

-- 测试大表JOIN小表和小表JOIN大表的效率
-- 开启MapJoin参数设置
	-- 设置自动选择Mapjoin,默认为true
	set hive.auto.convert.join = true;
	-- 大表小表的阈值设置(默认25M以下认为是小表)
	set hive.mapjoin.smalltable.filesize = 25000000;

MapJoin working mechanism

4.2 Large table join Large table

4.2.1 Empty KEY filtering

Sometimes the join timeout is because some key corresponds to too much data, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. At this point, we should carefully analyze these abnormal keys. In many cases, the data corresponding to these keys is abnormal data, and we need to filter them in the SQL statement. For example, if the field corresponding to the key is empty, the operation is as follows:

  1. Configure History Server
<!--配置mapred-site.xml-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>
-- 启动历史服务器
sbin/mr-jobhistory-daemon.sh start historyserver

View jobhistory http://hadoop102:19888/jobhistory

  1. Create original data table, empty id table, merged data table
-- 创建空id表
create table nullidtable(
    id bigint, 
    t bigint, 
    uid string, 
    keyword string, 
    url_rank int, 
    click_num int, 
    click_url string
) 
row format delimited fields terminated by '\t';
  1. Load the original data and empty id data into the corresponding table respectively
hive (default)> load data local inpath '/opt/module/hive/datas/nullid' into table nullidtable;
  1. Test does not filter empty ids
hive (default)> insert overwrite table jointable select n.* from nullidtable n
left join bigtable o on n.id = o.id;
  1. Test filter empty id
hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n  left join bigtable o on n.id = o.id;

4.2.2 Empty key conversion

Sometimes although a certain key is empty, there is a lot of data corresponding to it, but the corresponding data is not abnormal data and must be included in the join result. At this time, we can assign a random value to the empty key field in table a to make the data random Evenly distributed to different reducers.

When the null value is not randomly distributed, the following operations are performed:

  1. Set 5 reduce numbers
set mapreduce.job.reduces = 5;
  1. JOIN two tables
insert overwrite table jointable
select n.* from nullidtable n left join bigtable b on n.id = b.id;

It can be seen that there is a data skew, and the resource consumption of some reducers is much larger than that of other reducers.

When randomly distributing empty null values, the following operations are performed:

  1. Set 5 reduce numbers
set mapreduce.job.reduces = 5;
  1. JOIN two tables
insert overwrite table jointable
select n.* from nullidtable n full join bigtable o on 
nvl(n.id,rand()) = o.id;

4.2.3 SMB(Sort Merge Bucket join)

(1) Create a second large table

create table bigtable2(
    id bigint,
    t bigint,
    uid string,
    keyword string,
    url_rank int,
    click_num int,
    click_url string)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table bigtable2;

Test large table direct JOIN

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable s
join bigtable2 b
on b.id = s.id;

(2) Create bucket table 1, the number of buckets should not exceed the number of available CPU cores

create table bigtable_buck1(
    id bigint,
    t bigint,
    uid string,
    keyword string,
    url_rank int,
    click_num int,
    click_url string)
clustered by(id) 
sorted by(id)
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck1 select * from bigtable; 

(3) Create sub-communication table 2, the number of buckets should not exceed the number of available CPU cores

create table bigtable_buck2(
    id bigint,
    t bigint,
    uid string,
    keyword string,
    url_rank int,
    click_num int,
    click_url string)
clustered by(id)
sorted by(id) 
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck2 select * from bigtable; 

(4) Setting parameters

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

(5) test

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;

4.3 Group By

By default, the same key data is distributed to a reduce in the map stage, and data skew occurs when a key data is too large.

Not all aggregation operations need to be completed on the reduce side. Many aggregation operations can be partially aggregated on the map side first, and finally the final result can be obtained on the reduce side.

  1. Enable Map side aggregation parameter setting
-- 是否在Map端进行聚合,默认为True
set hive.map.aggr = true
-- 在Map端进行聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval = 100000
-- 有数据倾斜的时候进行负载均衡(默认是false)
set hive.groupby.skewindata = true

When the option is set to true, the generated query plan will have two MR jobs. In the first MRJob, the output results of the Map will be randomly distributed to the Reduces, and each Reduce will perform a partial aggregation operation and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces, thus To achieve the purpose of load balancing; the second MRJob distributes the preprocessed data results to the Reduce according to the Group By Key (this process can ensure that the same GroupBy Key is distributed to the same Reduce), and finally completes the final aggregation operation.

4.4 Count(Distinct) deduplication statistics

It doesn’t matter when the amount of data is small, but when the amount of data is large, because the COUNT DISTINCT operation needs to be completed with a Reduce Task, the amount of data that this Reduce needs to process is too large, which will make it difficult to complete the entire Job. Generally, COUNT DISTINCT is used Replace by GROUP BY and then COUNT, but you need to pay attention to the data skew caused by group by.

Practical

-- 创建一张大表
hive (default)> create table bigtable(id bigint, time bigint, uid string, keyword
string, url_rank int, click_num int, click_url string) row format delimited
fields terminated by '\t';
-- 加载数据
hive (default)> load data local inpath '/opt/module/datas/bigtable' into table bigtable;
-- 设置5个reduce个数
set mapreduce.job.reduces = 5;
-- 执行去重id查询
hive (default)> select count(distinct id) from bigtable;
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.12 sec   HDFS Read: 120741990 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 120 msec
OK
c0
100001
Time taken: 23.607 seconds, Fetched: 1 row(s)
-- 采用GROUP by去重id
hive (default)> select count(id) from (select id from bigtable group by id) a;
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 17.53 sec   HDFS Read: 120752703 HDFS Write: 580 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.29 sec2   HDFS Read: 9409 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 820 msec
OK
_c0
100001
Time taken: 50.795 seconds, Fetched: 1 row(s)

Although it will take one more job to complete, but in the case of a large amount of data, this is definitely worth it

4.5 Cartesian product

Try to avoid the Cartesian product. When joining, do not add the on condition, or the on condition is invalid. Hive can only use one reducer to complete the Cartesian product.

4.6 Row and column filtering

Column processing: In SELECT, only take the required columns. If there are partitions, use partition filtering as much as possible, and use SELECT * less.

Row processing: In partition pruning, when using external association, if the filter condition of the sub table is written after Where, then the whole table will be associated first, and then filtered, for example:

Case practice:

-- 测试先关联两张表,再用where条件过滤
hive (default)> select o.id from bigtable b
join bigtable  o.id = b.id
where o.id <= 10;
Time taken: 34.406 seconds, Fetched: 100 row(s)
-- 通过子查询后,再关联表
hive (default)> select b.id from bigtable b
join (select id from bigtable where id <= 10 ) o on b.id = o.id;
Time taken: 30.058 seconds, Fetched: 100 row(s)

4.7 Reasonably set the number of Map and Reduce

1) Usually, the job will generate one or more map tasks through the input directory.
The main determining factors are: the total number of input files, the input file size, and the file block size set by the cluster.

2) Is it better to have more maps?
the answer is negative. If a task has many small files (much smaller than the block size of 128m), each small file will be treated as a block and completed with a map task, and the startup and initialization time of a map task is much longer than that of logic processing Time, it will cause a lot of waste of resources. Moreover, the number of maps that can be executed at the same time is limited.

3) Is it guaranteed that each map processes a file block close to 128m, so you can sit back and relax?
The answer is not necessarily. For example, if there is a 127m file, a map will be used to complete it normally, but this file has only one or two small fields, but there are tens of millions of records. If the logic of map processing is more complicated, it must be done with a map task. It is also time-consuming.

For the above problems 2 and 3, we need to take two ways to solve: reduce the number of maps and increase the number of maps;

4.7.1 Increase the number of Maps for complex files

When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thereby improving the execution efficiency of the task.

The method of increasing the map is: according to the formula computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M, adjust the maximum value of maxSize. Let the maximum value of maxSize be lower than blocksize to increase the number of maps.

Case practice:

-- 执行查询
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
-- 设置最大切片值为100个字节
hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1

4.7.2 Merge small files

1) Combine small files before map execution to reduce the number of maps: CombineHiveInputFormat has the function of merging small files (system default format). HiveInputFormat does not have the function of merging small files.

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

2) The settings for merging small files at the end of Map-Reduce tasks:

-- 在map-only任务结束时合并小文件,默认true
SET hive.merge.mapfiles = true;
-- 在map-reduce任务结束时合并小文件,默认false
SET hive.merge.mapredfiles = true;
-- 合并文件的大小,默认256M
SET hive.merge.size.per.task = 268435456;
-- 当输出文件的平均大小小于该值时,启动一个独立的map-reduce任务进行文件merge
SET hive.merge.smallfiles.avgsize = 16777216;

4.7.3 Reasonably set the number of Reduces

1) Adjust the number of reduce method 1

-- 每个Reduce处理的数据量默认是256MB
hive.exec.reducers.bytes.per.reducer=256000000
-- 每个任务最大的reduce数,默认为1009
hive.exec.reducers.max=1009
-- 计算reducer数的公式
N=min(参数2,总输入数据量/参数1)

2) Adjust the number of reduce method 2

-- 在hadoop的mapred-default.xml文件中修改
-- 设置每个job的Reduce个数
set mapreduce.job.reduces = 15;

3) The number of reduce is not the more the better

  1. Excessive startup and initialization reduce will also consume time and resources;
  2. In addition, as many reduce as there are, there will be as many output files. If many small files are generated, if these small files are used as the input of the next task, the problem of too many small files will also appear;

When setting the number of reduce, you also need to consider these two principles: use the appropriate number of reduce to process a large amount of data; make the amount of data processed by a single reduce task appropriate;

4.8 Parallel Execution

ive transforms a query into one or more stages. Such stages can be the MapReduce stage, the sampling stage, the merge stage, and the limit stage. Or other stages that may be required during Hive execution. By default, Hive will only execute one stage at a time. However, a specific job may contain many stages, and these stages may not be completely interdependent, that is to say, some stages can be executed in parallel, which may shorten the execution time of the entire job. However, if more stages can be executed in parallel, the job may complete faster.

By setting the parameter hive.exec.parallel to true, you can enable concurrent execution. However, in a shared cluster, it should be noted that if there are more parallel stages in the job, the cluster utilization will increase.

set hive.exec.parallel=true;              -- 打开任务并行执行
set hive.exec.parallel.thread.number=16;  -- 同一个sql允许最大并行度,默认为8。

Of course, it has an advantage when the system resources are relatively free, otherwise, if there are no resources, parallelism will not start.

4.9 Strict Mode

Hive can prevent some dangerous operations by setting:

  1. Partitioned tables do not use partition filtering

    When hive.strict.checks.no.partition.filter is set to true, for partitioned tables, execution is not allowed unless the where statement contains partition field filter conditions to limit the range. In other words, the user is not allowed to scan all partitions. The reason for this limitation is that partitioned tables usually have very large data sets, and the data grows rapidly. Queries without partition constraints can consume unacceptably large resources to process the table.

  2. Use order by without limit filter

When hive.strict.checks.orderby.no.limit is set to true, the limit statement must be used for queries using the order by statement. Because order by distributes all result data to the same Reducer for processing in order to perform the sorting process, forcing the user to increase this LIMIT statement can prevent the Reducer from executing for a long time.

  1. Cartesian Product

When hive.strict.checks.cartesian.product is set to true, the Cartesian product query is restricted. Users who are very familiar with relational databases may expect to use the where statement instead of the ON statement when executing JOIN queries, so that the execution optimizer of the relational database can efficiently convert the WHERE statement into that ON statement. Unfortunately, Hive does not perform this optimization, so if the table is large enough, this query can become uncontrollable.

Guess you like

Origin blog.csdn.net/meng_xin_true/article/details/126060696