Hive optimization (tuning summary)

1. View the execution plan

explain extended hql; you can see the hdfs path of scanned data

1. The key value of the join is skewed, and the key value contains many null values ​​or outliers.
In this case, you can assign a random value to the outlier value to disperse the key,
such as:
select userid,name 
from user_info a
join(
select case when userid is null then cast(rand(47)*100000 as int)
else userid
from user_read_log
) b on a.userid = b.userid

Through the rand function, null values ​​are dispersed into different values, and the problem of data skew can be solved by comparing key values.

Note: If outliers are not needed, it is best to filter them out in advance, so that the amount of calculation can be greatly reduced.

2. Hive table optimization

Partition (different folders):

Enable dynamic partitioning:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Default value: strict
Description: strict is to prevent all partition fields from being dynamic. At least one partition field must be specified with a value.

Avoid creating a large number of partitions

Bucketing (different files):

set hive.enforce.bucketing=true;
set hive.enforce.sorting=true; Turn on forced sorting. When inserting data into the table, forced sorting will be performed. The default is false;

3. Hive SQL optimization

groupby data skew optimization
hive.groupby.skewindata=true; (one more job) 1.join optimization

(1) Data skew

hive.optimize.skewjoin=true;
If the join process is skewed, it should be set to true
set hive.skewjoin.key=100000;
This is the join key. If the number of records corresponding to the join key exceeds this value, optimization will be performed.
Simply put, it is a job Become two jobs to execute HQL

(2) mapjoin (map side executes join)

Startup method one: (automatic judgment)
set.hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize The default value is 25mb.
Small tables smaller than 25mb automatically start mapjoin.
Startup method two: (manual)
select / +mapjoin( A) / fa,fb from A t join B f on (fa=ta)

mapjoin supports unequal value conditions
reducejoin does not support unequal value judgment in ON conditions

(3) bucketjoin (data access can be accurate to the bucket level)

Conditions of use: 1. The two tables divide the buckets in the same way
2. The number of buckets in the two tables is a multiple.
Example:
create table order(cid int, price float) clustered by (cid) into 32 buckets;
create table customer( id int, first string) clustered by (id) into 32/64 buckets;

select price from order t join customer s on t.cid=s.id;

(4) where condition optimization

Before optimization (the relational database will be automatically optimized without considering it):
select m.cid,u.id from order m join customer u on m.cid =u.id where m.dt='2013-12-12';

After optimization (the where condition is executed on the map side instead of the reduce side):
select m.cid,u.id from (select * from order where dt='2013-12-12') m join customer u on m.cid =u.id;

(5) group by optimization

hive.groupby.skewindata=true;
If the group by process is skewed, it should be set to true
set hive.groupby.mapaggr.checkinterval=100000;
This is the number of records corresponding to the group key that exceeds this value, and optimization will be performed

Also one job becomes two jobs

(6) count distinct optimization

Before optimization (there is only one reduce, the burden of removing duplicates first and then counting is relatively large):
select count(distinct id) from tablename;
after optimization (start two jobs, one job is responsible for subqueries (there can be multiple reducers), and the other job Responsible for count(1)):
select count(1) from (select distinct id from tablename) tmp;

select count(1) from (select id from tablename group by id) tmp;

set mapred.reduce.tasks=3;

(7)
优化前:
select a,sum(b),count(distinct c),count(distinct d) from test group by a;

优化后:
select a,sum(b) as b,count© as c,count(d) as d from

select a, 0 as b,c,null as d from test group by a,c
union all
select a,0 as b, null as c,d from test group by a,d
union all
select a, b,null as c ,null as d from test) tmp group by a;

4. Hive job optimization

1. Parallel execution

By default, hive jobs are performed sequentially. One HQL is split into multiple jobs. There is no dependency or mutual influence between jobs and they can be executed in parallel.
set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=8;
is to control the maximum value of jobs that can be run at the same time for the same SQL. The default value of this parameter is 8. At this time, a maximum of 8 jobs can be run at the same time.

2. Localized execution (executed on the node where the data is stored)

set hive.exec.mode.local.auto=true;

Localized execution must meet the conditions:
(1) The input data size of the job must be less than the parameter
hive.exec.mode.local.auto.inputbytes.max (default 128MB)
(2) The number of maps of the job must be less than the parameter:
hive.exec. mode.local.auto.tasks.max (default is 4) too many and not enough slots
(3) The number of job reducers must be 0 or 1

3.Job merges input small files

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
combines multiple splits into one. The number of merged splits is determined by the size limit of mapred.max.split.size

4.Job merges and outputs small files (to prepare for subsequent job optimization)

set hive.merge.smallfiles.avgsize=256000000; When the average size of the output file is less than this value, start a new job to merge files

set hive.merge.size.per.task=64000000;The size of each file after merging

5.JVM reuse

set mapred.job.reuse.jvm.num.tasks=20;

How many tasks each jvm runs;

JVM reuse allows jobs to retain slots for a long time until the end of the job.

6. Compress data (multiple jobs)

(1) Intermediate compression processes data between multiple jobs queried by hive. For intermediate compression, it is best to choose a compression method that saves CPU time.
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression. codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK; Compress by block instead of recording
(2) Final output compression (choose a good compression effect to reduce storage space)
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK; Compress in blocks, not records

5. Hive Map optimization

1.set mapred.map.tasks=10 is invalid
(1) Default map number
default_num=total_size/block_size;
(2) Expected size (manually set number)
goal_num =mapred.map.tasks;
(3) Set processing File size (number of maps calculated based on file fragment size)
split_size=max(block_size,mapred.min.split.size);
split_num=total_size/split_size;
(4) Final calculated number of maps (actual number of maps)
compute_map_num=min(split_num,max(default_num,goal_num))

Summary:
(1) If you want to increase the number of maps, set mapred.map.tasks to a larger value;
(2) If you want to reduce the number of maps, set mapred.min.split.size to a larger value value.

2.Map-side aggregation

set hive.map.aggr=true; is equivalent to executing combiner on the map side

3. Speculative execution (default is true)

mapred.map.tasks.speculative.execution

6. Hive Shuffle optimization

Map 端
io.sort.mb
io.sort.spill.percent
min.num.spill.for.combine
io.sort.factor
io.sort.record.percent

reduce端
mapred.reduce.parallel.copies
mapred.reduce.copy.backoff
io.sort.factor
mapred.job.shuffle.input.buffer.percent

7. HIve Reduce optimization

1. Speculative execution (default is true)

mapred.reduce.tasks.speculative.execution (in hadoop)
hive.mapred.reduce.tasks.speculative.execution (the same parameters in hive, the effect is the same as in hadoop)
either one will work

2.Reduce optimization (reduce number setting)
set mapred.reduce.tasks=10; set directly

Maximum value
hive.exec.reducers.max Default: 999

The file size calculated by each reducer
hive.exec.reducers.bytes.per.reducer Default: 1G

Calculation formula: Although there are so many settings, so many
numRTasks are not necessarily used =min[maxReducers,input.size/perReducer]
maxReducers=hive.exec.reducers.max
perReducer=hive.exec.reducers.bytes.per.reducer

8. Queue

set mapred.queue.name=queue3; set queue queue3
set mapred.job.queue.name=queue3; set use queue3
set mapred.job.priority=HIGH;

Queue reference article:
http://yaoyinjie.blog.51cto.com/3189782/872294

Reprinted from https://blog.51cto.com/tianxingzhe/1705565

Guess you like

Origin blog.csdn.net/csdn_mycsdn/article/details/105172547