Hive tuning: shyMing road to god

Preface

It is no exaggeration to say that whether you have mastered hive tuning is an important indicator of whether a data engineer is qualified. Hive tuning involves compression and storage tuning, parameter tuning, sql tuning, data tilt tuning, and small file issues.

1. Data compression and storage format

Insert picture description here

  1. The map stage outputs data compression. In this stage, a low CPU overhead algorithm is preferred.
set hive.exec.compress.intermediate=true
set mapred.map.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec
set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
  1. Compress the final output result
set hive.exec.compress.output=true 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
 
## 当然,也可以在hive建表时指定表的文件格式和压缩编码

In conclusion, generally choose orcfile/parquet + snappy

2. Reasonable use of partitions and buckets

Partitioning is to physically divide the data of the table into different folders, so that the partition directory to be read can be accurately specified when querying, and the amount of data read has never been reduced.

Bucketing is to divide the table data into different files after hashing the hash of the specified column. When querying in the future, hive can quickly locate the bucketed file where a row of data is located according to the bucketing structure, thus improving the reading efficiency.

3. hive parameter optimization


// 让可以不走mapreduce任务的,就不走mapreduce任务
hive> set hive.fetch.task.conversion=more;
 
// 开启任务并行执行
 set hive.exec.parallel=true;
// 解释:当一个sql中有多个job时候,且这多个job之间没有依赖,则可以让顺序执行变为并行执行(一般为用到union all的时候)
 
 // 同一个sql允许并行任务的最大线程数 
set hive.exec.parallel.thread.number=8;
 
// 设置jvm重用
// JVM重用对hive的性能具有非常大的 影响,特别是对于很难避免小文件的场景或者task特别多的场景,这类场景大多数执行时间都很短。jvm的启动过程可能会造成相当大的开销,尤其是执行的job包含有成千上万个task任务的情况。
set mapred.job.reuse.jvm.num.tasks=10; 
 
// 合理设置reduce的数目
// 方法1:调整每个reduce所接受的数据量大小
set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
// 方法2:直接设置reduce数量
set mapred.reduce.tasks = 20
// map端聚合,降低传给reduce的数据量

set hive.map.aggr=true  
 // 开启hive内置的数倾优化机制
set hive.groupby.skewindata=true

4.sql optimization

4.1 where condition optimization

Before optimization (relational databases will be automatically optimized without consideration)

select m.cid,u.id
from order m join customer u 
on( m.cid =u.id )where m.dt='20180808';

After optimization (where conditions are executed on the map side instead of on the reduce side)

select m.cid,u.id 
fromselect * from order where dt='20180818') m 
join customer u 
on( m.cid =u.id);

4.2 Union optimization

Try not to use union (union to remove duplicate records) but use union all and then use group by to remove duplicates

4.3 count distinct optimization

Don't use count (distinct cloumn), use subqueries

select count(1)
from (select id from tablename group by id) tmp;

4.4 Use in instead of join

If you need to constrain another table based on the fields of one table, try to use in instead of join. In is faster than join


select id,name from tb1  a join tb2 b on(a.id = b.id);
 
select id,name from tb1 where id in(select id from tb2);

4.5 Optimize subqueries

Eliminate group by, COUNT(DISTINCT), MAX, MIN in the subquery. Can reduce the number of jobs.

4.6 join optimization

Common/shuffle/Reduce JOIN connection occurs in the reduce phase, suitable for connecting large tables to large tables (the default way)
Map join: the connection occurs in the map phase, suitable for connecting small tables to large tables and
large tables from files reading
small data table stored in memory (Hive has been optimized automatically, automatically determine the small table and cache)

set hive.auto.convert.join=true;

SMB join
Sort -Merge -Bucket Join optimizes the connection of large tables to large tables, using the concept of bucket tables for optimization. Cartesian product join occurs in a bucket (need to be two bucket tables for join)

 set hive.auto.convert.sortmerge.join=true;  
 set hive.optimize.bucketmapjoin = true;  
 set hive.optimize.bucketmapjoin.sortedmerge = true;  
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

5. Data skew

Performance: The task progress is maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a few (1 or several) reduce subtasks are not completed. Because the amount of data processed is too different from other reduce.
Reason: The amount of data input for a reduce is much larger than the amount of input data for other reduce

5.1 Tilt caused by sql itself

1)
If the group by has data skew in the group by, can the dimension of the group by become finer? If it can't become finer, you can add a random number to the original group key and group and aggregate once, then Remove the random number from the result and then group and aggregate.
When joining, if there are a large number of null join keys, you can convert the null to a random value to avoid aggregation

2) Count (distinct)
situation: Too many special values
Consequences: Reduce processing this special value is time-consuming; there is only one reduce task.
Solution: When count distinct, treat the null value separately, for example, you can directly filter the null value line,
plus 1 in the final results. If there are other calculations that need to be group by, the records with empty values ​​can be processed separately, and then combined with other calculation results.

3) Data skew caused by association of different data types
: For example, the user_id field in the user table is int, and the user_id field in the log table has both string and int types. When the Join operation of two tables is performed according to user_id.
Consequences: It takes time to process the reduce of this special value; there is only one reduce task The
default Hash operation will be allocated according to the int type id, which will cause all the records of the string type id to be allocated
to a Reducer.
Solution: Convert the number type to a string type
select * from users a
left outer join logs b
on a.usr_id = cast(b.user_id as string)

4)mapjoin

5.2 The characteristics of the business data itself (there is a hot key)

Each input of join is relatively large, and the long tail is caused by the hot value. The hot value and non-hot value can be processed separately, and then the data
key itself is unevenly distributed.
You can add a random number to the key or increase the number of reduce Tasks

5.3 Enable load balancing when data skew

set hive.groupby.skewindata=true;
Idea: First distribute and process randomly, and then distribute and process according to key group by.
Operation: When the option is set to true, the generated query plan will have two MRJobs.
In the first MRJob, the output result set of the Map will be randomly distributed to Reduce. Each Reduce performs partial aggregation operations and outputs the result. The result of the processing is that the same GroupBy Key may be distributed to different Reduces. To achieve the purpose of load balancing;

The second MRJob is then distributed to Reduce according to GroupBy Key according to the preprocessed data results (this process can ensure that the same original GroupBy Key is distributed to the same Reduce), and finally complete the final aggregation operation.

5.4 Control the null value distribution

The empty key is converted into a string plus a random number or a pure random number, and the data that is skewed due to the empty value is not divided into multiple Reducers.
Note: If the outliers are not needed, it is best to filter them out in the where condition in advance, so that the amount of calculation can be greatly reduced

6. Merge small files

There are three places where small files are generated: map input, map output, reduce output, too many small files will also affect the efficiency of hive analysis:

Set the small file merge of map input

set mapred.max.split.size=256000000;  
//一个节点上split的至少的大小(这个值决定了多个DataNode上的文件是否需要合并)
set mapred.min.split.size.per.node=100000000;
//一个交换机下split的至少的大小(这个值决定了多个交换机上的文件是否需要合并)  
set mapred.min.split.size.per.rack=100000000;
//执行Map前进行小文件合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Set the relevant parameters for merging map output and reduce output:

//设置map端输出进行合并,默认为true
set hive.merge.mapfiles = true
//设置reduce端输出进行合并,默认为false
set hive.merge.mapredfiles = true
//设置合并文件的大小
set hive.merge.size.per.task = 256*1000*1000
//当输出文件的平均大小小于该值时,启动一个独立的MapReduce任务进行文件merge。
set hive.merge.smallfiles.avgsize=16000000

7. View the execution plan of sql

explain sql
learn to view the execution plan of sql, optimize business logic, and reduce the amount of job data. It is also very important for tuning

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108966876