Hive tuning-compression, partition and bucket, table optimization

Whether you have mastered hive tuning is an important indicator for judging whether a data engineer is qualified

1. Data compression and storage format

Compression encoding supported by MR
Insert picture description here

Performance comparison
Insert picture description here
①bzip2 has a high compression rate, but the compression/decompression speed is slow.
②LZO's compression rate is relatively low, but the compression/decompression speed is fast. ③Note
: LZO is a general compression codec for Hadoop compressed data. Its design goal is to achieve a compression speed comparable to the read speed of a hard disk, so speed is the priority factor, not the compression ratio. Compared with the Gzip codec, its compression speed is 5 times that of Gzip, and its decompression speed is 2 times that of Gzip . The same file compressed with LZO is 50% larger than compressed with Gzip but 25%~50% smaller than before compression. This is very beneficial for improving performance, and the completion time of the Map phase is 4 times faster.

2. Reasonable use of partitions and buckets

The partition is for the storage path of the data; the bucket is for the data file.
Partitioning is to physically divide the data of the table into different folders , so that the partition directory to be read can be specified accurately when querying, and the amount of data read has never been reduced.
Bucketing is to divide the table data into different files after hashing the hash of the specified column. When querying in the future, hive can quickly locate the bucketed file where a row of data is located according to the bucketing structure, which has always improved the reading efficiency.

3. Hive parameter optimization 7

// If you can skip the mapreduce task, you can skip the mapreduce task
hive> set hive.fetch.task.conversion=more;

// Turn on task parallel execution
set hive.exec.parallel=true;
// Explanation: When there are multiple jobs in a sql, and there is no dependency between these multiple jobs, you can make sequential execution become parallel execution (generally For when union all is used)

// The maximum number of threads allowed for parallel tasks in the same SQL
set hive.exec.parallel.thread.number=8;

// Set jvm reuse. JVM reuse has a great impact on the performance of hive, especially for scenarios where it is difficult to avoid small files or scenarios with a lot of tasks. Most of these scenarios have a short execution time. The startup process of jvm may cause considerable overhead, especially when the executed job contains thousands of task tasks.
set mapred.job.reuse.jvm.num.tasks=10;

// Set the number of reduce reasonably
// Method 1: Adjust the amount of data accepted by each reduce
set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
// Method 2: Set the number of reduce directly
set mapred.reduce.tasks = 20

// Map-side aggregation to reduce the amount of data passed to reduce
set hive.map.aggr=true

// Turn on hive's built-in data dump optimization mechanism
set hive.groupby.skewindata=true

4. Optimize SQL

① Move the position of where so that the execution of where is on the map side instead of the reduce side.
② When unnecessary, use union all+group by instead of union
③ Don’t use count(distinct)
④ Use in instead of join
⑤ Optimize subqueries, Reducing the group by, count (distinct), max, min, etc. in the subquery can reduce the number of jobs.
6. Join optimization.
Use map join to make small dimension tables (the number of records below 1000) advanced to memory. Reduce on the map side.

5. Data skew

Data skew: The task progress is maintained at 99% (or 100%) for a long time. Check the task monitoring page and find that only a few (1 or several) reduce subtasks are not completed. Because the amount of data processed is too different from other reduce.
(1) Data skew caused by sql itself
① Set a reasonable number of maps. The number of maps is not the better. The startup and initialization time of a map task is much longer than the logic processing time. Too many maps will cause problems. Big waste of resources.
② Combine small files: Combine small files before map execution to reduce the number of maps: CombineHiveInputFormat has the function of combining small files (system default format).
③ Increase the number of maps for complex files. When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thereby improving the efficiency of task execution.

7. View the SQL execution plan

Learn to view SQL execution plan, optimize business logic, and reduce job data volume. It is also very important for tuning!
explain sql

Guess you like

Origin blog.csdn.net/Cxf2018/article/details/109288976