2023-Hive performance enterprise-level tuning

As an important framework for big data platforms, Hive has become one of the most widely used frameworks for building enterprise-level data warehouses due to its stability and ease of use.

However, if we are limited to using Hive without considering performance issues, it will be difficult to build a perfect data warehouse. Therefore, Hive performance tuning is a skill that our big data practitioners must master. This article will explain some methods and techniques of Hive performance tuning.

How to tune Hive performance

Why is it said that the work of performance optimization is relatively difficult, because the optimization of a technology must be a comprehensive work, which is a combination of multiple technologies. If we are limited to only one technology, then we must not be able to optimize well.

The following will introduce the diversity of Hive optimization from several completely different angles. Let's experience it together first.

1. SQL statement optimization

There are too many things involved in SQL statement optimization. Due to the limited space, I cannot introduce them one by one. So I will take a few typical examples to let everyone learn this kind of thinking. If you encounter similar tuning problems in the future, you can focus more on these aspects. Think about it.

1. union all

insert into table stu partition(tp) 
select s_age,max(s_birth) stat,'max' tp 
from stu_ori
group by s_age

union all

insert into table stu partition(tp) 
select s_age,min(s_birth) stat,'min' tp 
from stu_ori
group by s_age;

We simply analyze the above SQl statement, that is, get the maximum and minimum birthdays of each age and put them in the same table. The two statements before and after union all group the same table according to s_age, and then take the maximum value and min. Grouping the same field twice in the same table is a huge waste. Can we modify it? Of course it is possible. Let me introduce a syntax for you: , this syntax will prepend from, and the function is to use from ... insert into ... a Tables can be inserted multiple times:

--开启动态分区 
set hive.exec.dynamic.partition=true; 
set hive.exec.dynamic.partition.mode=nonstrict; 

from stu_ori 

insert into table stu partition(tp) 
select s_age,max(s_birth) stat,'max' tp 
group by s_age

insert into table stu partition(tp) 
select s_age,min(s_birth) stat,'min' tp 
group by s_age;

The above SQL can group the s_age field of the stu_ori table once and perform two different insert operations.

This example tells us that we must know more about SQL statements. If we don't know this syntax, we will never think of this way .

2. distinct

Let's look at a SQL first, to remove heavy counts:

select count(1) 
from( 
  select s_age 
  from stu 
  group by s_age 
) b;

This is the number of enumeration values ​​for simple statistics of age, why not use distinct?

select count(distinct s_age) 
from stu;

Some people say that using the first method can effectively avoid data skew on the reduce side when the amount of data is particularly large, but is this the case?

Regardless of the problem of the large amount of data, using distinct in the current business and environment will definitely be more efficient than the above sub-query method . There are several reasons for this:

  1. The field to be deduplicated above is the age field. You must know that the enumeration value of age is very limited. Even if the age between 1 and 100 years is calculated, the maximum enumeration value of s_age is 100. If it is converted into MapReduce to explain If so, in the Map phase, each Map will deduplicate s_age. Since the s_age enumeration value is limited, the s_age obtained by each Map is also limited, and the amount of reduced data finally obtained is the number of maps * the number of s_age enumeration values.

  2. The distinct command will build a hashtable in memory, and the time complexity of searching and deduplication is O(1); group by varies greatly between different versions, some versions will use the form of building a hashtable to deduplicate, and some versions will By sorting, the optimal time complexity of sorting cannot reach O(1). In addition, the first method (group by) deduplication will be converted into two tasks, which will consume more disk network I/O resources.

  3. The latest Hive 3.0 has added count(distinct) optimization. Through configuration  hive.optimize.countdistinct, even if data skew occurs, it can be automatically optimized and the logic of SQL execution can be changed automatically.

  4. The second method (distinct) is more concise than the first method (group by), and the meaning expressed is simple and clear. If there are no special problems, concise code is better!

This example tells us that sometimes we should not over-optimize. Tuning pays attention to timely tuning. Premature tuning may do useless work or even produce negative effects. The work cost and return on tuning are not directly proportional. Tuning needs to follow certain principles .

2. Data format optimization

Hive provides a variety of data storage organization formats, and different formats will have a great impact on the operating efficiency of the program.

The formats provided by Hive include TEXT, SequenceFile, RCFile, ORC, and Parquet.

SequenceFile is a flat file with a binary key/value pair structure, which was widely used as a MapReduce output/output format and as a data storage format on the early Hadoop platform.

Parquet is a columnar data storage format that is compatible with multiple computing engines, such as MapRedcue and Spark, and provides good performance support for multi-layer nested data structures. It is currently one of the mainstream choices for data storage in the Hive production environment. one.

ORC optimization is an optimization of RCFile, which provides an efficient way to store Hive data, and can also improve the performance of Hive's reading, writing, and data processing, and is compatible with various computing engines. In fact, in the actual production environment, ORC has become one of Hive's mainstream choices for data storage.

We use the same data and SQL statement, but the data storage format is different, and the execution time is as follows:

Data Format CPU time User wait time
TextFile 33 points 171 seconds
SequenceFile 38 points 162 seconds
Parquet 2 minutes 22 seconds 50 seconds
ORC 1 minute 52 seconds 56 seconds

Note: CPU time: indicates the time of server CPU resources occupied by the running program.
User waiting time: records all the time the user waits from submitting the job to returning the result.

It takes 33 minutes for the CPU to query a TextFile-type data table, and 1 minute and 52 seconds to query an ORC-type table. The time is greatly shortened. It can be seen that different data storage formats can also have a great impact on HiveSQL performance.

3. Too many optimizations for small files

If there are too many small files, for hive, when querying, each small file will be regarded as a block, and a Map task will be started to complete, and the start and initialization time of a Map task is much longer than the logical processing time. It will cause a lot of waste of resources. Moreover, the number of maps that can be executed at the same time is limited.

4. Parallel Execution Optimization

Hive converts a query into one or more stages. Such stages can be the MapReduce stage, the sampling stage, the merge stage, and the limit stage. Or other stages that may be required during Hive execution. By default, Hive will only execute one stage at a time. However, a specific job may contain many stages, and these stages may not be completely interdependent, that is to say, some stages can be executed in parallel, which may shorten the execution time of the entire job. If there are more stages that can be executed in parallel, the job may finish faster.

By setting the parameter hive.exec.parallel to true, you can enable concurrent execution. In a shared cluster, it should be noted that if the number of parallel stages in the job increases, the cluster utilization will increase.

set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。

Of course, it has an advantage when the system resources are relatively free, otherwise there will be no resources, and parallelism will not start.

5. Data skew optimization

We all know the principle of data skew, that is, a certain key or several keys occupy 90% of the entire data, so the efficiency of the entire task will be slowed down by the processing of this key, and it may also cause memory loss due to the aggregation of the same key. overflow.

The general processing scheme of Hive's data skew :

Common practice, through parameter tuning:

set hive.map.aggr=true;  
set hive.groupby.skewindata = ture;

When the option is set to true, the generated query plan has two MapReduce tasks.

In the first MapReduce, the output result set of the map is randomly distributed among the reducers, and each reducer performs partial aggregation operations and outputs the results.

The result of this processing is that the same Group By Key may be distributed to different reducers, so as to achieve the purpose of load balancing;

The second MapReduce task is then distributed to the reduce according to the Group By Key according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same reduce), and finally completes the final aggregation operation.

But this solution is a black box for us and cannot be controlled.

So how to deal with this kind of data skew in the case of daily needs:

  1. sample sampling, which set of keys are obtained;

  2. Add random numbers to the centralized keys according to certain rules;

  3. Perform join, because the data is scattered, so the data skew is avoided;

  4. In the processing result, the previously added random number is divided into original data.

Example: If it is found that 90% of the keys are null, once the data volume is too large, data skew will inevitably occur. The following methods can be used:

SELECT *
FROM a
 LEFT JOIN b ON CASE 
   WHEN a.user_id IS NULL THEN concat('hive_', rand())
   ELSE a.user_id
  END = b.user_id;

Note : The value randomly assigned to the null value should not be the same as the existing value in the table, otherwise the result will be wrong.

6. Limit limit adjustment optimization

In general, the Limit statement still needs to execute the entire query statement, and then return some results.

There is a configuration property that can be turned on to avoid this: sample the data source .

hive.limit.optimize.enable=true -- Turn on the function of sampling the data source

hive.limit.row.max.size -- Set the minimum sample size

hive.limit.optimize.limit.file -- Set the maximum number of samples to be sampled

Disadvantage : it is possible that some data will never be processed

7. JOIN optimization

1. Use the same connection key

When joining 3 or more tables, if each on clause uses the same join key, only one MapReduce job will be generated.

2. Filter data as early as possible

Reduce the amount of data in each stage, add partitions to the partition table, and select only the fields that need to be used.

3. Try to make atomic operations

Try to avoid a SQL containing complex logic, you can use intermediate tables to complete complex logic.

8. Predicate pushdown optimization

Predicate Pushdown in Hive is short for predicate pushdown. In short, it is to try to push down the filter conditions before the join without affecting the results . After the predicate is pushed down, the filter conditions are executed on the map side, which reduces the output of the map side, reduces the amount of data transmitted on the cluster, saves cluster resources, and improves task performance.

Let's look at the following statement:

select s1.key, s2.key 
from s1 left join s2 
on s1.key > '2';

The above is a Left Join statement, s1 is the left table, called the reserved row table, and s2 is the right table.

Q : Is the on condition s1.key > '2' executed before or after the join? That is, will predicate pushdown be performed?

Answer : Predicate pushdown will not be performed, because s1 is a reserved row table, and the filter condition will be executed after join.

And the following statement:

select s1.key, s2.key 
from s1 left join s2 
on s2.key > '2';

The s2 table is not a reserved row, so s2.key>2the condition can be pushed down to the s2 table, that is, executed before the join.

Look at the following sentence again:

select s1.key, s2.key 
from s1 left join s2 
where s1.key > '2';

The right table s2 is a NULL supplementary table.

s1 is not a NULL supplementary table, so s1.key>2predicate pushdown can be performed.

And the following statement:

select s1.key, s2.key 
from s1 left join s2 
where s2.key > '2';

Since s2 is a NULL supplementary table, s2.key>2the filter condition cannot be pushed down.

So what are the rules for predicate pushdown, when will it be pushed down, and when will it not be pushed down? The following table is summarized, and it is recommended to save it as a collection :

case :

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'

The above SQL mainly made two mistakes :

  1. The where condition of the right table (table b above) is written after the join, which will cause the entire table to be associated with the filter partition first.

Note: Although the where condition of table a is also written after the join, table a will push down the predicate, that is, the where condition is executed first, and then the join is executed, but table b will not push down the predicate!

  1. The condition of on does not filter null values. If there are a large number of null values ​​in the two data tables, it will cause data skew.

at last

Code optimization principles:

  • Understand the principle of demand, which is the foundation of optimization;

  • Grasp the principle of full data link, which is the context of optimization;

  • Adhere to the principle of concise code, which makes optimization easier;

  • Talking about optimization when there are no bottlenecks is asking for trouble.

     

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/130419790
Recommended