Solve the problem of data skew in hive

  In normal work, when we use hive to deal with business problems, we will inevitably encounter the problem of data skew. The essence of data skew is the uneven distribution of keys, which leads to large or small gaps in the amount of data allocated to different reducers. When the data gap is too large, it will cause data skew, which makes a certain reduce burden too large, resulting in the delay in completing the task.

main reason

1. Uneven key distribution.
2. The data on the map side is skewed, the input files are too many, and the sizes are not uniform.
3. The data on the reduce side is skewed, and the partitioner has problems.
4. The characteristics of the business data itself.

solution

1. Adjust the parameters in hive as follows:

set hive.map.aggr=true;
set hive.groupby.skewindata=true;

The parameter hive.map.aggr=true is used to set partial aggregation on the map side, which is equivalent to Combiner
hive.groupby.skewindata=true. When this parameter is set to true , two MR JOBs will be generated when data skew occurs . The first MR JOB will first randomly distribute the keys , and distribute them in different reduce as evenly as possible. In each reduce , the data is partially aggregated, and then the data is processed by the second MR JOB . At this time, the data is already After partial aggregation of the result data, the data of the same key will be divided into one reduce . At this time, the unified group by key processing will basically avoid the occurrence of excessive pressure on a certain reduce .

2. Optimize map and reduce
as follows:

set hive.merge.mapfiles=true;
set hive.mapred.map.tasks=number;

set hive.merge.mapfiles=true is used to process too many small files, which puts a lot of pressure on the map , use this parameter to merge small files, or you can use this command set hive.mapred.map. tasks=number to adjust the number of mappers , and multiple mappers to equally divide the pressure on the map side.

set hive.mapred.reduce.tasks=number;

set hive.mapred.reduce.tasks=number Through this command to adjust the number of reduce , this is generally only suitable for certain business scenarios data skew problem, for example, there are 80 different types of goods with a large amount of data in one reduce the polymerization is carried out, this time increases reduce the number of performing the hash partition when these 80 kinds of product might be distributed to different types reduce them, increasing reduce the number of processed data skew problem is with a strong It is limited, and may not work.

3. sql optimization
3.1 Data skew caused by the null value problem Data skew caused by the
null value problem is a common problem in actual business, such as user behavior log data in the traffic domain, and sometimes user_id data loss will occur . At this time, user_id will generate a lot of null values. At this time, if it is associated with the user information table, data skew will occur. The solution is as follows:

-- 1.这种情况在join的时候直接过滤空值,最后给union all上
select 
 *
from a
join
b
on
a.id is not null and a.id = b.id
union all
select
 * 
from
a 
where a.id is null;
 
-- 2.给空值字段取一个字符串常量+随机数
select 
 *
from a 
left outer join 
b 
on
case 
when a.id is null 
then concat('常量字段',rand()) 
else a.id 
end = b.id

The second method is better than the first method, because there are fewer IO times and fewer jobs. In the first method, the user behavior log table is read twice, and the jobs must be 2. One method is 1. This optimization is suitable for data skew caused by invalid IDs . By turning a key with a null value into a string plus a random number, the data that causes data skew can be divided into different reducers to solve the problem of data skew. All records that make themselves null will not be crowded in the same reduce task , and will be scattered into multiple reduce tasks due to alternative random string values . Since the null value is not related, the final result will not be affected after processing.

3.2 small table associated with a large table
using hvie when processing data, when it comes to a small table and a large table associated with this problem is usually better to solve the problem, we only need a small table into memory first, similar to the Spark of Broadcast variables, the implementation is as follows:

set hive.auto.convert.join=true; //设置 MapJoin 优化自动开启
set hive.mapjoin.smalltable.filesize=25000000 //可以根据情况设定具体值

3.2 Associating a large table with a large table
1. When a large table is associated with a large table, setting map join will no longer work, because any large table will not be broadcast to the map side to prevent excessive pressure on the memory caused by the large table Memory waste. At this time, we can filter the useless columns and rows in the two large tables and then associate them. If there are more useless data, this can greatly reduce the pressure on the nodes and avoid data skew. The problem.

2. If two large tables do have so much data to be associated, we can try the following methods:

set hive.optimize.skewjoin = true; 
set hive.skewjoin.key = skew_key_threshold (default = 100000

When hive is running, there is no way to judge which key will generate how much skew, so use this parameter to control the skew threshold. If this value is exceeded, the new value will be sent to the reduce that has not yet reached .

Guess you like

Origin blog.csdn.net/AnameJL/article/details/113205453