Troubleshooting data skew caused by join in hive

Join data skew problem caused by large key in hive

1. Scene

If the number of records under a key exceeds other keys, it may cause a reduce task to be particularly slow when joining or grouping. This article analyzes the join scenario.
The SQL of this example is as follows: To query the number of times each appid is opened, it is necessary to exclude the cheating imei.

selectappid,count(*)
from (
 
select md5imei,appid
 
from (
 
select  t1.md5imei,t1.appid,t2.cheat_flags
 
from  imei_open_app t1left outer joincheat_imei t2
 
on t1.md5imei=t2.md5imei
 )t3
 
where t3.cheat_flagsis null
) t4
group by appid ;

 
Description: table cheat_imei, 75 million, no big key, for cheating imei. Table imei_open_app, 565.26 million, open appid for each imei. There are big keys in this table, and there are 236.59 million records with md5imei=54bc0748b1c0fb46135d117b6d26885e.

2. Hadoop environment

hadoop 2.6.0-cdh5.8.0
hive-1.1.0-cdh5.8.0

3. Problems caused

It may cause the following 2 problems.

3.1 Task stuck

A certain reduce task is stuck at 99.9% for half a day. as follows
 
 

3.2 Task timeout is killed.

The amount of data processed by the Reduce task is huge. When doing full gc, stop the world. Causes the response to time out, beyond the default 600 seconds, and the task is killed. The error message is as follows:

AttemptID:attempt_1498075186313_242232_r_000021_1 Timed outafter 600 secs Container killed by the ApplicationMaster. Container killed onrequest. Exit code is 143 Container exited with a non-zero exit code 143


4. How to judge the problem caused by the big key

You can use the following method.

4.1 Judging by time

If a reduce time is much longer than other reduce times. (Note: If the execution time of each reduce is about the same, it is very long, it may be caused by too few reduce settings). As shown below. Most of the tasks completed within 4 minutes, only the task r_000021 was not completed within 33 minutes.

 

Also note that there is a special case that needs to be excluded here. Sometimes, there may be a problem with the node where a task is executed, causing the task to run very slowly. At this time, the speculative execution of mapreduce will restart a task. If the new task can be completed in a very short time, it is usually due to the problem of the task execution node that the individual task is slow. If the task execution task after speculative execution is also particularly slow, it is more indicative that the task may have a skew problem.

4.2 Judging by Task Counter

Counter will record the statistics of the entire job and each task. The url of the counter is generally similar to:

http://rm:9099/proxy/application_1498075186313_242232/mapreduce/taskcounters/task_1498075186313_242232_r_000017  

1) By entering the number of records

The normal task counter is as follows,


The counter of task=000021 is as follows, and the number of input records is 240 million. 10 times more than other tasks


2) By outputting the number of characters

The normal task counter is as follows,


The counter of task=000021 is as follows, which is dozens of times that of other tasks


5. How to find the big key and the corresponding SQL execution code

5.1 Find the corresponding big key

Under normal circumstances, when hive does a join, it will print the join log. We look for the big key through the log.

1) Find the task with a particularly slow task, open the corresponding log, and the url is similar to

http://rm:8042/node/containerlogs/container_e115_1498075186313_242232_01_000416/hdp-ads-audit/syslog/?start=0

2) Search for "rows for joinkey" that appears in the log, as shown below.


 
 
3) Find the record with the longest time span, as shown below. For example [54bc0748b1c0fb46135d117b6d26885e], the processing time is from 2017-08-03 11:31:30 to 2017-08-03 11:46:35, which takes 15 minutes, and the task is still not over.
 
 
 
 
. . . . . . Because the log is too long, the middle part is omitted. . . . . . .
 
 
In addition, it may be seen from the log that 54bc0748b1c0fb46135d117b6d26885e has processed 236,528,000 pieces of data. The actual situation is that this key has 236.59 million pieces of data in imei_open_app. The key is the key that causes the join to be skewed.
 
 

5.2 Determine the stage where the task is stuck

1) Determine stage by jobname

Generally, the default jobname of Hive will bring the name and stage stage, as follows for Stage-1.


2) If the jobname is customized, it may not be possible to judge the stage by the jobname. The task log is required. To find the particularly slow task, search for "CommonJoinOperator: JOIN struct". When Hive does join, it will print the join key to the log, as follows.


The key information in the above figure is struct<_col1:string,_col6:string>

At this time, you need to refer to the execution plan of the SQL. By referring to the execution plan, it can be concluded that this stage is the stage1 stage.


5.3 Determine the SQL execution code

Once the execution stage (ie stage) is determined, then through the execution plan, you can determine which piece of code is skewed when executing it. From the above figure, it can be inferred that the data skew occurred when executing the code in the red box below.


6. Solutions

6.1 Filter out dirty data

If the big key is meaningless dirty data, filter it out directly. In this scenario, the big key has no practical meaning and is dirty data, so it is directly filtered out.

6.2 Data Preprocessing

Preprocess the data, and try to ensure that there are not too many records corresponding to the same key when joining.

6.3 Increase the number of reducers

If there are multiple big keys in the data, increasing the number of reducers can make the probability of these big keys falling into the same reduce much smaller.

6.4 Convert to mapjoin

If two tables are joined, and one table is a small table, you can use mapjoin to do it.

6.5 Large keys are handled separately

The big key and other keys are processed separately, the SQL is as follows

 

select appid,count(*)
from (
 select md5imei,appid
 from (
    select md5imei,appid,cheat_flags
    from select  md5imei,appid,t2.cheat_flags from  imei_open_app  where md5imei="54bc0748b1c0fb46135d117b6d26885e" ) t1
    left outer join cheat_imei  t2
    on t1.md5imei=t2.md5imei
    union all 
    select md5imei,appid,cheat_flags
    from select  md5imei,appid,t2.cheat_flags from  imei_open_app  where md5imei!="54bc0748b1c0fb46135d117b6d26885e" ) t11
    left outer join cheat_imei  t12
   on t11.md5imei=t12.md5imei
 )t3
 where t3.cheat_flags is null
) t4
group by  appid  ;

6.6 hive.optimize.skewjoin

A join SQL will be divided into two jobs. In addition, hive.skewjoin.key can be set at the same time, the default is 10000. refer to:

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties . but the

The parameter is invalid for full outer join.

6.7 Adjust memory settings

It is suitable for scenarios where the housekeeping is killed due to memory overrun. By increasing the memory, at least the task can run and not be killed. This parameter does not necessarily reduce task execution time.

Such as:

setmapreduce.reduce.memory.mb=5120 ;

setmapreduce.reduce.java.opts=-Xmx5000M  -XX:MaxPermSize=128m;


The article is the author's original article, if it is helpful to you, welcome to reward!! [ Write a picture description here ] ( https://img-blog.csdn.net/2018040817584585 ) ! [ Write a picture description here ] ( https://img-blog.csdn.net/20180408180105136 )

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325383829&siteId=291194637