Some common methods to solve data inclined Spark

Data skew is big data to calculate one of the most difficult issues, after data skew, Spark job performance will be much worse than expectations. Tuning data skew is Spark job performance art solutions using various different types of data skew problems, guaranteed.

First, the principle of data skew

Spark a job, will be divided into a plurality of job to the Action operation inside thereof, each internal stage into a plurality of job will shuffle operations in accordance with, and a plurality of task assigned to each stage to perform the task. Each task will receive a partition of data processing.

task within the same stage is a parallel processing of data, having a different stage dependency is serial processing. Due to this processing mechanism, suppose a Spark job a job there are two stage, respectively stage0 and stage1, then stage1 have to wait stage0 process ends can take place. If the internal distribution of n stage0 task for computing tasks, a task which has to receive the data partition is too large, it performed not over 1 hour, the n-1 remaining within half an hour task execution ends, and in etc. this last task execution ends to proceed to the next stage. Because of this task to receive a large amount of data inside the stage phenomenon is the data skew.

The figure is an example: hello The key 7 corresponds to the data mapped to the same task to deal with the remaining two were processed only a task data.

1503160-390346e5473a42fa.png
Data skew

Second, the data skew phenomenon occurs

1, the vast majority of task execution was very fast, but individual task to perform extremely slow. For example, a total task 1000, task 997 is executed th finished within 1 minute, but the remaining twenty-three task has to twelve hours. This situation is very common.

2, Spark had the normal execution of the job, one day suddenly reported OOM (out of memory) anomalies observed abnormal stack, we write business code caused. This is relatively rare.

Third, how oblique positioning data code

Data skew occurs only in the shuffle process. Common and may trigger the shuffle operation operators have: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition and so on. Data skew occurs, may be useful is the use of these operators is one of the lead.

If we are yarn-client mode submission, we can directly view the local log, positioned in the log in the current operation to which stage; if the yarn-cluster model submitted, we can spark web UI to view the current operation to which stage. No matter which model we used can be viewed in the spark web UI on top of each task to the current stage of the amount of data and run-time, which can further determine data allocation is not the task of data caused by uneven tilt.

After confirming the occurrence of data sloping stage, we can identify the operator will trigger the shuffle, the corresponding figure out the stage of the code tilted. Shuffle trigger operation operator in addition to those mentioned above, but also pay attention to the use of certain spark sql sql statements, such as group by and so on.

Fourth, the solution

Tilt idea is to solve the data collection task to ensure that within each stage of the data is sufficiently uniform. A way to get the data source is to this dimension particle size within a sufficiently uniform Spark division calculation, failing this, the phase must approach the read processing data source, try to ensure uniformity. Generally, there are several options.

1, the polymerization results in the source data and filtering inclined key

a, polymerization data source

Suppose one of our Spark jobs data source ETL is stored in the data every day in the Hive, these data are mainly electronic business platform operating daily log of the user. Spark job particle size analysis is the session, then we can write data at the time to ensure that each Hive in on a session data corresponding to all the information, also based session as the data written to the size of the Hive.

In this way we can ensure that our Spark job there is no need to do some of the operations of the groupByKey + map, map can operate directly on each key of value, calculated data we need. Eliminating the shuffle operations, to avoid data shuffle of tilt.

However, when we analyze Spark job granularity than one size, such as this session in addition to size, there are date of granularity, userId particle size, and so on. This time is no guarantee that all these data can be aggregated into the size of the piece of data. At this time we can make a compromise, choose a relatively large particle size, polymerization data. For example, we might have 100W pieces of data according to the original storage, but according to some granularity, such as date of this size, after aggregate storage, so our data can be reduced to 50W strip, can be done to reduce the data skew phenomenon.

b, leading to the inclined filter key

For example, our data Hive, a total of 100W a key, of which there are five key corresponding to the amount of data is very large, there may be hundreds of thousands of pieces of data (single malicious brush when this happens in the electronic business platform will appear) , an amount of data corresponding to the other key are only a few dozen. If we can accept this business above five key corresponding data can be discarded. In this case we can filter out the key 5 when taken with the data from the Hive sql. Thereby avoiding data skew.

2, shuffle operation to improve the degree of parallelism reduce end

Spark In doing shuffle operations, the default is HashPartitioner used to partition the data. If the degree of parallelism reduce inappropriate disposed end, likely to cause a large number of different key is assigned to the same task up, resulting in a data processing task is greater than the other task, resulting in data skew.

If the degree of parallelism adjustment reduce side, allowing the data processing task to reduce each side is reduced to ease the data skew.

Setting method: The RDD performing shuffle count neutrons, shuffle operator to pass a parameter, such as reduceByKey (1000), this parameter is set when the number of the shuffle operator performing the shuffle read task. For Spark SQL statement in the shuffle classes, such as group by, join, etc., need to set a parameter, i.e. spark.sql.shuffle.partitions, the parameter represents the degree of parallelism shuffle read task, which is the default value 200, for many scenes it is a little too small.

3, using a random key polymerizable double

Use groupByKey and reduceByKey Spark two operators in shuffle operation will be performed. At this time, if the amount of data map file for each key end of the deviation is large, easily lead to data skew.

First we need to operate on the data in the random number key splicing broken packets, so that the original data is a key may be assigned to the plurality of key, followed by a polymerization, after completion of polymerization in the original key spell to remove the random number, and then polymerized so that the inclination of the data will have a better effect.

Specific look:

1503160-7fbb8078ea9c2411.png
Random key double polymerization

This solution polymerization and the like is performed based reduceByKey shuffle operator to use group by RDD or statements in Spark SQL packet aggregation in this scheme is more suitable.

4, will reduce-side join converted into map-side join

RDD will shuffle operation is performed two join, if each key data corresponding to the data will be unevenly distributed tilted.

In this case, if the amount of data in a two RDD RDD is not, can extract data out of the RDD, and then made a broadcast variables, the large amount of data that do RDD operators operating map, then map carried out within the operator and broadcasting variables join, to avoid the join process shuffle, thus avoiding data skew phenomenon shuffle process may occur.

5, the sample is inclined and split key join operations

When it comes to this situation: when to join the two RDD, wherein a large amount of data in the RDD the few key corresponding to the inclination data leads, and the other a relatively uniform distribution of data RDD. At this time we can use this method.

1, the RDD key contains a handful of large amount of data by sampling a sample operator to a sample, and statistics about the number of each key, and calculated the amount of data which is the largest of several key.

2, then the data corresponding to these several key split from the original RDD out, to form a single RDD, and for each key are marked with a random number less than n as a prefix, without causing most inclined key forming a further RDD.

3, then the need to join another RDD, but also a few inclined filtered off key data corresponding to a single form and RDD, the data is expanded into each of the n pieces of data, the n pieces of data are sequentially attached a ~ 0 the prefixes n, does not cause most key is also formed another inclined RDD.

4, then additional random prefix for independent RDD join with another independent RDD expanded n times, the original case can be broken into the same n key parts, dispersed into a plurality of the task to conduct the join.

5, while the other two common RDD can join on as usual. Finally, the two join the results using the union operator can be merged, it is the final result of the join.

Specific look:

1503160-a79bd320f6641de7.png
Sampling inclined respectively join key

6, and the expansion random prefix for join RDD

If during the join operation, RDD leads to a large number of key data skew, then the spin-off key not make sense, at this time you can take this program.

Achieve basic idea of ​​the program and on a similar, first check RDD distribution of data, find the cause of data skew RDD, such as multiple key mapping data are large. Then we did the RDD data transfer are marked with a random within the prefix n. RDD for expansion while the other, which did not transfer data into n data expansion, expansion successively marked out each of the data prefix of 0 ~ n. The two were then RDD join.

This program together with one than, less sampling process, because the one is for a RDD only a very small number of key data skew occurs, requires special handling for these key. And this program is a lot of key data skew occurs for a RDD, there is no need at this time of sampling.

Reproduced in: https: //www.jianshu.com/p/768d9e8536c3

Guess you like

Origin blog.csdn.net/weixin_34183910/article/details/91073741