[Spark] Data skew

1. Meaning and harm

When calculating data, the degree of data dispersion is not enough, resulting in a large amount of data concentrated on one or several machines for calculation.

The local calculation is much slower than the average calculation speed, and the whole process is too slow.
Some tasks process too much data, possibly OOM, task failure, and application failure.
1

2. Phenomenon and reasons

1. Phenomenon: (Spark log or monitoring)

1. Executor lost, (Driver) OOM, and Shuffle process errors;
2. Sudden failure of normal running tasks;
3. The execution time of a single Executor is extremely long, and the overall task is stuck at a certain stage and does not end;

Spark Streaming is more prone to data skew, especially join and group operations that include SQL. Because memory allocation is not much, data skew is prone to occur, causing OOM.

2. Reason

Data skew only occurs in the shuffle phase. When performing shuffle, each of the same keys must be pulled to a task of a certain node for processing. For example, if aggregation or join is performed according to the key, the amount of data corresponding to a certain key is particularly large, and data skew will occur.

Trigger shuffle operators: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc.

3. Positioning

1 、 Spark Web UI

Check the amount of data allocated by each task to determine if it is caused by uneven data allocation.

  1. The execution of a certain task is particularly slow.
    1) Know the commonly used operators to trigger shuffle;
    2) Find which stage to run from the log, and then check the amount of data allocated by each task in the Spark Web UI. If it is severely uneven, there is a high probability of data skew.
    3) Calculate the problem code based on the principle of stage division, and then accurately locate the shuffle operator. (This part requires an in-depth understanding of Spark source code)

  2. A certain task is inexplicable OOM
    directly by looking at the exception stack in the log in the yarn-client/cluster mode,

2. Statistics by Key

Because the data is huge, the data can be sampled, the number of occurrences can be counted, and the first few can be taken out according to the number of times. If the majority are evenly distributed, and the individual data is several orders of magnitude larger, it means that data skew has occurred.

Four, solve

1. Avoid data skew of the data source

  1. Filter abnormal data
    content : handle abnormalities/null values ​​correctly; ignore invalid data that has little effect on the results as appropriate; valid data: customize Partitioner, hash uneven data once, and make it more parallel after breaking it up. Gather again.

  2. Hive ETL preprocessing data
    content : The warning line uses Hive ETL to aggregate data according to Key, or join other tables. Spark operations no longer need shuffle atoms to perform such operations.
    Pros and cons : Data skew is completely avoided, and Spark job performance is greatly improved. But treating the symptoms does not cure the root cause, it just moves the problem forward.
    Application : Spark has high response requirements and few execution times. It can be put to the front to provide a better user experience.

2. Filter a few keys that cause tilt

Content : A few keys with a special amount of data are not important for job execution and calculation results, so just filter them directly.
Pros and cons : Simple to implement and very effective. However, there are not many applicable scenarios. In most cases, there are many Keyhs that cause tilt, not a few.

3. Improve the parallelism of shuffle operations

Content : Set the number of shuffle read tasks executed by the shuffle operator, the parameter spark.shuffle.partitions can represent parallel, the default is 200, which can be adjusted larger. The number of tasks assigned to multiple keys has changed from one to multiple, allowing each task to process less data than before.
Pros and cons : Simple to implement and can be effectively alleviated, but it is not completely resolved.
3

4. Two-stage aggregation (local aggregation + global aggregation)

Content : For the first local aggregation, put a random number in front of each key, perform local aggregation such as reduceByKey, then remove the prefix of each key, and perform the global aggregation operation again to get the final result.
Pros and cons : apply reduceByKey/goup by, etc. For aggregation operations (grouping operations), but for join shuffle operations, other solutions have to be used.
4

5. Convert reduce join to map join

Content : For large and small tables (<2G), join operators can be used instead of join operators, but broadcast variables and map operators can be used to implement join operations, thereby avoiding shuffle operations. The Broadcast variable obtains the full data of the smaller RDD, compares it with each item of the current RDD according to the key, and connects the same. Advantages and
disadvantages : The join operation works well, but it is only suitable for large and small tables. Smaller tables will be OOM.

6. Sample the tilt key and split the join operation

Applicable : When two RDD/Hive tables are joined, if the amount of data is relatively large, data skew occurs. If the data volume of a small number of keys in one RDD is too large, all keys in another RDD are evenly distributed.
Content : For a few keys with a large amount of data, sample samples the keys with the largest amount of data, and then split these keys from the original RDD to form a single RDD, and mark each key with random within n The number is used as a prefix, and another RDD that needs to be joined is also filtered out to form a separate RDD corresponding to the key. Each data is expanded into n data, and the 0-n prefix is ​​added. Most keys that will not cause tilt also form another RDD. At this point, the original same key is broken up into n parts, which are distributed to multiple tasks for join. The rest of the RDD can be used as usual, and finally the two join results are merged to get the final join result.
Advantages and disadvantages : just a few keys cause tilt, just break them up, and then another RDD corresponding to the key is expanded by n times. But if there are too many tilt keys, such as thousands, it is not suitable.

7. Use random prefix and expand RDD

Content : Similar to Scheme 6, when the join operation is performed, a large number of keys in the RDD cause data. A_RDD is marked with a random additional number, and B_RDD is expanded. The difference is that the sixth solution is for a small number of keys. This solution is for a large number of skewed keys. It is impossible to split some keys. It can only expand the data of the entire RDD, which requires high memory resources. Advantages and
disadvantages : The join type data can be tilted, and the effect is significant. Of course, it is more to ease, rather than completely avoid. However, expansion requires high memory resources.

8. Combined use

Content : First, use schemes one and two to preprocess part of the data. Secondly, you can increase the parallelism of shuffle. Finally, you can choose a scheme to optimize for different aggregation or join operations.

9. Other angles:

Business perspective : For example, Shanghai has a lot of data, so it can be counted separately, and finally integrated with other cities;
program level : count (distinct) results in only one reduce, which can be breaded with a layer of count outside the group;
parameter tuning : Spark comes with many parameters, Reasonable use can solve most problems.

Five, reference

1. The way to optimize Spark performance-N kinds of postures to solve Spark data skew
2. Spark performance optimization guide-advanced chapter
3. Talk about hundreds of billions of data optimization practice: data skew (pure dry goods)
4. Spark Data skew and its solutions
5. Detailed data skew (recommended for collection)
6. This article will show you what is "data skew"
7. Interview must ask & data skew

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/115006042