Spark Performance Optimization Guide-Common Ideas for Data Skew

When spark processes data, data skew problems often occur. How to alleviate data skew is very obvious for the improvement of spark computing efficiency. I have listed several common schemes for data skew by Meituan Tech.

Phenomenon when data tilt occurs

Most tasks are executed very fast, but individual tasks are extremely slow. For example, there are a total of 1,000 tasks, and 997 tasks are executed within one minute, but the remaining two or three tasks take an hour or two. This situation is very common.

The Spark job that was able to be executed normally suddenly reported an OOM (memory overflow) exception one day. Observing the exception stack was caused by the business code we wrote. This situation is relatively rare.

The principle of data skew

The principle of data skew is very simple: when performing shuffle, the same key on each node must be pulled to a task on a node for processing, such as aggregation or join according to the key. At this time, if the amount of data corresponding to a key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data. Then most tasks may only be assigned to 10 pieces of data, and then run out in 1 second; but individual tasks may be assigned to 1 million pieces of data. The data should be run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.

Therefore, when data is skewed, Spark jobs seem to run very slowly, and may even cause memory overflow due to the large amount of data processed by a task.

The following picture is a very clear example: the key hello corresponds to a total of 7 pieces of data on three nodes, and these data will be pulled into the same task for processing; and the two keys world and you only correspond One piece of data, so the other two tasks only need to process one piece of data. At this time, the running time of the first task may be 7 times that of the other two tasks, and the running speed of the entire stage is also determined by the slowest running task.

Insert picture description here

Data skew solution

1. Filter a few keys that cause tilt

Scenario applicable to the scheme: If there are only a few keys that cause tilt, and the impact on the calculation itself is not large, then this scheme is very suitable. For example, 99% of the keys correspond to 10 pieces of data, but only one key corresponds to 1 million pieces of data, which results in data skew.

Solution implementation idea: If we judge the few keys with a large amount of data and are not particularly important for job execution and calculation results, then simply filter out the few keys. For example, in Spark SQL, you can use the where clause to filter out these keys or execute the filter operator on the RDD in Spark Core to filter out these keys. If you need to dynamically determine which keys have the largest amount of data and then filter each time the job is executed, you can use the sample operator to sample the RDD, and then calculate the number of each key. The key with the largest amount of data is filtered out. can.

Solution implementation principle: After filtering out the keys that cause data skew, these keys will not participate in the calculation, and it is naturally impossible to generate data skew.

Advantages of the solution: The implementation is simple, and the effect is also very good, which can completely avoid data tilt.

Disadvantages of the scheme: There are not many applicable scenarios. In most cases, there are still many keys that cause tilt, not just a few.

Program practical experience: We have also adopted this program in the project to solve the data tilt. One time I found that a Spark job suddenly OOM when it was running. After tracing, it was found that a key in the Hive table had abnormal data on that day, causing a sudden increase in the amount of data. Therefore, sampling is performed before each execution, after calculating the keys with the largest amount of data in the sample, those keys are directly filtered out in the program.

2. Improve the parallelism of shuffle operations

Scenario application scenario: If we have to face the problem of data skew, then it is recommended to use this scheme first, because this is the simplest way to deal with data skew.

Solution implementation idea: When executing the shuffle operator on the RDD, pass a parameter to the shuffle operator, such as reduceByKey (1000). This parameter sets the number of shuffle read tasks when the shuffle operator is executed. For shuffle SQL statements in Spark SQL, such as group by, join, etc., you need to set a parameter, namely spark.sql.shuffle.partitions, which represents the parallelism of the shuffle read task. The default value is 200, for many scenarios It ’s a bit too small.

Solution implementation principle: Increasing the number of shuffle read tasks allows multiple keys originally assigned to a task to be assigned to multiple tasks, thereby allowing each task to process less data than the original. For example, if there are originally 5 keys, each key corresponds to 10 pieces of data, and these 5 keys are all assigned to a task, then this task will process 50 pieces of data. After adding the shuffle read task, each task is assigned a key, that is, each task processes 10 data, so naturally the execution time of each task will become shorter. The specific principle is shown in the figure below.

Advantages of the solution: It is relatively simple to implement and can effectively mitigate and mitigate the effects of data skew.

Disadvantages of the solution: It only relieves the data skew, and does not completely eradicate the problem. According to practical experience, its effect is limited.

Practical experience of the solution: This solution usually cannot completely solve the data tilt, because if there are some extreme situations, such as a key corresponding to 1 million data, then no matter how much your task number increases, this corresponds to a key of 1 million data Certainly it will still be assigned to a task to process, so it is destined that data skew will occur. So this kind of solution can only be said to be the first method to try when data skew is found, try to use the simple method of mouth to ease data skew, or use it in combination with other solutions.
Insert picture description here

3. Two-stage aggregation (local aggregation + global aggregation)

Scenarios applicable to the scheme: This scheme is more suitable when performing aggregate shuffle operators such as reduceByKey on the RDD or group by using the group by statement in Spark SQL.

Solution realization idea: The core realization idea of ​​this solution is to conduct two-stage aggregation. The first time is local aggregation, first assign a random number to each key, such as a random number within 10, then the original key becomes different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1), it becomes (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then, perform the aggregation operation such as reduceByKey on the data after adding the random number, and then perform local aggregation, then the local aggregation result will become (1_hello, 2) (2_hello, 2). Then remove the prefix of each key, it will become (hello, 2) (hello, 2), and perform the global aggregation operation again, you can get the final result, such as (hello, 4).

Solution implementation principle: the original same key is changed into multiple different keys by adding a random prefix, so that the data originally processed by a task can be distributed to multiple tasks for local aggregation, and then a single task can process data Too much problem. Then remove the random prefix and perform global aggregation again to get the final result. The specific principle is shown in the figure below.

Advantages of the solution: The effect of the data skew caused by the shuffle operation of the aggregation class is very good. Usually can solve the data skew, or at least greatly alleviate the data skew, and improve the performance of Spark operations by more than several times.

Disadvantages of the solution: It is only applicable to the shuffle operation of the aggregation class, and the scope of application is relatively narrow. If it is a join type shuffle operation, you have to use other solutions.
Insert picture description here

4. Convert reduce join to map join

Scenarios applicable to the scenario: when using the join operation on RDD, or using the join statement in Spark SQL, and the data amount of an RDD or table in the join operation is relatively small (such as several hundred M or one or two G), it is more applicable This program.

Solution implementation ideas: Instead of using the join operator for the connection operation, use the Broadcast variable and the map operator to achieve the join operation, and then completely avoid the operation of the shuffle class, completely avoid the occurrence and occurrence of data tilt. The data in the smaller RDD is directly pulled into the memory of the Driver side through the collect operator, and then a Broadcast variable is created for it; then the map class operator is executed on another RDD, and in the operator function, from the Broadcast variable Obtain the full amount of data for the smaller RDD, and compare each piece of data in the current RDD according to the connection key. If the connection key is the same, then connect the data of the two RDDs in the way you need.

Solution implementation principle: The ordinary join will go through the shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, then this is a reduce join. However, if an RDD is relatively small, you can use broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join, then no shuffle operation will occur and data tilt will not occur . The specific principle is shown in the figure below.

Advantages of the scheme: The data tilt caused by the join operation has a very good effect, because shuffle does not occur at all, and data tilt does not occur at all.

Disadvantages of the scheme: There are fewer applicable scenarios, because this scheme is only applicable to a large table and a small table. After all, we need to broadcast small tables, which will consume more memory resources. The driver and each Executor memory will have a full amount of data for a small RDD. If the RDD data we broadcast is relatively large, such as more than 10G, then a memory overflow may occur. Therefore, it is not suitable for the case where both are large tables.
Insert picture description here

5. Sampling the tilt key and splitting the join operation

Scenario application scenario: When two RDD / Hive tables are joined, if the amount of data is relatively large and the "solution 5" cannot be adopted, then you can look at the key distribution in the two RDD / Hive tables. If the data is skewed because the data volume of a few keys in one RDD / Hive table is too large, and all keys in the other RDD / Hive table are evenly distributed, then this solution is more appropriate of.

Solution implementation ideas: * For the RDD that contains a few keys with a large amount of data, sample a sample through the sample operator, then count the number of each key, and calculate which ones have the largest amount of data key. * Then split the data corresponding to these keys from the original RDD to form a separate RDD, and prefix each key with a random number within n, without causing the formation of most of the tilted keys Another RDD. * Next, another RDD that needs to be joined, and the data corresponding to the several tilt keys are also filtered out to form a separate RDD, and each data is expanded into n pieces of data, and the n pieces of data are sequentially appended with a 0 ~ n The prefix does not cause most of the tilted keys to form another RDD. * Then join the independent RDD with a random prefix and another independent RDD with an expansion of n times. At this time, the original same key can be broken into n parts and distributed to multiple tasks to join. * The other two ordinary RDDs can be joined as usual. * Finally, the results of the two joins can be combined using the union operator, which is the final join result.

Solution implementation principle: For the data tilt caused by join, if only a few keys cause the tilt, you can split a few keys into independent RDDs and add random prefixes to break up into n shares to join. The data corresponding to a key will not be concentrated on a few tasks, but will be distributed to multiple tasks to join. The specific principle is shown in the figure below.

Advantages of the solution: For data tilt caused by join, if only a few keys cause tilt, this method can be used to break up the key in the most effective way to join. Moreover, it only needs to expand the data corresponding to a few tilt keys n times, and does not need to expand the full amount of data. Avoid taking up too much memory.

Disadvantages of the solution: If there are particularly many keys that cause tilt, for example, thousands of keys cause data tilt, then this method is not suitable.
Insert picture description here

6. Join using random prefix and expanded RDD

Scenarios applicable to the scenario: If there are a large number of keys in the RDD that cause data skew during the join operation, it is meaningless to split the keys. At this time, only the last solution can be used to solve the problem.

Solution implementation idea: * The implementation idea of ​​the solution is basically similar to "solution six". First, check the data distribution in the RDD / Hive table to find the RDD / Hive table that caused the data tilt. For example, multiple keys correspond to More than 10,000 data. * Then each data of the RDD is prefixed with a random prefix within n. * At the same time, expand another normal RDD, expand each piece of data into n pieces of data, and each piece of expanded data is sequentially prefixed with a 0 ~ n. * Finally, join the two processed RDDs.

Solution implementation principle: the original same key is changed into a different key by appending a random prefix, and then these processed "different keys" can be distributed to multiple tasks for processing, rather than letting a task handle a large number of the same key. The difference between this solution and "Solution 6" is that the last solution tries to perform special processing only on the data corresponding to a few tilted keys. Since the processing process needs to expand the RDD, the previous solution expands the memory of the RDD. Occupation is not large; and this kind of solution is for the case of a large number of skewed keys, and it is impossible to split some keys out for separate processing, so only the entire RDD can be expanded in data, which requires high memory resources.

Advantages of the scheme: Basically, the data tilt of the join type can be processed, and the effect is relatively significant, and the performance improvement effect is very good.

Disadvantages of the scheme: This scheme is more to ease data skew, rather than completely avoid data skew. Moreover, the entire RDD needs to be expanded, which requires high memory resources.

Program practical experience: When developing a data requirement, it was discovered that a join caused data skew. Before optimization, the execution time of the job is about 60 minutes; after optimization using this scheme, the execution time is shortened to about 10 minutes, and the performance is improved by 6 times.

Published 19 original articles · praised 21 · visits 913

Guess you like

Origin blog.csdn.net/weixin_42134034/article/details/105623070