Spark learns from 0 to 1 (11)-Spark solves data skew

1. Use Hive ETL to preprocess data

1.1 Application scenarios of the solution

If it is the Hive table that causes the data skew. The data in the Hive table is not uniform (for example, a key corresponds to 1 million data, and other keys correspond to 10 pieces of data), and business scenarios require frequent use of Spark to perform certain analysis operations on the Hive table, so it is more suitable for use This technical solution.

1.2 Scheme realization ideas

At this point, you can evaluate whether data can be preprocessed through Hive (that is, the data is pre-aggregated according to the key through Hive ETL, or pre-joined with other tables), and then the data targeted in the Spark job is not the original Hive table, but the preprocessed Hive table. At this time, due to the data and pre-aggregation or join operations, there is no need to use the original shuffle operator to perform such operations in the Spark job.

1.3 Scheme realization principle

This solution solves the problem of data skew from the root cause, because it completely avoids the execution of shuffle operators in Spark, then there is definitely no data skew problem. But here is also a reminder to everyone that this approach is a temporary solution rather than a permanent cure. Because the data itself has the problem of uneven distribution, when performing shuffle operations such as group by or join in Hive ETL, data skew will still occur, resulting in very slow Hive ETL. We just advance the occurrence of data skew to Hive ETL to avoid data skew in the Spark program.

2. Filter a few keys that cause skew

2.1 Application scenarios

If it is found that there are only a few keys that cause tilt, and the impact on the calculation itself is not large, then this scheme is very suitable. For example, 99% of keys correspond to 10 pieces of data, but only one key corresponds to 1 million data, which leads to data skew.

2.2 Scheme realization ideas

If we judge that the few keys with a lot of data are not particularly important to the execution of the job and the results of the calculation, then just come over and drop a few keys. For example, in Spark SQL, you can use the where clause to remove these keys or execute the filter operator on RDD in Spark Core to remove these keys. If you need to dynamically determine which keys have the most number and then filter each time the job is executed, you can use the sample operator to sample the RDD, and then calculate the number of each key, and filter out the key with the most number.

2.3 Scheme realization principle

After the keys that cause data skew are filtered out, these keys will not participate in the calculation, and it is naturally impossible to cause data skew.

3. Improve the parallelism of shuffle operations

3.1 Scheme realization ideas

When executing the shuffle operator on the RDD, pass in a parameter to the shuffle operator, such as reduceByKey(1000), which sets the number of shuffle read tasks when the shuffle operator is executed. For shuffle statements in Spark SQL, such as group by, join, etc., you need to set a parameter, that is spark.sql.shuffle.partitions, this parameter represents the parallelism of shuffle read task. The default value is 200, which is a bit too small for many scenarios.

3.2 Scheme realization principle

Increasing the number of shuffle read tasks allows multiple keys originally assigned to a task to be assigned to multiple tasks, so that each task can process less data than before. For example: if there are originally 5 different keys, each key corresponds to 10 pieces of data, and these 5 keys are all assigned to a task, then this task will process 50 pieces of data. After the shuffle read task is added, each task is assigned a key, that is, each task processes 10 pieces of data, so naturally the execution time of each task will be shorter.

4. Double aggregation

4.1 Application scenarios

This scheme is more suitable when performing aggregation shuffle operators such as reduceByKey on RDD or group by statements in Spark SQL for grouping aggregation.

4.2 Scheme realization ideas

The core realization idea of ​​this scheme is to carry out two-stage aggregation.

The first time is partial aggregation:

First put a random number on each key, such as a random number within 10, then the same key will become different. For example, (hello, 1) (hello, 1) (hello, 1) (hello, 1) will become (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then perform aggregation operations such as reduceByKey on the data with random numbers to perform partial aggregation, then the partial aggregation result will become (1_hello, 2) (2_hello, 2).

The second time is global aggregation:

Then remove the prefix of each key, it will become (hello, 2) (hello, 2), and perform the global aggregation operation again to get the final result (hello, 4).

4.3 Scheme realization principle

By adding a random prefix to the original same key, it becomes multiple different keys, so that the data originally processed by one task can be distributed to multiple tasks for partial aggregation, thereby solving the problem of excessive data processing by a single task problem. Then remove the random prefix and perform global aggregation again to get the final result.

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-BB9Tof6K-1602580670499)(E:\maven\big-data-cloud\spark\spark_double_groupby.png)]

If there is a key in an RDD that causes data cleaning, and there are other keys at the same time, then generally the data is sampled first, and then the slanted key is found, and then the original RDD is separated using filter to separate two RDDs, one is RDD1 composed of slanted keys, one is RDD2 composed of other keys, then RDD1 can be used to add random prefixes for multi-partition and multi-task calculation, for another RDD2, normal aggregation calculation, and finally the results are combined.

5. Convert reduce join to map join

BroadCast + filter (or map)

5.1 Application scenarios of the scheme

When using join operations on RDDs or using join statements in Spark SQL, and the amount of data in an RDD or table in the join operation is relatively small (such as several hundred M or one or two G), this solution is more suitable.

5.2 Scheme realization ideas

Do not use the join operator for the join operation, but use the Broadcast variable and the map operator to implement the join operation, thereby completely avoiding the shuffle operation and completely avoiding the occurrence and occurrence of data skew. The data in the smaller RDD is directly pulled into the memory on the Driver side through the collect operator, and then a Broadcast variable is created for it; the map operator is not performed on another RDD. In the operator function, get the full data of the smaller RDD from the Broadcast variable, and compare it with each data of the current RDD according to the connection key. If the connection key is the same, then use the data of the two RDDs in the way you need connect them.

5.3 Scheme realization principle

Ordinary join will go through the shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, this time it is reduce join. But if an RDD is relatively small, and you can use the broadcast small RDD full data + map operator to achieve the same effect as the join, that is, map join, then no shuffle operation will occur, and no data skew will occur. .

6. Sample tilt key and split join operation

6.1 Application scenarios of the scheme

When two RDD/Hive tables are joined, if the amount of data is relatively large and the "Resolution Five" cannot be adopted, then you can look at the key distribution in the two RDD/Hive tables at this time. If data skew occurs, it is because the data volume of the few keys in one RDD/Hive table is too large, and all the keys in the other RDD/Hive table are more evenly distributed, then this solution is more suitable.

6.2 Scheme realization ideas

For the RDD that contains a few keys with excessive data volume, sample a sample through the sample operator, and then count the number of each key to calculate which keys have the largest amount of data. Then the data corresponding to these keys are separated from the original RDD to form a separate RDD, and each key is prefixed with a random number within n, without causing most of the skewed keys to form another one RDD. Then another RDD that needs to be joined, the page filters out the data corresponding to the slanted keys and forms a single RDD, expands each piece of data into n pieces of data, and these n pieces of data are close to 0~n in order Prefix. Most keys that do not cause tilt also form another RDD. Let's talk about the join of an independent RDD with a random prefix and another independent RDD that is expanded by n times. At this time, the original same key can be broken up into n parts and distributed to multiple tasks for joining. The other two ordinary RDDs can be joined as usual. Finally, the results of the two joins are combined together using the union operator, and this is the final join result.

Insert picture description here

7. Use random prefix and expand RDD for join

7.1 Application scenarios of the solution

If a large number of keys in the RDD cause data skew during the join operation, there is no point in splitting the keys. At this time, the last solution can only be used to solve the problem.

7.2 Scheme realization ideas

The implementation idea of ​​this solution is basically similar to "Solution 6". First, check the data distribution in the RDD/Hive table and find the RDD/Hive table that causes data skew. For example, there are multiple keys corresponding to more than 10,000 pieces of data. . Then put each piece of data in the RDD with a random prefix within n. At the same time, expand the capacity of another normal RDD, expand each piece of data into n pieces of data, and each piece of expanded data is prefixed with a prefix of 0~n in turn. Finally, join the two processed RDDs.

Insert picture description here

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109056363