Spark data skew scenarios and solutions

Phenomenon when data skew occurs

Most tasks execute very quickly, but individual tasks execute extremely slowly.

The principle of data skew

When performing shuffle, the same key on each node must be pulled to a task on a certain node for processing, such as performing operations such as aggregation or join according to the key. At this time, if the amount of data corresponding to a certain key is particularly large, data skew will occur. Therefore, when there is data skew, the Spark job will appear to run very slowly, and may even cause memory overflow due to the large amount of data processed by a task. The running speed of the entire stage is also determined by the slowest task. 10/1 million

How to locate the code that causes data skew

Data skew only occurs during shuffle. Operators that may trigger the shuffle operation: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc.

The case where a task executes very slowly

It is the stage in which data skew occurs. Submit in yarn-client mode, then you can directly see the log locally, and you can find the current running stage in the log, and submit in yarn-cluster mode, you can check the current running stage through the Spark Web UI stage.
On the Spark Web UI, take a deep look at the amount of data allocated by each task in the current stage, so as to further determine whether the uneven data allocation of the task leads to data skew.
After knowing the stage where the data skew occurs, we then need to calculate which part of the code the stage where the skew occurs corresponds to based on the stage division principle. There must be a shuffle operator in this part of the code. Generally, shuffle operators will split the stages.

A task inexplicably overflows the memory

View the exception stack in the local log in yarn-client mode, or view the exception stack in the log in yarn-cluster mode through YARN. Generally speaking, through the exception stack information, it is possible to locate which line of the code has memory overflow. Then look around that line of code, and there is usually a shuffle operator. At this time, this operator is likely to cause data skew. However, memory overflow does not necessarily mean that data skew has occurred, and it may also be that there is a bug in the code.
Check the data distribution of the key that causes data skew
After knowing where the data skew occurs, you usually need to analyze the RDD/Hive table that has executed the shuffle operation and caused the data skew, and check the distribution of the keys in it. This is mainly to provide a basis for which technical solution to choose later.

  • If the data skew is caused by the group by and join statements in Spark SQL, then query the key distribution of the tables used in SQL.
  • If the data skew is caused by executing the shuffle operator on the Spark RDD, you can add code to check the key distribution in the Spark job, such as RDD.countByKey(). For the counted number of times, collect/take it to the client and print it, and you can see the distribution of the key.

Data Skew Solution

1. The data in the Hive table itself is very uneven - use Hive ETL to preprocess the data

Hive ETL aggregates the data according to the key in advance, or joins with other tables in advance, and then the data source targeted in the Spark job is not the original Hive table, but the preprocessed Hive table. At this time, since the data has been aggregated or joined in advance, there is no need to use the original shuffle operator to perform such operations in the Spark job.
This solution solves the problem of data skew from the root, because it completely avoids the execution of shuffle operators in Spark, so there will definitely be no problem of data skew. But the index does not cure the root cause, and data skew will also occur in Hive ETL.

2. There are only a few keys that cause skew, and they have little impact on the calculation itself - filter a few keys that cause skew

If a few keys with a large amount of data are not particularly important to job execution and calculation results, then simply filter out those few keys. After the keys that cause data skew are filtered out, these keys will not participate in the calculation, and it is naturally impossible to generate data skew.

3. It is necessary to face up to the data inclination - improve the parallelism of the shuffle operation

For shuffle statements in Spark SQL, such as group by, join, etc., you need to set a parameter, namely spark.sql.shuffle.partitions, which represents the parallelism of the shuffle read task. The default value is 200. For many scenarios It's a little too small. By increasing the number of shuffle read tasks, multiple keys originally allocated to one task can be allocated to multiple tasks, so that each task can process less data than before.
Disadvantages: This solution usually cannot completely solve data skew, because if there are some extreme cases, such as a key corresponding to 1 million data, no matter how much your task number increases, the key corresponding to 1 million data must still be It will be assigned to a task for processing, so it is destined that data skew will still occur.

4. Aggregation class (reduceByKey /group by) shuffle-two-stage aggregation (local aggregation + global aggregation)

The core implementation idea of ​​this solution is to perform two-stage aggregation. The first time is partial aggregation. First, each key is marked with a random number, such as a random number within 10. At this time, the original same key becomes different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1), will become (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then, perform aggregation operations such as reduceByKey on the data marked with random numbers to perform local aggregation, then the local aggregation result will become (1_hello, 2) (2_hello, 2). Then remove the prefix of each key, it will become (hello, 2)(hello, 2), and perform the global aggregation operation again to get the final result, such as (hello, 4).
By adding a random prefix to the original same key and turning it into multiple different keys, the data originally processed by one task can be distributed to multiple tasks for local aggregation, thereby solving the problem of too much data processed by a single task question. Then remove the random prefix and perform global aggregation again to get the final result.

5. Small table join large table - convert reduce join to map join

Pull the data in the smaller RDD directly into the memory of the Driver side through the collect operator, and then create a Broadcast variable for it; then execute the map operator on another RDD, and in the operator function, from the Broadcast variable Get the full amount of data of the smaller RDD, and compare it with each piece of data in the current RDD according to the connection key. If the connection key is the same, then connect the data of the two RDDs in the way you need. Ordinary join will go through the shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then performing join, which is reduce join at this time. However, if an RDD is relatively small, you can use the broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join. At this time, there will be no shuffle operation, and no data skew will occur. .

6. The amount of data in the two tables is relatively large, but the data amount of a few keys in one of them is too large, and the other can uniformly sample skewed keys and split the join operation

  • For the RDD that contains a few keys with a large amount of data, use the sample operator to sample and calculate which keys have the largest amount of data.
  • Then split the data corresponding to these keys from the original RDD to form a separate RDD, and give each key a random number within n as a prefix, without causing most of the skewed keys to form another an RDD.
  • Then another RDD that needs to be joined will also filter out the data corresponding to those tilted keys and form a separate RDD, and expand each piece of data into n pieces of data, and these n pieces of data are appended with a 0~n number in order Most of the keys that do not cause skew also form another RDD.
  • Then join the independent RDD appended with a random prefix with another independent RDD expanded by n times. At this time, the original same key can be broken into n shares and distributed to multiple tasks for join.
  • The other two ordinary RDDs can be joined as usual.
  • Finally, the results of the two joins can be combined using the union operator, which is the final join result.
    For the data skew caused by join, if only a few keys cause the skew, a few keys can be split into independent RDDs, and random prefixes can be added to split them into n parts for join. At this time, these keys correspond to The data will not be concentrated on a few tasks, but will be scattered to multiple tasks for join.

7. During the join operation, there are a large number of keys in the RDD that cause data skew, and it is meaningless to split the keys - use a random prefix and expand the RDD

  • First check the data distribution in the RDD/Hive table, and find the RDD/Hive table that causes data skew. For example, multiple keys correspond to more than 10,000 pieces of data.
  • Then each piece of data in the RDD is marked with a random prefix within n.
  • At the same time, expand the capacity of another normal RDD, expand each piece of data into n pieces of data, and add a prefix of 0~n to each piece of data after expansion.
  • Finally, join the two processed RDDs.
    It is equivalent to the above situation without splitting.

References

"Learning Big Data in Five Minutes - Spark Data Tilt and Solutions"

Guess you like

Origin blog.csdn.net/hshudoudou/article/details/130330153