Spark performance optimization: data skew tuning

foreword

Following "Spark Performance Optimization: Development Tuning" and "Spark Performance Optimization: Resource Tuning", which explain the development tuning and resource tuning that every Spark developer must be familiar with, this article serves as the "Spark Performance Optimization Guide" The advanced chapter will analyze in-depth data skew tuning and shuffle tuning to solve more difficult performance problems.

1. Data skew tuning
tuning overview

Sometimes, we may encounter one of the most difficult problems in big data computing - data skew, and the performance of Spark jobs will be much worse than expected. Data skew tuning is to use various technical solutions to solve different types of data skew problems to ensure the performance of Spark jobs.
Symptoms when data skew occurs

Most tasks execute very fast, but some tasks execute extremely slowly. For example, there are a total of 1000 tasks, 997 tasks are executed within 1 minute, but the remaining two or three tasks will take an hour or two. This situation is very common.

A Spark job that was able to execute normally suddenly reported an OOM (out of memory) exception one day. Observe the exception stack, which was caused by the business code we wrote. This situation is relatively rare.

How Data Skew Occurs

The principle of data skew is very simple: when performing shuffle, the same key on each node must be pulled to a task on a node for processing, such as aggregation or join operations according to the key. At this time, if the amount of data corresponding to a key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data, then most tasks may only be allocated 10 pieces of data, and then run over in 1 second; but individual tasks may be allocated 1 million pieces of data data, to run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.

Therefore, when the data is skewed, the Spark job seems to run very slowly, and it may even cause a memory overflow due to the excessive amount of data processed by a task.

The following figure is a very clear example: the key hello corresponds to a total of 7 pieces of data on three nodes, and these data will be pulled into the same task for processing; and the two keys of world and you correspond to 1 piece of data, so the other two tasks only need to process 1 piece of data respectively. At this time, the running time of the first task may be 7 times that of the other two tasks, and the running speed of the entire stage is also determined by the slowest running task.

write picture description here
How to locate the code that causes data skew

Data skew only happens during shuffle. Here are some commonly used operators that may trigger shuffle operations: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc. When data skew occurs, it may be caused by the use of one of these operators in your code.
When a task executes very slowly

The first thing to look at is which stage the data skew occurs in.

If you submit in yarn-client mode, you can directly see the log locally, and you can find the number of stages currently running in the log; if you submit in yarn-cluster mode, you can view the current through the Spark Web UI. Running to the first stage. In addition, whether using yarn-client mode or yarn-cluster mode, we can take a deep look at the amount of data allocated by each task in the current stage on the Spark Web UI, so as to further determine whether the uneven data allocated by the task causes the data tilt.

write picture description here

After knowing which stage the data skew occurs, then we need to calculate which part of the code the skewed stage corresponds to based on the principle of stage division. There must be a shuffle operator in this part of the code. To accurately calculate the corresponding relationship between stages and codes, you need a deep understanding of Spark source code. Here we can introduce a relatively simple and practical calculation method: as long as you see a shuffle operator or Spark SQL SQL appearing in the Spark code If there is a statement (such as a group by statement) that causes shuffle in the statement, it can be determined that the two stages before and after are divided by that place as the boundary.

Here we take Spark's most basic entry program - word count as an example, how to roughly calculate the code corresponding to a stage in the simplest way. In the following example, in the whole code, there is only one reduceByKey operator that will shuffle, so it can be considered that with this operator as the boundary, the two stages before and after will be divided.

stage0, mainly to perform operations from textFile to map, and perform shuffle write operations. The shuffle write operation can be simply understood as partitioning the data in the pairs RDD. In the data processed by each task, the same key will be written to the same disk file.
stage1, mainly to perform operations from reduceByKey to collect. When each task of stage1 starts to run, the shuffle read operation will be performed first. The task that performs the shuffle read operation will pull the keys that belong to its own processing from the nodes where each task of stage0 is located, and then perform global aggregation or join operations on the same key. Here, the value of the key is accumulated. After executing the reduceByKey operator, stage1 calculates the final wordCountsRDD, and then executes the collect operator to pull all the data to the Driver for us to traverse and print out.

val conf = new SparkConf()  
val sc = new SparkContext(conf)  
val lines = sc.textFile("hdfs://...")  
val words = lines.flatMap(_.split(" "))  
val pairs = words.map((_, 1))  
val wordCounts = pairs.reduceByKey(_ + _)  
wordCounts.collect().foreach(println(_))

Through the analysis of the word counting program, I hope that you can understand the most basic principle of stage division, and how the shuffle operation is performed at the boundary of the two stages after stage division. Then we know how to quickly locate which part of the code corresponds to the stage where the data skew occurs. For example, we found in the Spark Web UI or local log that some tasks of stage1 are executed very slowly, and it is judged that there is data skew in stage1, then we can go back to the code to locate that stage1 mainly includes the reduceByKey shuffle operator. At this point, it can be basically determined that the data skew problem caused by the reduceByKey operator. For example, if a word appears 1 million times, and other words appear 10 times, then a task in stage 1 will process 1 million data, and the speed of the entire stage will be slowed down by this task.
A task inexplicably overflows memory

In this case, it is easier to locate the problematic code. We recommend to directly view the exception stack in the local log in yarn-client mode, or view the exception stack in the log in yarn-cluster mode through YARN. In general, the exception stack information can locate which line in your code memory overflow occurred. Then look around that line of code, and there is usually a shuffle operator. At this time, it is likely that this operator causes the data to be skewed.

However, it should be noted that data skew cannot be determined by accidental memory overflow alone. Because of bugs in the code written by yourself, as well as occasional data exceptions, memory overflows may also occur. Therefore, it is still necessary to check the running time and the allocated data amount of each task of the stage where the error is reported through the Spark Web UI according to the method mentioned above, in order to determine whether the memory overflow is caused by the data skew.
View the data distribution of the keys that cause the data skew

After knowing where the data skew occurs, it is usually necessary to analyze the RDD/Hive table that performed the shuffle operation and caused the data skew, and check the distribution of keys. This is mainly to provide a basis for which technical solution to choose in the future. For various situations where different key distributions and different shuffle operators are combined, it may be necessary to choose different technical solutions.

At this point, depending on how you perform the operation, there are many ways to view the key distribution:

If the data is skewed by the group by and join statements in Spark SQL, query the key distribution of the tables used in SQL.
If the data is skewed by executing the shuffle operator on Spark RDDs, you can add code to check the key distribution in the Spark job, such as RDD.countByKey(). Then collect/take the counted number of occurrences of each key to the client and print it, and you can see the distribution of keys.

For example, for the word counting program mentioned above, if it is determined that the reduceByKey operator of stage1 causes the data skew, then you should look at the key distribution in the RDD that performs the reduceByKey operation, which in this example refers to pairs RDDs. In the following example, we can first sample 10% of the sample data for pairs, then use the countByKey operator to count the occurrences of each key, and finally traverse and print the occurrences of each key in the sample data on the client.
2. Solutions for data skew

Solution 1: Preprocess data with Hive ETL

Applicable scenarios : Hive tables cause data skew. If the data in the Hive table itself is very uneven (for example, a key corresponds to 1 million data, other keys only correspond to 10 data), and the business scenario requires frequent use of Spark to perform an analysis operation on the Hive table, then compare It is suitable to use this technical solution.

Solution implementation idea : At this time, you can evaluate whether data preprocessing can be performed through Hive (that is, data can be pre-aggregated by key through Hive ETL, or joined with other tables in advance), and then the data targeted in the Spark job The source is not the original Hive table, but the preprocessed Hive table. At this time, since the data has been aggregated or joined in advance, there is no need to use the original shuffle operator to perform such operations in Spark jobs.

Solution implementation principle : This solution solves the data skew from the root cause, because the execution of shuffle operators in Spark is completely avoided, so there will definitely be no data skew problem. But here I also want to remind everyone that this method is a temporary solution rather than a permanent solution. After all, the data itself has the problem of uneven distribution, so when shuffle operations such as group by or join are performed in Hive ETL, data skew will still occur, resulting in very slow Hive ETL. We just advance the occurrence of data skew to Hive ETL to avoid data skew in Spark programs.

Advantages of the scheme : It is simple and convenient to implement, and the effect is very good. Data skew is completely avoided, and the performance of Spark jobs will be greatly improved.

Disadvantages of the solution: Treat the symptoms but not the root cause, and data skew will still occur in Hive ETL.

Practical experience of the solution : In some projects where the Java system is used in combination with Spark, there will be scenarios where Java code frequently calls Spark jobs, and the execution performance of Spark jobs is required to be high, so this solution is more suitable. The data skew is advanced to the upstream Hive ETL, which is only executed once a day, and only that time is relatively slow. After that, every time Java calls a Spark job, the execution speed will be fast, which can provide a better user experience.

Project practical experience : This solution is used in the interactive user behavior analysis system of Meituan·Dianping. The system mainly allows users to submit data analysis and statistics tasks through the Java Web system, and the backend submits Spark jobs through Java for data analysis and statistics. . It is required that the Spark job speed must be fast, within 10 minutes as far as possible, otherwise the speed is too slow and the user experience will be poor. Therefore, we have advanced the shuffle operation of some Spark jobs to Hive ETL, so that Spark can directly use the preprocessed Hive intermediate table, reduce Spark's shuffle operation as much as possible, greatly improve the performance, and improve the performance of some jobs by 6 times more.
Solution 2: Filter a few keys that cause skew

Applicable scenarios of the scheme : If it is found that there are only a few keys that cause skew, and the impact on the calculation itself is not large, then this scheme is very suitable. For example, 99% of the keys correspond to 10 pieces of data, but only one key corresponds to 1 million pieces of data, which leads to data skew.

Solution implementation idea : If we judge that the few keys with a large amount of data are not particularly important to the execution and calculation results of the job, then simply filter out those few keys. For example, you can use where clause to filter out these keys in Spark SQL or execute filter operator on RDD in Spark Core to filter out these keys. If you need to dynamically determine which keys have the largest amount of data each time the job is executed and then filter them, you can use the sample operator to sample the RDD, then calculate the number of each key, and filter out the key with the largest amount of data. Can.

Scheme implementation principle : After filtering out the keys that cause data skew, these keys will not participate in the calculation, and naturally it is impossible to generate data skew.

Advantages of the scheme : The implementation is simple, and the effect is also very good, which can completely avoid data skew.

Disadvantages of the scheme : There are not many applicable scenarios. In most cases, there are still many keys that cause skew, not just a few.

Scheme practical experience : We have also used this scheme to solve data skew in the project. One day, it was found that a Spark job was running OOM suddenly. After investigation, it was found that a certain key in the Hive table had abnormal data on that day, which led to a sudden increase in the amount of data. Therefore, sampling is performed before each execution, and after calculating the keys with the largest amount of data in the sample, those keys are directly filtered out in the program.
Solution 3: Improve the parallelism of shuffle operations

Applicable scenarios of the scheme : If we have to face the difficulties of data skew, it is recommended to use this scheme first, because it is the easiest way to deal with data skew.

Solution implementation idea : When executing the shuffle operator on the RDD, pass a parameter to the shuffle operator, such as reduceByKey(1000), which sets the number of shuffle read tasks when the shuffle operator is executed. For shuffle-like statements in Spark SQL, such as group by, join, etc., a parameter needs to be set, namely spark.sql.shuffle.partitions, which represents the parallelism of the shuffle read task. The default value is 200. For many scenarios It's a bit too small.

Implementation principle of the solution : increasing the number of shuffle read tasks allows multiple keys originally assigned to one task to be assigned to multiple tasks, so that each task can process less data than before. For example, if there are originally 5 keys, each key corresponds to 10 pieces of data, and these 5 keys are allocated to a task, then this task will process 50 pieces of data. After adding the shuffle read task, each task is assigned a key, that is, each task processes 10 pieces of data, so the execution time of each task will naturally be shortened. The specific principle is shown in the figure below.
write picture description here

Advantages of the scheme : It is relatively simple to implement, and can effectively alleviate and reduce the impact of data skew.

Disadvantages of the scheme : It only alleviates the data skew, but does not completely eradicate the problem. According to practical experience, its effect is limited.

Practical experience of the scheme: This scheme usually cannot completely solve the data skew, because if there are some extreme situations, such as the amount of data corresponding to a certain key is 1 million, then no matter how much the number of your tasks increases, this key corresponds to 1 million data It will definitely still be allocated to a task for processing, so data skew is bound to occur. Therefore, this solution can only be said to be the first method to be used when data skew is found, to try to alleviate the data skew in the simplest way, or to use it in combination with other solutions.

Solution 4: Two-stage aggregation (local aggregation + global aggregation)

Applicable scenarios of the scheme : This scheme is more suitable when performing aggregation-type shuffle operators such as reduceByKey on RDD or using group by statement in Spark SQL for group aggregation.

Scheme realization idea : The core realization idea of this scheme is to carry out two-stage aggregation. The first time is local aggregation. First, assign a random number to each key, such as a random number within 10. At this time, the same key will become different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1), it becomes (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then, perform aggregation operations such as reduceByKey on the data after the random number, and perform local aggregation, then the local aggregation result will become (1_hello, 2) (2_hello, 2). Then remove the prefix of each key, it will become (hello, 2) (hello, 2), and perform the global aggregation operation again to get the final result, such as (hello, 4).

The implementation principle of the scheme : The original same key is changed into multiple different keys by adding random prefixes, so that the data originally processed by one task can be distributed to multiple tasks for local aggregation, and then the processing data of a single task can be solved. Too much problem. Then remove the random prefix and perform global aggregation again to get the final result.
write picture description here

Advantages of the scheme : The effect is very good for the data skew caused by the shuffle operation of the aggregation class. Usually, the data skew can be solved, or at least greatly alleviated, and the performance of Spark jobs can be improved by several times.

Disadvantages of the scheme : It is only applicable to the shuffle operation of the aggregation class, and the scope of application is relatively narrow. If it is a shuffle operation of the join class, other solutions have to be used.

Solution 5: Convert reduce join to map join

Applicable scenarios of the scheme : when using join operations on RDDs or using join statements in Spark SQL, and the amount of data in an RDD or table in the join operation is relatively small (such as hundreds of M or one or two G), it is more applicable this program.

Solution implementation idea : instead of using the join operator for the join operation, use the Broadcast variable and the map operator to implement the join operation, thereby completely avoiding the shuffle type operation and completely avoiding the occurrence and occurrence of data skew. Pull the data in the smaller RDD directly into the memory of the Driver side through the collect operator, and then create a Broadcast variable for it; then execute the map operator on another RDD, in the operator function, from the Broadcast variable Obtain the full data of the smaller RDD and compare it with each data of the current RDD according to the connection key. If the connection key is the same, then connect the data of the two RDDs in the way you need.

Solution implementation principle : Ordinary join will go through the shuffle process, and once shuffled, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, which is reduce join. However, if an RDD is relatively small, you can use the broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join. At this time, no shuffle operation will occur, and no data skew will occur. . The specific principle is shown in the figure below.
write picture description here

Advantages of the solution : The data skew caused by the join operation is very effective, because shuffle will not occur at all, and there will be no data skew at all.

Disadvantages of the scheme : There are few applicable scenarios, because this scheme is only applicable to the case of one large table and one small table. After all, we need to broadcast the small table, which will consume more memory resources. The driver and each Executor will reside in the memory of a small RDD with full data. If the RDD data we broadcast is relatively large, such as more than 10G, then memory overflow may occur. Therefore, it is not suitable for the case where both are large tables.

Solution 6: Sampling skewed keys and splitting join operations

Applicable scenarios of the solution : When two RDD/Hive tables are joined, if the amount of data is relatively large and "Solution 5" cannot be used, then you can look at the key distribution in the two RDD/Hive tables at this time. If the data skew occurs because the data volume of a few keys in one RDD/Hive table is too large, and all the keys in the other RDD/Hive table are distributed evenly, then this solution is more appropriate of.

Scheme realization idea :

For the RDD that contains a few keys with a large amount of data, use the sample operator to sample a sample, then count the number of each key, and calculate which keys have the largest amount of data.
Then split the data corresponding to these keys from the original RDD to form a separate RDD, and prefix each key with a random number within n, without causing most of the skewed keys to form another an RDD.
Then another RDD that needs to be joined will also filter out the data corresponding to the tilted keys and form a separate RDD, and expand each data into n pieces of data, and these n pieces of data will be sequentially appended with a 0~n prefix, most keys that do not cause skewing also form another RDD.
Then join the independent RDD with a random prefix and another independent RDD expanded by n times. At this time, the original same key can be broken up into n parts and distributed into multiple tasks for join.
The other two ordinary RDDs can be joined as usual.
Finally, the results of the two joins can be combined using the union operator, which is the final join result.

Solution implementation principle: For the data skew caused by join, if only a few keys cause the skew, you can split a few keys into independent RDDs, and add random prefixes and break them into n parts for join. The data corresponding to a key will not be concentrated on a few tasks, but distributed to multiple tasks for join.

Advantages of the scheme : For the data skew caused by join, if only a few keys cause the skew, this method can be used to break up the keys in the most efficient way to join. Moreover, it is only necessary to expand the capacity of the data corresponding to a few inclined keys by n times, and there is no need to expand the capacity of the full amount of data. Avoid taking up too much memory.

Disadvantages of the scheme : If there are too many keys that cause skew, for example, thousands of keys cause data skew, then this method is not suitable.
Solution 7: Join using random prefix and scaling RDD

Applicable scenarios of the solution : If a large number of keys in the RDD cause the data to be skewed during the join operation, it is meaningless to split the keys. At this time, only the last solution can be used to solve the problem.

Scheme realization idea :

The implementation idea of this solution is basically similar to "Solution 6". First, check the data distribution in the RDD/Hive table and find the RDD/Hive table that causes data skew. For example, there are multiple keys corresponding to more than 10,000 pieces of data. .
Then each piece of data in the RDD is prefixed with a random prefix within n.
At the same time, another normal RDD is expanded, and each piece of data is expanded into n pieces of data, and each piece of expanded data is prefixed with a 0~n in turn.
Finally, join the two processed RDDs.

The implementation principle of the scheme : change the original same key into a different key by appending a random prefix, and then disperse these processed "different keys" into multiple tasks for processing, instead of letting one task process a large number of identical keys key. The difference between this solution and "Solution 6" is that the previous solution is to only perform special processing on the data corresponding to a few skewed keys as much as possible. Since the processing needs to expand the RDD, the previous solution needs to expand the RDD. The occupancy of the RDD is not large; and this solution is for the case of a large number of skewed keys, and it is impossible to split some keys for separate processing, so the data can only be expanded for the entire RDD, which requires high memory resources.

Advantages of the scheme : The data skew of the join type can basically be processed, and the effect is relatively significant, and the performance improvement effect is very good.

Disadvantages of the scheme: This scheme is more about alleviating data skew, rather than completely avoiding data skew. Moreover, the entire RDD needs to be expanded, which requires high memory resources.

Practical experience of the solution : When a data requirement was developed, it was found that a join caused the data to be skewed. Before optimization, the execution time of the job was about 60 minutes; after optimization using this scheme, the execution time was shortened to about 10 minutes, and the performance was improved by 6 times.
Solution 8: Use a combination of multiple solutions

In practice, it is found that in many cases, if you only deal with relatively simple data skew scenarios, one of the above solutions can basically be solved. But if you want to deal with a more complex data skew scenario, you may need to use a combination of solutions. For example, for Spark jobs with multiple data skew links, we can first use solutions 1 and 2 to preprocess part of the data and filter part of the data to alleviate; Performance; finally, you can choose a scheme to optimize its performance for different aggregation or join operations. After you have a thorough understanding of the ideas and principles of these solutions, you can use a variety of solutions flexibly to solve your own data skew problem in practice according to various situations.
(————————Template of practice summary———————————-)
This article is reproduced from: http://tech.meituan.com/spark-tuning-basic.html

Spark performance optimization: data skew tuning

Guess you like