Yiwen teaches you to quickly solve Spark data skew!

  Hello everyone, I am not warm Bu fire, big data is a student of a computer professional college sophomore, nickname comes from the phrase - 不温不火, which was intended 希望自己性情温和. As a novice in the Internet industry, a blogger writes a blog to record his own learning process on the one hand, and on the other hand to summarize the mistakes he has made, hoping to help many young people who are just as early as himself. However, due to the limited level, there will inevitably be some mistakes in the blog. If there are any omissions, I hope that you can give me your advice! For the time being, there is only one platform, csdn, the blog homepage: https://buwenbuhuo.blog.csdn.net/

  This blog post brings you an article to teach you to quickly solve Spark data tilt! .
1


2
The data skew problem in Spark mainly refers to the data skew problem that occurs during the shuffle process, which is caused by the different amount of data corresponding to different keys.

For example, the reduce point has to process a total of 1 million pieces of data. The first and second tasks are allocated 10,000 pieces of data, and the calculation is completed within 5 minutes. The third task is allocated 980,000 pieces of data. At this time, the third The task may take 10 hours to complete, which makes the entire Spark job take 10 hours to complete, which is the consequence of data skew.

Note that it is necessary to distinguish the two cases of data skew and excessive data volume. Data skew means that a few tasks are allocated most of the data, so a few tasks run slowly; data excess means that all tasks are allocated a very large amount of data Large, similar, all tasks run slowly.

数据倾斜的表现:

  1. Most tasks of Spark jobs are executed quickly, only a limited number of tasks are executed very slowly. At this time, there may be data skew, and the job can run, but it runs very slowly;
  2. Most tasks of the Spark job are executed quickly, but some tasks will suddenly report OOM during the running process. After repeated execution several times, an OOM error will be reported in a task. At this time, data skew may occur and the job cannot run normally. .

定位数据倾斜问题:

  1. Check the shuffle operators in the code, such as reduceByKey, countByKey, groupByKey, join and other operators, and judge whether there will be data skew here according to the code logic;
  2. Check the log file of the Spark job. The log file will record the error accurately to a certain line of the code. You can determine which stage the error occurs in according to the code position where the exception is located, and which shuffle operator is the corresponding one;

1. Aggregate original data

  • 1. Avoid the shuffle process

In most cases, the data sources of Spark jobs are Hive tables, which are basically yesterday's data after ETL. In order to avoid data skew, we can consider avoiding the shuffle process. If the shuffle process is avoided, then the possibility of data skew is fundamentally eliminated.

If the data of the Spark job comes from the Hive table, you can aggregate the data in the Hive table first, for example, group by key, and concatenate all the values ​​corresponding to the same key into a string in a special format, so , A key has only one piece of data; afterwards, when all values ​​of a key are processed, only the map operation is required, and no shuffle operation is required. Through the above method, the shuffle operation is avoided, and any data skew problem is unlikely to occur.

For the operation of the data in the Hive table, it is not necessarily concatenated into a string, but can also be directly accumulated for each piece of data in the key.

To distinguish between the large amount of data processed and the difference between data tilt

  • 2. Reduce the key granularity (increase the possibility of data skew and reduce the amount of data per task)

The increase in the number of keys may make the data skew more serious.

  • 3. Increase key granularity (reduce the possibility of data skew and increase the amount of data per task)

If there is no way to aggregate a piece of data for each key, in certain scenarios, consider expanding the aggregation granularity of the key.

For example, there are currently 100,000 pieces of user data. The current key granularity is (province, city, district, date). Now we are considering expanding the granularity and expanding the key granularity to (province, city, date). In this case, the key's granularity The number will be reduced, and the difference in the amount of data between keys may also be reduced, which can reduce the phenomenon and problems of data skew. (This method is only effective for specific types of data. When the application scenario is not suitable, data skew will be aggravated)

2. Filter the key that causes skew

If certain data is allowed to be discarded in the Spark job, you can consider filtering the keys that may cause data skew, and filter out the data corresponding to the key that may cause data skew, so that data skew will not occur in the Spark job.

3. Improve reduce parallelism in shuffle operations

When scheme 1 and scheme 2 do not have a good effect on data skew processing, consider increasing the reduce side parallelism in the shuffle process. The increase in reduce side parallelism increases the number of tasks on the reduce side, so each task is allocated The amount of data received will be reduced accordingly, thereby alleviating the problem of data skew.

  • 1. Parallelism setting on reduce side

In most shuffle operators, you can pass in a parameter for setting the degree of parallelism, such as reduceByKey(500). This parameter will determine the degree of parallelism on the reduce side in the shuffle process. When performing a shuffle operation, it will correspond to the creation Specify the number of reduce tasks.

For shuffle statements in Spark SQL, such as group by, join, etc., you need to set a parameter, namely spark.sql.shuffle.partitions, which represents the parallelism of shuffle read task. The default value is 200. For many scenarios It's a bit too small.

Increasing the number of shuffle read tasks allows multiple keys originally assigned to one task to be assigned to multiple tasks, thereby allowing each task to process less data than before.

For example, if there are originally 5 keys, each key corresponds to 10 pieces of data, and these 5 keys are all assigned to a task, then this task will process 50 pieces of data. After the shuffle read task is added, each task is assigned a key, that is, each task processes 10 pieces of data, so naturally the execution time of each task will be shortened.

  • 2. Defects in the parallelism setting on the reduce side

Increasing the parallelism on the reduce side does not fundamentally change the nature and problems of data skew (solutions 1 and 2 fundamentally avoid the occurrence of data skew), but to ease and reduce the data pressure of shuffle reduce tasks as much as possible, and The problem of data skew is applicable to the situation where the amount of data corresponding to more keys is relatively large.

This solution usually cannot completely solve the data skew, because if there are some extreme situations, for example, the amount of data corresponding to a key is 1 million, then no matter how much your task number increases, the key corresponding to 1 million data will definitely be allocated To be processed in a task, so data skew is destined to still occur. Therefore, this solution can only be said to be the first method to try when data skew is found. Try to use the simplest method to alleviate the data skew, or use it in combination with other solutions.

In an ideal situation, after the reduce side parallelism is increased, the problem of data skew will be reduced to a certain extent, and even data skew will be basically eliminated; however, in some cases, it will only make the original slow task running speed due to data skew Slightly improved, or avoided the OOM problem of some tasks, but still running slowly, at this time, you must give up the third plan in time and start to try the latter plan.

4. Use random keys to achieve double aggregation

When using operators such as groupByKey and reduceByKey, you can consider using random keys to achieve double aggregation.
3
First, add a random number prefix to the key of each data through the map operator, and break up the key to replace the original key Become a different key, and then perform the first aggregation, so that the data originally processed by one task can be distributed to multiple tasks for partial aggregation;

Subsequently, the prefix of each key is removed, and aggregation is performed again.

This method has a relatively good effect on data skew caused by operators such as groupByKey and reduceByKey. It is only suitable for aggregate shuffle operations and has a relatively narrow scope of application.

If it is a shuffle operation of the join class, other solutions must be used.

This method is also a solution to be tried when the first few programs have no better results.

5. Convert reduce join to map join

Under normal circumstances, the join operation will perform the shuffle process, and perform a reduce join, that is, first gather all the same keys and corresponding values ​​into a reduce task, and then join.

普通join的过程如下图所示:
4
Ordinary join will go through the shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, this time it is a reduce join.

But if an RDD is relatively small, you can use broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join. At this time, no shuffle operation will occur, and no data skew will occur. .

(Note that RDD cannot be broadcast, only the data inside RDD can be pulled to Driver memory through collect and then broadcast)

  • 核心思想

Do not use join operators for join operations, but use Broadcast variables and map operators to implement join operations, thereby completely avoiding shuffle operations and completely avoiding the occurrence and occurrence of data skew.

Pull the data in the smaller RDD directly into the memory on the Driver side through the collect operator, and then create a Broadcast variable for it;

Then execute the map operator on another RDD. In the operator function, get the full data of the smaller RDD from the Broadcast variable, and compare it with each piece of data in the current RDD according to the connection key. If the connection key is the same, then Just connect the data of the two RDDs in the way you need.

According to the above thinking, shuffle operation will not occur at all, and the data skew problem that may be caused by join operation is fundamentally eliminated.

When the join operation has data skew problems and one of the RDDs has a small amount of data, this method can be given priority, and the effect is very good. The process of map join is shown in the figure:
5

  • 不适用场景分析

Since Spark's broadcast variable saves a copy in each Executor, if two RDDs have relatively large data volumes, then if one RDD with relatively large data volume is made into a broadcast variable, it is likely to cause memory overflow

6. Sample sampling to join the tilt key separately

In Spark, if an RDD has only one key, the data corresponding to this key will be broken up by default during the shuffle process and processed by different reduce tasks.

Therefore, when a single key causes data skew, it is possible to extract the key with data skew separately to form an RDD, and then use the RDD composed of the key that would cause the skew to join other RDDs separately. At this time, according to Spark The operating mechanism of the RDD, the data in this RDD will be scattered to multiple tasks for join operations during the shuffle phase. The process of tilting the key to join separately is shown in the figure
6

  • 适用场景分析

For the data in the RDD, you can convert it to an intermediate table, or use countByKey() directly to see the amount of data corresponding to each key in this RDD. At this time, if you find the entire RDD, it is the amount of data for one key. Especially many, then you can consider using this method.

When the amount of data is very large, you can consider using sample sampling to obtain 10% of the data, and then analyze which key in the 10% of the data may cause data skew, and then extract the data corresponding to this key separately.

  • 不适用场景分析

If there are many keys that cause data skew in an RDD, then this solution is not applicable

7. Join using random numbers and expansion

If there are a large number of keys in the RDD that cause data skew during the join operation, then it is meaningless to split the key. At this time, we can only use the last solution to solve the problem. For the join operation, we can consider it One RDD data is expanded, and the other RDD is diluted before joining.

We will turn the same key into a different key by appending a random prefix, and then we can distribute these processed "different keys" to multiple tasks for processing, instead of letting one task process a large number of the same keys.

This solution is aimed at the situation where there are a large number of skewed keys, and it is impossible to split some keys for separate processing. Data expansion of the entire RDD is required, which requires high memory resources.

  • 核心思想

Choose an RDD, use flatMap to expand, add a numeric prefix (a value from 1 to N) to the key of each piece of data, and map one piece of data to multiple pieces of data; (expanding)

Choose another RDD and perform the map mapping operation. The key of each piece of data is prefixed with a random number (a random number from 1 to N); (dilution)
7

  • 局限性

If the two RDDs are both large, then it is obviously not feasible to expand the RDD by N times; the expansion method can only alleviate the data skew, but cannot completely solve the data skew problem.

  This sharing ends here,


14

  A good book never tires of reading a hundred times. And if I want to be the most beautiful boy in the audience, I must insist on acquiring more knowledge through learning, using knowledge to change my destiny, using blogs to witness growth, and using actions to prove that I am working hard.
  If my blog is helpful to you, if you like the content of my blog, please “点赞” “评论”“收藏”click three links! I heard that people who like it will not be too bad luck, and they will be full of energy every day! If you really want to be a prostitute, I wish you happy every day, and welcome to my blog.
  The code word is not easy, and your support is my motivation to stick to it. Don't forget 关注me after you like it!

15
16

Guess you like

Origin blog.csdn.net/qq_16146103/article/details/108180754