Spark performance tuning and fault handling (5) Spark data skew optimization

The data skew problem in Spark mainly refers to shuffle 过程中出现的数据倾斜问题,是由于不同的 key 对应的数据量不同导致的不同 task 所处理的数据量不同的问题。

For example, the reduce point has to process a total of 1 million pieces of data. The first and second tasks are allocated to 10,000 pieces of data, and the calculation is completed within 5 minutes. The third task is allocated to 980,000 pieces of data. At this time, the third The task may take 10 hours to complete, which makes the entire Spark job take 10 hours to complete. This is the consequence of data skew.

Note: It is necessary to distinguish the two cases of data skew and excessive data volume
(1) 数据倾斜means that a few tasks are allocated most of the data, so a few tasks run slowly;
(2) 数据过量means that the amount of data allocated to all tasks is all Very large, similar, all tasks run slowly.

The performance of data tilt:

(1) Most tasks of Spark jobs are executed quickly, and only a few tasks are executed very slowly. At this time, there may be data skew, and the job can run, but it runs very slowly;
(2) Most of Spark jobs Tasks are executed quickly, but some tasks will suddenly report OOM during the running process. After repeated execution several times, a certain task will report an OOM error. At this time, data skew may occur and the job cannot run normally.

Location data skew problem:

(1) Check the shuffle operators in the code, such as reduceByKey, countByKey, groupByKey, join and other operators, and judge whether there will be data skew here according to the code logic;
(2) Check the log file of the Spark job, the log file is for errors The record will be accurate to a certain line of the code, and the code position where the exception is located can be used to determine which stage the error occurred in and which shuffle operator is corresponding;

1. Aggregate original data

which is:避免 shuffle 过程,从根本上消除发生数据倾斜的可能

In most cases, the data sources of Spark jobs are Hive tables, which are basically yesterday's data after ETL.

In order to avoid data skew, we can consider avoiding the shuffle process. If the shuffle process is avoided, then the possibility of data skew is fundamentally eliminated.

If the data of the Spark job comes from the Hive table, you can aggregate the data in the Hive table first, such as grouping by key, and concatenate all the values ​​corresponding to the same key into a string in a special format, so , A key has only one piece of data; afterwards, when all values ​​of a key are processed, only the map operation is required, and no shuffle operation is required. Through the above method, the shuffle operation is avoided, and any data skew problem is unlikely to occur.

For the operation of data in the Hive table, it is not necessarily concatenated into a string, but it can also be directly accumulated for each piece of data in the key.

Second, filter the key that causes tilt

If some data is allowed to be discarded in the Spark job, then 可以考虑将可能导致数据倾斜的key 进行过滤,滤除可能导致数据倾斜的 key 对应的数据, in this way, data skew will not occur in the Spark job.

Third, improve the parallelism of reduce in shuffle operations

When the above two methods do not have a good effect on data skew processing, you can consider increasing the reduce side parallelism in the shuffle process. The increase in reduce side parallelism increases the number of reduce side tasks, then each task is allocated to The amount of data will be reduced accordingly, thereby alleviating the problem of data skew.

3.1 Setting the parallelism of reduce side

In most shuffle operators, you can pass in a parameter for setting the degree of parallelism. reduceByKey(500),这个参数会决定 shuffle 过程中 reduce 端的并行度,在进行 shuffle操作的时候,就会对应着创建指定数量的 reduce task。For example , for shuffle statements in Spark SQL, such as group by, join, etc., you need to set a parameter, that is spark.sql.shuffle.partitions, this parameter represents shuffle read The task parallelism, the value is 200 by default, which is a bit too small for many scenarios.

增加 shuffle read task 的数量,可以让原本分配给一个 task 的多个 key 分配给多个 task,从而让每个 task 处理比原来更少的数据。For example, if there are originally 5 keys, each key corresponds to 10 pieces of data, and these 5 keys are all assigned to a task, then this task will process 50 pieces of data. After the shuffle read task is added, each task is assigned a key, that is, each task processes 10 pieces of data, so naturally the execution time of each task will be shortened.

3.2 Defects in the reduce side parallelism setting

Improve the parallelism of the reduce side 并没有从根本上改变数据倾斜的本质和问题(Scheme 1 and Scheme 2 fundamentally avoid the occurrence of data skew), 只是尽可能地去缓解和减轻 shuffle reducetask 的数据压力,以及数据倾斜的问题which is suitable for situations where there are more keys and the amount of data is relatively large.

该方案通常无法彻底解决数据倾斜,因为如果出现一些极端情况,比如某个 key对应的数据量有 100 万,那么无论你的 task 数量增加到多少,这个对应着 100 万数据的 key 肯定还是会分配到一个 task 中去处理,因此注定还是会发生数据倾斜的。Therefore, this solution can only be said to be the first method to try when data skew is found. Try to use the simplest method to alleviate the data skew, or use it in combination with other solutions.

In an ideal situation, after the reduce side parallelism is increased, the problem of data skew will be reduced to a certain extent, and even data skew will be basically eliminated; however, in some cases, it will only make the original slow task running speed due to data skew Slightly improved, or avoided OOM problems of some tasks, but still running slowly, at this time, you must give up the third plan in time and start to try the latter plan.

Fourth, use random keys to achieve double aggregation

When using operators like groupByKey and reduceByKey, you can consider using random keys to achieve double aggregation, as shown in the following figure:
Insert picture description here
First, add a random number prefix to the key of each data through the map operator to break up the key , Change the original same key into a different key, and then perform the first aggregation, so that the data originally processed by one task can be distributed to multiple tasks for partial aggregation; then, the prefix of each key is removed , Polymerize again.

This method has a better effect on data skew caused by operators such as groupByKey and reduceByKey 仅仅适用于聚合类的 shuffle 操作,适用范围相对较窄. If it is a shuffle operation of the join class, other solutions have to be used.

This method is also a solution to be tried when the first few programs have no better results.

Five, convert reduce join to map join

Under normal circumstances, the join operation will perform the shuffle process, and perform a reduce join, that is, first gather all the same keys and corresponding values ​​into a reduce task, and then join.

The process of
Insert picture description here
ordinary join is shown in the following figure: ordinary join will go through the shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, this time it is reduce join. But if an RDD is relatively small, it can be used 广播小RDD全量数据+map 算子to achieve the same effect as join, that is, map join. At this time, no shuffle operation will occur, and data skew will not occur.

note:RDD 是并不能进行广播的,只能将 RDD 内部的数据通过 collect 拉取到 Driver 内存然后再进行广播

1. The core idea:

Do not use join operators for join operations, but use Broadcast variables and map operators to implement join operations, thereby completely avoiding shuffle operations and completely avoiding the occurrence and occurrence of data skew. 将较小RDD 中的数据直接通过 collect 算子拉取到 Driver 端的内存中来,然后对其创建一个 Broadcast 变量; Then execute map operator on another RDD,在算子函数内,从Broadcast 变量中获取较小 RDD 的全量数据,与当前 RDD 的每一条数据按照连接key 进行比对,如果连接 key 相同的话,那么就将两个 RDD 的数据用你需要的方式连接起来。

According to the above thinking, shuffle operation will not happen at all, and the data skew problem that may be caused by join operation is fundamentally eliminated.

When the join operation has a data skew problem and one of the RDDs has a small amount of data, this method can be considered first, and the effect is very good. The process of map join is shown in the figure below:
Insert picture description here

2. Analysis of unsuitable scenarios:

Since Spark's broadcast variable saves a copy in each Executor, if two RDDs have a relatively large data volume, then if one RDD with a relatively large data volume is made into a broadcast variable, it is likely to cause memory overflow.

Six, sample sampling to join the tilt key separately

In Spark, if an RDD has only one key, the data corresponding to this key will be broken up by default during the shuffle process and processed by different reduce-side tasks.

When 由单个key导致数据倾斜, the data may be extracted individually tilted out key, a composition RDD, then this would otherwise result in the inclination of the Join key consisting RDD RDD alone According to another will occur, this time, according to Spark operating mechanism, in this RDD The data will be distributed to multiple tasks during the shuffle phase for join operations.

The process of slanted key and single join is shown in the following figure:
Insert picture description here
1. Applicable scenario analysis:
For the data in the RDD, you can convert it into an intermediate table, or use countByKey() directly to see the corresponding key in this RDD The amount of data at this time 如果你发现整个 RDD 就一个 key 的数据量特别多,那么就可以考虑使用这种方法.

When the amount of data is very large, you can consider 使用 sample 采样获取 10%的数据, then 分析这10%的数据中哪个 key 可能会导致数据倾斜,然后将这个 key 对应的数据单独提取出来.

2. Analysis of unsuitable scenarios:
If there are many keys that cause data skew in an RDD, then this solution is not applicable.

Seven, use random numbers and expansion for join

If during the join operation RDD 中有大量的 key 导致数据倾斜, then it is meaningless to split the key. At this time, we can only use the last solution to solve the problem. For the join operation, we can consider 对其中一个 RDD 数据进行扩容,另一个 RDD 进行稀释后再 join.

We will pass the same key as before 附加随机前缀变成不一样的 key, and then it will work 将这些处理后的“不同 key”分散到多个 task 中去处理,而不是让一个 task 处理大量的相同 key. This solution is aimed at the situation where there are a large number of skewed keys, and it is impossible to split some keys for separate processing. Data expansion of the entire RDD is required, which requires high memory resources.

1. The core idea:

Choose an RDD, use flatMap to expand, add a numeric prefix (a value from 1 to N) to the key of each piece of data, and map one piece of data to multiple pieces of data; ( 扩容)

Choose another RDD and perform the map mapping operation. The key of each piece of data is prefixed with a random number (a random number from 1 to N); ( 稀释)

Join the two processed RDDs.

Insert picture description here
2. Limitations:
If the two RDDs are both very large, it is obviously not feasible to expand the RDD by N times; the expansion method can only alleviate the data skew, but cannot completely solve the data skew problem.

3. Use plan 7 to further optimize and analyze plan 6:
when there are several keys in the RDD that cause data skew, plan 6 is no longer applicable, and plan 7 is very resource intensive. At this time, the idea of ​​plan 7 can be introduced to improve plan 6:

(1) For the RDD containing a few keys with excessive data volume, use the sample operator to sample a sample, and then count the number of each key to calculate which keys have the largest amount of data.

(2) Then split the data corresponding to these keys from the original RDD to form a separate RDD, and prefix each key with a random number within n, without causing most of the tilt The key forms another RDD.

(3) Then another RDD that needs to be joined is also filtered out and the data corresponding to the slanted keys are formed to form a single RDD, and each piece of data is expanded into n pieces of data, and these n pieces of data are appended with a 0 in order The prefix of ~n will not cause most of the keys to be tilted and form another RDD.

(4) Join the independent RDD with a random prefix and another independent RDD that is expanded by n times. At this time, the original same key can be broken up into n parts and distributed to multiple tasks for joining.

(5) The other two ordinary RDDs can be joined as usual.

(6) Finally, use the union operator to combine the results of the two joins, which is the final join result.

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108651456