Spark data skew and their solutions

  • This article first appeared in Internet technology vivo micro-channel public number https://mp.weixin.qq.com/s/lqMu6lfk-Ny1ZHYruEeBdA
  • About the author: Zheng Zhibin, he graduated from South China University of Technology Computer Science and Technology (bilingual classes). Has engaged in e-commerce, open platform, mobile browser, recommended advertising and big data, artificial intelligence and other related development and architecture. AI station currently engaged in the construction business and in vivo recommended advertising intelligence platform center. We specialize in a variety of business forms business structure, platform and business solutions.

From the data skew the terms of harm, phenomena, and other reasons, Deep elaborate Spark data skew their solutions.

First, what is data skew

For such a large distributed data systems Spark / Hadoop in terms of the amount of data is not terrible, terrible is data skew.

For distributed systems, ideally, with the increase in system size (number of nodes), the application of linear time-consuming overall decline. If a large number of data processing machine 120 minutes, when the number of machines increased to three, preferably Processed 120/3 = 40 min. However, I want to do each machine execution time distributed case is 1 / N when stand-alone, it is necessary to ensure equal amount of tasks each machine. Unfortunately, very often, task allocation is uneven, uneven and even most of the tasks to be assigned to individual machines, the amount of other tasks assigned most of the machines somehow account for only a small part. Such a machine is responsible for processing tasks 80%, further 10% two machines each processing task.

"Do not suffer much and suffer from inequality." This is the biggest problem in distributed environments. Means that computing power is not linear expansion, but there is a short board effect: Stage a time-consuming, it is determined by the slowest of the Task.

Since all within the same task Stage perform the same calculation, the exclusion of different computing nodes premise computing the difference, the difference between the different time-consuming task to the task of a major amount of data processed is determined. Therefore, in order to play the advantages of distributed parallel computing system, it is necessary to solve the problem of data skew.

Second, the data skew harm

When the data skew occurs, a small amount of time-consuming task is much higher than other tasks, so that the overall time-consuming too large, failed to give full play to the advantages of parallel and distributed computing systems.  

In addition, when data skew occurs, the data processing part of the task is too large, it may cause memory shortage makes the mission failed, and thus introduce the entire application to fail.  

Third, the data skew phenomenon

When they find a phenomenon occurs in all likelihood tilt the data:

  • The vast majority of task execution was very fast, but individual task to perform slow, the overall task card can not end at some stage.

  • Spark had the normal execution of the job, one day suddenly reported OOM (out of memory) anomalies observed abnormal stack, we write business code caused. This is relatively rare.

TIPS

In Spark streaming program, the data is more prone to tilt, in particular comprising a number of similar sql join in the program, when such operation group. Because when Spark Streaming program is running, we generally will not be assigned a particularly large number of memory, so once some data skew in the process, it is very likely to cause OOM.

Fourth, the data skew reasons

When performing the shuffle, the same key must be pulled on the respective nodes to a node on a task to be processed, such as polymerization or the like according to the key for join operations. At this time, if the data corresponding to a key, then particularly, data skew occurs. For example, most of the key data corresponding to the 10, but has individual key corresponding to the 1 million data, then most likely it will only task assigned to the 10 data, then runs over 1 second; but individual task may be assigned to one million data, to run twelve hours.

Therefore, when the data appears inclined, Spark job seems to run very slowly, and may even have a data processing task of excessive lead to memory overflow.

Fifth, the problems found and location

1, by Spark Web UI

To view the currently running stage by Spark Web UI task assigned to the respective data amount (Shuffle Read Size / Records), to further determine task data not result in uneven distribution of data skew.

Know which data skew occurs after a stage, then we need to be divided according to the principle stage, calculate which part of the code that corresponds to the tilt of the stage occurs out in this part of the code will certainly be a shuffle class operator. You can view the distribution of each key by countByKey.

TIPS

Data skew occurs only in the shuffle process. Here to give you a list of some common and may trigger shuffle operation operator: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition and so on. When data skew occurs, it may be your code using these operators in a particular caused.

2, the key statistics

It can also be verified by the number of occurrences of key statistical sampling.

Due to the huge amount of data, it can be used sampling method, data sampling, statistical number appears, taken in accordance with the first few occurrences size of the sort:

df.select("key").sample(false, 0.1)           // 数据采样
    .(k => (k, 1)).reduceBykey(_ + _) // 统计 key 出现的次数 .map(k => (k._2, k._1)).sortByKey(false) // 根据 key 出现次数进行排序 .take(10) // 取前 10 个。

If the data is found that most are more evenly distributed, and large individual data over several orders of magnitude compared with other data, then the data skew occurs.

Sixth, how to ease data skew

The basic idea

  • Business logic: we come from the business logic level to optimize data skew, such as statistical orders to different cities, then we separate these first-tier cities to do the count, final and other cities to do the integration.

  • Program: for example, in the Hive, a frequently encountered count (distinct) operation, which will eventually lead to only one reduce, we can first group and then the outer layer of bread count, on it; use reduceByKey in Spark alternative groupByKey and so on.

  • Tuning parameters: Hadoop and Spark comes with a lot of parameters and mechanisms to regulate data skew, rational use of them will be able to solve most problems.

Roadmap 1. abnormal data filtering

If the data leading to tilt key data is abnormal, then simply filter out on it.

It must first analyze key to determine what key cause data skew. Specific methods above has been introduced, and not repeat them here.

Then the key record corresponding to these analysis:

  1. Null values ​​or outliers of the class, mostly caused by this reason

  2. Invalid data, a large number of duplicate data or test results has little effect on the effective data

  3. Valid data, the data traffic caused by the normal distribution

solution

For the first case 1, to filter data directly.

Case 3 require special handling, we specifically detailed below.

2. shuffle ideas to improve the degree of parallelism

Spark In doing Shuffle, use HashPartitioner (non-Hash Shuffle) to partition data by default. If inappropriate setting of parallelism may cause a large number of different Key corresponding to the same data is allocated a Task, causing the data processed Task Task is much larger than the other, resulting in data skew.

If the degree of parallelism adjustment Shuffle, originally assigned to the same so that different Task Key Task exiled to a different treatment, it can reduce the amount of data required for processing raw Task, thereby alleviating short-board effect problems caused by the inclination data.

(1) Process operation

RDD operation may be provided directly on the degree of parallelism required Shuffle operation or operator use spark.default.parallelism provided. If Spark SQL, but also by SET spark.sql.shuffle.partitions = [num_tasks] Set parallelism. The default parameters are controlled by different Cluster Manager.

and may be provided dataFrame sparkSql spark.sql.shuffle.partitions = [num_tasks] shuffle concurrency control parameters, the default is 200.

(2) Applicable scene

Number of different Key is assigned to the same cause the Task Task large amount of data.

(3) Solutions

Adjust the degree of parallelism. Generally increasing parallelism, but sometimes such as to reduce the degree of parallelism can achieve the desired effect.

(4) Advantage

Simple, requires only parameter tuning. Available minimum cost to solve the problem. If the data appears generally inclined, you can test several times before in this way, if the problem is not resolved, then try other methods.

(5) disadvantage

Less applicable scene, just let each task execution fewer different key. Unable to resolve the case of particularly large individual key cause of the tilt, if the size of some of the key is very large, even if a task to perform it alone, will suffer data skew. And the method generally only ease data skew, not completely eliminate the problem. From the experience point of view, its general effect.

TIPS tilt data can be analogous to hash collisions. Improve the degree of parallelism is similar to the size of the hash table.

3. The idea of ​​custom Partitioner

(1) Principle

Using a custom Partitioner (default HashPartitioner), originally were assigned to the same Task Key assigned to the different Task.

For example, we are on groupByKey operator, using a custom Partitioner:

.groupByKey(new Partitioner() {
  @Override
  public int numPartitions() { return 12; } @Override public int getPartition(Object key) { int id = Integer.parseInt(key.toString()); if(id >= 9500000 && id <= 9500084 && ((id - 9500000) % 12) == 0) { return (id - 9500000) / 12; } else { return id % 12; } } })

TIPS This approach is equivalent to a custom hash function hash table.

(2) Applicable scene

Number of different Key is assigned to the same cause the Task Task large amount of data.

(3) Solutions

Partitioner achieved using a custom class instead of the default HashPartitioner, as far as possible evenly distributed to all the different various Task Key.

(4) Advantage

The degree of parallelism does not affect the original design. If you change the degree of parallelism, parallelism will follow Stage default changes may affect subsequent Stage.

(5) disadvantage

Limited application scenarios, only different dispersed Key, Key data corresponding to the same scene set is very large NA. Similar effects adjustment of parallelism, and can alleviate not completely eliminate data skew data skew. According to the characteristics of the data and the need to customize special Partitioner, not flexible enough.

4. Reduce Ideas Map end into end Join Join

Broadcast mechanism by Spark will end Join converted to Reduce Map Join end, which means Spark now do not need to shuffle across nodes do but join directly through a local file, thereby completely eliminating data Shuffle brings tilt.

from pyspark.sql.functions import broadcast
result = broadcast(A).join(B, ["join_col"], "left")

Where A is small compared to the entire dataframe and can be stored in memory executor.

(1) Applicable scene

Join side participation data set is small enough, and may be loaded into the broadcast to each Driver Executor by Broadcast method.

(2) Solutions

Small data in Java / Scala data in the code set to pull Driver, and then the broadcast data to the respective small data sets by Executor Broadcast programs. Before or in SQL, Broadcast threshold is adjusted to be large enough so that the Broadcast effect. Reduce Join in turn replaced by Map Join.

(3) Advantage

Shuffle avoided, the complete elimination of the condition data generated by the inclination, can greatly improve performance.

(4) weaknesses

Since data is transmitted through the first small Broadcase to each Executor, it requires the participation of one of the data sets Join sufficiently small, and mainly applied to Join the scene, a scene not suitable for the polymerization, the limited application conditions.

NOTES

When using the Spark SQL required by the Broadcast SET spark.sql.autoBroadcastJoinThreshold = 104857600 threshold set large enough, to take effect.

The idea of ​​splitting and then join union

Idea is very simple, it is to join split into a data set inclined and non-inclined Join Join data sets, finally union:

  1. The maximum amount of data that is to RDD key contains a handful of large amount of data (assuming that is leftRDD), operator by sampling a sample to a sample, and statistics about the number of each key, which calculated a few key. Specific methods above has been introduced, and not repeat them here.

  2. Then these key data corresponding to the k filter leftRDD separate out from and to each key are marked with a random number within 1 ~ n as a prefix, forming a single leftSkewRDD; key without causing most inclined form additional a leftUnSkewRDD.

  3. Then join the other needs rightRDD, but also a few inclined filtered off key and operation of the data set by flatMap each data converted to n data (n pieces of data which are sequentially attached to a 0 prefix n) form a single rightSkewRDD; does not result in the most inclined additional key also form a rightUnSkewRDD.

  4. The expander now leftSkewRDD performed n times rightSkewRDD join, and the Join random process will remove the prefix, to obtain the results skewedJoinRDD Join inclined dataset. Note that at this time we have successfully broken into the original n the same key parts across multiple task to carry out join in the.

  5. For leftUnSkewRDD and rightUnRDD carried Join, Join obtained results unskewedJoinRDD.

  6. Operators will be incorporated by skewedJoinRDD with unskewedJoinRDD union, thereby obtaining a complete Join result set.

TIPS

  1. Key rightRDD inclined portion corresponding to the data, a random set of prefixes required (1 ~ n) as a Cartesian product (the amount of data coming enlarged n times), regardless of the data to ensure the inclined side inclined Key prefixed with normal can Join .
  2. skewRDD join parallelism may be set to n * k (k is the number of topSkewkey).
  3. Since the non-tilted and tilted Key Key operation completely independent, it can be performed in parallel.

(1) Applicable scene

Two tables are large, you can not use the Map end Join. Wherein there are a few RDD Key data is too large, a further more uniform distribution of RDD Key.

(2) Solutions

The data in RDD inclined obliquely Key data sets corresponding to randomly drawn out alone plus prefix, another data RDD are combined with each to form a new random prefix RDD (corresponds to increase its data to the original N times, N is the total number of random prefix), then both the Join and remove prefix. The remaining data is then not contain Key is inclined Join. Finally, the two Join result sets by union merger, you can get all the Join results.

(3) Advantage

Map relative to the Join, can adapt to Join large data sets. If resources are sufficient, the non-inclined portion inclined portion data set data set may be performed in parallel, significantly enhance efficiency. And only make data extensions for data skew part of the limited increase of resources consumption.

(4) weaknesses

If the inclination Key is very large, the other side of the expansion data is very large, this scheme is not applicable. Also at this time and the non-inclined oblique Key Key separately, need to scan data set twice, increasing overhead.

6. The key idea of ​​the big table salt, a small table to expand the N-fold jion

If the data are more inclined appears Key, on a way to a large number of these inclined Key spin-off, has little significance. At this time, the data set is more suitable for direct presence of data skew to all add random prefix, then additional data is a serious inclined integrally with random data set does not exist for the set of prefixes Cartesian product (ie, the amount of data N times expansion).

It is actually a special case or on a simplified method. Less split, there will be no union.

(1) Applicable scene

A data set exists inclined Key more, further more uniform distribution of a dataset.

(2) Advantages

Most of the scenes are applicable, good results.

(3) weaknesses

A whole set of data needs to be expanded N times will increase resource consumption.

7. map idea first end partially polymerized

In addition a combiner function map-side partial polymerization. Plus reduce combiner is equivalent to advance, the same key will be in a mapper polymerization process to reduce the amount of calculation shuffle and reduce the amount of data ends. This method can effectively alleviate data skew problem, but if the key data skew results in a large amount distributed in different mapper time, this method is not very effective the.

TIPS Use reduceByKey instead groupByKey.

8. The idea of ​​partial polymeric salt + salt to the polymerization Global

The core idea of ​​this program is to achieve a two-stage polymerization. The first is a partial polymerization, give each key are marked with a random number from 1 to n, such as a random number less than 3, the key at this time becomes the same as the original is not the same, such as (Hello, 1) ( hello, 1) (hello, 1) (hello, 1) (hello, 1), becomes (1_hello, 1) (3_hello, 1) (2_hello, 1) (1_hello, 1) (2_hello, 1). Then the data marked with random numbers, and other polymerization runs performed reduceByKey, partial polymerization, the partial polymerization results, it will become a (1_hello, 2) (2_hello, 2) (3_hello, 1). Then remove the prefix to each key, will become (hello, 2) (hello, 2) (hello, 1), again the overall polymerization operation, the final result can be obtained, for example (hello, 5).

def antiSkew(): RDD[(String, Int)] = {
    val SPLIT = "-"
    val prefix = new Random().nextInt(10)
    pairs.map(t => ( prefix + SPLIT + t._1, 1)) .reduceByKey((v1, v2) => v1 + v2) .map(t => (t._1.split(SPLIT)(1), t2._2)) .reduceByKey((v1, v2) => v1 + v2) }

But twice mapreduce, performance is slightly worse than once.

Seven, the data in Hadoop tilt

Hadoop directly close to the user using the Hive Mapreduce programs and procedures, although the last Hive is used to perform the MR (for now at least Hive-memory computing is not popular), but after all written content is very different logic, one program, one is Sql , so little here to distinguish.

Tilt data in Hadoop mainly in ruduce stage card in 99.99%, 99.99% has not ended.

If you look here detailed logs and monitoring interface or words will find:

  • There are more than a few reduce jamming

  • Various container OOM error

  • A great amount of data read and write, far more than the other at least reduce the normal

  • With the data skew, the task is kill other strange manifestations may occur.

Experience:  Hive inclination data generally occurs in the Group and Sql On, and relatively deep binding and data logic.

Optimization

Here are some of the methods and ideas to specific parameters and usage on the line to see the official website.

  1. map join mode

  2. DISTINCT count operation into the first turn Group, then count

  3. Tuning parameters

    set hive.map.aggr=true

    set hive.groupby.skewindata=true

  4. left semi jion use

  5. Map output end disposed intermediate compression result. (Not entirely solve the problem of data skew, but reduces the IO to read and write and network transmission, can improve the lot of efficiency)

Explanation

hive.map.aggr = true: in the map will be part of the gathering operation, more efficient but requires more memory.

hive.groupby.skewindata = true: data load balancing is tilted, when the option is set to true, the resulting query plans have two MRJob. The first MRJob, Map output result set randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same GroupBy Key likely to be distributed to different Reduce thereby to achieve load balancing purposes; MRJob then according to second data according to the result of the preprocessing to Reduce the GroupBy Key distribution (this same process can be guaranteed to be distributed GroupBy Key Reduce the same), and finally complete the final polymerization operation.

Eight, reference articles

  1. Optimizing the performance of road Spark - Spark Data inclined solution (Data Skew) of N postures

  2. Talking about one hundred billion data optimization practice: data skew (pure dry)

  3. Solve data skew problem encountered spark

More Stay tuned  vivo Internet technology  micro-channel public number

Note: Please reprint the article with the Micro Signal: labs2020  contact.

Guess you like

Origin www.cnblogs.com/vivotech/p/12106029.html