Data Optimization inclined spark

 

 

 

 

 

 

 

 

What data skew? Refers to a data skew, data centralized parallel processing, a subset of data (e.g., a Spark Kafka or the Partition) is significantly higher than the other portions, so that the portion of the processing speed of the entire data set of the bottleneck process.

Experience a conclusion: Under normal circumstances, the reasons are OOM data skew. A task task data is too big, GC pressure is enormous. This is not than Kafka, because kafka memory is not through the JVM. It is based on the Linux kernel Page.

Two large tilt data directly fatal consequences.

A tilt data will directly lead to a situation: Out Of Memory.

2 is running slow, particularly slow, very slow and extremely slow, unacceptably slow.

 

Positioning data skew problem:

  1.   Now shuffle operator code, e.g. reduceByKey, countByKey, groupByKey, join operator, etc., will be here determined whether the inclination data according to the code logic;
  2.   See Spark job log file, the log file for Bad record a line of code to be accurate and to be expressly stage error occurs in the first few abnormality code to locate the position, corresponding to which a shuffle operator;

At this point you perform operations according to the situation will be different, there are many ways to view key distribution:

If Spark SQL in the group by, data join statement causes the tilt, then at the table SQL queries used in the distribution key.

If the operator is to perform shuffle data RDD Spark inclined lead, then the job may be added in view Spark distribution key codes, such RDD.countByKey (). Then the number of statistics out of each key appears, collect / take clients to print it, you can see the distribution of the key.

Solution one: the polymerization original data

For example the data source is Kafka :

Spark Stream Kafka to read data, for example by DirectStream manner. Whether the data is generated due to the inclination of each of Kafka Partition the corresponding one of the Task Spark (Partition), so the data between the relevant Topic Partition in Kafka is balanced, the data is processed directly determines Spark.

Kafka within a message Topic Partition between different distribution, mainly used by the UE Producer Partition implementation class determined. If random Partitioner, each message is sent to a random Partition, whereby the probability of speaking, among Partition data will reach equilibrium. At this time, the source Stage (Kafka directly read data Stage) no data skew.

 

1. Reduce shuffle

For example the data source is Hive :

Hive cause data table is inclined. If the data Hive table itself is non-uniform (such as a key data corresponding to one million, only the other key data corresponding to the 10), and both are frequently used business scenarios Spark perform an analysis operation on Hive table, the comparator suitable for such a solution.

Program realization of ideas: can now evaluate whether the data can be pretreated by Hive (i.e., advance through the data according to the key Hive ETL polymerization, or other tables in advance, and the Join), then for the data in the job Spark Hive is not the original source of the table, but after pretreatment Hive table. At this time, because the data has been previously subjected to polymerization or join operations, then there is no need to use the original class Spark shuffle operation performed these operations the operator.

2. Reduce key efforts

 

3. The key size is increased

If there is no way out a polymerization for each key data in a particular scenario may be considered a key to expand the polymerization degree.

For example, there are 100,000 users of data, the current key granularity is (province, city, district, date), we now consider expanding the size, the key's size expanded (province, city, date), this is the case, key in the number will reduce the amount of data discrepancies between key also may be reduced, which can reduce the phenomenon and the problem of data skew. (This method is only valid for certain types of data, when the inappropriate application scenario, the data will increase the inclination)

 

Solution two: Filter lead to tilt key

If allowed to discard some of the data in the Spark operations, consider the inclination may result in data key was filtered, the filtered tilt may result in data corresponding to key data, so that, in operation does not occur Spark data inclined.

Third solution: increase the degree of parallelism reduce shuffle operation

In most of the operators in shuffle, you can pass a parameter set degree of parallelism, such reduceByKey (500), this parameter determines the degree of parallelism during reduce shuffle end, when performing the shuffle operation, will create a corresponding a specified number of reduce task.

Increasing the number of shuffle read task can be assigned to multiple key so that the original of a task assigned to multiple task, so that each task processing less than the original data

The program is usually not completely solve the data skew, because if some extreme situations, such as the amount of data a key corresponding to a million, then no matter how much you increase the number of task, this corresponds to 1,000,000 of key data must still be allocated go to a task processing, and therefore destined to happen or data skew.

You may also be increased or decreased to adjust parallelism

Solution four: to achieve the dual polymerization using a random key

First, add the map by operators to key data for each random number prefixes, to be broken up key, the key will be the same as the original becomes a different key, and then the first polymerization, so that you can be a task so that the original processing data dispersed to a plurality of local task to do the polymerization; subsequently, the prefix of each key is removed, polymerization is carried out again.

This method of data for groupByKey, reduceByKey such operators caused by the inclined relatively good results, only suitable for the polymerization class shuffle operation, relatively narrow scope. If shuffle operation join the class, we had to use other solutions.

This method is also no solution before several options when relatively good results to try.

 

Solution five: Convert reduce join to map join

Program application scenarios: when using the join operation based on the RDD, or use the join statement Spark SQL, the join operation and a smaller RDD table or the amount of data (such as a few hundred twelve G or M), is more appropriate this program.

Program application scenarios: when using the join operation based on the RDD, or use the join statement Spark SQL, the join operation and a smaller RDD table or the amount of data (such as a few hundred twelve G or M), is more appropriate this program.

Applicable scene

Join side participation data set is small enough, and may be loaded into the broadcast to each Driver Executor by Broadcast method.

Advantage

Shuffle avoided, the complete elimination of the condition data generated by the inclination, can greatly improve performance.

Disadvantaged

Join asked to participate in the side of the data set is small enough, and is mainly applied to Join the scene, a scene not suitable for the polymerization, the limited application conditions.

Solution six: sample sampling join be individually tilted key

  1. Applicable scene analysis:

For the data in the RDD, it can be converted to an intermediate table, or directly countByKey () manner, to see the amount of data corresponding to a respective key in the RDD, at this time if you find a key on the whole RDD amount of data particularly, then you can consider using this method.

When the amount of data is very large, consider using a 10% sample acquired sample data, and then analyze the data in which 10% of the key which may cause data skew, then the key corresponding to the extracted data separately.

  1.     NA scene analysis:

If resulted in a lot of data RDD tilt key, then the program does not apply.

Solution seven: the use of a random number and perform join expansion

If when the join operation is in progress, RDD leads to a large number of key data skew, then the spin-off key not make sense, then you can only use the last solution to solve the problem, to join the operation, which we can consider RDD a data expansion, and then diluted another RDD join.

We will be the same as the original key becomes a different random key by adding the prefix, and then you can use these "different key" processed across multiple processing task to go, rather than a task handle a large number of the same key. This is a solution for a large tilt key, the key portion can not be split out a separate process, the need for expansion of the entire data RDD, memory resource demanding.

   main idea:

Selecting a RDD, flatMap used for expansion, the numerical prefix is ​​added (1 - value of N) for each key data, the map data as a plurality of data; (expansion)

Select another RDD, performed map mapping operation, key data are each marked with a random number as a prefix (1 ~ N a random number); (dilution)

The RDD two processed, for join operations.

    limitation:

If the two RDD are large, it will be N times the RDD expansion obviously does not work;

Use only way to ease the expansion of data skew, can not completely solve the problem of data skew.

 

Guess you like

Origin blog.csdn.net/weixin_45194374/article/details/95043529