Spark tilt data solutions

Spark tilt data solutions

Solution one: preprocessing data using Hive ETL

Program application scenarios:

Hive cause data table is inclined. If the data Hive table itself is non-uniform (such as a key data corresponding to one million, only the other key data corresponding to the 10), and both are frequently used business scenarios Spark perform an analysis operation on Hive table, the comparator suitable for such a solution.

Program realization of ideas:

At this time can assess whether the data can be pretreated by Hive (i.e., advance through the data according to the key Hive ETL polymerization, or other tables in advance, and the Join), then the data source for the job is not the original Spark Hive of the table, but after pretreatment Hive table. At this time, because the data has been previously subjected to polymerization or join operations, then there is no need to use the original class Spark shuffle operation performed these operations the operator.

Program implementation principles:

This program addresses the root causes of data skew, because completely avoid execution shuffle in Spark class operator, then surely there would be no data skew problem. But here we must also remind everyone that this way belongs to palliative. After all the data itself there is a problem of the uneven distribution, it is performed when Hive ETL join or group by other shuffle operation, the data will still tilt, resulting in very slow Hive ETL. We just ahead of the data skew occurs to the Hive ETL, the Spark program to avoid the occurrence of data skew it.

Program Benefits:

Implement simple and convenient, the effect is also very good, completely circumvented the data skew, Spark job performance will be improved significantly.

Program shortcomings:

A temporary solution, the data skew Hive ETL still occur.

Program experience:

In conjunction with the use of some Java System Spark project, the Java code will appear frequently called Spark job scene, but the execution performance of the Spark job is demanding, it is more suitable for the use of this program. The tilt data to advance to the Hive ETL upstream, performed only once a day, only that time is relatively slow, while after each call Java Spark job execution speed will soon be able to provide a better user experience.

Project experience:

The system used in this scheme in the US group · Comment interactive user behavior analysis, the system is mainly to allow users to submit statistical data analysis tasks via Java Web system, the back-end Java submitted by Spark job data statistical analysis. Spark required operating speed must be fast, as far as possible within 10 minutes, otherwise too slow, the user experience will be poor. So we shuffle some Spark job action will advance to the Hive ETL, the Spark allowing direct use of pre-Hive middle of the table, shuffle operation Spark reduce as much as possible, greatly enhance the performance, the performance of some operations improved 6 more times.

Solution two: filter results in a few key tilt

Program application scenarios:

If you find it results in a few key tilt, and the impact on the calculation itself is not large, then it is suitable to use this program. For example 99% of the corresponding key 10 on the data, but only a key corresponding to a million data, resulting in data skew.

Program realization of ideas:

If we determine that a handful of particularly large amounts of data key, the job of execution and the results are not particularly important, then simply direct it to filter out a few key. For example, the Spark SQL where clause may be used to filter out the key or performing filter to filter out the key operator in Spark Core of the RDD. If needed per job execution, the amount of data which dynamically determines up key and then filtered, then the operator may be used for RDD sample to sample, and then calculate the number of each key, a maximum amount of data that is filtered off key can.

Program implementation principles:

After the tilt of the key causes data to filter out these key would not participate in the calculation, the natural inclination is impossible to generate the data.

Program Benefits:

Simple, but also very effective, it can be completely circumvented data skew.

Program shortcomings:

Applicable scene much, in most cases, leading to tilt the key is still a lot of, not only a few.

Program experience:

In the project we have used this solution to data skew. Once discovered one day when suddenly Spark job to run OOM, and found after tracing, Hive table is one key exception in the day of data, resulting in data volume surge. Thus, after the first sampling to take before every calculation of a maximum amount of data of several sample key, that key to filter out directly in the program.

Solution three: the degree of parallelism shuffle operation

Program application scenarios:

If we have to grasp the nettle of data skew, it is recommended to use this program priority, because it is the simplest kind of treatment plan data skew.

Program realization of ideas:

RDD is performed on the shuffle count neutrons, shuffle operator to pass a parameter, such as reduceByKey (1000), this parameter is set when the number of the shuffle operator performing the shuffle read task. For Spark SQL statement in the shuffle classes, such as group by, join, etc., need to set a parameter, i.e. spark.sql.shuffle.partitions, the parameter represents the degree of parallelism shuffle read task, which is the default value 200, for many scenes it is a little too small.

Program implementation principles:

Increasing the number of shuffle read task can be assigned to multiple key so that the original of a task assigned to multiple task, so that each task processing less than the original data. For example, if there are five original key, each key 10 corresponds to the data, which are assigned to the five key of a task, then the task will process data 50. Increased after the shuffle read task, each task can be assigned to a key, i.e., each processing task 10 on the data, then the nature of each task execution time will be shorter. DETAILED principle as shown in FIG.

Program Benefits:

Relatively simple to implement, can effectively alleviate and mitigate the impact of data skew.

Program shortcomings:

Just tilt it to ease the data, not completely eradicate the problem, based on practical experience, the effect is limited.

Program experience:

The program is usually not completely solve the data skew, because if some extreme situations, such as the amount of data a key corresponding to a million, then no matter how much you increase the number of task, this corresponds to 1,000,000 of key data must still be allocated go to a task processing, and therefore destined to happen or data skew. Therefore, this scheme can only say that in trying to find a means to be used when the first data skew, trying to mouth the simplest way to ease data skew it, or used in conjunction with other programs.

Solution four: two stage polymerization (global + local polymerization polymerization)

Program application scenarios:

When performing a polymerization reduceByKey like shuffle operator based on the group by using the RDD or statements in Spark SQL packet aggregation in this scheme is more suitable.

Program realization of ideas:

The core idea of ​​this program is to achieve a two-stage polymerization. The first is a partial polymerization, give each key are marked with a random number, such as a random number less than 10, at this time the original key becomes the same as a different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1), will become (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then the data marked with random numbers, and other polymerization runs performed reduceByKey, partial polymerization, the partial polymerization results, it will become a (1_hello, 2) (2_hello, 2). Then remove the prefix to each key, will become (hello, 2) (hello, 2), globally polymerization operation again, the final result can be obtained, for example (hello, 4).

Program implementation principles:

The same key original prefix by appending a random manner, into a plurality of different key, so that the original may be a data processing task is dispersed to a plurality of local task to do the polymerization, and then solve the excessive amount of data to a single task processing problem. Then random prefix removed, again globally polymerization, the final result can be obtained. Specific principles shown below.

Program Benefits:

For data shuffle operation based polymerization results in the inclination of the effect is very good. You can usually get rid of data skew, or at least greatly ease data skew, the Spark job performance increase several times over.

Program shortcomings:

Class is only suitable for the polymerization shuffle operation, relatively narrow scope. If shuffle operation join the class, we had to use other solutions.

Solution five: to reduce join into map join

Program application scenarios:

When using the join operation based on the RDD, or use the join statement in Spark SQL, and join operations in a RDD table or the relatively small amount of data (such as a few hundred twelve G or M), more suitable for this embodiment.

Program realization of ideas:

Join connecting operation without using operators, variables used Broadcast Operators class map to achieve join operation, and thus completely circumvented shuffle operation class, completely avoid the occurrence of inclination and the data appear. The smaller data RDD directly pulled by the operator to collect the end of the memory to Driver, and then create a Broadcast its variable; RDD Next, another operator performs map-based, in the operator function, from a variable Broadcast get smaller RDD full amount of data, each piece of data with the current RDD were aligned in accordance with the connection key, the same key if the connection, then the data will be two RDD with the way you need to connect.

Program implementation principles:

Common join will go shuffle process, once shuffle, equivalent to the same key will pull data into a shuffle read task then join in, and at that reduce join. However, if a RDD is relatively small, you can use the full amount of data broadcast small RDD + map operator to achieve the same effect with the join, i.e. Map join, shuffle operation does not occur at this time, the data will not occur inclined . DETAILED principle as shown in FIG.

Program Benefits:

Data tilt join operation results, the effect is very good, because there will not shuffle occurred, it did not happen data skew.

Program shortcomings:

Less applicable scenario, because the program applies only to a large table and a small table. After all, we will need a small table broadcast, this time will be relatively consume memory resources, driver memory Executor and each resident will be the full amount of data a small RDD. If we broadcasted RDD data is relatively large, such as 10G or more, then the memory overflow may occur. Both are therefore not suitable for the case of a large table.

Solution six: sampling tilt and spin join key operation

Program application scenarios:

When two RDD / Hive table join, if relatively large amount of data, not a "solution Five", then the time can look at the distribution of two key RDD / Hive table. If the data skew occurs, because it is one of a few key RDD / Hive table data is too large, and all other key RDD / Hive table are more evenly distributed, then the use of this solution is more appropriate of.

Program realization of ideas:

1, the RDD key contains a handful of large amount of data by sampling a sample operator to a sample, and statistics about the number of each key, and calculated the amount of data which is the largest of several key.
  2, then the data corresponding to these several key split from the original RDD out, to form a single RDD, and for each key are marked with a random number less than n as a prefix, without causing most inclined key forming a further RDD.
  3, then the need to join another RDD, but also a few inclined filtered off key data corresponding to a single form and RDD, the data is expanded into each of the n pieces of data, the n pieces of data are sequentially attached a ~ 0 the prefixes n, does not cause most key is also formed another inclined RDD.
  4, then additional random prefix for independent RDD join with another independent RDD expanded n times, the original case can be broken into the same n key parts, dispersed into a plurality of the task to conduct the join.
  5, while the other two common RDD can join on as usual.
  6, the last two will join the results using the union operator can be merged, it is the final result of the join.

Program implementation principles:

Join the data resulting from the inclination, if only a few key leads to the inclination, a few key may be split into separate RDD, and additional random prefix to be broken into n parts join, this time corresponding to several key data will not be concentrated on a few task, but across multiple join a task carried out. Specific principles shown below.

Program Benefits:

For data skew caused join, if only a few key leads to the inclination, this manner can be the most effective way to break up the key for join. And only needs to be n-times for a few inclined expansion key data corresponding to the total amount of data does not require expansion. To avoid consuming too much memory.

Program shortcomings:

If the tilt of the key results in a particularly large number of words, such as tens of thousands of key data have led to the tilt, so this way is not suitable.

Solution seven: random prefix and the expansion RDD be join

Program application scenarios:

If during the join operation, RDD leads to a large number of key data skew, then the spin-off key not make sense, then you can only use the last solution to solve the problem.

Program realization of ideas:

1, to achieve the basic idea of the program and "solution six" Similarly, the first view the data distribution RDD / Hive table, find the resulting data skew RDD / Hive table, such as multiple key corresponds to the more than 10,000 Article data.
  2, then each data RDD are marked with a random prefix within n.
  3, while the other normal RDD for expansion, the expansion of each data into n pieces of data, each of the data out of the expansion are sequentially marked with a prefix ~ 0 n a.
  4, and finally the two RDD can join processing.

Program implementation principles:

Originally the same key becomes a different random key by adding a prefix, then you can be "different key" after these processes across multiple processing task to go, rather than a task handle a large number of the same key. The scheme is that the "solution six" differences, one solution is possible only on a small number of key data corresponding inclined special treatment, since the processing capacity is needed RDD, so a solution after the expansion memory RDD occupancy is not large; a solution which is inclined against a large number of key, the key portion can not be split out a separate process, so only the entire data RDD expansion, memory resource demanding.

Program Benefits:

Inclined to join data types can be processed substantially, and the effect is relatively significant, very good performance.

Program shortcomings:

The program is more inclined to ease data, rather than completely avoid data skew. And the need for expansion of the entire RDD, memory resource demanding.

Program experience:

Have developed a data needs, they found a lead to join the data skew. Prior to optimization, the job execution time is about 60 minutes; then optimized using this program, the execution time is shortened to about 10 minutes, 6-fold increase in performance.

Solution eight: a combination of a variety of programs

I found in practice that, in many cases, if only relatively simple processing data skew scenario, then use one of the above-described embodiment can be basically solved. However, if a more complex to handle data skew scenario, it may be desirable to use a combination of a variety of programs. For example, we focused on the emergence of multiple data links tilt Spark job, you can use the solution I and II, part of the data pre-processing and filtering to alleviate some of the data; Second, can enhance the degree of parallelism for certain shuffle operation, optimize its performance; Finally, also for different polymeric or join operation to select a solution to optimize its performance. Then we need to ideas and principles of these programs have a thorough understanding, in practice, according to a variety of different situations, flexible use of a variety of programs to solve their own data skew problem.

Released four original articles · won praise 0 · Views 515

Guess you like

Origin blog.csdn.net/The_Inertia/article/details/104055932