spark tilt data solutions

1, an appropriate degree of parallelism of end reducer

 

Applicable scene:

If a Task has 100 Key · and particularly large amount of data, it is likely to lead to OOM or task to run very slowly, if at this time increases the degree of parallelism, you can break down
the amount of data the Task, for example, the original Task 100 Key task decomposition to 10, which can reduce the amount of data for each task, making it possible to solve problems and tasks OOM run slow.


2, to achieve the dual Key random polymerization (reducebykey)

 

Using the processing techniques of random number prefix Key, Key value of the secondary polymerization.
(1) the first polymerization (partial polymerization): adding a random number for each Key value, performing a first polymerization operation reduceByKey.
(2) Second polymerization (polymerization double): Key value minus the prefix of the random number, performing a second reduceByKey polymerization, the polymerization results ultimately obtained overall.
Applicable scene:

Random Key suitable groupByKey, reduceByKey  case Key value data skew occurs when some other operation data operator. For example, the electrical system's ad click, if the user clicks the convergence according to the province, the original value of Key State, if the value of Value in some provinces particularly, data skew occurs, may be split into a plurality of each Key Key, Key prefix plus the value of the random number break, the group makes up the new Key value random_ provinces, do local call reduceByKey polymerization, then a random prefix removed, the value is still formed Key State, then call reduceByKey, globally polymerization.


3, after first inclined Keys sampling separate the Join operation

 

Applicable scene:

Two RDD for Join operations, if there is a serious rdd data skew, then we can have serious data RDDI inclined Key found by way of sampling, then the original RDDl split into a tilted RDDll C Key data> and RDD12 C Key data is not tilted), the RDDI l, RDD1 2 respectively RDD2 Joi n operation performed, then the result of the operation performed Union Join operation.

principle:

If the amount of data RDDl l particularly in this case was able to mitigate data skew, because of the introduction Spark Core natural parallelism to the same data in a Key 1 RDDl been resolved, so that the original is inclined so as to achieve Key dispersed to different purposes of the Task , eased data skew.
 

Published 159 original articles · won praise 75 · views 190 000 +

Guess you like

Origin blog.csdn.net/xuehuagongzi000/article/details/104053052