spark Performance Optimization | 12 Ge Optimization

Original: http: //litaotao.github.io/boost-spark-application-performance

This series is a comprehensive understanding of their own records in the learning process of spark + reference article + some understanding of the process of personal practice spark some ideas from. Write such a series just to sort out personal study notes recorded spark, so to be able to understand all the main and unnecessary details will not be recorded, and the original English text of the document will sometimes appear, it does not affect the understanding, not translated. To learn more about the best read reference articles and official documents.

Secondly, this series is based on the latest spark 1.6.0 series began, spark current updates quickly, what version of the record is good or necessary.

Finally, if you feel that the content is wrong, welcomed the message notes, all messages within 24 hours must reply, thank you very much.

Tips: If the illustrations seem obvious, you can: 1 zoom pages; 2 open a new tab image to view the original Oh...

1. 优化? Why? How? When? What？

"Spark applications also need to be optimized?", Many people may have this doubt, "not already have code generator, the optimizer execution, pipeline or something of it?." Yes, spark does have some powerful columns of built-in tools to make your code faster at execution time. If, however, all depends on the tools, frameworks to do, I think it only shows two problems: 1 you just know these this framework, and not know why; it seems you are only 2 photos. painted gourd dipper you, did not you, others can easily write such a spark of application, so you are replaceable;

In doing optimization spark applications, when starting from the following few points is enough:

Why: Because you have limited resources, because your production application environment there will be many factors of instability, good optimization and testing is the only way to reduce the impact of instability before production;
How do: web ui + log is optimized to do Heavenly Sword and Tulong Dao, can grasp these two points on it;
When to do it: when mature application development, business requirements are met, you can begin to do according to the needs and schedules;
What to do: In general, 80% of applications optimized spark, are concentrated in three areas: memory, disk io, network io. Then fine point that is set up to build, cluster and file system driver, executor of memory, shuffle settings, file system configuration, the cluster [eg try to make the file system and cluster within a local area network, network faster; if you can allows driver and also a local area network within the cluster, because sometimes need to return data from the worker to the driver]
Note: Do not obsessed with optimizing all start from the program itself, although most of the time program their own reasons, but before you start checking procedure is best to confirm that all worker machines circumstances are normal, oh. For example, machine load, network conditions.

The picture below from databricks a share of the Tuning Debugging the Apache and the Spark , very interesting, to put it very right ah, ha ha.

OK, let's take a look at some common optimization method.

2. repartition and coalesce

original:


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       Spark provides the `repartition()` function, which shuffles the data 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       across the network to create a new set of partitions. Keep in mind 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       that repartitioning your data is a fairly expensive operation. Spark 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       also has an optimized version of `repartition()` called `coalesce()` 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       that allows avoiding data movement, but only if you are decreasing 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       the number of RDD partitions. To know whether you can safely call 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       coalesce(), you can check the size of the RDD using `rdd.partitions.size()` 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       in Java/Scala and `rdd.getNumPartitions()` in Python and make sure 
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       that you are coalescing it to fewer partitions than it currently has.

Summary: When you want to re-rdd Fragmentation, if the target number is less than the current number of Area Area, you use coalesce , do not use repartition . About partition more details of the optimization, with reference to chapter 4 of Learning Spark

3. Passing Functions to Spark

In Python, we have three options for passing functions into Spark.

lambda expressions

word = rdd.filter(lambda s: "error" in s)

top-level functions


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       import my_personal_lib
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
        
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       word = rdd.filter(my_personal_lib.containsError)

locally defined functions


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       def containsError(s):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
           return "error" in s
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       word = rdd.filter(containsError)

One issue to watch out for when passing functions is inadvertently serializing the object containing the function. When you pass a function that is the member of an object, or contains references to fields in an object (e.g., self.field), Spark sends the entire object to worker nodes, which can be much larger than the bit of information you need. Sometimes this can also cause your program to fail, if your class contains objects that Python can’t figure out how to pickle.


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       ### wrong way
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
        
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       class SearchFunctions(object):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         def __init__(self, query):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             self.query = query
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         def isMatch(self, s):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             return self.query in s
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         def getMatchesFunctionReference(self, rdd):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             # Problem: references all of "self" in "self.isMatch"
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             return rdd.filter(self.isMatch)
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         def getMatchesMemberReference(self, rdd):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             # Problem: references all of "self" in "self.query"
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             return rdd.filter(lambda x: self.query in x)
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
        
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       ### the right way
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
        
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       class WordFunctions(object):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         ...
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         def getMatchesNoReference(self, rdd):
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             # Safe: extract only the field we need into a local variable
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             query = self.query
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             return rdd.filter(lambda x: query in x)

4. worker resource allocation: cpu, memroy, executors

The relatively deep topic, but in different deployment models are not the same [standalone, yarn, mesos], can not give any advice here. Only one purpose, do not consider reimbursing all resources independently to the spark to use, taking into account some of the processes of the machine itself, some of the process spark dependent network, the task situation [computationally intensive, IO-intensive, long- live task] and the like.

Here we can only recommend some video, slide and blog, specific conditions, when it came to me after the resource tuning then issued to actual cases.

Top 5 Mistakes When Writing Spark Applications

5. shuffle block size limitation

The Spark shuffle Block CAN BE NO Greater Within last 2 GB - the Spark shuffle in the block size can not exceed 2g .

Spark call using a ByteBuffer data structure as the data cache shuffle, but this ByteBuffer default allocation of memory is 2g, so once the data is more than 2g shuffle time, shuflle process will go wrong. Factors affecting the size of the data shuffle There are a few common:

The number of partition, partition the more, the less data is distributed to each partition, the more easily lead to too large shuffle data;
Uneven distribution of data, typically groupByKey , the presence of a few key data comprises too large, resulting in the key located on the data partition is too large, it is possible to trigger the late shuflle block greater than 2g;

Such are the ways to solve the general increase in the number of partition, Top 5 Mistakes the When the Spark Writing Applications said here can be expected to make data on each partition is about 128MB, only for reference, or need specific scene specific analysis here only to talk about the principle of clearly on the line, there is no perfect specification.

sc.textfile specify a larger partition number when
spark.sql.shuffle.partitions
rdd.repartition
rdd.coalesce

TIPS :

In the case of less than 2000 and the partition 2000 is greater than the two scenarios, the Spark different data structures used to record information at the time shuffle, the partition is greater than 2000, there will be a more efficient alternative [compressed] data structure to store information. So if your partition is not to 2000, but 2000 is very close, you can rest assured that the partition is set to 2000 or more.


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
           if (uncompressedSizes.length > 2000) {
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             HighlyCompressedMapStatus(loc, uncompressedSizes)
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
           } else {
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
             new CompressedMapStatus(loc, uncompressedSizes)
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
           }
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
         }

6. level of parallel － partition

Let's look at some of the performance indicators of a stage where all the task run, some explanation:

Scheduler Delay : The time it takes to spark assigned task
Executor Computing Time : Executor task execution time spent
Getting Result Time : Get the results of task execution time spent
Result Serialization Time : Task execution results serialization time
Task Deserialization Time : Task deserialization time
Shuffle Write Time : Shuffle data write time
Shuffle Read Time : Shuffle the time it takes to read data

The point here is level of parallel , in fact, in most cases refer to the number of partition, the partition number of changes will affect changes in several indicators above. Our tuning time, very often see changes in the above index. When the partition changes, changes in several indicators above are as follows:

partition too small [easy to introduce data skew problem]
- Scheduler Delay : No significant changes
- Executor Computing Time : Unstable, some small, but on average relatively large
- Getting Result Time : Unstable, some small, but on average relatively large
- Result Serialization Time : Unstable, some small, but on average relatively large
- Task Deserialization Time : Unstable, some small, but on average relatively large
- Shuffle Write Time : Unstable, some small, but on average relatively large
- Shuffle Read Time : Unstable, some small, but on average relatively large
partition is too large
- Scheduler Delay : No significant changes
- Executor Computing Time : Relatively stable, on average, relatively small
- Getting Result Time : Relatively stable, on average, relatively small
- Result Serialization Time : Relatively stable, on average, relatively small
- Task Deserialization Time : Relatively stable, on average, relatively small
- Shuffle Write Time : Relatively stable, on average, relatively small
- Shuffle Read Time : Relatively stable, on average, relatively small

That should be how to set the number of partition it? Here again, there is no specific formula and specifications, we are generally trying to have a more optimal results after a few times. But the aim is: try not to cause data skew problem, try to make time for each task performed within a period of little change interval.

7. data skew

Most of the time, we hope that the benefits of distributed computing effect should be like the following figure:

Sometimes, however, it is below this effect, which is called the data skew. That data is not being 大致均匀 distributed to the cluster, for such a task, the task execution time depends on the overall time of the first data block to be processed. In many distributed systems, data skew is a big problem, such as a distributed cache, the cache is assumed that there are 10 machine, but 50% of the data which falls on one machine, then when the machine after down off the entire cache of data will lose general, the cache hit rate of at least [is certainly greater than] 50%. This is also a lot of distributed cache to introduce consistent hashing, to be introduced 虚拟节点 vnode for a reason.

Consistent hashing schematics:

Back to the topic, how to solve the problem of data skew the spark in? First clear the scene and occurrence root of the problem: In general, all (key, value) data type, the uneven distribution of key, this scenario is more common method is to perform key processing salt [salt Chinese should not know how to say], for example, that there are two key (key1, key2), and large data sets corresponding to key1, key2 and the corresponding data set is relatively small, the key may be expanded into a plurality of key (key1-1, key1 -2, ..., key1-n, key2-1, key2-2, ..., key2-m), and to ensure that key1-* the corresponding original data is key1 on the data set into a corresponding comes, key2-*the corresponding data is the original key2 dividing the corresponding data from the set. After this, we have m+na key, and each key corresponds to the data set is relatively small, the increase parallelism, each parallel processing program data set little difference in size, efficient parallel processing can be greatly accelerated. In these two I have mentioned all share in this approach:

8. avoid cartesian operation

rdd.cartesian operation is very time-consuming, especially when large data sets when, cartesian magnitude is the square of the level of growth, not only time consuming but also consume space.


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       >>> rdd = sc.parallelize([1, 2])
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       >>> sorted(rdd.cartesian(rdd).collect())
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       [(1, 1), (1, 2), (2, 1), (2, 2)]

9. avoid shuffle when possible

The default shuffle spark is on a stage of the write data disk, and then the next stage and then read from the disk. A great impact on performance of disk IO here, especially when large volumes of data.

10. use reduceByKey instead of GroupByKey when possible

11. use treeReduce instead of reduce when possible

12. use Kryo serializer

spark application, when the shuffle and Cache for RDD, the data is to be serialized can store, in addition to the IO at this time, the data may be a bottleneck serialization application. It is recommended to use kryo sequence database at the time of data serialization serialization can ensure high efficiency.


   
   
    
    
     
     
      
      
     
     
     
     
      
      
       
       sc_conf = SparkConf()
      
      
     
     
    
    
     
     
      
      
     
     
     
     
      
      
       
       sc_conf.set("spark.serializer", "org.apache.spark.serializer.KryoSeria