Spark principle analysis: Executor + Task + Shuffle + BlockManager + CacheManager + Checkpoint

 

 Executor principle analysis

 

Task analysis principles

 

 

Shuffle principle analysis

1, in Spark, under what circumstances, would shuffle happen? reduceByKey, groupByKey, sortByKey, countByKey, join, cogroup other operations.
2, the default operation principle Shuffle profiling
3, analyzing the principle of the optimized operation Shuffle
. 4, source code analysis related Shuffle

Spark Shuffle features two operations

The first feature
in earlier versions of the Spark, the bucket cache is very, very important, because all the required data is written to a memory cache after ShuffleMapTask, will be flushed to disk. But this has a problem, if the data map side too much, then it is likely to cause memory overflow. So spark in the new version, optimized default memory cache that is 100kb, then, write a little refreshed after the data has reached the threshold of the disk, the data will little by little flushed to disk.
The advantage of this operation, is not prone to memory overflow. The disadvantage is that if the memory cache is too small, then excessive disk io write operation may occur. So, here's memory cache size, it can be optimized according to actual business situations.

第二个特点
与MapReduce完全不一样的是,MapReduce它必须将所有的数据都写入本地磁盘文件以后,才能启动reduce操作,来拉取数据。为什么?因为mapreduce要实现默认的根据key的排序!所以要排序,肯定得写完所有数据,才能排序,然后reduce来拉取。
但是Spark不需要,spark默认情况下,是不会对数据进行排序的。因此ShuffleMapTask每写入一点数据,ResultTask就可以拉取一点数据,然后在本地执行我们定义的聚合函数和算子,进行计算。
spark这种机制的好处在于,速度比mapreduce快多了。但是也有一个问题,mapreduce提供的reduce,是可以处理每个key对应的value上的,很方便。但是spark中,由于这种实时拉取的机制,因此提供不了,直接处理key对应的values的算子,只能通过groupByKey,先shuffle,有一个MapPartitionsRDD,然后用map算子,来处理每个key对应的values。就没有mapreduce的计算模型那么方便。

 

 普通的Shuffle操作原理剖析

 

Guess you like

Origin www.cnblogs.com/Transkai/p/11354843.html