hadoop and spark the shuffle similarities and differences

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_40713537/article/details/102567885

From high-level perspective, the two are not a big difference.

Both the Mapper ( the Spark there is ShuffleMapTask output) is partition , a different partition to a different reducer ( the Spark Lane reducer may be the next stage in the ShuffleMapTask , it may be ResultTask ). Reducer in a buffer memory for, side shuffle side aggregate data, until the data aggregate after good for the reduce () ( the Spark It may be a series of subsequent operations).

From the low-level point of view, the difference is not small.

Hadoop MapReduce is the Sort-based , into the combine () and reduce () the records must be the Sort . This has the advantage that the combine / reduce () can process large data, the input data because it can be obtained by efflux ( Mapper for each piece of data do first sort, the reducer of shuffle for each piece of data sorted do merge). Current Spark default choice is hash-based , usually HashMap come to shuffle carried out by the data Aggregate , will not sort the data in advance. If you need the data sorted, then you need to call their own similar sortByKey () operation; if you are a Spark 1.1 users can spark.shuffle.manager set the Sort , it will sort the data. inSpark 1.2 in, Sort as default Shuffle achieved.

From the implementation point of view, there are many differences in the two.

Hadoop MapReduce processing process into a clear phases: Map (), spill, Merge, shuffle, Sort, the reduce () and the like. Each stage of their duties, may be achieved by one for each phase of the function in accordance with the procedural programming ideas. In Spark , there is no such function clear phases, only different stage and a series of transformation () , so the spill, merge, aggregate and other operations need inherent in transformation () in.

If we map end partition data, the process is called persistent data shuffle the Write , while the reducer reads the data, Aggregate process data referred to shuffle the Read . Then the Spark , the question becomes how the job to join the logical or physical execution figure shuffle write and shuffle read processing logic? And two highly efficient processing logic should be how to achieve? 

Shuffle write because they do not require data ordering, shuffle the Write task is simple: the data partition is good, and persistence. The reason for the persistence, on the one hand to reduce memory storage pressure, on the other hand also for fault-tolerance.

Guess you like

Origin blog.csdn.net/qq_40713537/article/details/102567885