Spark and MR similarities and differences

Spark is borrowed mapreduce and developed on its basis, it inherits the advantages of distributed computing and improved mapreduce obvious flaws, but both also have a lot of differences as follows:

1, spark operation of the intermediate data stored in higher memory, the iterative computational efficiency; mapreduce intermediate results need to ground, the disk needs to be saved, which entails operating disk io do affect performance

2, spark high fault tolerance, it is achieved by the elastic efficient fault-tolerant distributed data sets RDD, RDD read-only nature of the data set is a set of distributed nodes stored in memory, these sets are elastic, a partial loss or mistakes can be achieved by reconstruction of kinship calculation process of the entire data set; mapreduce if fault tolerance may only be recalculated, higher costs

. 3, spark more general, spark providing a plurality of functions api transformation and action of these two categories; MapReduce map and reduce provided only two operations

. 4, frame and ecological spark more complex, it is first RDD, blood Lineage, directed acyclic graph DAG, stage during execution of delineation, spark job requires a lot of time required for tuning to different scenarios have reached performance requirements ; mapreduce and ecological framework is relatively simple, the performance requirements are also relatively weak, but run more stable for long-term run in the background

Summary, spark ecological richer, more powerful, better performance, a wider scope of application; mapreduce more simple, good stability, suitable for off-line data mining massive computing

Guess you like

Origin www.cnblogs.com/xiangyuguan/p/11227971.html