What is the difference between Spark and MapReduce?

Both Spark and MapReduce can process massive amounts of data, but there are differences in processing methods and processing speeds, which are summarized as follows:

1. Spark processes data based on memory, while MapReduce processes data based on disk.

  MapReduce saves intermediate results to disk, reducing memory usage and sacrificing computing performance.

  Spark saves the intermediate results of calculations in memory, which can be used repeatedly, which improves the performance of data processing.
 

2. Spark constructs a DAG directed acyclic graph when processing data, which reduces the number of shuffles and data landing disks

  The fundamental reason why Spark computing is faster than MapReduce is the DAG computing model. Generally speaking, DAG can reduce the number of shuffles in most cases compared to MapReduce. Spark's DAGScheduler is equivalent to an improved version of MapReduce. If the calculation does not involve data exchange with other nodes, Spark can complete these operations in memory at one time, that is, intermediate results do not need to be placed on disk, reducing disk IO operations. However, if data exchange is involved in the calculation process, Spark will also write the shuffle data to disk! There is a misunderstanding. Spark is calculation based on memory, so it is fast. This is not the main reason. To calculate data, it must be loaded into memory. The same is true for Hadoop, but Spark supports the Cache to store data that needs to be used repeatedly. , To reduce the time-consuming data loading, so Spark is better at running machine learning algorithms (repeated iterative data is required)

3. Spark is a coarse-grained resource application, while MapReduce is a fine-grained resource application

  Coarse-grained application for resources means that when resources are submitted, spark will apply for resources to the resource manager (yarn, mess) in advance. If the resources are not applied, it will wait. If the resources are applied, it will run the task task without the need for task again. Go apply for resources.

  MapReduce is a fine-grained application of resources, submit tasks, tasks apply for resources, run programs, and release resources. Although resources can be fully utilized, tasks run very slowly.

https://blog.csdn.net/JENREY/article/details/84873874

Guess you like

Origin blog.csdn.net/u013963379/article/details/106460616