The difference between MR and Spark

1. The difference between MR and Spark 1.
A task in Hadoop is called a job. A job is divided into map task and reduce task. Each task runs in its own process. When the task runs, the process will also end.

2. A task of spark is called application. There are multiple jobs in an application. Each time an action operation is triggered, a job will be generated. These jobs can be calculated in parallel or serially. There are multiple stages in each job, and the stage is shuffle. In the process, DAGScheduler divides the job through the dependency between RDDs. There are multiple tasks in each stage to form taskset, which is then distributed to the excer for execution by taskscheduler. The life cycle of the executor and the life cycle of the app are The same, even if there is no job running, it still exists. So the task can quickly start to read the memory for calculation.

3. Hadoop jobs only have map and reduce operations, lack of expression and will repeatedly read and write hdfs during the mr process, resulting in a large number of io operations. Multiple jobs need to manage their relationships. Spark's iterative calculations are all performed in memory. Yes, api provides a large number of RDD operations such as join, groupby, etc., and good fault tolerance can be achieved through DAG graphs.

Original link: https://blog.csdn.net/weixin_43704599/article/details/109610374

Guess you like

Origin blog.csdn.net/hzp666/article/details/114971356