The difference between the spark and Hadoop

1. The difference between the same and the Spark and Mapreduce

  • With both parallel computation model mr
  • hadoop of a job: job
    • job is divided into map task and reduce task, each task is run in its own process of
    • When the end of the task, the process will end
  • Task spark user submitted: application
    • An application corresponding to a sparkcontext, in the presence of a plurality of job app
    • Every triggering action operation will produce a job
    • The job can be performed in parallel or in series
    • Each job has multiple stage, stage process is the shuffle by DAGSchaduler dependencies between divided job from the RDD
    • Each stage there are a plurality of task, there TaskSchaduler composition taskset distributed to each perform executor
    • executor and app life cycle is the same, even if there is no job is running, so the task can quickly start to read memory calculation.
  • hadoop the job only map and reduce operations, lacks skills
    • Mr will be repeated in the course of reading and writing hdfs, cause a lot of io operations need to manage multiple job relationship.
  • Iterative calculation spark are carried out in memory
    • In RDD API provides a number of operations such as join, groupby etc.
    • Can achieve good fault by the DAG

 

Guess you like

Origin www.cnblogs.com/hdc520/p/11425177.html