1. The difference between the same and the Spark and Mapreduce
- With both parallel computation model mr
- hadoop of a job: job
- job is divided into map task and reduce task, each task is run in its own process of
- When the end of the task, the process will end
- Task spark user submitted: application
- An application corresponding to a sparkcontext, in the presence of a plurality of job app
- Every triggering action operation will produce a job
- The job can be performed in parallel or in series
- Each job has multiple stage, stage process is the shuffle by DAGSchaduler dependencies between divided job from the RDD
- Each stage there are a plurality of task, there TaskSchaduler composition taskset distributed to each perform executor
- executor and app life cycle is the same, even if there is no job is running, so the task can quickly start to read memory calculation.
- hadoop the job only map and reduce operations, lacks skills
- Mr will be repeated in the course of reading and writing hdfs, cause a lot of io operations need to manage multiple job relationship.
- Iterative calculation spark are carried out in memory
- In RDD API provides a number of operations such as join, groupby etc.
- Can achieve good fault by the DAG