mapreduce multi-process and multi-threaded spark compare

Apache Spark performance to some extent depending on its uses asynchronous concurrency model (referred to herein as the model server / driver terminal employed) on which Hadoop 2.0 (including YARN and the MapReduce) is consistent. Hadoop 2.0 internal implementation of asynchronous concurrency model similar Actor, the implementation is epoll + state machine, and Apache Spark is the direct use of open source software Akka, the software realize the Actor model, performance is very high. Although both the server side using the same concurrency model, but at the task level (especially Spark MapReduce tasks and tasks) has adopted a different parallel mechanisms: Hadoop MapReduce uses a multi-process model, which uses a multi-threaded model Spark .

Note that the multi-process and multi-threaded herein, refers to the same number of tasks on a node operating mode. Both MapReduce and Spark, on the whole, are multi-process: MapReduce applications are made up of individual Task process consisting of; operating environment Spark application is built from multiple independent processes Executor temporary resource pool consisting of .

Multi-process model to facilitate fine-grained control resources for each task takes, but will consume more start-up time, low-latency operation is not suitable for the type of work, which is one of the reasons MapReduce widely criticized. The multi-threaded model , by contrast, the Spark model makes it suitable for running low-latency type of job. In short, the Spark with the task node to run in a multithreaded fashion JVM process, provides the following benefits:

1) Start fast task, as opposed to a slow start speed MapReduce Task process usually takes about 1s;

2) all the tasks running on the same node in a process conducive to shared memory. This is ideal for memory-intensive tasks, especially for applications that need to load a large number of dictionaries can greatly save memory.

3) all the tasks on the same node can run on a JVM process (Executor), and the share of resources Executor can be used continuously batches task, not freed after running part of the task, which avoids multiple applications for each task resources bring the time cost, the number of tasks for a lot of applications, can greatly reduce the running time. In contrast is the MapReduce Task: Each Task separate application resources, release immediately after use, can not be reused for other tasks, although 1.0 supports JVM reused to make up for the problem to some extent, but the 2.0 has not yet support this feature.

 

Despite the multi-threading model Spark it has brought many benefits, but there are also shortcomings, mainly:

1) Since all tasks to run in one process, therefore, would be a serious contention for resources on the same node, it is difficult to fine-grained control of each task footprint. In contrast MapReduce, which allows users to set different resources separately for Map Task and Reduce Task, and then fine-grained control of resource-intensive tasks, thus contributing to the smooth operation of the normal big job.

A brief introduction of multi-process model MapReduce and Spark multi-threading model below.

1.MapReduce multi-process model

1) Each Task run in a separate JVM process;

2) different amounts of resources may be provided separately for different types of Task, memory and CPU currently supports two resources;

3) Each Task has run, will release the occupied resources, these resources can not be other Task reuse, even with the same type of a job Task. That is, each Task go through the "Application Resources -> Run Task -> release resources" process.

2.Spark multi-threaded model

1) you may run one or more Executor service on each node;

2) Each Executor with a number of slot, which indicates how many ShuffleMapTask or Executor ReduceTask can run;

3) Each Executor runs in a single JVM process, each Task is a running thread in the Executor;

4) inside the same Executor Task may share memory, for example by the broadcast function SparkContext # broadcast file or data structure is loaded only once in each of the Executor, and not as MapReduce as Task loaded once each;

5) Executor once started, will continue to run and its resources can always be Task reuse, until the completion of Spark program runs after the release of exit.

Overall , Spark uses a classical scheduler / workers mode, the first step Spark each application running is to build a reusable resource pool, and then in the resource pool to run all ShuffleMapTask and ReduceTask (Note that although Spark program is very flexible manner, no longer limited to writing Mapper and Reducer, but internally Spark engine can only use two types of Task expressed a complex application that ShuffleMapTask and ReduceTask), and MapReduce application is different, it does not We will build a reusable resource pool, but let each Task dynamic application resources, and run immediately after the release of resources.

Guess you like

Origin www.cnblogs.com/coco2015/p/11272671.html