Spark overall structure and operating procedures

This section introduces the basic terminology and operational framework of the Spark, Spark and then introduce the basic processes running, and finally introduce the core concepts and principles of operation of the RDD.

Spark overall architecture

Spark Run architecture shown in Figure 1, comprises a cluster resource manager (Cluster Manager), a plurality of nodes running the job work tasks (Worker Node), mission control node of each application (Driver) and each working node the process responsible for the implementation of specific tasks (Executor).

Run Spark architecture
Figure 1 Spark architecture running

Driver is running Spark Applicaion the main () function, which creates SparkContext. SparkContext Cluster Manager and is responsible for communication, resource application, task assignment and monitoring.

Cluster Manager is responsible for managing applications and resources required for the application to run on Worker Node, Spark now includes native Cluster Manager, Mesos Cluster Manager and Hadoop YARN Cluster Manager.

Application Executor is a process running on the Worker Node, responsible for running Task (task), and is responsible for the data exists on memory or disk, each with its own separate group of Application Executor. Each Executor contains the resources to run a certain number of tasks assigned to it.

Executor on each Worker Node serve different Application, between them is not shared data. Compared with MapReduce computing framework, Executor Spark has adopted two major advantages.

  • Executor using multiple threads to perform specific tasks, compared to MapReduce process model, resources and the use of start-up cost is much smaller.
  • Executor has a BlockManager memory module, memory, and disk will jointly as a storage device, when required multiple rounds of iterative calculation, intermediate results can be stored in the storage module for the next immediate use when required, without the need to from reading disk, thereby effectively reduce I / O overhead in interactive query scenarios can be pre-cached data to the memory module BlockManager, thereby improving the read and write I / O performance.

Spark running processes

Spark run the basic process shown in Figure 2, the following steps.

1) Construction of Spark Application runtime environment (start SparkContext), SparkContext register with the Cluster Manager, and run the application Executor resources.

2) Cluster Manager to allocate resources and start the Executor Executor process, Executor with the operation of the "heartbeat" sent to the Cluster Manager.

The basic operation flowchart Spark
Basic operation flowchart of FIG. 2 Spark

. 3) SparkContext construct the DAG, DAG the graph into a plurality of Stage, and the each of the Stage taskset (set of tasks) to Task Scheduler (task scheduler). Executor SparkContext apply to Task, Task Scheduler will be issued to Task Executor, at the same time, SparkContext application code will be issued to the Executor.

4) Task running on the Executor, the execution results back to the Task Scheduler, and then back to the DAG Scheduler. Write data has finished running, SparkContext cancellation to ClusterManager and releases all resources.

DAG Scheduler 决定运行 Task 的理想位置,并把这些信息传递给下层的 Task Scheduler。

DAG Scheduler 把一个 Spark 作业转换成 Stage 的 DAG,根据 RDD 和 Stage 之间的关系找出开销最小的调度方法,然后把 Stage 以 TaskSet 的形式提交给 Task Scheduler。此外,DAG Scheduler 还处理由于 Shuffle 数据丢失导致的失败,这有可能需要重新提交运行之前的 Stage。

年薪40+W的大数据开发【教程】,都在这儿!

Task Scheduler 维护所有 TaskSet,当 Executor 向 Driver 发送“心跳”时,Task Scheduler 会根据其资源剩余情况分配相应的 Task。另外,Task Scheduler 还维护着所有 Task 的运行状态,重试失败的 Task。

总体而言,Spark 运行机制具有以下几个特点。

1)每个 Application 拥有专属的 Executor 进程,该进程在 Application 运行期间一直驻留,并以多线程方式运行任务。

这种 Application 隔离机制具有天然优势,无论是在调度方面(每个 Driver 调度它自己的任务),还是在运行方面(来自不同 Application 的 Task 运行在不同的 JVM 中)。

同时,Executor 进程以多线程的方式运行任务,减少了多进程频繁的启动开销,使得任务执行非常高效可靠。当然,这也意味着 Spark Application 不能跨应用程序共享数据,除非将数据写入到外部存储系统。

2)Spark 与 Cluster Manager 无关,只要能够获取 Executor 进程,并能保持相互通信即可。

3) JobClient submitted SparkContext should be close to the Worker Node, preferably in the same rack, because during operation Spark Application, there is a lot of information exchange between SparkContext and Executor.

4) Task data using the local optimization mechanism and speculative execution. Data locality refers to the possible calculation on the nodes where the data is moved, the mobile computing overhead than the mobile data network is much smaller. Meanwhile, Spark uses a delay scheduling mechanism, the implementation process can be optimized to a greater extent.

. 5) BlockManager (memory module) on the Executor, can be put together as memory and disk storage devices. When the task processing iteration calculation, intermediate results do not need to write a distributed file system, but directly stored in the storage system, subsequent iterations can read the intermediate results, avoiding disk read and write. In the interactive query case, the relevant data can also be cached in advance on the storage system to improve query performance.

41. The the Spark overall structure and operating procedures
42. The the Spark ecosystem
43. The the Spark Development Example
44. The the Spark Streaming Introduction
45. The the Spark Streaming system architecture
46. The the Spark Streaming programming model
47. The the Spark DSTREAM related operations
48. The the Spark Streaming development examples
49. Data About mining

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92171036