SparkCore of running architecture

Reference official website: http://spark.apache.org/docs/latest/cluster-overview.html

Cluster Mode Overview

Cluster Mode Overview
This Gives A Short the Overview of the Document How Spark runs ON Clusters, to the make IT Easier to Understand at The Components Involved. At The the Read through the Application submission Guide to Learn the About Launching Applications ON A Cluster.
This document outlines how Spark in brief running on the cluster, in order to understand the components involved more easily. Read the application submission guidelines for information on starting the application on the cluster.

Components

Components
Spark Applications RUN AS work of the Independent sets of Processes ON A Cluster, Coordinated by at The SparkContext Object in your main Program (Called at The Driver Program).
Sets of Processes: a collection of processes
Spark application to on a cluster separate process running set by the main routine (referred to as the driver) SparkContext object to coordinate and organize. Spark a driver and application comprising a plurality of executors.
Analysis: Spark applications are a group of independent processes that have? A driver program and process multiple executors process components.

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Specifically, clustered mode, SparkContext can connect different types of cluster managers Cluster Manager, for example Spark own standalone cluster manager, Mesos or YARN, which cluster managers the role is divided between various application application resources. Once connected to the Spark of these cluster managers, Spark gained the executors distributed among the nodes of the cluster, these executors actually is a series of processes, which are responsible for performing computing application in our application and associated data storage. Subsequently, SparkContext our application code sent to executors, the application code is defined by Python or JAR file and passed SparkContext. Finally, SparkContext send tasks to the executors to carry out.
Here Insert Picture Description
Driver program which contains SparkContext, it is possible to run up a cluster, the cluster can be the operation goes to the above, it is necessary to apply for resources SparkContext Cluster manager to the cluster, such as to apply the above two nodes, two executors. Once connected, get resources, access to the executors distributed among the nodes of the cluster, SparkContext you can send us your application code to executors. Finally, SparkContext send tasks to the executors to carry out.

About this architecture, we need to pay attention to the following points:

  • 1.Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
    Spark Each application has its own executor of its own processes, these processes in the entire life cycle of the Spark applications have been alive, and multi-threaded approach to perform multiple tasks within. The benefit of this is to make between applications isolated from each other, whether it is scheduling side scheduling level or at the operational level executor side are isolated from each other, from the scheduling perspective, each driver dispatch belong to its own tasks, from execution a perspective, Tasks belonging to different applications running on the JVM different, (if there are two Spark applications, respectively, a plurality of executors, distributed over the executors each node, wherein if the two applications are Spark there is a executor, Talia is run on the same node, and independently of each other). However, this also means that (is an example of SparkContext) between different Spark applications can not share the data of their respective owners, unless you write data to an external storage system. There is a way you can not write to external storage systems, such as Alluxio virtual memory, high-speed distributed storage system .

  • 2.Spark is agnostic to the underlying cluster manager . As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (eg Mesos / YARN) .
    the Spark does not care about the underlying cluster manager. Spark can be obtained as long as the process to get executor and executor of these processes can communicate with each other, then this application even if the Spark also supports other applications running on the cluster manager (such as Mesos / YARN) is also relatively easy above.
    That can spark a good run in more than one place, but it does not take care of your bottom, whether you are a standalone or Mesos or YARN, its code is the same.

  • Driver Program for the listen MUST 3.The Accept incoming and Executors Throughout Connections from ITS ITS Lifetime (EG, See spark.driver.port in config sectionTop The Network ). SUCH of As, MUST BE The Driver Program The worker from Network Addressable Nodes.
    In the driver program throughout the life cycle, it has been listening and receiving of connections from its executors. (You can view spark.driver.port in the network config sectio). Therefore, driver program must communicate with each worker nodes nodes in the network.

  • 4.Because the driver schedules tasks on the cluster , it should be run close to the worker nodes, preferably on the same local area network. If you'd like to send requests to the cluster remotely, it's better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away. from the worker nodes
    because the driver is scheduling the various tasks on the cluster, so it should run close to the working node, preferably running on the same local area network. If you want to send the request to the remote cluster, it is best to open the RPC driver and submit it to operate from nearby and not far from run driver working node.

Cluster Manager Types

Cluster Management types
The system currently supports several cluster managers:

  • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
  • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN – the resource manager in Hadoop 2.
  • Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

Spark now directly supports the following cluster manager:

  • Standalone - a simple cluster manager, Spark built, so we set up a cluster becomes very easy.
  • Apache Mesos - a general-purpose cluster manager, it can also run Hadoop MapReduce and service applications.
  • Hadoop YARN - resource manager Hadoop 2.x in.
  • Kubernetes - an open source project for automatic deployment, container expansion and internal management applications.

Submitting Applications

Applications can be submitted to a cluster of any type using the spark-submit script. The application submission guide describes how to do this.
By spark-submit the script, we can submit the application to the Applications on any type of cluster. This article describes the application submission guide specific implementation.

Monitoring

Each driver program has a web UI, typically on port 4040, that displays information about running tasks, executors, and storage usage Simply go to http:. //:. 4040 in a web browser to access this UI The monitoring guide also describes other monitoring options.
each driver has a web UI, typically at 4040 port, you can see about running tasks, executors, and storage size and other information. We can be accessed through a browser http: //: 4040 to navigate the UI interface. This article details the monitoring guide other monitoring options.

Job Scheduling

The Spark Gives Control over Resource Allocation both-across Applications (AT at The Level of at The Cluster Manager) and the WITHIN Applications (IF Multiple Computations are Happening ON at The Same, SparkContext). At The the Job Scheduling the Overview Describes the this in More the Detail.
The Spark in cross-application (in the cluster manager level) and the application (running on the same computing a plurality SparkContext), the resource allocation control. job scheduling overview more detailed description of this feature.

Glossary (term)

Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.
the term meaning
Application Based on the user program to build the Spark. A driver program by the cluster and multiple executors components.
Application jar Spark Jar package contains the user's application. In some scenarios, you may want to create a "Uber jar", which contains the user's application and its dependencies. Users should not be included Jar Hadoop and Spark library. However, these libraries will be loaded at runtime.
Driver program Driver program is a process that runs the application inside the application's main () function, and the main function inside to create SparkContext
Cluster manager 一个外部服务,用于获取集群资源(比如: standalone manager, Mesos, YARN )
Deploy mode 区分驱动程序进程的运行位置。在“集群”模式下,框架在集群内部启动驱动程序。在“客户端”模式下,提交者在集群外部启动驱动程序,在本地local启动driver,就是你在哪里提交,在哪台机器上提交,就在哪台机器上运行这个driver program。
Worker node 任何可以在群集中运行应用程序代码的节点
Executor 在worker node上启动应用程序的进程,这个进程可以运行多个任务并将数据保存在内存或磁盘存储中。每个Spark应用程序都有它自己的一组executors。
Task 被发送给一个executor的工作单元。每个executor上面可以跑多个task。
Job 由Spark action触发的由多个tasks组成的并行计算。当一个Spark action(如save, collect)被触发,一个包含很多个tasks的并行计算的job将会生成。你可以在driver’s logs看到这个术语。就是说只要Spark 程序触发了一个action,它就是一个job。
Stage 每个job被切分成小的任务集,这些小的任务集叫做stages,并且他们之间相互依赖(类似于MapReduce中的map和reduce阶段)。你可以在driver’s logs看到这个术语。

Here Insert Picture Description
To sum up, the so-called Spark application that contains a driver plus more executor. The so-called driver, which is to run our application Spark inside the main method, and create a process SparkContext inside. The so-called Cluster manager, it is an external service, apply for cluster resources. The so-called Deploy mode, it is divided into two modes, client and cluster mode, client mode, the driver is run locally, if cluster mode, the driver is running in a cluster inside. The so-called Worker node, that is, it can run on top of executor process for Yarn, it is to run above NM container. The so-called executor, it is a process, which can run multiple tasks and data stored in memory or on disk storage, each Spark application has its own set of executors. The so-called job, that is, when a Spark application encountered an action and they will create a job, job there will be a number of task, task is the smallest unit calculated, task will be sent to the executor to carry out the above.

Guess you like

Origin blog.csdn.net/liweihope/article/details/91293267