In-depth understanding of Spark framework 3: operating architecture, core data set RDD

Preface


Since most of the Spark frameworks are built on the Hadoop system, to understand the core operating principles of Spark, it is necessary to have a familiar understanding of the Hadoop system. From Hadoop1.0 to Hadoop2.0 architecture optimization and development exploration detailed explanation This blog, you can first review the entire Hadoop system, and then come to understand the Spark framework will be more efficient.

I wanted to directly write an article about the reasons for the advantages and disadvantages, as well as the ecosystem and operating structure and principles, but found that the length is too long, and it is simply divided into two articles:

Part I: the Spark framework of a deep understanding of: the development of reason and the advantages and disadvantages

Novella: the Spark framework to understand the depth of two: ecosystem


1. Spark cluster architecture

Spark's architecture diagram:

  • Application: A Spark application written by the user, including a Driver function code and Executor code distributed on multiple nodes in the cluster.
  • Client program: the client where the user submits the job
  • Driver: Run the main function of Application and create SparkContext.
  • SparkContext: Application context, which controls the entire life cycle. Responsible for communicating with the ClusterManager, for resource application, task allocation and monitoring, etc. When the Executor part is completed, the Driver is also responsible for closing the SparkContext.
  • Cluter Manager: Refers to the external service that obtains resources on the cluster. There are currently three types

1) Standalon: Spark native resource management, and the Master is responsible for resource allocation. It can also be understood that using Standalone is Spark's native resource manager.

2) Apache Mesos: a resource scheduling framework with good compatibility with hadoop MR

3) Hadoop Yarn: mainly refers to ResourceManager in Yarn

  • Spark Worker: Any node in the cluster that can run Application, running one or more Executor processes
  • Executor: It is a process that runs on the worker node (Spark Worker) and is responsible for running Task. Executor starts the thread pool to run Tasks. Weapon is responsible for storing data in memory or disk. Each Application will apply for its own Executor to process tasks.
  • Task: The unit of work running on Executor.
  • Job: A Job contains multiple RDDs and various operations on the corresponding RDDs.
  • Stage: It is the basic scheduling unit of Job. A Job is divided into multiple groups of Tasks. Each group of Tasks is called Stage, or TaskSet, which represents a group of related tasks that have no Shuffle dependency between them. Task set.
  • RDD: is the abbreviation of Resilient distributed datasets, Chinese is the elastic distributed dataset; it is the core module and class of Spark
  • DAGScheduler: Build a Stage-based DAG according to the Job, and submit the Stage to TaskScheduler

An Application is composed of a Driver and several Jobs, a Job is composed of multiple Stages, and a Stage is composed of multiple Tasks that have no Shuffle relationship.

When an Application is executed, the Driver will apply for resources from the cluster manager, start the Executor, and send the application code and files to the Executor, and then execute the Task on the Executor. After the execution is over, the execution result will be returned to the Driver or written to HDFS Or in other databases.

Compared with the Hadoop MapReduce computing framework, the Executor used by Spark has two advantages:

  1. Use multithreading to perform specific tasks to reduce task startup overhead;
  2. There is a BlockManager storage module in Executor, which uses memory and disks as storage devices to effectively reduce IO overhead;

One picture flow:

Operation process:

2. Spark operating mode

Operating environment

mode

description

Local

Local mode

Often used for local development and testing, local is also divided into local single-threaded and local-cluster multi-threaded;

Standalone

Cluster mode

Typical Mater/slave mode, but it can also be seen that the Master has a single point of failure; Spark supports ZooKeeper to achieve HA

On yarn

Cluster mode

Runs on the yarn resource manager framework, yarn is responsible for resource management, and Spark is responsible for task scheduling and calculation

Where months

Cluster mode

Running on the framework of the mesos resource manager, mesos is responsible for resource management, and Spark is responsible for task scheduling and calculation

On cloud

Cluster mode

For example, AWS's EC2, using this mode can easily access Amazon's S3;

Spark supports multiple distributed storage systems: HDFS and S3

Among them, Mesos and YARN mode are similar. Currently, Standalone mode and YARN mode are more commonly used.

1. Standalone operation mode

Standalone mode is a resource scheduling framework implemented by Spark. Its main nodes are Client node, Master node and Worker node. Driver can run on the Master node or on the local Client side. When using the spark-shell interactive tool to submit a Spark job, Diver runs on the Master node. When you use the spark-submit tool to submit a Job or use the "new SparkConf().setMaster" method to run Spark tasks on development platforms such as Eclipse and IDEA, Diver runs on the local Client.

(1) First, SparkContext connects to Master, registers with Master and applies for resources

(2) Woker periodically sends heartbeat information to Master and reports Executor status

(3) The Master decides which Worker to allocate resources on according to the resource application requirements of SparkContext and the information reported in the Worker's heartbeat cycle, and then obtains the resources on the Worker, and starts StandloneExecutorBackend.

(4) StandloneExecutorBackend registers with SparkContext

(5) SparkContext sends the Application code to StandloneExecutorBackend, and SparkContext parses the Application code, constructs the DAG graph, and submits it to the DAG Scheduler, which is decomposed into Stage (when it encounters an Action operation, it will spawn a job, and each Job contains one or more Stage), and then submit the Stage (or TaskSet) to the Task Scheduler. The Task Scheduler assigns the Task to the corresponding Worker, and finally submits it to the StandloneExecutorBackend to run.

(6) StandExecutorBackend will establish the Execcytor thread pool, start executing the task, and report to the SparkContext until the task is completed.

(7) After all tasks are completed, SparkContext logs off from the Master and releases resources.

If you want to learn more, you can look at the underlying compiled scala code, such as StandaloneSchedulerBackend.start:

***
    val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
    val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
 
    val initialExecutorLimit =
      if (Utils.isDynamicAllocationEnabled(conf)) {
        Some(0)
      } else {
        None
      }
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
      //创建AppClient
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    //启动AppClient
    client.start()
 ***

This article outlines the principle of operation.

2.Spark on Yarn

Spark on YARN mode is divided into two modes according to the position of Driver in the cluster, one is YARN-Client mode (client mode), and the other is YARN-Cluster mode (cluster mode).

In the YARN operating mode, there is no need to start a Spark independent cluster, so it is not possible to visit http://master:8080 at this time . Start the Spark shell command in YARN client mode :

bin/spark-shell --master yarn-client

And opening the cluster will report an error:

bin/spark-shell --master yarn-cluster

The reason is that the two operating procedures are different.

In the cluster mode, Diver runs on the Application Master, and the Applocation Master process is also responsible for driving the Application and applying for resources from YARN. The process runs in the YARN Container, so the Client that starts the Application Master can be shut down immediately, instead of continuing until the application's life cycle.

Figure YARN—Cluster mode operation process:

(1) The client generates job information and submits it to ResourceManager.

(2) ResourceManager starts the Container on a NodeManager (determined by YARN) and assigns the Application Master to the NodeManager.

(3) The NodeManager receives the resource application from the ResouceManager, and the ResouceManager allocates resources while notifying other NodeManagers to start the corresponding Executor.

(4) Application registers and reports to Application Master on ResourceManager and completes corresponding tasks.

(5) The Excutor reports to the Application Master main city on the NodeManager to complete the corresponding tasks.

Figure YARN client mode job running process:

The Application Master only applies for resources from YARN to the Excutor, and then the Client will communicate with the Container for job scheduling.

(1) The client generates job information and submits it to the ResouceManager.

(2) ResouceManager starts the Container on the local NodeManager and assigns the Application Master to the NodeManager.

(3) NodeManager receives the allocation of ResourceManager, starts Applicaton Master and starts the initialization job. At this time, this NodeManager is called Driver.

(4) Application applies to ResourceManager for resources, ResourceManager allocates resources and notifies other NodeManagers to start the corresponding Excutor.

(5) The Executor registers and reports to the locally started Application Master and completes the corresponding tasks.

From the perspective of the job running process in the two modes. In the YARN-Cluster mode, SparkDriver runs in the Application Master (AM). It helps to apply for resources from YARN and monitors the running status of the job. After the user submits the job, the Client can be closed, and the job will continue on YARN. Run, so YARN-Cluster mode is not suitable for running interactive jobs. However, in YARN-Client mode, AM only requests Executor from YARN, and Client will communicate with the requested Container to schedule their work, which means that Client cannot leave.

In summary, Spark Driver in cluster mode runs in AM, while Spark Diver in client mode runs on the client. Therefore, YARN-Cluster is suitable for production, and YARN-Client is suitable for interaction and debugging, which means that you want to quickly see application output information.

Three, Spark core data set

RDD (Resilient Distributed Datasets, Resilient Distributed Datasets) is the most important concept in Spark. RDD can be simply understood as a data set that provides many operating interfaces. Unlike general data sets, its actual data is distributed and stored. In a batch of machines (in memory or disk), the partitions here can be simply compared and understood with the files in Hadoop HDFS.

What RDD is a flexible distributed data set:

1. Elasticity one: Automatically switch between memory and disk data storage;
2. Elasticity two: Lineage-based high-efficiency fault tolerance (the nth node fails, it will be restored from the n-1th node, and the lineage is fault-tolerant);
3 3. Elasticity 3: Task will automatically retry a specific number of times if it fails (default 4 times);
4. Elasticity 4: If the Stage fails, it will automatically retry a specific number of times (you can only run the stage where the calculation fails); Data sharding that failed to calculate;
5. checkpoint and persist
6. Data scheduling elasticity: DAG TASK has nothing to do with resource management.
7. Data sharding is highly elastic (manually freely set the sharding function), repartition

Define an RDD data set named: "myRDD". This data set is divided into multiple partitions. Each partition may actually be stored on a different machine, or it may be stored in memory or hard disk (HDFS).

RDD has a fault-tolerant mechanism, and it is read-only and cannot be modified. A certain conversion operation can be performed to create a new RDD. Specifically, RDD has the following properties.

  • Read-only: It cannot be modified, and a new RDD can only be generated through the conversion operation.
  • Distributed: It can be distributed on multiple machines for parallel processing.
  • Resilience: It will exchange data with the disk when the memory is insufficient during the calculation process.
  • Memory-based: It can be cached in memory in whole or in part, and reused between multiple calculations.

RDD is essentially a more general iterative parallel computing framework. Users can display the intermediate results of control calculations, and then use them freely for subsequent calculations.

There are many iterative algorithms in the actual application development of big data, such as machine learning, graph algorithms, etc., and interactive data mining tools. What these application scenarios have in common is that intermediate results are reused between different calculation stages, that is, the output result of one stage will be used as the input of the next stage.

RDD is designed to meet this need. Although MapReduce has the advantages of automatic fault tolerance, load balancing, and scalability, its biggest disadvantage is the use of an acyclic data flow model, which requires a large number of disk I/O operations during iterative calculations.

By using RDD, users don’t need to worry about the distributed nature of the underlying data. They only need to express the specific application logic as a series of conversion processing to achieve pipeline, thus avoiding the storage of intermediate results, greatly reducing data replication and disk I /O and data serialization overhead.

RDD operations are divided into transformation (Transformation) operations and action (Action) operations. The conversion operation is to generate a new RDD from an RDD, and the action operation is to perform actual calculations.

1. Build operations

The calculations in Spark are all done by manipulating RDDs. The first problem in learning RDDs is how to build RDDs. The way to build RDDs is divided into the following two categories from the perspective of data sources.

Read data directly from the memory.
Reading data from the file system, there are many types of file systems, the common ones are HDFS and local file systems.
The first method is to construct an RDD from memory, which requires the use of the makeRDD method. The code is shown below.

val rdd01 = sc.makeRDD(List(l,2,3,4,5,6))

This statement creates an RDD consisting of six elements "1,2,3,4,5,6".

The second method is to construct RDD through the file system, the code is shown below.

val rdd:RDD[String] == sc.textFile(“file:///D:/sparkdata.txt”,1)

This example uses the local file system, so the file path protocol prefix is ​​file://.
2. Conversion operation

The conversion operation of the RDD is the operation of returning a new RDD. The converted RDDs are evaluated lazily and will only be calculated when these RDDs are used in the action operation.

Many transformation operations are for each element, that is, these transformation operations will only operate on one element in the RDD at a time, but not all transformation operations are like this.

 RDD conversion operation (rdd1={1, 2, 3, 3}, rdd2={3,4,5})

3. Actions

Action operations are used to perform calculations and output the results in a specified way. The action operation accepts RDD, but returns non-RDD, that is, outputs a value or result. During the execution of the RDD, the actual calculation takes place in the action operation. Table 2 describes the commonly used RDD actions.

RDD action operation (rdd={1,2,3,3})

The operation of RDD is lazy. When the RDD executes the transformation operation, the actual calculation is not executed. Only when the RDD executes the action operation will the calculation task submission be triggered to execute the corresponding calculation operation.

Features of RDD

  • It is an immutable, partitioned collection object on cluster nodes;
  • Through parallel conversion to create such as ( map , filter , join, etc.);
  • Automatically rebuild upon failure;
  • Can control the storage level (memory, disk, etc.) for reuse;
  • Must be serializable;
  • RDD can only be generated from persistent storage or through Transformations operations. Compared with distributed shared memory ( DSM ), it can achieve fault tolerance more efficiently. For lost data partitions only need to be recalculated based on its lineage , without the need for specific The checkpoint ;
  • The data partitioning feature of RDD can improve performance through data locality, which is the same as Hadoop MapReduce ;
  • RDDs are serializable, and can be automatically downgraded to disk storage when the memory is insufficient, and the RDDs are stored on the disk. At this time, the performance will be greatly reduced but not worse than the current MapReduce ;

 


See:

What is Spark RDD?

The resilience of spark RDD

One article to get the communication terminal of Spark's scheduler (SchedulerBackend)

In-depth understanding of Spark 2.1 Core (6): Standalone mode operation principle and source code analysis

The relationship between computer memory and disk

Spark's past and present

Introduction to Spark and its ecosystem

Spark basic architecture and operating principle

 

Guess you like

Origin blog.csdn.net/master_hunter/article/details/114895846