Big data foundation: Spark working principle and basic concepts

Introduction | Apache Spark is a fast general-purpose computing engine designed for large-scale data processing. It has a wide range of applications in the fields of data mining and machine learning. It has also formed a rapidly developing and widely used ecosystem. This article will give you a detailed introduction to the core technology principles of Spark, and I hope to communicate with you. Author: Feng Xiong, Tencent Big Data R & D engineers.


1. Spark introduction and ecology

Spark is the open source general distributed parallel computing framework of UC Berkeley AMP Lab. It has become the top open source project of the Apache Software Foundation. As for why we should learn Spark, it can be summarized as the following three points:


1. Spark's advantages over hadoop

(1) High performance

Spark has all the advantages of Hadoop MR. The intermediate results of each calculation of Hadoop MR will be stored on the HDFS disk, and the intermediate results of Spark can be stored in the memory for data processing.

(2) High fault tolerance

  • Data recovery based on "Lineage": Spark introduces the abstraction of elastic distributed data set RDD, which is a collection of read-only data distributed in a group of nodes. These collections are elastic and interdependent. If part of the data in the data set is lost, it can be reconstructed based on the "blood" relationship.

  • Checkpoint fault tolerance: RDD calculations can be used for fault tolerance through checkpoint. Checkpoint has two detection methods: redundant data and log update operations. The doCheckPoint method in RDD is equivalent to caching data through redundant data, while "lineage" achieves fault tolerance through coarse-grained record update operations. CheckPoint fault tolerance is a fault-tolerant assistance for pedigree detection, to avoid the high cost of fault tolerance caused by too long "lineage".

(3) The versatility of spark

Spark is a general big data computing framework, which provides richer usage scenarios compared to hadoop.


Compared with the two operations of hadoop map reduce, spark also provides richer operations, which are divided into action (collect, reduce, save...) and transformations (map, union, join, filter...). At the same time, in the communication model of each node Compared with hadoop's shuffle operation, there are partitions, control of intermediate result storage, materialized views, etc.

2. Introduction to spark ecology

Spark supports multiple programming languages, including Java, Python, R, and Scala. Support local mode, standalone mode, yarn mode and k8s at the computing resource scheduling layer.

At the same time, spark has multi-component support application scenarios. On the basis of spark core, it provides spark Streaming, spark SQL, spark Mllib, spark R, GraphX ​​and other components.

Spark Streaming is used for real-time stream computing, spark SQL is designed to combine familiar SQL database queries with more complex algorithm-based analysis, GraphX ​​is used for graph computing, spark Mllib is used for machine learning, and spark R is used for R language Data calculation.

Spark supports a variety of storage media. At the storage layer, spark supports reading and writing data from hdfs, hive, aws, etc., and also supports reading and writing data from large databases such as hbase and es. It also supports mysql, Data is read in and written out in relational databases such as pg. In real-time stream computing, data can be obtained from flume, kafka and other data sources and stream computing can be performed.

Spark also supports very rich data formats, such as common txt, json, csv and other formats. At the same time, it also supports parquet, orc, avro and other formats. These formats have obvious advantages in data compression and massive data query.

2. Spark principle and characteristics

1. spark core

Spark Core is the core of Spark, which includes the following parts:

(1) Spark basic configuration

SparkContext is the entry point of spark applications. The submission and execution of spark applications cannot be separated from sparkContext. It hides network communication, distributed deployment, message communication, storage system, computing storage, etc. Developers only need to develop through apis such as sparkContext That's it.

sparkRpc is implemented based on netty and is divided into asynchronous and synchronous modes. The event bus is mainly used for the exchange between sparkContext components. It belongs to the listener mode and uses asynchronous calls. The measurement system is mainly used for system operation monitoring.

(2) Spark storage system

It is used to manage the data storage methods and storage locations that spark depends on during operation. Spark's storage system gives priority to storing data in memory at each node. When the memory is insufficient, data is written to the disk, which is also important for high spark computing performance. the reason.

We can flexibly control whether the data is stored in memory or disk, and can output the results to remote storage through remote network calls, such as hdfs, hbase, etc.


(3) Spark scheduling system

The spark scheduling system is mainly composed of DAGScheduler and TaskScheduler.

DAGScheduler mainly divides a Job into multiple stages according to the dependencies between RDDs. Each stage after the division is abstracted into a task set composed of one or more Tasks and handed over to TaskScheduler for further task scheduling. The TaskScheduler is responsible for scheduling each specific Task.


Specific scheduling algorithms are FIFO, FAIR:

  • FIFO scheduling: First in, first out, this is the default scheduling mode of Spark.

  • FAIR scheduling: supports grouping jobs into pools and setting different scheduling weights for each pool. Tasks can determine the execution order according to the weights.

2. spark sql

Spark sql provides SQL-based data processing methods, making distributed data set processing easier, which is also an important reason why spark is widely used.

At present, an important evaluation index for big data-related computing engines is whether to support sql, so as to lower the threshold for users. Spark SQL provides two abstract data collections DataFrame and DataSet.

DataFrame is spark Sql's abstraction of structured data. It can be simply understood as a table in spark. Compared with RDD, it has more data table structure information (schema). DataFrame = Data + schema

RDD is a collection of distributed objects, and DataFrame is a collection of distributed Rows, which provides a richer set of operators than RDD, and at the same time improves the efficiency of data execution.

DataSet is a distributed collection of data. It has the advantages of RDD strong typing and Spark SQL optimized execution. DataSet can be constructed from jvm objects, and then operated using map, filter, flatmap and other operation functions.

3. spark streaming

This module is mainly for processing streaming data, supporting scalability and fault tolerance processing of streaming data, and can be integrated with established data sources such as Flume and Kafka. The implementation of Spark Streaming also uses the abstract concept of RDD, which makes it more convenient to write applications for streaming data.

4. Spark features

(1) Spark calculation speed is fast

Spark builds each task into a DAG for calculation. The internal calculation process is calculated through the existence of a flexible distributed data set RDD, which is 100 times more efficient than hadoop's mapreduce.

(2) Easy to use

Spark provides a large number of operators. Development only needs to call the relevant api for implementation and cannot pay attention to the underlying implementation principles.

General big data solution


Compared with the previous offline tasks using mapreduce to achieve, real-time tasks using storm to achieve, currently these can be achieved through spark, reducing the cost of development. At the same time, spark reduces the user's learning threshold through spark SQL, and also provides machine learning and graph computing engines.

(3) Support multiple resource management modes

The local model can be used for task debugging in learning and use, and standalone, yarn and other modes are provided in the formal environment to facilitate users to choose the appropriate resource management mode for adaptation.

(4) Community support

Spark has a rich ecosystem and fast iterations, making it an indispensable computing engine in the big data field.


3. Spark operating mode and cluster role

1. Spark operating mode

Operating mode

Run type

Description

local

Local mode

Often used for local development and testing, divided into local single-threaded and local-cluster multi-threaded modes

standalone

Cluster mode

Independent mode, running on Spark's own resource scheduling management framework, which adopts the master/salve structure

yarn

Cluster mode

Run on the yarn resource manager framework, yarn is responsible for resource management, and spark is responsible for task scheduling and calculation

months

Cluster mode

Run on the mesos resource manager framework, mesos is responsible for resource management, spark is responsible for task scheduling and calculation

k8s

Cluster mode

Run on k8s


2. Spark cluster role

The figure below is a cluster role diagram of spark. It mainly consists of five parts: cluster manager node, worker node, executor, driver, and application. The following describes the characteristics of each part in detail.

(1)Cluster Manager

The cluster manager, which exists in the Master process, is mainly used to manage the resources requested by the application. According to its deployment mode, it can be divided into local, standalone, yarn, mesos and other modes.

(2)worker

Workers are Spark's work nodes, used to perform task submission. The main job responsibilities are as follows:

  • The worker nodes report their own cpu, memory and other information to the cluster manager through the registration machine.

  • The worker node creates and enables the executor under the action of the spark master, and the executor is the real computing unit.

  • Spark master assigns tasks to executors on worker nodes and executes applications.

  • The worker nodes synchronize resource information and executor status information to the cluster manager.

Running worker nodes in yarn mode generally refers to NodeManager nodes, and running in standalone mode generally refers to slave nodes.

(3)executor

The executor is a component that actually performs computing tasks, and it is a process that the application runs on the worker. This process is responsible for the operation of the Task. It can save data in memory or disk storage, and it can also return the result data to the Driver.

(4)Application

Application is an application programmed by Spark API. It includes the code to implement the Driver function and the code to be executed on each executor in the program. An application is composed of multiple jobs. The entrance of the application program is the main method defined by the user.

(5)Driver

The driver node is a process that runs the main function in the Application and creates a SparkContext. The application communicates with the Cluster Manager and executor through the Driver. It can run on the application node, or it can be submitted to the Cluster Manager by the application, and then the Cluster Manager arranges the workers to run.


The Driver node is also responsible for submitting Jobs, converting Jobs into Tasks, and coordinating Task scheduling among various Executor processes.

(6) sparkContext

sparkContext is the most critical object of the entire spark application and the main entry point for all Spark functions. The core role is to initialize the components required by the spark application and also to register with the master program.

3. Other core concepts of spark

(1) RDD

It is the most important concept in Spark. It is an elastic distributed data set, a fault-tolerant collection of elements that can be operated in parallel, and a basic abstraction of all data processing in Spark. You can operate rdd through a series of operators, which are mainly divided into two operations: Transformation and Action.

  • ‍‍‍‍‍Transformation (transformation): It is to generate a new RDD by wrapping the existing RDD. For the conversion process, a lazy computer system is used and the result will not be calculated immediately. Commonly used methods are map, filter, flatmap, etc.

  • Action (execute): Perform calculations on the existing RDD to generate results, and return the results to Driver or write them to external storage. Commonly used methods are reduce, collect, saveAsTextFile, etc.

(2) DAY

DAG is a directed acyclic graph. In Spark, DAG is used to describe our calculation logic. Mainly divided into DAG Scheduler and Task Scheduler.

Picture from: https://blog.csdn.net/newchitu/article/details/92796302

(3) DAY Scheduler

DAG Scheduler is a stage-oriented high-level scheduler. DAG Scheduler divides DAG into multiple tasks. Each group of tasks is a stage. When parsing, it is constructed in reverse with shuffle boundary. Whenever it encounters a shuffle, Spark will generate a new stage, and then submit it to the underlying scheduler (task scheduler) in the form of TaskSet, and each stage is encapsulated into a TaskSet. DAG Scheduler needs to record actions such as RDD being stored in disk materialization, and at the same time, it needs Task to find the optimal scheduling logic and monitor the failure caused by shuffle cross-node output.

(4)Task Scheduler

Task Scheduler is responsible for the execution of each specific task. Its main responsibilities include

  • Scheduling management of task sets;

  • Status result tracking;

  • Physical resource scheduling management;

  • Task execution

  • Get results.

(5)Job

A job is a parallel computing task constructed by multiple stages. The job is triggered by the action operation of spark. In spark, a job contains multiple RDDs and various operation operators acting on the RDD.

(6)stage

The DAG Scheduler will divide the DAG into multiple interdependent stages. One basis for dividing the stages is the dependence between RDDs.

When all operations in the job are divided into stages, they are generally carried out in reverse order, that is, starting from Action, when encountering narrow-dependent operations, they are divided into the same execution stage, and when encountering wide-dependent operations, a new execution stage is divided. And the new stage is the parent of the previous stage, which is then executed recursively.

The child stage needs to wait for all parent stages to execute before it can be executed. At this time, the stages form a large-grained DAG according to the dependency relationship. In a stage, all operations are performed by a set of tasks in a serial pipeline manner.

(7)TaskSet Task

TaskSet can be understood as a task, corresponding to a stage, is a task set composed of Tasks. All tasks in a TaskSet can be calculated in parallel without shuffle dependency.


Task is the most independent computing unit in spark. It is sent by Driver Manager to the executive for execution. Usually, one task processes one partition in spark RDD. Tasks are divided into ShuffleMapTask and ResultTask. The task at the last stage is ResultTask, and the other stages belong to ShuffleMapTask.


Four, spark job running process

1. Spark job running process

The spark application runs on a distributed cluster in the unit of process collection, and creates sparkContext objects to interact with the cluster through the main method of the driver program. The specific operation process is as follows:

  • sparkContext requests CPU, memory and other computing resources from the cluster manager.

  • The cluster Manager allocates the resources needed for application execution and creates executors on the worker nodes.

  • SparkContext sends program code and task tasks to the executor for execution. The code can be a compiled jar package or python file. Then sparkContext will collect the results to the Driver side.

2. Spark RDD iteration process

  • sparkContext creates RDD objects, calculates the dependencies between RDDs, and composes a DAG directed acyclic graph.

  • DAGScheduler divides the DAG into multiple stages, and submits the TaskSet corresponding to the stage to the management center of the cluster. The stage is divided according to the wide and narrow dependencies in the RDD. Spark will be divided into a stage when it encounters wide dependencies, and each stage contains it. One or more task tasks to avoid the system overhead caused by message transmission between multiple stages.

  • taskScheduler applies for resources for each task through the cluster management center and submits the task to the worker node for execution.

  • The executor on the worker performs specific tasks.


3. Introduction to yarn resource manager

Spark programs are generally run on clusters, and spark on yarn is a very large operating mode used in work or production.

Before the yarn mode, each distributed framework had to run on a cluster. For example, Hadoop would run on a cluster, and Spark would run on standalone when it used a cluster. In this case, the resource utilization of the entire cluster is low, and management is more troublesome.


Yarn is a distributed resource management and task management management, mainly composed of three modules ResourceManager, NodeManager and ApplicationMaster.

ResourceManager is mainly responsible for the resource management, monitoring and distribution of the cluster. It has absolute control and resource management authority for all applications.


NodeManager is responsible for node maintenance, execution and monitoring of task running status. It will report its resource usage to ResourceManager by way of heartbeat.

Each node of yarn resource manager runs a NodeManager, which is the agent of ResourceManager. If the ResourceManager of the primary node goes down, it will connect to the standby node of the ResourceManager.


ApplicationMaster is responsible for the scheduling of specific applications and the coordination of resources. It will negotiate with ResourceManager for resource application. ResourceManager allocates resources to applications for operation in the form of container containers. At the same time responsible for the start and stop of the task.


Container is an abstraction of resources. It encapsulates the resource information (cpu, memory, disk, network, etc.) on each node. Yarn assigns tasks to the container to run. At the same time, the task can only use the resources described by the container to achieve each task. Isolation of resources.

4. Spark program execution process on yarn

Spark on yarn is divided into two modes: yarn-client mode and yarn-cluster mode. Generally, yarn-cluster mode is used online.

(1) Yarn-client mode


The driver is executed locally on the client. This mode allows the spark application and the client to interact, because the driver can access the status of the driver on the client through the webUI. At the same time, the Driver will conduct a lot of communication with the Executor in the yarn cluster, which will cause a large increase in client network card traffic.

(2) yarn-cluster mode


Yarn-Cluster is mainly used in the production environment. Because the Driver runs in a NodeManager in the Yarn cluster, the machine where the Driver submits a task is random and will not cause a surge in the network card traffic of a certain machine. The disadvantage is that the log cannot be seen after the task is submitted. The log can only be viewed through yarn.

The following figure shows the running mode of yarn-cluster:

The client submits applications to yarn, including the ApplicationMaster program, the command to start the ApplicationMaster, and the programs that need to be run in the Executor.

The ApplicationMaster program starts ApplicationMaster commands, programs that need to be run in Executor, etc.

ApplicationMaster registers with ResourceManager so that users can view the running status of the application directly through ResourceManage.

After the ApplicationMaster applies for the resource (that is, the Container), it communicates with the corresponding NodeManager and starts the Task.

Task reports the running status and progress to ApplicationMaster, so that ApplicationMaster can keep track of the running status of each task, so that it can restart the task when the task fails.

After the application program runs, ApplicationMaster applies to ResourceManager to log out and shuts itself down.


references:

[1] Spark on Yarn architecture principle:

https://blog.csdn.net/lijingjingchn/article/details/85012470

[1] Spark on Yarn detailed explanation:

https://www.cnblogs.com/bigdata1024/p/12116621.html

[1] Spark task submission method and execution process:

https://www.cnblogs.com/frankdeng/p/9301485.html

[1] Spark fault tolerance mechanism:

https://www.cnblogs.com/cynchanpin/p/7163160.html

[1] spark 之 scheduler:

https://mp.weixin.qq.com/s/9g5e5WlmXUyQDXiU6PTGZA?token=1292183487&lang=zh_CN

[1] How Spark works:

https://blog.csdn.net/qq_16681169/article/details/82432841

[1] Spark RDD:

https://www.cnblogs.com/zlslch/p/5942204.html

[1] Quick start of Spark basic concepts:

https://www.leonlu.cc/profession/17-spark-terminology/

[1] Introduction of dag in spark:

https://blog.csdn.net/newchitu/article/details/92796302

[2] spark:

https://spark.apache.org/docs/3.0.0-preview/index.html

Article recommendation

Ten billion-level real-time computing system performance optimization ---Elasticsearch

Guess you like

Origin blog.csdn.net/QcloudCommunity/article/details/109685416