Spark series-RDD's wide and narrow dependencies, and the characteristics of Spark's operating architecture, operating process, and framework

1. Narrow dependency

Narrow dependence refers to: each parent RDD's one is used by the partitionmost quilt RDDone partition, for example: map,filter,unionwaiting for operations will produce narrow dependence, which is equivalent to the relationship between parents and only child

Two, wide dependence (shuffle denpendency)

Width dependency means: a parent of each RDDof a partitionplurality of sub- RDDone partitionused, each a parent RDDto a partitionpossible transmission of data to the sub-portion RDDof each partition, a sub- RDDplurality partitiondepend on the parent RDDexample: groupByKey,reduceByKey,sortedByKey, both relations are quite Yu's parents were overborn.
Insert picture description here

Three, Spark

  1. Basic concepts: RDD, DAG, Executor, Application, Task, Job, Stage
  2. RDD: Resilient distributed data set, an abstract concept of distributed memory, provides a highly shared memory model.
  3. DAG: Directed acyclic graph, used to represent the blood relationship between RDDs in the Spark process, that is, the dependence between RDDs.
  4. Executor: A process running on a worker node that is responsible for running tasks and storing data for applications.
  5. Application: Spark application written by the user.
  6. Task: The unit of work running on Executor .
  7. Job: A job contains multiple RDDs and various operations on the corresponding RDDs.
  8. Stage: The basic scheduling unit of a job. A job is divided into multiple groups of tasks. Each group of tasks is called "Stage", also known as a task set.

Fourth, the architecture design of Spark

Spark's operating architecture includes cluster resource managers ( Cluster Manager), worker nodes ( Worker Node), task control nodes ( Driver), and execution processes ( Executor). The cluster resource manager can be the resource manager that comes with Spark, or it can be a Yarnresource management framework.
Insert picture description here
Among them Spark, one is Applicationcomposed of one task control node Driverand multiple jobs Job, one job is composed of multiple Stage, and one is Stagecomposed of multiple Task. When executing one Application, the control node will Cluster Managerapply for resources from the cluster manager , start it Executor, and Executorsend application code and files to it, and then Executorexecute the task. After the operation is over, the execution result will be returned to the task control node, or written to HDFSor other databases .
Insert picture description here

Five, Spark operating process

Insert picture description here

  1. When a Spark application is submitted, a basic operating environment must first be constructed for the application, that is, a task control node is Drivercreated SparkContext, SparkContextwhich Cluster Managercommunicates with the cluster resource manager and performs resource application, task allocation, and monitoring . SparkContextWill registerExecutor with the resource manager and apply for running resources .
  2. The resource manager Executorallocates resources and starts the Executorprocess, and the Executorrunning status is sent to the resource manager along with the heartbeat.
  3. SparkContextConstruct a DAGgraph based on the dependencies of the RDD , DAGsubmit the graph to the ** DAGscheduler DAGSchedulerfor analysis, DAGdecompose the graph into multiple Stage, each phase is a task set, and calculate the dependencies between the phases, and then divide the tasks one by one The set is submitted to the underlying task scheduler TaskSchedulerfor processing; Executorto SparkContextapply for a task, the task scheduler assigns the task to Executorrun, and at the same time,SparkContext sends the application code to Executor**.
  4. The task Executoris running on, the execution result is fed back to the task scheduler, and then fed back to the DAG scheduler. After running, data is written and all resources are released.

Six, the characteristics of Spark operating architecture

  1. Each application has its own dedicated Executorprocess, and the process will always exist during the Executorrunning of the task . The process runs the task in a multi-threaded manner , which reduces the frequent startup overhead of multi-process tasks and makes task execution efficient and reliable.
  2. The Spark running process has nothing to do with the resource manager, as long as it can get the EXecutorprocess and maintain communication.
  3. There is a BlockManager storage module on Executor, which is similar to a key-value storage system (memory and disk are used as storage devices). When processing iterative computing tasks, there is no need to write intermediate results to a file system such as HDFS, but directly On this storage system, it can be read directly when needed in the future; in the interactive query scenario, the table can also be cached on this storage system in advance to improve read and write IO performance.
  4. The task uses optimization mechanisms such as data locality and speculative execution . Data locality is to try to move the calculation to the node where the data is located, that is, " computing closer to the data ", because mobile computing occupies much less network resources than mobile data. Moreover, Spark uses a delayed scheduling mechanism, which can optimize the execution process to a greater extent. For example, the node that owns the data is currently being occupied by other tasks. In this case, is it necessary to move the data to other idle nodes? The answer is not necessarily. Because if it is predicted that the time for the current node to end the current task is less than the time to move the data, then the scheduling will wait until the current node is available.

Pending: the Executorprocess runs tasks in a multi-threaded manner, check the details

Guess you like

Origin blog.csdn.net/Cxf2018/article/details/109433352