1. Narrow dependency
Narrow dependence refers to: each parent RDD
's one is used by the partition
most quilt RDD
one partition
, for example: map,filter,union
waiting for operations will produce narrow dependence, which is equivalent to the relationship between parents and only child
Two, wide dependence (shuffle denpendency)
Width dependency means: a parent of each RDD
of a partition
plurality of sub- RDD
one partition
used, each a parent RDD
to a partition
possible transmission of data to the sub-portion RDD
of each partition, a sub- RDD
plurality partition
depend on the parent RDD
example: groupByKey,reduceByKey,sortedByKey
, both relations are quite Yu's parents were overborn.
Three, Spark
- Basic concepts: RDD, DAG, Executor, Application, Task, Job, Stage
- RDD: Resilient distributed data set, an abstract concept of distributed memory, provides a highly shared memory model.
- DAG: Directed acyclic graph, used to represent the blood relationship between RDDs in the Spark process, that is, the dependence between RDDs.
- Executor: A process running on a worker node that is responsible for running tasks and storing data for applications.
- Application: Spark application written by the user.
- Task: The unit of work running on Executor .
- Job: A job contains multiple RDDs and various operations on the corresponding RDDs.
- Stage: The basic scheduling unit of a job. A job is divided into multiple groups of tasks. Each group of tasks is called "Stage", also known as a task set.
Fourth, the architecture design of Spark
Spark's operating architecture includes cluster resource managers ( Cluster Manager
), worker nodes ( Worker Node
), task control nodes ( Driver
), and execution processes ( Executor
). The cluster resource manager can be the resource manager that comes with Spark, or it can be a Yarn
resource management framework.
Among them Spark
, one is Application
composed of one task control node Driver
and multiple jobs Job
, one job is composed of multiple Stage
, and one is Stage
composed of multiple Task
. When executing one Application
, the control node will Cluster Manager
apply for resources from the cluster manager , start it Executor
, and Executor
send application code and files to it, and then Executor
execute the task. After the operation is over, the execution result will be returned to the task control node, or written to HDFS
or other databases .
Five, Spark operating process
- When a Spark application is submitted, a basic operating environment must first be constructed for the application, that is, a task control node is
Driver
createdSparkContext
,SparkContext
whichCluster Manager
communicates with the cluster resource manager and performs resource application, task allocation, and monitoring .SparkContext
Will registerExecutor
with the resource manager and apply for running resources . - The resource manager
Executor
allocates resources and starts theExecutor
process, and theExecutor
running status is sent to the resource manager along with the heartbeat. SparkContext
Construct aDAG
graph based on the dependencies of the RDD ,DAG
submit the graph to the **DAG
schedulerDAGScheduler
for analysis,DAG
decompose the graph into multipleStage
, each phase is a task set, and calculate the dependencies between the phases, and then divide the tasks one by one The set is submitted to the underlying task schedulerTaskScheduler
for processing;Executor
toSparkContext
apply for a task, the task scheduler assigns the task toExecutor
run, and at the same time,SparkContext
sends the application code toExecutor
**.- The task
Executor
is running on, the execution result is fed back to the task scheduler, and then fed back to the DAG scheduler. After running, data is written and all resources are released.
Six, the characteristics of Spark operating architecture
- Each application has its own dedicated
Executor
process, and the process will always exist during theExecutor
running of the task . The process runs the task in a multi-threaded manner , which reduces the frequent startup overhead of multi-process tasks and makes task execution efficient and reliable. - The Spark running process has nothing to do with the resource manager, as long as it can get the
EXecutor
process and maintain communication. - There is a BlockManager storage module on Executor, which is similar to a key-value storage system (memory and disk are used as storage devices). When processing iterative computing tasks, there is no need to write intermediate results to a file system such as HDFS, but directly On this storage system, it can be read directly when needed in the future; in the interactive query scenario, the table can also be cached on this storage system in advance to improve read and write IO performance.
- The task uses optimization mechanisms such as data locality and speculative execution . Data locality is to try to move the calculation to the node where the data is located, that is, " computing closer to the data ", because mobile computing occupies much less network resources than mobile data. Moreover, Spark uses a delayed scheduling mechanism, which can optimize the execution process to a greater extent. For example, the node that owns the data is currently being occupied by other tasks. In this case, is it necessary to move the data to other idle nodes? The answer is not necessarily. Because if it is predicted that the time for the current node to end the current task is less than the time to move the data, then the scheduling will wait until the current node is available.
Pending: the Executor
process runs tasks in a multi-threaded manner, check the details