Spark learning from 0 to 1 (2)-Apache Spark operating mode and principle

1. Standalone mode two ways to submit tasks

1.1 Standalone-client submit task mode

  1. Submit order

    $./spark-submit --master spark://masterNode:7077 --class 主类 jar路径 100
    

    or

    $./spark-submit --master spark://masterNode:7077 --deploy-mode client --class 主类 jar路径 100
    
  2. Implementation principle diagram

Insert picture description here

  1. Implementation process

    1. After the task is submitted in the client mode, the Driver process will be started on the client.
    2. Driver will apply to Master for resources to start Application.
    3. The resource application is successful, and the Driver side sends the task to the workder side for execution.
    4. The worker returns the task execution result to the Driver end.
  2. to sum up

    The client mode is suitable for testing and debugging programs. The Driver process is started on the client, and the client here refers to the current node that submits the application. You can see the execution of the task on the Driver side. The client mode cannot be used in the production environment. Because: Suppose 100 applications are to be submitted to the cluster for operation. Driver will be started on the client side every time, which will cause the problem of a surge in network card traffic on the client side 100 times.

1.2 Standalone-cluster submission method

  1. Submit order

    $./spark-submit --master spark://masterNode:7077 --deploy-mode cluster --class 主类 jar包 100
    
  2. Implementation principle diagram

Insert picture description here

  1. Implementation process

    1. After the application is submitted in cluster mode, it will request the Master to start the Driver.
    2. The Master receives the request and starts the Driver process on a node in the cluster randomly.
    3. After the Driver starts, apply for resources for the current application.
    4. The driver sends tasks to the worker nodes for execution.
    5. The worker returns the execution status and execution result to the Driver.
  2. to sum up

    The Driver process is started on a worker in the cluster, and the execution of the task cannot be viewed on the client. Assuming that 100 applications are to be submitted to the cluster to run, and each time the Driver will be randomly started on a worker in the cluster, the problem of the surge in traffic of the 100-word network card will be spread to the cluster.

1.3 Summary

Standalone has two ways to submit tasks. The communication between Driver and cluster includes:

  • Driver is responsible for the application of application resources
  • Distribution of tasks
  • Result recovery
  • Monitor task execution

2. Two ways to submit tasks in Yarn mode

2.1 yarn-client submit task

  1. Submit order

    ./spark-submit --master yarn --class 类名 jar包路径 100
    

    or

    ./spark-submit --master yarn-client --class 类名 jar包路径 100
    

    or

    ./spark-submit --master yarn --deploy-mode client --class 类名 jar包路径 100
    
  2. Implementation principle diagram

Insert picture description here

  1. Implementation process

    1. The client submits an Application and starts a Driver process on the client.
    2. After the application is started, it will send a request to RS (ResourceManager) to start AM (ApplicationMaster) resources.
    3. RS receives the request and randomly selects an NM (NodeManager) to start AM. The NM here is equivalent to the Worker node in Standalone.
    4. After AM starts, it will request a batch of container resources from RS to start Executor.
    5. RS will find a batch of NMs and return it to AM to start Executor.
    6. AM will send a command to NM to start Executor.
    7. After the Executor is started, it will be registered to the Driver in the reverse direction, and the Driver will send tasks to the Executor. The Executor returns the execution status and results to the Driver.
  2. to sum up

    The Yarn-client mode is uniformly suitable for testing because the Driver runs locally. Driver will communicate a lot with Executor in the yarn cluster, which will cause a large increase in client network card traffic.

    The role of ApplicationMaster:

    1. Apply for resources for the current Application.

    2. Send a message to NodeManager to start Executor.

      Note: ApplicationMaster has the function of initiating Executor and applying for resources, but not the function of job scheduling.

2.2 yarn-cluster submitting task

  1. Submit order

    ./spark-submit --master yarn --deploy-mode cluster --class 主类 jar包路径 100
    

    or

    ./spark-submit --master yarn-cluster --class 主类 jar包路径 100
    
  2. Implementation principle diagram

    Insert picture description here

  3. Implementation process

    1. The client submits the Application program, sends a request to RS (ResourceManager), and starts AM (ApplicationMaster).
    2. After receiving the request, RS randomly starts AM (equivalent to the Driver side) on an NM (NodeManager).
    3. After AM starts, it sends a request to RS to request a batch of containers to start Executor.
    4. RS returns a batch of NM nodes to AM.
    5. AM connects to NM, sends a request to NM, and starts Executor.
    6. The Executor is registered reversely to the Driver of the node where the AM is located. Driver sends task to Executor.
  4. to sum up

    Yarn-Cluster is mainly used in the production environment, because the Driver runs on a NodeManager in the Yarn cluster. Each time the driver that submits the task is located at a random machine, and there will be no surge in the network card traffic of a certain machine. The disadvantage is that the log cannot be viewed after the task is submitted, and the log can only be viewed through yarn.

    The role of ApplicationMaster:

    1. Apply for resources for the current Application.
    2. Send a message to NodeManager to start the Excutor.
    3. Task scheduling.

    Command to stop the cluster task: yarn application -kill applicationId

3. Spark term explanation

  • Master (standalone): The master node (process) of resource management
  • Cluster Manager: External services that obtain resources on the cluster (such as standalone, Mesos, Yarn)
  • Worker Node (standalone): The slave node (process) of resource management or the process of managing local resources.
  • Application: Spark-based user program, including driver program and Executor program running on the cluster.
  • Driver Program: The process used to connect to the worker process (Worker).
  • Executor: A process started for an Application on a node managed by a Worker process. The process is responsible for running tasks and storing data in memory or disk. Each application has its own independent executors.
  • Task: The unit of work sent to an Executor.
  • Job: Contains many tasks in parallel computing, which can be regarded as corresponding to Action.
  • Stage: A Job will be divided into many groups of tasks, and each task is called Stage (just like MapReduce is divided into Map Task and Reduce Task)

4. Narrow dependence and wide dependence

There are a series of dependencies between RDDs, which are divided into narrow dependencies and wide dependencies.

  • Narrow dependence

    The relationship between parent RDD and child RDD partition is one-to-one. Or a partition of a parent RDD corresponds to a parent RDD and a child RDD in the case of a partition of a child RDD. The partition relationship is many-to-one. There will be no shuffle.

  • Wide dependence

    The relationship between parent RDD and child RDD partition is one-to-many. There will be shuffles.

  • Width dependency diagram

Insert picture description here

Insert picture description here

5. Stage

The Spark task will form a DAG directed acyclic graph based on the dependencies between RDDs. The DAG will be submitted to the DAGScheduler, and the DAGScheduler will divide the DAG into multiple stages that depend on each other. The division is based on the wide and narrow dependencies between RDDs. When a wide dependency is encountered, the stage is divided, and each stage contains one or more task tasks. Then submit these tasks to TaskScheduler in the form of taskSet to run.

A stage is composed of a set of parallel tasks.

5.1 Stage cutting rules

Cutting rule: From back to front, cut the stage when encountering wide dependence.

Insert picture description here

5.2 Stage calculation mode

Pipeline pipeline calculation mode, pipeline is just a kind of calculation thought and mode.

Insert picture description here

Data is always in the pipeline, when will the data fall?

  1. Persist the RDD.
  2. When the shuffle writer.

The task parallelism of the stage is determined by the number of partitions of the last RDD of the stage.

How to change the number of RDD partitions

For reduceByKey("key",4)example: ,groupByKey(4)

5.3 Verify pipeline calculation mode

val conf = new SparkConf()
conf.setMaster("local").setAppName("pipeline")
val sc = new SparkContext(conf)
val rdd1 = sc.parallelize(Array(1,2,3,4))
val rdd2 = rdd1.map {
    
    x => {
    
    
    println("map======>"+x)
    x
}}

val rdd3 = rdd2.filter {
    
     x=> {
    
    
    println("filter=====>"+x)
    true
}}
rdd3.collect()
sc.stop()

6. Spark resource scheduling and task scheduling

Insert picture description here

6.1 Graphical Spark resource scheduling and task scheduling process

Insert picture description here

6.2 Spark resource scheduling and task scheduling process

  1. After starting the cluster, the Worker node will report the resource status to the Master node, and the Master has the cluster resource status.
  2. After Spark submits an Application, it forms a DAG directed acyclic graph based on the dependencies between RDDs.
  3. After the task is submitted, Spark will create two objects on the Driver side: DAGScheduler and TaskScheduler.
  4. DAGScheduler is the highest-level scheduler for task scheduling and an object.
  5. The main function of DAGScheduler is to divide DAG into stages according to the wide and narrow dependencies between RDDs, and then submit these stages to TaskScheduler in the form of TaskSet (TaskScheduler is the low-level scheduler of task scheduling, here TaskSet is actually a collection, which What is encapsulated is each task task, which is the parallelism task task in the Stage).
  6. TaskScheduler will traverse the TaskSet collection, and after getting each task, it will send the task to the computing node Executor for execution (in fact, it is sent to the thread pool ThreadPool in the Executor for execution).
  7. The running status of Task in the Executor thread pool will be fed back to TaskScheduler. When the task execution fails, the TaskScheduler is responsible for retrying, and the Task is re-sent to the Executor for execution. The default is 3 retry attempts.
  8. If it fails after 3 retries, the stage where the task is located fails. If the stage fails, the DAGScheduler will copy and try again, and resend the TaskSet to the TaskScheduler. Stage defaults to retries 5 times. If it still fails after 4 times, then the job has failed. If job fails, Application fails.

TaskScheduler can not only retry failed tasks, but also retry straggling (lagging, slow) tasks (that is, tasks whose execution speed is much slower than other tasks). If there is a slow running task, then TaskScheduler will start a new task to perform the same processing logic as the slow task. Which of the two tasks is executed first depends on the execution result of which task. This is the speculative execution mechanism of Spark. Speculative execution is turned off by default in Spark. It can spark.speculationbe configured through properties.

note:

  • For ETL types of businesses that need to be stored in the database, the speculative execution mechanism should be closed so that no duplicate data is stored in the database.
  • If you encounter data skew, turning on the speculation mechanism may cause the task to be restarted all the time to process the same logic, and the task may always be in a state of incomplete processing.

6.3 Coarse-grained resource application and fine-grained resource application

  • Coarse-grained resource application (Spark)

    Before Application is executed, all resource applications are applied. When the resource application is successful, task scheduling will be performed. After all tasks are executed, some resources will be released.

    Advantages: Before Application is executed, all resources are applied. Each task can use resources directly. There is no need for tasks to apply for resources themselves before execution. Task starts fast, task execution is fast, Application execution is fast.

    Disadvantages: The resources will not be released until the last task is completed, and the resources of the cluster cannot be fully utilized.

  • Fine-grained resource application (MapReduce)

    Application does not need to apply for resources before execution, but executes directly. Let each task in the job apply for resources by itself before execution, and release the resources when the task is executed.

    Advantages: The resources of the cluster can be fully utilized.

    Disadvantages: When tasks apply for resources by themselves, task startup becomes slower. The application running slower accordingly.

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109047803