Spark Architecture and Operating Mechanism (3) - Application Initialization

When submitting each application to the Spark environment to run, an application initialization process needs to be completed first. The main work is to load the configuration and initialize the job, and finally create a SparkContext instance. There are two situations to trigger Spark application initialization: (1) Use spark-shell to execute Spark program. When the spark-shell interactive environment starts, it will automatically complete the Spark configuration for the user, and automatically create a SparkContext to connect to the Spark cluster. When we see the spark-shell command line in and out of the window, the application initialization process has been completed. (2) The way to submit Spark programs using spark-submit. Users can package the application into a JAR package, and then submit the application to the Spark cluster for processing through the configured spark-submit script. Here is an example of a spark-submit script running on a YARN cluster:






./bin/spark-submit --master yarn-cluster --class wordcount --deploy-mode client \
--name wordcount --executor-memory 1g --executor-cores 1 \
...\ #Other parameters
/home/wordcount.jar \ #Program jar package address
/user/input/Readme.txt \ #Program parameters
/user/output \

The main parameters of the above script are as follows:
1.master: indicates the master node in the cluster;
2.class: most of the user's applications are written in Scala or Java language, and the Jar package of an application contains multiple class files, and class indicates 3.deploy
-mode: Indicates where the Driver node runs, you can choose to run the Driver node on the user's client computer, or you can run the Driver node on a node in the cluster;
4.name: Specify the name of the application
5.executor-memory: specify the memory size of the Executor that needs to be configured. Spark's operations on RDD are based on memory, so the memory size setting of the Executor directly affects the performance;
6.executor-cores: Specify the Executor to use The number of CPU cores of the server. In YARN mode, specify the number of cores used by executors through executor-cores. In Mesos mode, use total-executor-cores to specify the total number of cores used by all executors;

    parameters that can be set by the spark-submit script There are many more. Through the spark-submit script, the user submits the application to the Spark cluster, and the Spark cluster will select a host to run the Driver program according to the deploy-mode setting in the spark-submit script. The Driver program enters the application program according to the program entry set by the class.
    The main function of each application entry class will contain a SparkContext instance. SparkContext is the interface for the entire application to connect to the cluster. It is mainly responsible for the following tasks:

1) Accept SparkConf parameters
    When the SparkContext is initialized, the Spark runtime environment will pass the configuration parameters related to SparkConf to the SparkContext to configure the properties of the application runtime, such as the master node to be connected to, the name of the application, sparkHome, environment variables, etc.
    It should be noted that the parameters in SparkConf and spark-submit can set application runtime parameters, but the priority of SparkConf is higher than that of spark-submit.

2) Creating the SparkEnv operating environment
    Spark cannot run without some important management modules, such as BlockManager, CacheManager, etc. SparkEnv creates these management modules according to the cluster parameters set before;

3) Resource application
    The entire application is connected to the cluster through SparkContext, and Apply to the Spark Cluster Manager (Cluster Manager) for running Executor resources. Once the resource application is successful, each application will obtain Executor resources distributed on different nodes. SparkContext sends the application to each Executor, and the Executor actually completes the application.

4) Create SparkUI Spark
    provides a separate web UI management interface for each application.

5) Create TaskSchedule
    The initialization of TaskSchedule will vary according to the Spark running mode. After initialization, TaskSchedule is responsible for the actual physical scheduling of each task;

6) Create DAGSchedule
    DAGSchedule is created according to the created TaskSchedule, DAGSchedule is responsible for accepting the submitted computing tasks, and is responsible for the logical scheduling of tasks;

7) Provide function methods SparkContext provides a number of important function methods to manipulate data, such as the
    textFile method used earlier;


The SparkContext is created successfully, that is, the initialization of the Spark application is completed. At this time, you can view the status of the application by accessing the 4040 port of the Driver node.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326264370&siteId=291194637