How to submit Spark tasks in Spark [Spark-shell, Spark-submit]

Use spark-shell and spark-submit commands to submit spark tasks.

When executing the test program, use spark-shell, spark interactive command line

When submitting the spark program to the spark cluster to run, spark-submit

1、spark-shell

\quad \quad spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to perform interactive programming. Users can write spark programs in Scala under the command line.

1.1 Overview

1. Spark-shell usage help

(py27) [root@master ~]# cd /usr/local/src/spark-2.0.2-bin-hadoop2.6
(py27) [root@master spark-2.0.2-bin-hadoop2.6]# cd bin
(py27) [root@master bin]# ./spark-shell --help
Usage: ./bin/spark-shell [options]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      
  • Spark-shell source code
(py27) [root@master bin]# cat spark-shell
function main() {
    
    
  if $cygwin; then
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
    stty icanon echo > /dev/null 2>&1
  else
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
  fi
}

  • , The spark-submit script is executed in the main method of spark-shell and the --name is set to Spark-shell, which explains why the name displayed in the web ui is spark-shell when running spark-shell. Secondly, we see that the main class specified is
    org.apache.spark.repl.Main, which can actually be guessed here. The sparkcontext object must be created for us, because when we execute spark-shell, it will automatically create the sparkcontext object for us.
  • It can be seen from the source code that Spark-shell actually executed the Spark-submit command.

1.2 Start

1. Start spark-shell in the bin directory directly

./spark-shell

Insert picture description here

  • Indicates that the local mode is used to start by default, and a SparkSubmit process is started on this machine

  • You can also specify the parameter --master, such as:

    spark-shell --master local[N]Means to simulate N threads locally to run the current task
    spark-shell --master local[*]Means to use all available resources on the current machine

  • Without parameters, the default is

 spark-shell --master local[*]
  • Exit spark-shell
    :quitor shortcut keyCtrl+D

2、 ./spark-shell --master spark://master:7077

  • Start in Standalone mode, that is, cluster mode
    Insert picture description here
    3../spark-shell --master yarn-client

Insert picture description here

  • Start in Yarn client mode
  • ./spark-shell --master yarn-clusterThis is started in Yarn cluster mode

1.3 Application scenarios

  • Usually test-oriented
  • So generally ./spark-shellstart directly and enter the local mode to test

2、spark-submit

\quad \quad Once the program is packaged, the bin/spark-submit script can be used to start the application. This script is responsible for setting the classpath and dependencies used by spark, and supports different types of cluster managers and release modes.

2.1 Overview

\quad \quad It is mainly used to submit the compiled and packaged Jar package to the cluster environment for running. It is similar to the hadoop jar command in hadoop. The hadoop jar is to submit a MR-task, and spark-submit is to submit a spark task. scripts may be provided Spark class path (cLASSPATH) and application dependencies, and may set different Spark supports cluster management and deployment model. Compared with spark-shell, it does not have REPL (interactive programming environment). Before running, you need to specify the startup class of the application, the path of the jar package, and the parameters.

1. Help for using spark-submit

 ./spark-submit --help
(py27) [root@master bin]#  ./spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

  • Spark-submit source code
(py27) [root@master bin]# cat spark-submit
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
  • Spark-submit is relatively simple to execute the spark-class script to make the main class org.apache.spark.deploy.SparkSubmit, and pass all the accepted parameters to the main class

2.2 Basic syntax

  • Example: Submit a task to the hadoop yarn cluster for execution.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/target/scala-2.11/jars/spark-examples*.jar 10

Explanation of parameters:

parameter name Parameter Description
- -class The main class of the application, only for java or scala applications
- -master The address of the master, where to submit the task for execution, such as local, spark://host:port, yarn, local
- -deploy-mode Start the driver locally (client) or start on the cluster, the default is client
- -name The name of the application, which will be displayed on the Spark web user interface
- -jars Local jar packages separated by commas. After setting, these jars will be included in the classpath of the driver and executor
- -packages The maven coordinates of the jar contained in the classpath of the driver and executor
- -exclude-packages To avoid conflicts, specify packages that are not included
- -repositories Remote repository
- -conf PROP=VALUE Specify the value of spark configuration properties, for example -conf spark.executor.extraJavaOptions="-XX:MaxPermSize=256m"
- -properties-file The loaded configuration file, the default is conf/spark-defaults.conf
- -driver-memory Driver memory, default 1G
- -driver-java-options Additional Java options passed to the driver
- -driver-library-path Additional library path passed to the driver
- -driver-class-path Additional classpath passed to the driver
- -driver-cores The number of driver cores, the default is 1. Use under yarn or standalone
- -executor-memory The memory of each executor, the default is 1G
- -total-executor-colors The total number of cores of all executors. Only used under mesos or standalone
- -num-executors The number of executors started. The default is 2. Use under yarn
- -executor-core The number of cores per executor. Use under yarn or standalone

3. Comparison of spark-shell and spark-submit

1. The same point: the locations are all under the /spark/bin directory

2. Differences:

(1) Spark-shell itself is interactive, and a development environment similar to IDE will be provided on the dos interface, and developers can program on it. At runtime, the underlying spark-submit method is called for execution.
(2) Spark-submit itself is not interactive. It is used to submit the Jar package compiled and packaged in an editor such as IDEA to the cluster environment and execute it.

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112724328