How to submit Spark tasks
Use spark-shell and spark-submit commands to submit spark tasks.
When executing the test program, use spark-shell, spark interactive command line
When submitting the spark program to the spark cluster to run, spark-submit
1、spark-shell
\quad \quad spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to perform interactive programming. Users can write spark programs in Scala under the command line.
1.1 Overview
1. Spark-shell usage help
(py27) [root@master ~]# cd /usr/local/src/spark-2.0.2-bin-hadoop2.6
(py27) [root@master spark-2.0.2-bin-hadoop2.6]# cd bin
(py27) [root@master bin]# ./spark-shell --help
Usage: ./bin/spark-shell [options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
- Spark-shell source code
(py27) [root@master bin]# cat spark-shell
function main() {
if $cygwin; then
# Workaround for issue involving JLine and Cygwin
# (see http://sourceforge.net/p/jline/bugs/40/).
# If you're using the Mintty terminal emulator in Cygwin, may need to set the
# "Backspace sends ^H" setting in "Keys" section of the Mintty options
# (see https://github.com/sbt/sbt/issues/562).
stty -icanon min 1 -echo > /dev/null 2>&1
export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
stty icanon echo > /dev/null 2>&1
else
export SPARK_SUBMIT_OPTS
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
fi
}
- , The spark-submit script is executed in the main method of spark-shell and the --name is set to Spark-shell, which explains why the name displayed in the web ui is spark-shell when running spark-shell. Secondly, we see that the main class specified is
org.apache.spark.repl.Main, which can actually be guessed here. The sparkcontext object must be created for us, because when we execute spark-shell, it will automatically create the sparkcontext object for us. - It can be seen from the source code that Spark-shell actually executed the Spark-submit command.
1.2 Start
1. Start spark-shell in the bin directory directly
./spark-shell
-
Indicates that the local mode is used to start by default, and a SparkSubmit process is started on this machine
-
You can also specify the parameter --master, such as:
spark-shell --master local[N]
Means to simulate N threads locally to run the current task
spark-shell --master local[*]
Means to use all available resources on the current machine -
Without parameters, the default is
spark-shell --master local[*]
- Exit spark-shell
:quit
or shortcut keyCtrl+D
2、 ./spark-shell --master spark://master:7077
- Start in Standalone mode, that is, cluster mode
3../spark-shell --master yarn-client
- Start in Yarn client mode
./spark-shell --master yarn-cluster
This is started in Yarn cluster mode
1.3 Application scenarios
- Usually test-oriented
- So generally
./spark-shell
start directly and enter the local mode to test
2、spark-submit
\quad \quad Once the program is packaged, the bin/spark-submit script can be used to start the application. This script is responsible for setting the classpath and dependencies used by spark, and supports different types of cluster managers and release modes.
2.1 Overview
\quad \quad It is mainly used to submit the compiled and packaged Jar package to the cluster environment for running. It is similar to the hadoop jar command in hadoop. The hadoop jar is to submit a MR-task, and spark-submit is to submit a spark task. scripts may be provided Spark class path (cLASSPATH) and application dependencies, and may set different Spark supports cluster management and deployment model. Compared with spark-shell, it does not have REPL (interactive programming environment). Before running, you need to specify the startup class of the application, the path of the jar package, and the parameters.
1. Help for using spark-submit
./spark-submit --help
(py27) [root@master bin]# ./spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
- Spark-submit source code
(py27) [root@master bin]# cat spark-submit
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
- Spark-submit is relatively simple to execute the spark-class script to make the main class org.apache.spark.deploy.SparkSubmit, and pass all the accepted parameters to the main class
2.2 Basic syntax
- Example: Submit a task to the hadoop yarn cluster for execution.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/target/scala-2.11/jars/spark-examples*.jar 10
Explanation of parameters:
parameter name | Parameter Description |
---|---|
- -class | The main class of the application, only for java or scala applications |
- -master | The address of the master, where to submit the task for execution, such as local, spark://host:port, yarn, local |
- -deploy-mode | Start the driver locally (client) or start on the cluster, the default is client |
- -name | The name of the application, which will be displayed on the Spark web user interface |
- -jars | Local jar packages separated by commas. After setting, these jars will be included in the classpath of the driver and executor |
- -packages | The maven coordinates of the jar contained in the classpath of the driver and executor |
- -exclude-packages | To avoid conflicts, specify packages that are not included |
- -repositories | Remote repository |
- -conf PROP=VALUE | Specify the value of spark configuration properties, for example -conf spark.executor.extraJavaOptions="-XX:MaxPermSize=256m" |
- -properties-file | The loaded configuration file, the default is conf/spark-defaults.conf |
- -driver-memory | Driver memory, default 1G |
- -driver-java-options | Additional Java options passed to the driver |
- -driver-library-path | Additional library path passed to the driver |
- -driver-class-path | Additional classpath passed to the driver |
- -driver-cores | The number of driver cores, the default is 1. Use under yarn or standalone |
- -executor-memory | The memory of each executor, the default is 1G |
- -total-executor-colors | The total number of cores of all executors. Only used under mesos or standalone |
- -num-executors | The number of executors started. The default is 2. Use under yarn |
- -executor-core | The number of cores per executor. Use under yarn or standalone |
3. Comparison of spark-shell and spark-submit
1. The same point: the locations are all under the /spark/bin directory
2. Differences:
(1) Spark-shell itself is interactive, and a development environment similar to IDE will be provided on the dos interface, and developers can program on it. At runtime, the underlying spark-submit method is called for execution.
(2) Spark-submit itself is not interactive. It is used to submit the Jar package compiled and packaged in an editor such as IDEA to the cluster environment and execute it.