Spark任务的提交方式

1、spark-shell
2、spark-submit
- 2.1 概述
- 2.2 基本语法
3、spark-shell、spark-submit比较

使用spark-shell命令和spark-submit命令来提交spark任务。

当执行测试程序，使用spark-shell，spark的交互式命令行

提交spark程序到spark集群中运行时，spark-submit

1、spark-shell

$\quad \quad$ spark-shell 是 Spark 自带的交互式 Shell 程序，方便用户进行交互式编程，用户可以在该命令行下用 Scala 编写 spark 程序。

1.1 概述

1、spark-shell 使用帮助

(py27) [root@master ~]# cd /usr/local/src/spark-2.0.2-bin-hadoop2.6
(py27) [root@master spark-2.0.2-bin-hadoop2.6]# cd bin
(py27) [root@master bin]# ./spark-shell --help

Usage: ./bin/spark-shell [options]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

Spark-shell源码

(py27) [root@master bin]# cat spark-shell

function main() {
    
    
  if $cygwin; then
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
    stty icanon echo > /dev/null 2>&1
  else
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
  fi
}

,在 spark-shell 的main 方法中执行了 spark-submit 的脚本并且制定–name 为 Spark-shell ，这就解释了,为什么运行 spark-shell 的时候,在web ui中显示的名字是spark-shell ,其次我们看到指定了主类是
org.apache.spark.repl.Main ,这里其实可以猜到,必定为我们创建sparkcontext 对象,因为我们在执行spark-shell 的时候,会自动为我们创建好 sparkcontext 对象
从源码可以看出Spark-shell其实最终执行了Spark-submit命令

1.2 启动

1、 直接启动bin目录下的spark-shell

./spark-shell

在这里插入图片描述

表示默认使用local 模式启动，在本机启动一个SparkSubmit进程
还可指定参数 --master，如：

spark-shell --master local[N] 表示在本地模拟N个线程来运行当前任务
spark-shell --master local[*] 表示使用当前机器上所有可用的资源
不携带参数默认就是

 spark-shell --master local[*]

退出spark-shell
:quit或快捷键Ctrl+D

2、 ./spark-shell --master spark://master:7077

以Standalone模式进行启动，即集群模式

3、 ./spark-shell --master yarn-client

在这里插入图片描述

以Yarn client 模式启动
./spark-shell --master yarn-cluster这个是以Yarn cluster 模式启动

1.3 应用场景

通常是以测试为主
所以一般直接以./spark-shell启动，进入本地模式测试

2、spark-submit

$\quad \quad$ 程序一旦打包好，就可以使用 bin/spark-submit 脚本启动应用了。这个脚本负责设置 spark 使用的 classpath 和依赖，支持不同类型的集群管理器和发布模式。

2.1 概述

$\quad \quad$ 它主要是用于提交编译并打包好的Jar包到集群环境中来运行，和hadoop中的hadoop jar命令很类似，hadoop jar是提交一个MR-task,而spark-submit是提交一个spark任务，这个脚本 可以设置Spark类路径（classpath）和应用程序依赖包，并且可以设置不同的Spark所支持的集群管理和部署模式。 相对于spark-shell来讲它不具有REPL(交互式的编程环境)的，在运行前需要指定应用的启动类，jar包路径,参数等内容。

1、spark-submit使用帮助

 ./spark-submit --help

(py27) [root@master bin]#  ./spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

Spark-submit源码

(py27) [root@master bin]# cat spark-submit
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

spark-submit 比较简单就是去执行 spark-class 脚本制定主类是 org.apache.spark.deploy.SparkSubmit ,把接受的参数都传到主类中

2.2 基本语法

例子：提交任务到 hadoop yarn 集群执行。

./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/target/scala-2.11/jars/spark-examples*.jar 10

参数的解释：

参数名	参数说明
- -class	应用程序的主类，仅针对 java 或 scala 应用
- -master	master 的地址，提交任务到哪里执行，例如 local,spark://host:port, yarn, local
- -deploy-mode	在本地 (client) 启动 driver 或在 cluster 上启动，默认是 client
- -name	应用程序的名称，会显示在Spark的网页用户界面
- -jars	用逗号分隔的本地 jar 包，设置后，这些 jar 将包含在 driver 和 executor 的 classpath 下
- -packages	包含在driver 和executor 的 classpath 中的 jar 的 maven 坐标
- -exclude-packages	为了避免冲突而指定不包含的 package
- -repositories	远程 repository
- -conf PROP=VALUE	指定 spark 配置属性的值，例如 -conf spark.executor.extraJavaOptions="-XX:MaxPermSize=256m"
- -properties-file	加载的配置文件，默认为 conf/spark-defaults.conf
- -driver-memory	Driver内存，默认 1G
- -driver-java-options	传给 driver 的额外的 Java 选项
- -driver-library-path	传给 driver 的额外的库路径
- -driver-class-path	传给 driver 的额外的类路径
- -driver-cores	Driver 的核数，默认是1。在 yarn 或者 standalone 下使用
- -executor-memory	每个 executor 的内存，默认是1G
- -total-executor-cores	所有 executor 总共的核数。仅仅在 mesos 或者 standalone 下使用
- -num-executors	启动的 executor 数量。默认为2。在 yarn 下使用
- -executor-core	每个 executor 的核数。在yarn或者standalone下使用

3、spark-shell、spark-submit比较

1、相同点：放置的位置都在/spark/bin目录下面

2、不同点：

（1）Spark-shell本身是交互式的，dos界面上会提供一种类似于IDE的开发环境，开发人员可以在上面进行编程。在运行时，会调用底层的spark-submit方法进行执行。
（2）Spark-submit本身不是交互性的，用于提交在IDEA等编辑器中编译并打包生成的Jar包到集群环境中，并执行。

Spark之Spark任务的提交方式【Spark-shell、Spark-submit】