spark official document (1) - Submit application

Spark Version: 1.6.2

spark-submit provides a unified interface to submit applications in all cluster platform, you do not need to change because the migration platform configuration. Spark supports three clusters: Standalone, Apache Mesos and Hadoop Yarn.

Bound application dependent libraries

If your application relies on other projects, it needs to be packaged together, including the need to rely on third-party libraries when packaging. sbt and maven has assembled plug-in, you can specify hadoop and spark version without it into the jar package, because hadoop libraries and spark provided by the cluster environment. Then in the installation directory spark spark-submit submit your application tools.

For python program, we need to add parameters --py-files, if more than Python files, it is recommended to zip package or egg, and then executed.

spark-submit the application submitted

spark-submit support to submit applications for the three clusters, mainly syntax is as follows:

./bin/spark-submit \
  --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Detailed parameters:

--class: entry application, e.g.org.apache.spark.example.SpariPi
--master: Specifies the type of cluster, for example, local (native), Spark: // Master: 7077 (stanalone mode), yarn-client
--deploy-mode: whether to deploy Driver to worker nodes, the default is the client
--conf: Configuration spark environment, using the key = value form in quotes
appliaction-jar: Specifies the application jar packets
: application-arguments parameters of the application

There are also provided for non-universal platform for each cluster, using Spark standalone cluster, for example, can be configured --superviseparameters, to ensure that when the driver returns non-zero value automatically restart. The following are some common configuration example:

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \ --master local[8] \  /path/to/examples.jar \ 100

Run on a Spark standalone cluster in client deploy mode

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a Spark standalone cluster in cluster deploy mode with supervise

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a YARN cluster

export HADOOP_CONF_DIR=XXX
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000

Run a Python application on a Spark standalone cluster

./bin/spark-submit
--master spark://207.184.161.138:7077
examples/src/main/python/pi.py
1000

Run on a Mesos cluster in cluster deploy mode with supervise

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master mesos://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
http://path/to/examples.jar
1000

Master URLs

Next, the spark-submit --master parameters are introduced, mainly includes the following types:

Master URL	Brief introduction
local	Use spark a worker thread running locally
local[K]	Use the K worker threads running spark locally
local[*]	In the local CPU core running a number of worker threads
spark://HOST: PORT	Connected to the master node Spark standalone cluster, the default port is 7077
months: // HOST: PORT	Mesos connected to the cluster, the default port is 5050
yarn	Connected to the cluster yarn, and yarn-client specifies yarn-cluster modes by --deploy-mode. Location clusters by HADOOP_CONF_DIR or YARN_CONF_DIR variable configuration

By loading the configuration file

Spark profile or by application code, or spark-submit configuration parameters related to loading. By default, spark read conf / spark-defaults.conf configuration. The default configuration refer to spark the next document.
If spark.master parameters set by code, --master parameter is ignored. In general, the highest priority properties that can be configured by SparkConf, followed by spark-submit the property, and finally the configuration file. Code priority> spark-submit parameters> profile.

Advanced dependency management

spark-submit the ** - jars ** different options to choose different treatment strategies according to the cluster. spark supports the following URL patterns and use different strategies:

file: absolute file paths, each worker through the http service copy files from the driver node;
hdfs: http: https ftp: pulling through a corresponding jar file to the local protocol;
local: This URL represents the file already exists in the local path of each worker, and will not trigger network IO

Since each worker will copy the file to a local, how to clean up a problem. yarn processing automatically on a regular basis, spark standalone cluster can be configured to spark.worker.cleanup.appDataTtlsave configuration time, the default is 7 days.
Users can also --packagescontain other dependent, those dependencies are also dependent on the propagation included. --repositoriesLibrary may contain additional storage. These parameters pyspark, spark-shell, spark-submitare in support.

spark test RDD share storage

(Acquiring part of the record, and for the space occupied by RDD RDD estimate based on the number of records):

def getTotalSize(rdd: RDD[Row]): Long = {
  // This can be a parameter
  val NO_OF_SAMPLE_ROWS = 10l;
  Val totalRows = rdd.count ();
  var totalSize = 0l
  if (totalRows > NO_OF_SAMPLE_ROWS) {
    val sampleRDD = rdd.sample(true, NO_OF_SAMPLE_ROWS)
    val sampleRDDSize = getRDDSize(sampleRDD)
    totalSize = sampleRDDSize.*(totalRows)./(NO_OF_SAMPLE_ROWS)
  } else {
    // As the RDD is smaller than sample rows count, we can just calculate the total RDD size
    totalSize = getRDDSize(rdd)
  }

totalSize
}

def getRDDSize(rdd: RDD[Row]) : Long = {
var rddSize = 0l
val rows = rdd.collect()
for (i <- 0 until rows.length) {
rddSize += SizeEstimator.estimate(rows.apply(i).toSeq.map { value => value.asInstanceOf[AnyRef] })
}

rddSize

}

More information

When deployed applications, cluster model overview of distributed execution, how to monitor and debug the program are described.

Category: the Spark