spark official document (1) - Submit application
Spark Version: 1.6.2
spark-submit provides a unified interface to submit applications in all cluster platform, you do not need to change because the migration platform configuration. Spark supports three clusters: Standalone, Apache Mesos and Hadoop Yarn.
Bound application dependent libraries
If your application relies on other projects, it needs to be packaged together, including the need to rely on third-party libraries when packaging. sbt and maven has assembled plug-in, you can specify hadoop and spark version without it into the jar package, because hadoop libraries and spark provided by the cluster environment. Then in the installation directory spark spark-submit submit your application tools.
For python program, we need to add parameters --py-files, if more than Python files, it is recommended to zip package or egg, and then executed.
spark-submit the application submitted
spark-submit support to submit applications for the three clusters, mainly syntax is as follows:
./bin/spark-submit \
--class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
Detailed parameters:
- --class: entry application, e.g.
org.apache.spark.example.SpariPi
- --master: Specifies the type of cluster, for example, local (native), Spark: // Master: 7077 (stanalone mode), yarn-client
- --deploy-mode: whether to deploy Driver to worker nodes, the default is the client
- --conf: Configuration spark environment, using the key = value form in quotes
- appliaction-jar: Specifies the application jar packets
- : application-arguments parameters of the application
There are also provided for non-universal platform for each cluster, using Spark standalone cluster, for example, can be configured --supervise
parameters, to ensure that when the driver returns non-zero value automatically restart. The following are some common configuration example:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100
Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000
Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000
Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000
Run a Python application on a Spark standalone cluster
./bin/spark-submit
--master spark://207.184.161.138:7077
examples/src/main/python/pi.py
1000
Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master mesos://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
http://path/to/examples.jar
1000
Master URLs
Next, the spark-submit --master parameters are introduced, mainly includes the following types:
Master URL | Brief introduction |
---|---|
local | Use spark a worker thread running locally |
local[K] | Use the K worker threads running spark locally |
local[*] | In the local CPU core running a number of worker threads |
spark://HOST: PORT | Connected to the master node Spark standalone cluster, the default port is 7077 |
months: // HOST: PORT | Mesos connected to the cluster, the default port is 5050 |
yarn | Connected to the cluster yarn, and yarn-client specifies yarn-cluster modes by --deploy-mode. Location clusters by HADOOP_CONF_DIR or YARN_CONF_DIR variable configuration |
By loading the configuration file
Spark profile or by application code, or spark-submit configuration parameters related to loading. By default, spark read conf / spark-defaults.conf configuration. The default configuration refer to spark the next document.
If spark.master parameters set by code, --master parameter is ignored. In general, the highest priority properties that can be configured by SparkConf, followed by spark-submit the property, and finally the configuration file. Code priority> spark-submit parameters> profile.
Advanced dependency management
spark-submit the ** - jars ** different options to choose different treatment strategies according to the cluster. spark supports the following URL patterns and use different strategies:
- file: absolute file paths, each worker through the http service copy files from the driver node;
- hdfs: http: https ftp: pulling through a corresponding jar file to the local protocol;
- local: This URL represents the file already exists in the local path of each worker, and will not trigger network IO
Since each worker will copy the file to a local, how to clean up a problem. yarn processing automatically on a regular basis, spark standalone cluster can be configured to spark.worker.cleanup.appDataTtl
save configuration time, the default is 7 days.
Users can also --packages
contain other dependent, those dependencies are also dependent on the propagation included. --repositories
Library may contain additional storage. These parameters pyspark, spark-shell, spark-submit
are in support.
spark test RDD share storage
(Acquiring part of the record, and for the space occupied by RDD RDD estimate based on the number of records):
def getTotalSize(rdd: RDD[Row]): Long = { // This can be a parameter val NO_OF_SAMPLE_ROWS = 10l; Val totalRows = rdd.count (); var totalSize = 0l if (totalRows > NO_OF_SAMPLE_ROWS) { val sampleRDD = rdd.sample(true, NO_OF_SAMPLE_ROWS) val sampleRDDSize = getRDDSize(sampleRDD) totalSize = sampleRDDSize.*(totalRows)./(NO_OF_SAMPLE_ROWS) } else { // As the RDD is smaller than sample rows count, we can just calculate the total RDD size totalSize = getRDDSize(rdd) }
totalSize
}
def getRDDSize(rdd: RDD[Row]) : Long = {
var rddSize = 0l
val rows = rdd.collect()
for (i <- 0 until rows.length) {
rddSize += SizeEstimator.estimate(rows.apply(i).toSeq.map { value => value.asInstanceOf[AnyRef] })
}
rddSize
}
More information
When deployed applications, cluster model overview of distributed execution, how to monitor and debug the program are described.