spark official document (1) - Submit application

spark official document (1) - Submit application

Spark Version: 1.6.2

spark-submit provides a unified interface to submit applications in all cluster platform, you do not need to change because the migration platform configuration. Spark supports three clusters: Standalone, Apache Mesos and Hadoop Yarn.

Bound application dependent libraries

If your application relies on other projects, it needs to be packaged together, including the need to rely on third-party libraries when packaging. sbt and maven has assembled plug-in, you can specify hadoop and spark version without it into the jar package, because hadoop libraries and spark provided by the cluster environment. Then in the installation directory spark spark-submit submit your application tools.

For python program, we need to add parameters --py-files, if more than Python files, it is recommended to zip package or egg, and then executed.

spark-submit the application submitted

spark-submit support to submit applications for the three clusters, mainly syntax is as follows:

./bin/spark-submit \
  --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Detailed parameters:

  • --class:  entry application, e.g.org.apache.spark.example.SpariPi
  • --master:  Specifies the type of cluster, for example, local (native), Spark: // Master: 7077 (stanalone mode), yarn-client
  • --deploy-mode:  whether to deploy Driver to worker nodes, the default is the client
  • --conf:  Configuration spark environment, using the key = value form in quotes
  • appliaction-jar:  Specifies the application jar packets
  • : application-arguments  parameters of the application

There are also provided for non-universal platform for each cluster, using Spark standalone cluster, for example, can be configured --superviseparameters, to ensure that when the driver returns non-zero value automatically restart. The following are some common configuration example:

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \ --master local[8] \  /path/to/examples.jar \ 100 

Run on a Spark standalone cluster in client deploy mode

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a Spark standalone cluster in cluster deploy mode with supervise

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000

Run on a YARN cluster

export HADOOP_CONF_DIR=XXX
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000

Run a Python application on a Spark standalone cluster

./bin/spark-submit
--master spark://207.184.161.138:7077
examples/src/main/python/pi.py
1000

Run on a Mesos cluster in cluster deploy mode with supervise

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master mesos://207.184.161.138:7077
--deploy-mode cluster
--supervise
--executor-memory 20G
--total-executor-cores 100
http://path/to/examples.jar
1000

Master URLs

Next, the spark-submit --master parameters are introduced, mainly includes the following types:

Master URL Brief introduction
local Use spark a worker thread running locally
local[K] Use the K worker threads running spark locally
local[*] In the local CPU core running a number of worker threads
spark://HOST: PORT Connected to the master node Spark standalone cluster, the default port is 7077
months: // HOST: PORT Mesos connected to the cluster, the default port is 5050
yarn Connected to the cluster yarn, and yarn-client specifies yarn-cluster modes by --deploy-mode. Location clusters by HADOOP_CONF_DIR or YARN_CONF_DIR variable configuration

By loading the configuration file

Spark profile or by application code, or spark-submit configuration parameters related to loading. By default, spark read conf / spark-defaults.conf configuration. The default configuration refer to spark the next document.
If spark.master parameters set by code, --master parameter is ignored. In general, the highest priority properties that can be configured by SparkConf, followed by spark-submit the property, and finally the configuration file. Code priority> spark-submit parameters> profile.

Advanced dependency management

spark-submit the ** - jars ** different options to choose different treatment strategies according to the cluster. spark supports the following URL patterns and use different strategies:

  • file:  absolute file paths, each worker through the http service copy files from the driver node;
  • hdfs: http: https ftp:  pulling through a corresponding jar file to the local protocol;
  • local:  This URL represents the file already exists in the local path of each worker, and will not trigger network IO

Since each worker will copy the file to a local, how to clean up a problem. yarn processing automatically on a regular basis, spark standalone cluster can be configured to spark.worker.cleanup.appDataTtlsave configuration time, the default is 7 days.
Users can also --packagescontain other dependent, those dependencies are also dependent on the propagation included. --repositoriesLibrary may contain additional storage. These parameters pyspark, spark-shell, spark-submitare in support.

 

spark test RDD share storage

(Acquiring part of the record, and for the space occupied by RDD RDD estimate based on the number of records):

Copy the code
def getTotalSize(rdd: RDD[Row]): Long = {
  // This can be a parameter
  val NO_OF_SAMPLE_ROWS = 10l;
  Val totalRows = rdd.count ();
  var totalSize = 0l
  if (totalRows > NO_OF_SAMPLE_ROWS) {
    val sampleRDD = rdd.sample(true, NO_OF_SAMPLE_ROWS)
    val sampleRDDSize = getRDDSize(sampleRDD)
    totalSize = sampleRDDSize.*(totalRows)./(NO_OF_SAMPLE_ROWS)
  } else {
    // As the RDD is smaller than sample rows count, we can just calculate the total RDD size
    totalSize = getRDDSize(rdd)
  }

totalSize
}

def getRDDSize(rdd: RDD[Row]) : Long = {
var rddSize
= 0l
val rows
= rdd.collect()
for (i <- 0 until rows.length) {
rddSize
+= SizeEstimator.estimate(rows.apply(i).toSeq.map { value => value.asInstanceOf[AnyRef] })
}

rddSize

}

Copy the code

 

More information

When deployed applications, cluster model overview of distributed execution, how to monitor and debug the program are described.

Category: the Spark

Reprinted from here

Guess you like

Origin blog.csdn.net/Mei_ZS/article/details/89852038