Spark1.0.0 on YARN mode deployment

1 Overview
      Speaking of the deployment of Spark on YARN, it is better to say that the Spark application runs in the YARN environment. According to the different distribution methods of drivers (SparkContext) in Spark applications, Spark on YARN has two modes:
  • One is the yarn-client mode, in which the Spark driver runs on the client, and then applies to YARN to run the exeutor to run the Task.
  • One is the yarn-cluster mode. In this mode, the Spark driver will be started in the YARN cluster as an ApplicationMaster, and then the ApplicationMaster will apply for resources to the RM to start the executor to run the Task, just as Spark1.0.0 in the basic concept of the running architecture   of Spark on YARN.
      For Spark 1.0.0, to deploy Spark applications in YARN, you can use the bin/spark-submit tool (see Spark 1.0.0 Application Deployment Tool spark-submit for details ). When deploying a Spark application on YARN, it is not necessary to provide the URL as the value of the master parameter as in Standalone and Mesos. Because the Spark application can obtain relevant information in the hadoop configuration file, it is only necessary to simply use yarn-cluster or The yarn-client can be assigned to the master.
 
2: Configuration
      Just because the relevant information needs to be obtained from the hadoop configuration file, the environment variable HADOOP_CONF_DIR or YARN_CONF_DIR needs to be configured. Similarly, the configuration in Spark1.0.0 property configuration is also applicable to Spark on YARN, and Spark also provides some YARN-specific configuration items:
  • Configuration of environment variables
    • SPARK_YARN_USER_ENV User can set Spark on YARN environment variable in this parameter, for example SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar".
  • Configuration of system properties
The configuration of YARN-specific properties should support SparkConf and conf/spark-defaults.conf file configuration Spark1.0.0 Properties Configuration - mmicky - mmicky's Blog .
 property name  default  meaning
 spark.yarn.applicationMaster.waitTries  10 The number of times RM waits for Spark AppMaster to start, that is, the number of times SparkContext is initialized. If this value is exceeded, the startup fails.
 spark.yarn.submit.file.replication  3  Replication factor for files uploaded to HDFS by the application
 spark.yarn.preserve.staging.files  false  Set to true to keep stage-related files instead of deleting them after the job ends.
 spark.yarn.scheduler.heartbeat.interval-ms  5000 The interval at which Spark AppMaster sends heartbeat information to YARN RM
 spark.yarn.max.executor.failures  2 times the number of executors The maximum number of executor failures that cause the application to declare a failure
 spark.yarn.historyServer.address  without  Spark history server的地址(不要加http://)。这个地址会在应用程序完成后提交给YARN RM,使得将信息从RM UI连接到history server UI上。
需要更多的细节参看 Spark1.0.0属性配置 。
 
3:实验环境
实验环境参见 Spark1.0.0 开发环境快速搭建 。
下面分别实验scala程序在YARN中的部署和python程序在YARN中的部署。
 
4:scala程序在YARN中的部署
      Spark应用程序既可以在客户端部署也可以在集群中部署,如果客户端和集群的网络状况不是很好的话,就将Spark应用程序先复制到集群的某一个节点,然后在该节点上部署。特别是采用yarn-client方式部署的时候,这样不会因为网络状况不好而影响应用程序的运行,毕竟driver和executor之间存在大量的信息交换。本实验采用的客户端部署,使用的是 Spark1.0.0 多语言编程之Scala实现编译生成的程序包week6.jar 。
  • yarn-client方式部署
    • ./bin/spark-submit --master yarn-client --class week6.SogouA --executor-memory 3g --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-client --class week6.SogouB --executor-memory 3g --num-executors 2 --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-client --class week6.SogouC --executor-memory 3g --executor-cores 4 --num-executors 3 --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • 上面列出了三个class的运行命令,命令中的参数可以用户自己设置,具体参见Spark1.0.0 应用程序部署工具spark-submit。上面采用不同参数只是说明以下参数可以根据YARN的集群情况来申请资源来运行程序。
    • 采用yarn-client方式,因为driver在客户端,所以可以通过webUI访问driver的状态,在本例中可以通过http://wyy:4040访问,而YARN通过http://hadoop1:8088访问。
    • 采用yarn-client方式,因为driver在客户端,所以程序的运行结果可以在客户端显示。
    • 客户端的driver将应用提交给YARN后,YARN会先后启动AppMaster和executor,另外AppMaster和executor都是装载在container里运行,container默认的内存是1G(参数yarn.scheduler.minimum-allocation-mb定义),AppMaster分配的内存是driver-memory,executor分配的内存是executor-memory,所以向YARN申请的内存是(driver-memory +1)+ (executor-memory + 1)* num-executors,上面运行week6.SogouC使用的内存就是(1+1) + (3+1)*3=14G,如下图所示:
 
  • yarn-cluster方式部署
    • ./bin/spark-submit --master yarn-cluster --class week6.SogouA --executor-memory 3g --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-cluster --class week6.SogouB --executor-memory 3g --num-executors 2 --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-cluster --class week6.SogouC --executor-memory 3g --executor-cores 4 --num-executors 3 --driver-memory 1g week6.jar hdfs://hadoop1:8000/dataguru/data/mini.txt
    • 采用yarn-cluster方式,因为driver在YARN中运行,要通过webUI访问driver的状态,需要点YARN中该job的Tracking UI。
    • 采用yarn-cluster方式,因为driver在YARN中运行,所以程序的运行结果不能在客户端显示,所以最好将结果保存在hdfs上,客户端的终端显示的是作为YARN的job的运行情况。
    • YARN中内存分配情况和yarn-client方式一样,如下图:
 
5:python程序在YARN中的部署
      python程序在YARN中的部署和scala程序包部署是一样,只是命令稍有差异而已:
  • yarn-client方式部署
    • ./bin/spark-submit --master yarn-client --executor-memory 3g --driver-memory 1g SogouA.py hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-client --executor-memory 3g --num-executors 2 --driver-memory 1g SogouA.py hdfs://hadoop1:8000/dataguru/data/mini.txt
    • ./bin/spark-submit --master yarn-client --executor-memory 3g --executor-cores 4 --num-executors 3 --driver-memory 1g SogouA.py hdfs://hadoop1:8000/dataguru/data/mini.txt
  • Deploy in yarn-cluster mode
    • spark1.0.0 has not been implemented yet

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326510866&siteId=291194637