Spark series-operating mode (three) Yarn mode configuration (detailed)

yarn mode

00_Introduction

The Spark client directly connects to Yarn, without the need to build a Spark cluster. There are two modes of yarn-client and yarn-cluster. The main difference lies in the running node of the Driver program.

Yarn-client: The Driver program runs on the client and is suitable for interaction and debugging. You want to see the output of the app immediately.

Yarn-cluster: The Driver program runs on AP (APPMaster) started by RM (ResourceManager) and is suitable for production environments.

The Yarn operating mode is shown in the figure:
Insert picture description here

01_Configuration

1.1, modify the Hadoop configuration file yarn-site.xml, and distribute it to other nodes.

Since our test environment has too few virtual machines, we can prevent future tasks from being accidentally killed.

cd /usr/hadoop/hadoop-2.10.0/etc/hadoop/
vi yarn-site.xml 
#添加以下在configuration标签中间
<!--是否启动一个线程检查每个任务正使用的物理内存量。如果任务使用的内存量超出分配值,则会被中断。默认是true -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
<!--是否启动一个线程检查每个任务正在使用的虚拟内存量。如果任务使用的内存量分配超出分配值,会被选择中断。默认是true-->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
#分发    
scp -r yarn-site.xml slave1:/usr/hadoop/hadoop-2.10.0/etc/hadoop/
scp -r yarn-site.xml slave2:/usr/hadoop/hadoop-2.10.0/etc/hadoop/

1.2, copy spark and name it spark-yarn

cp -r spark-2.4.0-bin-hadoop2.7 spark-yarn

1.3, modify the spark-env.sh file

mv spark-env.sh.template spark-env.sh
for i in *.template; do mv ${i} ${i%.*}; done
HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.10.0/etc/hadoop

1.4, modify environment variables

vi /etc/profile
export SPARK_HOME=/usr/hadoop/spark-yarn
export PATH=$PATH:$SPARK_HOME/sbin:$SPARK_HOME/bin
source /etc/profile

1.5, add history service

1.5.1 Configure env.sh

vi spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=30 -Dspark.history.fs.logDirectory=hdfs://node1:8020/spark-logs-yarn"

1.5.2 Create a log directory, hadoop has to be turned on

 bin/hadoop fs -mkdir /spark-logs-yarn

1.5.3 Configure defaults.conf

vi spark-defaults.conf 
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://node1:8020/spark-logs-yarn

1.6, example

From the spark-shell log, you can see that the spark-shell --master yarn-client command has been obsolete since Spark 2.0 and can be replaced with spark-shell --master yarn --deploy-mode client.

Note:
(1) HDFS and YARN cluster need to be started before submitting the task.
(2) If you have any questions about the port number of hadoopUI page access, please refer to the article Big Data Platform-Hadoop Environment Configuration

(1) In the shell

spark-shell --master yarn --deploy-mode client
scala> val rdd=sc.parallelize(1 to 100,5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.count
res0: Long = 100   

(2) directly in spark-yarn

bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkPi \
--deploy-mode client \
./examples/jars/sp,,,,, \ #架包的路径
100

02_Summary

(1) Stand-alone and yarn mode cannot be configured on the same one, configuration of spark environment variables will conflict, and yarn-UI cannot be accessed either

(2) The yarn mode does not need to be equipped with a spark cluster. When using it, start hadoop first

Guess you like

Origin blog.csdn.net/qq_46009608/article/details/108911193