Spark study notes (c) - Standalone mode

Part notes recorded some of the content Local mode, but there are few practical applications using Local mode, just for us to facilitate learning and testing. Real production environment, Standalone mode is a little more appropriate.

1, basic overview

Standalone is not a stand-alone mode, it is a cluster, but is based Spark independent scheduler cluster, which means it is Spark unique mode of operation. Client and Cluster has two modes, the main difference is: Run node Driver program . How to understand it? Where to submit the task where to start Driver, this is called Client mode; stuffing machine starts Driver, this is called Cluster mode.

Spark saying that white is only responsible for scheduling their own cluster, not what Yarn, Mesos. So this is no Yarn of ResourceManager, NodeManager and Container, and Talia corresponds to the concept of Spark is Master, Worker and Executor.

He drew a diagram to explain Standalone modes of operation:

2, installation

1) modify the slavefile, add work node:

[simon@hadoop102 conf]$ vim slaves

hadoop102
hadoop103
hadoop104

2) modify the spark-env.shfile, add the following configuration:

[simon@hadoop102 conf]$ vim spark-env.sh

SPARK_MASTER_HOST=hadoop102
SPARK_MASTER_PORT=7077

3) Distribution spark package

[simon@hadoop102 module]$ xsync spark/

4) Start the cluster

[simon@hadoop102 spark]$ sbin/start-all.sh

#查看启动信息
hadoop103:   JAVA_HOME is not set
hadoop103: full log in /opt/module/spark/logs/spark-simon-org.apache.spark.deploy.worker.Worker-1-hadoop103.out

Reported the exception information: JAVA_HOME is not setcan be added as follows spark-config.sh file in the sbin directory:

export JAVA_HOME=/opt/module/jdk1.8.0_144

Then restart the cluster:

[simon@hadoop102 spark]$ sbin/start-all.sh

#查看启动信息
[simon@hadoop102 spark]$ jpsall
--------------------- hadoop102 -------------------------------
4755 NameNode
4900 DataNode
5704 NodeManager
6333 Master
6623 Worker
--------------------- hadoop103 -------------------------------
8342 DataNode
9079 NodeManager
10008 Worker
8893 ResourceManager
--------------------- hadoop104 -------------------------------
8882 NodeManager
8423 SecondaryNameNode
9560 Worker
8347 DataNode

Spark can see the cluster has been launched successfully, Hadoop102 is Master node, two outside two nodes is Worker

5) performing an official case:

[simon@hadoop102 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

And local modes difference is designated master node

Results of the:

3, JobHistoryServer configuration

If we want to see the log information to perform the task, we also need to configure the server history

1) modify the spark-default.conf file, open the Log:

[simon@hadoop102 conf]$ vi spark-defaults.conf
spark.eventLog.enabled           true
#directory要事先创建好
spark.eventLog.dir               hdfs://hadoop102:9000/directory  

2) Create a folder on HDFS

[simon@hadoop102 hadoop]$ hadoop fs –mkdir /directory

3) modify the spark-env.shfile, add the following configuration:

[simon@hadoop102 conf]$ vi spark-env.sh

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 
-Dspark.history.retainedApplications=30 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:9000/directory"

#参数描述:
spark.eventLog.dir:Application在运行过程中所有的信息均记录在该属性指定的路径下; 
spark.history.ui.port=18080  WEBUI访问的端口号为18080
spark.history.fs.logDirectory=hdfs://hadoop102:9000/directory  配置了该属性后,在start-history-server.sh时就无需再显式的指定路径,Spark History Server页面只展示该指定路径下的信息
spark.history.retainedApplications=30指定保存Application历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,这个是内存中的应用数,而不是页面上显示的应用数。

4) Distribution Profile

[simon@hadoop102 conf]$ xsync spark-defaults.conf
[simon@hadoop102 conf]$ xsync spark-env.sh

5) Start the server history

[simon@hadoop102 spark]$ sbin/start-history-server.sh

6) perform the task again

[simon@hadoop102 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

7) Check the task log history

hadoop102:18080

4, HA Configuration

Spark cluster deployment is over, but there is a big problem that there is a single point of failure Master node, to resolve this problem, it should help zookeeper, and start at least two Master nodes to achieve highly reliable, relatively simple configuration.

HA mode overall architecture diagram:

We can see relied Zookeeper, in fact, and almost HDFS HA ​​mode of operation, then started configuration.

1) zookeepernormal installation and startup

[simon@hadoop102 spark]$ zk-start.sh
[simon@hadoop102 spark]$ jpsall
--------------------- hadoop102 -------------------------------
8498 HistoryServer
4755 NameNode
4900 DataNode
5704 NodeManager
6333 Master
9231 QuorumPeerMain
6623 Worker
--------------------- hadoop103 -------------------------------
8342 DataNode
9079 NodeManager
10008 Worker
10940 QuorumPeerMain
8893 ResourceManager
--------------------- hadoop104 -------------------------------
11073 QuorumPeerMain
8882 NodeManager
8423 SecondaryNameNode
9560 Worker
8347 DataNode

QuorumPeerMainZookeeper is a process, we can see has been properly started.

2) modify the spark-env.shfile to add the following configuration:

[simon@hadoop102 conf]$ vi spark-env.sh

注释掉如下内容:
#SPARK_MASTER_HOST=hadoop102
#SPARK_MASTER_PORT=7077
添加上如下内容:
export SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104 
-Dspark.deploy.zookeeper.dir=/spark"

3) Distribution Profile

[simon@hadoop102 conf]$ xsync spark-env.sh

4) Start all nodes on hadoop102

[simon@hadoop102 spark]$ sbin/start-all.sh

5) Start the master node separately hadoop103

[simon@hadoop103 spark]$ sbin/start-master.sh

6) spark HA cluster access

/opt/module/spark/bin/spark-shell \
--master spark://hadoop102:7077,hadoop103:7077 \
--executor-memory 2g \
--total-executor-cores 2

Test is not commonly used in the learning process, together with the test on the line. Hadoop102, Hadoop103 is master, the close Active master, see Master can be automatically switched.


References:

[1] Haibo. Spark Big Data technologies

Guess you like

Origin www.cnblogs.com/simon-1024/p/12175821.html