3 Spark cluster installation

The first 3 chapters Spark cluster installation

. 3 .1 the Spark installation address

1. Official website address

http://spark.apache.org/

2. Document View Address

https://spark.apache.org/docs/2.1.1/

3. download link

https://spark.apache.org/downloads.html

. 3 .2 Standalone mode installation

1 ) upload and extract the spark installation package

[atguigu@hadoop102 sorfware]$ tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module/

[atguigu@hadoop102 module]$ mv spark-2.1.1-bin-hadoop2.7 spark

2 ) into the spark installation directory under the conf folder

[atguigu@hadoop102 module]$ cd spark/conf/

3 ) modify the configuration file name

[atguigu@hadoop102 conf]$ mv slaves.template slaves

[atguigu@hadoop102 conf]$ mv spark-env.sh.template spark-env.sh

4 ) modify the slave file, add work node:

[atguigu@hadoop102 conf]$ vim slaves

 

hadoop102

hadoop103

hadoop104

5 ) Modify spark-env.sh file, add the following configuration: 46 47  OK

[atguigu@hadoop102 conf]$ vim spark-env.sh

 

SPARK_MASTER_HOST=hadoop102

SPARK_MASTER_PORT = 7077        service port

 

6 ) distribution spark package

[atguigu@hadoop102 module]$ xsync spark/

7 ) Start

[atguigu@hadoop102 spark]$ sbin/start-all.sh

[atguigu@hadoop102 spark]$ util.sh

================atguigu@hadoop102================

3330 Jps

3238 Worker

3163 Master

================atguigu@hadoop103================

2966 Jps

2908 Worker

================atguigu@hadoop104================

2978 Worker

3036 Jps

 

View page: hadoop102: 8080

Note : If you encounter  "JAVA_HOME not set" abnormal , can sbin directory under the Spark -config.sh add the following configuration file:

export JAVA_HOME=XXXX

8 ) submit jobs & execute the program

[atguigu@hadoop102 spark]$ bin/spark-submit \       

--class org.apache.spark.examples.SparkPi \               主类

--master spark://hadoop102:7077 \                            master

--executor-memory 1G \                 tasks resources

--total-performer-colors 2 \

examplesjarssparkexamples211211jar \ jar, bao

100

 

./bin/spark-submit \

--class <main-class>

--master <master-url> \

--deploy-mode <deploy-mode> \

--conf <key>=<value> \

... # other options

<application-jar> \

[application-arguments]

 

Parameter Description:

--master spark: // hadoop102: 7077  specifies the Master Address

--class: start classes in your application (such as org.apache.spark.examples.SparkPi)

--deploy-mode: whether to publish your drive to the worker node (cluster) or as a local client (client) (default: client) *

--conf: Spark arbitrary configuration attribute, format key = value if the value has spaces, quotes "key = value."

application-jar: packaged application jar, contains rely on this URL. globally visible in the cluster. For example hdfs: // shared storage system, if it is  File: // path , then all nodes of the path contains the same jar

Parameters passed to main () method: application-arguments

--executor-memory 1G specify each executor memory available 1G

--total-executor-cores 2 specify each executor used cup core number 2 th

 

 

The algorithm is a Monte Carlo algorithm for PI

 

9 ) Start the Spark  shell

/opt/module/spark/bin/spark-shell \

--master spark://hadoop102:7077 \

--executor-memory 1g \

--total-performer-color 2

Note: If you start the spark shell did not specify when the master address, but you can also start the normal spark shell and execute spark shell program actually started the spark of local mode, which only started a process in the machine, does not establish the cluster contact.

Spark Shell has by default SparkContext class object is initialized sc . If you need to use the user code, the direct application of sc can      sparksession   is sparksql 

scala> sc.textFile("./word.txt")

.flatMap(_.split(" "))

.map((_,1))

.reduceByKey(_+_)

.collect

 

res0: Array[(String, Int)] = Array((hive,1), (atguigu,1), (spark,1), (hadoop,1), (hbase,1))

 

3.3 JobHistoryServer配置

1)修改spark-default.conf.template名称

[atguigu@hadoop102 conf]$ mv spark-defaults.conf.template spark-defaults.conf

2)修改spark-default.conf文件,开启Log

[atguigu@hadoop102 conf]$ vi spark-defaults.conf

spark.eventLog.enabled           true

spark.eventLog.dir               hdfs://hadoop102:9000/directory  

注意:HDFS上的目录需要提前存在。

3)修改spark-env.sh文件,添加如下配置:

[atguigu@hadoop102 conf]$ vi spark-env.sh

 

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=4000

-Dspark.history.retainedApplications=3

-Dspark.history.fs.logDirectory=hdfs://hadoop102:9000/directory"

参数描述:

spark.eventLog.dirApplication在运行过程中所有的信息均记录在该属性指定的路径下;

spark.history.ui.port=4000  调整WEBUI访问的端口号为4000

spark.history.fs.logDirectory=hdfs://hadoop102:9000/directory  配置了该属性后,在start-history-server.sh时就无需再显的指定路径,Spark History Server页面只展示该指定路径下的信息

spark.history.retainedApplications=3   指定保存Application历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,这个是内存中的应用数,而不是页面上显示的应用数。

4)分发配置文件

[atguigu@hadoop102 conf]$ xsync spark-defaults.conf

[atguigu@hadoop102 conf]$ xsync spark-env.sh

5)启动历史服务

[atguigu@hadoop102 spark]$ sbin/start-history-server.sh

6)再次执行任务长度。

[atguigu@hadoop102 spark]$ bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master spark://hadoop102:7077 \

--executor-memory 1G \

--total-executor-cores 2 \

./examples/jars/spark-examples_2.11-2.1.1.jar \

100

7)查看历史服务

hadoop102:4000

3.4 HA配置

1zookeeper正常安装并启动

2)修改spark-env.sh文件添加如下配置:

[atguigu@hadoop102 conf]$ vi spark-env.sh

 

注释掉如下内容:

#SPARK_MASTER_HOST=hadoop102

#SPARK_MASTER_PORT=7077

添加上如下内容:

export SPARK_DAEMON_JAVA_OPTS="

-Dspark.deploy.recoveryMode=ZOOKEEPER

-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104 

-Dspark.deploy.zookeeper.dir=/spark"

3)分发配置文件

[atguigu@hadoop102 conf]$ xsync spark-env.sh

4)在hadoop102上启动全部节点

[atguigu@hadoop102 spark]$ sbin/start-all.sh

5)在hadoop103上单独启动master节点88

[atguigu@hadoop103 spark]$ sbin/start-master.sh

6spark HA集群访问

/opt/module/spark/bin/spark-shell \

--master spark://hadoop102:7077,hadoop103:7077 \   单独指定102也能

--executor-memory 2g \

--total-executor-cores 2

3.5 Yarn模式安装

1)修改hadoop配置文件yarn-site.xml,添加如下内容:

[atguigu@hadoop102 hadoop]$ vi yarn-site.xml

        <!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->

        <property>

                <name>yarn.nodemanager.pmem-check-enabled</name>

                <value>false</value>

        </property>

        <!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->

        <property>

                <name>yarn.nodemanager.vmem-check-enabled</name>

                <value>false</value>

        </property>

2)修改spark-env.sh,添加如下配置:

[atguigu@hadoop102 conf]$ vi spark-env.sh

 

YARN_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop  

HADOOP_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop

3)分发配置文件

[atguigu@hadoop102 conf]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/yarn-site.xml

[atguigu@hadoop102 conf]$ xsync spark-env.sh

4)执行一个程序

[atguigu@hadoop102 spark]$ bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode client \

./examples/jars/spark-examples_2.11-2.1.1.jar \

100

注意:在提交任务之前需启动HDFS以及YARN集群。

1zookeeper正常安装并启动

2)修改spark-env.sh文件添加如下配置:

[atguigu@hadoop102 conf]$ vi spark-env.sh

 

注释掉如下内容:

#SPARK_MASTER_HOST=hadoop102

#SPARK_MASTER_PORT=7077

添加上如下内容:

export SPARK_DAEMON_JAVA_OPTS="

-Dspark.deploy.recoveryMode=ZOOKEEPER

-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104 

-Dspark.deploy.zookeeper.dir=/spark"

3)分发配置文件

[atguigu@hadoop102 conf]$ xsync spark-env.sh

4)在hadoop102上启动全部节点

[atguigu@hadoop102 spark]$ sbin/start-all.sh

5)在hadoop103上单独启动master节点88

[atguigu@hadoop103 spark]$ sbin/start-master.sh

6spark HA集群访问

/opt/module/spark/bin/spark-shell \

--master spark://hadoop102:7077,hadoop103:7077 \   单独指定102也能

--executor-memory 2g \

--total-executor-cores 2

3.5 Yarn模式安装

1)修改hadoop配置文件yarn-site.xml,添加如下内容:

[atguigu@hadoop102 hadoop]$ vi yarn-site.xml

        <!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->

        <property>

                <name>yarn.nodemanager.pmem-check-enabled</name>

                <value>false</value>

        </property>

        <!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->

        <property>

                <name>yarn.nodemanager.vmem-check-enabled</name>

                <value>false</value>

        </property>

2)修改spark-env.sh,添加如下配置:

[atguigu@hadoop102 conf]$ vi spark-env.sh

 

YARN_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop  

HADOOP_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop

3)分发配置文件

[atguigu@hadoop102 conf]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/yarn-site.xml

[atguigu@hadoop102 conf]$ xsync spark-env.sh

4)执行一个程序

[atguigu@hadoop102 spark]$ bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode client \

./examples/jars/spark-examples_2.11-2.1.1.jar \

100

注意:在提交任务之前需启动HDFS以及YARN集群。

 

Guess you like

Origin www.cnblogs.com/Diyo/p/11291978.html