Centos7 installation Spark2.4

ready

1, hadoop have been deployed (if no reference: Centos7 installation Hadoop2.7 ), cluster follows ( IP address of the previous article are subject to change ):

hostname IP addresses Deployment Planning
node1 172.20.0.2 NameNode, DataNode
node2 172.20.0.3 DataNode
node3 172.20.0.4 DataNode

2, the official website to download the installation package: spark-2.4.4-bin-hadoop2.7.tgz (Tsinghua recommended to open or mirror sites USTC).

3, spark will be deployed in three already existing path / mydata, configure the environment variables:

export SPARK_HOME=/mydata/spark-2.4.4
export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH

Local Mode

In the machine node1 decompression spark-2.4.4-bin-hadoop2.7.tgz to / mydata, and rename the folder /mydata/spark-2.4.4.

Consistent with hadoop article, a spark version perform the task of wordcount (Python version) the following:

shell> vim 1.txt  # create a file, write something
  hadoop hadoop
  HBase HBase HBase
  spark spark spark spark
shell> spark-$ SPARK_HOME the Submit / examples / src / main / Python / wordcount.py 1.txt  # submit to spark wordcount task, statistics 1.txt of words and the number of results are as follows
  the Spark: 4
  HBase: 3
  hadoop: 2

spark is a calculation engine, view the file wordcount.py can see achieve the same function, its code is far less than mapreduce, greatly reducing the difficulty of developing big data.

Standalone mode

Can be translated into independent mode, the spark comes to completion in addition to the cluster of storage; following first configuration on node1:

spark configuration file located in $ SPARK_HOME / conf:

A copy of the spark-env.sh.template spark-env.sh

From slaves.template copy slaves

# 文件名 spark-env.sh
SPARK_MASTER_HOST=node1
SPARK_LOCAL_DIRS=/mydata/data/spark/scratch
SPARK_WORKER_DIR=/mydata/data/spark/work
SPARK_PID_DIR=/mydata/data/pid
SPARK_LOG_DIR=/mydata/logs/spark

# 文件名 slaves
node1
node2
node3

 Because start-all.sh and stop-all.sh conflict with hadoop under $ SPARK_HOME / sbin, recommended renaming:

shell> mv start-all.sh spark-start-all.sh
shell> mv stop-all.sh spark-stop-all.sh

After configuration is complete spark program files are copied to the other two:

shell> scp -qr /mydata/spark-2.4.4/ root@node2:/mydata/
shell> scp -qr /mydata/spark-2.4.4/ root@node3:/mydata/

Then start the cluster node1:

shell> spark-start-all.sh
The authentication process using jps command on node1 Master、Worker
The authentication process using jps command on node2 Worker
The authentication process using jps command on node3 Worker

 Can be accessed through a browser http: // node1: 8080 /:

The following file 1.txt on a multi-copy is 2.txt, then are put on hdfs, and finally through the spark cluster execute wordcount tasks:

the shell> CP 1.txt 2.txt 
the shell> HDFS -mkdir DFS / tmp / WC /
the shell> HDFS DFS -put 1.txt 2.txt / tmp / WC /
the shell> Spark-Submit --master Spark: // node1 : 7077 $ SPARK_HOME / examples / src / main / Python / wordcount.py HDFS: // node1: 9000 / tmp / WC / *
shell> the Spark-the Submit --master the Spark: // node1: 7077 $ SPARK_HOME / examples / src 9 /main/python/pi.py  # passing a test computing tasks pi, numeral 9 denotes the last slice (Partitions) number, the output similar to this: Pi is roughly 3.137564

At http: // node1: you can see the tasks performed 8080 / on:

Yarn mode

Actual use, typically by getting them to run to spark an existing cluster, such as the use hadoop own yarn to resource scheduling.

spark on yarn does not require spark cluster, so stop it:

shell> spark-stop-all.sh

Configuration is very simple, just to have this environment variable to:

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

However, for easy viewing history and logs, here to configure the spark history server, and associated with jobhistory hadoop of:

Into the directory $ SPARK_HOME / conf, from a copy of a spark-defaults.conf.template spark-defaults.conf:

# 文件名 spark-defaults.conf
spark.eventLog.enabled                 true
spark.eventLog.dir                     hdfs://node1:9000/spark/history
spark.history.fs.logDirectory          hdfs://node1:9000/spark/history
spark.yarn.historyServer.allowTracking    true
spark.yarn.historyServer.address       node1:18080

进入目录 $HADOOP_HOME/etc/hadoop,在 yarn-site.xml 中添加一下内容:

# 文件名 yarn-site.xml
<property>
    <name>yarn.log.server.url</name>
    <value>http://node1:19888/jobhistory/logs/</value>
</property>

在hdfs创建必要的路径:

shell> hdfs dfs -mkdir -p /spark/history

将hadoop和spark的配置同步更新到其他所有节点(勿忘)。

下面在node1重启yarn,并且启动spark history server:

shell> stop-yarn.sh
shell> start-yarn.sh
shell> start-history-server.sh  # 启动后通过jps可以看到多出一个HistoryServer

执行下面的命令,通过yarn及cluster模式执行wordcount任务:

shell> spark-submit --master yarn --deploy-mode cluster $SPARK_HOME/examples/src/main/python/wordcount.py hdfs://node1:9000/tmp/wc/*

浏览器访问 http://node1:18080/ 可以看到spark的history:

点击 App ID 进入,然后定位到 Executors ,找到 Executor ID 为driver的,查看它的stdout或stderr:

即可看到日志和计算结果:

同样,可以通过yarn命令访问日志:

shell> yarn logs -applicationId [application id]

over

Guess you like

Origin www.cnblogs.com/toSeek/p/12068159.html