ready
1, hadoop have been deployed (if no reference: Centos7 installation Hadoop2.7 ), cluster follows ( IP address of the previous article are subject to change ):
hostname | IP addresses | Deployment Planning |
node1 | 172.20.0.2 | NameNode, DataNode |
node2 | 172.20.0.3 | DataNode |
node3 | 172.20.0.4 | DataNode |
2, the official website to download the installation package: spark-2.4.4-bin-hadoop2.7.tgz (Tsinghua recommended to open or mirror sites USTC).
3, spark will be deployed in three already existing path / mydata, configure the environment variables:
export SPARK_HOME=/mydata/spark-2.4.4 export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH
Local Mode
In the machine node1 decompression spark-2.4.4-bin-hadoop2.7.tgz to / mydata, and rename the folder /mydata/spark-2.4.4.
Consistent with hadoop article, a spark version perform the task of wordcount (Python version) the following:
shell> vim 1.txt # create a file, write something
hadoop hadoop
HBase HBase HBase
spark spark spark spark
shell> spark-$ SPARK_HOME the Submit / examples / src / main / Python / wordcount.py 1.txt # submit to spark wordcount task, statistics 1.txt of words and the number of results are as follows
the Spark: 4
HBase: 3
hadoop: 2
spark is a calculation engine, view the file wordcount.py can see achieve the same function, its code is far less than mapreduce, greatly reducing the difficulty of developing big data.
Standalone mode
Can be translated into independent mode, the spark comes to completion in addition to the cluster of storage; following first configuration on node1:
spark configuration file located in $ SPARK_HOME / conf:
A copy of the spark-env.sh.template spark-env.sh
From slaves.template copy slaves
# 文件名 spark-env.sh SPARK_MASTER_HOST=node1 SPARK_LOCAL_DIRS=/mydata/data/spark/scratch SPARK_WORKER_DIR=/mydata/data/spark/work SPARK_PID_DIR=/mydata/data/pid SPARK_LOG_DIR=/mydata/logs/spark
# 文件名 slaves
node1
node2
node3
Because start-all.sh and stop-all.sh conflict with hadoop under $ SPARK_HOME / sbin, recommended renaming:
shell> mv start-all.sh spark-start-all.sh shell> mv stop-all.sh spark-stop-all.sh
After configuration is complete spark program files are copied to the other two:
shell> scp -qr /mydata/spark-2.4.4/ root@node2:/mydata/ shell> scp -qr /mydata/spark-2.4.4/ root@node3:/mydata/
Then start the cluster node1:
shell> spark-start-all.sh
The authentication process using jps command on node1 | Master、Worker |
The authentication process using jps command on node2 | Worker |
The authentication process using jps command on node3 | Worker |
Can be accessed through a browser http: // node1: 8080 /:
The following file 1.txt on a multi-copy is 2.txt, then are put on hdfs, and finally through the spark cluster execute wordcount tasks:
the shell> CP 1.txt 2.txt
the shell> HDFS -mkdir DFS / tmp / WC /
the shell> HDFS DFS -put 1.txt 2.txt / tmp / WC /
the shell> Spark-Submit --master Spark: // node1 : 7077 $ SPARK_HOME / examples / src / main / Python / wordcount.py HDFS: // node1: 9000 / tmp / WC / *
shell> the Spark-the Submit --master the Spark: // node1: 7077 $ SPARK_HOME / examples / src 9 /main/python/pi.py # passing a test computing tasks pi, numeral 9 denotes the last slice (Partitions) number, the output similar to this: Pi is roughly 3.137564
At http: // node1: you can see the tasks performed 8080 / on:
Yarn mode
Actual use, typically by getting them to run to spark an existing cluster, such as the use hadoop own yarn to resource scheduling.
spark on yarn does not require spark cluster, so stop it:
shell> spark-stop-all.sh
Configuration is very simple, just to have this environment variable to:
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
However, for easy viewing history and logs, here to configure the spark history server, and associated with jobhistory hadoop of:
Into the directory $ SPARK_HOME / conf, from a copy of a spark-defaults.conf.template spark-defaults.conf:
# 文件名 spark-defaults.conf spark.eventLog.enabled true spark.eventLog.dir hdfs://node1:9000/spark/history spark.history.fs.logDirectory hdfs://node1:9000/spark/history spark.yarn.historyServer.allowTracking true spark.yarn.historyServer.address node1:18080
进入目录 $HADOOP_HOME/etc/hadoop,在 yarn-site.xml 中添加一下内容:
# 文件名 yarn-site.xml <property> <name>yarn.log.server.url</name> <value>http://node1:19888/jobhistory/logs/</value> </property>
在hdfs创建必要的路径:
shell> hdfs dfs -mkdir -p /spark/history
将hadoop和spark的配置同步更新到其他所有节点(勿忘)。
下面在node1重启yarn,并且启动spark history server:
shell> stop-yarn.sh
shell> start-yarn.sh
shell> start-history-server.sh # 启动后通过jps可以看到多出一个HistoryServer
执行下面的命令,通过yarn及cluster模式执行wordcount任务:
shell> spark-submit --master yarn --deploy-mode cluster $SPARK_HOME/examples/src/main/python/wordcount.py hdfs://node1:9000/tmp/wc/*
浏览器访问 http://node1:18080/ 可以看到spark的history:
点击 App ID 进入,然后定位到 Executors ,找到 Executor ID 为driver的,查看它的stdout或stderr:
即可看到日志和计算结果:
同样,可以通过yarn命令访问日志:
shell> yarn logs -applicationId [application id]
over