搭建hadoop集群
hadoop2.7.3 + spark1.6.1 + scala2.11.8 + jdk1.8.0_101
下载hadoop2.7,修改$HADOOP_HOME/etc/hadoop下的hadoop-env.sh文件
export JAVA_HOME=/soft/jdk1.8.0_101
修改core-site.xml文件(这里讲将数据目录data就放在$HADOOP_HOME下了
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://192.168.186.128:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/spark/hadoop-2.7.3/data</value> </property> </configuration>
修改hdfs-site.xml文件
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
先格式化
$HADOOP_HOME/bin/hdfs namenode -format
启动namenode和datanode
./hadoop-daemon.sh start namenode ./hadoop-daemon.sh start datanode
关闭iptables
service iptables stop chkconfig --level 35 iptables off
修改hostname
#几个修改方式 hostname 【主机名】 vim /etc/sysconfig/network sysctl kernel.hostname vim /etc/hosts
搭建hadoop yarn
修改yarn-en.sh
JAVA=/soft/jdk1.8.0_101/bin/java
修改yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>vm128</value> </property> </configuration>
启动节点
./yarn-daemon.sh start resourcemanager ./yarn-daemon.sh start nodemanager
搭建spark
下载scala,安装最新版本即可,然后配置scala home
JAVA_HOME=/soft/jdk1.8.0_101 SCALA_HOME=/root/spark/scala-2.11.8 PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin: export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL JAVA_HOME SCALA_HOME
修改$SPARK_HOME/conf下的 spark-env.sh
export SCALA_HOME=/root/spark/scala-2.11.8 export JAVA_HOME=/soft/jdk1.8.0_101 export SPARK_MASTER_IP=192.168.186.128 export SPARK_WORKER_MEMORY=512M export HADOOP_CONF_DIR=/root/spark/hadoop-2.7.3/etc/hadoop
启动节点
$SPARK_HOME/bin/start-master.sh ./start-slave.sh spark://192.168.186.128:7077
jps结果
NameNode和DataNode是hdfs进程
ResourceManager和NodeManager是YARN进程
Master和Worker是spark进程
6368 Master 7666 Jps 6756 Worker 4343 DataNode 5052 NodeManager 4446 NameNode 4798 ResourceManager
运行简单例子
$SPARK_HOME/bin/spark-shell
先上传一个文件到hdfs中
$HADOOP_HOME/bin/./hdfs dfs -mkdir /test ./hdfs dfs -put /root/spark/spark-2.0.0-bin-hadoop2.7/conf/spark-defaults.conf.template /test/xx var textFile = sc.textFile("hdfs://192.168.186.128:9000/test/xx") var line = textFile.filter(line=>line.contains("spark")) #执行count后就可以计算了 line.count() #map,filter,collect函数 sc.parallelize(1 to 100).map(_*2).filter(_>50).filter(_<180).collect
web UI端口
#hadoop界面 http://192.168.186.128:50070/dfshealth.html#tab-datanode #yarn界面 http://192.168.186.128:8088/cluster/apps/RUNNING #spark界面 http://192.168.186.128:8080/ #spark-shell启动后的任务监控界面 http://192.168.186.134:4040/
参考