Just for fun because of the need to use three to build spark (192.168.1.10,192.168.1.11,192.168.1.12), because of spark built on top of hadoop, then you need to build hadoop. After a two in the afternoon, and finally completed structures, specially recorded as follows.
Click here to add a caption
Ready to work
1. jdk installed.
2. Download
It contains scala, hadoop, spark
3. ssh without password authentication
Three non-password authentication step with each other:
The first step to create private keys rsa Convention:
[root@jw01 .ssh]# ssh-keygen -t rsa
[root@jw02 .ssh]# ssh-keygen -r rsa
[root@kt01 .ssh]# ssh-keygen -t rsa
[root@kt02 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): #回车代表无需密码登陆
Enter passphrase (empty for no passphrase): #回车
Enter same passphrase again: #回车
Your identification has been saved in /root/.ssh/id_rsa. #代表私钥
Your public key has been saved in /root/.ssh/id_rsa.pub. #代表公钥
The key fingerprint is:
04:45:0b:47:10:92:0c:b2:b9:d7:11:5b:49:05:e4:d9 root@jw01
A second step, the two generated public key id_rsa.pub 192.168.1.11,192.168.1.12 rename id_rsa.pub_11, the directory id_rsa.pub_12 /root/.ssh/ transferred to the 192.168.1.10,
Then the public key file on all public 192.168.1.10 added for authentication authorized_keys(若没有该文件,则下面的命令会生成文件)
, the command is:
cat ~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys
The third step: the files on the distribution of 192.168.1.10 copied to the directory 192.168.1.11,192.168.1.12 /root/.ssh/ two machines
The last test, whether you can use ssh ip address each landing.
Preparing the Environment
Modify the hostname
We will build a master, 2 Ge slave cluster program. First, modify the host name vi /etc/hostname
, as modified on the master master
, which changed to a Slave slave1
, the other the same way.
Configuring hosts
Modify the host file on each host
|
hadoop installation
1. Extract
tar -zxvf hadoop-2.6.0.tar.gz
2. modify the configuration file
As shown in Reference [1]
Enter on the machine 192.168.1.10 (master) hadoop configuration directory, you need to configure the following seven documents: hadoop-env.sh
, yarn-env.sh
, slaves
, core-site.xml
, hdfs-site.xml
, maprd-site.xml
,yarn-site.xml
-
In
hadoop-env.sh
the configuration JAVA_HOME# The java implementation to use. export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
-
In
yarn-env.sh
the configuration JAVA_HOME# some Java parameters export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
-
In the
slaves
configuration ip slave node or host,192.168.1.11 192.168.1.12
-
modify
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000/</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/home/spark/workspace/hadoop-2.6.0/tmp</value> </property> </configuration>
-
modify
hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/workspace/hadoop-2.6.0/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/workspace/hadoop-2.6.0/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
-
modify
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
-
modify
yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> </configuration>
3. configured hadoop-2.6.0
folder distributed to slave machine 192.168.1.11,192.168.1.12
4. Start at 192.168.1.10
cd ~/workspace/hadoop-2.6.0 #进入hadoop目录
bin/hadoop namenode -format #格式化namenode
sbin/start-dfs.sh #启动dfs
sbin/start-yarn.sh #启动yarn
5. Test
10 on the machine
$ jps #run on master
3407 SecondaryNameNode
3218 NameNode
3552 ResourceManager
3910 Jps
11, 12 machine
$ jps #run on slaves
2072 NodeManager
2213 Jps
1962 DataNode
admin 端
Enter in your browser http://192.168.1.10:8088, there should be hadoop out management interface, and can see slave1 and slave2 node. Port disposed on the yarn-site.xml
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
Installation scala
References [1]
Operate on three machines were: Machine 192.168.1.10,192.168.1.11,192.168.1.12
Decompression
|
Modify environment variables again sudo vi /etc/profile
, add the following:
|
The same way the environment variables to take effect, and scala verify whether the installation was successful
|
Possible to solve the problems encountered:
[1] Hadoop jps appear process information unavailable prompt solution: References [2]
After starting Hadoop, using jps java command to view the current process of the system, display:
hduser@jack:/usr/local/hadoop$ jps
18470 SecondaryNameNode
19096 Jps
12167 -- process information unavailable
19036 NodeManager
18642 ResourceManager
18021 DataNode
17640 NameNode
Can at this time to enter the local file system through the / tmp directory, delete the name hsperfdata_ {username} folder, and then restart Hadoop.
[2] all kinds of rights issues
Solution: ready to redo the work without ssh password authentication
[3] "Incompatible clusterIDs" error when starting Hadoop HDFS Analysis
Solution: "Incompatible clusterIDs" is the cause of the error before performing the "hdfs namenode -format", no data directory empty DataNode node. Empty it.
Installation spark
As shown in Reference [1]
Extracting machine at 10
tar -zxvf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6 spark-1.4 #原来的文件名太长了,修改下
Change setting:
|
在spark-env.sh末尾添加以下内容(这是我的配置,你可以自行修改):
export SCALA_HOME=/home/spark/workspace/scala-2.10.4
export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
export HADOOP_HOME=/home/spark/workspace/hadoop-2.6.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/home/spark/workspace/spark-1.3.0
SPARK_DRIVER_MEMORY=1G
修改slaves文件
cp slaves.template slaves
Change setting:
192.168.1.11
192.168.1.12
The above configuration distributed: 192.168.1.11,192.168.1.12
In 10 starts:
sbin/start-all.sh
检查是否启动:
master上
$ jps
7949 Jps
7328 SecondaryNameNode
7805 Master
7137 NameNode
7475 ResourceManager
在slave2
$jps
3132 DataNode
3759 Worker
3858 Jps
3231 NodeManager
进入Spark的Web管理页面: http://192.168.1.10:8080
如果8080被别的程序占用,使用8081端口。
点击此处添加图片说明文字
END
点击此处添加图片说明文字
碧茂课堂精彩课程推荐:
1.Cloudera数据分析课;
2.Spark和Hadoop开发员培训;
3.大数据机器学习之推荐系统;
4.Python数据分析与机器学习实战;
点击此处添加图片说明文字
详情请关注我们公众号:碧茂大数据-课程产品-碧茂课堂
现在注册互动得海量学币,大量精品课程免费送!