[Hadoop learning 2] Hadoop fully distributed environment configuration
1 Clone a virtual machine
The cloned virtual machine is shut down
modifyhostname
vi /etc/sysconfig/network
View ip configuration file
ip addr
To modify the ip address in the ip file, you only need to change the last digit of the ip to any number
vi /etc/sysconfig/network-scripts/ifcfg-ens33
vi /etc/hosts
Note : etc/hosts
Both the cloned virtual machine and the cloned virtual machine need to be added
Restart the virtual machine.
At this point, the host2 (ip: 192.168.159.158) virtual machine is cloned.Similarly, clone a host3 (ip: 192.168.159.157) virtual machine
2 Server function planning
Determine the function of each server
host1 | host2 | host3 |
---|---|---|
NameNode | ResourceManage | |
DataNode | DataNode | DataNode |
NodeManager | NodeManager | NodeManager |
HistoryServer | SecondaryNameNode |
3 Install new Hadoop on the first machine
3.1 Preparation
In order to distinguish it from the previous installation of pseudo-distributed Hadoop on the host1 machine , weStop all Hadoop services of host1, And then /opt/modules/app
install another Hadoop in a new directory .
We install the cluster by first decompressing and configuring Hadoop on the first machine, and then distributing it to the other two machines.
3.2 Unzip the Hadoop directory:
tar -zxf /opt/hadoop/hadoop-2.10.1.tar.gz -C /opt/modules/app/
3.3 Configure jdk, modify hadoop-env.sh, mapred-env.sh, yarn-env.sh
Open the file hadoop-env.sh
, mapred-env.sh
, yarn-env.sh
, modified JAVA_HOME
path of FIG.JAVA_HOME=/opt/modules/jdk1.8.0_171
[hadoop@host1 ~]$ vi /opt/modules/app/hadoop-2.10.1/etc/hadoop/hadoop-env.sh
[hadoop@host1 ~]$ vi /opt/modules/app/hadoop-2.10.1/etc/hadoop/mapred-env.sh
[hadoop@host1 ~]$ vi /opt/modules/app/hadoop-2.10.1/etc/hadoop/yarn-env.sh
3.4 Deployment core-site.xml
[hadoop@host1 ~]$ cd /opt/modules/app/hadoop-2.10.1
[hadoop@host1 hadoop-2.10.1]$ vi etc/hadoop/core-site.xml
Add the following content in <configuration> and </configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://host1.chybinmy.com:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/modules/app/hadoop-2.10.1/data/tmp</value>
</property>
explain:
fs.defaultFS
Is the address of the NameNode.hadoop.tmp.dir
It is the address of the hadoop temporary directory. By default, the data files of NameNode and DataNode will be stored in the corresponding subdirectories of this directory.
3.5 Configure hdfs-site.xml
[hadoop@host1 hadoop-2.10.1]$ vi etc/hadoop/hdfs-site.xml
Add the following content in <configuration> and </configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>host3.chybinmy.com:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/modules/app/hadoop-2.10.1/data/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/modules/app/hadoop-2.10.1/data/tmp/dfs/data</value>
</property>
explain:
dfs.namenode.secondary.http-address
It is the designatedsecondaryNameNode
http access address and port number, because in the planning, we willhost3
plan as aSecondaryNameNode
server.
So here is set to:host3.chybinmy.com:50090
3.6 Configure slaves
[hadoop@host1 hadoop-2.10.1]$ vi etc/hadoop/slaves
Add
host1.chybinmy.com
host2.chybinmy.com
host3.chybinmy.com to the file
The slaves file specifies HDFS
which DataNode
nodes are on.
3.7 Deployment yarn-site.xml
[hadoop@host1 hadoop-2.10.1]$ vi etc/hadoop/yarn-site.xml
Add the following content in <configuration> and </configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>host2.chybinmy.com</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
explain:
- According to the plan,
yarn.resourcemanager.hostname
the designatedresourcemanager
server points to ithost2.chybinmy.com
. yarn.log-aggregation-enable
Is to configure whether to enable the log aggregation function.yarn.log-aggregation.retain-seconds
It is to configure how long the aggregated logs are kept on HDFS.
3.8 Deployment mapred-site.xml
From mapred-site.xml.template
copying a mapred-site.xml
file.
[hadoop@host1 hadoop-2.10.1]$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
[hadoop@host1 hadoop-2.10.1]$ vi etc/hadoop/mapred-site.xml
Add the following content in <configuration> and </configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.159.159:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>192.168.159.159:19888</value>
</property>
explain:
mapreduce.framework.name
Set themapreduce
task to runyarn
on.mapreduce.jobhistory.address
The setmapreduce
history server is installed on the host1 machine.mapreduce.jobhistory.webapp.address
It is to set the web page address and port number of the history server.
4 Set up SSH login without password
Each machine in the Hadoop cluster will access each other through SSH. It is unrealistic to enter a password for each access, so it is necessary to configure the SSH between each machine to log in without a password.
4.1 Generate public key
First convert the user to root
su root
vi /etc/ssh/sshd_config
I found the following three sentences and removed the comments. I didn’t find them RSAAuthentication yes
, so I added them directly.
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
Restart the service, return to the hadoop user, generate public and private keys
service sshd restart
su hadoop
ssh-keygen -t dsa
Will /home/hadoop/
the next generation .ssh
file (the file is a hidden file, you need ll -a
to see), in .ssh
next time to see the two documents on behalf of public and private keys
to copy the public key to the authorized_keys
file, and change the authorized_keys
operating authority file, then ssh This machine, if you don’t need to enter a password, it means you can pass through this machine
.
cd .ssh
cat id_dsa.pub >> authorized_keys(或者cat id_rsa.pub >> authorized_keys)
chmod 600 authorized_keys
ssh localhost
If you ssh localhost
still need to enter the password, try the following sentence, which /home/hadoop/
can be flexibly changed according to your host username
chmod 755 /home/hadoop/
chmod 700 ~/.ssh
chmod 644 ~/.ssh/authorized_keys
The above operations are performed on host1, host2 and host3 at the same time! After the operation, go to the next step.
4.2 Distribution of public keys
The id_dsa.pub of the following statement may be id_rsa.pub depending on the situation
cat ~/.ssh/id_dsa.pub | ssh hadoop@host2.chybinmy.com 'cat - >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_dsa.pub | ssh hadoop@host3.chybinmy.com 'cat - >> ~/.ssh/authorized_keys'
Also do the key distribution operation in host2, host3
Verify that the SSH passwordless login is successful, directlyssh 主机名
5 Distribute Hadoop files
First create a directory to store Hadoop on the other two machines
[hadoop@host2 ~]$ mkdir /opt/modules/app
[hadoop@host3 ~]$ mkdir /opt/modules/app
The share/doc directory under the Hadoop root directory distributed by Scp is the stored Hadoop documents. The files are quite large. It is recommended to delete this directory before distributing, which can save hard disk space and increase the speed of distribution.
[hadoop@host1 hadoop-2.5.0]$ du -sh /opt/modules/app/hadoop-2.10.1/share/doc
[hadoop@host1 hadoop-2.5.0]$ scp -r /opt/modules/app/hadoop-2.10.1/ 192.168.159.158:/opt/modules/app
[hadoop@host1 hadoop-2.5.0]$ scp -r /opt/modules/app/hadoop-2.10.1/192.168.159.157:/opt/modules/app
6 Format NameNode
Perform formatting on the NameNode machine (host1):
[hadoop@host1 hadoop-2.10.1]$ /opt/modules/app/hadoop-2.10.1/bin/hdfs namenode -format
Note:
If you need to reformat the NameNode, you need to delete the two folders in the figure below for each host (the data
folder is core-site.xml
configured in hadoop.tmp.dir
)
Because every time formatting, the default is to create a cluster ID and write it into the file of NameNode
sum (the directory where the VERSION file is located is sum ). When reformatting, a new cluster ID will be generated by default. If the original directory is not deleted, It will cause the new cluster ID in the VERSION file, and the old cluster ID in the middle , and an error will be reported if it is inconsistent.DataNode
VERSION
dfs/name/current
dfs/data/current
namenode
DataNode
Another method is to specify the cluster ID parameter when formatting, and specify it as the old cluster ID.
7 Start the cluster
7.1 Start HDFS
[hadoop@host1 ~]$ /opt/modules/app/hadoop-2.10.1/sbin/start-dfs.sh
[hadoop@host1 ~]$ jps
For the three hosts respectively jps
, the following three pictures show that the startup is successful.
To shut down HDFS, use the following statement
[hadoop@host1 ~]$ /opt/modules/app/hadoop-2.10.1/sbin/stop-dfs.sh
7.2 Start YARN
In the host1
start on YARN
[hadoop@host1 ~]$ /opt/modules/app/hadoop-2.10.1/sbin/start-yarn.sh
In the host2
start ResourceManager on:
[hadoop@host2 hadoop-2.10.1]$ /opt/modules/app/hadoop-2.10.1/sbin/yarn-daemon.sh start resourcemanager
[hadoop@host2 hadoop-2.10.1]$ jps
7.3 Start the log server
Because we plan host3
to run the MapReduce log service on the server, we need to start it on host3
[hadoop@host3 hadoop-2.10.1]$ /opt/modules/app/hadoop-2.10.1/sbin/mr-jobhistory-daemon.sh start historyserver
7.4 View HDFS web page
http://192.168.159.159:50070/
(The URL is changed according to your own ip, where ip is the IP of host1)
7.5 View YARN Web page
http://192.168.159.158:8088/cluster
(The URL is changed according to your own ip, where ip is the IP of host2)
8 Test Job
Here we use the wordcount example that comes with hadoop to test and run mapreduce in local mode.
8.1 Prepare mapreduce input file wc.input
[hadoop@host1 ~]$ cat /opt/data/wc.input
8.2 Create an input directory input in HDFS
[hadoop@host1 ~]$ cd /opt/modules/app/hadoop-2.10.1
[hadoop@host1 hadoop-2.10.1]$ bin/hdfs dfs -mkdir /input
Upload wc.input to HDFS
[hadoop@host1 hadoop-2.10.1]$ bin/hdfs dfs -put /opt/data/wc.input /input/wc.input
8.3 Run the mapreduce Demo that comes with hadoop
[hadoop@host1 hadoop-2.10.1]$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /input/wc.input /output
8.4 View output files
[hadoop@host1 hadoop-2.10.1]$ bin/hdfs dfs -ls /output