Hadoop fully distributed platform construction
Construction steps:
2. jdk installation, clone virtual machine
3. Modify the hostname of the virtual machine and add a mapping
4. Configure SSH password-free login
5. Configure Time Synchronization Service
6. Hadoop installation (operation on master)
7. Distribution of Hadoop folders
[Previously] The
Hadoop fully distributed cluster requires multiple virtual machines. It is troublesome to install and configure each virtual machine separately. Therefore, we can create a virtual machine in VMware and complete the public basic configuration and then directly create a complete clone. This is more efficient.
The construction of a fully distributed Hadoop cluster is a typical master-slave architecture, that is, one master node and multiple slave nodes. Here I use three virtual machines, one as a master
node, and the other two as slave1
nodes and slave2
nodes.
Required installation packages: jdk installation package, Hadoop installation package (there is the version I use in the resource).
If there is no special instructions, this operation is done in Xshell.
Installation Recommendations: I am here to create a unified installation export folder in the root directory, and then create two folders servers (software installation folder) in the export folder, softwares (installation package store directory) or the /usr/local/
next installation.
mkdir -p /export/servers
mkdir -p /export/softwares
1. Static IP configuration
After the installation is complete centos7 card default is not the first startifconfig
or
ip a
view the card name. Then modify the relevant configuration files.
vi /etc/sysconfig/network-scripts/ifcfg-ens33
将BOOTPROTO修改为static
BOOTPROTO=static
最后一行ONBOOT改为yes
ONBOOT=yes
添加如下内容:
IPADDR=填IP地址
NETMASK=子网掩码
GATEWAY=网关IP
DNS1=8.8.8.8
DNS2=8.8.4.4
If you can ping the same external network, the static IP configuration is successful
ping www.qq.com
2. Installation of jdk
* Jdk installation package will be uploaded to thesoftwares
folder, and unzip to servers folder.
cd /export/softwares/
rz
选中jdk压缩包,上传至当前目录下
mv jdk-8u161-linux-x64.tar.gz jdk
tar -zxvf jdk -C ../servers/
如果<code>rz</code>命令报错,则执行如下安装后再执行
yum -y install lrzsz
- Configure jdk environment variables
In /etc/profile
add the following environment variable end of the file:
export JAVA_HOME=/export/servers/jdk
export PATH=$PAHT:$JAVA_HOME/bin
Exit after saving.
-
Reload the configuration file to make the environment variables configured just now take effect.
source /etc/profile
-
Check whether the configuration is successful:
Entering java -version
the version information of jdk that appears is the installation and configuration is successful.
The above static IP configuration and jdk installation need to be configured on each machine, so after the configuration is successful on one machine, clone the virtual machine directly (note that a complete clone must be created). Note that the IP addresses of all three are The same, you need to modify the IP of the two cloned virtual machines in the configuration file. After the modification is completed, restart the network service to see if they can ping each other. If the ping succeeds, the IP modification is successful.
重启网路服务:
systemctl restart network
3. Modify the hostname of the virtual machine and add a mapping
* Modify/etc/hostname
file, delete the default first line, then changed
master
, restart the virtual machine, the host name change to take effect.
The same approach to modify the other two virtual machines host name are:
slave1
, ,
slave2
and restart the virtual machine. * Edit
/etc/hosts
the file, and add the following content (note that IP address to change their IP address):
192.168.200.200 master
192.168.200.201 slave1
192.168.200.202 slave2
The same content is added on the other two machines.
Verify whether the change is successful or not depends on whether they can ping
communicate with each other . For example, execute the following command on any machine:
ping master
ping slave1
ping slave2
4. Configure SSH password-free login
-
Check whether SSH has been installed (centos7 is installed by default)
rpm -qa | grep ssh
The following results are already installed:
openssh-7.4p1-21.el7.x86_64 libssh2-1.8.0-3.el7.x86_64 openssh-clients-7.4p1-21.el7.x86_64 openssh-server-7.4p1-21.el7.x86_64
If it is not installed, you need to install it manually:
yum -y install openssh-server yum -y install openssh-clients
Tip : If the openssh-clients is not installed, an error will be reported when the ssh and scp commands are executed, indicating that the command cannot be found.
-
Test whether SSH is available (the IP address is the IP address of the target machine to be logged in: that is, the IP address of the child node)
ssh 192.168.200.201
Follow the prompts to enter the login password of the target machine. After the login is successful, ssh is available, and then execute the following command to return to the original host
exit
-
Generate key
ssk-keygen
This process needs to be confirmed repeatedly, just press Enter in between. Note: Sometimes you need to answer
yes
orno
! ! ! -
Perform secret-free login operations on all nodes on the master node, including the master node:
ssh-copy-id -i ~/.ssh/id_rsa.pub master ssh-copy-id -i ~/.ssh/id_rsa.pub slave1 ssh-copy-id -i ~/.ssh/id_rsa.pub slave2
Since the master needs to start the service on the slave node, the master needs to log in to the slave, so execute the above command on the master node.
5. Configure Time Synchronization Service
-
Check if ntp service is installed
rpm -qa | grep ntp
Install ntp service if not installed
yum -y install ntp
-
Set the master node as the master node of the NTP time synchronization service (that is, the time of the master node shall prevail)
Edit the
/etc/ntp.conf
file, comment outserver
the line that starts with, and add the following code:restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap server 127.127.1.0 fudge 127.127.1.0 stratum 10
-
Configure node time synchronization (get the time from the master node)
Modify the
/etc/ntp.conf
file in the slave1 and slave2 nodes as well , comment outserver
the line that starts with, and add the following code:server master
Set the child node to synchronize time with the master (time server) every 10 minutes
crontab -e 编写定时任务: */10 * * * * /usr/sbin/ntpdate master
-
Start NTP service
-
Start the ntp service on the master node and add self-start at boot
service ntpd start & chkconfig ntpd on
-
Manually synchronize the time once on the slave node (slave1, slave2)
ntpdate master
-
Start the ntp service on the slave node (slave1, slave2) and add the self-start at boot
service ntpd start & chkconfig ntpd on
-
Check whether the ntp server is connected to the upper ntp
ntpstat 这个命令可能会看到 unsynchronised 这是正常情况,因为配置完成后,需要等待一会儿才能和/etc/ntp.conf中配置的标准时间进行同步。
-
View the status of the ntp server and the upper ntp
ntpq -p 参数说明: when: 多少秒前进行过时间同步 poll: 下次更新在多少秒后 reach: 已经向上层ntp服务器要求更新的次数 delay: 网络延迟 offser: 时间补偿值 jitter: 系统时间与BIOS时间差
-
-
Test whether the configuration is successful
Modify the time on any node:
date -s "2011-11-11 11:11:11"
Wait for ten minutes and check whether the time is synchronized (10 minutes can be adjusted to 1 minute during the experiment to save time)
date
Extension: If you need to keep the clock synchronized with the external network time, you need to set up a scheduled task (in fact, the virtual machine time has been synchronized with the time on the network). I use the time of the Alibaba Cloud server here. ---- Operate on the master
启动定时任务:
crontab -e
添加如下代码:分别代表 分 时 日 月 周
*/1 * * * * /usr/sbin/ntpdate ntp4.aliyun.com
6. Hadoop installation (operation on master)
-
Unzip the Hadoop installation package
进入到softwares目录 tar -zxvf hadoop-2.7.2.tar.gz -C ../servers/ cd ../servers/ mv hadoop-2.7.2.tar.gz hadoop
-
Configure environment variables
In
/etc/profile
add the following files:export HADOOP_HOME=/export/servers/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Remember after saving
source /etc/profile
-
Modify the configuration file
All configuration files are in the following path:
hadoop/etc/hadoop/
slaves
The configuration file saves the information of the slave node, that is, the machine name of the node. Delete localhost and modify it as follows:
slave1 slave2
core-site.xml
Hadoop core configuration file
Add the following content between <configuration></configuration>:
<property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/export/servers/hadoop/tmp</value> </property>
fs.defaultFS
The attribute is to specify the default file system, and its value is hdfs and master node port.hadoop.tmp.dir
The attribute is to specify the directory in which the hdfs temporary data is stored. The default value is the Linux tmp directory.hdfs-site.xml
HDFS-related configuration files
Add the following content between <configuration></configuration>:
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/export/servers/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/export/servers/hadoop/tmp/dfs/data</value> </property>
dfs.replication
The attribute represents the number of copies of the data block, and its value is 3 by default. Here we set it to 1 for the convenience of testing.dfs.namenode.name.dir
The attribute represents the temporary data storage directory of the NameNode.dfs.namenode.data.dir
The attribute represents the temporary data storage directory of the DataNode.mapred-site.xml
MapReduce related configuration
Rename the file mapred-site.xml, the default file name is mapred-site.xml.template, and change the mapred-site.xml configuration:
mv mapred-site.xml.template mapred-site.xml
Add the following content between <configuration></configuration>:
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
mapreduce.framework.name
The attribute represents the running framework of the MapReduce program. The default value is local, which means it runs locally. Here we set it to yarn to let the MapReduce program run on the YARN framework.yarn-site.xml
YARN framework configuration
Add the following content between <configuration></configuration>:
<property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
yarn.resourcemanager.hostname
The attribute specifies which node the ResourceManager runs on.yarn.nodemanager.aux-services
The attribute specifies the default shuffling method of YARN and is set to the default shuffling algorithm of MapReduce.hadoop-env.sh
The basic environment configuration of Hadoop operation is used when the automation script is started.
export JAVA_HOME=/export/servers/jdk
On line 25 of the file, remove the previous comment, and then change the following path to your
JAVA_HOME
path. -
Since then, even if the configuration of Hadoop is completed, this is only the setting items necessary for normal startup! ! !
Note: For more detailed configuration information, you can go to the Hadoop official website to view it. After opening it, slide to the bottom of the page. There are links to the detailed information of each configuration file under Configuration in the lower left corner .
7. Distribution of Hadoop folders
After configuring all configuration files on the master node, distribute the Hadoop folder to the other two child nodes
scp -r /export/servers/hadoop slave1:/export/servers/
scp -r /export/servers/hadoop slave2:/export/servers/
8. Cluster startup
Be sure to turn off the firewall before starting the cluster. Note that the firewall of each machine must be turned off! ! !
查看防火墙状态:
systemctl status firewalld
关闭防火墙:
systemctl stop firewalld
设置防火墙开机不启动(永久关闭防火墙,如果后面需要还可以手动打开):
systemctl disable firewalld
打开防火墙:
systemctl start firewalld
-
Format the NameNode before starting for the first time (you don’t need to format it when starting later)
hdfs namenode -format
If you forget to format and start, you need to shut down all namenode and datanode processes, then delete the data and log data, and then reformat.
-
There are two ways to start the cluster, one is manual start, the other is automatic script start
-
Manual start
1. 启动HDFS : 启动NameNode(master节点): hadoop-daemon.sh start namenode 启动DataNode(在Slave节点): hadoop-daemon.sh start datanode 启动SecondaryNameNode(在任意节点): hadoop-daemon.sh start secondarynamenode 2. 启动YARN : 启动ResourceManager(在Master节点): yarn-daemon.sh start resourcemanager 启动NodeManager(在Slave节点): yarn-daemon.sh start nodemanager 3. 启动历史任务服务: mr-jobhistory-daemon.sh start historyserver
-
Automated script startup (executed on the master node)
1. 启动HDFS : start-dfs.sh 2. 启动YARN : start-yarn.sh 3. 启动历史任务服务: mr-jobhistory-daemon.sh start historyserver 注:还有一个脚本是:start-all.sh,但是一般不建议使用,容易出错。
-
-
View process
jps
If you see the following process on the master, the startup is successful:
[root@master ~]# jps 2016 ResourceManager 2353 Jps 1636 NameNode 1845 SecondaryNameNode 2310 JobHistoryServer [root@master ~]#
If you see the following process on the slave, the startup is successful:
[root@slave1 ~]# jps 1554 DataNode 1830 Jps 1671 NodeManager [root@slave1 ~]#
Description:
NameNode、SecondaryNameNode
It is the process of HDFS on the master and the processDataNode
of HDFS on the slave. The existence of these three processes indicates that the HDFS is started successfully.ResourceManager
It is the process of YARN on the master and the processNodeManager
of YARN on the slave. The existence of these two processes indicates that YARN is successfully started.JobHistoryServer
It is the process of historical service. -
View webpage
After the cluster is started, you can enter the IP address and port number in the browser, visit its UI page, and view the detailed information of the cluster.
查看HDFS集群详细信息: 192.168.200.200:50070 查看YARN集群详细信息: 192.168.200.200:8088 查看historyserver历史服务详细信息: 192.168.200.200:19888
-
Shutdown process
To open the service command
start
was changedstop
to.