Shake it Hadop, build a fully distributed platform for Hadoop to dry up big data

Hadoop fully distributed platform construction

Construction steps:

1. Static IP configuration

2. jdk installation, clone virtual machine

3. Modify the hostname of the virtual machine and add a mapping

4. Configure SSH password-free login

5. Configure Time Synchronization Service

6. Hadoop installation (operation on master)

7. Distribution of Hadoop folders

8. Cluster startup

[Previously] The

Hadoop fully distributed cluster requires multiple virtual machines. It is troublesome to install and configure each virtual machine separately. Therefore, we can create a virtual machine in VMware and complete the public basic configuration and then directly create a complete clone. This is more efficient.

The construction of a fully distributed Hadoop cluster is a typical master-slave architecture, that is, one master node and multiple slave nodes. Here I use three virtual machines, one as a masternode, and the other two as slave1nodes and slave2nodes.

Required installation packages: jdk installation package, Hadoop installation package (there is the version I use in the resource).

If there is no special instructions, this operation is done in Xshell.

Installation Recommendations: I am here to create a unified installation export folder in the root directory, and then create two folders servers (software installation folder) in the export folder, softwares (installation package store directory) or the /usr/local/next installation.

mkdir -p /export/servers
mkdir -p /export/softwares

1. Static IP configuration

After the installation is complete centos7 card default is not the first start ifconfig or ip a view the card name. Then modify the relevant configuration files.
vi /etc/sysconfig/network-scripts/ifcfg-ens33
将BOOTPROTO修改为static
BOOTPROTO=static
最后一行ONBOOT改为yes
ONBOOT=yes
添加如下内容:
IPADDR=填IP地址
NETMASK=子网掩码
GATEWAY=网关IP
DNS1=8.8.8.8
DNS2=8.8.4.4

If you can ping the same external network, the static IP configuration is successful

ping www.qq.com

2. Installation of jdk

* Jdk installation package will be uploaded to the softwares folder, and unzip to servers folder.
	cd /export/softwares/
	rz
	选中jdk压缩包,上传至当前目录下
	mv jdk-8u161-linux-x64.tar.gz jdk
	tar -zxvf jdk -C ../servers/ 

如果<code>rz</code>命令报错,则执行如下安装后再执行
	
	yum -y install lrzsz
  • Configure jdk environment variables

In /etc/profileadd the following environment variable end of the file:

export JAVA_HOME=/export/servers/jdk
export PATH=$PAHT:$JAVA_HOME/bin

Exit after saving.

  • Reload the configuration file to make the environment variables configured just now take effect.

      source /etc/profile
    
  • Check whether the configuration is successful:

Entering java -versionthe version information of jdk that appears is the installation and configuration is successful.


The above static IP configuration and jdk installation need to be configured on each machine, so after the configuration is successful on one machine, clone the virtual machine directly (note that a complete clone must be created). Note that the IP addresses of all three are The same, you need to modify the IP of the two cloned virtual machines in the configuration file. After the modification is completed, restart the network service to see if they can ping each other. If the ping succeeds, the IP modification is successful.
重启网路服务:
systemctl restart network

3. Modify the hostname of the virtual machine and add a mapping

* Modify /etc/hostname file, delete the default first line, then changed master , restart the virtual machine, the host name change to take effect.
The same approach to modify the other two virtual machines host name are: slave1 , , slave2 and restart the virtual machine. * Edit /etc/hosts the file, and add the following content (note that IP address to change their IP address):
	192.168.200.200 master
	192.168.200.201 slave1
	192.168.200.202 slave2

The same content is added on the other two machines.

Verify whether the change is successful or not depends on whether they can pingcommunicate with each other . For example, execute the following command on any machine:

ping master
ping slave1
ping slave2

4. Configure SSH password-free login

  • Check whether SSH has been installed (centos7 is installed by default)

      rpm -qa | grep ssh 
    

    The following results are already installed:

      openssh-7.4p1-21.el7.x86_64
      libssh2-1.8.0-3.el7.x86_64
      openssh-clients-7.4p1-21.el7.x86_64
      openssh-server-7.4p1-21.el7.x86_64
    

    If it is not installed, you need to install it manually:

      yum -y install openssh-server
      yum -y install openssh-clients
    

    Tip : If the openssh-clients is not installed, an error will be reported when the ssh and scp commands are executed, indicating that the command cannot be found.

  • Test whether SSH is available (the IP address is the IP address of the target machine to be logged in: that is, the IP address of the child node)

      ssh 192.168.200.201
    

    Follow the prompts to enter the login password of the target machine. After the login is successful, ssh is available, and then execute the following command to return to the original host

      exit
    
  • Generate key

      ssk-keygen
    

    This process needs to be confirmed repeatedly, just press Enter in between. Note: Sometimes you need to answer yesor no! ! !

  • Perform secret-free login operations on all nodes on the master node, including the master node:

      ssh-copy-id -i ~/.ssh/id_rsa.pub master
      ssh-copy-id -i ~/.ssh/id_rsa.pub slave1
      ssh-copy-id -i ~/.ssh/id_rsa.pub slave2
    

    Since the master needs to start the service on the slave node, the master needs to log in to the slave, so execute the above command on the master node.

5. Configure Time Synchronization Service

  1. Check if ntp service is installed

     rpm -qa | grep ntp
    

    Install ntp service if not installed

     yum -y install ntp
    
  2. Set the master node as the master node of the NTP time synchronization service (that is, the time of the master node shall prevail)

    Edit the /etc/ntp.conffile, comment out serverthe line that starts with, and add the following code:

     restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap
     server 127.127.1.0
     fudge 127.127.1.0 stratum 10
    
  3. Configure node time synchronization (get the time from the master node)

    Modify the /etc/ntp.conffile in the slave1 and slave2 nodes as well , comment out serverthe line that starts with, and add the following code:

     server master
    

    Set the child node to synchronize time with the master (time server) every 10 minutes

     crontab -e
     编写定时任务:
     */10 * * * * /usr/sbin/ntpdate master
    
  4. Start NTP service

    1. Start the ntp service on the master node and add self-start at boot

       service ntpd start & chkconfig ntpd on
      
    2. Manually synchronize the time once on the slave node (slave1, slave2)

       ntpdate master
      
    3. Start the ntp service on the slave node (slave1, slave2) and add the self-start at boot

       service ntpd start & chkconfig ntpd on
      
    4. Check whether the ntp server is connected to the upper ntp

       ntpstat
       这个命令可能会看到 unsynchronised 这是正常情况,因为配置完成后,需要等待一会儿才能和/etc/ntp.conf中配置的标准时间进行同步。
      
    5. View the status of the ntp server and the upper ntp

       ntpq -p
       
       参数说明:
       when: 	多少秒前进行过时间同步
       poll:	下次更新在多少秒后
       reach:	已经向上层ntp服务器要求更新的次数
       delay:	网络延迟
       offser:	时间补偿值
       jitter:	系统时间与BIOS时间差
      
  5. Test whether the configuration is successful

    Modify the time on any node:

     date -s "2011-11-11 11:11:11" 
    

    Wait for ten minutes and check whether the time is synchronized (10 minutes can be adjusted to 1 minute during the experiment to save time)

     date
    

Extension: If you need to keep the clock synchronized with the external network time, you need to set up a scheduled task (in fact, the virtual machine time has been synchronized with the time on the network). I use the time of the Alibaba Cloud server here. ---- Operate on the master

	启动定时任务:
	crontab -e

	添加如下代码:分别代表 分 时 日 月 周
	*/1 * * * * /usr/sbin/ntpdate ntp4.aliyun.com

6. Hadoop installation (operation on master)

  1. Unzip the Hadoop installation package

     进入到softwares目录
     tar -zxvf hadoop-2.7.2.tar.gz -C ../servers/
     cd ../servers/
     mv hadoop-2.7.2.tar.gz hadoop
    
  2. Configure environment variables

    In /etc/profileadd the following files:

     export HADOOP_HOME=/export/servers/hadoop
     export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    

    Remember after savingsource /etc/profile

  3. Modify the configuration file

    All configuration files are in the following path:

     hadoop/etc/hadoop/
    

    slaves

    The configuration file saves the information of the slave node, that is, the machine name of the node. Delete localhost and modify it as follows:

     slave1
     slave2
    

    core-site.xml

    Hadoop core configuration file

    Add the following content between <configuration></configuration>:

     <property>
     	<name>fs.defaultFS</name>
     	<value>hdfs://master:9000</value>
     </property>
     <property>
     	<name>hadoop.tmp.dir</name>
     	<value>file:/export/servers/hadoop/tmp</value>
     </property>
    

    fs.defaultFSThe attribute is to specify the default file system, and its value is hdfs and master node port.

    hadoop.tmp.dirThe attribute is to specify the directory in which the hdfs temporary data is stored. The default value is the Linux tmp directory.

    hdfs-site.xml

    HDFS-related configuration files

    Add the following content between <configuration></configuration>:

     <property>
     	   <name>dfs.replication</name>
     	   <value>1</value>
     </property>
     <property>
     	  <name>dfs.namenode.name.dir</name>
     	  <value>file:/export/servers/hadoop/tmp/dfs/name</value>
     </property>
     <property>
    		 <name>dfs.datanode.data.dir</name>
    		 <value>file:/export/servers/hadoop/tmp/dfs/data</value>
     </property>
    

    dfs.replicationThe attribute represents the number of copies of the data block, and its value is 3 by default. Here we set it to 1 for the convenience of testing.

    dfs.namenode.name.dirThe attribute represents the temporary data storage directory of the NameNode.

    dfs.namenode.data.dirThe attribute represents the temporary data storage directory of the DataNode.

    mapred-site.xml

    MapReduce related configuration

    Rename the file mapred-site.xml, the default file name is mapred-site.xml.template, and change the mapred-site.xml configuration:

     mv mapred-site.xml.template mapred-site.xml
    

    Add the following content between <configuration></configuration>:

     <property>
     	<name>mapreduce.framework.name</name>
     	<value>yarn</value>
     </property>
    

    mapreduce.framework.nameThe attribute represents the running framework of the MapReduce program. The default value is local, which means it runs locally. Here we set it to yarn to let the MapReduce program run on the YARN framework.

    yarn-site.xml

    YARN framework configuration

    Add the following content between <configuration></configuration>:

     <property>
    		 <name>yarn.resourcemanager.hostname</name>
    		 <value>master</value>
     </property>
     <property>
         <name>yarn.nodemanager.aux-services</name>
         <value>mapreduce_shuffle</value>
     </property>
    

    yarn.resourcemanager.hostnameThe attribute specifies which node the ResourceManager runs on.

    yarn.nodemanager.aux-servicesThe attribute specifies the default shuffling method of YARN and is set to the default shuffling algorithm of MapReduce.

    hadoop-env.sh

    The basic environment configuration of Hadoop operation is used when the automation script is started.

     export JAVA_HOME=/export/servers/jdk
    

    On line 25 of the file, remove the previous comment, and then change the following path to your JAVA_HOMEpath.

  4. Since then, even if the configuration of Hadoop is completed, this is only the setting items necessary for normal startup! ! !

Note: For more detailed configuration information, you can go to the Hadoop official website to view it. After opening it, slide to the bottom of the page. There are links to the detailed information of each configuration file under Configuration in the lower left corner .

7. Distribution of Hadoop folders

After configuring all configuration files on the master node, distribute the Hadoop folder to the other two child nodes

scp -r /export/servers/hadoop slave1:/export/servers/
scp -r /export/servers/hadoop slave2:/export/servers/

8. Cluster startup

Be sure to turn off the firewall before starting the cluster. Note that the firewall of each machine must be turned off! ! !

查看防火墙状态:
systemctl status firewalld

关闭防火墙:
systemctl stop firewalld

设置防火墙开机不启动(永久关闭防火墙,如果后面需要还可以手动打开):
systemctl disable firewalld

打开防火墙:
systemctl start firewalld
  • Format the NameNode before starting for the first time (you don’t need to format it when starting later)

      hdfs namenode -format
    

    If you forget to format and start, you need to shut down all namenode and datanode processes, then delete the data and log data, and then reformat.

  • There are two ways to start the cluster, one is manual start, the other is automatic script start

    1. Manual start

       1. 启动HDFS :
       
       启动NameNode(master节点):
       hadoop-daemon.sh start namenode
      
       启动DataNode(在Slave节点):
       hadoop-daemon.sh start datanode
      
       启动SecondaryNameNode(在任意节点):
       hadoop-daemon.sh start secondarynamenode
      
      
       2. 启动YARN :
      
       启动ResourceManager(在Master节点):
       yarn-daemon.sh start resourcemanager
      
       启动NodeManager(在Slave节点):
       yarn-daemon.sh start nodemanager
      
       3. 启动历史任务服务:
      
       mr-jobhistory-daemon.sh start historyserver
      
    2. Automated script startup (executed on the master node)

       1. 启动HDFS :
       
       start-dfs.sh
      
       2. 启动YARN :
      
       start-yarn.sh
      
       3. 启动历史任务服务:
      
       mr-jobhistory-daemon.sh start historyserver
      
      
       注:还有一个脚本是:start-all.sh,但是一般不建议使用,容易出错。
      
  • View process

      jps
    

    If you see the following process on the master, the startup is successful:

      [root@master ~]# jps
      2016 ResourceManager
      2353 Jps
      1636 NameNode
      1845 SecondaryNameNode
      2310 JobHistoryServer
      [root@master ~]# 
    

    If you see the following process on the slave, the startup is successful:

      [root@slave1 ~]# jps
      1554 DataNode
      1830 Jps
      1671 NodeManager
      [root@slave1 ~]# 
    

    Description:

    NameNode、SecondaryNameNodeIt is the process of HDFS on the master and the process DataNodeof HDFS on the slave. The existence of these three processes indicates that the HDFS is started successfully.

    ResourceManagerIt is the process of YARN on the master and the process NodeManagerof YARN on the slave. The existence of these two processes indicates that YARN is successfully started.

    JobHistoryServerIt is the process of historical service.

  • View webpage

    After the cluster is started, you can enter the IP address and port number in the browser, visit its UI page, and view the detailed information of the cluster.

      查看HDFS集群详细信息:
      192.168.200.200:50070
    
      查看YARN集群详细信息:
      192.168.200.200:8088
    
      查看historyserver历史服务详细信息:
      192.168.200.200:19888
    
  • Shutdown process

    To open the service command startwas changed stopto.

Guess you like

Origin blog.csdn.net/qq_45796486/article/details/115272321