Remember Centos7.4 to build Hadoop 3.2.0 (HA) YARN (HA) cluster

1. Basic information

Official website   http://hadoop.apache.org/

Quick start   http://hadoop.apache.org/docs/r1.0.4/cn/quickstart.html

Online documentation   http://tool.oschina.net/apidocs/apidoc?api=hadoop

Yibai tutorial https://www.yiibai.com/hadoop/ 

W3Cschool tutorial  https://www.w3cschool.cn/hadoop/?

2. Description of environment and tools

1. Operating system Centos7.4 x64 Minimal 1708

Install 5 virtual machines

NameNode: 2 sets of 2G memory, 1 core CPU

DataNode: 3 sets of 2G memory, 1 core CPU

2. JDK version: jdk1.8

3. Tools: xshell5

4. VMware version: VMware Workstation Pro15

5、Hadoop:3.2.0

6、Zookeeper:3.4.5

3. Installation and deployment (preparation of basic environment)

1. Virtual machine installation (install 5 virtual machines)

Reference https://blog.csdn.net/llwy1428/article/details/89328381 

2. Each virtual machine is connected to the Internet (5 nodes must be configured with network cards)

Network card configuration can refer to:

https://blog.csdn.net/llwy1428/article/details/85058028

3. Modify the host name (5 nodes need to modify the host name)

Edit the host name of each node in the cluster (take the first node node1.cn as an example)

[root@localhost~]# hostnamectl set-hostname node1.cn
node1.cn
node2.cn
node3.cn
node4.cn
node5.cn

4. JDK8 environment construction (5 nodes need to be built)

Refer to  https://blog.csdn.net/llwy1428/article/details/85232267

5. Configure firewall (5 nodes must be operated)

Turn off the firewall, and set boot prohibition

关闭防火墙    : systemctl stop firewalld
查看状态      : systemctl status firewalld
开机禁用      : systemctl disable firewalld

6. Configure static IP

Here is the node1.cn node as an example (other nodes are omitted):

[[email protected] ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

Note: the red box is the modified and added part

Can refer to: https://blog.csdn.net/llwy1428/article/details/85058028

7, configure the hosts file

Take the node1.cn node as an example:

[root@node1 ~]# vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.11.131 node1.cn
192.168.11.132 node2.cn
192.168.11.133 node3.cn
192.168.11.134 node4.cn
192.168.11.135 node5.cn

8. Install basic tools

[root@node1 ~]# yum install -y vim wget lrzsz tree zip unzip net-tools ntp
[root@node1 ~]# yum update -y (可选)

(Depending on your own network situation, you may need to wait a few minutes)

9. Configure password-free login between nodes

Refer to the specific steps: 

https://blog.csdn.net/llwy1428/article/details/85911160

https://blog.csdn.net/llwy1428/article/details/85641999

10. Modify the number of open system files on each node of the cluster

Take the node1.cn node as an example:

[root@node1 ~]# vim /etc/security/limits.conf

reference

https://blog.csdn.net/llwy1428/article/details/89389191

11. Configuration time synchronization of each node in the cluster

This article is based on Aliyun time server, Aliyun time server address: ntp6.aliyun.com

Note: If there is a dedicated time server, please change the host name or IP address of the time server. The host name needs to be mapped in the etc/hosts file.

Take node1.cn as an example:

Set the system time zone to Dongba District (Shanghai Time Zone)

[root@node1 ~]# timedatectl set-timezone Asia/Shanghai

Close ntpd service

[root@node1 ~]# systemctl stop ntpd.service

Set ntpd service to prohibit startup

[root@node1 ~]# systemctl disable ntpd

Set up a scheduled task

[root@node1 ~]# crontab -e

Write the following (synchronize with Alibaba Cloud time server every 10 minutes):

0-59/10 * * * * /usr/sbin/ntpdate ntp6.aliyun.com

Restart the scheduled task service

[root@node1 ~]# /bin/systemctl restart crond.service

Set timed tasks to start up

[root@node1 ~]# vim /etc/rc.local

After adding the following content, save and exit: wq

/bin/systemctl start crond.service

The other nodes in the cluster are the same as node1.cn.

Reference https://blog.csdn.net/llwy1428/article/details/89330330 

12. Disable SELinux on each node of the cluster

Take node1.cn as an example:

[root@node1 ~]# vim /etc/selinux/config

After modifying the following content, save and exit: wq

The other nodes in the cluster are the same as node1.cn.

13. Disable Transparent HugePages on each node of the cluster

Reference   https://blog.csdn.net/llwy1428/article/details/89387744

14. Configure the system environment as UTF8

Take node1.cn as an example:

[root@node1 ~]# echo "export LANG=zh_CN.UTF-8 " >> ~/.bashrc
[root@node1 ~]# source ~/.bashrc

The other nodes in the cluster are the same as node1.cn.

15, install the database

Note: MariaDb (Mysql) is installed to provide metadata support for Hive, Spark, Oozie, Superset, etc. If you do not use these tools, you do not need to install the Mysql database.

The installation process of MariaDb (Mysql) can refer to:

https://blog.csdn.net/llwy1428/article/details/84965680

https://blog.csdn.net/llwy1428/article/details/85255621

Fourth, install and deploy Hadoop cluster (HA mode)

(Note: During the construction and operation of the cluster, ensure that the time of all nodes in the cluster is synchronized)

1. Create a directory, upload files,

Note: First configure the basic information on node1.cn, then distribute the configured files to each node, and then perform further configuration

Create a directory /opt/cluster/ on each node

Take node1.cn as an example:

[root@node1 ~]# mkdir /opt/cluster

2. File download (file upload), decompression

Download

[root@node1 opt]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

or

Download the file manually: hadoop-3.2.0.tar.gz

Upload the downloaded file hadoop-3.2.0.tar.gz to the /opt/cluster path, and decompress hadoop-3.2.0.tar.gz

Enter the /opt/cluster directory

unzip files

[root@node1 cluster]# tar zxvf hadoop-3.2.0.tar.gz

View directory structure

3. Create several directories in hadoop

[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/tmp
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/name
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/data
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/journaldata

4. Configure hadoop environment variables (add Hadoop environment variable information)

[root@node1 ~]# vim /etc/profile
在最后追加如下信息
export HADOOP_HOME="/opt/cluster/hadoop-3.2.0"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Save and exit :wq

Make the configuration file effective

[root@node1 ~]# source /etc/profile

View version

5. Configure hadoop-env.sh

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/hadoop-env.sh

Add the following content

export JAVA_HOME=/opt/utils/jdk1.8.0_191
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

5, placement core-site.xml

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/core-site.xml
<configuration>
     <property>
           <name>fs.defaultFS</name>
           <value>hdfs://cluster</value>
     </property>
     <property>
           <name>hadoop.tmp.dir</name>
           <value>/opt/cluster/hadoop-3.2.0/hdfs/tmp</value>
     </property>
     <property>
            <name>ha.zookeeper.quorum</name>
            <value>node3.cn:2181,node4.cn:2181,node5.cn:2181</value>
    </property>
</configuration>

6. Edit the file hdfs-site.xml

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/hdfs-site.xml
<configuration>
	<property>
		<name>dfs.nameservices</name>
		<value>cluster</value>
	</property>
	<property>
		<name>dfs.ha.namenodes.cluster</name>
		<value>nn1,nn2</value>
	</property>
	<property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>

	<property>
		<name>dfs.namenode.rpc-address.cluster.nn1</name>
		<value>node1.cn:8020</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.cluster.nn2</name>
		<value>node2.cn:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.cluster.nn1</name>
		<value>node1.cn:50070</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.cluster.nn2</name>
		<value>node2.cn:50070</value>
	</property>
	<property>
		<name>dfs.namenode.shared.edits.dir</name>
		<value>qjournal://node3.cn:8485;node4.cn:8485;node5.cn:8485/cluster</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/opt/cluster/hadoop-3.2.0/hdfs/name</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/opt/cluster/hadoop-3.2.0/hdfs/data</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/opt/cluster/hadoop-3.2.0/hdfs/edits</value>
	</property>
	<property>
		<name>dfs.ha.automatic-failover.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/opt/cluster/hadoop-3.2.0/hdfs/journaldata</value>
	</property>
	<property>
		<name>dfs.client.failover.proxy.provider.cluster</name>
		<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
	</property>
	<property>
		<name>dfs.ha.fencing.methods</name>
		<value>shell(/bin/true)</value>
	</property>
	<property>
		<name>dfs.ha.fencing.methods</name>
		<value>sshfence</value>
	</property>
	<property>
		<name>dfs.ha.fencing.ssh.private-key-files</name>
		<value>/root/.ssh/id_rsa</value>
	</property>
	<property>
 		<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
 		<value>false</value>
	</property>
</configuration>

7. Edit the file mapred-site.xml

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/mapred-site.xml
<configuration>
       	<property>
           	<name>mapreduce.framework.name</name>
           	<value>yarn</value>
       	</property>
       	<property>
               	<name>yarn.app.mapreduce.am.env</name>
              	<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
       	</property>
       	<property>
           	<name>mapreduce.map.env</name>
           	<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
      	</property>
      	<property>
        	<name>mapreduce.reduce.env</name>	
        	<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
      	</property>
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>node1.cn:10020</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>node1.cn:19888</value>
	</property>
</configuration>

8. Edit the file yarn-site.xml

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/yarn-site.xml
<configuration>
	<property>
  		<name>yarn.resourcemanager.ha.enabled</name>
  		<value>true</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.cluster-id</name>
  		<value>cluster-yarn</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.ha.rm-ids</name>
  		<value>rm1,rm2</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.hostname.rm1</name>
  		<value>node1.cn</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.hostname.rm2</name>
  		<value>node2.cn</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.webapp.address.rm1</name>
  		<value>node1.cn:8088</value>
	</property>
	<property>
  		<name>yarn.resourcemanager.webapp.address.rm2</name>
  		<value>node2.cn:8088</value>
	</property>
	<property>
 		<name>yarn.resourcemanager.zk-address</name>
  		<value>node3.cn:2181,node4.cn:2181,node5.cn:2181</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>					
	<property>
		<name>yarn.log-aggregation-enable</name>
		<value>true</value>
	</property>															    <property>
		<name>yarn.log-aggregation.retain-seconds</name>
		<value>106800</value>
	</property>
</configuration>

9. Configuration file workers

[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/workers
node3.cn
node4.cn
node5.cn

10. Distribute the entire hadoop-3.2.0 directory to each node

[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node2.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node3.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node4.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node5.cn:/opt/cluster/

11. Configure and start zookeeper

reference 

https://hunter.blog.csdn.net/article/details/96651537

https://hunter.blog.csdn.net/article/details/85937442

12. The specified three nodes start journalnode

(Here I choose node3.cn, node4.cn, node5.cn as journalnode)

[root@node3 ~]# hdfs --daemon start journalnode
[root@node4 ~]# hdfs --daemon start journalnode
[root@node5 ~]# hdfs --daemon start journalnode

13. Format the namenode on node1.cn

[root@node1 ~]# hdfs namenode -format

14. Start namenode on node1.cn

[root@node1 ~]# hdfs --daemon start namenode

15. Synchronize the successfully formatted namenode information on node1.cn on node2.cn

[root@node2 ~]# hdfs namenode -bootstrapStandby

16. Start namenode on node2.cn 

[root@node2 ~]# hdfs --daemon start namenode

View

17, close the service

(1) Close the namenode on node1.cn and node2.cn

[root@node1 ~]# hdfs --daemon stop namenode
[root@node2 ~]# hdfs --daemon stop namenode

(2) Close JournalNode on node3.cn, node4.cn, node5.cn

[root@node3 ~]# hdfs --daemon stop journalnode
[root@node4 ~]# hdfs --daemon stop journalnode
[root@node5 ~]# hdfs --daemon stop journalnode

18. Format ZKFC

First start zookeeper on node3.cn, node4.cn, node5.cn

Reference  https://blog.csdn.net/llwy1428/article/details/85937442

After starting zookeeper, execute on node1.cn:

[root@node1 ~]# hdfs zkfc -formatZK

19. Start hdfs and yarn services

[root@node1 ~]# /opt/cluster/hadoop-3.2.0/sbin/start-dfs.sh

[root@node1 ~]# /opt/cluster/hadoop-3.2.0/sbin/start-yarn.sh

20. Check the service startup status of each node

  

  

 

So far, Centos 7.4 builds a Hadoop (HA) cluster, and the operation is complete.

Five, basic shell operations

(1) Create a directory in hdfs

[root@node1 ~]# hdfs dfs -mkdir /hadoop
[root@node1 ~]# hdfs dfs -mkdir /hdfs
[root@node1 ~]# hdfs dfs -mkdir /tmp

(2) View the catalog

[root@node2 ~]# hdfs dfs -ls /

(3) Upload files

For example: create a file test.txt in the /opt directory and write some words (the process is omitted)

[root@node3 ~]# hdfs dfs -put /opt/test.txt /hadoop

View uploaded files

[root@node4 ~]# hdfs dfs -ls /hadoop
[root@node4 ~]# hdfs dfs -cat /hadoop/test.txt

(4) Delete files

[root@node5 ~]# hdfs dfs -rm /hadoop/test.txt
Deleted /hadoop/test.txt

6. View the UI pages of some services in the browser

1. View the information of hdfs

Check the ip of node1.cn and node2.cn respectively

http:// node1.cn ip : 50070

http:// node2.cn ip : 50070

Other pages: omitted.

2. View ResourceManager information

enter

http:// node1.cn's ip : 8088

or

http:// node2.cn's ip : 8088

 

Other pages: omitted.

Seven, run mapreduce wordcount

Take the test.txt above as an example

[root@node5 ~]# /opt/cluster/hadoop-3.2.0/bin/yarn jar /opt/cluster/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /hadoop /hadoop/output

View the execution result in the browser

Result of execution in ResourceManager

View execution results

[root@node5 ~]# hdfs dfs -cat /hadoop/output/part-r-00000

 

 

 

Guess you like

Origin blog.csdn.net/llwy1428/article/details/94467687