Use VMWare to build cluster system and Hadoop cluster

Construction of the cluster system

1. Matters needing attention

1. Make sure all VMWare services in the Windows system have been started

Insert picture description here
2. Confirm the gateway address generated by VMWare.
View steps: In the
virtual machine: Edit --> Virtual Network Editor --> VMnet8 --> NAT Settings
Insert picture description here
3. Confirm that the VMnet8 network card has been configured with IP address and DNS (domain name server)
Viewing steps:
Right-click the wifi icon in the lower right corner of the computer --> open network and Internet settings --> Ethernet --> change adapter options --> right click after finding VMnet8, and select properties --> double-click Internet protocol version 4- -> Select "Use the following IP address" (essentially even if the IP address of the network card is configured)
Insert picture description here

Two, copy the virtual machine

Install a virtual machine first, and then copy a few copies of the file of that virtual machine, change the file name to
enter the file, click on the file with the format named .vmx to open the virtual machine and
pay attention to the memory allocation, with 16G Memory, create 3 virtual machines as an example, (16 -4) / 3 = 4

Three, set a static IP address for the virtual machine

Use the command below to open the network card file of the virtual machine

vim /etc/sysconfig/network-scripts/ifcfg-ens33

The configured file is as follows:
Insert picture description here
Next, I will explain the added and modified parts: The
BOOTPROTO attribute can have two options: dhcp and static
1. dhcp means that the IP address of centos7 is obtained dynamically. The premise of using dhcp is to ensure the router has opened the DHCP
2, static representation centos7 obtain an IP address using a static way get
ONBOOT property, centos7 is off by default, need to manually change the yes to connect to the network
IPADDR properties for static IP address to obtain an IP address used
NETMASK to The subnet mask
GATEWAY is the gateway IP. This can be viewed from the second item of the above notes on how to obtain
DNS1 as the first domain name server, this can go to Baidu
DNS2 as the second domain name server

Fourth, modify the corresponding hostname for each virtual machine

Here are a few instructions that can be used to manipulate the host name

hostname # 单独使用可以查看当前主机名
hostname 主机名 # 可以临时修改主机名
hostnamectl # 可以查看主机信息
hostnamectl set-hostname 主机名 # 可以永久修改主机名

Here I set the hostnames for my three virtual machines as node01, node02, and node03 respectively (the following commands are run in three different virtual machines)

hostnamectl set-hostname node01
hostnamectl set-hostname node02
hostnamectl set-hostname node03

Five, set up IP and domain name mapping for each virtual machine

Use the following command to open the /etc/hosts file

vim /etc/hosts

Add the following statement in the file:

192.168.28.147 node01 node01.hadoop.com
192.168.28.148 node02 node02.hadoop.com
192.168.28.149 node03 node03.hadoop.com

Under the above statement to explain: column corresponds to the above statement is the host IP address of the host name of the domain
is not the domain can not write

6. Turn off the firewall and SELinux

The instructions for turning off the firewall are as follows:

systemctl stop firewalld # 关闭防火墙,但这只是暂时的,一当开机重启后,防火墙会再次打开
systemctl disable firewalld # 永久关闭防火墙

The method to close SELinux is as follows:

vim /etc/selinux/config
SELINUX=disabled # 关闭SELinux

Here is an explanation of what SELinux is. The so-called SELinux is a security subsystem of Linux. The permission management of Linux is for files rather than processes. Therefore, if the root user starts a process, the process can operate any one. The file, because the root user created it, it is equivalent to having the root user's authority, which will have a high security risk. Therefore, we use SELinux to increase the restriction on the process so that the process can only be in the allowable range operating resources within
the next three to explain the value under SELinux, as well as its three operating modes:
1, enforcing mandatory mode (if operating in violation of SELinux rules will prohibit direct the operation, and the behavior of the record to the log in)
2, permissive tolerance mode (if operating in violation of SELinux rules, does not directly stop, but will this behavior recorded in the log)
3, Disabled turn off SELinux

Seven, set the virtual machine password-free login

This operation is performed to avoid the trouble of entering a password when starting another node from one node
. The principle of password- free SSH login:
1. First, you need to configure the public key of
node A on node B. 2. Node A requests node B. Require login
3. Node B uses the public key of Node A to encrypt a random text
4. Node A uses the private key to decrypt it and sends it back to Node B
5. Node B verifies whether the text is correct.
Next is how to realize the virtual machine password-free login Operation:
1. Use the following commands in the three nodes to generate public and private keys under the three nodes:

ssh-keygen -t rsa # 使用rsa加密算法生成公钥和私钥

2. Use the following command to copy the public keys of the three hosts to one host, here the node01 host is used:

ssh-copy-id node01

3. Perform the following operations on the host that has obtained the public keys of the three hosts to copy the public key of the first host to several other hosts:

scp /root/.ssh/authorized_keys node02:/root/.ssh
scp /root/.ssh/authorized_keys node03:/root/.ssh

Note that because I directly use the root user to build the cluster system, the key generation is also generated by the root user in the root directory of the root. The location of all .ssh is /root/.ssh, if not the root user, The path where ssh is located should be /home/username/.ssh
4. Check whether the known_hosts file is generated in the .ssh directory of each node . This is the file used to save the identifiable host. 5. Use the ssh host name . Password-free login

Eight, set clock synchronization

This is to prevent the chaos of the application running on the three nodes due to the unsynchronized time, which will cause unpredictable problems. For a simple example, when we use HBase, if the time difference between the nodes is too large, then It will cause HBase to hang up. Here is how to synchronize the clock:
1. Install ntp

yum -y install ntp

2. Start clock synchronization by timing tasks.
Execute the following commands:

crontab -e

After popping up the vim edit box, enter the following statement and save it to achieve clock synchronization:

*/1 * * * * /usr/sbin/ntpdate ntp4.aliyun.com

Hadoop cluster construction

1. Download and unzip Hadoop

If there is no Hadoop file, you can use the following command to download the Hadoop compressed package:

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

If there is a compressed package for downloading Hadoop on Windows, you can use the rz command to upload the compressed package to centos7. The installation of this command is as follows:

yum -y install lrzsz # 安装 rz 命令
rz # 打开上传界面,进行文件从Windows到Linux的上传

The command to decompress Hadoop is as follows:

tar -zxf hadoop-2.9.2.tar.gz -C /usr/local # 将Hadoop解压到 /usr/local 中

2. Download and unzip the JDK

If there is no JDK file, you can execute the following command to download online:

yum -y install java-1.8.0-openjdk*

After the file is downloaded, it is generally stored in /usr/lib/jvm . If you want to know where the file is stored after downloading, you can use the following command:

ls -l /etc/alternatives

Find the soft link of java. The file location pointed to by the soft link is the location where the file is stored after downloading.
If the file is uploaded via the rz command, the file is generally decompressed to /usr/lib as follows:

tar -zxf jdk-8u271-linux-x64.tar.gz -C /usr/lib

Note that the two jdk packages I used for code demonstration here are different, the following are all using java-1.8.0-openjdk *

3. Configure system environment variables

Use the following command to open the configuration file of system environment variables:

vim /etc/profile

Add the following content at the end of the file, you can directly type G in the command mode to jump directly to the last line (the following commands are all hand-typed, and you may enter the wrong one, but this is generally the case):

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop-2.9.2
export PATH=$PATH:$HADOOP_HOME/bin

Save and exit after input, use the following command to make the file effective:

. /etc/profile # 或者使用 source /etc/profile也行, . 是 source的缩写

Check whether the Hadoop environment variables are configured successfully (the reason for configuring the Hadoop environment variables is that you can easily execute the hadoop command in any directory after the configuration, without entering the hadoop installation directory and using ./ to run), directly type in hadoop The command is sufficient, and the following conditions indicate that the configuration is successful:
Insert picture description here

4. Configure JAVA_HOME environment variables for related Hadoop files

Use the following command to enter the hadoop file

cd /usr/local/hadoop-2.9.2/etc/hadoop

Use vim for the three files below to add statements at the end of the file

vim hadoop-env.sh
vim mapred-env.sh
vim yarn-env.sh

Statements added at the end of each file:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64

5. Configure HDFS

Use vim to open the core-site.xml file (the file is still in the upper directory) and
add the following statement:

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://node01:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>file:/root/server/hadoop-2.9.2/tmp</value>
        </property>
</configuration>

Explain the above statement:
1. fs.defaultFS : This is the default access path of HDFS and the access address of NameNode
2. hadoop.tmp.dir : This is the storage directory of Hadoop data files. If this parameter is not configured, It will point to /tmp by default , and the /tmp directory will be automatically emptied after restarting.
Use vim to modify the hdfs-site.xml file and add the following content:

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property><!--不用检查用户权限-->
                <name>dfs.permissions.enabled</name>
                <value>false</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:/root/server/hadoop-2.9.2/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:/root/server/hadoop-2.9.2/tmp/dfs/data</value>
        </property>
</configuration>

To explain the above statement:
1. dfs.replication : the number of copies of the file in the HDFS system
2. dfs.namenode.name.dir : the storage location of the NameNode node data in the local file system
3. dfs.datanode.data. dir : The storage location of the DataNode node data in the local file system
4. dfs.permissions.enabled : Used to determine whether to check user permissions Open the slaves file
with vim , and add the following content (host name):

node01
node02
node03

Note that if you use Hadoop 3.x version, modify the workers file and add the same content

6. Configure YARN

The file to be modified is still in the directory where the file above is located

First rename the mapred-site.xml.template file to mapred-site.xml , and the renaming method is as follows (Hadoop 3.x has changed the name by default, so there is no need to change it):

mv mapred-site.xml.template mapred-site.xml

Then modify the file and add the following statement:

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

Explain the meaning of the statement. The statement specifies the execution framework of the mapreduce task as yarn.
Then open the yarn-site.xml file and add the following content:

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
        		<name>yarn.resourcemanager.address</name>
        		<value>node01:8032</value>
		</property>
</configuration>

Explain the above content:
1. yarn.nodemanager.aux-services : This is an auxiliary service running on NodeMananger. We need to configure it as mapreduce_shuffle . Only in this way can we run the MapReduce program. As for why we need it, we will talk about mapreduce later. Will be introduced in the blog. YARN provides this configuration item for extending custom services on NodeManager. The Shuffle function of MapReduce is an extended service.
2, yarn.resourcemanager.address : Specify the node and access port where ResourceManager is located. The default is 8032. Specify here ResourceManager runs on node01. If the above content is not added, ResourceManager will start on the node that executes the YARN start command (start-yarn.sh) by default

7. Copy the Hadoop installation files to other hosts

scp -r /usr/local/hadoop-2.9.2 node02:/usr/local/
scp -r /usr/local/hadoop-2.9.2 node03:/usr/local/

Direct copying can reduce the trouble of configuring in the other two nodes

8. Format the NameNode

Before starting Hadoop, the NameNode needs to be formatted. Its purpose is to initialize some directories and files in the HDFS file system. We execute the following commands on node01 to format:

hadoop namenode -format

After the format is successful, the /usr/local/hadoop-2.9.2/tmp/dfs/name/current directory will be generated , and the file fsimage for storing the metadata information of the HDFS file system will be generated in this directory.

9. Start Hadoop

Execute the following command on node01 node, start the Hadoop cluster, command /usr/local/hadoop-2.9.2/sbin in

./start-all.sh

The following describes some of the results that will occur
if the next configuration is not performed: 1. If the node where the SecondaryNameNode is located is not configured, it will be started on the node where the HDFS startup command (start-dfs.sh) is executed by default
2. If the node where the ResourceManager is located is not configured, It will start on the node where the YARN startup command (start-yarn.sh) is executed by default; if the node where the ResourceManager is located, YARN must be started on the configured node, otherwise an exception will be thrown when other nodes start
3. NodeManager No configuration is required. It will be on the same node as the DataNode to obtain the data local advantage when the task is executed. That is, the node with the DataNode will have the NodeManager.
I encountered a problem when starting Hadoop 3.x, that is, it appeared when starting Hadoop. The following error is reported:

ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.

I will write the solution below for the time being. Since I have not figured out the specific source, it is not very detailed for the time being. I will leave it in the next blog.
Modify the following four files
1. For start-dfs.sh and stop-dfs.sh File, add the following code to the previous line of the first non-comment code:

HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

2. For the start-yarn.sh and stop-yarn.sh files, add the following code to the previous line of the first non-comment code:

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

After re- ./start-all.sh to

10. View the process started by each node

[root@node01 sbin]# jps
15088 NodeManager
14786 SecondaryNameNode
1987 QuorumPeerMain
14617 DataNode
14941 ResourceManager
15374 Jps
14479 NameNode

[root@node02 ~]# jps
15697 QuorumPeerMain
20611 Jps
20360 DataNode
20475 NodeManager

[root@node03 ~]# jps
20548 DataNode
20663 NodeManager
20793 Jps
15789 QuorumPeerMain

Note that if DataNode does not appear in jps, it may be because after using hadoop namenode-format again , we did not delete the data in /usr/local/hadoop-2.9.2/tmp/dfs/data/current , and each node All in

Guess you like

Origin blog.csdn.net/myWarren/article/details/109278438