1. Virtual machine configuration
Version: Centos7 is installed in VM15, mirror: CentOS-7-x86_64-Minimal-1908; hadoop2.10.0, jdk1.8.0_241
1.1 Install CentOS system
It is not recommended to install the latest version of CentOS.
Note: This article uses VMWare for configuration. If VirtualBox is used for configuration, except for the slightly different virtual machine network configuration, the others are the same.
1.2 Introduction to network connection
1. Bridge mode: The virtual machine and the physical machine are connected to the same network, and the virtual machine and the physical machine are in a parallel relationship, and the status is equivalent. Whether it is a virtual system or a real system, as long as they are on the same network segment, they can ping each other.
2. NAT mode: A physical opportunity acts as a "router". If a virtual machine wants to access the Internet, it must pass through the physical machine. If the physical machine cannot access the Internet, the virtual machine cannot access the Internet. No other configuration is required, only the physical machine can access the Internet. The virtual system can only be accessed in both directions with the host where the virtual machine is set up, and cannot access other real hosts on the same network segment.
3. Host only mode: It is equivalent to taking a network cable to directly connect the physical machine and the virtual machine. All virtual systems can communicate with each other, but the virtual system and the real network are isolated.
1.3 Virtual machine network configuration
1. Open the virtual network editor in VMWare, select VMnet8 (NAT) mode, click NAT settings, configure the gateway address (you can choose the default value).
2. Open the control panel, network center, change the adapter, open the properties of VMnet8, and configure the TCP/IPv4 properties.
3. The gateway address is consistent with VMWare, do not configure the occupied ip address.
4. Log in to the CentOS system, enter etc/sysconfig/network-scripts, and edit the NAT network card configuration file.
vi ifcfg-ens33
5. Configure the following variables
BOOTPROTO=static #关闭DHCP,可以不做,不影响
ONBOOT=yes #开机启动
IPADDR=192.168.223.100 #该虚拟机的ip地址
GATEWAY=192.168.223.2 #网关地址
DNS1=192.168.223.2 #DNS可根据自己网络进行配置
NATMASK=255.255.255.0 #子网掩码
1.4 Restart the network card, the network configuration takes effect
service network restart
or
systemctl restart network
1.5 Network verification
View ip address under CentOS:
ip a
View the ip address under Windows (run cmd):
ipconfig
If the following operations are correct, the network configuration is successful
- Virtual machine ping gateway
- Virtual machine ping physical host
- Physical host ping virtual machine
- Virtual machine ping Baidu (if the external network address cannot be pinged, you can modify other DNS1 addresses in the network configuration file of the previous step)
1.6 Turn off the CentOS firewall
In a distributed cluster, the communication between each node will be hindered by the firewall, so the firewall is turned off (security issues are implemented by professional software at the periphery and unified management)
systemctl stop firewalld.service
Prohibit firewall startup
systemctl disable firewalld.service
If there is a network problem, check whether the network configuration file is wrong
vi /etc/sysconfig/network-scripts/ifcfg-ens33
1.7 Install XShell, XFtp
XShell and XFtp are paid software, you can download the free educational version from the English official website (the Chinese official website does not have this version)
After the installation is successful, create a new connection, enter the virtual machine master ip address to be connected, and enter the user name and password to connect
After logging in successfully, use XShell instead of the virtual machine interface to perform Shell operations
2. Basic environment configuration
2.1 Switch user to root user
su root
2.2 Configure clock synchronization
Install ntpdate online, use Alibaba Cloud ntp server to synchronize time, date command to view current time
yum install ntpdate
ntpdate ntp.aliyun.com
date
If you cannot connect to the external network, you can install ntpdate offline. The download address of ntpdate: https://pkgs.org/
rpm -ivh 'ntpdate包的路径'
ntpdate ntp.aliyun.com
date
2.3 Configure the host name
The host can be uniquely identified in the network. Like the ip address, this host can be accessed through the ip address and the network host name. The role is to simplify and facilitate.
Modify the host name:
hostnamectl set-hostname node1
View the modified host name:
hostname
2.4 Configure the hosts list
The function of the hosts list is to let each server in the cluster know each other's host name and ip address
vi /etc/hosts
Add host ip and host name
192.168.223.100 node1
192.168.223.101 node2
192.168.223.102 node3
Verification, ping ip address and host name, the results are the same, no difference
ping 192.168.223.100
ping node1
2.5 Install Java environment
Create a personal data directory, java directory
mkdir /usr/wallasunrui
mkdir /usr/java
mkdir /usr/hadoop
Use XFtp to copy the java installation package to the wallasunrui directory, unzip, and move to the java directory
tar -zxvf jdk-8u241-linux-x64.tar.gz
mv jdk1.8.0_241 /usr/java/jdk1.8.0_241
Enter the system configuration file
vi /etc/profile
Add the following two lines at the end of the file
export JAVA_HOME=/usr/java/jdk1.8.0_241
export PATH=$JAVA_HOME/bin:$PATH
Make the configuration effective:
source /etc/profile
java -version
2.6 Install Hadoop environment
Use XFtp to upload the hadoop installation package to the wallasunrui folder, unzip the Hadoop installation package, and move it to a new folder
tar -zxvf hadoop-2.10.0.tar.gz
mv hadoop-2.10.0 /usr/hadoop/hadoop-2.10.0
Configure Hadoop environment variables
vi /etc/profile
Add the following two lines at the end of the configuration file
export HADOOP_HOME=/usr/hadoop/hadoop-2.10.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Make hadoop configuration effective and verify
source /etc/profile
hadoop version
whereis hdfs
No error message means hadoop has been successfully added to the CentOS system environment
2.7 Binding hadoop and java (many students will forget this step)
Set up hadoop configuration file
cd /usr/hadoop/hadoop-2.10.0/etc/hadoop
vi hadoop-env.sh
Find the following line of code:
export JAVA_HOME=${JAVA_HOME}
Modify this line of code to
export JAVA_HOME=/usr/java/jdk1.8.0_241
3. Configure Hadoop
3.1 Hadoop core file configuration
Enter the etc folder of hadoop, configure the core-site.xml file, and add the following content
<configuration>
<!--指定文件系统的入口地址,可以为主机名或ip -->
<!--端口号默认为8020 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:8020</value>
</property>
<!--指定hadoop的临时工作存目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop/hadoop-2.10.0/hadoop-data</value>
</property>
</configuration>
Configure the yarn-env.sh file and find this line:
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
Amend to the following, and remove the comment #
:
export JAVA_HOME=/usr/java/jdk1.8.0_241
Configure the hdfs-site.xml file and add the following content:
<configuration>
<!--指定hdfs备份数量,小于等于从节点数目-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- 自定义hdfs中namenode的存储位置-->
<!-- <property>-->
<!-- <name>dfs.namenode.name.dir</name>-->
<!-- <value>file:/usr/hadoop/dfs/name</value>-->
<!-- </property>-->
<!-- 自定义hdfs中datanode的存储位置-->
<!-- <property>-->
<!-- <name>dfs.datanode.data.dir</name>-->
<!-- <value>file:/usr/hadoop/dfs/data</value>-->
<!--</property>-->
</configuration>
Mapred-site.xml configuration file generated by the cp command without suffix template
file
cp mapred-site.xml.template mapred-site.xml
Edit the mapred-site.xml file and add the following content:
<configuration>
<!--hadoop的MapReduce程序运行在YARN上-->
<!--默认值为local-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configure the yarn-site.xml file
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<!--nomenodeManager获取数据的方式是shuffle-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Modify the slaves file, delete the original content, and add the hostname of the datanode node. You can set it yourself. Here I will make three.
node1
node2
node3
3.2 Clone multiple slave machines
Use node1 to mirror and clone 3 virtual machines (to create a complete clone). The names are node2 and node3. The number can be set freely according to the computer configuration. The following steps are the same as the process above.
-
Modify the hostname of each slave, the method is the same as 2.3
hostnamectl set-hostname node2 -
Enter etc/sysconfig/network-scripts, modify the ip address of each slave, the method is the same as 1.3
3.3 Optional—Synchronous hadoop configuration (only used when the master and slave configurations are different)
Send the hadoop folder on the master machine to three slave machines
scp -r /usr/hadoop node2:/usr/hadoop
scp -r /usr/hadoop node3:/usr/hadoop
4. Configure SSH login
4.1 Generate public key and private key
On the master and each slave, use the rsa algorithm to generate the public key and the private key (in the installation process, use the "Enter" key to confirm)
ssh-keygen -t rsa
View the generated private key id_rsa and public key id_rsa.pub
cd /root/.ssh/
ls
4.2 Send public key
Create a common public key authorized_keys on the master, modify the authorized_keys permissions, and send this public key to each slave
cat id_rsa.pub >> authorized_keys
chmod 644 authorized_keys
systemctl restart sshd.service
scp /root/.ssh/authorized_keys node2:/root/.ssh
scp /root/.ssh/authorized_keys node3:/root/.ssh
4.3 Introduction to Linux file permissions
Here is the knowledge introduction, non-hadoop configuration steps
-
1-3 digits represent the permissions of the file owner
-
4-6 digits represent the permissions of users in the same group
-
7-9 digits represent the authority of other users
-
The read permission is equal to 4, represented by r
-
The write permission is equal to 2, denoted by w
-
The execution authority is equal to 1, denoted by x
444 r–r--r–
600 rw-------
644 rw-r–r--
666 rw-rw-rw-
700 rwx------
744 rwxr–r--
755 rwxr-xr-x
777 rwxrwxrwx
4.4 Verify SSH
SSH login verification, the login path can be changed from'~/.ssh' to'~' without a password, and logout is exit
ssh node1
ssh node2
exit
ssh node3
exit
Note: Only one-way free login from master to slave is realized! ! !
5. Ready to run hadoop
5.1 Format HDFS
On the master machine, enter the bin folder under hadoop and run the following code:
hdfs namenode -format
Note: only need to format once! Multiple formatting will result in a mismatch between the cluster ID values of the NameNode and DataNode. You need to delete the NameNode, DataNode, and log folders before formatting.
5.2 Start hadoop
Enter the sbin folder in hadoop and run:
start-dfs.sh
start-yarn.sh
5.3 View hadoop process
jps
5.4 Accessing hadoop through the web
View NameNode, DataNode:
http://192.168.223.100:50070
View SecondaryNameNode information:
http://192.168.223.100:50090
View YARN interface:
http://192.168.223.100:8088