Linux builds hadoop cluster

  Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. Hadoop implements a distributed file system (Distributed File System), one of which is HDFS (Hadoop Distributed File System). HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data set) applications. HDFS relaxes (relax) the requirements of POSIX, and can access (streaming access) data in the file system in the form of streams. The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides calculation for massive data

1. Create 4 virtual machines

Create 1 clone and 3 clones, everyone will install it by default here!

When I created it, the user name was vm1, and I need to identify myself when creating different subsequent commands

insert image description here

2. Modify the host name

Method 1: Temporary modification, restart invalid
hostname vm2
, that is, change the host name to vm2
insert image description here
Method 2: Modify the file, restart to take effect (4 hosts should be configured according to the second method, edit
vi /etc/hostname
and restart reboot (you can not restart first , configure point 4 and then restart)

insert image description here

3. Configure the network

4 machines must be equipped!

Command vim /etc/sysconfig/network-scripts/ifcfg-ens33
to restart the network service network restart

Add the following configuration

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="cce078fd-5c83-4f0a-a39f-ddaae8c9a1c3"
DEVICE="ens33"
ONBOOT="yes"

IPADDR=192.168.122.101
GATEWAY=192.168.122.2
DNS1=8.8.8.8

insert image description here

4. Configure the hosts file

4 looms must be equipped!

Command vim /etc/hosts
to restart reboot

The configuration is as follows:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.122.101 vm1
192.168.122.102 vm2
192.168.122.103 vm3
192.168.122.104 vm4

insert image description here

5. Assign the local network to the virtual machine

insert image description here

insert image description here
Configuration reference:

insert image description here
Test ping Baidu

insert image description here

6. Download jdk, hadoop compressed package

I downloaded jdk1.8, hadoop3.3.5

insert image description here

7. Use xftp to transfer to the virtual machine

You can download one without xftp, and it is also possible to use a shared folder. There is a way to get the compressed package and send it to the linux machine

First create the folder
mkdir /usr/java
mkdir /home/hadoop

insert image description here

insert image description here

insert image description here
transfer complete

8. Configure jdk

Unzip tar -zxvf /usr/java/jdk-8u202-linux-x64.tar.gz -C /usr/java/
configure environment variable vim /etc/profile.d/my_env.sh

The configuration is as follows:

#JAVA_HOME
export JAVA_HOME=/usr/java/jdk1.8.0_202
export PATH=$PATH:$JAVA_HOME/bin

insert image description here
Reload the profile file to make the configuration take effect immediately source /etc/profile
to check whether the configuration is successful java -version

insert image description here

9. Configure hadoop

Unzip tar -zxvf /home/hadoop/hadoop-3.3.5.tar.gz -C /home/hadoop/
configure environment variables vim /etc/profile.d/my_env.sh
append the following configuration

#HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop-3.3.5
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

insert image description here

Reload the prfile file to make the configuration take effect immediately source /etc/profile
to check whether the configuration is successful hadoop

insert image description here

10. Create a script shell script to facilitate data synchronization

mkdir /home/vm1/bin (vm1 is the username)
vim /home/vm1/bin/xsync

Enter the following:

#!/bin/bash

#1.判断参数个数
if [ $# -lt 1 ]
then
	echo Not Enough Arguement!
	exit;
fi
#2.遍历集群所有机器
for host in vm1 vm2 vm3 vm4
do
	echo ============================  $host  ============================ 
	#3。遍历所有目录,挨个发送

	for file in $@
	do
		#4.判断文件是否存在
		if [ -e $file ]
			then
			#5。获取父目录
			pdir=$(cd -P $(dirname $file); pwd)
			
			#6.获取当前文件的名称
			fname=$(basename $file)
			ssh $host "mkdir -p $pdir"
			rsync -av $pdir/$fname $host:$pdir
			else
				echo $file does not exists!
		fi
	done
done

insert image description here

Modify file permissions chmod 777 /home/vm1/bin/xsync

insert image description here

Test: xsync /home/vm1/bin/xsync

Synchronize the current file to the other three machines, because there is no ssh password-free login, so you need to enter the password of each machine, just enter as required, and ssh password-free login will be configured later

insert image description here

Go to another machine to check whether the file is synchronized
ls /home/vm1/bin

insert image description here

It is found that other machines have received this file. In the future, if we want other machines to synchronize the files of this machine, we can directly use this script

11. Configure ssh password-free login

Hadoop requires that it must be configured. After configuration, you can directly log in to other machines without entering a password.

Execute the following command on vm1
to generate a key pair ssh-keygen -t rsa (three carriage returns)
put the public key on vm1 ssh-copy-id vm1 (enter yes and enter the password)
put the public key on vm2 ssh-copy-id vm2 (to enter yes and enter a password)
put the public key on vm3 ssh-copy-id vm3 (to enter yes and enter a password)
put the public key on vm4 ssh-copy-id vm4 (to Enter yes and enter the password)

insert image description here
Test password-free login
Login ssh vm2
exit exit

insert image description here
  Now you can only log in to other machines in vm1 without password. If you want to log in with each other without password, follow the above steps and execute it once on each machine.

  After success, you might as well go back and test it. The previous script does not need to enter a password. It is very convenient to synchronize data.

xsync /home/vm1/bin/xsync

insert image description here

12. Synchronize jdk and hadoop to other machines

Execute the synchronization command
xsync /home/hadoop/hadoop-3.3.5 /usr/java/jdk1.8.0_202 /etc/profile.d/my_env.sh in vm1

There are a lot of files to be synchronized for the first time, please wait patiently

After completion, go to another machine to check the synchronization result

ls /usr/java/
ls /home/hadoop/

insert image description here

Reload the environment to make the environment variables take effect immediately source /etc/profile
test whether java -version is completed

insert image description here

13. hadoop core configuration file

(1) Configure core-site.xml

命令:vim $HADOOP_HOME/etc/hadoop/core-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 指定NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://vm1:8020</value>
    </property>

    <!-- 指定hadoop数据的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/hadoop-3.3.5/data</value>
    </property>

    <!-- 配置HDFS网页登录使用的静态用户为atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>vm1</value>
    </property>
</configuration>

insert image description here

(2) HDFS configuration file
Configure hdfs-site.xml

命令:vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web端访问地址-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>vm1:9870</value>
    </property>
	<!-- 2nn web端访问地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>vm3:9868</value>
    </property>
</configuration>

insert image description here

(3) YARN configuration file

Configure the yarn-site.xml
command: vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>vm2</value>
    </property>

    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

insert image description here
(4) MapReduce configuration file
Configure the mapred-site.xml
command: vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- 指定MapReduce程序运行在Yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

insert image description here

15. Configure workers

Instruction: vim $HADOOP_HOME/etc/hadoop/workers

Add the following to this file:

vm1
vm2
vm3
vm4

Note: spaces are not allowed at the end of the content added in this file, and blank lines are not allowed in the file.

15. Synchronize the configuration files modified above to other machines

Instruction: xsync $HADOOP_HOME

insert image description here

You can verify whether the synchronization is complete

16. Start the cluster

(1) If the cluster is started for the first time, the NameNode needs to be formatted on the vm1 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster If an error is reported during operation and the NameNode needs to be reformatted, the namenode and datanode processes must be stopped first, and the data and logs directories of all machines must be deleted before formatting.)

Format command: hdfs namenode -format

insert image description here
(2) Start HDFS
command: $HADOOP_HOME/sbin/start-dfs.sh

insert image description here
(3) Start YARN on the node (vm2) configured with ResourceManager

Instruction: $HADOOP_HOME/sbin/start-yarn.sh

insert image description here
(4) View the NameNode of HDFS on the Web side

Enter in the browser: http://vm1:9870

insert image description here

(5) View the ResourceManager of YARN on the Web side

Enter in the browser: http://vm2:8088

insert image description here

17. Word Count test

1. Write files locally
Create folder: mkdir $HADOOP_HOME/WordCount
Create file 1:

echo 'This is the first hadoop test program!' > $HADOOP_HOME/WordCount/file1.txt

Create file 2:

echo 'This program is not very difficult, but this program is a common hadoop program!' > $HADOOP_HOME/WordCount/file2.txt

2. Upload files to fs to
create folders: hadoop fs -mkdir /input
Upload files: hadoop fs -put $HADOOP_HOME/WordCount/*.txt /input

3. Execute the program
Execute the wordcount program: hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar wordcount /input /output

View output files: hadoop fs -ls /output
View query results: hadoop fs -cat /output/part-r-00000

insert image description here

Like it if you think it is good!

Guess you like

Origin blog.csdn.net/weixin_52115456/article/details/131190981