A super-detailed version of Linux to build a Hadoop cluster from scratch (CentOS7+hadoop 3.2.0+JDK1.8+Mapreduce fully distributed cluster case + detailed source code graphic explanation)

A super-detailed version of Linux to build a Hadoop cluster from scratch (CentOS7+hadoop 3.2.0+JDK1.8+Mapreduce fully distributed cluster case + detailed source code graphic explanation)

Keywords and related configuration versions

Keywords: Linux CentOS Hadoop Java
version: CentOS7 Hadoop3.2.0 JDK1.8
virtual machine parameter information memory 3.2G, processor 2x2, memory 50G
ISO: CentOS-7-x86_64-DVD-2009.iso

Basic master-slave idea:

First configure the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information) on a virtual machine (master), modify the node IP, host name, and add the master-slave IP and corresponding host through cloning. Name, get the remaining virtual machine (node1)!
The cluster built this time has a master-slave structure with one host and one slave.
(You can set up multiple slave machines according to your actual situation. In this article, I have one slave machine. It is also very simple to add a few more nodes, depending on personal preferences or personal needs.)

Note: Hadoop has added the Yarn resource manager since version 2. Yarn does not need to be installed separately. As long as the JDK is installed on the machine, Hadoop can be installed directly. Simply installing Hadoop does not rely on other things such as Zookeeper.

1. First, build Linux CentOS7 on a virtual machine.

Friends who don’t know how to build it can read the blog I wrote before:

Xingchuan is fine: Virtual machine setup for Linux CentOS7 (detailed graphic explanation) https://blog.csdn.net/Myx74270512/article/details/127883266?spm=1001.2014.3001.5502

Insert image description here
The virtual machine memory and processor parameters of the master and node1 nodes that I have configured are as follows.
Insert image description here
Insert image description here

2. Directly select the root user to log in and turn off the firewall.

(You can choose it or not according to your personal needs. I choose root to log in directly, which is simpler and more trouble-free)

I directly choose the root user to log in, which avoids some environmental problems caused by ordinary user authorization and user switching. In short, it is efficient and convenient.
Insert image description here

Insert image description here
The advantage of this is that you can enter directly as the root user, and you don’t have to go through the trouble of entering a password for authorization and authentication:
Insert image description here
Then we turn off the firewall first:

systemctl stop firewalld  //关闭防火墙
systemctl disable firewalld  //关闭开机自启
systemctl status firewalld  //查看防火墙状态

Insert image description here

systemctl status firewalld 

//View firewall status
Insert image description here

When the firewall is turned off we continue.

3. Implement ssh password-free login

Configure passwordless access to ssh

ssh-keygen -t rsa

Press Enter continuously
Insert image description here

cd ~/.ssh
cat id_rsa.pub >> authorized_keys

Insert image description here
Let the ssh server start automatically

vi ~/.bashrc 

Insert image description here

Press O at the end of the file to enter edit mode, and add:

/etc/init.d/ssh start

Insert image description here

Press ESC to return to command mode, enter: wq to save and exit.
Make changes effective immediately

source ~/.bashrc

Insert image description here

4. Install jdk1.8 on CentOS7

1. yum installation

  1. Before installation, check to see if there is a jdk that comes with the system. If so, uninstall it first.
rpm -qa | grep jdk

[root@master ~]#rpm -qa | grep jdk
copy-jdk-configs-3.3-10.el7_5.noarch
java-1.8.0-openjdk-headless-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-devel-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-zip-1.8.0.322.b06-1.el7_9.noarch
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-accessibility-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-demo-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-src-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-1.8.0.322.b06-1.el7_9.noarch

Insert image description here

Uninstall jdk:

rpm -e --nodeps  上步查询出的所有jdk

例如:
[root@master ~]# rpm -e --nodeps copy-jdk-configs-3.3-10.el7_5.noarch
[root@master ~]# rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.322.b06-1.el7_9.x86_64
[root@master ~]# rpm -qa | grep jdk
java-1.8.0-openjdk-devel-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-zip-1.8.0.322.b06-1.el7_9.noarch
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-accessibility-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-demo-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-src-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-1.8.0.322.b06-1.el7_9.noarch
[root@test ~]#

rpm -e --nodeps I have only executed it twice. The remaining 7 uninstalls have the same operation and will not be executed here.

Verify that it has been uninstalled cleanly:

rpm -qa|grep java

Insert image description here

java -version

Insert image description here

After uninstalling, start installing jdk1.8:

View installable versions

yum list java*

Insert image description here

Install version 1.8.0 openjdk

yum -y install java-1.8.0-openjdk*

Insert image description here
Insert image description here

Check the installation location:

rpm -qa | grep java
rpm -ql java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64

Insert image description here

Environment variable configuration:
Current user uses:

vi ~/.bashrc

Or for global users use:

vi /etc/profile

Add to:

export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

Insert image description here

and then execute

 source ~/.bashrc

or

 source /etc/profile 

Insert image description here

command to make the modified configuration file take effect.

Verify installation:

which java
java -version

Insert image description here

5. Download hadoop

The hadoop I am using here is version 3.2.0.
Open the download address selection page: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar. gz

Insert image description here

This link also has more 3.2.0 version of other hadoop files: https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/
Download hadoop files:
Insert image description here
You can choose what you want according to your personal needs Download the hadoop version number file directly:
Attached URL:
https://archive.apache.org/dist/hadoop/common/

Then upload the file and decompress it
1. Create a new directory named hadoop in the opt directory and upload the downloaded hadoop-3.2.0.tar to this directory

   mkdir /opt/hadoop

Insert image description here

Unzip and install:

 tar -zxvf hadoop-3.2.0.tar.gz

Insert image description here

Configure Hadoop environment variables:

vim ~/.bashrc

Insert image description here

Add hadoop environment variables:

export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbin
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Note: The representation here is based $PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbinon retaining the original $PATHenvironment variables, and then adding $JAVA_HOME/binand /opt/hadoop/hadoop-3.2.0/binthese /opt/hadoop/hadoop-3.2.0/sbinpaths as new $PATHenvironment variables.
Insert image description here

Then we execute

source  ~/.bashrc

Insert image description here

Make the modified configuration file take effect.

6. Modification of Hadoop configuration file

Create several new directories:

mkdir /root/hadoop
mkdir /root/hadoop/tmp
mkdir /root/hadoop/var
mkdir /root/hadoop/dfs
mkdir /root/hadoop/dfs/name
mkdir /root/hadoop/dfs/data

Insert image description here

Modify a series of configuration files in etc/hadoop

 vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/core-site.xml

Add configuration to the node:

<configuration>
 <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
   </property>
   <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
   </property>
   </configuration>

Insert image description here

Modify hadoop-env.sh

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hadoop-env.sh

Modify export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/lib/jvm/java-openjdk
Description: Modify to your own JDK path
Insert image description here

Modify hdfs-site.xml

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hdfs-site.xml 

Add configuration to the node:

<configuration>
<property>
   <name>dfs.name.dir</name>
   <value>/root/hadoop/dfs/name</value>
   <description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.
   </description>
</property>

<property>
   <name>dfs.data.dir</name>
   <value>/root/hadoop/dfs/data</value>
   <description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.
   </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>2</value>
</property>

<property>
   <name>dfs.permissions</name>
   <value>false</value>
   <description>need not permissions</description>
</property>
</configuration>

Insert image description here

Create and modify mapred-site.xml:

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/mapred-site.xml

Add configuration to the node:

<configuration>
<!-- 配置mapReduce在Yarn上运行(默认本地运行) -->
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
</configuration>

Insert image description here

Modify workers file:

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers

Delete the localhost inside and add the following content (both the master and node1 nodes must be modified):

master
node1

Insert image description here

Note: There can be no extra spaces here, and no blank lines are allowed in the file.

You can also modify the /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers file of the master node, and then distribute a command directly to the cluster, so that you do not need to modify the workers files of other nodes:

xsync  /opt/hadoop/hadoop-3.2.0/etc

Modify the yarn-site.xml file:

HADOOP_CLASSPATH is the path that sets the classes to run. Otherwise, when you run the program using hadoop classname [args], an error will be reported, saying that the class to be run cannot be found. There is no problem when running the program using hadoop jar jar_name.jar classname [args]

You need to set the hadoop classpath here, otherwise mapreduce will report an error that the main class cannot be found:

hadoop classpath

Insert image description here

Note the results returned

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/yarn-site.xml 

Add a configuration

<property>
        <name>yarn.application.classpath</name>
        <value>hadoop classpath返回信息</value>
</property>

This is my yarn-site.xml configuration:

<configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>master</value>
        </property>
        <property>
            <name>yarn.application.classpath</name>
            <value>/opt/hadoop/hadoop-3.2.0/etc/hadoop:/opt/hadoop/hadoop-3.2.0/share/hadoop/common/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/common/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn/*</value>
        </property>
</configuration>

Insert image description here

Configure the start-dfs.sh, start-yarn.sh, stop-dfs.sh, stop-yarn.sh files in the hadoop-3.2.0/sbin/ directory

Service startup permission configuration

cd /opt/hadoop/hadoop-3.2.0

Configure start-dfs.sh and stop-dfs.sh files

vi sbin/start-dfs.sh 

and

 vi sbin/stop-dfs.sh

Add the following

HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Insert image description here
Insert image description here

Configure start-yarn.sh and stop-yarn.sh files

vi sbin/start-yarn.sh 

and

vi sbin/stop-yarn.sh

Add the following

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

Insert image description here
Insert image description here

After configuring the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information), clone the virtual machine and obtain the slave node1 node.

If the master has been copied to the node1 node virtual machine, but the Hadoop configuration file information of the node1 node has not been modified, then we can directly run the following command on the master node to distribute the configured Hadoop configuration information to each node in the cluster. , so that we don’t need to modify the Hadoop configuration files of other nodes:
xsync /opt/hadoop/hadoop-3.2.0/etc/hadoop

After cloning the master host, obtain the slave node1 node.
Insert image description here

Then start modifying the network card information:

vim /etc/sysconfig/network-scripts/ifcfg-ens33

Modify node1 node ip information:
Insert image description here

Modify the node1 node host name:

vi /etc/hostname

Insert image description here

Modify the IP and host name corresponding to the node1 node (the master and slave nodes must be consistent)

vim /etc/hosts  

Insert image description here
Try ssh interconnection between master and slave nodes:
first try connecting the master node to the node1 node

ssh node1

Insert image description here
Try node1 node to connect to the master node again:

ssh master

Insert image description here
OK, the interconnection is successful. (Press exit to exit)

7. Start Hadoop

Because master is the namenode and node1 is the datanode, only the master needs to be initialized, that is, the hdfs is formatted.
Enter the master machine/opt/hadoop/hadoop-3.2.0/bin directory:

  cd /opt/hadoop/hadoop-3.2.0/bin

Execute initialization script

  ./hadoop namenode -format

Insert image description here
Insert image description here

Then execute the startup process:

./sbin/start-all.sh

Insert image description here

Check the startup process.

jps

Insert image description here

operation result:

The master is our namenode. The IP of the machine is 192.168.95.20. Access the following address on the local computer:

http://192.168.95.20:9870/
Insert image description here

Visit the following address in your local browser:

http://192.168.95.20:8088/cluster

Automatically jump to the cluster page
Insert image description here

Create a directory on hdfs to store files

./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input

Check whether distribution replication is normal

./bin/hdfs dfs -ls /home/hadoop/myx/wordcount/input

Insert image description here

8. Run MapReduce cluster

Mapreduce running case:
Create a directory on hdfs to store files.
For example

./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input

You can first simply write two small files called text1 and text2, as shown below.
file:text1.txt
hadoop is very good
mapreduce is very good

vim text1

Add the following:

hadoop is  very good 
mapreduce is very good

Insert image description here

These two files can then be stored in HDFS and processed using WordCount.

./bin/hdfs dfs -put text1 /home/hadoop/myx/wordcount/input

Insert image description here
Check the distribution situation
Insert image description here
and run MapReduce to process it with WordCount.

./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount /home/hadoop/myx/wordcount/input /home/hadoop/myx/wordcount/output

Insert image description here
Insert image description here

The final results will be stored in the specified output directory. Check the output directory to see the following content.

./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output/part-r-00000*

Insert image description here
The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output

Insert image description here
Insert image description here
Let’s try the second case:
file:text2.txt

vim text2

Add the following

hadoop is  easy to learn 
mapreduce is  easy to learn

Insert image description here
Check the newly created input2 directory on the browser:
Insert image description here
run MapReduce for processing, and set the output directory to output2 (the output result directory does not need to be created in advance, the output2 output directory will be automatically generated during Mapreduce operation).

./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount /home/hadoop/myx/wordcount/input2 /home/hadoop/myx/wordcount/output2

Insert image description here
Insert image description here

After running, view the output of text2

./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output2/part-r-00000*

Insert image description here
The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output2

Insert image description here
Insert image description here
The above output is the number of occurrences of each word.

Let's try running the test program WordCount ourselves.
First, create a new folder WordCount in the current user directory of hadoop, and create two test files in it, file1.txt and file2.txt. Fill in the content in both files yourself.
Create a new folder WordCount.

mkdir WordCount
ls

Insert image description here

cd WordCount
vim file1.txt

Insert image description here
The content of file1.txt file is:

This is the first hadoop test program!

Insert image description here
vim file2.txt
file2.txt file content is:

This  program is not very difficult,but this program is a common hadoop program!

Insert image description here
Then create a new folder input in the /home directory of the Hadoop file system HDFS and view the contents. The specific commands are as follows.

cd /opt/hadoop/hadoop-3.2.0
./bin/hadoop fs -mkdir /input
./bin/hadoop fs -ls /

Insert image description here
View in browser:

http://192.168.95.20:9870/explorer.html#/input

Insert image description here
Upload the file1.txt\file2.txt files in the WordCount folder to the "input" folder you just created. The specific commands are as follows.

./bin/hadoop fs -put /opt/hadoop/hadoop-3.2.0/WordCount/*.txt  /input

Insert image description here

Run the Hadoop sample program and set the output directory to /output (the output result directory does not need to be created in advance, the /output output directory will be automatically generated during Mapreduce operation).

./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount  /input /output

Insert image description here
Insert image description here

View the file directory information and WordCount results of the output results.
Use the following command to view the file directory information of the output results.

./bin/hadoop fs -ls /output

Insert image description here
Use the following command to view the results of WordCount.

./bin/hdfs dfs -cat /output/part-r-00000*

The output results are as follows.
Insert image description here
The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/output

Insert image description here
The above output is the number of occurrences of each word.
At this point, the case of building a hadoop cluster and running three MapReduce clusters on Centos is completed!

I remember there was a simple interview question about the Hadoop version port before. Here you will be asked about the port numbers of some important service processes .
I'm afraid some friends here don't know, so I'll sort it out for you.
For the newer version of Hadoop3.0x:

Hadoop3.0x Corresponding port number
HDFS NameNode internal communication port 8020/9000/9820
HDFS NameNode query port HTTP UI for users 9870
MapReduce View the port for executing the task 8088
History service communication port 19888

For version Hadoop2.0x:

Hadoop2.0x Corresponding port number
HDFS NameNode internal communication port 8020/9000
HDFS NameNode query port HTTP UI for users 50070
MapReduce View the port for executing the task 8088
History service communication port 19888

OK, it took almost 2 hours to build a Hadoop cluster from scratch on Linux (CentOS7+hadoop 3.2.0+JDK1.8+Mapreduce fully distributed cluster). It was finally completed. The hard-working student Xiaoma decided to reward himself with a big fight. I hope this tutorial will be helpful to you. These have been tested. If your environment configuration and operation are OK, the deployment can basically be completed. I have deployed a slave node1 node here. You can add 3 or more nodes according to your needs. For more nodes, the operations for modifying node configuration information are the same.
Friends in need can ask me for the complete project source code and detailed documents. Welcome to visit me!
Finally, I wish you all the best in your deployment!

Guess you like

Origin blog.csdn.net/Myx74270512/article/details/127947252