A super-detailed version of Linux to build a Hadoop cluster from scratch (CentOS7+hadoop 3.2.0+JDK1.8+Mapreduce fully distributed cluster case + detailed source code graphic explanation)
Keywords and related configuration versions
Keywords: Linux CentOS Hadoop Java
version: CentOS7 Hadoop3.2.0 JDK1.8
virtual machine parameter information memory 3.2G, processor 2x2, memory 50G
ISO: CentOS-7-x86_64-DVD-2009.iso
Basic master-slave idea:
First configure the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information) on a virtual machine (master), modify the node IP, host name, and add the master-slave IP and corresponding host through cloning. Name, get the remaining virtual machine (node1)!
The cluster built this time has a master-slave structure with one host and one slave.
(You can set up multiple slave machines according to your actual situation. In this article, I have one slave machine. It is also very simple to add a few more nodes, depending on personal preferences or personal needs.)
Note: Hadoop has added the Yarn resource manager since version 2. Yarn does not need to be installed separately. As long as the JDK is installed on the machine, Hadoop can be installed directly. Simply installing Hadoop does not rely on other things such as Zookeeper.
Article directory
1. First, build Linux CentOS7 on a virtual machine.
Friends who don’t know how to build it can read the blog I wrote before:
The virtual machine memory and processor parameters of the master and node1 nodes that I have configured are as follows.
2. Directly select the root user to log in and turn off the firewall.
(You can choose it or not according to your personal needs. I choose root to log in directly, which is simpler and more trouble-free)
I directly choose the root user to log in, which avoids some environmental problems caused by ordinary user authorization and user switching. In short, it is efficient and convenient.
The advantage of this is that you can enter directly as the root user, and you don’t have to go through the trouble of entering a password for authorization and authentication:
Then we turn off the firewall first:
systemctl stop firewalld //关闭防火墙
systemctl disable firewalld //关闭开机自启
systemctl status firewalld //查看防火墙状态
systemctl status firewalld
//View firewall status
When the firewall is turned off we continue.
3. Implement ssh password-free login
Configure passwordless access to ssh
ssh-keygen -t rsa
Press Enter continuously
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
Let the ssh server start automatically
vi ~/.bashrc
Press O at the end of the file to enter edit mode, and add:
/etc/init.d/ssh start
Press ESC to return to command mode, enter: wq to save and exit.
Make changes effective immediately
source ~/.bashrc
4. Install jdk1.8 on CentOS7
1. yum installation
- Before installation, check to see if there is a jdk that comes with the system. If so, uninstall it first.
rpm -qa | grep jdk
[root@master ~]#rpm -qa | grep jdk
copy-jdk-configs-3.3-10.el7_5.noarch
java-1.8.0-openjdk-headless-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-devel-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-zip-1.8.0.322.b06-1.el7_9.noarch
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-accessibility-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-demo-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-src-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-1.8.0.322.b06-1.el7_9.noarch
Uninstall jdk:
rpm -e --nodeps 上步查询出的所有jdk
例如:
[root@master ~]# rpm -e --nodeps copy-jdk-configs-3.3-10.el7_5.noarch
[root@master ~]# rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.322.b06-1.el7_9.x86_64
[root@master ~]# rpm -qa | grep jdk
java-1.8.0-openjdk-devel-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-zip-1.8.0.322.b06-1.el7_9.noarch
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-accessibility-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-demo-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-src-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk-javadoc-1.8.0.322.b06-1.el7_9.noarch
[root@test ~]#
rpm -e --nodeps I have only executed it twice. The remaining 7 uninstalls have the same operation and will not be executed here.
Verify that it has been uninstalled cleanly:
rpm -qa|grep java
java -version
After uninstalling, start installing jdk1.8:
View installable versions
yum list java*
Install version 1.8.0 openjdk
yum -y install java-1.8.0-openjdk*
Check the installation location:
rpm -qa | grep java
rpm -ql java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64
Environment variable configuration:
Current user uses:
vi ~/.bashrc
Or for global users use:
vi /etc/profile
Add to:
export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
and then execute
source ~/.bashrc
or
source /etc/profile
command to make the modified configuration file take effect.
Verify installation:
which java
java -version
5. Download hadoop
The hadoop I am using here is version 3.2.0.
Open the download address selection page: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar. gz
This link also has more 3.2.0 version of other hadoop files: https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/
Download hadoop files:
You can choose what you want according to your personal needs Download the hadoop version number file directly:
Attached URL:
https://archive.apache.org/dist/hadoop/common/
Then upload the file and decompress it
1. Create a new directory named hadoop in the opt directory and upload the downloaded hadoop-3.2.0.tar to this directory
mkdir /opt/hadoop
Unzip and install:
tar -zxvf hadoop-3.2.0.tar.gz
Configure Hadoop environment variables:
vim ~/.bashrc
Add hadoop environment variables:
export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbin
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Note: The representation here is based $PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbin
on retaining the original $PATH
environment variables, and then adding $JAVA_HOME/bin
and /opt/hadoop/hadoop-3.2.0/bin
these /opt/hadoop/hadoop-3.2.0/sbin
paths as new $PATH
environment variables.
Then we execute
source ~/.bashrc
Make the modified configuration file take effect.
6. Modification of Hadoop configuration file
Create several new directories:
mkdir /root/hadoop
mkdir /root/hadoop/tmp
mkdir /root/hadoop/var
mkdir /root/hadoop/dfs
mkdir /root/hadoop/dfs/name
mkdir /root/hadoop/dfs/data
Modify a series of configuration files in etc/hadoop
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/core-site.xml
Add configuration to the node:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Modify hadoop-env.sh
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hadoop-env.sh
Modify export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/lib/jvm/java-openjdk
Description: Modify to your own JDK path
Modify hdfs-site.xml
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hdfs-site.xml
Add configuration to the node:
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop/dfs/name</value>
<description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.
</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop/dfs/data</value>
<description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
<description>need not permissions</description>
</property>
</configuration>
Create and modify mapred-site.xml:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/mapred-site.xml
Add configuration to the node:
<configuration>
<!-- 配置mapReduce在Yarn上运行(默认本地运行) -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Modify workers file:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers
Delete the localhost inside and add the following content (both the master and node1 nodes must be modified):
master
node1
Note: There can be no extra spaces here, and no blank lines are allowed in the file.
You can also modify the /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers file of the master node, and then distribute a command directly to the cluster, so that you do not need to modify the workers files of other nodes:
xsync /opt/hadoop/hadoop-3.2.0/etc
Modify the yarn-site.xml file:
HADOOP_CLASSPATH is the path that sets the classes to run. Otherwise, when you run the program using hadoop classname [args], an error will be reported, saying that the class to be run cannot be found. There is no problem when running the program using hadoop jar jar_name.jar classname [args]
You need to set the hadoop classpath here, otherwise mapreduce will report an error that the main class cannot be found:
hadoop classpath
Note the results returned
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/yarn-site.xml
Add a configuration
<property>
<name>yarn.application.classpath</name>
<value>hadoop classpath返回信息</value>
</property>
This is my yarn-site.xml configuration:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/opt/hadoop/hadoop-3.2.0/etc/hadoop:/opt/hadoop/hadoop-3.2.0/share/hadoop/common/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/common/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/hdfs/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn/lib/*:/opt/hadoop/hadoop-3.2.0/share/hadoop/yarn/*</value>
</property>
</configuration>
Configure the start-dfs.sh, start-yarn.sh, stop-dfs.sh, stop-yarn.sh files in the hadoop-3.2.0/sbin/ directory
Service startup permission configuration
cd /opt/hadoop/hadoop-3.2.0
Configure start-dfs.sh and stop-dfs.sh files
vi sbin/start-dfs.sh
and
vi sbin/stop-dfs.sh
Add the following
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Configure start-yarn.sh and stop-yarn.sh files
vi sbin/start-yarn.sh
and
vi sbin/stop-yarn.sh
Add the following
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
After configuring the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information), clone the virtual machine and obtain the slave node1 node.
If the master has been copied to the node1 node virtual machine, but the Hadoop configuration file information of the node1 node has not been modified, then we can directly run the following command on the master node to distribute the configured Hadoop configuration information to each node in the cluster. , so that we don’t need to modify the Hadoop configuration files of other nodes:
xsync /opt/hadoop/hadoop-3.2.0/etc/hadoop
After cloning the master host, obtain the slave node1 node.
Then start modifying the network card information:
vim /etc/sysconfig/network-scripts/ifcfg-ens33
Modify node1 node ip information:
Modify the node1 node host name:
vi /etc/hostname
Modify the IP and host name corresponding to the node1 node (the master and slave nodes must be consistent)
vim /etc/hosts
Try ssh interconnection between master and slave nodes:
first try connecting the master node to the node1 node
ssh node1
Try node1 node to connect to the master node again:
ssh master
OK, the interconnection is successful. (Press exit to exit)
7. Start Hadoop
Because master is the namenode and node1 is the datanode, only the master needs to be initialized, that is, the hdfs is formatted.
Enter the master machine/opt/hadoop/hadoop-3.2.0/bin directory:
cd /opt/hadoop/hadoop-3.2.0/bin
Execute initialization script
./hadoop namenode -format
Then execute the startup process:
./sbin/start-all.sh
Check the startup process.
jps
operation result:
The master is our namenode. The IP of the machine is 192.168.95.20. Access the following address on the local computer:
http://192.168.95.20:9870/
Visit the following address in your local browser:
http://192.168.95.20:8088/cluster
Automatically jump to the cluster page
Create a directory on hdfs to store files
./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input
Check whether distribution replication is normal
./bin/hdfs dfs -ls /home/hadoop/myx/wordcount/input
8. Run MapReduce cluster
Mapreduce running case:
Create a directory on hdfs to store files.
For example
./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input
You can first simply write two small files called text1 and text2, as shown below.
file:text1.txt
hadoop is very good
mapreduce is very good
vim text1
Add the following:
hadoop is very good
mapreduce is very good
These two files can then be stored in HDFS and processed using WordCount.
./bin/hdfs dfs -put text1 /home/hadoop/myx/wordcount/input
Check the distribution situation
and run MapReduce to process it with WordCount.
./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /home/hadoop/myx/wordcount/input /home/hadoop/myx/wordcount/output
The final results will be stored in the specified output directory. Check the output directory to see the following content.
./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output/part-r-00000*
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output
Let’s try the second case:
file:text2.txt
vim text2
Add the following
hadoop is easy to learn
mapreduce is easy to learn
Check the newly created input2 directory on the browser:
run MapReduce for processing, and set the output directory to output2 (the output result directory does not need to be created in advance, the output2 output directory will be automatically generated during Mapreduce operation).
./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /home/hadoop/myx/wordcount/input2 /home/hadoop/myx/wordcount/output2
After running, view the output of text2
./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output2/part-r-00000*
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output2
The above output is the number of occurrences of each word.
Let's try running the test program WordCount ourselves.
First, create a new folder WordCount in the current user directory of hadoop, and create two test files in it, file1.txt and file2.txt. Fill in the content in both files yourself.
Create a new folder WordCount.
mkdir WordCount
ls
cd WordCount
vim file1.txt
The content of file1.txt file is:
This is the first hadoop test program!
vim file2.txt
file2.txt file content is:
This program is not very difficult,but this program is a common hadoop program!
Then create a new folder input in the /home directory of the Hadoop file system HDFS and view the contents. The specific commands are as follows.
cd /opt/hadoop/hadoop-3.2.0
./bin/hadoop fs -mkdir /input
./bin/hadoop fs -ls /
View in browser:
http://192.168.95.20:9870/explorer.html#/input
Upload the file1.txt\file2.txt files in the WordCount folder to the "input" folder you just created. The specific commands are as follows.
./bin/hadoop fs -put /opt/hadoop/hadoop-3.2.0/WordCount/*.txt /input
Run the Hadoop sample program and set the output directory to /output (the output result directory does not need to be created in advance, the /output output directory will be automatically generated during Mapreduce operation).
./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /input /output
View the file directory information and WordCount results of the output results.
Use the following command to view the file directory information of the output results.
./bin/hadoop fs -ls /output
Use the following command to view the results of WordCount.
./bin/hdfs dfs -cat /output/part-r-00000*
The output results are as follows.
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/output
The above output is the number of occurrences of each word.
At this point, the case of building a hadoop cluster and running three MapReduce clusters on Centos is completed!
I remember there was a simple interview question about the Hadoop version port before. Here you will be asked about the port numbers of some important service processes .
I'm afraid some friends here don't know, so I'll sort it out for you.
For the newer version of Hadoop3.0x:
Hadoop3.0x | Corresponding port number |
---|---|
HDFS NameNode internal communication port | 8020/9000/9820 |
HDFS NameNode query port HTTP UI for users | 9870 |
MapReduce View the port for executing the task | 8088 |
History service communication port | 19888 |
For version Hadoop2.0x:
Hadoop2.0x | Corresponding port number |
---|---|
HDFS NameNode internal communication port | 8020/9000 |
HDFS NameNode query port HTTP UI for users | 50070 |
MapReduce View the port for executing the task | 8088 |
History service communication port | 19888 |
OK, it took almost 2 hours to build a Hadoop cluster from scratch on Linux (CentOS7+hadoop 3.2.0+JDK1.8+Mapreduce fully distributed cluster). It was finally completed. The hard-working student Xiaoma decided to reward himself with a big fight. I hope this tutorial will be helpful to you. These have been tested. If your environment configuration and operation are OK, the deployment can basically be completed. I have deployed a slave node1 node here. You can add 3 or more nodes according to your needs. For more nodes, the operations for modifying node configuration information are the same.
Friends in need can ask me for the complete project source code and detailed documents. Welcome to visit me!
Finally, I wish you all the best in your deployment!