Cloud Computing and Big Data - Deploy Hadoop Cluster and Run MapReduce Cluster (Super Detailed!)
Linux builds Hadoop cluster (CentOS7+hadoop3.2.0+JDK1.8+Mapreduce fully distributed cluster)
The version number used in this article: CentOS7 Hadoop3.2.0 JDK1.8
Basic concepts and importance
Many friends deploy clusters using Hadoop and MapReduce, but they don’t know what is being deployed and what its use is. Before deploying the cluster, let me tell you the basic concepts of Hadoop and MapReduce and their importance in big data processing:
-Hadoop is an open source software framework developed by the Apache Foundation for distributed processing and storage of large-scale data sets. The core components of Hadoop include Hadoop Distributed File System (HDFS) and MapReduce.
-
HDFS is a distributed file system that can store large amounts of data on common hardware. HDFS splits data into chunks and then distributes storage across multiple nodes in the cluster, providing high fault tolerance and high throughput.
-
MapReduce is a programming model for processing and generating large data sets. MapReduce tasks include two phases: Map phase and Reduce phase. In the Map stage, the input data is split into independent chunks and then processed in parallel. In the Reduce phase, the processing results are combined into a final output.
The importance of Hadoop and MapReduce in big data processing is mainly reflected in the following points:
-
Scalability: Hadoop can run on hundreds or thousands of machines and process petabytes of data.
-
Fault tolerance: Hadoop can automatically handle node failures and ensure data reliability and integrity.
-
Cost-effectiveness: Hadoop can run on common hardware, reducing the cost of big data processing.
-
Flexibility: The MapReduce programming model can handle structured and unstructured data and adapt to various types of data processing tasks.
Let’s officially get to the point!
1. Directly select the root user to log in and turn off the firewall
Directly selecting the root user to log in avoids some environmental problems caused by ordinary user authorization and user switching. Simply put, it is efficient and convenient.
Then turn off the firewall:
systemctl stop firewalld //关闭防火墙
systemctl disable firewalld //关闭开机自启
systemctl status firewalld //查看防火墙状态
Leave the firewall turned off.
2. Implement ssh password-free login
Configure passwordless access to ssh
ssh-keygen -t rsa
Press Enter continuously
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
Set ssh server to start automatically
vi ~/.bashrc
Press O at the end of the file to enter edit mode, and add:
/etc/init.d/ssh start
Press ESC to return to command mode, enter: wq to save and exit.
Make changes effective immediately
source ~/.bashrc
Check the ssh service status.
systemctl status sshd
3. Install jdk1.8 on CentOS7
1. yum installation
- Before installation, check to see if there is a jdk that comes with the system. If so, uninstall it first.
Uninstall the built-in jdk:
rpm -e --nodeps
all the jdk queried in the previous step
, for example:
[root@master ~]# rpm -e --nodeps copy-jdk-configs-3.3-10.el7_5.noarch
Verify that it has been uninstalled cleanly:
java -version
After uninstalling, start installing jdk1.8:
View installable versions
yum list java*
Install version 1.8.0 openjdk
yum -y install java-1.8.0-openjdk*
Check the installation location:
rpm -qa | grep java
rpm -ql java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64
Add user environment variables
Add:
export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
and then execute
source ~/.bashrc
Verify installation:
which java
View java version information
java -version
Indicates that JDK configuration is completed.
4. Download hadoop
This link also has more Hadoop files for version 3.2.0:
https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/Here
is the downloaded hadoop-3.2.0.tar.gz Network disk file link:
Link: https://pan.baidu.com/s/1a3GJH_fNhUkfaDbckrD8Gg?pwd=2023
Download the hadoop file:
Then upload the file and decompress it
1. Create a new directory named hadoop in the opt directory, and upload the downloaded hadoop-3.2.0.tar to the directory
mkdir /opt/hadoop
Unzip and install:
tar -zxvf hadoop-3.2.0.tar.gz
Configure Hadoop environment variables:
vim ~/.bashrc
Add hadoop environment variables:
export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbin
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Then we execute
source ~/.bashrc
Make the modified configuration file take effect.
5. Modification of Hadoop configuration file
Create several new directories:
mkdir /root/hadoop
mkdir /root/hadoop/tmp
mkdir /root/hadoop/var
mkdir /root/hadoop/dfs
mkdir /root/hadoop/dfs/name
mkdir /root/hadoop/dfs/data
Modify a series of configuration files vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/core-site.xml in etc/hadoop
and add configurations to the node:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Modify hadoop-env.sh
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hadoop-env.sh
Modify export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/lib/jvm/java-openjdk
Description: Modify to your own JDK path
Modify hdfs-site.xml
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hdfs-site.xml
and add configuration to the node:
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop/dfs/name</value>
<description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.
</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop/dfs/data</value>
<description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
<description>need not permissions</description>
</property>
</configuration>
Create and modify mapred-site.xml:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/mapred-site.xml
Add configuration to the node:
<configuration>
<!-- 配置mapReduce在Yarn上运行(默认本地运行) -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Modify workers file:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers
Delete the localhost inside and add the following content (both the master and node1 nodes must be modified):
master
node1
Modify the yarn-site.xml file:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/yarn-site.xml
Configure the start-dfs.sh, start-yarn.sh, stop-dfs.sh, stop-yarn.sh files in the hadoop-3.2.0/sbin/ directory.
Service startup permission configuration
cd /opt/hadoop/hadoop-3.2.0
Configure start-dfs.sh and stop-dfs.sh files
vi sbin/start-dfs.sh
vi sbin/stop-dfs.sh
加入下面内容
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Configure start-yarn.sh and stop-yarn.sh files
vi sbin/start-yarn.sh
vi sbin/stop-yarn.sh
Add the following
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
After configuring the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information), clone the virtual machine and obtain the slave node1 node.
After cloning the master host, obtain the slave node1 node.
Then start modifying the network card information:
vim /etc/sysconfig/network-scripts/ifcfg-ens33
Modify node1 node ip information:
Modify the node1 node host name:
vi /etc/hostname
Modify the IP and host name corresponding to the node1 node (the master and slave nodes must be consistent)
vim /etc/hosts
Try ssh to interconnect the master and slave nodes:
first try to connect the node1 node to the master node
ssh node1
Try node1 node to connect to the master node again:
ssh master
OK, the interconnection is successful. (Press exit to exit
6. Start Hadoop
Because master is the namenode and node1 is the datanode, only the master needs to be initialized, that is, the hdfs is formatted.
Enter the master machine/opt/hadoop/hadoop-3.2.0/bin directory:
cd /opt/hadoop/hadoop-3.2.0/bin
Execute initialization script
./hadoop namenode -format
Then execute the startup process:
./sbin/start-all.sh
Check the startup process.
jps
The master is our namenode. The IP of the machine is 192.168.95.20. Access the following address on the local computer:
http://192.168.95.20:9870/
Visit the following address in your local browser:
http://192.168.95.20:8088/cluster
Automatically jump to the cluster page and
create a directory on HDFS to store files.
./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input
Check whether distribution replication is normal
./bin/hdfs dfs -ls /home/hadoop/myx/wordcount/input
7. Run MapReduce cluster
Mapreduce running case:
Create a directory on hdfs to store files.
For example
./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input
You can first simply write two small files called text1 and text2, as shown below.
file:text1.txt
hadoop is very good
mapreduce is very good
vim text1
These two files can then be stored in HDFS and processed using WordCount.
./bin/hdfs dfs -put text1 /home/hadoop/myx/wordcount/input
Check the distribution situation
and run MapReduce to process it with WordCount.
./bin/hadoop jar ![在这里插入图片描述](https://img-blog.csdnimg.cn/81fe96bc9823429d8263e450ba417363.png)
/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /home/hadoop/myx/wordcount/input /home/hadoop/myx/wordcount/output
The final results will be stored in the specified output directory. Check the output directory to see the following content.
./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output/part-r-00000*
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output
The above output is the number of occurrences of each word.
Let’s try the second case:
file:text2.txt
vim text2
hadoop is easy to learn
mapreduce is easy to learn
Check the newly created input2 directory on the browser:
run MapReduce for processing, and set the output directory to output2 (the output result directory does not need to be created in advance, the output2 output directory will be automatically generated during Mapreduce operation).
./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /home/hadoop/myx/wordcount/input2 /home/hadoop/myx/wordcount/output2
After running, view the output of text2
./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output2/part-r-00000*
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output2
The above output is the number of occurrences of each word.
Let's try running the test program WordCount ourselves.
First, create a new folder WordCount in the current user directory of hadoop, and create two test files in it, file1.txt and file2.txt. Fill in the content in both files yourself.
Create a new folder WordCount.
mkdir WordCount
ls
cd WordCount
vim file1.txt
The content of file1.txt file is:
This is the first hadoop test program!
vim file2.txt
The content of file2.txt file is:
This program is not very difficult,but this program is a common hadoop program!
Then create a new folder input in the /home directory of the Hadoop file system HDFS and view the contents. The specific commands are as follows.
cd /opt/hadoop/hadoop-3.2.0
./bin/hadoop fs -mkdir /input
./bin/hadoop fs -ls /
View in browser:
http://192.168.95.20:9870/explorer.html#/input
Upload the file1.txt\file2.txt files in the WordCount folder to the "input" folder you just created. The specific commands are as follows.
./bin/hadoop fs -put /opt/hadoop/hadoop-3.2.0/WordCount/*.txt /input
Run the Hadoop sample program and set the output directory to /output (the output result directory does not need to be created in advance, the /output output directory will be automatically generated during Mapreduce operation).
./bin/hadoop jar /opt/had![在这里插入图片描述](https://img-blog.csdnimg.cn/abf75678cb6943698c1a26d250317caf.png)
oop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /input /output
View the file directory information and WordCount results of the output results.
Use the following command to view the file directory information of the output results.
./bin/hadoop fs -ls /output
Use the following command to view the results of WordCount.
./bin/hdfs dfs -cat /output/part-r-00000*
The output results are as follows.
The running output results can also be viewed on the web side, with detailed information:
http://192.168.95.20:9870/explorer.html#/output
The above output is the number of occurrences of each word.
At this point, the case of building a hadoop cluster and running three MapReduce clusters on Centos is completed!
Here are some tips and suggestions for optimizing Hadoop cluster performance and MapReduce task efficiency:
-
Hardware optimization: Choosing appropriate hardware configuration is the key to improving Hadoop cluster performance. For example, use a faster CPU, larger memory, faster hard drive (such as SSD), and a high-speed network connection.
-
Configuration optimization: Hadoop and MapReduce configuration parameters can be adjusted according to specific workloads. For example, HDFS's block size can be increased to increase processing speed of large files, or MapReduce's memory settings can be adjusted to accommodate larger tasks.
-
Data localization: Run MapReduce tasks on the nodes where the data is located as much as possible to reduce network transmission overhead.
-
Parallel processing: By increasing the parallelism of MapReduce tasks, the resources of the cluster can be more fully utilized.
-
Programming optimization: When writing a MapReduce program, the transmission and sorting of data should be reduced as much as possible. For example, the Combiner function can be used to reduce data transfer between Map and Reduce stages.
-
Use advanced tools: Some advanced data processing tools, such as Apache Hive and Apache Pig, can automatically optimize MapReduce tasks to make them more efficient.
-
Monitoring and debugging: Using Hadoop's own monitoring tools, such as Hadoop Web UI and Hadoop Metrics, can help you find and solve performance problems.
The above are just some basic optimization tips and suggestions. Specific optimization strategies need to be adjusted according to specific needs and environments. Xiao Ma here wishes you all the best in your deployment!