Cloud Computing and Big Data - Deploy Hadoop Cluster and Run MapReduce Cluster (Super Detailed!)

Cloud Computing and Big Data - Deploy Hadoop Cluster and Run MapReduce Cluster (Super Detailed!)

Linux builds Hadoop cluster (CentOS7+hadoop3.2.0+JDK1.8+Mapreduce fully distributed cluster)

The version number used in this article: CentOS7 Hadoop3.2.0 JDK1.8

Basic concepts and importance

Many friends deploy clusters using Hadoop and MapReduce, but they don’t know what is being deployed and what its use is. Before deploying the cluster, let me tell you the basic concepts of Hadoop and MapReduce and their importance in big data processing:

-Hadoop is an open source software framework developed by the Apache Foundation for distributed processing and storage of large-scale data sets. The core components of Hadoop include Hadoop Distributed File System (HDFS) and MapReduce.

  • HDFS is a distributed file system that can store large amounts of data on common hardware. HDFS splits data into chunks and then distributes storage across multiple nodes in the cluster, providing high fault tolerance and high throughput.

  • MapReduce is a programming model for processing and generating large data sets. MapReduce tasks include two phases: Map phase and Reduce phase. In the Map stage, the input data is split into independent chunks and then processed in parallel. In the Reduce phase, the processing results are combined into a final output.

The importance of Hadoop and MapReduce in big data processing is mainly reflected in the following points:

  1. Scalability: Hadoop can run on hundreds or thousands of machines and process petabytes of data.

  2. Fault tolerance: Hadoop can automatically handle node failures and ensure data reliability and integrity.

  3. Cost-effectiveness: Hadoop can run on common hardware, reducing the cost of big data processing.

  4. Flexibility: The MapReduce programming model can handle structured and unstructured data and adapt to various types of data processing tasks.

Let’s officially get to the point!

1. Directly select the root user to log in and turn off the firewall

Insert image description here

Directly selecting the root user to log in avoids some environmental problems caused by ordinary user authorization and user switching. Simply put, it is efficient and convenient.

Then turn off the firewall:

systemctl stop firewalld  //关闭防火墙

Insert image description here

systemctl disable firewalld  //关闭开机自启

Insert image description here

systemctl status firewalld  //查看防火墙状态

Insert image description here
Leave the firewall turned off.

2. Implement ssh password-free login

Configure passwordless access to ssh

ssh-keygen -t rsa

Press Enter continuously
Insert image description here

cd ~/.ssh
cat id_rsa.pub >> authorized_keys

Insert image description here
Set ssh server to start automatically

vi ~/.bashrc 

Press O at the end of the file to enter edit mode, and add:

/etc/init.d/ssh start

Insert image description here
Press ESC to return to command mode, enter: wq to save and exit.
Make changes effective immediately

source ~/.bashrc

Insert image description here
Check the ssh service status.

systemctl status sshd

Insert image description here

3. Install jdk1.8 on CentOS7

1. yum installation

  1. Before installation, check to see if there is a jdk that comes with the system. If so, uninstall it first.
    Insert image description here
    Uninstall the built-in jdk:
    rpm -e --nodepsall the jdk queried in the previous step
    , for example:
[root@master ~]# rpm -e --nodeps copy-jdk-configs-3.3-10.el7_5.noarch

Verify that it has been uninstalled cleanly:

java -version

Insert image description here
After uninstalling, start installing jdk1.8:

View installable versions

yum list java*

Insert image description here
Install version 1.8.0 openjdk

yum -y install java-1.8.0-openjdk*

Insert image description here
Insert image description here
Check the installation location:

rpm -qa | grep java
rpm -ql java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64

Insert image description here
Add user environment variables
Add:

export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

Insert image description here
and then execute

source ~/.bashrc

Insert image description here
Verify installation:

which java

View java version information

java -version

Insert image description here
Indicates that JDK configuration is completed.

4. Download hadoop

This link also has more Hadoop files for version 3.2.0:
https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/Here
is the downloaded hadoop-3.2.0.tar.gz Network disk file link:
Link: https://pan.baidu.com/s/1a3GJH_fNhUkfaDbckrD8Gg?pwd=2023

Download the hadoop file:
Insert image description here
Then upload the file and decompress it
1. Create a new directory named hadoop in the opt directory, and upload the downloaded hadoop-3.2.0.tar to the directory
mkdir /opt/hadoop

Unzip and install:

tar -zxvf hadoop-3.2.0.tar.gz

Insert image description here
Configure Hadoop environment variables:

vim ~/.bashrc

Insert image description here
Add hadoop environment variables:

export JAVA_HOME=/usr/lib/jvm/java-openjdk
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:/opt/hadoop/hadoop-3.2.0/bin:/opt/hadoop/hadoop-3.2.0/sbin
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Insert image description here
Then we execute

source  ~/.bashrc

Make the modified configuration file take effect.
Insert image description here

5. Modification of Hadoop configuration file

Create several new directories:

mkdir /root/hadoop
mkdir /root/hadoop/tmp
mkdir /root/hadoop/var
mkdir /root/hadoop/dfs
mkdir /root/hadoop/dfs/name
mkdir /root/hadoop/dfs/data

Insert image description here

Modify a series of configuration files vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/core-site.xml in etc/hadoop
and add configurations to the node:

<configuration>
 <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
   </property>
   <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
   </property>
   </configuration>

Insert image description here
Modify hadoop-env.sh

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hadoop-env.sh

Modify export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/lib/jvm/java-openjdk
Description: Modify to your own JDK path
Insert image description here

Modify hdfs-site.xml
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/hdfs-site.xml
and add configuration to the node:

<configuration>
<property>
   <name>dfs.name.dir</name>
   <value>/root/hadoop/dfs/name</value>
   <description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.
   </description>
</property>

<property>
   <name>dfs.data.dir</name>
   <value>/root/hadoop/dfs/data</value>
   <description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.
   </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>2</value>
</property>

<property>
   <name>dfs.permissions</name>
   <value>false</value>
   <description>need not permissions</description>
</property>
</configuration>

Insert image description here
Create and modify mapred-site.xml:
vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/mapred-site.xml
Add configuration to the node:

<configuration>
<!-- 配置mapReduce在Yarn上运行(默认本地运行) -->
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
</configuration>

Insert image description here
Modify workers file:

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/workers

Delete the localhost inside and add the following content (both the master and node1 nodes must be modified):

master
node1

Insert image description here
Modify the yarn-site.xml file:

vi /opt/hadoop/hadoop-3.2.0/etc/hadoop/yarn-site.xml

Insert image description here
Configure the start-dfs.sh, start-yarn.sh, stop-dfs.sh, stop-yarn.sh files in the hadoop-3.2.0/sbin/ directory.
Service startup permission configuration

cd /opt/hadoop/hadoop-3.2.0

Configure start-dfs.sh and stop-dfs.sh files

vi sbin/start-dfs.sh
vi sbin/stop-dfs.sh
加入下面内容
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Insert image description here

Configure start-yarn.sh and stop-yarn.sh files

vi sbin/start-yarn.sh
vi sbin/stop-yarn.sh

Add the following

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

Insert image description here
Insert image description here
After configuring the basic settings (SSH, JDK, Hadoop, environment variables, Hadoop and MapReduce configuration information), clone the virtual machine and obtain the slave node1 node.
Insert image description here
After cloning the master host, obtain the slave node1 node.
Then start modifying the network card information:

vim /etc/sysconfig/network-scripts/ifcfg-ens33

Modify node1 node ip information:
Insert image description here

Modify the node1 node host name:

vi /etc/hostname

Insert image description here
Modify the IP and host name corresponding to the node1 node (the master and slave nodes must be consistent)

vim /etc/hosts  

Insert image description here
Try ssh to interconnect the master and slave nodes:
first try to connect the node1 node to the master node

ssh node1

Insert image description here
Try node1 node to connect to the master node again:

ssh master

Insert image description here
OK, the interconnection is successful. (Press exit to exit

6. Start Hadoop

Because master is the namenode and node1 is the datanode, only the master needs to be initialized, that is, the hdfs is formatted.
Enter the master machine/opt/hadoop/hadoop-3.2.0/bin directory:

  cd /opt/hadoop/hadoop-3.2.0/bin

Execute initialization script

  ./hadoop namenode -format

Insert image description here
Insert image description here

Then execute the startup process:

./sbin/start-all.sh

Insert image description here
Check the startup process.

jps

Insert image description here
The master is our namenode. The IP of the machine is 192.168.95.20. Access the following address on the local computer:

http://192.168.95.20:9870/

Insert image description here
Visit the following address in your local browser:

http://192.168.95.20:8088/cluster

Automatically jump to the cluster page and
Insert image description here
create a directory on HDFS to store files.

./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input

Check whether distribution replication is normal

./bin/hdfs dfs -ls /home/hadoop/myx/wordcount/input

Insert image description here

7. Run MapReduce cluster

Mapreduce running case:
Create a directory on hdfs to store files.
For example

./bin/hdfs dfs -mkdir -p /home/hadoop/myx/wordcount/input

You can first simply write two small files called text1 and text2, as shown below.
file:text1.txt

hadoop is  very good 
mapreduce is very good
vim text1

Insert image description here

These two files can then be stored in HDFS and processed using WordCount.

./bin/hdfs dfs -put text1 /home/hadoop/myx/wordcount/input

Insert image description here
Check the distribution situation
Insert image description here
and run MapReduce to process it with WordCount.

./bin/hadoop jar ![在这里插入图片描述](https://img-blog.csdnimg.cn/81fe96bc9823429d8263e450ba417363.png)
/opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount /home/hadoop/myx/wordcount/input /home/hadoop/myx/wordcount/output

Insert image description here

Insert image description here
The final results will be stored in the specified output directory. Check the output directory to see the following content.

./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output/part-r-00000*

The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output

Insert image description here
Insert image description here
The above output is the number of occurrences of each word.

Let’s try the second case:
file:text2.txt

vim text2
hadoop is  easy to learn 
mapreduce is  easy to learn

Insert image description here
Check the newly created input2 directory on the browser:
Insert image description here
run MapReduce for processing, and set the output directory to output2 (the output result directory does not need to be created in advance, the output2 output directory will be automatically generated during Mapreduce operation).

./bin/hadoop jar /opt/hadoop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount /home/hadoop/myx/wordcount/input2 /home/hadoop/myx/wordcount/output2

Insert image description here
Insert image description here
After running, view the output of text2

./bin/hdfs dfs -cat /home/hadoop/myx/wordcount/output2/part-r-00000*

Insert image description here

The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/home/hadoop/myx/wordcount/output2

Insert image description here
Insert image description here
The above output is the number of occurrences of each word.

Let's try running the test program WordCount ourselves.
First, create a new folder WordCount in the current user directory of hadoop, and create two test files in it, file1.txt and file2.txt. Fill in the content in both files yourself.
Create a new folder WordCount.

mkdir WordCount
ls

Insert image description here

cd WordCount
vim file1.txt

Insert image description here
The content of file1.txt file is:

This is the first hadoop test program!

Insert image description here

vim file2.txt

The content of file2.txt file is:

This  program is not very difficult,but this program is a common hadoop program!

Insert image description here
Then create a new folder input in the /home directory of the Hadoop file system HDFS and view the contents. The specific commands are as follows.

cd /opt/hadoop/hadoop-3.2.0
./bin/hadoop fs -mkdir /input
./bin/hadoop fs -ls /

Insert image description here
View in browser:

http://192.168.95.20:9870/explorer.html#/input

Insert image description here
Upload the file1.txt\file2.txt files in the WordCount folder to the "input" folder you just created. The specific commands are as follows.

./bin/hadoop fs -put /opt/hadoop/hadoop-3.2.0/WordCount/*.txt  /input

Insert image description here
Run the Hadoop sample program and set the output directory to /output (the output result directory does not need to be created in advance, the /output output directory will be automatically generated during Mapreduce operation).

./bin/hadoop jar /opt/had![在这里插入图片描述](https://img-blog.csdnimg.cn/abf75678cb6943698c1a26d250317caf.png)
oop/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar  wordcount  /input /output

Insert image description here

Insert image description here
View the file directory information and WordCount results of the output results.
Use the following command to view the file directory information of the output results.

./bin/hadoop fs -ls /output

Insert image description here

Use the following command to view the results of WordCount.

./bin/hdfs dfs -cat /output/part-r-00000*

The output results are as follows.
Insert image description here
The running output results can also be viewed on the web side, with detailed information:

http://192.168.95.20:9870/explorer.html#/output

Insert image description here
The above output is the number of occurrences of each word.
At this point, the case of building a hadoop cluster and running three MapReduce clusters on Centos is completed!

Here are some tips and suggestions for optimizing Hadoop cluster performance and MapReduce task efficiency:

  1. Hardware optimization: Choosing appropriate hardware configuration is the key to improving Hadoop cluster performance. For example, use a faster CPU, larger memory, faster hard drive (such as SSD), and a high-speed network connection.

  2. Configuration optimization: Hadoop and MapReduce configuration parameters can be adjusted according to specific workloads. For example, HDFS's block size can be increased to increase processing speed of large files, or MapReduce's memory settings can be adjusted to accommodate larger tasks.

  3. Data localization: Run MapReduce tasks on the nodes where the data is located as much as possible to reduce network transmission overhead.

  4. Parallel processing: By increasing the parallelism of MapReduce tasks, the resources of the cluster can be more fully utilized.

  5. Programming optimization: When writing a MapReduce program, the transmission and sorting of data should be reduced as much as possible. For example, the Combiner function can be used to reduce data transfer between Map and Reduce stages.

  6. Use advanced tools: Some advanced data processing tools, such as Apache Hive and Apache Pig, can automatically optimize MapReduce tasks to make them more efficient.

  7. Monitoring and debugging: Using Hadoop's own monitoring tools, such as Hadoop Web UI and Hadoop Metrics, can help you find and solve performance problems.

The above are just some basic optimization tips and suggestions. Specific optimization strategies need to be adjusted according to specific needs and environments. Xiao Ma here wishes you all the best in your deployment!

Guess you like

Origin blog.csdn.net/Myx74270512/article/details/133246660