Hadoop (deployment)

Table of contents

 Hadoop three operating modes

local run mode

Pseudo-distributed operating mode

Completely distributed operating mode (development focus)


 Hadoop three operating modes

Hadoop operating modes include: local mode, pseudo-distributed mode and fully distributed mode.
Hadoop official website: http://hadoop.apache.org/

local run mode

1. Official Grep case
①Create an input folder under the hadoop-2.7.6 file:
[root@master hadoop-2.7.6]# mkdir input
②Copy Hadoop’s xml configuration file to input
[root@master hadoop-2.7.6]# cp etc/hadoop/*.xml input
③Execute the MapReduce program in the share directory
[root@master hadoop-2.7.6]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar grep input output 'dfs[a-z.]+'
④View the output results:
[root@master hadoop-2.7.6]# cat output/*

Pseudo-distributed operating mode

1. Start HDFS and run the MapReduce program

(1) Analysis
        ①Configure the cluster
        ②Start and test cluster addition, deletion and check
        ③Execute WordCount case
(2) Execution steps
        ①Configure the cluster
(a) Configuration: hadoop-env.sh
In the main node, get the path of jdk:
[root@master hadoop-2.7.6]# echo $JAVA_HOME
Edit hadoop-env.sh and add the jdk path:
[root@master hadoop]# vim hadoop-env.sh
export JAVA_HOME=/usr/local/soft/jdk1.8.0_171

(b) Configuration: core-site.xml
<!-- Specify the address of NameNode in HDFS -->
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
</property>
<!-- Specify the storage directory for files generated when Hadoop is running -->
<property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/soft/hadoop-2.7.6/tmp</value>
</property>
(c) Configuration: hdfs-site.xml
<!-- Specify the number of HDFS replicas -->
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>
②Start the cluster
(a) Format the NameNode (format it when you start it for the first time, do not always format it in the future)
[root@master hadoop-2.7.6]# hadoop namenode -format

 

(b) Start NameNode
[root@master hadoop-2.7.6]# hadoop-daemon.sh start namenode

 

(c) Start DataNode
[root@master hadoop-2.7.6]# hadoop-daemon.sh start datanode
③View cluster
(a) Check whether the startup is successful
[root@master hadoop-2.7.6]# jps
7461 NameNode
7559 DataNode
7641 Jps
(b) View the HDFS file system on the web side
http://master:50070

 

(c) View the generated Log
Note: When encountering bugs in the enterprise, we often analyze the problem and solve the bug based on the log prompt information.
[root@master hadoop-2.7.6]# cd logs/
[root@master logs]# ll

(d) Thinking: Why can’t you always format NameNode? What should you pay attention to when formatting NameNode?
What?
[root@master hadoop-2.7.6]# cd tmp/dfs/name/current
[root@master current]# cat VERSION
#Mon Jul 11 10:27:14 CST 2022
namespaceID=1179648118
clusterID=CID-92721281-f46b-419c-bab2-e23edc300e06
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1720451217-192.168.18.133-1657506434948
layoutVersion=-63
[root@master data]# cd current/
[root@master current]# ll
Total usage 4
drwx------. 4 root root 54 Jul 11 ​​10:30 BP-1720451217-192.168.18.133-1657506434948
-rw-r--r--. 1 root root 229 7 Apr 11 10:30 VERSION
[root@master current]# cat VERSION
#Mon Jul 11 10:30:12 CST 2022
storageID=DS-7efd22c8-fc53-4e03-beae-6699e4398181
clusterID=CID-92721281-f46b-419c-bab2-e23edc300e06
cTime=0
datanodeUuid=5774abb7-ab49-4740-8747-a6923bd17d8e
storageType=DATA_NODE
layoutVersion=-56
Note: Formatting the NameNode will generate a new cluster ID, causing the cluster IDs of the NameNode and DataNode to be inconsistent, and the cluster cannot find past data. Therefore, when formatting the NameNode, be sure to delete the data and logs first, and then format the NameNode.

 ④Operation cluster

(a) Create an input folder on the HDFS file system
[root@master hadoop-2.7.6]# hadoop dfs -mkdir /input
(b) Upload the test file contents to the file system
[root@master hadoop-2.7.6]# hadoop dfs -put wordcountinput/wc.input /input
(c) Check whether the uploaded file is correct
[root@master hadoop-2.7.6]# hadoop dfs -ls /input
[root@master hadoop-2.7.6]# hadoop dfs -cat /input/wc.input
(d) Run the MapReduce program
[root@master hadoop-2.7.6]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
(e) View the output results
Command line view:
[root@master hadoop-2.7.6]# hadoop dfs -cat /output/*

 2. Start YARN and run the MapReduce program

(1) Analysis
①Configure the cluster to run MR on YARN
②Start and test cluster addition, deletion and check
③Execute WordCount case on YARN
(2) Execution steps
①Configure the cluster
(a) Configure yarn-env.sh
Configure JAVA_HOME
export JAVA_HOME=/usr/local/soft/jdk1.8.0_171
(b) Configure yarn-site.xml
<!-- Specify the address of YARN's ResourceManager -->
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
</property>
<!-- How Reducer obtains data-->
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
(c) Configuration: mapred-env.sh
Configure JAVA_HOME
export JAVA_HOME=/usr/local/soft/jdk1.8.0_171
(d) Configuration: (Rename mapred-site.xml.template to) mapred-site.xml
[root@master hadoop]# mv mapred-site.xml.template mapred-site.xml
[root@master hadoop]# vim mapred-site.xml
<!-- Specify MR to run on YARN -->
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
②Start the cluster
(a) Before starting, you must ensure that NameNode and DataNode have been started.
(b) Start ResourceManager
[root@master hadoop]# yarn-daemon.sh start resourcemanager
(c) Start NodeManager
[root@master hadoop]# yarn-daemons.sh start nodemanager
③Cluster operation
(a) View the YARN browser page: http://master:8088/cluster
(b) Execute MapReduce program:
[root@master hadoop-2.7.6]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
(c) View the running results:
[root@master hadoop-2.7.6]# hadoop dfs -cat /output/*

 

3. Configure the history server

In order to view the historical running status of the program, you need to configure the history server. The specific configuration steps are as follows:
Down:
1. Configure mapred-site.xml
Add the following configuration to the file.
<!--Historical server address-->
<property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
</property>
<!-- History server web address -->
<property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
</property>
2. Start the history server
[root@master hadoop-2.7.6]# mr-jobhistory-daemon.sh start historyserver
3. Check whether the history server is started
[root@master hadoop-2.7.6]# jps
10016 Jps
9234 NodeManager
7461 NameNode
7559 DataNode
9948 JobHistoryServer
8941 ResourceManager
4. View JobHistory
http://master:19888/jobhistory

4. Configure log aggregation
Log aggregation concept: After the application is completed, the program running log information is uploaded to the HDFS system.
Benefits of the log aggregation function: You can easily view program running details, which facilitates development and debugging.
Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.
The specific steps to enable the log aggregation function are as follows:
1. Configure yarn-site.xml
[root@master hadoop]# vim yarn-site.xml
Add the following configuration to the file.
<!-- Enable log aggregation function -->
<property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
</property>
<!-- Set log retention time to 7 days -->
<property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
</property>
2. Close NodeManager, ResourceManager and HistoryManager
[root@master hadoop]# yarn-daemon.sh stop resourcemanager
[root@master hadoop]# yarn-daemons.sh stop nodemanager
[root@master hadoop]# mr-jobhistory-daemon.sh stop historyserver
3. Start NodeManager, ResourceManager and HistoryManager
[root@master hadoop]# yarn-daemon.sh start resourcemanager
[root@master hadoop]# yarn-daemon.sh start nodemanager
[root@master hadoop]# mr-jobhistory-daemon.sh start historyserver
4. Delete existing output files on HDFS
[root@master hadoop]# hadoop dfs -rm -r /output
5. Execute the WordCount program
[root@master hadoop-2.7.6]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
6. View logs
http://master:19888/jobhistory

 

5. Configuration file description
Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only users want to modify them.
When changing a certain default configuration value, you need to modify the custom configuration file and change the corresponding attribute value.
(1)Default configuration file

 

(2) Custom configuration file:
core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml 四个
The configuration file is stored in the path $HADOOP_HOME/etc/hadoop. Users can
Modify the configuration again.

Completely distributed operating mode (development focus)

1. Preparation
(1) Equipment
Three virtual machines: master, node1, node2
(2) Time synchronization
date
(3) Modify the host name
The three host names are: master, node1, node2
(4) Turn off the firewall
systemctl stop firewalld
(5) Check firewall status
systemctl status firewalld
(6) Cancel firewall auto-start
systemctl disable firewalld
(7) Password-free login
All three stations require password exemption:
# 1. Generate key
ssh-keygen -t rsa
# 2. Configure password-free login
ssh-copy-id master
ssh-copy-id node1
ssh-copy-id node2
# 3. Test password-free login
ssh node1
2. Build a Hadoop cluster
(1) Upload the installation package and decompress it
①Use xftp to upload the compressed package to /usr/local/packages/ of the master
②Unzip
tar -zxvf hadoop-2.7.6.tar.gz -C /usr/local/soft/
(2) Configure environment variables
vim /etc/profile
JAVA_HOME=/usr/local/soft/jdk1.8.0_171
HADOOP_HOME=/usr/local/soft/hadoop-2.7.6
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# Reload environment variables
source /etc/profile
(3) Modify Hadoop configuration file
①Enter hadoop configuration file
cd /usr/local/soft/hadoop-2.7.6/etc/hadoop/
②core-site.xml
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/soft/hadoop-2.7.6/tmp</value>
</property>
<property>
        <name>fs.trash.interval</name>
        <value>1440</value>
</property>
③hadoop-env.sh
export JAVA_HOME=/usr/local/soft/jdk1.8.0_171

④hdfs-site.xml
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>
<property>
        <name>dfs.permissions</name>
        <value>false</value>
</property>
⑤mapred-site.xml.template
Rename the file:
cp mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
</property>
<property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
</property>
⑥slaves
node1
node2
⑦yarn-site.xml
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
</property>
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
<property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
</property>
<property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
</property>
(4) Distribute Hadoop to node1 and node2
cd /usr/local/soft/
scp -r hadoop-2.7.6/ node1:`pwd`
scp -r hadoop-2.7.6/ node2:`pwd`
(5) Format namenode (need to be executed when starting for the first time)
hdfs namenode -format

(6) Start the Hadoop cluster
start-all.sh
(7) Check the processes on master, node1, and node2
①master

②node1 

 ③node2

(8) Access the WEB interface of HDFS
http://master:50070

(9) Access YARN’s WEB interface
http://master:8088/cluster

3.Usage of scp&rsync
(1) scp (secure copy) secure copy
①scp definition:
scp can copy data between servers. (from server1 to server2)
②Basic grammar
scp -r /usr/local/soft/data/ node1:`pwd`
Command recursively copies the file path/name destination user@host:destination path/name
③Case practice
On the master, copy the data/ directory under /usr/local/soft/ in the master to node1
In the soft directory:
[root@master soft]# scp -r data/ node1:`pwd`
(2) rsync remote synchronization tool
rsync is mainly used for backup and mirroring. It has the advantages of being fast, avoiding copying the same content, and supporting symbolic links.
The difference between rsync and scp: using rsync to copy files is faster than scp. rsync only updates the difference files. scp copies all files.
①Basic grammar
[root@master soft]# rsync -rvl data/ node2:`pwd`
Command option parameters file path/name to be copied destination user@host:destination path/name
Option parameter description:
Options
② Case practice
Synchronize the /usr/local/packages directory on the master machine to the root of the node1 server
The /usr/local/packages directory under the user
[root@master local]# rsync -rvl packages/
node1:/usr/local/packages/

Guess you like

Origin blog.csdn.net/qq_40322236/article/details/128894396