Hadoop has four advantages :
(1) High availability: Hadoop maintains multiple copies of data at the bottom, so even if a computing element or storage fails in Hadoop, it will not cause data loss
(2) High scalability: Assign task data between clusters, which can easily expand thousands of nodes
(3) Efficiency: Under the idea of MapReduce, Hadoop works in parallel to speed up task processing
(4) High fault tolerance: It can automatically redistribute failed tasks.
Hadoop composition :
Overview of HDFS architecture:
Hadoop Distributed File System, or HDFS for short, is a distributed file system.
(1) NameNode (nn): Store metadata of files, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), as well as the block list of each file and the DataNode where the block is located, etc.
2) DataNode (dn): Store file block data in the local file system, as well as the checksum of the block data
3) Secondary NameNode(2nn): Overview of the YARN architecture for NameNode metadata backup at regular intervals
:
Yet Another Resource Negotiator, YARN for short, is another resource coordinator, which is the resource manager of Hadoop.
Overview of MapReduce Architecture MapReduce divides the calculation process into two stages: Map and Reduce
1) Parallel processing of input data in the Map phase
2) Summarize the Map results in the Reduce phase
The relationship between HDFS, YARN, and MapReduce:
Big data technology ecosystem:
1) Sqoop: Sqoop is an open source tool, which is mainly used to transfer data between Hadoop, Hive and traditional databases (MySQL). It can import data from a relational database (for example: MySQL, Oracle, etc.) Into the HDFS of Hadoop, you can also import HDFS data into a relational database.
2) Flume: Flume is a highly available, highly reliable, distributed system for massive log collection, aggregation and transmission. Flume supports customizing various data senders in the log system to collect data;
3) Kafka: Kafka is a high-throughput distributed publish-subscribe messaging system;
4) Spark: Spark is currently the most popular open source big data memory computing framework. It can be calculated based on the big data stored on Hadoop.
5) Flink: Flink is currently the most popular open source big data memory computing framework. There are many scenarios for real-time calculation.
6) Oozie: Oozie is a workflow scheduling management system for managing Hadoop jobs.
7) Hbase: HBase is a distributed, column-oriented open source database. HBase is different from a general relational database. It is a database suitable for unstructured data storage.
8) Hive: Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.
9) ZooKeeper: It is a reliable coordination system for large-scale distributed systems. The functions provided include: configuration maintenance, name services, distributed synchronization, group services, etc.
Hadoop operating environment construction
(1) Install the template virtual machine, IP address 192.168.10.100, host name hadoop100, memory 4G, hard disk 100G
(2) Install epel-release
(3) Turn off the firewall
(4) Create yhd user
(5) Configure the yhd user to have root privileges, and sudo is easy to execute later
(6) Create module and software folders to store software packages and decompressed programs
(7) Modify the subordinate group and owner under the /opt directory
(8) Uninstall the jdk that comes with the virtual machine
rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps
Virtual machine cloning: hadoop102 hadoop103 hadoop104
Use hadoop100 as a template for cloning, the following takes hadoop102 as an example to modify the configuration as follows
(1) Modify the IP address (nmtui or vim /etc/sysconfig/network-scripts/ifcfg-ens32)
(2) Modify the host name (hostnamectl set-hostname hadoop102)
(3) Modify the host name mapping /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108
(4) Modify the windows hosts file and add the mapping host information to the hosts liberal arts
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108
(5) Hadoop102s upload, decompress and install jdk and Hadoop
[yhd@hadoop102 software]$ ll
total 520600
-rw-r--r--. 1 yhd yhd 338075860 Feb 24 09:00 hadoop-3.1.3.tar.gz
-rw-r--r--. 1 yhd yhd 195013152 Feb 24 09:09 jdk-8u212-linux-x64.tar.gz
[yhd@hadoop102 software]$ ll ../module/
total 0
drwxr-xr-x. 13 yhd yhd 204 Mar 20 19:56 hadoop-3.1.3
drwxr-xr-x. 7 yhd yhd 245 Apr 2 2019 jdk1.8.0_212
(6) Configure environment variables and source them to take effect
[yhd@hadoop102 ~]$ vim /etc/profile.d/my_env.sh
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
[yhd@hadoop102 ~]$ source /etc/profile.d/my_env.sh
(7) Whether the test is effective
(8) View the directory structure of hadoop
[yhd@hadoop102 hadoop-3.1.3]$ ll
total 180
drwxr-xr-x. 2 yhd yhd 183 Sep 12 2019 bin
drwxrwxr-x. 4 yhd yhd 37 Mar 20 19:57 data
drwxr-xr-x. 3 yhd yhd 20 Sep 12 2019 etc
drwxr-xr-x. 2 yhd yhd 106 Sep 12 2019 include
drwxr-xr-x. 3 yhd yhd 20 Sep 12 2019 lib
drwxr-xr-x. 4 yhd yhd 288 Sep 12 2019 libexec
-rw-rw-r--. 1 yhd yhd 147145 Sep 4 2019 LICENSE.txt
drwxrwxr-x. 3 yhd yhd 4096 Mar 20 19:57 logs
-rw-rw-r--. 1 yhd yhd 21867 Sep 4 2019 NOTICE.txt
-rw-rw-r--. 1 yhd yhd 1366 Sep 4 2019 README.txt
drwxr-xr-x. 3 yhd yhd 4096 Sep 12 2019 sbin
drwxr-xr-x. 4 yhd yhd 31 Sep 12 2019 share
Important catalog
(1) bin directory: store scripts for operating Hadoop related services (hdfs, yarn, mapred)
(2) etc directory: Hadoop configuration file directory, storing Hadoop configuration files
(3) lib directory: store Hadoop local library (compress and decompress data)
(4) sbin directory: store scripts to start or stop Hadoop related services
(5) share directory: store Hadoop dependent jar packages, documents, and official cases
(9) Configure clock synchronization (ntp) and set auto-start after power-on
(10) SSH passwordless login configuration
Both yhd and root users must distribute public keys
(11) xsync cluster distribution script
[yhd@hadoop102 ~]$ mkdir bin
[yhd@hadoop102 ~]$ cd bin
[yhd@hadoop102 bin]$ vim xsync
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4. 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
(12) Empower and execute
(13) Distribute the /opt/ software package, decompression file and configuration information copy to 103 and 104 and verify
Hadoop cluster configuration
1. Cluster deployment planning
(1) NameNode and SecondaryNameNode should not be installed on the same server
(2) ResourceManager also consumes memory, do not configure it on the same machine as NameNode and SecondaryNameNode
2. Configuration file description
Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a certain default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.
Default configuration file:
Custom configuration file:
The four configuration files core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml are stored in the path $HADOOP_HOME/etc/hadoop, and users can modify the configuration again according to project requirements.
(1) Deployment core-site.xml
<configuration>
<!-- 指定 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<!-- 指定 hadoop 数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
</property>
</configuration>
(2) Configure hdfs-site.xml
<configuration>
<!-- nn web 端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<!-- 2nn web 端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
</configuration>
(3) Placement yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定 MR 走 shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 ResourceManager 的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
(4) Placement mapred-site.xml
<configuration>
<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(5) Configure the works file
[root@hadoop102 hadoop]# vim workers
hadoop102
hadoop103
hadoop104
(6) Distribute the configured Hadoop configuration files on the cluster
[yhd@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc
==================== hadoop102 ====================
sending incremental file list
sent 910 bytes received 19 bytes 1,858.00 bytes/sec
total size is 107,468 speedup is 115.68
==================== hadoop103 ====================
sending incremental file list
etc/
etc/hadoop/
etc/hadoop/workers
sent 993 bytes received 54 bytes 2,094.00 bytes/sec
total size is 107,468 speedup is 102.64
==================== hadoop104 ====================
sending incremental file list
etc/
etc/hadoop/
etc/hadoop/workers
sent 993 bytes received 54 bytes 698.00 bytes/sec
total size is 107,468 speedup is 102.64
[yhd@hadoop102 hadoop]$
(7) Verification view
(8) Start the cluster
1. If the cluster is started for the first time, you need to format the NameNode on the hadoop102 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster is in If an error is reported during operation and the NameNode needs to be reformatted, the namenode and datanode processes must be stopped first, and the data and logs directories of all machines must be deleted before formatting)
[yhd@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
2021-03-20 19:56:01,526 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop102/192.168.10.102
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.1.3
STARTUP_MSG: classpath = /opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/animal-sniffer-annotations-1.17.j
2. Start hdfs
[yhd@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
Starting namenodes on [hadoop102]
Starting datanodes
hadoop103: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
hadoop104: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
Starting secondary namenodes [hadoop104]
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
12660 Jps
12266 NameNode
3. Start yarn (in hadoop103)
[yhd@hadoop103 hadoop]$ sbin/start-yarn.sh
[yhd@hadoop103 hadoop]$ jps
11762 Jps
9801 DataNode
9993 ResourceManager
10126 NodeManager
4. View HDFS NameNode on the web
5. View YARN ResourceManager on the web side
(9) Cluster test
Upload files to the cluster
[yhd@hadoop102 ~]$ hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
2021-03-20 19:58:39,234 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[yhd@hadoop102 ~]$hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /
2021-03-20 19:59:22,656 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 19:59:24,974 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
6. Check where the file is stored after uploading the file
[yhd@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-510334103-192.168.10.102-1616241362316/current/finalized/subdir0/subdir0
[yhd@hadoop102 subdir0]$ ll
total 192208
-rw-rw-r--. 1 yhd yhd 46 Mar 20 19:58 blk_1073741825
-rw-rw-r--. 1 yhd yhd 11 Mar 20 19:58 blk_1073741825_1001.meta
-rw-rw-r--. 1 yhd yhd 134217728 Mar 20 19:59 blk_1073741826
-rw-rw-r--. 1 yhd yhd 1048583 Mar 20 19:59 blk_1073741826_1002.meta
-rw-rw-r--. 1 yhd yhd 60795424 Mar 20 19:59 blk_1073741827
-rw-rw-r--. 1 yhd yhd 474975 Mar 20 19:59 blk_1073741827_1003.meta
-rw-rw-r--. 1 yhd yhd 38 Mar 20 20:04 blk_1073741834
-rw-rw-r--. 1 yhd yhd 11 Mar 20 20:04 blk_1073741834_1010.meta
-rw-rw-r--. 1 yhd yhd 439 Mar 20 20:04 blk_1073741835
-rw-rw-r--. 1 yhd yhd 11 Mar 20 20:04 blk_1073741835_1011.meta
-rw-rw-r--. 1 yhd yhd 25306 Mar 20 20:04 blk_1073741836
-rw-rw-r--. 1 yhd yhd 207 Mar 20 20:04 blk_1073741836_1012.meta
-rw-rw-r--. 1 yhd yhd 214462 Mar 20 20:04 blk_1073741837
-rw-rw-r--. 1 yhd yhd 1683 Mar 20 20:04 blk_1073741837_1013.meta
[yhd@hadoop102 subdir0]$
7. View the contents of the files stored in HDFS on the disk
[yhd@hadoop102 subdir0]$ cat blk_1073741825
hadoop yarn
hadoop mapreduce
atguigu
atguigu
8. Execute wordcount program
[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output
2021-03-20 20:03:51,652 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 20:03:52,181 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616241436786_0001
2021-03-20 20:03:52,272 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 20:03:52,407 INFO input.FileInputFormat: Total input files to process : 1
2021-03-20 20:03:52,435 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
9. View the operation of computing services
10. Check if there is data in hdfs
11. Configure the history server (the above history server is not configured, so the history click to jump fails)
Edit [yhd@hadoop102 hadoop]$ vim mapred-site.xml
<configuration>
<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<!-- 历史服务器 web 端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop102:19888</value>
</property>
</configuration>
Distribution configuration
[yhd@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/mapred-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 64 bytes received 12 bytes 152.00 bytes/sec
total size is 1,170 speedup is 15.39
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml
sent 585 bytes received 47 bytes 1,264.00 bytes/sec
total size is 1,170 speedup is 1.85
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml
sent 585 bytes received 47 bytes 1,264.00 bytes/sec
total size is 1,170 speedup is 1.85
[yhd@hadoop102 hadoop]$
Start the history server in hadoop102
[yhd@hadoop102 hadoop-3.1.3]$ bin/mapred --daemon start historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
15666 JobHistoryServer
12266 NameNode
12733 NodeManager
15725 Jps
New task
[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output1
2021-03-20 22:33:14,973 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 22:33:16,139 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616241436786_0002
2021-03-20 22:33:16,274 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,351 INFO input.FileInputFormat: Total input files to process : 1
2021-03-20 22:33:17,405 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,586 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,664 INFO mapreduce.JobSubmitter: number of splits:1
View hdfs page
Test again whether the history server is easy to use
Click on the logs prompt as follows, you need to configure the log aggregation function
Configure log aggregation
Log aggregation concept: After the application is completed, upload the program operation log information to the HDFS system
Benefits of log aggregation function: you can easily view the program running details, which is convenient for development and debugging
To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer
Placement [yhd @ hadoop102 hadoop] $ vim yarn-site.xml
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为 7 天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
Distribute to the other two servers
[yhd@hadoop102 hadoop]$ xsync yarn-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 62 bytes received 12 bytes 148.00 bytes/sec
total size is 1,621 speedup is 21.91
==================== hadoop103 ====================
sending incremental file list
yarn-site.xml
sent 1,034 bytes received 47 bytes 720.67 bytes/sec
total size is 1,621 speedup is 1.50
==================== hadoop104 ====================
sending incremental file list
yarn-site.xml
sent 1,034 bytes received 47 bytes 720.67 bytes/sec
total size is 1,621 speedup is 1.50
Close the history server process on 102
[yhd@hadoop102 hadoop-3.1.3]$ mapred --daemon stop historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
16617 Jps
12266 NameNode
12733 NodeManager
'Close the yarn process on 103
[yhd@hadoop103 hadoop-3.1.3]$ jps
12531 Jps
9801 DataNode
9993 ResourceManager
10126 NodeManager
[yhd@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh
Stopping nodemanagers
hadoop102: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop104: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
Stopping resourcemanager
Start yarn on 103
[yhd@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[yhd@hadoop103 hadoop-3.1.3]$ jps
12913 ResourceManager
13046 NodeManager
9801 DataNode
13391 Jps
Start the history server on 102
[yhd@hadoop102 hadoop-3.1.3]$ mapred --daemon start historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
17029 Jps
16969 JobHistoryServer
12266 NameNode
16797 NodeManager
Perform a new task to check and verify as follows:
[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2
2021-03-20 22:55:55,213 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 22:55:55,683 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616251848467_0001
2021-03-20 22:55:55,764 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
First see if it is successful on dnfs
Look at the task scheduling operation again
Click on the logs process to view