Introduction to Hadoop and cluster construction test (1)

Hadoop has four advantages :

(1) High availability: Hadoop maintains multiple copies of data at the bottom, so even if a computing element or storage fails in Hadoop, it will not cause data loss

(2) High scalability: Assign task data between clusters, which can easily expand thousands of nodes

(3) Efficiency: Under the idea of ​​MapReduce, Hadoop works in parallel to speed up task processing

(4) High fault tolerance: It can automatically redistribute failed tasks.

Hadoop composition :

Overview of HDFS architecture:

Hadoop Distributed File System, or HDFS for short, is a distributed file system.

(1) NameNode (nn): Store metadata of files, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), as well as the block list of each file and the DataNode where the block is located, etc.

2) DataNode (dn): Store file block data in the local file system, as well as the checksum of the block data

3) Secondary NameNode(2nn): Overview of the YARN architecture for NameNode metadata backup at regular intervals
:

Yet Another Resource Negotiator, YARN for short, is another resource coordinator, which is the resource manager of Hadoop.

Overview of MapReduce Architecture MapReduce divides the calculation process into two stages: Map and Reduce

1) Parallel processing of input data in the Map phase

2) Summarize the Map results in the Reduce phase

The relationship between HDFS, YARN, and MapReduce:

Big data technology ecosystem:

 1) Sqoop: Sqoop is an open source tool, which is mainly used to transfer data between Hadoop, Hive and traditional databases (MySQL). It can import data from a relational database (for example: MySQL, Oracle, etc.) Into the HDFS of Hadoop, you can also import HDFS data into a relational database.

2) Flume: Flume is a highly available, highly reliable, distributed system for massive log collection, aggregation and transmission. Flume supports customizing various data senders in the log system to collect data;

3) Kafka: Kafka is a high-throughput distributed publish-subscribe messaging system;

4) Spark: Spark is currently the most popular open source big data memory computing framework. It can be calculated based on the big data stored on Hadoop.

5) Flink: Flink is currently the most popular open source big data memory computing framework. There are many scenarios for real-time calculation.

6) Oozie: Oozie is a workflow scheduling management system for managing Hadoop jobs.

7) Hbase: HBase is a distributed, column-oriented open source database. HBase is different from a general relational database. It is a database suitable for unstructured data storage.

8) Hive: Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.

9) ZooKeeper: It is a reliable coordination system for large-scale distributed systems. The functions provided include: configuration maintenance, name services, distributed synchronization, group services, etc.

Hadoop operating environment construction

(1) Install the template virtual machine, IP address 192.168.10.100, host name hadoop100, memory 4G, hard disk 100G

(2) Install epel-release

(3) Turn off the firewall

(4) Create yhd user

(5) Configure the yhd user to have root privileges, and sudo is easy to execute later

(6) Create module and software folders to store software packages and decompressed programs

(7) Modify the subordinate group and owner under the /opt directory

(8) Uninstall the jdk that comes with the virtual machine

rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

Virtual machine cloning: hadoop102 hadoop103 hadoop104

Use hadoop100 as a template for cloning, the following takes hadoop102 as an example to modify the configuration as follows

(1) Modify the IP address (nmtui or vim /etc/sysconfig/network-scripts/ifcfg-ens32)

(2) Modify the host name (hostnamectl set-hostname hadoop102)

(3) Modify the host name mapping /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108
(4) Modify the windows hosts file and add the mapping host information to the hosts liberal arts

192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108

(5) Hadoop102s upload, decompress and install jdk and Hadoop

[yhd@hadoop102 software]$ ll
total 520600
-rw-r--r--. 1 yhd yhd 338075860 Feb 24 09:00 hadoop-3.1.3.tar.gz
-rw-r--r--. 1 yhd yhd 195013152 Feb 24 09:09 jdk-8u212-linux-x64.tar.gz
[yhd@hadoop102 software]$ ll ../module/
total 0
drwxr-xr-x. 13 yhd yhd 204 Mar 20 19:56 hadoop-3.1.3
drwxr-xr-x.  7 yhd yhd 245 Apr  2  2019 jdk1.8.0_212

(6) Configure environment variables and source them to take effect

[yhd@hadoop102 ~]$ vim /etc/profile.d/my_env.sh 
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
[yhd@hadoop102 ~]$ source  /etc/profile.d/my_env.sh 

(7) Whether the test is effective

(8) View the directory structure of hadoop

[yhd@hadoop102 hadoop-3.1.3]$ ll
total 180
drwxr-xr-x. 2 yhd yhd    183 Sep 12  2019 bin
drwxrwxr-x. 4 yhd yhd     37 Mar 20 19:57 data
drwxr-xr-x. 3 yhd yhd     20 Sep 12  2019 etc
drwxr-xr-x. 2 yhd yhd    106 Sep 12  2019 include
drwxr-xr-x. 3 yhd yhd     20 Sep 12  2019 lib
drwxr-xr-x. 4 yhd yhd    288 Sep 12  2019 libexec
-rw-rw-r--. 1 yhd yhd 147145 Sep  4  2019 LICENSE.txt
drwxrwxr-x. 3 yhd yhd   4096 Mar 20 19:57 logs
-rw-rw-r--. 1 yhd yhd  21867 Sep  4  2019 NOTICE.txt
-rw-rw-r--. 1 yhd yhd   1366 Sep  4  2019 README.txt
drwxr-xr-x. 3 yhd yhd   4096 Sep 12  2019 sbin
drwxr-xr-x. 4 yhd yhd     31 Sep 12  2019 share

Important catalog

(1) bin directory: store scripts for operating Hadoop related services (hdfs, yarn, mapred)

(2) etc directory: Hadoop configuration file directory, storing Hadoop configuration files

(3) lib directory: store Hadoop local library (compress and decompress data)

(4) sbin directory: store scripts to start or stop Hadoop related services

(5) share directory: store Hadoop dependent jar packages, documents, and official cases

(9) Configure clock synchronization (ntp) and set auto-start after power-on

(10) SSH passwordless login configuration

Both yhd and root users must distribute public keys

(11) xsync cluster distribution script

[yhd@hadoop102 ~]$ mkdir bin
[yhd@hadoop102 ~]$ cd bin
[yhd@hadoop102 bin]$ vim xsync 
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
 echo Not Enough Arguement!
 exit;
fi 
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
 echo ==================== $host ====================
 #3. 遍历所有目录,挨个发送
 for file in $@
 do
 #4. 判断文件是否存在
 if [ -e $file ]
 then
 #5. 获取父目录
 pdir=$(cd -P $(dirname $file); pwd)
 #6. 获取当前文件的名称
 fname=$(basename $file)
 ssh $host "mkdir -p $pdir"
 rsync -av $pdir/$fname $host:$pdir
 else
 echo $file does not exists!
 fi
 done
done

(12) Empower and execute

(13) Distribute the /opt/ software package, decompression file and configuration information copy to 103 and 104 and verify

Hadoop cluster configuration

1. Cluster deployment planning

(1) NameNode and SecondaryNameNode should not be installed on the same server

(2) ResourceManager also consumes memory, do not configure it on the same machine as NameNode and SecondaryNameNode

2. Configuration file description

Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a certain default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.

Default configuration file:

Custom configuration file:

          The four configuration files core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml are stored in the path $HADOOP_HOME/etc/hadoop, and users can modify the configuration again according to project requirements.

(1) Deployment core-site.xml

<configuration>
  <!-- 指定 NameNode 的地址 -->
   <property>
   <name>fs.defaultFS</name>
   <value>hdfs://hadoop102:8020</value>
   </property>
   <!-- 指定 hadoop 数据的存储目录 -->
   <property>
   <name>hadoop.tmp.dir</name>
   <value>/opt/module/hadoop-3.1.3/data</value>
   </property>

</configuration>

(2) Configure hdfs-site.xml

<configuration>

<!-- nn web 端访问地址-->
<property>
 <name>dfs.namenode.http-address</name>
 <value>hadoop102:9870</value>
 </property>
<!-- 2nn web 端访问地址-->
 <property>
 <name>dfs.namenode.secondary.http-address</name>
 <value>hadoop104:9868</value>
 </property>

</configuration>

(3) Placement yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
<!-- 指定 MR 走 shuffle -->
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <!-- 指定 ResourceManager 的地址-->
 <property>
 <name>yarn.resourcemanager.hostname</name>
 <value>hadoop103</value>
 </property>
 <!-- 环境变量的继承 -->
 <property>
 <name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
 </property>
</configuration>

(4) Placement mapred-site.xml

<configuration>

<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>

</configuration>

(5) Configure the works file

[root@hadoop102 hadoop]# vim workers 

hadoop102
hadoop103
hadoop104

 (6) Distribute the configured Hadoop configuration files on the cluster 

[yhd@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc
==================== hadoop102 ====================
sending incremental file list

sent 910 bytes  received 19 bytes  1,858.00 bytes/sec
total size is 107,468  speedup is 115.68
==================== hadoop103 ====================
sending incremental file list
etc/
etc/hadoop/
etc/hadoop/workers

sent 993 bytes  received 54 bytes  2,094.00 bytes/sec
total size is 107,468  speedup is 102.64
==================== hadoop104 ====================
sending incremental file list
etc/
etc/hadoop/
etc/hadoop/workers

sent 993 bytes  received 54 bytes  698.00 bytes/sec
total size is 107,468  speedup is 102.64
[yhd@hadoop102 hadoop]$

(7) Verification view

(8) Start the cluster

      1. If the cluster is started for the first time, you need to format the NameNode on the hadoop102 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster is in If an error is reported during operation and the NameNode needs to be reformatted, the namenode and datanode processes must be stopped first, and the data and logs directories of all machines must be deleted before formatting)

[yhd@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
2021-03-20 19:56:01,526 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop102/192.168.10.102
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.3
STARTUP_MSG:   classpath = /opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/animal-sniffer-annotations-1.17.j

       2. Start hdfs

[yhd@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh 
Starting namenodes on [hadoop102]
Starting datanodes
hadoop103: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
hadoop104: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
Starting secondary namenodes [hadoop104]
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
12660 Jps
12266 NameNode

      3. Start yarn (in hadoop103)

[yhd@hadoop103 hadoop]$ sbin/start-yarn.sh
[yhd@hadoop103 hadoop]$ jps
11762 Jps
9801 DataNode
9993 ResourceManager
10126 NodeManager

     4. View HDFS NameNode on the web

    5. View YARN ResourceManager on the web side 

(9) Cluster test

Upload files to the cluster

[yhd@hadoop102 ~]$ hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
2021-03-20 19:58:39,234 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[yhd@hadoop102 ~]$hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /
2021-03-20 19:59:22,656 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 19:59:24,974 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

  6. Check where the file is stored after uploading the file

[yhd@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-510334103-192.168.10.102-1616241362316/current/finalized/subdir0/subdir0
[yhd@hadoop102 subdir0]$ ll
total 192208
-rw-rw-r--. 1 yhd yhd        46 Mar 20 19:58 blk_1073741825
-rw-rw-r--. 1 yhd yhd        11 Mar 20 19:58 blk_1073741825_1001.meta
-rw-rw-r--. 1 yhd yhd 134217728 Mar 20 19:59 blk_1073741826
-rw-rw-r--. 1 yhd yhd   1048583 Mar 20 19:59 blk_1073741826_1002.meta
-rw-rw-r--. 1 yhd yhd  60795424 Mar 20 19:59 blk_1073741827
-rw-rw-r--. 1 yhd yhd    474975 Mar 20 19:59 blk_1073741827_1003.meta
-rw-rw-r--. 1 yhd yhd        38 Mar 20 20:04 blk_1073741834
-rw-rw-r--. 1 yhd yhd        11 Mar 20 20:04 blk_1073741834_1010.meta
-rw-rw-r--. 1 yhd yhd       439 Mar 20 20:04 blk_1073741835
-rw-rw-r--. 1 yhd yhd        11 Mar 20 20:04 blk_1073741835_1011.meta
-rw-rw-r--. 1 yhd yhd     25306 Mar 20 20:04 blk_1073741836
-rw-rw-r--. 1 yhd yhd       207 Mar 20 20:04 blk_1073741836_1012.meta
-rw-rw-r--. 1 yhd yhd    214462 Mar 20 20:04 blk_1073741837
-rw-rw-r--. 1 yhd yhd      1683 Mar 20 20:04 blk_1073741837_1013.meta
[yhd@hadoop102 subdir0]$ 

  7. View the contents of the files stored in HDFS on the disk

[yhd@hadoop102 subdir0]$ cat blk_1073741825
hadoop yarn
hadoop mapreduce
atguigu
atguigu

 8. Execute wordcount program

[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output
2021-03-20 20:03:51,652 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 20:03:52,181 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616241436786_0001
2021-03-20 20:03:52,272 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 20:03:52,407 INFO input.FileInputFormat: Total input files to process : 1
2021-03-20 20:03:52,435 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

   9. View the operation of computing services

10. Check if there is data in hdfs

11. Configure the history server (the above history server is not configured, so the history click to jump fails)

Edit [yhd@hadoop102 hadoop]$ vim mapred-site.xml 

<configuration>

<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>
<!-- 历史服务器端地址 -->
<property>
 <name>mapreduce.jobhistory.address</name>
 <value>hadoop102:10020</value>
</property>
<!-- 历史服务器 web 端地址 -->
<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>hadoop102:19888</value>
</property>
</configuration>

Distribution configuration

[yhd@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/mapred-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 64 bytes  received 12 bytes  152.00 bytes/sec
total size is 1,170  speedup is 15.39
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml

sent 585 bytes  received 47 bytes  1,264.00 bytes/sec
total size is 1,170  speedup is 1.85
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml

sent 585 bytes  received 47 bytes  1,264.00 bytes/sec
total size is 1,170  speedup is 1.85
[yhd@hadoop102 hadoop]$ 

Start the history server in hadoop102

[yhd@hadoop102 hadoop-3.1.3]$ bin/mapred --daemon start historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
15666 JobHistoryServer
12266 NameNode
12733 NodeManager
15725 Jps

New task

[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output1
2021-03-20 22:33:14,973 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 22:33:16,139 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616241436786_0002
2021-03-20 22:33:16,274 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,351 INFO input.FileInputFormat: Total input files to process : 1
2021-03-20 22:33:17,405 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,586 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-03-20 22:33:17,664 INFO mapreduce.JobSubmitter: number of splits:1

View hdfs page

Test again whether the history server is easy to use

Click on the logs prompt as follows, you need to configure the log aggregation function

Configure log aggregation

Log aggregation concept: After the application is completed, upload the program operation log information to the HDFS system

Benefits of log aggregation function: you can easily view the program running details, which is convenient for development and debugging

To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer

    Placement [yhd @ hadoop102 hadoop] $ vim yarn-site.xml

<!-- 开启日志聚集功能 -->
<property>
 <name>yarn.log-aggregation-enable</name>
 <value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property> 
 <name>yarn.log.server.url</name> 
 <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为 7 天 -->
<property>
 <name>yarn.log-aggregation.retain-seconds</name>
 <value>604800</value>
</property>

Distribute to the other two servers

[yhd@hadoop102 hadoop]$ xsync yarn-site.xml 
==================== hadoop102 ====================
sending incremental file list

sent 62 bytes  received 12 bytes  148.00 bytes/sec
total size is 1,621  speedup is 21.91
==================== hadoop103 ====================
sending incremental file list
yarn-site.xml

sent 1,034 bytes  received 47 bytes  720.67 bytes/sec
total size is 1,621  speedup is 1.50
==================== hadoop104 ====================
sending incremental file list
yarn-site.xml

sent 1,034 bytes  received 47 bytes  720.67 bytes/sec
total size is 1,621  speedup is 1.50

Close the history server process on 102

[yhd@hadoop102 hadoop-3.1.3]$ mapred --daemon stop historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
16617 Jps
12266 NameNode
12733 NodeManager

'Close the yarn process on 103

[yhd@hadoop103 hadoop-3.1.3]$ jps
12531 Jps
9801 DataNode
9993 ResourceManager
10126 NodeManager
[yhd@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh 
Stopping nodemanagers
hadoop102: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop104: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
Stopping resourcemanager

Start yarn on 103

[yhd@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh 
Starting resourcemanager
Starting nodemanagers
[yhd@hadoop103 hadoop-3.1.3]$ jps
12913 ResourceManager
13046 NodeManager
9801 DataNode
13391 Jps

Start the history server on 102

[yhd@hadoop102 hadoop-3.1.3]$ mapred --daemon start historyserver
[yhd@hadoop102 hadoop-3.1.3]$ jps
12433 DataNode
17029 Jps
16969 JobHistoryServer
12266 NameNode
16797 NodeManager

Perform a new task to check and verify as follows:

[yhd@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2
2021-03-20 22:55:55,213 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.10.103:8032
2021-03-20 22:55:55,683 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yhd/.staging/job_1616251848467_0001
2021-03-20 22:55:55,764 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

First see if it is successful on dnfs

Look at the task scheduling operation again

Click on the logs process to view

Guess you like

Origin blog.csdn.net/yanghuadong_1992/article/details/115034152