Big Data Learning Summary (2021 Edition) ---Hadoop (Getting Started)

Write catalog title here

Chapter 1. Overview of Hadoop

  • Is a distributed system infrastructure
  • The main solution is the massive datastorageAnd massive dataanalysis caculateproblem
  • Hadoop features:A lothigh speedDiverseLow value density
  • Hadoop advantages (4 high):High reliabilityHigh scalabilityHigh efficiencyHigh fault tolerance

1.1 Big data department business process analysis, department organizational structure (focus)

Insert picture description here
Insert picture description here
Platform group : The main thing is to collect data. Ensure the stable operation of each framework, partial technology

Data warehouse group (high demand) :

  • Data cleaning: often done by interns
  • HIVE engineer-data analysis, data warehouse modeling: partial business

Data Mining Group (good development) : try to develop in this direction

1.2 Hadoop composition (interview focus)

Differences between Hadoop 1.x, 2.x, and 3.x
Insert picture description here

1.2.1 HDFS Architecture: Distributed File System

NameNode (nn) 、 DataNode (dn) 、 SecondaryNameNode
Insert picture description here

1.2.2 YARN: Resource Manager for Hadoop

  • ResourceManager (RM): the boss of the entire cluster resources (memory, CPU, etc.)
  • NodeManager (NM): single node server resource boss (so each node must be deployed)
  • ApplicationMaster (AM): the boss of a single task
  • Container : A container, which is quite an independent server, which encapsulates the resources required for task operation, such asMemory, CPU, disk, networkWait
    Insert picture description here

1.2.3 MapReduce architecture: Map (parallel processing data) and Reduce (summarization of data results)

MapReduce divides the calculation process into two stages: Map and Reduce

  • Parallel processing of input data in the Map stage
  • Summarize the Map results in the Reduce phase

1.3 Big data technology ecosystem

Insert picture description here
The technical terms involved in the figure will be studied separately later

1.4 Recommended system framework diagram

Insert picture description here

1.5 The relationship between HDFS, YARN, and MapReduce

Insert picture description here

Chapter 2 Hadoop Operating Environment Construction (Development Focus)

2.1 Virtual machine environment preparation

1. Clone a virtual machine

2. Modify the static IP of the cloned virtual machine

3. Modify the host name

4. Turn off the firewall

  • Close: systemctl stop firewalld
  • 开启:systemctl disable firewalld.service

5. Create atguigu user

6. Configure the atguigu user to have root privileges, which is convenient for adding sudo to execute root privilege commands later

[root@hadoop100 ~]# vim /etc/sudoers

Modify the /etc/sudoers file,Below the line %wheelAdd a line as follows:

atguigu ALL=(ALL) NOPASSWD:ALL

7. Create the module and software folders in the /opt directory, and modify the user and group
that they belong to. Modify the owner and group of the module and software folders to be atguigu users

[root@hadoop100 ~]# chown atguigu:atguigu /opt/module 
[root@hadoop100 ~]# chown atguigu:atguigu /opt/software

8. Uninstall the JDK that comes with the virtual machine

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e 
--nodeps

9. Restart the virtual machine

2.2 Clone a virtual machine (take hadoop102 as an example below)

1. Use the template machine hadoop100 to clone three virtual machines: hadoop102 hadoop103 hadoop104

2. Modify the static IP of the cloned virtual machine

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

Note: Ensure that the IP address and virtual network editor address in the ifcfg-ens33 file of the Linux system are the same as the VM8 network IP address of the Windows system

3. Modify the host name of the clone machine

[root@hadoop100 ~]# vim /etc/hostname
hadoop102

4. Configure the Linux clone host name mapping hosts file, open /etc/hosts

[root@hadoop100 ~]# vim /etc/hosts
添加如下内容
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104

5. Restart the clone machine hadoop102
6.Modify the windows host mapping file (hosts file)

2.3 Install JDK in hadoop102

1. Uninstall the existing JDK
Note: Before installing the JDK, be sure to delete the JDK that comes with the virtual machine in advance.

2. Import the JDK into the software folder under the opt directory

  • Use Xftp transfer tool
    Insert picture description here
  • SecureCRT or Xshelll, enter the path where jdk is needed, then "alt+p" enter the sftp mode, select jdk1.8 and drag it in
    Insert picture description here

3. Unzip the JDK to the /opt/module directory

[atguigu@hadoop102 software]$ tar -zxvf jdk-8u212-linuxx64.tar.gz -C /opt/module/

4.Configure JDK environment variables
Create a new /etc/profile.d/my_env.sh file

[atguigu@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh
添加如下内容
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

Note: source the /etc/profile file to make the new environment variable PATH take effect

[atguigu@hadoop102 ~]$ source /etc/profile

5. Test whether the JDK is installed successfully, and then restart

2.4 Install Hadoop in hadoop102

Steps: basically the same as installing jdk, when configuring environment variables, add the following at the end of the my_env.sh file

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

2.5 Hadoop directory structure

[atguigu@hadoop102 hadoop-3.1.3]$ ll
总用量 52
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 bin
drwxr-xr-x. 3 atguigu atguigu 4096 5 月 22 2017 etc
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 include
drwxr-xr-x. 3 atguigu atguigu 4096 5 月 22 2017 lib
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 libexec
-rw-r--r--. 1 atguigu atguigu 15429 5 月 22 2017 LICENSE.txt
-rw-r--r--. 1 atguigu atguigu 101 5 月 22 2017 NOTICE.txt
-rw-r--r--. 1 atguigu atguigu 1366 5 月 22 2017 README.txt
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 sbin
drwxr-xr-x. 4 atguigu atguigu 4096 5 月 22 2017 share

Important catalog
Insert picture description here

Chapter 3 Hadoop Operation Mode (Fully Distributed Mode: Development Focus)

Analysis:
1) Prepare 3 clients (close firewall, static IP, host name) 2) Install JDK
3) Configure environment variables
4) Install Hadoop
5) Configure environment variables
6) Configure cluster
7) Single point start
8) Configure ssh
9) Assemble and test the cluster

3.1 Write the cluster distribution script xsync

1. Create an xsync file in the /home/atguigu/bin directory

[atguigu@hadoop102 opt]$ cd /home/atguigu
[atguigu@hadoop102 ~]$ mkdir bin
[atguigu@hadoop102 ~]$ cd bin
[atguigu@hadoop102 bin]$ vim xsync
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
	 echo Not Enough Arguement!
	 exit;
fi

#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
	 echo ==================== $host ====================
	 #3. 遍历所有目录,挨个发送
	 for file in $@
	 do
		 #4. 判断文件是否存在
		 if [ -e $file ]
			 then
			 	 #5. 获取父目录
				  pdir=$(cd -P $(dirname $file); pwd)
			 
				  #6. 获取当前文件的名称
			 	 fname=$(basename $file)
				  ssh $host "mkdir -p $pdir"
				  rsync -av $pdir/$fname $host:$pdir
			 else
			 	 echo $file does not exists!
	      fi
 	done
 done

2. Modify the script xsync to have execution permissions

[atguigu@hadoop102 bin]$ chmod +x xsync

3.Copy the script to /bin so that it can be called globally

[atguigu@hadoop102 bin]$ sudo cp xsync /bin/

3.2 SSH passwordless login configuration (after configuration, you can directly "ssh user" to switch users)

1. Principle of password-free login
Insert picture description here

2.
Insert picture description here
Explanation of the file function under the .ssh folder (~/.ssh) to generate public and private keys
Insert picture description here
Insert picture description here
3. Copy the public key to the target machine to log in without secret

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104

Note: Each node must do this! ! ! , And users are different, if other users also need to configure, use the same method

3.3 cluster configuration

3.3.1 Cluster deployment planning

Insert picture description here

3.3.2 Configuration statement (core-site.xml, hdfs-site.xml, yarn-site.xml, mapped-site.xml)

  • Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a certain default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.
  • Default configuration file : the location where the file is stored in the Hadoop jar package
  • Custom configuration file :core-site.xml 、 hdfs-site.xml 、 yarn-site.xml 、 mapred-site.xml
  • The four configuration files are stored in the path $HADOOP_HOME/etc/hadoop, and the user can modify the configuration again according to project requirements

Distribute the configured Hadoop configuration file on the cluster

3.3.3 Group cluster configuration (start multiple nodes at once)

1. Configure workers

[atguigu@hadoop102 hadoop]$ vim /opt/module/hadoop-
3.1.3/etc/hadoop/workers

在该文件中增加如下内容:
hadoop102
hadoop103
hadoop104

Remember to synchronize all nodes! !

2. Start the cluster

  • The cluster is started for the first time, and the NameNode needs to be formatted on the hadoop102 node
[atguigu@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
  • Start HDFS
[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
  • Configured ResourceManager node (hadoop103) starts YARN
[atguigu@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

Insert picture description here
Insert picture description here
Insert picture description here

Practical screenshots
Insert picture description here

Insert picture description here

Insert picture description here
Insert picture description hereInsert picture description here

3.4 Basic cluster test

Create a directory on HDFS:
Insert picture description here
upload files:
Insert picture description here

Insert picture description here
Insert picture description here

View the HDFS file storage path:

[atguigu@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-962968270-
192.168.10.102-1616034469344/current/finalized/subdir0/subdir0

Insert picture description here

3.5 Configure History Server

1. 1. Placement mapred-site.xml

2. Distribution configuration

3. Start the history server in hadoop102

4. Check whether the history server is startedInsert picture description here

Insert picture description here

Insert picture description here

3.6 Configure History Server

Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.

1. 1. Placement yarn-site.xml

2. Distribution configuration

3. Close NodeManager, ResourceManager and HistoryServer

4. Start NodeManager, ResourceManage and HistoryServer

5. View log

Chapter 4 Summary of Cluster Start/Stop Method

4.1 Start/stop of each module separately (configuration of ssh is a prerequisite)

(1) Start/stop HDFS as a whole

start-dfs.sh/stop-dfs.sh

(2) Start/stop YARN as a whole

start-yarn.sh/stop-yarn.sh

4.2 Each service component starts/stops one by one

(1) Start/stop HDFS components separately

hdfs --daemon start/stop namenode/datanode/secondarynamenode

(2) Start/stop YARN

yarn --daemon start/stop resourcemanager/nodemanager

Chapter 5 Writing Common Scripts for Hadoop Clusters

5.1 Hadoop cluster startup and shutdown script (including HDFS, Yarn, Historyserver): myhadoop.sh

Insert picture description here

Insert picture description here

5.2 View the Java process scripts of three servers: jpsall

Insert picture description here

5.3 Cluster time synchronization

  • If the server is in a public network environment (can be connected to the external network), cluster time synchronization is not required, Because the server will be regularly calibrated with the public network time;
  • If the server is in the internal network environment, the cluster time synchronization must be configured, otherwise, if the time is too long, a time deviation will occur, which will cause the cluster execution time to be out of synchronization.

Guess you like

Origin blog.csdn.net/m0_51755061/article/details/115000237