Write catalog title here
- Chapter 1. Overview of Hadoop
- Chapter 2 Hadoop Operating Environment Construction (Development Focus)
- Chapter 3 Hadoop Operation Mode (Fully Distributed Mode: Development Focus)
- Chapter 4 Summary of Cluster Start/Stop Method
- Chapter 5 Writing Common Scripts for Hadoop Clusters
Chapter 1. Overview of Hadoop
- Is a distributed system infrastructure
- The main solution is the massive datastorageAnd massive dataanalysis caculateproblem
- Hadoop features:A lot、high speed、Diverse、Low value density
- Hadoop advantages (4 high):High reliability、High scalability、High efficiency、High fault tolerance
1.1 Big data department business process analysis, department organizational structure (focus)
Platform group : The main thing is to collect data. Ensure the stable operation of each framework, partial technology
Data warehouse group (high demand) :
- Data cleaning: often done by interns
- HIVE engineer-data analysis, data warehouse modeling: partial business
Data Mining Group (good development) : try to develop in this direction
1.2 Hadoop composition (interview focus)
Differences between Hadoop 1.x, 2.x, and 3.x
1.2.1 HDFS Architecture: Distributed File System
NameNode (nn) 、 DataNode (dn) 、 SecondaryNameNode
1.2.2 YARN: Resource Manager for Hadoop
- ResourceManager (RM): the boss of the entire cluster resources (memory, CPU, etc.)
- NodeManager (NM): single node server resource boss (so each node must be deployed)
- ApplicationMaster (AM): the boss of a single task
- Container : A container, which is quite an independent server, which encapsulates the resources required for task operation, such asMemory, CPU, disk, networkWait
1.2.3 MapReduce architecture: Map (parallel processing data) and Reduce (summarization of data results)
MapReduce divides the calculation process into two stages: Map and Reduce
- Parallel processing of input data in the Map stage
- Summarize the Map results in the Reduce phase
1.3 Big data technology ecosystem
The technical terms involved in the figure will be studied separately later
1.4 Recommended system framework diagram
1.5 The relationship between HDFS, YARN, and MapReduce
Chapter 2 Hadoop Operating Environment Construction (Development Focus)
2.1 Virtual machine environment preparation
1. Clone a virtual machine
2. Modify the static IP of the cloned virtual machine
3. Modify the host name
4. Turn off the firewall
- Close: systemctl stop firewalld
- 开启:systemctl disable firewalld.service
5. Create atguigu user
6. Configure the atguigu user to have root privileges, which is convenient for adding sudo to execute root privilege commands later
[root@hadoop100 ~]# vim /etc/sudoers
Modify the /etc/sudoers file,Below the line %wheelAdd a line as follows:
atguigu ALL=(ALL) NOPASSWD:ALL
7. Create the module and software folders in the /opt directory, and modify the user and group
that they belong to. Modify the owner and group of the module and software folders to be atguigu users
[root@hadoop100 ~]# chown atguigu:atguigu /opt/module
[root@hadoop100 ~]# chown atguigu:atguigu /opt/software
8. Uninstall the JDK that comes with the virtual machine
[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e
--nodeps
9. Restart the virtual machine
2.2 Clone a virtual machine (take hadoop102 as an example below)
1. Use the template machine hadoop100 to clone three virtual machines: hadoop102 hadoop103 hadoop104
2. Modify the static IP of the cloned virtual machine
[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
Note: Ensure that the IP address and virtual network editor address in the ifcfg-ens33 file of the Linux system are the same as the VM8 network IP address of the Windows system
3. Modify the host name of the clone machine
[root@hadoop100 ~]# vim /etc/hostname
hadoop102
4. Configure the Linux clone host name mapping hosts file, open /etc/hosts
[root@hadoop100 ~]# vim /etc/hosts
添加如下内容
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
5. Restart the clone machine hadoop102
6.Modify the windows host mapping file (hosts file)
2.3 Install JDK in hadoop102
1. Uninstall the existing JDK
Note: Before installing the JDK, be sure to delete the JDK that comes with the virtual machine in advance.
2. Import the JDK into the software folder under the opt directory
- Use Xftp transfer tool
- SecureCRT or Xshelll, enter the path where jdk is needed, then "alt+p" enter the sftp mode, select jdk1.8 and drag it in
3. Unzip the JDK to the /opt/module directory
[atguigu@hadoop102 software]$ tar -zxvf jdk-8u212-linuxx64.tar.gz -C /opt/module/
4.Configure JDK environment variables
Create a new /etc/profile.d/my_env.sh file
[atguigu@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh
添加如下内容
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
Note: source the /etc/profile file to make the new environment variable PATH take effect
[atguigu@hadoop102 ~]$ source /etc/profile
5. Test whether the JDK is installed successfully, and then restart
2.4 Install Hadoop in hadoop102
Steps: basically the same as installing jdk, when configuring environment variables, add the following at the end of the my_env.sh file
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
2.5 Hadoop directory structure
[atguigu@hadoop102 hadoop-3.1.3]$ ll
总用量 52
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 bin
drwxr-xr-x. 3 atguigu atguigu 4096 5 月 22 2017 etc
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 include
drwxr-xr-x. 3 atguigu atguigu 4096 5 月 22 2017 lib
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 libexec
-rw-r--r--. 1 atguigu atguigu 15429 5 月 22 2017 LICENSE.txt
-rw-r--r--. 1 atguigu atguigu 101 5 月 22 2017 NOTICE.txt
-rw-r--r--. 1 atguigu atguigu 1366 5 月 22 2017 README.txt
drwxr-xr-x. 2 atguigu atguigu 4096 5 月 22 2017 sbin
drwxr-xr-x. 4 atguigu atguigu 4096 5 月 22 2017 share
Important catalog
Chapter 3 Hadoop Operation Mode (Fully Distributed Mode: Development Focus)
Analysis:
1) Prepare 3 clients (close firewall, static IP, host name) 2) Install JDK
3) Configure environment variables
4) Install Hadoop
5) Configure environment variables
6) Configure cluster
7) Single point start
8) Configure ssh
9) Assemble and test the cluster
3.1 Write the cluster distribution script xsync
1. Create an xsync file in the /home/atguigu/bin directory
[atguigu@hadoop102 opt]$ cd /home/atguigu
[atguigu@hadoop102 ~]$ mkdir bin
[atguigu@hadoop102 ~]$ cd bin
[atguigu@hadoop102 bin]$ vim xsync
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4. 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
2. Modify the script xsync to have execution permissions
[atguigu@hadoop102 bin]$ chmod +x xsync
3.Copy the script to /bin so that it can be called globally
[atguigu@hadoop102 bin]$ sudo cp xsync /bin/
3.2 SSH passwordless login configuration (after configuration, you can directly "ssh user" to switch users)
1. Principle of password-free login
2.
Explanation of the file function under the .ssh folder (~/.ssh) to generate public and private keys
3. Copy the public key to the target machine to log in without secret
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104
Note: Each node must do this! ! ! , And users are different, if other users also need to configure, use the same method
3.3 cluster configuration
3.3.1 Cluster deployment planning
3.3.2 Configuration statement (core-site.xml, hdfs-site.xml, yarn-site.xml, mapped-site.xml)
- Hadoop configuration files are divided into two categories: default configuration files and custom configuration files. Only when users want to modify a certain default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.
- Default configuration file : the location where the file is stored in the Hadoop jar package
- Custom configuration file :core-site.xml 、 hdfs-site.xml 、 yarn-site.xml 、 mapred-site.xml
- The four configuration files are stored in the path $HADOOP_HOME/etc/hadoop, and the user can modify the configuration again according to project requirements
Distribute the configured Hadoop configuration file on the cluster
3.3.3 Group cluster configuration (start multiple nodes at once)
1. Configure workers
[atguigu@hadoop102 hadoop]$ vim /opt/module/hadoop-
3.1.3/etc/hadoop/workers
在该文件中增加如下内容:
hadoop102
hadoop103
hadoop104
Remember to synchronize all nodes! !
2. Start the cluster
- The cluster is started for the first time, and the NameNode needs to be formatted on the hadoop102 node
[atguigu@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
- Start HDFS
[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
- Configured ResourceManager node (hadoop103) starts YARN
[atguigu@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
Practical screenshots
3.4 Basic cluster test
Create a directory on HDFS:
upload files:
View the HDFS file storage path:
[atguigu@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-962968270-
192.168.10.102-1616034469344/current/finalized/subdir0/subdir0
3.5 Configure History Server
1. 1. Placement mapred-site.xml
2. Distribution configuration
3. Start the history server in hadoop102
4. Check whether the history server is started
3.6 Configure History Server
Note: To enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryServer.
1. 1. Placement yarn-site.xml
2. Distribution configuration
3. Close NodeManager, ResourceManager and HistoryServer
4. Start NodeManager, ResourceManage and HistoryServer
5. View log
- History server address: http://hadoop102:19888/jobhistory
- Historical task list
- View task running log
- Run log details
Chapter 4 Summary of Cluster Start/Stop Method
4.1 Start/stop of each module separately (configuration of ssh is a prerequisite)
(1) Start/stop HDFS as a whole
start-dfs.sh/stop-dfs.sh
(2) Start/stop YARN as a whole
start-yarn.sh/stop-yarn.sh
4.2 Each service component starts/stops one by one
(1) Start/stop HDFS components separately
hdfs --daemon start/stop namenode/datanode/secondarynamenode
(2) Start/stop YARN
yarn --daemon start/stop resourcemanager/nodemanager
Chapter 5 Writing Common Scripts for Hadoop Clusters
5.1 Hadoop cluster startup and shutdown script (including HDFS, Yarn, Historyserver): myhadoop.sh
5.2 View the Java process scripts of three servers: jpsall
5.3 Cluster time synchronization
- If the server is in a public network environment (can be connected to the external network), cluster time synchronization is not required, Because the server will be regularly calibrated with the public network time;
- If the server is in the internal network environment, the cluster time synchronization must be configured, otherwise, if the time is too long, a time deviation will occur, which will cause the cluster execution time to be out of synchronization.