Getting Started with Hadoop 01---Basic Concepts and Deployment Tutorial


Reference for this article: Hadoop3.x Tutorial


What is Hadoop

Hadoop is an open source software framework implemented in java language under Apache, and it is a software platform for developing and operating large-scale data processing. Allows distributed processing of large datasets on clusters of massive computers using a simple programming model.

In a narrow sense, Hadoop refers to Apache, an open source framework whose core components are:

  • HDFS (Distributed File System): Solve Massive Data Storage
  • YARN (a framework for job scheduling and cluster resource management): solve resource task scheduling
  • MAPREDUCE (distributed computing programming framework): solving massive data computing

In a broad sense, Hadoop usually refers to a broader concept - the Hadoop ecosystem.

insert image description here
The current Hadoop has grown into a huge system. With the growth of the ecosystem, more and more new projects have emerged, some of which are not in charge of Apache. These projects are good supplements to HADOOP or higher-level abstractions . for example:

frame use
HDFS distributed file system
MapReduce Framework for Distributed Computing Program Development
ZooKeeper Distributed Coordination Service Basic Components
HIVE HADOOP-based distributed data warehouse, providing SQL-based query data operations
FLUME Log Data Collection Framework
then Workflow Scheduling Framework
Sqoop Data import and export tools (such as between mysql and HDFS)
Impala Real-time SQL query analysis based on hive
Mahout Machine learning algorithm library based on distributed computing frameworks such as mapreduce/spark/flink

History of Hadoop

Hadoop was created by Apache Lucene founder Doug Cutting. It originated from Nutch, which is a sub-project of Lucene. The design goal of Nutch is to build a large-scale search engine for the entire web, including functions such as web crawling, indexing, and querying. However, as the number of crawled web pages increases, it encounters serious scalability problems: how to solve billions of web pages storage and indexing issues.

In 2003 Google published a paper providing a feasible solution to this problem. The paper describes Google's product architecture, which is called: Google Distributed File System (GFS), which can solve the storage needs of large files generated in the process of web crawling and indexing.

In 2004, Google published a paper introducing the Google version of the MapReduce system to the world.

At the same time, based on Google's thesis, Nutch developers completed the corresponding open source implementations of HDFS and MAPREDUCE, and separated from Nutch to become an independent project HADOOP. By January 2008, HADOOP became the top project of Apache and ushered in it. period of rapid development.

In 2006, Google published a paper on BigTable, which prompted the development of Hbase.

Therefore, the development of Hadoop and its ecosystem is inseparable from the contribution of Google.


What are the characteristics of Hadoop

  • Scalable: Hadoop distributes data and completes computing tasks among available computer clusters, and these clusters can be easily expanded to thousands of nodes.
  • Low cost (Economical): Hadoop distributes and processes data through a server cluster composed of ordinary cheap machines, so that the cost is very low.
  • Efficient: Through concurrent data, Hadoop can dynamically move data between nodes in parallel, making it very fast.
  • Reliability (Rellable): It can automatically maintain multiple copies of data, and can automatically redeploy computing tasks after task failures. So Hadoop's ability to store and process data bit by bit is worthy of people's trust.

Hadoop version

Hadoop historical version:

• 1.x version series: the second-generation open source version in the hadoop version, which mainly fixes some bugs in version 0.x, etc., and this version has been eliminated

• 2.x version series: Major changes have taken place in the architecture, introducing many new features such as the yarn platform, and it is the mainstream version currently in use.

• 3.x version series: HDFS, MapReduce, and YARN have been greatly upgraded, and Ozone key-value storage has been added.

Hadoop distribution company:

  • Hadoop distributions are divided into open source community editions and commercial editions . The community version refers to the version maintained by the Apache Software Foundation, which is an officially maintained version system.

  • The commercial version of Hadoop refers to the version released by third-party commercial companies on the basis of the community version of Hadoop, which has made some modifications, integration, and compatibility testing of various service components. The more famous ones include cloudera's CDH, mapR, and hortonWorks.

Community Edition:

Free open source version Apache: http://hadoop.apache.org/

  • Advantages: With open source contributors all over the world, the code update iteration version is faster,

  • Disadvantages: Version upgrades, version maintenance, version compatibility, and version patches may not be thoughtful

The download address of all Apache software (including various historical versions): http://archive.apache.org/dist/

Free open source version HortonWorks:

  • Hortonworks is mainly the vice president of Yahoo leading Hadoop development. He led more than 20 core members to establish Hortonworks. The core product software HDP (ambari), HDF is free and open source, and provides a complete set of web management interface for us to manage us through the web interface. Cluster status, web management interface software HDF website ( http://ambari.apache.org/ ), in 2018, two giant companies in the field of big data, Cloudera and Hortonworks, announced a merger of equals, Cloudera acquired Hortonworks in stock, and Cloudera shareholders Ultimately acquired 60% of the combined company

Paid version:

  • Software paid version Cloudera: https://www.cloudera.com/

  • Cloudera is mainly a big data company in the United States on the Apache open source Hadoop version. Through various internal patches of its own company, it achieves stable operation between versions. Each version of the software in the big data ecosystem provides corresponding versions to solve the problem. Various problems such as version upgrade difficulties and version compatibility


Hadoop Architecture

Hadoop 1.0 :

  • HDFS (Distributed File Storage)
  • MapReduce (resource management and distributed data processing)

Hadoop 2.0 :

  • HDFS (Distributed File Storage)
  • MapReduce (distributed data processing)
  • YARN (cluster resource management, task scheduling)
    insert image description here

Since Hadoop 2.0 is developed based on JDK 1.7, and JDK 1.7 has stopped updating in April 2015, this directly forces the Hadoop community to re-release a new Hadoop version based on JDK 1.8, namely hadoop 3.0.

Hadoop 3.0 introduces some important functions and optimizations, including HDFS erasable coding, multi-Namenode support, MR Native Task optimization, YARN cgroup-based memory and disk IO isolation, YARN container resizing, etc.

The latest news from the Apache hadoop project team, hadoop3.x will adjust the program architecture in the future, and use Mapreduce based on memory + io + disk to jointly process data.

The biggest change is hdfs, hdfs calculates through the nearest block. According to the principle of nearest calculation, the local block is added to the memory, calculated first, through IO, shared memory calculation area, and finally quickly forms the calculation result, which is 10 times faster than Spark.
insert image description here


New features of Hadoop 3.0

At the beginner stage, you can understand

In terms of functions and performance, Hadoop 3.0 has made many major improvements to the hadoop kernel, mainly including:

Versatility:

  1. Simplify the Hadoop kernel, including removing outdated APIs and implementations, and replacing default component implementations with the most efficient implementations.
  2. Classpath isolation: to prevent conflicts between different versions of jar packages
  3. Shell script refactoring: Hadoop 3.0 refactored Hadoop management scripts, fixed a lot of bugs, and added new features.

HDFS:

Hdfs in Hadoop3.x has made great improvements in reliability and support capabilities:

1. HDFS supports data erasure coding, which allows HDFS to save half of the storage space without reducing reliability.

2. Multiple NameNode support, that is, support for one active and multiple standby namenode deployments in a cluster.

Note: The multi-ResourceManager feature is already supported in hadoop 2.0.

HDFS erasure code:

  • In Hadoop3.X, HDFS implements the new function of Erasure Coding. Erasure coding technology, referred to as EC, is a data protection technology. It was first used for data recovery in data transmission in the communication industry, and it is a coding fault-tolerant technology.
  • It makes the data of each part relevant by adding new verification data to the original data. In the case of a certain range of data errors, it can be recovered through erasure code technology.
  • Before hadoop-3.0, the HDFS storage method was to store 3 copies of each data, which also made the storage utilization rate only 1/3. Hadoop-3.0 introduced erasure code technology (EC technology), realizing 1 copy of data + 0.5 copies of redundancy The storage method of the remaining verification data.
  • Compared with replicas, erasure coding is a more space-saving method of persistent data storage. A standard encoding (such as Reed-Solomon(10,4)) would have a 1.4x space overhead; whereas an HDFS copy would have a 3x space overhead.

Multiple NameNodes are supported:

  • The original HDFS NameNode high-availability implementation provided only one active NameNode and one Standby NameNode; and by replicating the edit log to three JournalNodes, this architecture can tolerate the failure of any node in the system.
  • However, some deployments require a higher degree of fault tolerance. We can do this with this new feature, which allows users to run multiple Standby NameNodes. For example, by configuring three NameNodes and five JournalNodes, the system can tolerate the failure of two nodes instead of just one.

MapReduce:

MapReduce in Hadoop3.X has made the following changes compared with previous versions:

  • Tasknative optimization: Added C/C++ map output collector implementation (including Spill, Sort, and IFile, etc.) for MapReduce, and can switch to this implementation by adjusting job-level parameters. For shuffle-intensive applications, its performance can be improved by about 30%.

  • MapReduce memory parameters are automatically inferred. In Hadoop 2.0, it is very cumbersome to set memory parameters for MapReduce jobs. If the settings are not reasonable, memory resources will be seriously wasted. This situation is avoided in Hadoop 3.0.

  • MapReduce in Hadoop3.x adds a local implementation of the Map output collector, which will improve performance by more than 30% for shuffle-intensive jobs.

Default port changes:

  • Before hadoop3.x, the default ports of multiple Hadoop services belonged to the temporary port range of Linux (32768-61000). This means that the user's service may fail to start due to port conflicts with other applications when it starts.
  • Now these ports that may cause conflicts are no longer in the scope of ephemeral ports. Changes to these ports will affect NameNode, Secondary NameNode, DataNode and KMS. At the same time, the official documents have also been changed accordingly, for details, please refer to HDFS-9427 and HADOOP-12811.
Namenode ports: 50470> 9871, 50070> 9870, 8020> 9820

Secondary NN ports: 50091> 9869,50090> 9868

Datanode ports: 50020> 9867, 50010> 9866, 50475> 9865, 50075> 9864

Kms server ports: 16000> 9600 (原先的16000HMaster端口冲突)

YARN resource type:

  • The YARN resource model has been promoted to support user-defined countable resource types, not just CPU and memory.

  • For example, cluster administrators can define resources such as GPUs, software licenses, or locally-attached storage. YARN tasks can be scheduled based on the availability of these resources.


Hadoop cluster construction

Cluster Introduction

Specifically, the HADOOP cluster includes two clusters: the HDFS cluster and the YARN cluster. The two are logically separated, but they are often physically together.

The HDFS cluster is responsible for the storage of massive data . The main roles in the cluster are:

  • NameNode
  • DataNode
  • SecondaryNameNode

The YARN cluster is responsible for resource scheduling during massive data operations . The main roles in the cluster are:

  • ResourceManager
  • NodeManager

So what is mapreduce?

  • It is actually a distributed computing programming framework and an application development package, which is developed by the user according to the programming specification, and then packaged and run on the HDFS cluster, and is managed by the resource scheduling of the YARN cluster.

Cluster deployment method

There are three ways to deploy Hadoop :

standalone mode

Stand-alone mode is also called stand-alone mode. Only one machine runs one java process, which is mainly used for debugging.

Pseudo-Distributed mode (pseudo-distributed mode)

In the pseudo-distributed mode, the NameNode and DataNode of HDFS, and the ResourceManger and NodeManager of YARN are run on one machine, but separate java processes are started respectively, mainly for debugging.

Cluster mode

Cluster mode is mainly used for production environment deployment. N hosts will be used to form a Hadoop cluster. In this deployment mode, the master node and slave nodes will be deployed separately on different machines.


hadoop recompile

Why compile hadoop?

  • Since the hadoop installation package given by apache does not provide an interface with access to C programs, problems will arise when we use local libraries (local libraries can be used for compression, and support for C programs, etc.). The source package is recompiled.

Hadoop3 compile

  • Basic environment: Centos 7.7

Compile environment software installation directory

mkdir -p /export/server
  • Install and compile related dependencies
1、yum install gcc gcc-c++

#下面这个命令不需要执行 手动安装cmake
2、yum install make cmake  #(这里cmake版本推荐为3.6版本以上,版本低源码无法编译!可手动安装)

3、yum install autoconf automake libtool curl

4、yum install lzo-devel zlib-devel openssl openssl-devel ncurses-devel

5、yum install snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop libXtst
  • Install cmake manually
#yum卸载已安装cmake 版本低
yum erase cmake

#解压
tar zxvf cmake-3.13.5.tar.gz

#编译安装
cd /export/server/cmake-3.13.5

./configure

make && make install

#验证
[root@node4 ~]# cmake -version      
cmake version 3.13.5

#如果没有正确显示版本 请断开SSH连接 重写登录
  • Manually install snappy
#卸载已经安装的
cd /usr/local/lib

rm -rf libsnappy*

#上传解压
tar zxvf snappy-1.1.3.tar.gz 

#编译安装
cd /export/server/snappy-1.1.3
./configure
make && make install

#验证是否安装
[root@node4 snappy-1.1.3]# ls -lh /usr/local/lib |grep snappy
-rw-r--r-- 1 root root 511K Nov  4 17:13 libsnappy.a
-rwxr-xr-x 1 root root  955 Nov  4 17:13 libsnappy.la
lrwxrwxrwx 1 root root   18 Nov  4 17:13 libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx 1 root root   18 Nov  4 17:13 libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x 1 root root 253K Nov  4 17:13 libsnappy.so.1.3.0
  • Install and configure JDK 1.8
#解压安装包
tar zxvf jdk-8u65-linux-x64.tar.gz

#配置环境变量
vim /etc/profile

export JAVA_HOME=/export/server/jdk1.8.0_65
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

source /etc/profile

#验证是否安装成功
java -version

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
You have new mail in /var/spool/mail/root
  • Install and configure maven
#解压安装包
tar zxvf apache-maven-3.5.4-bin.tar.gz

#配置环境变量
vim /etc/profile

export MAVEN_HOME=/export/server/apache-maven-3.5.4
export MAVEN_OPTS="-Xms4096m -Xmx4096m"
export PATH=:$MAVEN_HOME/bin:$PATH

source /etc/profile

#验证是否安装成功
[root@node4 ~]# mvn -v
Apache Maven 3.5.4

#添加maven 阿里云仓库地址 加快国内编译速度
vim /export/server/apache-maven-3.5.4/conf/settings.xml

<mirrors>
     <mirror>
           <id>alimaven</id>
           <name>aliyun maven</name>
           <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
           <mirrorOf>central</mirrorOf>
      </mirror>
</mirrors>
  • Install ProtocolBuffer 2.5.0
#解压
tar zxvf protobuf-2.5.0.tar.gz

#编译安装
cd /export/server/protobuf-2.5.0
./configure
make && make install

#验证是否安装成功
[root@node4 protobuf-2.5.0]# protoc --version
libprotoc 2.5.0
  • compile hadoop
    insert image description here
#上传解压源码包
tar zxvf hadoop-3.1.4-src.tar.gz

#编译
cd /export/server/hadoop-3.1.4-src

mvn clean package -Pdist,native -DskipTests -Dtar -Dbundle.snappy -Dsnappy.lib=/usr/local/lib

#参数说明:

Pdist,native :把重新编译生成的hadoop动态库;
DskipTests :跳过测试
Dtar :最后把文件以tar打包
Dbundle.snappy :添加snappy压缩支持【默认官网下载的是不支持的】
Dsnappy.lib=/usr/local/lib :指snappy在编译机器上安装后的库路径
  • Compiled installation package path
/export/server/hadoop-3.1.4-src/hadoop-dist/target

Hadoop cluster installation

The cluster mode is mainly used for production environment deployment, which requires multiple hosts, and these hosts can access each other. We build Hadoop on the three virtual machines that have built the basic environment before.

Guidelines for role planning of each node in the cluster:

  • Reasonable allocation according to software working characteristics and server hardware resources
  • For example, is the NameNode that relies on memory to work deployed on a machine with large memory?

Notes on role planning:

  • If there is a conflict of resources, try not to deploy them together
  • Work needs to cooperate with each other. Deploy together as much as possible

  1. Cluster planning: prepare three servers
                                node1              node2              node3
HDFS集群守护进程:
               NameNode          Y                       
      SecondaryNameNode                              Y
              DataNode           Y                   Y                  Y   
YARN集群守护进程:         
        ResourceManager          Y
            NodeManager          Y                   Y                  Y                 

  1. Server basic environment preparation
1.分别设置好三台服务器的主机名为node1,node2,node3
vim /etc/hostname

insert image description here

2.在每台服务器都建立好HOSTS映射
vim /etc/hosts

insert image description here
If cloud service deployment is adopted, in the hosts file corresponding to each cloud service node, fill in the ip address of its own node as the internal network ip, and other nodes as external network ip.

Aliyun builds hadoop cluster server, internal network and external network access problems


  1. Release all port numbers involved in hadoop communication

The communication between Hadoop 3.x components needs to release the following ports:

  • HDFS:
    • Namenode Web UI: 9870 (configurable)
    • Datanode data transfer port: 9864 (configurable)
    • Datanode Web UI: 9864 (configurable)
    • Secondary Namenode Web UI: 9868 (configurable)
    • JournalNode RPC: 8485 (configurable)
    • NFS Gateway service: 111, 2049 (configurable)
    • NFS Gateway Web UI: 2049 (configurable)
  • YARN:
    • Resource Manager Web UI: 8088 (configurable)
    • Node Manager Web UI: 8042 (configurable)
    • Application History Web UI: 8188 (configurable)
    • Timeline Server Web UI: 8188 (configurable)
  • MapReduce:
    • JobHistory Server Web UI: 19888 (configurable)

Note: These port numbers are default configuration values, and the specific port numbers need to be configured according to the actual deployment situation. At the same time, different Hadoop components also need to communicate with each other, and the specific port number needs to be configured according to the actual deployment situation.


  1. ssh password-free login configuration

There is no need to open password-free login in pairs. Here, in order to facilitate the one-click start of subsequent scripts, you can only open one-way password-free login from node1 to node1, node2, and node3.

此处以配置node1免密登录node1,node2,node3为例:

ssh-keygen #4个回车 生成公钥、私钥
ssh-copy-id node1、ssh-copy-id node2、ssh-copy-id node3 #

  1. Cluster time synchronization
yum -y install ntpdate

ntpdate ntp4.aliyun.com

insert image description here


  1. Install jdk 1.8 on three servers
ubuntun:
sudo apt-get install openjdk-8-jdk

centos:
yum install java-1.8.0-openjdk* -y

  1. Create a unified working directory
mkdir -p /export/server #软件安装路径
mkdir -p /export/data #数据存储路径
mkdir -p /export/software #安装包存放路径

  1. Upload and decompress the Hadoop installation package (node1)
# hadoop-3.1.4-bin-snappy-CentOS7.tar.gz是已经编译好的安装包压缩包
tar zxvf hadoop-3.1.4-bin-snappy-CentOS7.tar.gz -C /export/server/

Hadoop installation package directory structure:
insert image description here


  1. Enter the etc directory and edit the hadoop-env.sh configuration file
cd /export/server/hadoop-3.1.4/etc/hadoop/

vim hadoop-env.sh
#配置JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.362.b08-1.el7_9.x86_64/jre

#设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root

export HDFS_DATANODE_USER=root

export HDFS_SECONDARYNAMENODE_USER=root

export YARN_RESOURCEMANAGER_USER=root

export YARN_NODEMANAGER_USER=root 

  1. Edit and modify core-site.xml
cd /export/server/hadoop-3.1.4/etc/hadoop/

vim core-site.xml
<configuration>     
<!-- 默认文件系统的名称。通过URI中schema区分不同文件系统。-->     
<!-- file:///本地文件系统 hdfs:// hadoop分布式文件系统 gfs://。-->     
<!-- hdfs文件系统访问地址:http://nn_host:8020(nn表示nameNode节点所在位置)。-->     
<property>         
    <name>fs.defaultFS</name>         
    <value>hdfs://node1:8020</value>     
</property>     
<!-- hadoop本地数据存储目录 format时自动生成 -->     
<property>         
     <name>hadoop.tmp.dir</name>         
     <value>/export/data/hadoop-3.1.4</value>     
</property>     
<!-- 在Web UI访问HDFS使用的用户名。-->     
<property>         
     <name>hadoop.http.staticuser.user</name>         
     <value>root</value>     
</property> 
</configuration>

  1. Edit and modify hdfs-site.xml
cd /export/server/hadoop-3.1.4/etc/hadoop/

vim hdfs-site.xml
<!--设定SNN运行主机和端口(主角色的辅助角色)-->
<property>
   <name>dfs.namenode.secondary.http-address</name>
   <value>node2:9868</value>
 </property>

  1. Edit and modify mapred-site.xml
 cd /export/server/hadoop-3.1.4/etc/hadoop/

 vim mapred-site.xml
<!-- mr程序默认运行方式。yarn集群模式 local本地模式--> 
<property>     
   <name>mapreduce.framework.name</name>     
   <value>yarn</value> 
</property> 
<!-- MR App Master环境变量。--> 
<property>     
   <name>yarn.app.mapreduce.am.env</name>     
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> 
</property> 
<!-- MR MapTask环境变量。--> 
<property>     
   <name>mapreduce.map.env</name>     
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> 
</property> 
<!-- MR ReduceTask环境变量。--> 
<property>     
   <name>mapreduce.reduce.env</name>     
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> 
</property>

  1. Edit and modify yarn-site.xml
cd /export/server/hadoop-3.1.4/etc/hadoop/

vim yarn-site.xml
<!-- yarn集群主角色RM运行机器。--> 
<property>     
   <name>yarn.resourcemanager.hostname</name>     
   <value>node1</value> 
</property> 
<!-- 0.0.0.0表示该地址将监听所有网络接口上的请求,而不仅仅是监听localhost上的请求。8088是ResourceManager Web UI使用的端口号,可以根据实际情况进行更改。--> 
<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>0.0.0.0:8088</value>
</property>
<!-- NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MR程序。--> 
<property>     
   <name>yarn.nodemanager.aux-services</name>     
   <value>mapreduce_shuffle</value> 
</property> 
<!-- 每个容器请求的最小内存资源(以MB为单位)。--> 
<property>     
   <name>yarn.scheduler.minimum-allocation-mb</name>     
   <value>512</value> 
</property> 
<!-- 每个容器请求的最大内存资源(以MB为单位)。--> 
<property>     
    <name>yarn.scheduler.maximum-allocation-mb</name>     
    <value>2048</value> 
</property> 
<!-- 容器虚拟内存与物理内存之间的比率。--> 
<property>     
    <name>yarn.nodemanager.vmem-pmem-ratio</name>     
    <value>4</value> 
</property>

  1. Edit and modify the workers configuration file (configuration from the host location where the role is located)
cd /export/server/hadoop-3.1.4/etc/hadoop/
vim workers
node1
node2
node3

  1. Distribution synchronization installation package—synchronize the Hadoop installation package scp on the node1 machine to other machines
 cd /export/server/

 scp -r hadoop-3.1.4 root@node2:/export/server/

 scp -r hadoop-3.1.4 root@node3:/export/server/

  1. Configure hadoop environment variables
  • Configure Hadoop environment variables on node1
 vim /etc/profile

 export HADOOP_HOME=/export/server/hadoop-3.1.4

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  • Synchronize the modified environment variables to other machines
 scp /etc/profile root@node2:/etc/

 scp /etc/profile root@node3:/etc/
  • Reload the environment variable to verify whether it takes effect (3 machines)
 source /etc/profile

 hadoop #验证环境变量是否生效

insert image description here


Deployment summary

1. Server basic environment

2.Hadoop source code compilation

3. Hadoop configuration file modification

4.shell file, 4 xml files, workers file

5. Configuration file cluster synchronization


NameNode format (formatting operation)

When starting HDFS for the first time, it must be formatted.

  • format is essentially initialization work, HDFS cleaning and preparation work

Order:

hdfs namenode -format

format helps to create the working directory of nameNode and related files when nameNode is initialized.
insert image description here
insert image description here
Notice:

1. The format operation is required before the first startup

2. format can only be done once and no longer needed

3. In addition to causing data loss, if the format is repeated multiple times, it will also cause the master and slave roles of the hdfs cluster to not recognize each other. This can be solved by deleting the hadoop.tmp.dir directory of all machines and reforma again


hadoop cluster start and stop

Manually start and stop processes one by one:

  • Manually start and close a role process each time on each machine
  • HDFS cluster
hdfs --daemon start namenode|datanode|secondarynamenode
hdfs --daemon stop namenode|datanode|secondarynamenode
  • YARN cluster
yarn --daemon start resourcemanager|nodemanager
yarn --daemon stop resourcemanager|nodemanager

The shell script starts and stops with one key:

The script here starts and stops with one click, depending on the password-free login configuration we configured above

  • On node1, use the shell script that comes with the software to start it with one click
  • Premise: Configure SSH password-free login and workers files between machines.

Start the start-stop script of the HDFS cluster separately:

  • start-dfs.sh
  • stop-dfs.sh

Start the start-stop script of the YARN cluster separately:

  • start-yarn.sh
  • stop-yarn.sh

Start the start-stop script of the Hadoop cluster (including two clusters):

  • start-all.sh
  • stop-all.sh

insert image description here
After the startup is complete, you can use the jps command to check whether the process starts successfully:
insert image description here
insert image description here
insert image description here


Hadoop startup log path: /export/server/hadoop-3.1.4/logs/

insert image description here


Web UI interface access test

Hadoop Web UI interface HDFS cluster address: http://namenode_host:9870

Where namenode_host is the host name or ip of the machine where namenode is running. If you use the host name to access, don’t forget to configure hosts in Windows

insert image description here


Hadoop Web UI interface YARN cluster address: http://resourcemanager_host:8088/cluster

Where resourcemanager_host is the hostname or ip of the machine where resourcemanager is running. If you use the hostname to access, don’t forget to configure hosts in Windows

insert image description here


Hadoop HDFS first experience

  • Create a directory
hadoop fs -mkdir /dhy
  • upload files
hadoop fs -put hadoop-root-datanode-dhy.out.1 /dhy
  • query file
hadoop fs -ls /

insert image description here
insert image description here


summary

  1. HDFS is essentially a file system

  2. There is a directory tree structure similar to Linux, divided into files and folders

  3. But why uploading a small file is so slow? everyone can think about it


Hadoop MapReduce + YARN first experience

  • Execute Hadoop's official MapReduce case: evaluate the value of pi
 cd /export/server/hadoop-3.1.4/share/hadoop/mapreduce/

 hadoop jar hadoop-mapreduce-examples-3.1.4.jar pi 2 4

insert image description here
insert image description here
Access the Web UI interface provided by Yarn to view the results of executing tasks:
insert image description here


summary

1. Is MapReduce essentially a program?

2. When executing MapReduce, why first request YARN?

3.MapReduce looks like two stages?

4. Map first, then Reduce?

5. When dealing with small data, is MapReduce fast?


Hadoop HDFS Benchmarks

test write speed

  • Make sure the HDFS cluster and YARN cluster start successfully
hadoop jar /export/server/hadoop-3.1.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB

Description: Write data to the HDFS file system, 10 files, each file is 10MB, and the files are stored in /benchmarks/TestDFSIO

Throughput: Throughput, Average IO rate: Average IO rate, IO rate std deviation: Standard deviation of IO rate

insert image description here


Test read speed

  • Make sure the HDFS cluster and YARN cluster start successfully
hadoop jar /export/server/hadoop-3.1.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB

Description: Read 10 files in the HDFS file system, each file is 10M

Throughput: Throughput, Average IO rate: Average IO rate, IO rate std deviation: Standard deviation of IO rate

insert image description here


clear test data

  • Make sure the HDFS cluster starts successfully
hadoop jar /export/server/hadoop-3.1.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -clean

Note: During the test, the /benchmarks directory will be created on the HDFS cluster. After the test, we can clean up the directory.

insert image description here


Guess you like

Origin blog.csdn.net/m0_53157173/article/details/130539795