1. Basic introduction to Hadoop tutorial (detailed)

Preface

Hadoop is a distributed system infrastructure developed by the Apache Foundation.
The main solution is the storage of massive data and the analysis and calculation of massive data.
The core design of the Hadoop framework is: HDFS and MapReduce, HDFS realizes storage, and MapReduce realizes principle analysis and processing.
Broadly speaking, Hadoop usually refers to a broader concept-the Hadoop ecosystem.
Insert picture description here

1. HDFS file system

HDFS (Hadoop Distributed File System), it is a highly fault-tolerant system, suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is suitable for applications with large data sets.

(1) Design features of HDFS:

1、大数据文件,非常适合上T级别的大文件或者一堆大数据文件的存储。
2、文件分块存储,HDFS会将一个完整的大文件平均分块存储到不同计算器上,它的意义在于读取文件时可以同时从多个主机取不同区块的文件,多主机读取比单主机读取效率要高得多。
3、流式数据访问,一次写入多次读写,这种模式跟传统文件不同,它不支持动态改变文件内容,而是要求让文件一次写入就不做变化,要变化也只能在文件末添加内容。
4、廉价硬件,HDFS可以应用在普通PC机上,这种机制能够让给一些公司用几十台廉价的计算机就可以撑起一个大数据集群。
5、硬件故障,HDFS认为所有计算机都可能会出问题,为了防止某个主机失效读取不到该主机的块文件,它将同一个文件块副本分配到其它某几个主机上,如果其中一台主机失效,可以迅速找另一块副本取文件。

(2) HDFS master/slave architecture:

An HDFS cluster consists of a Namenode and a certain number of Datanodes. Namenode is a central server, responsible for managing the namespace of the file system and client access to files. Datanodes are generally a node in a cluster, responsible for managing the storage attached to them on the nodes. Internally, a file is actually divided into one or more blocks, and these blocks are stored in the Datanode collection. Namenode performs namespace operations of the file system, such as opening, closing, and renaming files and directories, and at the same time determines the mapping of blocks to specific Datanode nodes. Datanode creates, deletes and replicates blocks under the command of Namenode. Both Namenode and Datanode are designed to run on ordinary, cheap machines running Linux.

(3) The realization idea of ​​HDFS (drawn according to personal understanding):

Insert picture description here
Insert picture description here

1、Block:将一个文件进行分块,通常是64M。
2、NameNode:保存整个文件系统的目录信息、文件信息及分块信息,这是由唯一一台主机专门保存,当然这台主机如果出错,NameNode就失效了。在Hadoop2.*开始支持activity-standy模式----如果主NameNode失效,启动备用主机运行NameNode。
3、DataNode:分布在廉价的计算机上,用于存储Block块文件。

Namenode fully manages the replication of data blocks. It periodically receives heartbeat signals and block status reports (Blockreport) from each Datanode in the cluster. Receiving a heartbeat signal means that the Datanode is working normally. The block status report contains a list of all data blocks on the Datanode.

(4) The writing principle of HDFS (drawn according to personal understanding):

Insert picture description here

步骤1:客户端向NamaNode发送“上传文件a.txt到根目录/下”的请求;
步骤2:NamaNode检测文件系统目录树是否已经存在同名文件;
步骤3:NamaNode返回是否允许上传的命令给客户端;
步骤4:客户端接到允许命令后,向NamaNode发送“请求上传文件a.txt的切片”;
步骤5:NamaNode检测DataNode信息池中,可用的DataNode的ip地址;
步骤6:NamaNode返回可用的DataNode的ip地址(返回的地址按照网络扩扑上距离来排序,离客户端最近的排在最前面);
步骤7:客户端向DataNode尝试建立连接,请求数据传输;
步骤8:建立pipeline完毕;
步骤9:建立数据传输的stream;
步骤10:传输数据。

2. MapReduce file system

MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB). MapReduce will be divided into two parts "Map" and "Reduce".

When you submit a calculation job to the MapReduce framework, it will first split the calculation job into several Map tasks, and then assign them to different nodes for execution. Each Map task processes part of the input data. When the Map task is completed , It will generate some intermediate files, these intermediate files will be used as the input data of the Reduce task. The main goal of the Reduce task is to aggregate and output the output of the previous Maps.

(1) MapReduce implementation ideas (drawn according to personal understanding):

Insert picture description here
Insert picture description here
Insert picture description here

MapReduce实现思路的个人理解后续更新.......

If you feel helpful, please "Like", "Follow", "Favorite"!

Guess you like

Origin blog.csdn.net/xgb2018/article/details/109327480