Hadoop - HDFS file system

Table of contents

HDFS file system 

1. File system definition

2. In the era of big data, what challenges will traditional file storage systems face in the face of massive data?

3. The core attributes and functional meanings of distributed storage systems

1. Advantages of distributed storage

2. The function of metadata recording

3. Benefits of block storage

4. The role of the copy mechanism

4. Introduction to HDFS

HDFS Applicable Scenarios

5. HDFS master-slave architecture

block storage

copy mechanism

metadata management

block storage

6. HDFS shell operation

Common operations:

7. HDFS workflow and mechanism

Primary role: NameNode

namenode responsibilities:

From role: datanode

datanode responsibilities

8. The process of reading and writing data in HDFS​​​​​​

Core concept--Pipeline pipeline

Core concept--ACK response response

Default 3 copy storage strategies:

The whole process of writing operation:

The whole process of read operation:


HDFS file system 

1. File system definition

The file system is a method of storing and organizing data, which realizes data storage, separate organization, access and acquisition operations, making it easier for users to access and find files.

The file system uses the abstract logical concept of a tree directory to replace the concept of data blocks used by physical devices such as hard disks. Users do not need to care about where the underlying data exists on the hard disk, but only need to remember the directory and file name of the file.

File systems typically use storage devices such as hard disks and optical disks, and maintain the physical location of files on the device.

    The so-called traditional common file system refers more to the single-machine file system, that is, the bottom layer will not be implemented across multiple machines. For example, the file system on the Windows operating system
, the file system on Linux, the FTP file system, and so on.
    The common features of these file systems include:
        1. With an abstract directory tree structure, the tree starts from the / root directory and spreads downward;
        2. The nodes in the tree are divided into two categories: directories and files;
        3. Start from the root directory , the node path is unique. 

########################################################## 

2. In the era of big data, what challenges will traditional file storage systems face in the face of massive data?

high cost:

​ Traditional storage hardware has poor versatility, equipment investment plus post-maintenance, and the cost of upgrading and expanding is very high

How to support efficient calculation and analysis:

​ Traditional storage methods mean data: storage is storage, computing is computing, and when data needs to be processed, the data is moved over,

​ Program and data storage are implemented by different technology vendors and cannot be organically integrated together.

Poor performance:

​ The I/O performance bottleneck of a single node cannot be overcome, and it is difficult to support high concurrent throughput scenarios of massive data.

Poor scalability:

​ Rapid deployment and elastic expansion cannot be achieved, dynamic expansion, high cost of scaling down, and difficult technical implementation.

##########################################################  

3. The core attributes and functional meanings of distributed storage systems

Core attributes of a distributed storage system

• Distributed storage

• metadata records

• Block storage

• Copy mechanism

##########################################################  

1. Advantages of distributed storage

Problem: The amount of data is large, and the stand-alone storage encounters a bottleneck

solve:

​ Vertical expansion of a single machine: there are not enough disks to add disks, and there is an upper limit bottleneck limit

​ Multi-machine horizontal expansion: the machine is not enough to add machines, theoretically unlimited expansion

##########################################################  

2. The function of metadata recording

question:

​ Files distributed on different machines are not conducive to finding

solve:

​ Metadata records the file and its storage location information to quickly locate the file location

 

##########################################################  

 

3. Benefits of block storage

question:

​ The file is too large to be stored on a single machine, and the efficiency of uploading and downloading is low

solve:

​ Files are stored in blocks in different machines, improving efficiency for block parallel operations

 ########################################################## 

4. The role of the copy mechanism

question:

​Hardware failure is inevitable and data is easy to lose

solve:

​ Backup of different machine settings, redundant storage, to ensure data security

 ########################################################## 

4. Introduction to HDFS

HDFS (Hadoop Distributed File System), which means: Hadoop distributed file system.

It is one of the core components of Apache Hadoop and exists as the bottom distributed storage service of the big data ecosystem. It can also be said that the first problem to be solved by big data is the storage of massive data.

 

HDFS mainly solves the problem of how to store big data. Distributed means that HDFS is a storage system that spans multiple computers.

HDFS is a distributed file system that can run on ordinary hardware. It is highly fault-tolerant and suitable for applications with large data sets. It is very suitable for storing large data (such as TB and PB).

HDFS uses multiple computers to store files and provides the same access interface, using a distributed file system like accessing an ordinary file system.

 

HDFS Applicable Scenarios

##########################################################  

5. HDFS master-slave architecture

HDFS cluster is a standard master/slave master-slave architecture cluster

Generally, an HDFS cluster consists of a Namenode and a certain number of Datanodes.

Namenode is the master node of HDSF, and Datanode is the slave node of HDFS. The two roles perform their own duties and coordinate to complete the distributed file storage service.

The official architecture diagram is a master-five-slave mode, in which five slave roles are located on different servers in two racks (RACK).

 

block storage

Files in HDFS are physically stored in blocks, and the default size is 128M. If it is less than 128M, it is a block itself.

The size of the block can be specified by configuration parameters, which are located in hdfs-default.xml: dfs.blocksize

copy mechanism

All blocks of the file will have a copy, the copy coefficient can be specified when the file is created, or can be changed later by command

The number of copies is controlled by the parameter dfs.replication, and the default value is 3, that is, 2 additional copies will be copied, together with itself a total of 3 copies

metadata management

In HDFS, there are two types of metadata managed by Namenode.

file attribute information

​ File name, permission, modification time, file size, replication factor, data block size.

File block location mapping information

​ Record the mapping information between file blocks and Datanodes, that is, which block is located on which node.

block storage

The specific storage management of each block of the file is managed by the DataNode node

Each block can be stored on multiple DataNodes

##########################################################  

6. HDFS shell operation

introduce:

​ The command line interface refers to a way of human interaction when the user enters instructions through the keyboard, and the computer executes the instructions after receiving the instructions.

​ Hadoop provides a shell command line client for the file system: hadoop fs [options]

[hadoop@node1 mapreduce]$ hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum [-v] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge [-immediate] [-fs <path>]]
[hadoop@node1 local]$ hadoop fs -ls /
[hadoop@node1 local]$ hadoop fs -mkdir /test
[hadoop@node1 local]$ hadoop fs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2023-03-28 14:03 /test
HDFS Shell CLI 支持多种文件系统,包括本地文件系统(file:///),分布式文件系统(hdfs://)

​		具体操作的是什么文件系统取决于命令中文件路径URL中的前缀协议。

​		如果没有指定前缀,则将会读取环境变量中的fs.defaultFS属性,以该属性值作为默认文件系统。
		hadoop dfs 只能操作HDFS文件系统(包括与Local FS间的操作),不过已经Deprecated;
		hdfs dfs 只能操作HDFS文件系统相关(包括与Local FS间的操作),常用;
		hadoop fs 可操作任意文件系统,不仅仅是hdfs文件系统,使用范围更广;
		目前版本来看,官方最终推荐使用的是hadoop fs。当然hdfs dfs在市面上的使用也比较多。

Common operations:

hadoop fs -mkdir [-p] <path> ... 
	path 为待创建的目录
	-p选项的行为与Unix mkdir -p非常相似,它会沿着路径创建父目录
hadoop fs -ls [-h] [-R] [<path> ...] 
	path 指定目录路径
	-h 人性化显示文件size
	-R 递归查看指定目录及其子目录
hadoop fs -put [-f] [-p] <localsrc> ... <dst>
	-f 覆盖目标文件(已存在下)
	-p 保留访问和修改时间,所有权和权限。
	localsrc 本地文件系统(客户端所在机器)
	dst 目标文件系统(HDFS)
hadoop fs -cat <src> ... 
	读取指定文件全部内容,显示在标准输出控制台。
	注意:对于大文件内容读取,慎重。
hadoop fs -get [-f] [-p] <src> ... <localdst>
	下载文件到本地文件系统指定目录,localdst必须是目录
	-f 覆盖目标文件(已存在下)
	-p 保留访问和修改时间,所有权和权限。
hadoop fs -cp [-f] <src> ... <dst> 
	-f 覆盖目标文件(已存在下)
hadoop fs -mv <src> ... <dst>
	移动文件到指定文件夹下
	可以使用该命令移动数据,重命名文件的名称

##########################################################  

7. HDFS workflow and mechanism

Primary role: NameNode

NameNode is the core of the Hadoop distributed file system and the main role in the architecture

NameNode maintains and manages file system metadata, including namespace directory tree structure, location information of files and blocks, access permissions, etc.

Based on this, NameNode becomes the only entrance to access HDFS

NameNode internally manages metadata through memory and disk files.

The metadata files on the disk include Fsimage memory metadata image files and edits log (Journal) editing logs.

namenode responsibilities:

NameNode only stores HDFS metadata: the directory tree of all files in the file system, and tracks files in the entire cluster, without storing actual data

The NameNode knows the list of blocks and their locations for any given file in HDFS, using this information the NameNode knows how to build the file from the blocks

The NameNode does not persist the location information of the datanode where each block in each file is stored, and this information will be rebuilt from the DataNode when the system starts

NameNode is a single point of failure in a Hadoop cluster

Where the NameNode resides and is usually configured with a large amount of memory (RAM)

From role: datanode

DataNode is a slave role in Hadoop HDFS, responsible for specific data block storage

The number of DataNodes determines the overall data storage capacity of the HDFS cluster, and the data blocks are maintained by cooperating with the NameNode

datanode responsibilities

DataNode is responsible for the storage of the final data block block, which is the slave role of the cluster, also known as Slave

When the DataNode starts, it will register itself to the NameNode and report the list of blocks it is responsible for holding.

When a DataNode is shut down, it will not affect the availability of data, and the NameNode will arrange for replicas of blocks managed by other DataNodes

The machine where the DataNode is located is usually configured with a large amount of hard disk space, because the actual data is stored in the DataNode.

Primary role Secondary role: secondarynamenode

SecondaryNameNode acts as an auxiliary node of NameNode, but cannot replace NameNode

It is mainly to help the main role to merge metadata files, which can be understood as the secretary of the main role 

########################################################## 

8. The process of reading and writing data in HDFS​​​​​​

Core concept--Pipeline pipeline

Pipeline, Chinese translation is pipeline, which is a data transmission method used by HDFS in the process of uploading files and writing data

The client writes the data block to the first data node, the first data node saves the data and then copies the block to the second data node, which saves it and then copies it to the third data node

Why use pipeline linear transmission between datanodes instead of topological transmission to three datanodes at a time?

Because the data is transmitted sequentially in one direction in the form of a pipeline, this can make full use of the bandwidth of each machine, avoid network bottlenecks and high-latency connections, and minimize the delay in pushing all data

In linear push mode, all egress bandwidth from each machine is used to transfer data at the fastest speed, instead of distributing bandwidth among multiple recipients

Core concept--ACK response response

ACK (Acknowledgs character) is the confirmation character. In data communication, a transmission control character sent by the receiver to the sender, indicating that the sent data has been confirmed to be received correctly.

In the process of HDFS pipeline pipeline data transmission, ACK check will be performed in the opposite direction of transmission to ensure the security of data transmission

Default 3 copy storage strategies:

The first copy of the block: Priority client local, otherwise random.

Second block copy: A different rack than the first block copy.

The third copy: the second copy has the same rack but different machines.

The whole process of writing operation:

1. The HDFS client creates an object instance DistributedFileSystem, which encapsulates methods related to HDFS file system operations.

2. Call the create() method of the DistributedFileSystem object, and request the NameNode to create a file through RPC.
The NameNode performs various checks to determine whether the target file exists, whether the parent directory exists, and whether the client has permission to create the file. If the check is passed
, the NameNode will record a record for this request and return the FSDataOutputStream output stream object to the client for writing data.

3. The client starts to write data through the FSDataOutputStream output stream.

4. When the client writes data, it divides the data into packets (packet defaults to 64k), and the internal component DataStreamer requests the NameNode to select
a set of DataNode addresses suitable for storing data copies. The default is 3 copy storage.
DataStreamer streams the packet to the first DataNode in the pipeline, which stores the packet and sends it to the second
DataNode in the pipeline. Likewise, the second DataNode stores the packet and sends it to the third (and last) DataNode.

5. In the opposite direction of transmission, the ACK mechanism will be used to verify whether the data packet transmission is successful;

6. After the client finishes writing data, call the close() method on the FSDataOutputStream output stream to close it.

7. DistributedFileSystem contacts NameNode to inform it that the file writing is complete, and waits for NameNode confirmation.
Because the namenode already knows which blocks the file consists of (DataStream requests to allocate data blocks), it only needs to wait for the minimum replication block to return successfully.
The minimum replication is specified by the parameter dfs.namenode.replication.min, and the default is 1.

##########################################################  

The whole process of read operation:

The whole process of read operation:

1. The HDFS client creates an object instance DistributedFileSystem, and calls the object's open() method to open the file it wants to read.

2. DistributedFileSystem uses RPC to call namenode to determine the block position (read in batches) information of **the first few blocks in the file**.

For each block, the namenode returns a list of datanode location addresses that have all replicas of the block, and the list of addresses is sorted, with the closest distance to the client's network topology sorted first.

3. DistributedFileSystem returns the FSDataInputStream input stream to the client for it to read data.

4. The client calls the read() method on the FSDataInputStream input stream. Then, the InputStream which has stored the DataNode address is connected to the file

The closest DataNode for the first block in . Data is streamed back to the client from the DataNode, and as a result the client can repeatedly call read() on the stream.

5. When the block ends, FSDataInputStream will close the connection with the DataNode, and then look for the best datanode position for the next block. These operations are transparent to the user. So the user feels like it's been reading a continuous stream.

When the client reads data from the stream, it also asks the NameNode as needed to retrieve the DataNode location information for the next batch of data blocks.

6. Once the client finishes reading, call the close() method on the FSDataInputStream.

Guess you like

Origin blog.csdn.net/qq_48391148/article/details/129834162