HDFS Distributed File System—Principle and Shell Operation

HDFS Distributed File System—Principle and Shell Operation

1. Introduction to HDFS

   With the rapid growth of data volume, traditional files are faced with storage bottlenecks, large files, and time-consuming upload and download problems. There are two solutions for storage bottlenecks that need to be expanded. The first is vertical expansion, that is, adding disks and memory; the second is horizontal expansion, that is, increasing the number of servers. For the problem of uploading and downloading efficiency, the solution is to divide a large file into multiple data blocks and store the data blocks in parallel. The distributed file system is based on the above solutions.

1. The basic concept of HDFS

HDFS is an easily scalable distributed file system that runs on hundreds of low-cost machines and has a high degree of fault tolerance; it provides high-throughput access to application data and can store and manage massive file information.
(1) NameNode (name node)
The master server of the HDFS cluster is called the name node or master node. NameNode is mainly managed and stored in the form of metadata; it stores information such as file system operation records and sets the number of backups in the configuration file.
(2) DataNode (data node)
The slave server of the HDFS cluster is called a data node. The file system divides files into multiple data blocks and stores them in DataNodes, so DataNodes require a lot of disk space. It is necessary to maintain constant communication with the NameNode, and send information such as the creation and deletion of data blocks to the NameNode machine.
(3) Block (data block)
The data block is the smallest unit for the disk to read and write data. In the Hadoop2.x version, the default block size is 128M, and the backup is 3 blocks, and each copy is stored in a different DataNode as much as possible. Through backup, data fault tolerance and availability are provided.
(4) Rack (rack)
HDFS uses a rack-aware strategy to enable the NameNode to determine the rack ID to which each DataNode belongs and the copy storage strategy to improve data reliability, availability, and network bandwidth utilization.
(5) MetaData (metadata)
has three types of information, one is HDFS file and directory information. The second is to record the content of the file, such as the relevant information stored; the third is to record all the DataNode information for the management of the DataNode.

2. Features of HDFS

Advantages:
(1) High fault tolerance, copy mechanism, DataNode nodes periodically send heartbeat signals to NameNode, and when DataNode is found to be down, it can be automatically restored through the copy.
(2) Streaming data access
(3) Supporting very large files
(4) High data throughput, once written, read multiple times, once written, cannot be modified, only appended, to ensure data consistency.
(5) Can be built on cheap machines.
Disadvantages:
(1) High latency
(2) Not suitable for small file storage
(3) Not suitable for concurrent writing, and does not support concurrent users' writing operations.

2. Principles of HDFS architecture

1. HDFS storage architecture

1.hdfs adopts a master-slave architecture (Master/Slave architecture), which consists of a NameNode and multiple DataNodes respectively; the NameNode is responsible for managing the namespace of the file system and the client's access to files, and the DataNode is responsible for the slave nodes. Storage of data on nodes.
2. NameNode maintains the FsImage image file and EditLog log file in the form of metadata. The FsImage image file stores the namespace information of the file system; the EditLog log file persistently records the changes of the file system metadata.
3. With the increase of NameNode storage metadata, the EditLog log file is getting bigger and bigger. When restarting the cluster, the NameNode needs to restore metadata information, load the FsImage image file and repeat the operation of EditLog log file records, which will take a lot of time. Therefore, HDFS provides a Secondary NameNode (secondary node), which is responsible for periodically merging the EditLog log file into the FsImage image file, reducing the size of the EditLog log file, and shortening the time to restart the cluster.

2.HDFS file reading and writing principle

Client (client) reads and writes data in HDFS.
(1) HDFS write data operation
The client initiates a file upload request, provides RPC (remote procedure call) to establish communication with the NameNode, and then uploads the file. Divide the file into multiple data blocks, and upload the data blocks and copies in sequence.
(2) HDFS read data operation
The client initiates an RPC request to the NameNode to obtain the location of the data block, and merges all the read Block files into a final file.

3. Shell operation of HDFS

1. HDFS Shell parameters

command parameters Functional description
-ls View the directory structure of the specified path
-of Statistical directory and the size of all files
-mv move files
-cp copy files
-rm delete file\blank folder
-put upload files
-cat View Files
-mkdir create folder
-text output source file to file format
-help help

2.ls command

hadoop fs -ls [parameter] [specified path]
parameter:

  • -d : show directories as normal files
  • -h: unit letter format that is easy for operators to read
  • -R: Display information of all subdirectories recursively
    View all files and folders under the HDFS root directory
hadoop fs -ls /

2. mkdir command

Use the parameter -p to create subdirectories

hadoop fs -p /itcast/hadoop

3.put command

parameter:

  • -f : overwrite target file
  • -p: Retain access and modification time, permissions
    Copy local system files to HDFS
hadoop fs -put -f aa /

Guess you like

Origin blog.csdn.net/tang5615/article/details/125669045