HDFS principle and operation

1. HDFS principle

HDFS (Hadoop Distributed File System) is a distributed file system. It ishighly fault-tolerantand provideshigh-throughput data access , very suitable for applications on large-scale data sets, it provides a highly fault-tolerant and high-throughput massive data storage solution.

  • 高吞吐量访问: Each Block of HDFS is distributed on a different Rack. When a user accesses it, HDFS will calculate and use the server with the closest access and the smallest number of visits to provide it to the user. Since Block is backed up on different Racks, it is no longer a single data access, and the speed and efficiency are very fast. In addition, HDFS can read and write in parallelfrom the server cluster, increasing the access bandwidth for file reading and writing.
  • 高容错性: System failure is inevitable, and how to achieve data recovery and fault tolerance after the failure is crucial. HDFS ensures the reliability of data in many aspects, multiple copies and distributed to different servers in physical locations, data verification function, background a> provides the possibility of high fault tolerance. Continuous self-check data consistency function
  • 线性扩展: Because the Block information of HDFS is stored on the NameNode, and the Block of the file is distributed on the DataNode, when expanding, only the number of DataNodes is added, and the system can operate without stopping the service Expansion is done below, no manual intervention is required.

2. HDFS architecture

As shown in the figure above, HDFS is a structure of Master and Slave, which is divided into three roles: NameNode, Secondary NameNode and DataNode.

  • NameNode: There is only one Master node in Hadoop1.
  • Secondary NameNode: Auxiliary NameNode, shares NameNode work, regularly merges fsimage and fsedits and pushes them to NameNode, and can assist in restoring NameNode in emergency situations.
  • DataNode: Slave node, which actually stores data, performs reading and writing of data blocks, and reports storage information to NameNode.

3. HDFS read operation

  1. The client opens the file it wishes to read by calling the open() method of the FileSystem object. For HDFS, this object is an instance of the distributed file system;
  2. DistributedFileSystem uses RPC to call the NameNode to determine the location of the file's starting block. The same Block will return multiple locations according to the number of repetitions. These locations are sorted according to the Hadoop cluster topology, with those closest to the client listed first;
  3. The first two steps will return a FSDataInputStream object, which will be encapsulated into a DFSInputStream object. DFSInputStream can conveniently manage datanode and namenode data streams. The client calls this input stream read() Method;
  4. The DFSInputStream that stores the DataNode address of the file starting block then connects to the nearest DataNode. By repeatedly calling the read() method on the data stream, the data can be transferred from the DataNode to the client;
  5. When the end of a block is reached, DFSInputStream closes the connection to the DataNode and then looks for the best DataNode for the next block. These operations are transparent to the client. From the client's perspective, it just reads a continuous stream;
  6. Once the client has finished reading, call the close() method on FSDataInputStream to close the file reading.

4. HDFS write operation

  1. The client creates a new file by calling the create() method of DistributedFileSystem;
  2. DistributedFileSystem calls NameNode through RPC to create a new file without Blocks associated. Before creation, NameNode will perform various verifications, such as whether the file exists and whether the client has permission to create it. If the verification passes, NameNode will record a record for creating a new file, otherwise it will throw an IO exception;
  3. After the first two steps are completed, the object of FSDataOutputStream will be returned. Similar to when reading a file, FSDataOutputStream is encapsulated into DFSOutputStream. DFSOutputStream can coordinate NameNode and Datanode. The client starts writing data to DFSOutputStream. DFSOutputStream will cut the data into small packets and write them to an internal queue called "Data Queue";
  4. DataStreamer will process and accept the Data Queue. It first asks NameNode which DataNodes are most suitable for storing this new Block. For example, if the number of repetitions is 3, then it will find the 3 most suitable DataNodes and arrange them into a pipeline. DataStreamer outputs the Packet in a queue to the first Datanode of the pipeline, and the first DataNode outputs the Packet to the second DataNode, and so on;
  5. DFSOutputStream also has a queue called Ack Quene, which is also composed of Packets and waits for the DataNode to receive a response. When all DataNodes in the Pipeline indicate that they have been received, the Akc Quene will remove the corresponding Packet package.
  6. After the client completes writing data, it calls the close() method to close the write stream;
  7. DataStreamer flushes the remaining packets to the Pipeline and waits for the Ack message. After receiving the last Ack, it notifies the NameNode to mark the file as completed.

5. Common commands in HDFS

hadoop fs -ls /
hadoop fs -lsr
hadoop fs -mkdir /user/hadoop
hadoop fs -put a.txt /user/hadoop/
hadoop fs -get /user/hadoop/a.txt /
hadoop fs -cp src dst
hadoop fs -mv src dst
hadoop fs -cat /user/hadoop/a.txt
hadoop fs -rm /user/hadoop/a.txt
hadoop fs -rmr /user/hadoop/a.txt
hadoop fs -text /user/hadoop/a.txt
hadoop fs -copyFromLocal localsrc dst  # 与hadoop fs -put 功能类似
hadoop fs -moveFromLocal localsrc dst  # 将本地文件上传到 hdfs,同时删除本地文件

Guess you like

Origin blog.csdn.net/m0_71417856/article/details/130550257