Introduction to Big Data (2) Hadoop Distributed File System - Introduction to HDFS

1. What is HDFS

Hadoop Distributed File System (Hadoop Distributed File System, HDFS) is the core sub-project of the Hadoop project. It is the basis of data storage management in distributed computing. It is developed based on the requirements of accessing and processing very large files in streaming data mode. It can run on cheap commercial servers. It has high fault tolerance, high reliability, high scalability, and high throughput. HDFS is not suitable for applications requiring low-latency data access, storing a large number of small files, and scenarios where multiple users write or modify files arbitrarily.

  • Distributed computing is a science that divides engineering data that requires a large amount of calculation into small pieces, and calculates them separately by multiple computers. After uploading the calculation results, the results are unified and combined to draw data conclusions.
  • Distributed File System (Distributed File System) is a file system used to manage storage across multiple computers in the network.
  • Streaming data mode access: HDFS adopts  an efficient access mode of "write once, read many times"  . The data set is usually generated or copied from the data source, and then various analyzes are performed on this data set for a long time. Each analysis involves most or all of the data set, so the time delay of reading the entire data set is more important than the time delay of reading the first record. The book here is relatively obscure, so let's briefly explain it. Streaming data access is to process a little data (for example, broadcast while downloading), and the corresponding non-streaming data access is to wait for all the data to be ready before processing (for example, broadcasting after downloading).
  • Low-latency data access: Applications that require low-latency data access, such as tens of milliseconds, are not suitable for running on HDFS. Remember that HDFS is optimized for high data throughput applications, which may result in increased latency. Currently, HBase is a better choice for low-latency access requirements.
  • A large number of small files: Small files usually refer to files whose file size is much smaller than the HDFS block size. If there are a large number of small files, it will have a certain impact on the entire storage system: (1) Since the namenode stores the metadata of the file system in memory, a large number of small files will exhaust the memory of the namenode and affect the file storage capacity of HDFS; (2) if mapreduce is used to process small files, the number of map tasks will increase and the number of addressing times will increase.
  • Multi-user writing, modifying files arbitrarily: File writing in HDFS only supports a single writer, and the write operation always writes data at the end of the file in an "append-only" manner. It does not support operations with multiple writers, nor does it support modification at arbitrary locations in the file.

2. Analysis of HDFS related concepts

  1.  Block (block): In the operating system, each disk has a default data block size, which is the smallest unit for data read/write on the disk ( computer storage term: sector, disk block, page ). HDFS also has the concept of block (Block), but much larger, the default is 128MB (can be set by dfs.blocksize). Similar to the file system on a single disk, files on HDFS are also divided into multiple chunks (Chunk) of the block size as independent storage units. But unlike a file system for a single disk, files smaller than a block size in HDFS do not occupy the space of an entire block . For example, when a 1MB file is stored in a 128MB block, the file only uses 1MB of disk space instead of 128MB (extension: block, packet, chunk in HDFS ). Note: The larger the file block, the shorter the seek time, but the longer the disk transfer time; the smaller the file block, the longer the seek time, but the shorter the disk transfer time.
  2. Namenode (management node): used to manage the namespace of the file system (namespace), which maintains the file system tree and all files and directories in the entire tree. These information are permanently stored on the local disk in the form of two files: the namespace image file (FSImage) and the edit log file (Editlog). The namenode also records the data node information of each block in each file, but it does not permanently store the location information of the block, because the information will be reconstructed based on the data node information when the system starts.

  1. FSImage (namespace image file): FSImage saves the latest metadata checkpoint, and loads FSImage information when HDFS starts , including information about all directories and files in the entire HDFS file system. For a file, this includes data block description information, modification time, access time, etc.; for a directory, it includes modification time, access control information (the user to which the directory belongs, the group it belongs to), etc. fsimage files are generally stored fsimage_with a prefix.
  2. Editlogs (editing log files): Editlogs mainly record various update operations performed on HDFS when the NameNode has been started , and all write operations performed by the HDFS client will be recorded in Editlogs. The edit logs file is generally stored edits_with a prefix. Both FSImage and Editlogs files are stored in ${dfs.namenode.name.dir}/current/the path.
  3. Datanode (data node): It is the working node of the file system and the node that actually stores data. They store and retrieve data blocks as needed (scheduled by clients or the namenode), and periodically send the namenode a list of the blocks they store.
  4. Secondary Namenode (auxiliary node/check node): Because the NameNode stores metadata through "FSImage + Editlogs", and the edit logs will only be merged into the fsimage when the NameNode restarts, so as to obtain the latest snapshot of a file system (checkpoint checkpoint: merge the edit log file into the fsimage file, the merge process is called checkpoint ) . But in the production environment, the NameNode is rarely restarted, which means that when the NameNode runs for a long time, the edit logs file will become very large, which will cause some problems. For example, restarting the NameNode when necessary (such as restarting the NameNode after downtime) will take a lot of time (the Editlogs need to be merged into FSImage). In order to solve this problem, we need an easy-to-manage mechanism to help us reduce the size of the edit logs file and get a latest fsimage file, which will also reduce the pressure on the NameNode. This is very similar to the recovery point of Windows. The recovery point mechanism of Windows allows us to take a snapshot of the OS, so that when a problem occurs in the system, we can roll back to the latest recovery point. The Secondary Namenode is here to help solve the above problems. It will regularly (1 hour by default, which can be modified by dfs.namenode.checkpoint.period; and dfs.namenode.checkpoint.txns, the default setting is 1 million, is used to define the number of unchecked transactions on the NameNode, which will force an urgent checkpoint, even if the checkpoint period has not yet been reached) from the Namenode to get the FSImage and Edits and merge them, and then send the latest FSImage to the Namenode.

HDFS adopts master/slave master-slave architecture. An HDFS cluster consists of a NameNode and a certain number of DataNodes. 

3. HDFS file writing process

As shown in the figure above, HDFS is distributed on three racks Rack1, Rack2, and Rack3.

At this point, assume that there is a file FileA with a size of 100MB, and the block size of HDFS is 64MB.

a.  Client sends a write data request to NameNode, as shown in the blue dotted line ① in the above figure

b.  NameNode divides FileA into Block1 and Block2 by 64MB according to the file size and file block configuration;

c.  The NameNode node returns the available DataNode according to the address information of the DataNode and the rack-aware policy, as shown by the pink dotted line ②. Extension: HDFS Rack Awareness

Note: HDFS uses a strategy called rack-aware to improve data reliability, availability, and network bandwidth utilization. By default, the replication factor is 3 (changeable via dfs.replication). The storage strategy of HDFS is to store one copy on a node in the local rack, one copy on another node in the same rack, and the last copy on a node in a different rack. This strategy reduces data transfer between racks and improves the efficiency of write operations. Rack errors are far less than node errors, so this strategy does not affect data reliability and availability. At the same time, this strategy reduces the overall bandwidth required for network transfers when reading data, since data blocks are only stored on two (not three) different racks.

Therefore, the returned DataNode information is as follows: the principle of proximity in the network topology, if they are all the same, randomly select a DataNode

    Block1: host2,host1,host3

    Block2: host7,host8,host4

d.  The client sends Block1 to the DataNode; the sending process is stream writing. The streaming writing process is as follows:

        1> Divide the 64MB Block1 into 64KB packages;

        2> Then send the first package to host2;

        3> After host2 receives it, it sends the first package to host1, and at the same time, the client sends the second package to host2;

        4> After host1 receives the first package, it sends it to host3 and receives the second package from host2 at the same time.

        5> And so on, as shown in the solid red line in the figure, until block1 is sent.

        6> host2, host1, host3 send a notification to the NameNode, host2 to the Client, saying "the message has been sent". It is shown by the solid pink line in the figure.

        7> After receiving the message from host2, the client sends a message to the namenode, saying that I have finished writing. That's it. thick yellow line

        8> After sending block1, send block2 to host7, host8, and host4, as shown by the blue solid line in the figure.

        9> After sending block2, host7, host8, and host4 send notifications to NameNode and host7 to Client, as shown by the light green solid line in the figure.

        10> The client sends a message to the NameNode, saying that I have finished writing, as shown in the yellow thick solid line in the figure. That's it.

4. HDFS file reading process

The read operation is relatively simple. As shown in the figure above, FileA consists of block1 and block2. At this point, the client reads FileA from DataNode as follows:

a.  The client sends a read request to the namenode.

b.  The namenode checks the Metadata metadata information and returns the location of the block of FileA.

    Block1: host2,host1,host3

    Block2: host7,host8,host4

c.  The position of the block is sequential, first read block1, then read block2. And block1 reads on host2; then block2 reads on host7;

In the above example, the client is located outside the rack, so if the client is located on a DataNode inside the rack, for example, the client is host6. Then, when reading, the rule to follow is: preferably read the data on this rack .

V. Conclusion

The reading and writing process of HDFS feels that the writing here is not very thorough. I am posting a few blog addresses:

Introduction to the HDFS read and write process, what is the principle of HDFS reading and writing data?

HDFS read and write process (the most refined and detailed in history)

HDFS write file process (details must see)

 

 

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/116596652
Recommended