The hdfs principle of hanging an interviewer

hdfs architecture

Insert picture description here

intended

Store metadata of files, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), as well as the block list of each file and the datanode where the block is located.
①fsimage: metadata mirror file. Store metadata information of NameNode memory for a certain period of time.
②edits: Operation log file.
③fstime: save the time of the last checkpoint. The
datanode
stores file block data in the file system, as well as the checksum of the data.
Secondary namenode
is an auxiliary daemon used to monitor the status of hdfs, and obtains snapshots of hdfs metadata at regular intervals.

Persistence of meta information

The file storing meta-information in the NameNode is fsimage. During system operation, all operations on meta-information are stored in memory and persisted to another file edits. And the edits file and the fsimage file will be periodically merged by the SecondaryNameNode (the merge process will be described in detail in the SecondaryNameNode).

NameNode features

Running the NameNode consumes a lot of memory and I/O resources. Generally, the NameNode does not store user data or execute MapReduce tasks.
In order to simplify the design of the system, Hadoop has only one NameNode, which leads to the single point of failure problem of the hadoop cluster. Therefore,
the fault tolerance of the NameNode node is particularly important. Hadoop provides the following two mechanisms to solve it:

The hadoop metadata is written to the local file system and synchronized to a remotely mounted network file system (NFS) in real time.
Run a secondary NameNode. Its role is to interact with the NameNode and periodically merge the
namespace mirroring by editing the log file . When the NameNode fails, it will recover through its merged namespace mirroring copy. It should be noted that the
state saved by the secondaryNameNode always lags behind that of the NameNode, so this method will inevitably lead to the loss of some data (more on this later).

SecondaryNameNode

It should be noted that the SecondaryNameNode is not a backup of the NameNode. As we know from the previous introduction, the meta information of all HDFS files is stored in the memory of the NameNode. When the NameNode starts, it first loads the fsimage into the memory. During the system operation, all operations on the NameNode are also stored in the memory. At the same time, to prevent data loss, these operations will continue to be persisted to the local edits file. in. The purpose of the edits file is to improve the operating efficiency of the system. The NameNode will write operations to the edits file before updating the meta-information in the memory. In the process of NameNode restart, edits and fsimage will be merged together, but the merge process will affect the speed of Hadoop restart. SecondaryNameNode was born to solve this problem.

The role of SecondaryNameNode is to periodically merge edits and fsimage files. Let's take a look at the merge steps:

Before merging, tell the NameNode to write all operations to the new edits file and name it edits.new.
The SecondaryNameNode requests the fsimage and edits files from the NameNode. The
SecondaryNameNode merges the fsimage and edits files into a new fsimage file. The
NameNode obtains the merged new fsimage from the SecondaryNameNode and replaces the old one, and replaces the edits with the edits.new file created in the first step. Replace the checkpoints in the updated fstime file and finally summarize the related files in the NameNode involved in the entire process.
fsimage: save the meta-information of the HDFS of the last checkpoint
edits: save what happened since the last checkpoint HDFS meta-information status change information
fstime: the timestamp of the last checkpoint is saved

hdfs read operation

Insert picture description here
1. First call the open method of the FileSystem object, which is actually an instance of DistributedFileSystem.

2. The DistributedFileSystem obtains the locations of the first block of the file through rpc. The same block will return multiple locations according to the number of repetitions. These locations are sorted according to the hadoop topology, and the closest to the client is ranked first.

3. The first two steps will return a FSDataInputStream object, which will be encapsulated by a DFSInputStream object. DFSInputStream can conveniently manage datanode and namenode data streams. The client calls the read method, and DFSInputStream will find the datanode closest to the client and connect.

4. Data flows continuously from the datanode to the client.

5. If the first block of data is read, the datanode connection to the first block will be closed, and then the next block will be read. These operations are transparent to the client, and the client's point of view is just reading a continuous stream.

6. If the first batch of blocks have been read, DFSInputStream will go to namenode to get the locations of a batch of blocks, and then continue to read. If all blocks are read, all streams will be closed at this time.
If the communication between DFSInputStream and datanode is abnormal when reading data, it will try the datanode with the second closest sort of the block being read, and record which datanode has an error, and skip the datanode directly when reading the remaining blocks . DFSInputStream will also check the block data checksum. If a bad block is found, it will be reported to the namenode node first, and then DFSInputStream will read the image of the block on other datanodes.

The design is that the client directly connects to the datanode to retrieve the data and the namenode is responsible for providing the optimal datanode for each block. The namenode only processes the block location request. This information is loaded in the memory of the namenode. HDFS can withstand a large amount of data through the datanode cluster. Concurrent client access.

hdfs write operation

Insert picture description here
1. The client creates a new file by calling the create method of DistributedFileSystem.

2. DistributedFileSystem calls namenode through RPC to create a new file that is not associated with blocks. Before creation, namenode will do various checks, such as whether the file exists, whether the client has permission to create it, etc. If the verification passes, namenode will record the new file, otherwise it will throw an IO exception.

3. After the first two steps, the object of FSDataOutputStream will be returned. Similar to when reading files, FSDataOutputStream is encapsulated into DFSOutputStream. DFSOutputStream can coordinate namenode and datanode. The client starts to write data to DFSOutputStream, DFSOutputStream will cut the data into small packets, and then arrange them into a data queue.

4. The DataStreamer will process and accept the data quene. It first asks which datanodes the new block of namenode is most suitable for storage (for example, if the number of repetitions is 3, then it will find the 3 most suitable datanodes), and arrange them into A pipeline. DataStreamer outputs packets in a queue to the first datanode of the pipeline, and the first datanode outputs packets to the second datanode, and so on.

5. DFSOutputStream also has a counter column called ack quene, which is also composed of packets, waiting for the datanode to receive a response. When all datanodes in the pipeline indicate that they have been received, then akc quene will move the corresponding packet packet Get rid of.
If an error occurs in a datanode during the writing process, the following steps will be taken:

  1. The pipeline is closed;
    2) In order to prevent packet loss, the packets in the ack quene will be synchronized to the data quene;
    3) Delete the currently written but unfinished block on the datanode that caused the error;
    4) The remaining part of the block It is written to the remaining two normal datanodes;
    5) Namenode finds another datanode to create a copy of this block. Of course, these operations are imperceptible to the client.

6. After the client finishes writing data, call the close method to close the write stream.

7. DataStreamer flushes the remaining packages to the pipeline, and then waits for the ack message. After receiving the last ack, it notifies the datanode to mark the file as completed.

Note: After the client executes the write operation, the written block is visible. The block being written is not visible to the client. Only by calling the sync method can the client ensure that the file writing operation has been completed. When the client calls the close method, the sync method is called by default. Whether you need to manually call depends on the trade-off between data robustness and throughput rate according to the needs of the program.

hdfs file deletion

The hdfs file deletion process generally requires the following steps:

  1. At the beginning of deleting the file, the NameNode just renames the deleted file to the /trash directory. Because the renaming operation is just a change of meta information, the whole process is very fast. The files in /trash will be kept for a certain interval of time (configurable, the default is 6 hours). During this period, the files can be easily restored. The restoration only needs to remove the files from /trash.
  2. When the specified time arrives, the NameNode will delete the file from the namespace
  3. Mark deleted file blocks to release space, and HDFS file system display space increases

hdfs file recovery

Like the recycle bin design of the Linux system, HDFS will create a recycle bin directory for each user: /user/username/.Trash/, and every file/directory deleted by the user through Shell will have one in the system recycle bin. Cycle, that is, when the file/directory in the system recycle bin is not replied by the user after a period of time, HDFS will automatically delete the file/directory completely, after which the user will never find the file/directory . The specific implementation inside HDFS is to start a background thread Emptier in the NameNode. This thread specifically manages and monitors all files/directories under the system recycle bin. For files/directories that have exceeded their life cycle, this thread will automatically delete them. They, but the granularity of this management is very large. In addition, users can also manually empty the recycle bin. The operation of emptying the recycle bin is the same as deleting a common file directory, but HDFS will automatically detect whether the file directory is a recycle bin. If it is, HDFS will of course not put it again. In the user's trash

According to the above introduction, the user deletes a file through the command line, that is, the shell command of HDFS, but the file is not deleted from HDFS immediately. Instead, HDFS renames the file and transfers it to the operating user's recycle bin directory (such as /user/hdfs/.Trash/Current, where hdfs is the operating user name). If the file/directory currently deleted by the user already exists in the user's recycle bin, HDFS will rename the currently deleted file/directory. The naming rule is very simple, followed by the name of the deleted file/directory. Number (from 1 until you know that there is no duplicate name).

When the file is still in the /user/hdfs/.Trash/Current directory, the file can be quickly restored. The time the file is saved in /user/hdfs/.Trash/Current is configurable. When this time is exceeded, Namenode will delete the file from the namespace. Deleting a file will also release the data block associated with the file. Note that there is a waiting time delay between the file being deleted by the user and the increase in HDFS idleness.

When the deleted file is still in the /user/hdfs/.Trash/Current directory, if the user wants to restore the file, he can search and browse the /user/hdfs/.Trash/Current directory and retrieve the file. The /user/hdfs/.Trash/Current directory only saves the most recent copy of the deleted file. The /user/dfs/.Trash/Current directory is no different from other file directories, except for one thing: HDFS applies a special policy on this directory to automatically delete files. The current default policy is to delete files that have been retained for more than 6 hours. This strategy will be defined as a configurable interface in the future.

In addition, the NameNode uses a background thread (the default is org.apache.Hadoop.fs.TrashPolicyDefault.Emptier, or you can specify the TrashPolicy class through fs.trash.classname) to periodically empty all files/directories in the user's recycle bin. It The user recycle bin is emptied every interval minutes. The specific operation steps are to first check all the directories in the form of yyMMddHHmm under the user recycle bin directory /user/username/.Trash, then delete the directories whose life span exceeds the interval, and finally store the deleted files/directories in the recycle bin directory/ user/username/.Trash/current is renamed to a /user/username/.Trash/yyMMddHHmm.
From the realization of this recycling thread (Emptier), it can be seen that the files deleted by the user with commands can be in its recycle bin at most 2*interval minutes can be saved in the medium, at least interval minutes can be saved, after this validity period, the files deleted by the user will never be restored

Configuration

Hadoop recycle bin trash, which is closed by default

1. Modify conf/core-site.xml and add

    <property>  
      <name>fs.trash.interval</name>  
      <value>1440</value>  
      <description>Number of minutes between trash checkpoints.  
      If zero, the trash feature is disabled.  
      </description>  
    </property>  

The default is 0. Unit minutes. What I set here is 1 day (60*24)
after deleting the data rm, the data will be moved to the .Trash directory under the current folder

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108800531