Comparison of data deletion principles between HDFS and Ozone

Preface


Some time ago, the author has been tossing about the performance problems of HDFS large directory deletion, and tried many different solutions to reduce the performance impact of large directory deletion, including changing the internal child list structure of INodeDirectly, or improving and optimizing the snapshot level. (For details, see this article: Practical thinking on the HDFS large directory file deletion program ). However, when I compared the HDFS delete operation with the delete operation inside the new generation storage Ozone system, I found that there are still many differences between the two, and the latter has a lot of improvement in design. The author of this article will talk about the comparison of HDFS and Ozone's delete operation, and some brainstorm thoughts on HDFS's existing delete operation.

Performance issues of existing HDFS delete operations


Back to this old-fashioned question-HDFS delete operation performance problem, which can be briefly summarized in two simple words: the first operation is heavy, and the second has the greatest impact.

The process is briefly as follows:

  • 1) Recursively traverse the directory tree structure, collect the INode instances (including INode directories and INode files) that need to be cleared, and the block blocks under INodeFile, that is, the blocks to be deleted.
  • 2) Then execute the batch delete operation of the block. After each batch is processed, the lock is released in the middle, and then the lock is processed.

In view of the fact that the above process involves the modification of the NameNode metadata, it is executed while holding the FSN global lock, so this process becomes very heavy.

The following figure is a brief schematic diagram of the HDFS internal data deletion process:

Insert picture description here

In the above process, after the data block information to be deleted in the middle is queried, it is not persisted. It is all carried out in one process. Therefore, a natural optimization point is:

Can this process be asynchronous? In addition, can this one-time cleanup method of a directory be transformed into multiple batch delete behaviors? RemoveBlock is already batched. Can it also be batch-processed when cleaning up INode, instead of the way it is traversed?

OK, the starting points for the above proposals are all good, but to achieve the above improvements, it is necessary to solve the following problems:

  • The delete operation is processed asynchronously. HDFS considers that only the INode is actually cleaned up before the quota is released. How to ensure the consistency of the quota here? Of course, if we allow asynchronous deletion at the expense of the quota accuracy of less blood, it is also acceptable.
  • If you do batch processing when deleting in the INode directory tree, and when you delete a certain INode, then you end and exit, then how to restore it next time? It may happen that some child INodes have no parent INode instances. The batch of interrupted INodes will become dirty data and reside in the NameNode. Another question is, INode exits midway during deletion. Is it possible that another concurrent operation is reading and writing the deleted directory?

Brainstorm improvement ideas for HDFS delete operations


In view of the thorny problems mentioned in the previous section, can we not solve them? No, there is always a feasible solution.

First of all, the problem of inconsistency of quotas. According to the above, we can sacrifice a little real-time accuracy of quota to do the asynchronousization of delete operations. I think this trade-off is still OK.

Then the second one, regarding INode delete batch processing, there are 2 small problems in summary:

  • The recovery problem of the intermediate INode exit processing. If the number of INode deletions reaches the threshold in the middle, and then forced to exit the processing to release the lock, it will indeed cause some problems. But we can make a conversion to this, first recursively traverse the number of directories, only collect all the INode instances, and store them in the list structure. Then do batch delete operation processing for the INode list. It is difficult to restore the interruption of the delete operation in the original directory tree.
  • The second small problem is that the directory to be deleted may be read and written at the moment when the delete operation is interrupted and released by the batch. This processing method is relatively simple. At the beginning of batch delete INode, first rename the directory file to be deleted into a reserved name. This name will never be accessed by users, such as the existing directory format such as .trash. To put it simply, rename first and then delete.

With the processing of the above points, we can start a background thread again to formally perform asynchronous batch processing of the catalog file. The new processing process diagram is imagined as follows:
Insert picture description here
In fact, the process shown in the above figure and the current HDFS trash data The cleanup mechanism is very close. The trash mechanism has satisfied the two stages of rename into a reserved name and background deletion. The only thing that can be improved is to delete the INode as a batch process instead of directly adjusting the delete operation method of the NameNode. The author's personal more aggressive idea is that the delete operation of HDFS can be turned into a simple rename operation, and all actual operation behaviors should have a special service in the background to do asynchronous processing. It is not necessary for the Client to force the completion of the synchronous execution of the delete behavior, especially in a distributed storage system.

Related design: Delete operation processing in Ozone system


Let's take a look at the design of another object storage system Ozone in this area.

First of all, Ozone is also used as a storage system, and it also faces the situation of large data deletion. However, its namespace is relatively simple, all are KV, and it does not currently have a directory tree deletion situation. Therefore, its deletion behavior is easier to batch, and you can get a batch of keys to be deleted in batches.

It also uses the method of rename to reserved name + background delete thread model processing method. The current deletion process diagram of Ozone is as follows: The
Insert picture description here
above figure shows that the internal deletion process of Ozone is slightly more complicated than that of HDFS, which is derived from the following points Background reason:

  • Ozone internally uses Container-based storage, which involves the interactive communication of OM SCM services, rather than a fully unified management similar to HDFS.
  • Ozone uses a third-party KV db storage, not pure memory, so it can do the persistence processing of deleted information, so it can also be easier to restore the deleted behavior when the system restarts.

In addition, within the OM and SCM services, each time a delete is performed, there is a limit to delete a batch of data, so that the entire delete behavior will have a small impact on the Ozone system.

After comparing the deletion mechanism of Ozone and the existing deletion behavior of HDFS, the former is better in terms of scalability and performance than the latter, and the latter has a simpler processing design, so it is revealed when the volume is large. Drawbacks. Students who are interested in Ozone and HDFS deletion principles can continue to read related articles written by the author, the link is at the end of the article.

related articles


[1],https://blog.csdn.net/Androidlushangderen/article/details/105778885
[2].https://blog.csdn.net/Androidlushangderen/article/details/77619513

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/106456427