Troubleshoot the problem that an HDFS Snapshot cannot be deleted

Preface


As we all know, HDFS has a very useful Snapshot function, which can be used to protect data from accidental deletion. Some people may say that the data has been deleted, can't I restore the data back from the trash directory? HDFS Snapshot is not the same as we usually say that data is deleted into the trash directory. The HDFS deletion operation into the trash directory is a delayed operation deletion strategy. If the user actually performs a data operation behavior that actually deletes the data (here refers to this data is completely removed from the namespace level, and does not even exist in the trash), assuming we enable Snapshot protection for the data, this time The recovery method can be recovered from the HDFS Snapshot. However, for data recovery from Snapshot, this will involve the copy of the actual physical data, rather than a simple action from the snapshot directory rename to the actual deleted path. At this point, the recovery processing of snapshot and trash is still different. The author of this article will share the problem that HDFS Snapshot cannot be deleted in our internal cluster. The timeline of the entire troubleshooting process is relatively long, and there were many detours in the middle.

background


Our internal cluster's strategy of using HDFS Snapshot is to use the daily snapshot strategy for data protection. To put it simply, we only protect the accidental deletion of data that occurs within 24 hours. If the data is deleted or lost before this time, we do not guarantee it. Because if the Snapshot hold time is longer, it means that the data held by Snapshot that should have been cleared will become larger and larger. This greatly occupies tight cluster storage space. The problem happened one day suddenly, we found that the storage space of the cluster became larger and larger, and the total number of objects of the NN metadata volume remained high. It was later discovered that the daily snapshots of many large directories were not deleted. The daily snapshot is similar to the following figure:
Insert picture description here
Then we quickly checked the script related to snapshot creation and deletion, and found that it threw an NPE exception when executing the deleteSnapshot delete command, and then the snapshot of the day was not deleted in time. Then the next day, the subsequent daily snapshots started to be created again, and then they were not deleted.

OK, the problem has occurred. The first thing to do now is how to delete these redundant snapshots. If you can't delete them, the storage space of the cluster will explode sooner or later. Then the second step is to analyze the root cause.

Problem Snapshot cleanup


For these snapshots that cannot be cleaned up, I tried to execute it again with deleteSnapshot. The result was still an NPE error, and then deleteSnapshot still failed. The exception stack information thrown is as follows:

java.lang.NullPointerException
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.storagespaceConsumedNoReplication(INodeFile.java:706)
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.storagespaceConsumed(INodeFile.java:692)
 at org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.updateQuotaAndCollectBlocks(FileWithSnapshotFeature.java:147)
 at org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.destroyDiffAndCollectBlocks(FileDiff.java:118)
 at org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.destroyDiffAndCollectBlocks(FileDiff.java:38)
 at org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:94)
 at org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:135)
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:504)
 at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively

Later, I checked the local log of NN, and failed to find the NPE error thrown by which path was deleted. At this time, our hypothetical problem is that the namespace metadata in the HDFS NN memory is estimated to be problematic.

Since the snapshot cannot be deleted no matter how to delete it through the command, and we suspect that the NN's memory data is faulty, we restarted the NN and failed over to a newly restarted NN, and then executed deleteSnapshot again. The snapshot was finally Cleaned up.

But this is just the beginning of the problem, we don't know what the real root cause is. At first, we just thought this was an occasional problem, thinking that restarting the NN could temporarily solve the problem, but we didn't expect that the problem of snapshots that could not be deleted immediately reappeared a few days later. Later, we decisively disabled the snapshot function first, and retained the fsimage file of NN when the problem occurred. Then we are ready to start the analysis of the follow-up further problems.

Snapshot NPE abnormal code analysis


We have analyzed the Snapshot code logic that throws the NPE exception, and the place where the exception is thrown is

  public final long storagespaceConsumedNoReplication() {
    
    
    FileWithSnapshotFeature sf = getFileWithSnapshotFeature();
    if(sf == null) {
    
    
      return computeFileSize(true, true);
    }

    // Collect all distinct blocks
    long size = 0;
    Set<Block> allBlocks = new HashSet<Block>(Arrays.asList(getBlocks()));
    List<FileDiff> diffs = sf.getDiffs().asList();
    for(FileDiff diff : diffs) {
    
    
      BlockInfoContiguous[] diffBlocks = diff.getBlocks();  <===== diff is null
      if (diffBlocks != null) {
    
    
        allBlocks.addAll(Arrays.asList(diffBlocks));
      }
    }
    for(Block block : allBlocks) {
    
    
      size += block.getNumBytes();
    }
    // check if the last block is under construction
    BlockInfoContiguous lastBlock = getLastBlock();
    if(lastBlock != null &&
        lastBlock instanceof BlockInfoContiguousUnderConstruction) {
    
    
      size += getPreferredBlockSize() - lastBlock.getNumBytes();
    }
    return size;
  }

Then enter the class FileDiffList corresponding to sf.getDiffs(), which inherits from the parent class AbstractINodeDiffList. The essential problem here is that there is a null element in the AbstractINodeDiffList list. But looking at the insert method of this list class, only the following addDiff method will do the insert operation.

  /** Add an {@link AbstractINodeDiff} for the given snapshot. */
  final D addDiff(int latestSnapshotId, N currentINode) {
    
    
    return addLast(createDiff(latestSnapshotId, currentINode));
  }

And every time the program executes addDiff, the diff is generated by the above createDiff operation, and there should be no null being inserted into the diffList.

In this piece of code analysis, we are in a dilemma. In the subsequent code level modification, we did the following 2 steps to improve:

1) Skip the null item in diff
2) Print out path information related to snapshot diff

Later, after redeploying the above changes, NN will still report NPE errors in other places where the diffList is traversed, and the path information is not helpful enough for us. Later, we tried to reproduce this problem offline. Debugging this problem in a production cluster is too costly and risky.

Failed to restore offline snapshot problem


After deploying the new code online, it is still difficult to help us find the root cause of the problem. So we plan to copy the previously backed up fsimage file to another machine for pure NN mode test (no JN, DN, HA mode). For the operation steps of this part, please refer to the blog post written by the author: HDFS NameNode fsimage file is corrupted. What to do .

In addition, we are also looking for a related JIRA issue in the community related to the issue we encountered. In the process, we found two JIRAs that are extremely related to deleteSnapshot, HDFS-9406 (FSImage may get corrupted after deleting snapshot) and HDFS-13101 (Yet another fsimage corruption related to snapshot). The previous issue already exists in our version, so we only verified the issue of HDFS-13101. Finally, we successfully reproduced the latter issue in our current Hadoop version. But further analysis later, HDFS-13101 and our snapshot scene are not the same.

First, when HDFS-13101 deletes a snapshot, it will involve two snapshots at the same time.
Second, it has a situation where data cross-snapshot rename.

In our usage scenario, only one data directory corresponds to one snapshot. Only by deleting the previous snapshot can we start to create the next snapshot. Therefore, we later analyzed that HDFS-13101 was not the fix method for our problem.

Since this similar issue has not been found in the community, is it a snapshot bug caused by our internal code changes? We increasingly suspect that it is a bug caused by the logic of our internal changes.

Reorganization and analysis of HDFS internal code changes


We analyzed the calling logic for the problematic code method AbstractINodeDiff#addDiff, and finally found a suspicious internal change logic.

When we optimized the performance of NN before, we found that the call of setTimes only changed the access time of the path, but it held the write lock operation, which had a greater impact on NN, so we changed the write lock operation of setTimes to read and write operations. .

  static boolean setTimes(
      FSDirectory fsd, INode inode, long mtime, long atime, boolean force,
      int latestSnapshotId) throws QuotaExceededException {
    
    
    fsd.readLock();  <---- swicth from write lock to read lock
    try {
    
    
      return unprotectedSetTimes(fsd, inode, mtime, atime, force,
                                 latestSnapshotId);
    } finally {
    
    
      fsd.readUnlock();
    }
  }

The problem comes from here. In the subsequent logic of unprotectedSetTimes, the setModificationTime and setAccessTime of the INode class actually involve changes to the snapshot diff.

  private static boolean unprotectedSetTimes(
      FSDirectory fsd, INode inode, long mtime, long atime, boolean force,
      int latest) throws QuotaExceededException {
    
    
    // remove writelock assert due to HADP-35711
    // assert fsd.hasWriteLock();
    boolean status = false;
    if (mtime != -1) {
    
    
      inode = inode.setModificationTime(mtime, latest);
      status = true;
    }

    // if the last access time update was within the last precision interval,
    // then no need to store access time
    if (atime != -1 && (status || force
        || atime > inode.getAccessTime() + fsd.getFSNamesystem().getAccessTimePrecision())) {
    
    
      inode.setAccessTime(atime, latest);
      status = true;
    }
    return status;
  }

    /** Set the last modification time of inode. */
  public final INode setModificationTime(long modificationTime,
      int latestSnapshotId) {
    
    
    recordModification(latestSnapshotId);
    setModificationTime(modificationTime);
    return this;
  }

Every time you do modifcation time or access time, it will record the time of the last snapshot diff as the last time, and then modify the current time to the latest time. Because it is converted to read and write operations, there will be cases where multiple threads perform diff update operations concurrently. In other words, the previous AbstractINodeDiff#addDiff may be executed concurrently. Snapshot diffList is essentially an ArrayList in structure, and ArrayList is not thread-safe. So there is a null situation.

When I tested the ArrayList, I also reproduced the situation where null was inserted into the ArrayList. The test code is as follows:

  @Test
  public void test() throws InterruptedException {
    
    
    ArrayList<String> array = new ArrayList<>();
    
    int numThreads = 100;
    Thread[] threads = new Thread[numThreads];
    for (int i = 0; i < numThreads; i++) {
    
    
      threads[i] = new Thread() {
    
    

        @Override
        public void run() {
    
    
          array.add(System.currentTimeMillis() + "");
        }

      };
    }
    for (int i = 0; i < numThreads; i++) {
    
    
      threads[i].start();
    }

    for (int i = 0; i < numThreads; i++) {
    
    
      threads[i].join();
    }
    System.out.println("Array size: " + array.size());
    System.out.println(array);
    for (int i = 0; i < numThreads; i++) {
    
    
      if(array.get(i) == null) {
    
    
        System.out.println("Detect null element: " + i);
      }
    }
  }

setTimes ignores snapshot diff update changes


After finding the root cause of the problem, we immediately proceeded to make changes to the code, and we did not want to roll back the logic of our previous changes. So we used the setTimes method to ignore the snapshot diff update changes, so as to make this setTimes a pure time value update operation. Since setModificationTime/setAccessTime are referenced by other methods at the same time, we have added a dedicated setTimes calling method, the method is as follows:

  public final INode setAccessTimeWithoutSnapshot(long accessTime, int latestSnapshotId) {
    
    
    setAccessTime(accessTime);
    return this;
  }

to sum up


So far, the problem that the snapshot cannot be deleted as described in this article is finally solved. The timeline for troubleshooting the entire problem is actually a relatively long one. The lesson from this question is to better review the code logic of each commit, and to ensure that there are enough test cases to ensure the safety of the incorporated code. Otherwise, the troubleshooting will take a lot of detours. In the problem described in this article, we did not carefully evaluate the potential risk points of setTimes in the logical change from write lock to read lock.

Reference


[1].https://issues.apache.org/jira/browse/HDFS-9406
[2].https://issues.apache.org/jira/browse/HDFS-13101

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/113446906
Recommended