HDFS Decommission Analysis

By changing the configuration and data structure transformation, quickly resolve problems HDFS Decommission slow.
Previous Article Review: Notes of an Open-Falcon-Graph frequently OOM troubleshooting

background

c3prc-xiami raid a large number of single copies of the documents, decommission single datanode very slow, observe the monitoring indicators, we found that:

  • NIC flow remains low, 60 ~ 80mb / s

  • Io util the disk is a single disk 100 percent at some point, that is, only one disk at the same time work

So off the assembly line speed will be very slow, a datanode 4w next day only one block. And a datanode average 20w a block, this obviously does not meet the speed requirements.

Initially thought to be a configuration problem, we analyzed several configurations, have a certain effect, increased to a block 6w every day, but still slow.

Finally, we proceed from the code level, changing the data structure, so that it is significantly faster offline, an offline datanode average of 1-2 days can is finished.

Analysis Configuration

public static final String DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY = "dfs.namenode.replication.max-streams";public static final int DFS_NAMENODE_REPLICATION_MAX_STREAMS_DEFAULT = 2;public static final String  DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY = "dfs.namenode.replication.max-streams-hard-limit";public static final int DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT = 4;复制代码

blockmanager.maxReplicationStreams This variable has two effects:

(1) selecting the block where copy out time chooseSourceDatanode (), if there is already> maxReplicationStreams blocks being copied on a dn is not longer select it as the source.

(2) HeartbeatManager each will want to send DNA_TRANSFER dn command will take a certain number of block transmission from DatanodeDescriptor.replicateBlocks, while the number of threads per DataTransfer DN can not exceed the maximum start maxReplicationStreams.

blockmanage.replicationStreamsHardLimit above a variable is similar, but in chooseSourceDatanode (), if the block has the highest priority, the dn can copy two more (the default is 2, 4, respectively), but not> replicationStreamsHardLimit.

So most replicationStreamsHardLimit be elected at the same time, but at the same dn in. But the simple adjustment hardLimit did not have much effect.

Dn out really control a copy of the data or maxReplicationStreams, dn report the number of threads going through NN Transfer to a heartbeat, then send maxTransfer a DNA_transfer CMD NN to dn:

//get datanode commandsfinal int maxTransfer = blockManager.getMaxReplicationStreams() - xmitsInProgress;xmitsInProgress=正在传输数量复制代码

For each dn: blocks to be copied from the queue removed maxTransfers

public List<BlockTargetPair> getReplicationCommand(int maxTransfers) {  return replicateBlocks.poll(maxTransfers);}复制代码

Conclusion 1:

By adjusting parameters of the two large, from 2 to 4, and then adjusted to 8, the effect is quite obvious, the log DN reflects the same time an increase in the number of threads transmission.

But when adjusted to 12 or more, there is not much effects. There is no overall speed up.

Currently adjusted to 12 is more reasonable. And a single dn corresponding to the number of disks.

publicstatic final String DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY ="dfs.namenode.replication.max-streams";publicstatic final int DFS_NAMENODE_REPLICATION_MAX_STREAMS_DEFAULT = 2;publicstatic final String DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY ="dfs.namenode.replication.max-streams-hard-limit";publicstatic final int DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT = 4;复制代码

. Namenode in blockmanager ReplicationMonitor every 3 seconds and will take a number of computeDatanodeWork block, then tell dn copy number of these blocks to take:

final int blocksToProcess = numlive * this.blocksReplWorkMultiplier;复制代码

I think the big start to speed up the transfer speed off the assembly line, but this small impact parameter, its role is just to get out of the block into DatanodeDescriptor waiting queue (replicateBlocks)

NN with observing log found the following problems:

  • ReplicationMonitor each iteration will print log:

askdn_ip to replicate blk_xxx to another_dn.复制代码

And the number is blocksToProcess. At the same time, this dn_ip are the same.

  • Each iteration (cycle over a period of time) requires multiple copies of the same dn blocksToProcess (c3prc-xiaomi = 800 th) blocks, and each copy dn up maxReplicationStreams

  • That NN done a lot of work ineffective, take the blocks are the same dn, when to take the next turn dn, it is the same problem. Not every dn are at work simultaneously. Dn intermittent observation monitor found out copy data.

When ReplicationMonitor time to take a blocksToProcess blocks, which blocks may be on the same dn, even on the same disk of the same dn.

Therefore, to analyze each of borrowing. Object can be taken out discontinuous blocks, allows different dn, different disks simultaneously.

Analysis of the data structure

And off-line code related to the main focus blockmanager, and UnderReplicatedBlocks

/** * Store set of Blocks that need to be replicated 1 or more times. * We also store pending replication-orders. */public final UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks();复制代码

All need to do replicate the block will be placed in blockmanager.neededReplications.

UnderReplicatedBlocks is a composite structure depositary 5 (LEVEL = 5) with the priority queue:

private final List<LightWeightHashSet<Block>> priorityQueues复制代码

For only a block copy, or replica are on the decommision node, it is the highest priority in the queue, raid a copy off the assembly line is the case.

For each priority queue, are implemented LightWeightLinkedSet, it is an ordered HashSet, finishing elements are connected.

UnderReplicatedBlocks achieve is to ensure that each element of the queue will be taken to, at the same time, each successive element in the queue is removed in order, will not allow certain block never have the chance to be taken out.

This is done for each queue holds an offset:

private finalList<LightWeightHashSet<Block>> priorityQueues复制代码

If we take the last queue (LEVEL-1) at the end, and to reset all queues offset = 0, take over again.

Five FIFO queues are to be selected, into the tail, Priority Level = 0 and is easier to take. Specific algorithms are taken UnderReplicatedBlocks.chooseUnderReplicatedBlocks () in.

When decommission during operation, it may block the entire dn are to join neededReplication queue (raid cluster so that if there are three copies of a block that has three source. Source dn only a single copy). At this time, addition of a certain priority queue (LightWeightLinkedSet) of blocks are ordered, and the w consecutive blocks belong to the same DN, or even continuous on the same disk. Since the sequence is taken from the queue from start to finish, so that there will be problems, particularly in the case of a single copy.

Therefore, we want to remove the block from the random priority queue, but have to be taken to ensure that each block, so we should order.

Data structure transformation -ShuffleAddSet

In fact, it is reasonable to do so, first enter the priority queue can be taken to. But for us this scenario does not require the same order and placed in order withdrawn. If scrambled, and then remove the primary iteration can be extracted as blocks on different disks or different dn.

LightWeightLinkedSet: UnderReplicatedBlocks default priority queue is used. Hashset itself is a succession LightWeightHashSet, while elements of two-way connection with Head tail

LightWeightHashSet: Lightweight hashSet

ShuffleAddSet:Look like LightWeightLinkedSet

We implemented a ShuffleAddSet inheritance LightWeightHashSet, performance and LightWeightLinkedSet consistent as possible, so that the external UnderReplicatedBlocks do not need to do too much modification.

ShuffleAddSet there are two queues, and external performance as a priority queue.

The first queue, but also Set and LightWeightLinkedSet is the same, orderly two-way, external calls Methods element is also taken from the Set.

The second queue, buffer queue cachedAddList, beginning with ArrayList, LinkedList, a performance problem is not used. Now also with LightWeightHashSet, HashSet with natural disorderly nature.

Whenever a new element is added, it is first placed cachedAddList followed when the first queue is empty or take the data to the end, immediately cachedAddList shuffle data, and a copy to the first queue, then empty themselves, continue to receive new element. Since HashSet disorder itself, and therefore less shuffle operation step, to copy directly from the first queue to cachedAddList.

Note that fetch data (call iterator take) and add operations must be synchronized, because the first time to take a queue reaches the end or the air, triggers shuffle and add operation, empty cachedlist.

In summary, only the first queue is empty or take the time to end, will join data from a second team, if the entire description are empty priority queue is empty. Each from cachedlist join a group, this group is in random order, although the first cohort than the entire cohort were randomly disrupted, on the whole, the first queue is still out of order.

The problem with this:

(1) The external call priority queue add operation, not necessarily into the first queue is to be scheduled after cachedList added elements in the first row may front of the queue to a schedule.

(2) an extreme case, if each one is added, and then take an element, a small number of elements, or remove the number and frequency much greater than the number of add, (such as a cachedlist added elements, once reached the end of the queue) not actually achieve random effects.

Fortunately, we do not require the FIFO scene, and each block Decommision large number of initially added UnderReplicatedBlocks, ReplicationMonitor each taking / number blockToProcess process, relatively (offline station node 8, UnderReplicatedBlocks reach 180W) is small. Occurred shuffle and add not very frequent, nor is the performance bottleneck. The observed maximum time is 200ms.

At the same time, we will ShuffleAddSet this function as a configurable item, UnderReplicatedBlocks can choose ShuffleAddSet at initialization time or LightWeightLinkedSet

dfs.namenode.blockmanagement.queues.shuffle复制代码

as follows:

/** the queues themselves */private final List<LightWeightHashSet<Block>> priorityQueues = new ArrayList<LightWeightHashSet<Block>>();public static final String NAMENODE_BLOCKMANAGEMENT_QUEUES_SHUFFLE =    "dfs.namenode.blockmanagement.queues.shuffle";/** Stores the replication index for each priority */private Map<Integer, Integer> priorityToReplIdx = new HashMap<Integer, Integer>(LEVEL);/** Create an object. */UnderReplicatedBlocks() {  Configuration conf = new HdfsConfiguration(); boolean useShuffle = conf.getBoolean(NAMENODE_BLOCKMANAGEMENT_QUEUES_SHUFFLE, false); for (int i = 0; i < LEVEL; i++) {    if (useShuffle) {      priorityQueues.add(new ShuffleAddSet<Block>()); } else {      priorityQueues.add(new LightWeightLinkedSet<Block>()); }    priorityToReplIdx.put(i, 0); }}复制代码

Performance issues

We found that the use ShuffleAddSet time and started off the assembly line will be stuck when eight dn, mostly stuck in DecommissionManager $ Monitor, every time you want to check all the blocks on this dn whether underReplicated, so many blocks to take the write lock very long time.

Also checks ShuffleAddSet.contains (blocks), Since there are two queues, so contain the cost will be greater than before.

2019-04-12,12:09:35,876 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Long read lock is held at 1555042175235. And released after 641 milliseconds.Call stack is:java.lang.Thread.getStackTrace(Thread.java:1479)org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:914)org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.checkAndLogLongReadLockDuration(FSNamesystemLock.java:104)org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1492)org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.isReplicationInProgress(BlockManager.java:3322)org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.checkDecommissionState(DatanodeManager.java:751)org.apache.hadoop.hdfs.server.blockmanagement.DecommissionManager$Monitor.check(DecommissionManager.java:93)org.apache.hadoop.hdfs.server.blockmanagement.DecommissionManager$Monitor.run(DecommissionManager.java:70)java.lang.Thread.run(Thread.java:662)复制代码

By modifying isReplicationInProgress method, similar to the process blockreport, put a lock every certain number of ways to ease the write lock for too long leads to other rpc request does not respond.

++processed;// Release lock per 5w blocks processed and has too many underReplicatedBlocks.if (processed == numReportBlocksPerIteration &&    namesystem.hasWriteLock() && underReplicatedBlocks > numReportBlocksPerIteration) {  namesystem.writeUnlock(); processed = 0; namesystem.writeLock();}复制代码

in conclusion

For more copies of a single cluster, can be offline following manner:

dfs.namenode.blockmanagement.queues.shuffle= truedfs.namenode.replication.max-streams= 12 默认是2,限制一个datanode复制数量dfs.namenode.replication.max-streams-hard-limit=12 默认是4dfs.namenode.replication.work.multiplier.per.iteration= 4 默认2 / namenode一次调度的数量=该值×datanodes数量复制代码

Open shuffle and add, copy number and adjust the maximum number of physical disks dn single, small clusters may turn up for the first treatment work.multiplier LiveDatanode 4 times the number of block. Maximize the speed of the lower wire.

Note the question:

2 offline each operation station node (refreshNodes), every 10 minutes and then at 2, up to eight. Dn off the assembly line at the same time should not exceed eight. Otherwise DecommissionManager overhead will have a significant impact NN normal service.

This article first appeared in public No. "millet cloud technology", click to read the original .


Reproduced in: https: //juejin.im/post/5cef9b2b6fb9a07efe2da14b

Guess you like

Origin blog.csdn.net/weixin_34026484/article/details/91472439