Data Balance tool solution between HDFS federation clusters

Preface


When a single large HDFS cluster is becoming increasingly unable to support our business scenarios, more and more companies have begun to consider adopting HDFS federation solutions. A question naturally arises here: How do I synchronize the data from the original cluster (NameNode) of the Namespace from the new federation? And in this process, there will be daily incremental data written in the old cluster. If it is only static data, we can start a distcp task to synchronize this part of the cross-namespace data, but with the influence factor of incremental data, this matter will become a little more complicated. Recently, the Hadoop community is doing the implementation and discussion about this area, and also implemented an initial balance tool between HDFS federations. This article will talk about the internal implementation details of this tool. I believe this tool is still very useful in our actual generation environment.

Coarse-grained federation balance scheme


Regarding the usage scenarios of federation balance, it mainly comes from the following two situations:

  • It is purely to balance the data on each federation NN to prevent too much data under a cluster and thus performance bottlenecks. This situation is more common when a multi-cluster management solution (HDFS RBF or ViewFs) has been used.
  • When a federation namespace is added, data is detached from the original namespace to the new namespace. At this time, we also have the need for data synchronization.

Regarding the second point mentioned above, the author has previously written a scheme based on initial distcp + multiple snapshot diff ditcp, but the scheme is relatively coarse-grained in execution, and it is not particularly perfect, requiring a lot of extra manual work. Execution of repeated commands. For the details of this coarse-grained federation balance solution, please refer to the article written by the author before: The ingenious use of the DistCp tool of the HDFS data migration solution .

Systematic federation balance tool solution


What the author of this article is going to explain is the complete version of the above coarse-grained federation balance scheme, that is, the complete implementation scheme of tooling. At present, the community has already discussed the corresponding implementation.

The tool-based implementation scheme is the same in principle as the coarse-grained Balance scheme discussed above, and is executed in two stages:

1) In the first stage, perform the first distcp data copy.
2) In the second stage, execute the distcp of the subsequent diff to make the copy, based on the snapshot of the directory, each time the distcp of the diff is executed, and then compare the latest state of the snapshot and the current file directory. If there is still updated change diff, and then execute the distcp copy of diff until the data is completely indistinguishable.

The synchronization principle is shown in the figure below:
Insert picture description here
Compared with the manual triggering of distcp to do the initial and diff copy, the federation balance scheme also considers the following additional points:

  • The schedul execution management of the ditcp task, so that the user only needs to order the balance task to be submitted once, and the subsequent distcp tasks will be scheduled in the Job.
  • The information of BalanceJob is persisted. The persistent external storage here is HDFS. This is to prevent that once the BalanceJob has not been executed, the unexecuted jobs can be restored from the HDFS data next time.
  • The automatic cleanup of the source file directory occurs at the end of the BalanceJob execution. After we have completely copied the data from the source cluster to the target cluster, we can clean up this part of the data from the original cluster.

The above-mentioned details are improved, so that the entire tool can be more conveniently used by the admin, reducing the processing of a lot of extra manual data synchronization operations.

In the realization of the Federation Balance internal tool solution, its principle process is shown in the following figure:
Insert picture description here
the process corresponding to the above steps:

  • 1, 2) Command line entry, entry of the submitted main program, submit a new BalanceJob or read BalanceJob from HDFS, restore the unexecuted job that failed last time, and then submit it to a JobQueue.
  • 3) WorkThread worker thread pulls BalanceJob from JobQueue to execute.
  • 4) When WorkThread starts to schedule and execute BalanceJob, it persists Job information to HDFS in order to recover from Job failure.
  • 5) The BalanceJob is over, WorkThread removes the previously written BalanceJob information from HDFS.

The whole process is not particularly complicated.

The above-mentioned federation balance tooling solution as a whole, the implementation is relatively clear, and there is no accurate synchronization of data by locking the directory inside HDFS, and the diff copy supported by distcp is used for incremental synchronization. This should be the current point of view It is a solution that causes relatively small user impact. All the above implementation details come from the community JIRA HDFS-15294: RBF: Balance data across federation namespaces with DistCp and snapshot diff , interested students can read the specific code implementation.

Quote


[1].https://issues.apache.org/jira/browse/HDFS-15294

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/106014155