Usage scenarios and considerations for data migration in HDFS

Data migration usage scenarios

  • Hot and cold cluster data synchronization and classified storage
  • Overall migration of cluster data
    • When the company's business develops rapidly, resulting in a temporary shortage of current server resources, in order to utilize resources more efficiently, the entire data in computer room A will be migrated to computer room B. The reason may be that computer room B has many machines, and computer room B The cost itself is lower than that of computer room A, etc.
  • Near real-time synchronization of data
    • The medium of quasi-real-time data synchronization is that double copies of the data are available. For example, one day cluster A suddenly announces that it is no longer allowed to be used. At this time, the online use cluster can be directly switched to the synchronization cluster of B, because cluster B synchronizes cluster A in real time. The data has completely consistent real data and metadata information, so there will be no impact on business use.

Data migration factors to consider

  • Bandwidth-bandwidth
    • If the bandwidth is used too much, it will affect the task operation of the online business. If the bandwidth is used less, it will lead to the problem of full data synchronization.
  • performance-performance
    • Is it a simple stand-alone program? Or is it a distributed program with better performance with multi-threading?
  • data-increment-incremental synchronization
    • When TB or PB level data needs to be synchronized, if the data is synchronized in full every time, the result will be very bad. It would be a good idea to synchronize only changed incremental data. Incremental data synchronization can be achieved with technologies such as HDFS snapshots.
  • syncable - Synchronicity of data migration
    • During the data migration process, it is necessary to ensure that the data can be completely synchronized within the cycle, and the gap cannot be too large. For example, it only takes me half a day to synchronize the incremental data of cluster A within 7 days to cluster B. Then I can wait until next week to synchronize again. The scariest thing is that the data of cluster A within 7 days, my program takes The synchronization cannot be completed after 7 days, and then the next cycle comes again, so it is impossible to achieve quasi-real-time consistency. In fact, 7 days is still a relatively large period of time, and it is best to achieve synchronization on a daily basis.

HDFS distributed copy tool-DistCp

  • DistCp is a tool in Hadoop and exists as an independent sub-project under the hadoop-tools project.
  • Positioning for data migration, regular backup of data between clusters and within clusters
  • During the backup process, each run of distcp becomes a backup cycle. Despite its relatively slow performance, its popularity has grown
  • Distcp uses MapReduce under the hood to copy files between clusters or within the same cluster in parallel. MapReduce that performs replication only has the mapper phase
    Insert image description here

Advantages and performance of distcp

  • Bandwidth limiting
    • distcp can use the command parameter bandwidth to limit the bandwidth of the program.
  • Incremental data synchronization
    • In distcp, incremental synchronization can be achieved through the three parameters of update, append and diff.
    • updata only copies files or directories that do not exist
    • append file that already exists under the optimal target path
    • diff synchronizes the source path and the target path through the diff comparison information of the snapshot.
      Updata solves the synchronization of new files and directories. append solves existing incremental update synchronization. diff solves synchronization of deleted or renamed type files
  • Efficient performance: distributed nature
    • The bottom layer of distcp uses MapReduce to perform data synchronization. MapReduce itself is a type of distributed program.

Order

Insert image description here

  • Among them, source_path and target_path need to carry address prefixes to distinguish different clusters.
hadoop distcp hdfs://src_cluster:8020/user/data hdfs://dest_cluster:8020/user/data_backup

This command tells the distcp tool to copy the data in the hdfs://src_cluster:8020/user/data directory to the hdfs://dest_cluster:8020/user/data_backup directory.

Guess you like

Origin blog.csdn.net/weixin_49750432/article/details/131996412