[Hadoop-Distcp] tool introduction and parameter description

[Hadoop-Distcp] tool introduction and parameter description

1 Overview

Distcp (distributed copy) is a tool for large-scale intra-cluster and inter-cluster copy. It uses Map/Reduce for file distribution, error handling and recovery, and report generation.

Distcp takes a list of files and directories as input to map tasks, and each task copies some of the files in the source list.

Official website address:http://hadoop.apache.org/docs/r2.7.0/hadoop-distcp/DistCp.html

2) The suitable scene and its advantages

1. Suitable scene:

Data remote disaster, computer room offline, data migration, etc.

2. Advantages:

① Yes 限制带宽, use bandwidththe parameter to limit the flow of each map task of distcp, and control the concurrent number of maps to control the bandwidth of the entire copy task, preventing the copy task from filling up the bandwidth and affecting other services.

② Support overwrite(overwrite, overwrite destination files unconditionally, even if they exist), update(incremental write, overwrite if the name and size of the dest file are different from the src file; skip if the size and name of the destination file are the same as the source file) , delete(delete write, delete files that exist in dest, but do not exist in src) and other copy methods of multiple source and destination checks, the copy of a large amount of data must be checked during the data copy process to ensure that the source consistency with the target data.

3) Parameter description

This parameter is Hadoop2.x version

  • -append: Reuse existing data in the target file and add new data where possible, adding instead of overwriting it
  • -async: whether distcp execution should be blocked
  • -atomic: Commit all changes or none
  • -bandwidth <arg>: Specify the bandwidth of each map in MB/second
  • -delete: Delete files that exist in the target file but do not exist in the source file, go to the HDFS garbage collection bin
  • -diff <arg>: Use the snapshot diff report to identify differences between the source and target
  • -f <arg>: list of files to be copied
  • -filelimit <arg>: (DEPRECATED!) Limit number of files copied to <= n
  • -filters <arg>: Exclude from list of copied files
  • -i: Ignore failures during copying
  • -log <arg>: distcp execution log folder save on HDFS
  • -m <arg>: Limit the number of maps started synchronously. By default, each file corresponds to a map, and each machine can start up to 20 maps
  • -mapredSslConf <arg>: configure ssl configuration file
  • -numListstatusThreads <arg>: The number of threads used to build the file list (up to 40), when the file directory structure is complex, the value should be increased appropriately
  • -overwrite: Select to overwrite target files unconditionally, even if they exist.
  • -p <arg>: Preserve source file state (rbugpcaxt) (replication, block size, user, group, permissions, checksum type, ACL, XATTR, timestamp)
  • -sizelimit <arg>: (DEPRECATED!) Limits the number of files copied to <= n bytes
  • -skipcrccheck: Whether to skip the CRC check between the source and destination paths.
  • -strategy <arg>: Select the copy strategy, the default value uniformsize, the total size of files copied by each map is balanced; it can be set to dynamic, so that faster maps copy more files to improve performance
  • -tmp: The intermediate work path promise to use for the atom
  • -update: Overwrite if the destination file has a different name and size than the source file; skip if the destination file has the same size and name as the source file

Guess you like

Origin blog.csdn.net/weixin_53543905/article/details/131418380