distcp utility of Hadoop Primal

1 Overview

DistCp (Distributed Copy) is a high performance for the internal copy tool between the large clusters or cluster. It uses the Map / Reduce achieve its distribution, error handling and recovery, as well as report generation. It is the list of files and directories as a map of input tasks, each task will complete copy of the file source list

Note: The data intersectoral collaboration encountered at work, boast different clusters version or different versions of the same data copy cluster is different.

2 utility

Overall, divided into two categories:

1) inter-cluster version with data copy;

2) boast a cluster version data copy;

 Copy data between clusters with version 

For example: A copy of the cluster (NN1 of IP192.168.7.120) A catalog of the cluster B to B1 directory (NN2 the IP192.168.8.120) of

1 hadoop distcp  hdfs://192.168.7.120:8020/cluster/A/    hdfs://192.168.8.120:8020/cluster/B1/

Summary :

a) using hdfs protocol, wherein the address 192.168.7.120 is namenode A cluster is 8020 cluster A rpc port (hdfs-site.xml in view). Namenode IP address is 192.168.8.120 cluster B

b) This command / A A cluster document will copy files and folders under the folder to the next cluster B / B1 directory, will appear in / B1 / A directory structure in the B cluster. If / B1 directory does not exist, the system will create a new one. It should be noted that the source path must be an absolute path. Including the leading hdfs: // ip: port

Adding a plurality of data sources, a plurality of source directory specified as:

1  hadoop distcp hdfs:
2 
3 //192.168.7.120:8020/cluster/A/a1 hdfs://192.168.7.120:8020/A/a2 hdfs://192.168.8.120:8020/cluster/B1/

Or use the -f option to get multiple sources from a file:

 hadoop distcp -f hdfs://192.168.7.120:8020/src_A_list hdfs://192.168.8.120:8020/cluster/B1/

Where the content is src_A_list

    hdfs://192.168.7.120:8020/cluster/A/a1

    hdfs://192.168.7.120:8020/cluster/A/a2

When copying from multiple sources, if the two sources of conflict, distcp stops copy with an error message , if a collision occurs at destination, it will be resolved according to the option settings. Skipped default destination file already exists (c described at; such as source files do not replace operation). Each time the number of files at the end of the operation will report skipped, but if some of the copy operation failed, but after a successful attempt, then the reported information may not be accurate.

Each JobTracker must be able to source and destination to access and interact with the file system.

After copying, generates a recommendation list of source and destination files, and cross-checks to verify that the copy real success. Because distcp using Map / Reduce operating and file system API, so there between the three, or any issues that may affect copy.

It is noteworthy that , when another client is writing to a source file, the copy will likely fail. Attempting to overwrite files on HDFS being written will fail. If a copy of the source file is moved or deleted before, the copy will fail with exception FileNotFoundException.

) By default c, although distcp will skip files that already exist on the target path, but by -overwirte option to select these files overwritten, can also be used, -update option is only updated files re write.

Real case:

Case requires from / cluster / A1 / and / cluster / A2 / to / cluster / copy B1, the source path comprising: 

    HDFS: // 192.168.7.120:8020/cluster/A1 
    HDFS: // 192.168.7.120:8020/cluster / A1 / A1 
    HDFS: // 192.168.7.120:8020/cluster/A1/a2 
    HDFS: // 192.168.7.120:8020/cluster/A2 
    HDFS: // 192.168.7.120:8020/cluster/A2/a3 
    HDFS: / / 192.168.7.120:8020/cluster/A2/a1

If not set or -overwrite -update option, then the two sources will map the target terminal / Cluster / Bl / A1A2 . If these two options, the contents and the content will be the destination directory for each source directory comparison. distcp encounter situations such conflicts will terminate operation and exit. 

By default, / Cluster / B1 / A1 and / Cluster / B1 / A2 directories will be created, so there will be no conflict. 

Now talk about - Update Usage: 
invocation of DistCp is -update HDFS: // 192.168.7.120:8020/cluster/A1 \ 
               HDFS: // 192.168.7.120:8020/cluster/A2 \ 
               HDFS: // 192.168.8.120:8020/cluster/B1 

wherein the source path / size: 
    
    HDFS: // 192.168.7.120:8020/cluster/A1   
    HDFS: // 192.168.7.120:8020/cluster/A1/a1 32 
    HDFS:// 192.168.7.120:8020/cluster/A1/a2 64- 
    HDFS: // 192.168.7.120:8020/cluster/A2   
    HDFS: // 192.168.7.120:8020/cluster/A2/a3 64- 
    HDFS: // 192.168.7.120 : 8020 / cluster / A2 / a4 32 

and the destination path / size: 

     HDFS: // 192.168.8.120:8020/cluster/B1 
     HDFS: // 192.168.8.120:8020/cluster/B1/a1 32 
     HDFS: // 192.168. 8.120: 8020 / Cluster / Bl / 32 A2 
     HDFS: // 192.168.8.120:8020/cluster/B1/a3 128 

produces: 

    HDFS: // 192.168.8.120:8020/cluster/B1 
    HDFS: //32 192.168.8.120:8020/cluster/B1/a1 
    HDFS: // 192.168.8.120:8020/cluster/B1/a2 32 
    HDFS: // 192.168.8.120:8020/cluster/B1/a3 64 
  HDFS: // 192.168. 8.120: 8020 / cluster / A2 / a4 32 
   

finding section 192. 168.8.120 a2 of file is not overwritten (a3 has coverage) . If -overwrite option is specified, all files will be overwritten.

d) distcp operation there are many options you can set, such as ignoring failure, or copy files to limit the amount of data and so on. Without direct input instruction or additional options may be used to view the description of this operation.

An optional accessory distcp parameters:

Boast cluster version data copy

 

hadoop distcp  hftp://192.168.7.120:50070/cluster/A/    hdfs://192.168.8.120:8020/cluster/B1

  

Note that, to define the URI NameNode access source network interface, this interface will be through the attribute value setting dfs.namenode.http-address, the default value is 50070, the reference hdfs-site.xml:

 

3 Summary of real problems

a)ipc.StandbyException : //s.apache.org/sbnn-error

 

 

solve:

Dfs所链接的namenode的状态不是active的 处于standby状态不予链接,所以方法:换一个namenode, 保证新的namenode是active

 

b)   java.io.IOException:Check-sum mismatch 

分析:该问题很常见,能在网上查到,是因为不同版本hadoop 的checksum版本不同,老版本用crc32,新版本用crc32c;

解决:只要在distcp时增加两个参数(-skipcrccheck -update),忽略crc检查即可。注意-skipcrccheck参数要与-update同时使用才生效。

 c) java.net.UnknowHostException

 

原因分析:图中可以看到,distcp job已经启动了,map 0%,  但是报了UnknowHostException:pslaves55,可能的原因是在从datanode取数据时,用的是host pslave55, 而这个host是数据源集群特有的,目标集群不识别,所以报UnknowHostException.

解决办法:在目标集群中配置hosts文件,将数据源集群中所有的host和ip的对应关系追加到目标集群中的hosts文件中,使得目标集群在访问host名时(如pslave55)能成功映射到ip

4 总结

要实现跨集群拷贝,如拷贝A集群的数据到B集群,需要确认以下事情:

(1)确认B集群机器都能ping通A集群所有ip。

(2) 用的port 响应在各自节点上放开 iptables 不要“拦住”

(3)如果部门间的端口防火墙已经开通,但还是telnet不同,请确认A集群的iptables已经加入了B集群ip。

(4)如果在B集群有UnknowHostException,则需要将A集群的host与ip映射关系追加到B集群的hosts文件中。

附上常用端口port 对照:

其他配置参考官方:

http://hadoop.apache.org/docs/r2.7.6/hadoop-distcp/DistCp.html

 

Guess you like

Origin www.cnblogs.com/jagel-95/p/10945317.html