hadoop archive 处理小文件

使用hadoop archive 命令通过mapreduce任务 生产 har 压缩文件

测试hdfs源文件:

/test/lizhao/2018-8-13/*
/test/lizhao/2018-8-14/*

压缩命令 hadoop archive -archiveName NAME -p <parent path> [-r <replication factor>]<src>* <dest>:

>>> hadoop archive -archiveName 2018-8.har -p /test/lizhao 2018-8-13 2018-8-14 /test/lizhao/

18/08/14 14:11:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 14:11:55 INFO client.RMProxy: Connecting to ResourceManager at IC-1/192.168.11.180:8032
18/08/14 14:11:56 INFO client.RMProxy: Connecting to ResourceManager at IC-1/192.168.11.180:8032
18/08/14 14:11:56 INFO client.RMProxy: Connecting to ResourceManager at IC-1/192.168.11.180:8032
18/08/14 14:11:56 INFO mapreduce.JobSubmitter: number of splits:1
18/08/14 14:11:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533867597475_0001
18/08/14 14:11:58 INFO impl.YarnClientImpl: Submitted application application_1533867597475_0001
18/08/14 14:11:58 INFO mapreduce.Job: The url to track the job: http://ic-1:8088/proxy/application_1533867597475_0001/
18/08/14 14:11:58 INFO mapreduce.Job: Running job: job_1533867597475_0001
18/08/14 14:12:07 INFO mapreduce.Job: Job job_1533867597475_0001 running in uber mode : false
18/08/14 14:12:07 INFO mapreduce.Job:  map 0% reduce 0%
18/08/14 14:12:13 INFO mapreduce.Job:  map 100% reduce 0%
18/08/14 14:12:24 INFO mapreduce.Job:  map 100% reduce 100%
18/08/14 14:12:24 INFO mapreduce.Job: Job job_1533867597475_0001 completed successfully
18/08/14 14:12:24 INFO mapreduce.Job: Counters: 49
*****
	Map-Reduce Framework
		Map input records=15
		Map output records=15
		Map output bytes=1205
		Map output materialized bytes=1241
		Input split bytes=116
		Combine input records=0
		Combine output records=0
		Reduce input groups=15
		Reduce shuffle bytes=1241
		Reduce input records=15
		Reduce output records=0
		Spilled Records=30
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=137
		CPU time spent (ms)=6370
		Physical memory (bytes) snapshot=457756672
		Virtual memory (bytes) snapshot=3200942080
		Total committed heap usage (bytes)=398458880
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=995
	File Output Format Counters 
		Bytes Written=0

3、查看压缩后的文件:

>>> hadoop fs -ls  har:///test/lizhao/2018-8.har
drwxr-xr-x   - root supergroup          0 2018-08-14 14:06 har:///test/lizhao/2018-8.har/2018-8-13
drwxr-xr-x   - root supergroup          0 2018-08-14 14:06 har:///test/lizhao/2018-8.har/2018-8-14
>>> hadoop fs -ls  har:///test/lizhao/2018-8.har/2018-8-13
-rw-r--r--   2 root supergroup         22 2018-08-14 14:05 har:///test/lizhao/2018-8.har/2018-8-13/1.txt
-rw-r--r--   2 root supergroup         22 2018-08-14 14:05 har:///test/lizhao/2018-8.har/2018-8-13/2.txt
-rw-r--r--   2 root supergroup         22 2018-08-14 14:05 har:///test/lizhao/2018-8.har/2018-8-13/3.txt
-rw-r--r--   2 root supergroup         22 2018-08-14 14:06 har:///test/lizhao/2018-8.har/2018-8-13/5.txt
-rw-r--r--   2 root supergroup         22 2018-08-14 14:06 har:///test/lizhao/2018-8.har/2018-8-13/6.txt
-rw-r--r--   2 root supergroup         22 2018-08-14 14:06 har:///test/lizhao/2018-8.har/2018-8-13/7.txt

4、下载har 中的文件

hadoop fs -get har:///test/lizhao/2018-8.har/2018-8-13/1.txt ./

猜你喜欢

转载自blog.csdn.net/qq_42006894/article/details/81666960