hadoop合并小文件的一些说说

0 一些背景说说：

小文件：是那些size比HDFS的block size(默认64M)小的多的文件
因为：任何一个文件，目录和block，在HDFS中都会被表示为一个object存储在namenode的内存中，每一个object占用150 bytes的内存空间。
如果有10million个文件，每一个文件对应一个block，那么就将要消耗namenode 3G的内存来保存这些block的信息。文件在大的话，对内存要求会更多。

1 合并小文件方式:

a) 应用程序自己控制，缺点是：红墨水蓝墨水都混合在一起，以后无法区分

import java.io.File;
import java.io.FileInputStream;
import java.net.URI;
import java.util.List;

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;


public class CopyOfApp1 {

	public static void main(String[] args) throws Exception {
		// 0 初始化filesystem客户端
		final Configuration conf = new Configuration();
		final FileSystem fileSystem = FileSystem.get(new URI("hdfs://chinadaas109:9000/"), conf);
	
		System.out.println(fileSystem);
		final Path path = new Path("/combinedfile");
		final FSDataOutputStream create = fileSystem.create(path); // hdfs创建目标文件 
		final File dir = new File("C:\\Windows\\System32\\drivers\\etc");  // 将此文件夹下的内容写到hdfs path中
		for(File fileName : dir.listFiles()) {
			System.out.println(fileName.getAbsolutePath());
			final FileInputStream fileInputStream = new FileInputStream(fileName.getAbsolutePath());
			final List<String> readLines = IOUtils.readLines(fileInputStream);
			for (String line : readLines) {
				create.write(line.getBytes());	
			}
			fileInputStream.close();
		}
		create.close();	

	}

b) archive：

HAR文件是通过在HDFS上构建一个层次化的文件系统来工作。
一个HAR文件是通过hadoop的archive命令来创建，而这个命令实 际上也是运行了一个MapReduce任务来将小文件打包成HAR。
对于client端来说，使用HAR文件没有任何影响。所有的原始文件都 （using har://URL）。但在HDFS端它内部的文件数减少了。

过HAR来读取一个文件并不会比直接从HDFS中读取文件高效，
而且实际上可能还会稍微低效一点，因为对每一个HAR文件的访问都需要完成两层 index文件的读取和文件本身数据的读取。

并且尽管HAR文件可以被用来作为MapReduce job的input，但是并没有特殊的方法来使maps将HAR文件中打包的文件当作一个HDFS文件处理。

创建文件 hadoop archive -archiveName xxx.har -p  /src  /dest
查看内容 hadoop fs -lsr har:///dest/xxx.har

c) sequence file/map file

sequence file:

filename作为key,file contents作为value
比如10000个100KB的文件，可以写一个程序来将这些小文件写入到一个单独的 SequenceFile中。
可以在一个streaming fashion(directly or using mapreduce)中来使用这个sequenceFile。
并且，SequenceFiles也是splittable的，所以mapreduce 可以break them into chunks，并且分别的被独立的处理。
这种方式还支持压缩。

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.Text;

public class TestSequenceFile {

	/**
	 * @param args
	 * @throws IOException
	 */
	public static void main(String[] args) throws IOException {
		// // TODO Auto-generated method stub
		 Configuration conf = new Configuration();
		 Path seqFile = new Path("/test/seqFile2.seq");
		 // Writer内部类用于文件的写操作,假设Key和Value都为Text类型
		 SequenceFile.Writer writer = SequenceFile.createWriter(conf,
		 Writer.file(seqFile), Writer.keyClass(Text.class),
		 Writer.valueClass(Text.class),
		 Writer.compression(CompressionType.NONE));

		 // 通过writer向文档中写入记录
		 writer.append(new Text("key"), new Text("value"));
		
		 IOUtils.closeStream(writer);// 关闭write流
		 // 通过reader从文档中读取记录
		 SequenceFile.Reader reader = new SequenceFile.Reader(conf,
		 Reader.file(seqFile));
		 Text key = new Text();
		 Text value = new Text();
		 while (reader.next(key, value)) {
		 System.out.println(key);
		 System.out.println(value);
		 }
		 IOUtils.closeStream(reader);// 关闭read流

		
	}

}

两个格式的参考文档： http://blog.csdn.net/javaman_chen/article/details/7241087

关于 sequencefile textfile 源码写法上的区别和案例参考链接：

http://tangjj.blog.51cto.com/1848040/1535555/

d) combinefileinputformat

hadoop合并小文件的一些说说

猜你喜欢