Hadoop的I/O操作——SequenceFile

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hh66__66hh/article/details/83032177

Hadoop的I/O操作——SequenceFile

1. 基于文件的数据结构

Hadoop的HDFS和MapReduce自框架主要是针对大数据文件来设计的,在小文件的处理上不但效率低,还浪费内存资源(每个小文件占据一个block,每个block的元数据都要存储在namenode里)。为了解决这个问题,通常采用容器来对一些小文件进行存储,Hadoop提供了2种类型的容器:SequenceFile和MapFile。

2.SequenceFile简单介绍

(1)SequenceFile是Hadoop用来存储二进制形式的key-value对而设计的一种平面文件。

(2)SequenceFile文件中,每个key-value被看作为一条记录(Record)。

(3)基于SequenceFile,可以提出一些HDFS中小文件存储的解决方案:将小文件合并成一个大文件,例如将每个小文件的文件名作为key,文件内容作为value,然后将这个键值对写入SequenceFile文件种。

(4)SequenceFile文件支持三种压缩类型(SequenceFile.CompressionType):
a. NONE:不对Record进行压缩;
b. RECORD:仅压缩每个Record里的value值;
c. BLOCK: 将整个block里的所有Record压缩到一起。
对于这3种压缩类型,Hadoop提供了3种相应类型的Writer:
a. SequenceFile.writer 写入时不进行压缩
b. SequenceFile.RecordCompressWriter 写入时只压缩key-value对中的value
c. SequenceFile.BlockCompressWriter 写入时将一批key-value对压缩成一个block。

(5)SequenceFile的结构
SequenceFile文件由文件头(Header)和随后的多条记录组成:
1) 文件头
SequenceFile文件的前3个字节为SEQ(顺序文件代码),紧接着的一个字节是顺序文件的版本好。文件头还包括其他字段,例如键和值类的名称、数据压缩细节、同步标识(它用于在读取文件时能从任意位置开始识别记录边界,每个文件都有一个随机生成的同步表示,其值存在头文件中,同步标识位于顺序文件中的记录与记录之间。同步标识的额外存储开销要求小于1%,所以没必要在每条记录结尾添加同步标识。)等。
2)记录
记录的内部结构取决于是否压缩,如果压缩则取决于是记录压缩还是快压缩。
A. 没有启用压缩
每条记录的内容依此为:记录长度(字节数)、键长度、键、值。
B. 记录压缩
与无压缩基本相同,不同点在于其值是用文件头中定义的codec压缩的。
在这里插入图片描述
C. 块压缩
快压缩一次压缩多条记录,它可以不断向数据块中压缩记录,直到块的字节数达到io.seqfile.compress.blocksize属性中设置的字节数,默认为1MB.每个新块的开始处都需要插入同步标识。
每个数据块的内容依次为(Apache Hadoop 2.9.1 的API里写的):
在这里插入图片描述

3. SequenceFile的写入

利用createWriter静态方法创建SequenceFile对象,返回SequenceFile.Writer对象,该方法有多个重载版本,都需要指定待写入的数据流(FSDataOutputStream或FileSystem对象和Path对象),Configuration对象、键和值的类型,例如:

org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass)

一旦获得SequenceFile.Writer实例对象,就可以调用其append来写入key-value对。

实际操作:
(1)先查看HDFS里的/test/目录:

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop fs -ls /test/  
18/10/12 11:50:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup      10240 2018-10-11 06:27 /test/1
-rw-r--r--   1 root supergroup        135 2018-10-11 06:26 /test/1.gz

(2)编写MySequenceWrite.java,实现MySequenceWriter类:

import static org.hamcrest.CoreMatchers.is;
import static org.junit.Assert.assertThat;

import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.util.ReflectionUtils;
import org.junit.*;

public class MySequenceWrite {

	private static final String[] Data = {
		"lala, I am an apple",
		"haha, I am an banana",
		"guagua, This is a shoot",
		"mama, These are flowers",
		"gaga, Thoses are tickets"
	};

	public static void main(String[] args) throws IOException {
			String uri = args[0];
			Configuration conf = new Configuration();
			FileSystem fs = FileSystem.get(URI.create(uri), conf);
			IntWritable key = new IntWritable();
			Text value = new Text();
			Path path = new Path(uri);

			SequenceFile.Writer writer = null;
			try {
				writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
				for(int i=1; i<21; i++) {
					key.set(i);
					value.set(Data[i%Data.length]);
					System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
					writer.append(key, value);
				}
			} finally {
				IOUtils.closeStream(writer);
			}

	}
}

(3)编译MySequenceWriter.java:

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# javac MySequenceWrite.java 
Note: MySequenceWrite.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

(4)用hadoop运行该类

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop MySequenceWrite /test/2.txt
18/10/12 11:56:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/12 11:56:41 INFO compress.CodecPool: Got brand-new compressor [.deflate]
[128]	1	haha, I am an banana
[167]	2	guagua, This is a shoot
[208]	3	mama, These are flowers
[252]	4	gaga, Thoses are tickets
[297]	5	lala, I am an apple
[337]	6	haha, I am an banana
[376]	7	guagua, This is a shoot
[417]	8	mama, These are flowers
[461]	9	gaga, Thoses are tickets
[506]	10	lala, I am an apple
[546]	11	haha, I am an banana
[585]	12	guagua, This is a shoot
[626]	13	mama, These are flowers
[670]	14	gaga, Thoses are tickets
[715]	15	lala, I am an apple
[755]	16	haha, I am an banana
[794]	17	guagua, This is a shoot
[835]	18	mama, These are flowers
[879]	19	gaga, Thoses are tickets
[924]	20	lala, I am an apple

(5)查看HDFS里的/test/目录,可以看到多了2.txt:

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop fs -ls /test/
18/10/12 11:58:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 root supergroup      10240 2018-10-11 06:27 /test/1
-rw-r--r--   1 root supergroup        135 2018-10-11 06:26 /test/1.gz
-rw-r--r--   1 root supergroup        964 2018-10-12 11:56 /test/2.txt

4. SequenceFile的读取

从头到尾读取顺序文件的方法是创建SequenceFile.Reader实例,然后反复调用next()方法迭代读取记录。
(1)如果key-value都是Writable类型,则调用以键和值作为参数的next()方法,其会将数据流中的下一条键值对读入变量中。当读取到文件结尾时会返回false。

public boolean next(Writable key, Writable val)

(2)如果使用非Writable的序列化框架,则使用下面两个方法:

public Object next(Object key) throws IOException
public Object getCurrentValue(Object val) throws IOException

如果next()返回的是非null对象,则可以从数据流中读取键值对,并通过getCurrentValue读取value;如果next()返回的是null,说明已经到文件结尾。

实际操作:
(1)编写MySequenceRead.java,实现MySequenceRead类:

import static org.hamcrest.CoreMatchers.is;
import static org.junit.Assert.assertThat;

import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.ReflectionUtils;
import org.junit.*;

public class MySequenceRead {

	public static void main(String[] args) throws IOException {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri), conf);
		Path path =  new Path(uri);
		SequenceFile.Reader reader = null;

		try {
			reader = new SequenceFile.Reader(fs, path, conf);
			Writable key = (Writable)ReflectionUtils.newInstance(reader.getKeyClass(), conf);
			Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);
			long position = reader.getPosition();
			while(reader.next(key, value)) {
				System.out.printf("[%s]\t%s\t%s\n", position, key, value);
				position = reader.getPosition();
			}

		} finally {
			IOUtils.closeStream(reader);
		}
	}
}

(2)编译MySequenceRead.java

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# javac MySequenceRead.java 
Note: MySequenceRead.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

(3)用hadoop运行该类:

root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop MySequenceRead /test/2.txt
18/10/12 12:12:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/12 12:12:05 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
[128]	1	haha, I am an banana
[167]	2	guagua, This is a shoot
[208]	3	mama, These are flowers
[252]	4	gaga, Thoses are tickets
[297]	5	lala, I am an apple
[337]	6	haha, I am an banana
[376]	7	guagua, This is a shoot
[417]	8	mama, These are flowers
[461]	9	gaga, Thoses are tickets
[506]	10	lala, I am an apple
[546]	11	haha, I am an banana
[585]	12	guagua, This is a shoot
[626]	13	mama, These are flowers
[670]	14	gaga, Thoses are tickets
[715]	15	lala, I am an apple
[755]	16	haha, I am an banana
[794]	17	guagua, This is a shoot
[835]	18	mama, These are flowers
[879]	19	gaga, Thoses are tickets
[924]	20	lala, I am an apple

猜你喜欢

转载自blog.csdn.net/hh66__66hh/article/details/83032177
今日推荐