下载安装Apache hadoop-1.2.1(bin.tar.gz文件)搭建集群后,在运行wordcount 时报警告 WARN snappy.LoadSnappy: Snappy native library not loaded。
我们想要给Hadoop集群增加snappy压缩支持。很多发行版的hadoop已经内置了snappy/lzo压缩,比如cloudera CDH, Hortonworks HDP. 但是Apache发行版安装包大多不带压缩支持。(Apache hadoop-.1.21 RPM版本Hadoop (hadoop-1.2.1-1.x86_64.rpm
)已经有snappy支持,但其hadoop-1.2.1-bin.tar.gz 并无压缩支持)
1. snappy安装
1. 给OS安装 g++:
centos:
yum -y update gcc
yum -y install gcc+ gcc-c++
ubuntu:
apt-get update gcc
apt-get install g++
2. 下载snappy 源码 , http://code.google.com/p/snappy/downloads/list (可以看到 snappy-1.1.1.tar.gz) 下载后解压(默认目录为 snappy-1.1.1 )
到解压后的目录依次执行:
1) ./configure
2) make
3) make check
4) make install
snappy默认安装目录为/usr/local/lib , 可以用ls /usr/local/lib 命令看到其下有 libsnappy.so 等文件。
3.将生成的libsnappy.so放到$HADOOP_HOME/lib/native/Linux-amd64-64. 重启hadoop集群,这时 Hadoop已经具有snappy压缩功能。
4. 运行 Wordcount之前 设置环境变量LD_LIBRARY_PATH使其包含libsnappy.so的目录 ( export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64:/usr/local/lib )
5.再次运行wordcount 可以看到之前的warn消失。
如果设置job的输出结果为snappy压缩,在hdfs上能看到输出目录包含一个part-r-00000.snappy的文件
------以上结果在cenos6.6 64bit minimal和Ubuntu 12.04 64bit server中验证通过
2. Hadoop job中使用压缩:
可在mapred-site.xml 中设置 或者针对job在 code中设置。
job中间结果(map输出)使用压缩:
---mrV1:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
---YARN:
<property><name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
job最终结果使用压缩:
---mrV1:
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description> For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended. </description>
</property>
---YARN:
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
<description>For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended.</description>
</property>
3 读hdfs上snappy压缩结果代码示例
/**
* 程序运行时 需要设置LD_LIBRARY_PATH,使其包含含有libsnappy.so的目录.对于Windows,需要设置PATH, 使其包含含有snappy.dll的目录
*
* @param file hdfs file, such as hdfs://hadoop-master-node:9000/user/hadoop/wordcount/output/part-r-00000.snappy
* @throws Exception
*/
public void testReadOutput_Snappy2(String file) throws Exception {
Configuration conf = new configuration();
conf.set("fs.default.name", "hdfs://hadoop-master-node:9000");
FileSystem fs = FileSystem.get(conf);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(new Path(file));
if (codec == null) {
System.out.println("Cannot find codec for file " + file);
return;
}
CompressionInputStream in = codec.createInputStream(fs.open(new Path(
file)));
BufferedReader br = null;
String line;
try {
br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} finally {
if (in != null) {
br.close();
}
}
}