Hive官方使用手册——LZO 压缩

LZO 基本概念

LZO是一个无损的数据压缩库，相比于压缩比它更加追求速度。查阅ttp://www.oberhumer.com/opensource/lzo 和http://www.lzop.org 或缺更多有关 LZO的信息 and 查阅压缩数据存储格式获取有关Hive压缩数据存储信息。

假设一个有三列的简单数据文件。

id
first name
last name

向这个数据文件中插入4条记录:

19630001     john          lennon
19630002     paul          mccartney
19630003     george        harrison
19630004     ringo         starr

调用这个数据文件 /path/to/dir/names.txt.

为了使它成为LZO文件，我们可以使用lzop实用程序，它将创建一个名字类似 names.txt.lzo 的文件。把这个问价拷贝到HDFS中。

先决条件

Lzo/Lzop 安装

需要在Hadoop集群中每个节点里安装lzo 和lzop 。安装的细节不在本文档中进行叙述。

`core-site.xml`

在你的 core-site.xml文件中添加下面的配置：

com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

样例:

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

接下来，运行下面这个命令来创建一个LZO索引文件：

hadoop jar /path/to/jar/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.LzoIndexer  /path/to/HDFS/dir/containing/lzo/files

这就在HDFS上安装了 names.txt.lzo 。

表定义

The following hive -e command creates an LZO-compressed external table:

hive -e "CREATE EXTERNAL TABLE IF NOT EXISTS hive_table_name (column_1  datatype_1......column_N datatype_N)
         PARTITIONED BY (partition_col_1 datatype_1 ....col_P  datatype_P)
         ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
         STORED AS INPUTFORMAT  \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\"
                   OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";

Note: The double quotes have to be escaped so that the 'hive -e' command works correctly.

See CREATE TABLE and Hive CLI for information about command syntax.

Hive Queries

Option 1: Directly Create LZO Files

Directly create LZO files as the output of the Hive query.
Use lzop command utility or your custom Java to generate .lzo.index for the .lzo files.

Hive Query Parameters

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

For example:

hive -e "SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec; SET hive.exec.compress.output=true;SET mapreduce.output.fileoutputformat.compress=true; <query-string>"

Note: If the data sets are large or number of output files are large , then this option does not work.

选项2: 编写自定义的Java程序来创建LZO文件。

Create text files as the output of the Hive query.创建文本文件作为Hive查询的输出。
Write custom Java code to编写自定义Java程序来实现以下功能：
1. 转化Hive查询生成的文本文件到 .lzo 文件
2. 为上面生成的 .lzo 文件生成 .lzo.lzo.index 文件

Hive 查询参数

在查询语句前面加上这些参数：

SET hive.exec.compress.output=false
SET mapreduce.output.fileoutputformat.compress=false

样例：

hive -e "SET hive.exec.compress.output=false;SET mapreduce.output.fileoutputformat.compress=false;<query-string>"