【Hive十五】Hive IO相关

1. Hadoop配置压缩算法(Hadoop的配置属性，Hive读取core-site.xml文件中的配置，可以再hive-site.xml中配置以覆盖Hadoop中的配置)
key: io.compression.codecs
value:org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec

压缩算法压缩/解压缩的速度与压缩比之间取得一个平衡，
GZip,BZip2的压缩比高，压缩速度相对低；
Snappy，LZO的压缩比低，但是压缩解压缩的速度很高。
BZip2，LZO支持压缩文件分block后的并行处理，而GZip和Snappy则不可以。要使用GZip和Snappy，推荐使用Block压缩的Sequence File

压缩文件是否支持分割？
数据压缩成GZIP，BZIP2，Snappy或者LZO格式后，这个压缩文件如果比较大，比如1G，那么HDFS将它分成10个block(每个block是128M)，那么每个分块是否支持
并行处理？

答：
1.
In text files, each line is a record, but these boundaries are obscured by GZip and Snappy.
However, BZip2 and LZO provide block-level compression, where each block has
complete records, so Hadoop can split these files on block boundaries.

Hive属性配置
1. hive.exec.compress.intermediate
表示是否对MapReduce的Shuffle过程产生的中间数据进行压缩，默认是false，即不压缩
2. mapred.map.output.compression.codec
Shuffle过程产生的中间数据采用的压缩算法
3. hive.exec.compress.output
表示是否对Hive的查询结果数据做压缩，默认是false，表示不压缩
4. mapred.output.compression.codec
表示最终结果(reducer输出)的压缩算法

5.Sequence File文件格式的特点
a. The sequence file format supported by Hadoop breaks a file into blocks and then optionally compresses the blocks in a splittable way（按block压缩，表示SequenceFile得到的分块数据是完整的）.
b. Sequence files have three different compression options: NONE, RECORD, and BLOCK.RECORD is the default. However, BLOCK compression is usually more efficient and it still
provides the desired splittability
c.设置Sequence File的压缩类型(compression options)
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>

d. 使用Sequence File作为最终文件格式(reduce输出数据的文件格式，可以对每个block进行压缩)
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE
> AS SELECT * FROM a;
对Sequence File而言，可以使用Gzip进行压缩，原理是Sequence File支持按照块进行压缩（每个块都是完整的数据）

【Hive十五】Hive IO相关

猜你喜欢