(1) Environment:
hadoop | 2.8.1 |
hive | 1.2.2 |
core-site.xml configuration item
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzopCodec, com.hadoop.compression.lzo.LzoCodec </value> </property> <!-- lzop --> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzopCodec</value> </property>
mapred-site.xml configuration items
<!--Set the map intermediate result to use lzop compression--> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>com.hadoop.compression.lzo.LzopCodec</value> </property> <!--Set the whole process of map/reduce to use lzop compression--> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <!-- lzop --> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>com.hadoop.compression.lzo.LzopCodec</value> </property>
(two)
1. hive build table sql
CREATE TABLE `lzo5`( `uuid` string) STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
2. Create a uuid.txt file and put 1 line of data
uuid1
3.lzop creates the lzo file
lzop uuid.txt
4.hive load data
load data inpath "/home/hadoop/uuid.txt.lzo" into table lzo5;
5.hive query, check the result is 1 (correct)
select count(1) from lzo5;
6. Create an lzo index for the lzo file under the lzo5 path of the hive table
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/common/hadoop-lzo-0.mmon/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributeddLzoIndexer hdfs://hd1:9000/user/hive/warehouse/lzo5
7. Check the index generation
hdfs dfs -ls hdfs://hd1:9000/user/hive/warehouse/lzo5
8. Query sql again and see that the result is 1 (correct)
select count(1) from lzo5;
(3) How to know whether the lzo index is effective?
Create an lzo file, slightly larger than the block size of hdfs, test it in two scenarios without index and with index, see the number of maps
- The number of maps without index is 1, because lzo has no index and cannot be split,
- The number of indexed maps is lzo file size / block size, because lzo + index supports split
(4) Comparison results:
The block size is 128M, and the generated lzo file is 370M
Execution times for unindexed and indexed are as follows, with indexed queries slightly faster:
No index, 1 map
With index, the number of maps is 3 (after index, split is supported)